e74cc1e5-8fa6-4f2b-bf89-0f00f7f44640

Lecture Notes in Networks and Systems 968
Harish Sharma
Vivek Shrivastava
Ashish Kumar Tripathi
Lipo Wang Editors
Communication
and Intelligent
Systems
Proceedings of ICCIS 2023, Volume 2
Lecture Notes in Networks and Systems
Volume 968
Series Editor
Janusz Kacprzyk , Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland
Advisory Editors
Fernando Gomide, Department of Computer Engineering and Automation—DCA,
School of Electrical and Computer Engineering—FEEC, University of Campinas—
UNICAMP, São Paulo, Brazil
Okyay Kaynak, Department of Electrical and Electronic Engineering,
Bogazici University, Istanbul, Türkiye
Derong Liu, Department of Electrical and Computer Engineering, University
of Illinois at Chicago, Chicago, USA
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Witold Pedrycz, Department of Electrical and Computer Engineering, University of
Alberta, Alberta, Canada
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Marios M. Polycarpou, Department of Electrical and Computer Engineering,
KIOS Research Center for Intelligent Systems and Networks, University of Cyprus,
Nicosia, Cyprus
Imre J. Rudas, Óbuda University, Budapest, Hungary
Jun Wang, Department of Computer Science, City University of Hong Kong,
Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest
developments in Networks and Systems—quickly, informally and with high quality.
Original research reported in proceedings and post-proceedings represents the core
of LNNS.
Volumes published in LNNS embrace all aspects and subfields of, as well as new
challenges in, Networks and Systems.
The series contains proceedings and edited volumes in systems and networks,
spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor
Networks, Control Systems, Energy Systems, Automotive Systems, Biological
Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems,
Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems,
Robotics, Social Systems, Economic Systems and other. Of particular value to both
the contributors and the readership are the short publication timeframe and
the world-wide distribution and exposure which enable both a wide and rapid
dissemination of research output.
The series covers the theory, applications, and perspectives on the state of the art
and future developments relevant to systems and networks, decision making, control,
complex processes and related areas, as embedded in the fields of interdisciplinary
and applied sciences, engineering, computer science, physics, economics, social, and
life sciences, as well as the paradigms and methodologies behind them.
Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of Science.
For proposals from Asia please contact Aninda Bose (aninda.bose@springer.com).
Harish Sharma · Vivek Shrivastava ·
Ashish Kumar Tripathi · Lipo Wang
Editors
Communication
and Intelligent Systems
Proceedings of ICCIS 2023, Volume 2
Editors
Harish Sharma Vivek Shrivastava
Department of Computer Science Department of Electrical and Electronics
and Engineering Engineering
Rajasthan Technical University National Institute of Technology
Kota, Rajasthan, India Uttarakhand
Srinagar, Uttarakhand, India
Ashish Kumar Tripathi
Department of Computer Science Lipo Wang
Malaviya National Institute of Technology School of Electrical and Electronic
Jaipur, Rajasthan, India Engineering
Nanyang Technological University
Singapore, Singapore
ISSN 2367-3370 ISSN 2367-3389 (electronic)

Lecture Notes in Networks and Systems
ISBN 978-981-97-2078-1 ISBN 978-981-97-2079-8 (eBook)
https://doi.org/10.1007/978-981-97-2079-8
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Paper in this product is recyclable.

Preface
This book contains outstanding research papers as the proceedings of the 5th Interna-
tional Conference on Communication and Intelligent Systems (ICCIS 2023), which
was held on 16–17 December 2023 at Malaviya National Institute of Technology
Jaipur, India, under the technical sponsorship of the Soft Computing Research
Society, India. The conference is conceived as a platform for disseminating and
exchanging ideas, concepts, and results of researchers from academia and industry
to develop a comprehensive understanding of the challenges of intelligence advance-
ments in computational viewpoints. This book will help in strengthening conge-
nial networking between academia and industry. This book presents novel contri-
butions to communication and intelligent systems and is reference material for
advanced research. The topics covered are intelligent systems: algorithms and appli-
cations, smart data analytics and computing, informatics, and applications, and
communication and control systems.
ICCIS 2023 received many technical contributed articles from distinguished
participants from home and abroad. ICCIS 2023 received 750 research submissions.
After a very stringent peer-reviewing process, only 102 high-quality papers were
finally accepted for presentation and publication.
This book presents the second volume of 34 research papers related to commu-
nication and intelligent systems and serves as reference material for advanced
research.
Kota, India Harish Sharma

Srinagar, India Vivek Shrivastava
Jaipur, India Ashish Kumar Tripathi
Singapore Lipo Wang
v
Contents
Multilingual Speech Recognition: An In-Depth Review

of Applications, Challenges, and Future Directions . . . . . . . . . . . . . . . . . . . 1
Mayur M. Jani, Sandip R. Panchal, Hemant H. Patel,
and Ashwin Raiyani
Performance Evaluation of Job Shop Scheduling Problem Using
Proposed Hybrid of Black Hole and Firefly Algorithms . . . . . . . . . . . . . . . 15
Jaspreet Kaur and Ashok Pal
Machine Learning and Healthcare: A Comprehensive Study . . . . . . . . . . 31
Riya Raj and Jayakumar Kaliappan
Evolutionary Algorithms for Fibers Upgrade Sequence Problem
on MB-EONs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Der-Rong Din
Exploring the Potential of Deep Learning Algorithms in Medical
Image Processing: A Comprehensive Analysis . . . . . . . . . . . . . . . . . . . . . . . . 61
Ganesh Prasad Pal and Raju Pal
Comparative Analysis of Image Enhancement Techniques:
A Study on Combined and Individual Approaches . . . . . . . . . . . . . . . . . . . . 71
Aditya Bhaskar and Bharti Joshi
Smishing: A SMS Phishing Detection Using Various Machine
Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Priteshkumar Prajapati, Heli Nandani, Devanshi Shah, Shail Shah,
Rachit Shah, Madhav Ajwalia, and Parth Shah
Convolution Neural Network (CNN)-Based Live Pig Weight
Estimation in Controlled Imaging Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Chandan Kumar Deb, Ayon Tarafdar, Md. Ashraful Haque,
Sudeep Marwaha, Suvarna Bhoj, Gyanendra Kumar Gaur,
and Triveni Dutt
vii
viii Contents
A Novel Image Encryption Technique Based on DNA Theory

and Chaotic Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Kartik Verma, Butta Singh, Manjit Singh, Satveer Kour,
and Himali Sarangal
An Empirical Study on Comparison of Machine Learning
Algorithms for Eye-State Classification Using EEG Data . . . . . . . . . . . . . . 113
N. Priyadharshini Jayadurga, M. Chandralekha, and Kashif Saleem
Decoding the UK’s Stance on AI: A Deep Dive into Sentiment
and Topics in Regulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Dwijendra Nath Dwivedi and Ghanashyama Mahanty
Latest Trends on Satellite Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . 141
Sahil Borkar, Krishna Chidrawar, Sakshi Naik, Mousami P. Turuk,
and Vaibhav B. Vaijapurkar
Landmark Detection Using Convolutional Neural Network:
A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Drishti Bharti, Kumari Priyanshi, and Prabhjot Kaur
An Efficient Illumination Invariant Tiger Detection Framework
for Wildlife Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Gaurav Pendharkar, A. Ancy Micheal, Jason Misquitta,
and Ranjeesh Kaippada
An Innovative Frequency-Limited Interval Gramians-Based
Model Order Reduction Method Using Singular Value
Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Vineet Sharma and Deepak Kumar
Contrast Enhancement of Medical Images Using Otsu’s Double
Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
R. Vinay, Monika Agarwal, Geeta Rani, and Aparajita Sinha
Recognizing Hate Speech on Twitter with Feature Combo . . . . . . . . . . . . . 209
Jatinderkumar R. Saini and Shraddha Vaidya
Agriculture Yield Forecasting via Regression and Deep Learning
with Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Aishwarya V. Kadu and K T V Reddy
Performance Comparison of M-ary Phase Shift Keying and M-ary
Quadrature Amplitude Modulation Techniques Under Fading
Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Tadele A. Abose, Ketema T. Megersa, Kehali A. Jember,
Diriba C. Kejela, Samuel T. Daka, and Moti B. Dinagde
Conception of Indian Monsoon Prediction Methods . . . . . . . . . . . . . . . . . . . 247
Namita Goyal, Aparna N. Mahajan, and K. C. Tripathi
Contents ix
AI-Integrated Smart Toy for Enhancing Cognitive, Emotional,

and Motor Skills in Toddlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Sara Bansod, Pranita Ranade, and Indresh Kumar Verma
Thumbnail Personalization in Movie Recommender System . . . . . . . . . . . 277
Mathura Bai Baikadolla, Srirachana Narasu Baditha,
Mohanvenkat Patta, and Kavya Muktha
Comparative Analysis of Large Language Models for Question
Answering from Financial Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Shivam Panwar, Anukriti Bansal, and Farhana Zareen
Multilingual Meeting Management with NLP: Automated
Minutes, Transcription, and Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Gautam Mehendale, Chinmayee Kale, Preksha Khatri,
Himanshu Goswami, Hetvi Shah, and Sudhir Bagul
Exploring Comprehensive Privacy Solutions for Enhancing
Recommender System Security and Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Esmita Gupta and Shilpa Shinde
Attribute-Based Encryption for the Internet of Things: A Review . . . . . . 335
Kirti Dinkar More and Dhanya Pramod
A Short Survey Work for Lung Cancer Diagnosis Model:
Algorithms Utilized, Challenging Issues, and Future Research
Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Nishat Shaikh and Parth Shah
Influence of Music on Brainwave-Based Stress Management . . . . . . . . . . . 377
Neelum Dave and Shreya Dave
Potential Exoplanet Detection Using Feature Selection, Multilayer
Perceptron, and Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 387
Keshav Sairam, Monika Agarwal, Aparajita Sinha, and K. Pradeep
An Empirical Study on ML Models with Glass Classification
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Shreyas Visweshwaran, M. Anbazhagan, and K. Ganesh
Design Novel Detection of Exudates Using Wavelets Filter
and Classification of Diabetic Maculopathy . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Chetan Pattebahadur, A. B. Kadam, Anupriya Kamble,
and Ramesh Manza
An Optimized Neural Network Model to Classify Lung Nodules
from CT-Scan Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Asiya and N. Sugitha
Fake Product Review Monitoring System Using Machine Learning . . . . . 437
Pragya Rajput and Pankaj Kumar Sethi
x Contents
Perception to Control: End-to-End Autonomous Driving Systems . . . . . . 447

Yoshita, Aman Jatain, Manju, and Sandeep Kumar
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455

Editors and Contributors
About the Editors
Dr. Harish Sharma is Associate professor at Rajasthan Technical University, Kota,

in Department of Computer Science and Engineering. He has worked at Vardhaman
Mahaveer Open University Kota and Government Engineering College Jhalawar. He
received his B.Tech. and M.Tech. degrees in Computer Engineering from Govern-
ment Engineering College, Kota, and Rajasthan Technical University, Kota, in 2003
and 2009, respectively. He obtained his Ph.D. from ABV-Indian Institute of Infor-
mation Technology and Management, Gwalior, India. He is Secretary and one of
the founder members of Soft Computing Research Society of India. He is Lifetime
Member of Cryptology Research Society of India, ISI, Kolkata. He is Associate
Editor of International Journal of Swarm Intelligence (IJSI) published by Inder-
science. He has also edited special issues of the many reputed journals like Memetic
Computing, Journal of Experimental and Theoretical Artificial Intelligence, Evolu-
tionary Intelligence, etc. His primary area of interest is nature-inspired optimiza-
tion techniques. He has contributed to more than 105 papers published in various
international journals and conferences.
Dr. Vivek Shrivastava has approx. 20 years of diversified experience of schol-

arship of teaching and learning, accreditation, research, industrial, and academic
leadership in India, China, and USA. Presently he is holding the position of Dean
Research and Consultancy at National Institute of Technology Delhi. Prior to his
academic assignments he has worked as System Reliability Engineer at SanDisk
Semiconductors Shanghai China and USA. Dr. Shrivastava has significant industrial
experience of collaborating with industry and Government organizations at SanDisk
Semiconductors. He has made significant contribution to the design development of
memory products. He has contributed to the development and delivery of Five-Year
Integrated B.Tech.-M.Tech. Program (Electrical Engineering) and master’s program
(Power Systems) at Gautam Buddha University Greater Noida. He has extensive
experience in academic administration in various capacities of Dean (Research and
xi
xii Editors and Contributors
Consultancy), Dean (Student Welfare), Faculty In-charge (Training and Placement),

Faculty In-charge (Library), Nodal Officer (Academics, TEQIP-III), Nodal Officer
RUSA, Experts in various committees in AICTE, UGC, etc.
Dr. Ashish Kumar Tripathi (Senior Member, IEEE) received his M.Tech. and Ph.D.
degrees in computer science and engineering from the Department of Computer
Science and Engineering, Delhi Technological University, Delhi, India, in 2013 and
2019, respectively. He is currently working as Assistant Professor at the Depart-
ment of Computer Science and Engineering, Malviya National Institute of Tech-
nology (MNIT), Jaipur, India. His research interests include big data analytics, social
media analytics, soft computing, image analysis, and natural language processing.
Dr. Tripathi has published several papers in international journals and conferences
including IEEE transactions. He is Active Reviewer for several journals of repute.
Dr. Lipo Wang received the bachelor’s degree from National University of Defense
Technology (China) and Ph.D. from Louisiana State University (USA). He is
presently on the faculty of the School of Electrical and Electronic Engineering,
Nanyang Technological University, Singapore. His research interest is artificial intel-
ligence with applications to image/video processing, biomedical engineering, and
data mining. He has 330+ publications, a US patent in neural networks, and a patent
in systems. He has co-authored 2 monographs and (co-)edited 15 books. He has
8000+ Google Scholar citations, with H-index 43. He was Keynote Speaker for 36
international conferences. He is/was Associate Editor/Editorial Board Member of
30 international journals, including 4 IEEE Transactions, and Guest Editor for 10
journal special issues. He was Member of the Board of Governors of the International
Neural Network Society, IEEE Computational Intelligence Society (CIS), and the
IEEE Biometrics Council. He served as CIS Vice President for Technical Activi-
ties and Chair of Emergent Technologies Technical Committee, as well as Chair of
Education Committee of the IEEE Engineering in Medicine and Biology Society
(EMBS). He was President of the Asia-Pacific Neural Network Assembly (APNNA)
and received the APNNA Excellent Service Award. He was Founding Chair of both
the EMBS Singapore Chapter and CIS Singapore Chapter. He serves/served as Chair/
Committee Members of over 200 international conferences.
Contributors
Tadele A. Abose Mattu University, Mattu, Ethiopia

Monika Agarwal Dayananda Sagar University, Bangalore, India
Madhav Ajwalia Chandubhai S. Patel Institute of Technology (CSPIT), Faculty
of Technology and Engineering (FTE), Charotar University of Science and
Technology (CHARUSAT), Changa, Gujarat, India
Editors and Contributors xiii
M. Anbazhagan Department of Computer Science, Amrita School of Computing,

Amrita Vishwa Vidyapeetham, Coimbatore, India
Asiya CSE Department, Noorul Islam Center for Higher Education, Thukalay,
Tamil Nadu, India
Srirachana Narasu Baditha Department of Information Technology, VNR
Vignana Jyothi Institute of Engineering and Technology, Hyderabad, Telangana,
India
Sudhir Bagul Dwarkadas J. Sanghvi College of Engineering, Mumbai,
Maharashtra, India
Mathura Bai Baikadolla Department of Information Technology, VNR Vignana
Jyothi Institute of Engineering and Technology, Hyderabad, Telangana, India
Anukriti Bansal Crisp Analytics Private Limited, LUMIQ, Noida, India
Sara Bansod Symbiosis Institute of Design, Symbiosis International (Deemed
University), Pune, India
Drishti Bharti Chandigarh University, Mohali, Punjab, India
Aditya Bhaskar Department of Computer Engineering, Ramrao Adik Institute of
Technology, DY Patil Deemed to be University, Nerul, Navi Mumbai, India
Suvarna Bhoj Livestock Production and Management Section, ICAR-Indian
Veterinary Research Institute, Uttar Pradesh, Izatnagar, Bareilly, India
Sahil Borkar SCTRs Pune Institute of Computer Technology, Pune, India
M. Chandralekha Department of Computer Science and Engineering, Amrita
School of Computing, Amrita Vishwa Vidyapeetham, Chennai, India
Krishna Chidrawar SCTRs Pune Institute of Computer Technology, Pune, India
Samuel T. Daka Mattu University, Mattu, Ethiopia
Shreya Dave Dr. D.Y. Patil Institute of Technology, Pimpri, Pune, India
Neelum Dave Dr. D.Y. Patil Institute of Technology, Pimpri, Pune, India
Chandan Kumar Deb Division of Computer Applications, ICAR-Indian
Agricultural Statistics Research Institute, New Delhi, India
Der-Rong Din Department of Computer Science and Information Engineering,
National Changhua University of Education, Changhua City, Taiwan, R. O. C.
Moti B. Dinagde Mattu University, Mattu, Ethiopia
Triveni Dutt Livestock Production and Management Section, ICAR-Indian
Dwijendra Nath Dwivedi SAS Middle East FZ-LLC, Dubai Media
City-Business Central Towers, Dubai, UAE
xiv Editors and Contributors
K. Ganesh Department of Computer Science, Amrita School of Computing,

Amrita Vishwa Vidyapeetham, Coimbatore, India
Gyanendra Kumar Gaur Livestock Production and Management Section,
ICAR-Indian Veterinary Research Institute, Uttar Pradesh, Izatnagar, Bareilly, India
Himanshu Goswami Dwarkadas J. Sanghvi College of Engineering, Mumbai,
Maharashtra, India
Namita Goyal Maharaja Agrasen University, Baddi, Himachal Pradesh, India
Esmita Gupta Department of Computer Engineering, Ramrao Adik Institute of
Technology, D. Y. Patil Deemed University, Nerul, India
Md. Ashraful Haque Division of Computer Applications, ICAR-Indian
Agricultural Statistics Research Institute, New Delhi, India
Mayur M. Jani Information Technology, Dr. Subhash University, Junagadh,
Gujarat, India
Aman Jatain Department of Computer Science and Technology, Manav Rachna
University, Faridabad, India
Kehali A. Jember Mattu University, Mattu, Ethiopia
Bharti Joshi Department of Computer Engineering, Ramrao Adik Institute of
Technology, DY Patil Deemed to be University, Nerul, Navi Mumbai, India
A. B. Kadam Department of Computer Science, Shri Shivaji Science and Arts
College Chikhli Buldhana, Chikhli, India
Aishwarya V. Kadu Datta Meghe Institute of Higher Education and Research
(DU), Faculty of Engineering and Technology, Wardha, India
Ranjeesh Kaippada Vellore Institute of Technology, Chennai, Tamil Nadu, India
Chinmayee Kale Dwarkadas J. Sanghvi College of Engineering, Mumbai,
Maharashtra, India
Jayakumar Kaliappan Vellore Institute of Technology, Vellore, Tamil Nadu,
India
Anupriya Kamble Department of Computer Science, Vishwakarma University,
Pune, India
Jaspreet Kaur Department of Mathematics, Chandigarh University, Mohali,
Punjab, India
Prabhjot Kaur Chandigarh University, Mohali, Punjab, India
Diriba C. Kejela Mattu University, Mattu, Ethiopia
Preksha Khatri Dwarkadas J. Sanghvi College of Engineering, Mumbai,
Maharashtra, India
Editors and Contributors xv
Satveer Kour Department of Computer Engineering and Technology, Guru

Nanak Dev University, Amritsar, India
Deepak Kumar MNNIT Allahabad, Prayagraj, Uttar Pradesh, India
Sandeep Kumar Department of Computer Science and Engineering, School of
Engineering and Technology, CHRIST (Deemed to Be University), Bangalore,
Karnataka, India
Aparna N. Mahajan Maharaja Agrasen University, Baddi, Himachal Pradesh,
India
Ghanashyama Mahanty Department of Analytical and Applied Economics,
Utkal University, Orissa, India
Manju Department of Computer Science and Engineering and Information
Technology Jaypee Institute of Information Technology, Noida, India
Ramesh Manza Department of Computer Science and Information Technology,
Dr. Babasaheb Ambedkar Marathwada University Aurangabad Maharashtra,
Aurangabad, India
Sudeep Marwaha Division of Computer Applications, ICAR-Indian Agricultural
Statistics Research Institute, New Delhi, India
Ketema T. Megersa Mattu University, Mattu, Ethiopia
Gautam Mehendale Dwarkadas J. Sanghvi College of Engineering, Mumbai,
Maharashtra, India
A. Ancy Micheal Vellore Institute of Technology, Chennai, Tamil Nadu, India
Jason Misquitta Vellore Institute of Technology, Chennai, Tamil Nadu, India
Kirti Dinkar More Department of Computer Science, MVP’s K. T. H. M.
College, Nashik, India
Kavya Muktha Department of Information Technology, VNR Vignana Jyothi
Institute of Engineering and Technology, Hyderabad, Telangana, India
Sakshi Naik SCTRs Pune Institute of Computer Technology, Pune, India
Heli Nandani Chandubhai S. Patel Institute of Technology (CSPIT), Faculty of
Technology and Engineering (FTE), Charotar University of Science and
Ashok Pal Department of Mathematics, Chandigarh University, Mohali, Punjab,
India
Ganesh Prasad Pal Department of Computer Science and Engineering, Jaypee
Institute of Information Technology, Noida, India
xvi Editors and Contributors
Raju Pal Department of Computer Science and Engineering, School of

Information and Communication Technology, Gautam Buddha University, Greater
Noida, India
Sandip R. Panchal Electronics and Communication Engineering, Dr. Subhash
University, Junagadh, Gujarat, India
Shivam Panwar Crisp Analytics Private Limited, LUMIQ, Noida, India
Hemant H. Patel Computer Science and Engineering, Dr. Subhash University,
Junagadh, Gujarat, India
Mohanvenkat Patta Department of Information Technology, VNR Vignana
Jyothi Institute of Engineering and Technology, Hyderabad, Telangana, India
Chetan Pattebahadur Department of Computer Science and Information
Technology, Dr. Babasaheb Ambedkar Marathwada University Aurangabad
Maharashtra, Aurangabad, India
Gaurav Pendharkar Vellore Institute of Technology, Chennai, Tamil Nadu, India
K. Pradeep Dayananda Sagar University, Bangalore, India
Priteshkumar Prajapati Chandubhai S. Patel Institute of Technology (CSPIT),
Faculty of Technology and Engineering (FTE), Charotar University of Science and
Dhanya Pramod Department of Computer Studies, Symbiosis Centre for
Information Technology, Symbiosis International (Deemed) University, Pune, India
N. Priyadharshini Jayadurga Department of Computer Science and
Engineering, Amrita School of Computing, Amrita Vishwa Vidyapeetham,
Chennai, India
Kumari Priyanshi Chandigarh University, Mohali, Punjab, India
Ashwin Raiyani Information Management, Nirma University, Ahmedabad,
Gujarat, India
Riya Raj Vellore Institute of Technology, Vellore, Tamil Nadu, India
Pragya Rajput Department of Computer Science Engineering, Chandigarh
University Punjab, Mohali, India
Pranita Ranade Symbiosis Institute of Design, Symbiosis International (Deemed
Geeta Rani Manipal University, Jaipur, India
K T V Reddy Datta Meghe Institute of Higher Education and Research (DU),
Faculty of Engineering and Technology, Wardha, India
Jatinderkumar R. Saini Symbiosis Institute of Computer Studies and Research,
Symbiosis International (Deemed University), Pune, India
Editors and Contributors xvii
Keshav Sairam Dayananda Sagar University, Bangalore, India

Kashif Saleem King Saud University, Riyadh, Saudi Arabia
Himali Sarangal Department of Engineering and Technology, Guru Nanak Dev
University Regional Campus, Jalandhar, India
Pankaj Kumar Sethi Department of Computer Science Engineering, Chandigarh
University Punjab, Mohali, India
Devanshi Shah Chandubhai S. Patel Institute of Technology (CSPIT), Faculty of
Hetvi Shah Dwarkadas J. Sanghvi College of Engineering, Mumbai,
Maharashtra, India
Parth Shah Smt. K. D. Patel Department of Information Technology, Chandubhai
S. Patel Institute of Technology (CSPIT), Faculty of Technology and Engineering
(FTE), Charotar University of Science and Technology (CHARUSAT), Changa,
Gujarat, India
Rachit Shah Chandubhai S. Patel Institute of Technology (CSPIT), Faculty of
Shail Shah Chandubhai S. Patel Institute of Technology (CSPIT), Faculty of
Nishat Shaikh Smt. K. D. Patel Department of Information Technology,
Chandubhai S. Patel Institute of Technology (CSPIT), Faculty of Technology and
Engineering (FTE), Charotar University of Science and Technology
(CHARUSAT), Changa, Gujarat, India
Vineet Sharma MNNIT Allahabad, Prayagraj, Uttar Pradesh, India;
Poornima College of Engineering, Jaipur, Rajasthan, India
Shilpa Shinde Department of Computer Engineering, Ramrao Adik Institute of
Technology, D. Y. Patil Deemed University, Nerul, India
Butta Singh Department of Engineering and Technology, Guru Nanak Dev
Manjit Singh Department of Engineering and Technology, Guru Nanak Dev
Aparajita Sinha Dayananda Sagar University, Bangalore, India
N. Sugitha Saveetha Engineering College, Chennai, India
Ayon Tarafdar Livestock Production and Management Section, ICAR-Indian
xviii Editors and Contributors
K. C. Tripathi Maharaja Agrasen Institute of Technology, GGSIPU, New Delhi,

India
Mousami P. Turuk SCTRs Pune Institute of Computer Technology, Pune, India
Shraddha Vaidya Symbiosis Institute of Computer Studies and Research,
Symbiosis International (Deemed University), Pune, India
Vaibhav B. Vaijapurkar SCTRs Pune Institute of Computer Technology, Pune,
India
Indresh Kumar Verma Symbiosis Institute of Design, Symbiosis International
(Deemed University), Pune, India
Kartik Verma Department of Engineering and Technology, Guru Nanak Dev
R. Vinay Dayananda Sagar University, Bangalore, India
Shreyas Visweshwaran Department of Computer Science, Amrita School of
Computing, Amrita Vishwa Vidyapeetham, Coimbatore, India
Yoshita Department of Computer Science and Engineering, Amity University,
Gurugram, Haryana, India
Farhana Zareen Crisp Analytics Private Limited, LUMIQ, Noida, India
Multilingual Speech Recognition:
An In-Depth Review of Applications,
Challenges, and Future Directions
Mayur M. Jani , Sandip R. Panchal , Hemant H. Patel ,

and Ashwin Raiyani
Abstract Speech is essential in communication because it allows people to convey

their thoughts, feelings, and ideas in different languages. However, due to the
complexities of multilingual speech, it might be difficult to recognize each word
in its associated language correctly. Fortunately, thanks to technological improve-
ments, various automatic speech-to-text tools are available that can translate diverse
languages into the required output language, hence decreasing linguistic barriers
during communication. This review article aims to offer an overview of the
many applications, problems, and methodologies utilized in developing multilin-
gual speech-to-text technology. The report will also look at possible areas for future
advancement and development of this technology. Overall, the presentation will
emphasize the critical role that multilingual speech-to-text technology may play in
breaking down language barriers and facilitating cross-linguistic communication.
Keywords Speech recognition · Multilingual speech · Deep learning ·

Speech-to-text (STT)
M. M. Jani
Information Technology, Dr. Subhash University, Junagadh, Gujarat 362001, India
e-mail: mayur.jani@dsuni.ac.in
S. R. Panchal
Electronics and Communication Engineering, Dr. Subhash University, Junagadh, Gujarat 362001,
India
e-mail: sandip.panchal@dsuni.ac.in
H. H. Patel
Computer Science and Engineering, Dr. Subhash University, Junagadh, Gujarat 362001, India
e-mail: hemant.patel@dsuni.ac.in
A. Raiyani (B)
Information Management, Nirma University, Ahmedabad, Gujarat 382421, India
e-mail: ashwin.raiyani@nirmauni.ac.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 1
H. Sharma et al. (eds.), Communication and Intelligent Systems, Lecture Notes in
Networks and Systems 968, https://doi.org/10.1007/978-981-97-2079-8_1
2 M. M. Jani et al.
1 Introduction
An automatic speech recognition (ASR) system can translate spoken words into text
using speech recognition, often known as ASR. Speech recognition has become an
important tool in many industries, including health care, finance, education, and
customer service. It enables more natural and intuitive interaction between people,
computers, and other devices, as keyboards and other input devices are not required.
The necessity for multilingual speech recognition has grown significantly as the
globe becomes increasingly interconnected. With the use of this technology, people
from various linguistic backgrounds may communicate in their native tongues with
one another, and with computers, therefore lowering communication barriers.
1.1 Automatic Speech Recognition
An ASR system’s primary responsibility is translating incoming speech into the

proper text. ASR systems combine acoustic and linguistic models to translate speech
to text. Figure 1 aids in comprehension of how ASR systems operate [1–3].
Linear Spectral Frequencies (LSF), Discrete Wavelet Transform (DWT), Mel-
Frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients
(LPCC), Linear Prediction Coefficients (LPC), and Perceptual Linear Prediction
(PLP) are the methods used to extract speech features [1, 4, 5].
By using the repetitive pattern of speech and Gaussian Mixture Models (GMM)
and Hidden Markov Models (HMMs) to represent the underlying phonetic units,
speech recognition systems can accurately translate spoken language into written
text. Acoustic models, language models, and decoding methods form the basis of
modern speech recognition technology [5, 6].
Fig. 1 Automatic speech recognition process

Multilingual Speech Recognition: An In-Depth Review of Applications … 3
1.2 Multilingual Speech-to-Text Conversion
Teaching a single language in a conventional automated speech recognition (ASR)

system to handle many languages might be challenging [7]. So, instead of training an
individual language for multilingual speech-to-text, we use a single, unified model
to handle multiple languages simultaneously [8].
Simple ASR systems are typically trained in a specific language and may not
perform well in languages that differ significantly from the training data. Speech-
to-text systems can adapt to multiple languages using language-specific models,
improving accuracy and performance. In multilingual environments, people often
mix or switch between languages within a single conversation. Multilingual speech-
to-text systems are better equipped to handle code-switching scenarios and accurately
transcribe or translate speech that includes multiple languages seamlessly. Multilin-
gual speech-to-text systems enable users to search and retrieve information from
spoken content across different languages. This is particularly valuable for multilin-
gual transcription services, language learning platforms, or cross-language content
analysis applications.
Section 2 describes various development techniques and approaches for the
multilingual STT system. Section 3 discusses various feature extraction techniques,
feature classification techniques, and toolkits for multilingual STT systems. Section 4
presents applications of multilingual STT systems, Sect. 5 presents challenges of
multilingual STT, and Sect. 6 gives future directions.
2 Techniques and Approaches
The suitability of each technique depends on the specific requirements, available

resources, and the target languages involved. Combining these techniques can also
achieve optimal performance in multilingual speech-to-text systems. Considering
the desired application and linguistic variety, evaluating each approach’s advantages
and disadvantages is critical. Table 1 compares various techniques and approaches
based on the previous papers.
2.1 Acoustic Model Sharing
Technique: In this approach, a single acoustic model is trained to recognize

speech from multiple languages. The model parameters are shared across languages,
allowing it to handle different languages with a single model.
Strengths: Acoustic model sharing reduces the complexity of the system and the need
for separate models for each language. It requires less training data and computational
resources.
4
Table 1 Comparison of various techniques and approaches

Model name Ref. [4] [5] [8] [1] [9] [2] [3] [10] [11] [6] [7] [12] [13] [14] [15]
Acoustic model sharing . . . . . . .
Code-switching models . . . . . . .
Language-specific models . . . . . . . . .
M. M. Jani et al.
Weaknesses: Since a single model is used, the system might not capture language-
specific nuances effectively. Performance might vary across languages, with some
languages being more accurately recognized than others.
Suitability: Acoustic model sharing is suitable when there is limited training data
for individual languages or when the focus is on low-resource languages.
2.2 Code-Switching Models
Technique: Code-switching models are designed to handle speech combining

multiple languages, recognizing, and transcribing language-switching speech
elements.
Strengths: Code-switching models are suitable for languages with frequent mixing,
as they accurately transcribe and decode speech, promoting more natural language
processing.
Weaknesses: Code-switching models pose challenges in training due to the need for
a large and diverse dataset to manage multiple languages in a single model effectively.
Suitability: Code-switching models are suitable for multilingual communities, social
media analysis, and customer service interactions in code-switching countries.
2.3 Language-Specific Models
Technique: This approach trains separate acoustic models for each target language.
Each model is specialized in recognizing speech in a specific language, capturing its
unique phonetic and linguistic characteristics.
Strengths: Language-specific models tend to achieve higher accuracy as they can
focus on the intricacies of individual languages. They are effective for languages
with significant pronunciation and vocabulary differences.
Weaknesses: Training and maintaining separate models for each language require
more resources and computational power. It can be challenging to collect sufficient
training data for low-resource languages.
Suitability: Language-specific models are suitable when the goal is to achieve high
accuracy and performance for individual languages, especially for languages with
distinct phonetic features.
6 M. M. Jani et al.
3 Feature Extraction Techniques and Toolkit/Software
Automated speech recognition (ASR) systems extract features to convert unpro-

cessed speech signals into concise, informative representations for transcription.
These features preserve essential information and are speaker-independent, capturing
common speech patterns for generalization.
A well-chosen feature extraction approach is crucial for the success of the ASR
system, as it directly affects the accuracy and performance of the transcription
process. As mentioned in Table 2, various feature extraction methods are commonly
used in multilingual STT.
Multilingual speech-to-text (STT) applications feature classification techniques
crucial in accurately converting spoken language from various languages into written
text. Some feature classification techniques commonly used for multilingual speech-
to-text are mentioned in Table 3. Here, I have listed some common toolkits for
multilingual speech-to-text (STT) feature classification techniques.
Kaldi: Kaldi is a highly customizable free to download framework that covers a wide
range of languages and is mostly used for speech detection research. [16–19].
CMU Sphinx (PocketSphinx): Carnegie Mellon University created a freely acces-
sible system for speech recognition technology. It may be used offline and supports
multilingual recognition of speech. [1, 3, 4, 17, 20]
Mozilla Deep Speech: Mozilla’s open-source speech recognition engine, powered
by deep learning, supports multiple languages and aims to provide highly realistic
speech-to-text capabilities [21].
Google Cloud Speech-to-Text: Part of the Google Cloud Platform, this service
provides an API for converting speech to text in multiple languages. It’s a cloud-based
solution with good multilingual support [1, 12].
Microsoft Azure Speech Service: A cloud-based service offered by Microsoft,
which includes multilingual speech-to-text capabilities. It supports several languages
and provides both real-time and batch-processing options.
4 Applications of Multilingual Speech Recognition
Multilingual speech recognition has numerous applications across various fields.

Here are a few of the most important applications:
Language Translation: Multilingual speech recognition may create speech-to-text
translation systems. These systems may take oral input in a single or multilingual
language, turn it into text, and then translate it into the required language. This tool
proves exceptionally useful for accommodating multilingual needs in conferences,
chat rooms, tourism, and international business interactions [1, 2, 4, 5, 8].
Table 2 Feature extraction techniques

S. No. Category Comments
1 Linear predictive LPC is a powerful signal analysis method in speech recognition,
coding (LPC) approximating speech as a linear combination of past samples and
[22–26] determining its basic parameters and computational model
2 Linear predictive Feature extraction compresses speech signals into measures,
cepstral obtaining LPCC coefficients through autocorrelation method, but
coefficients has a high sensitivity to quantization noise
(LPCC) [22–24]
3 Discrete Wavelet DWT captures temporal and frequency information for better
Transform speech signal representation, but its higher computation complexity
(DWT) [22, 23, may increase computational overhead in real-time speech
25] recognition systems
4 Mel-Frequency MFCCs, extracted from short-time speech frames using a
Cepstral Mel-frequency filter bank, logarithm, and Discrete Cosine
Coefficient Transform, represent speech spectral characteristics effectively, but
(MFCC) [22–25, may lose fine temporal information in rapid speech changes
27]
5 Linear LDA is a dimensionality reduction method that maximizes
discriminant inter-class and intra-class distance in speech features, improving
analysis (LDA) class discrimination and recognition accuracy, but may be
[23, 24] susceptible to outliers
6 Principal PCA is a technique that reduces data dimensionality, improving
component computational efficiency and noise reduction in speech recognition,
analysis (PCA) but may not capture complex nonlinear relationships
[23]
7 Probabilistic PLDA is a statistical model used for speaker verification and
linear diarization, accounting for speaker-specific information but
discriminant requiring significant training data per speaker, potentially limiting
analysis (PLDA) its effectiveness in low-resource scenarios
[23, 24]
8 RASTA filtering RASTA filtering normalizes speech signals by modeling vocal tract
[23–25] dynamics, improving robustness in speech recognition systems.
However, it may lose fine spectral details, affecting recognition
accuracy
9 Wavelet packet WPD decomposes signals into wavelet packets, offering flexible
decomposition time–frequency representation and fine-grained control for speech
(WPD) [23] dynamics, but can be computationally intensive and require
parameter tuning
10 Perceptual PLP improves speech recognition by removing unwanted speech
Linear Prediction information and adapting spectral characteristics to match the
(PLP) [23, 26] human auditory system compared to LPC
8 M. M. Jani et al.
Table 3 Feature classification techniques

S. Category Comments
No.
1 Support vector machines [20, This supervised algorithm is effective for binary
24, 28, 29] classification but struggles with speech recognition
due to its inability to handle fixed-length vectors
2 Hidden Markov model [1, 6, The computational complexity and storage
17, 20, 23, 24, ] requirements are higher for this Unsupervised
Algorithm model. To tackle the intersection issues,
this approach needs more performance data
3 Vector quantization [17, 20, The storage requirements for real-time deployment of
23, 24, 27] this Unsupervised Algorithm approach are
manageable. Numerically speaking, this method is
less difficult
4 Gaussian mixture model [17, The algorithm used in this model is unsupervised. This
20, 30, 31] model just needs a little amount of training and testing
data. Being a compromise between DTW and HMM is
this model’s main flaw
5 Convolutional neural networks CNNs use convolutional layers to identify spatial
(CNNs) [18, 32, 33] patterns in input data, benefiting image and voice
processing applications. However, they require
significant training data and resources, potentially
leading to overfitting
6 Recurrent neural networks RNNs are artificial neural networks used for handling
(RNNs) [18, 32, 33] sequential data, detecting long-term relationships, and
ideal for speech recognition and natural language
processing. However, gradient difficulties can impact
training stability
7 Deep neural networks (DNN) DNNs are a machine learning technique used in
[19, 32, 33] multilingual voice recognition systems to enhance
speech recognition accuracy by learning intricate
speech signal patterns
Voice Assistants: Multilingual speech recognition is critical for voice assistant

systems such as Siri, Google Assistant, and Alexa. These assistants must comprehend
and respond to user orders and enquiries in several languages. Multilingual speech
recognition allows voice assistants to properly transcribe and understand user speech
in various languages, increasing their accessibility and usability [1, 5].
Call Centers: Many businesses run worldwide call centers where clients from many
nations and languages seek assistance. Multilingual voice recognition helps contact
center operators to communicate with consumers in their native languages more
efficiently. The system can transcribe a client’s voice, automatically translate it, and
give real-time ideas to help operators reply more effectively [2, 11, 12].
Transcription Services: Multilingual speech recognition can automatically tran-
scribe audio content in different languages. This application finds utility in areas like
media, market research, legal services, and academia, where transcription of audio
recordings is required. It eliminates the need for manual transcription, saving time
and effort [3, 7, 8, 13, 34–36].
Education: Smart Attendance, which uses voice recognition technology, is a more
efficient method that saves time for teachers. Students can use technology to help
them prepare for or recall lecture topics. They employed several technologies to
create lecture notes [5] easily.
5 Challenges and Limitations
Multilingual speech-to-text (STT) systems must overcome various hurdles to achieve

accurate and reliable transcription across different languages. Some of the major
issues with multilingual STT are as follows:
Language Diversity: Multilingual STT systems need to handle a wide range of
languages with varying phonetic, syntactic, and semantic characteristics. Each
language may have unique speech patterns, accents, dialects, and linguistic complex-
ities, requiring the system to be adaptable and robust to handle these variations
[5, 6, 8, 9, 37].
Limited Training Data: Obtaining huge volumes of transcribed speech data for each
target language might be difficult. A robust multilingual STT system needs a large
amount of data for each language, including various speakers and speech patterns.
Addressing the scarcity of training data for low-resource languages is particularly
challenging [10, 13].
Code-Switching and Mixed-Language Input: Multilingual environments often
involve code-switching, where multiple languages are spoken within a single
utterance. STT systems must be capable of accurately detecting and transcribing
code-switching scenarios, capturing the transitions between languages seamlessly
[4, 28].
Speaker Variability: Speakers may have different accents, speaking rates, speech
styles, and individual idiosyncrasies, which can affect the performance of multilin-
gual STT systems. Handling speaker variability and adapting the system to different
speakers across multiple languages is crucial for achieving high transcription
accuracy [16, 38, 39].
Vocabulary and Terminology Variations: Different languages may have distinct
vocabularies, dialect-specific words, and variations in pronunciation. Building
and maintaining language-specific lexicons and language models that accurately
capture vocabulary and terminology variations pose challenges for multilingual
STT systems [22].
10 M. M. Jani et al.
Handling Out-of-Vocabulary Words: Multilingual STT systems must handle words

or terms that do not present in their pre-defined vocabulary. Handling out-of-
vocabulary words, including proper nouns, domain-specific terms, or new words,
requires effective strategies such as dynamic vocabulary expansion or context-based
approaches [26, 27].
Domain Adaptation: Transcribing speech from specific domains or specialized
fields, such as medical or legal, presents additional challenges. Adapting the multi-
lingual STT system to domain-specific vocabulary, jargon, and acronyms is essential
to ensure accurate transcriptions in specialized domains [26].
Real-Time Processing: Multilingual STT systems may need to process speech in real
time, such as in live events or teleconferences. Meeting the low-latency requirements
for real-time applications while maintaining transcription accuracy is a challenge that
needs to be addressed [30, 40].
Speaker Diarization and Segmentation: In multilingual scenarios, accurately
segmenting and identifying individual speakers, especially in overlapping speech,
are crucial for proper transcription and speaker attribution. To successfully deal with
these issues, strong speaker diarization approaches are necessary [16].
Evaluation and Benchmarking: Standardized evaluation metrics and bench-
mark datasets for multilingual STT are required to compare and assess different
systems’ performance systematically. Creating assessment protocols that represent
the intricacies and complexity of multilingual transcribing is a never-ending task.
Integrating methods from speech analysis, natural language processing, machine
learning, and cross-lingual modeling are required to address these challenges.
Advances in gathering data, linguistic assets, acoustic and language modeling,
and adaptation techniques are critical for improving the accuracy and usability of
multilingual STT systems.
6 Future Directions
Multilingual speech-to-text systems will become more widely available and acces-
sible, allowing more individuals to utilize and profit from them. The accuracy of
multilingual voice recognition systems will continue to increase as deep learning
algorithms are used and more training data becomes available. Integrating STT and
text-to-speech (TTS) capabilities with other technologies, such as machine transla-
tion and natural language processing, will enable more comprehensive and seam-
less communication across several languages. Overall, the future of multilingual
voice-to-text systems is predicted to entail the development of increasingly accu-
rate, efficient, and accessible systems capable of facilitating successful communica-
tion across diverse cultures and languages. These systems will become increasingly
significant in improving information accessibility, breaking down language barriers,
and improving the precision and effectiveness of language learning and translation
technologies.
7 Conclusions
Due to the scarcity of research in this field, the study’s goal is to complete an
evaluation of multilingual speech-to-text (STT) systems. The study’s primary goals
are to explore the difficulties and constraints encountered in creating multilingual
STT systems. These may involve, among other things, challenges with linguistic
variety, code-switching, speaker variability, vocabulary variances, and real-time
processing. The study aims to find different tools and approaches for improving the
efficacy and accuracy of multilingual STT systems. This might include investigating
language modeling, acoustic modeling, speaker diarization, vocabulary adaptation,
and domain-specific adaptation approaches. The research initiative will research
methods for enhancing recognition speed in multilingual STT systems, particularly
when dealing with mixed-language speech.
Additionally, lowering the Word Error Rate, which evaluates the correctness of
transcriptions, will be a priority. The study seeks to shed light and suggest new lines of
inquiry in multilingual speech-to-text. This might include investigating new methods,
data gathering approaches, assessment metrics, or tackling issues with multilingual
STT. This study aims to identify current barriers to multilingual STT system devel-
opment, highlight useful resources and techniques, and offer recommendations for
future research in this field. The ultimate objective is to improve multilingual speech-
to-text technology’s effectiveness, precision, and application to satisfy the demands
of bilingual and multilingual users in various communication scenarios.
References
1. Salini R, Safrin P, Shanmugapriyaa P, Sindhu S (2018) Switching between multiple languages

based on speech recognition and translation. Int J Eng Res Technol (IJERT). ISSN: 2278-0181
2. Patil S et al (2016) Multilingual speech and text recognition and translation using image. Int J
Eng Res 5(4)
3. Gopi A et al (2015) Multilingual speech to speech Mt based chat system. In: 2015 International
conference on computing and network communications (COCONET). IEEE
4. Deepak Reddy P, Rudresh C, Adithya AS (2022) Multilingual speech to text using deep learning
based on Mfcc features. Mach Learn Appl: Int J (MLAIJ) 9(2)
5. Sirigineedi AV et al (2020) A novel real time voice-based approach for multilingual web data
extraction with Raspberry Pi. UGC Care Listed (Group I) J 9(2). 2012 IJFANS. All Rights
Reserved
6. Bourlard H et al (2011) Current trends in multilingual speech processing. Sadhana 36:885–915
7. Biswas A et al (2022) Code-switched automatic speech recognition in five South African
languages. Comput Speech Lang 71:101262
8. Bano S et al (2020) Speech to text translation enabling multilingualism. In: 2020 IEEE
International conference for innovation in technology (INOCON). IEEE
12 M. M. Jani et al.
9. Mussakhojayeva S et al (2023) Multilingual speech recognition for Turkic languages.

Information 14(2):74
10. Padmane P, Pakhale A et al (2022) Multilingual speech and text recognition and translation.
Int J Innov Eng Sci. E-ISSN: 2456-346
11. Nowakowski K et al (2023) Adapting multilingual speech representation model for a new, under
resourced language through multilingual fine-tuning and continued pretraining. Inf Process
Manage 60(2):103148
12. Rodríguez LM, Cox C (2023) Speech-to-text recognition for multilingual spoken data in
language documentation. In: Proceedings of the sixth workshop on the use of computational
methods in the study of endangered languages
13. Weng F et al (1997) A study of multilingual speech recognition. In: Fifth European conference
on speech communication and technology
14. Krishnan CG, Harold Robinson Y, Chilamkurti N (2020) Machine learning techniques for
speech recognition using the magnitude. J Multimedia Inf Syst 7(1):33–40
15. Mohamed NA et al (2023) Multilingual speech recognition initiative for African languages
16. Ma JZ et al (2017) Improving deliverable speech-to-text systems with multilingual knowledge
transfer. Interspeech
17. Singh W (2020) Multilingual speech to text conversion–a review
18. Wang Y, Wang H (2017) Multilingual convolutional, long short-term memory, deep neural
networks for low resource speech recognition. Procedia Comput Sci 107:842–847
19. Cho J et al (2018) Multilingual sequence-to-sequence speech recognition: architecture, transfer
learning, and language modeling. In: 2018 IEEE spoken language technology workshop (SLT).
IEEE
20. Hemakumar G, Punitha P (2013) Speech recognition technology: a survey on Indian languages.
Int J Inf Sci Intell Syst 2(4):1–38
21. Ardila R et al (2019) Common voice: a massively multilingual speech corpus. arXiv preprint
arXiv:1912.06670
22. Ghule KR, Deshmukh RR (2015) Feature extraction techniques for speech recognition: a
review. Int J Sci Eng Res 6(5). ISSN 2229-5518
23. Gaikwad SK, Gawali BW, Yannawar P (2010) A review on speech recognition technique. Int
J Comput Appl 10(3):16–24
24. Bhuvnesh M, Hardik et al (2018) Feature extraction and classification techniques of automatic
speech recognition system: a review. Int J Creative Res Thoughts (IJCRT) 6(2). ISSN: 2320-
2882
25. Kurzekar PK et al (2014) A comparative study of feature extraction techniques for speech
recognition systems. Int J Innov Res Sci Eng Technol 3(12):18006–18016
26. Kesarkar MP, Rao P (2003) Feature extraction for speech recognition. Electronic Systems, EE
Department, IIT Bombay
27. Mohammed HM et al (2018) Speech recognition system with different methods of feature
extraction. Int J Innov Res Comput Commun Eng 6(3):1–10
28. Ghadage YH, Shelke SD (2016) Speech to text conversion for multilingual languages. In: 2016
International conference on communication and signal processing (ICCSP). IEEE
29. Lin H et al (2012) Recognition of multilingual speech in mobile applications. In: 2012 IEEE
International conference on acoustics, speech, and signal processing (ICASSP). IEEE
30. Garcia EG, Mengusoglu E, Janke E (2007) Multilingual acoustic models for speech recognition
in low-resource devices. In: 2007 IEEE International conference on acoustics, speech and signal
processing (ICASSP 07), vol 4. IEEE
31. Gitanjali W (2016) Multilingual speech recognition and language identification. Int J Modern
Trends Eng Res. E-ISSN: 2349-9745
32. Luo J et al (2022) Adaptive activation network for low resource multilingual speech recognition.
In: 2022 International joint conference on neural networks (IJCNN). IEEE
33. Alashban AA et al (2022) Spoken language identification system using convolutional recurrent
neural network. Appl Sci 12(18):9181
34. Iranzo-sánchez J et al (2020) Europarl-st: a multilingual corpus for speech translation of parlia-
mentary debates. In: 2020 IEEE International conference on acoustics, speech, and signal
processing (ICASSP 2020). IEEE
35. Wang C et al (2020) Covost: a diverse multilingual speech-to-text translation corpus. arXiv
preprint arXiv:2002.01320
36. Nakamura S et al (2006) The ATR multilingual speech-to-speech translation system. IEEE
Trans Audio Speech Lang Process 14(2):365–376
37. Udhaykumar N, Ramakrishnan SK, Swaminathan R (2004) Multilingual speech recognition
for information retrieval in Indian context. In: Proceedings of the student research workshop
at HLT-NAACL 2004
38. Anwar M et al (2023) Muavic: a multilingual audio-visual corpus for robust speech recognition
and robust speech-to-text translation. arXiv preprint arXiv:2303.00628
39. Schultz T (2002) Globalphone: a multilingual speech and text database developed at Karlsruhe
university. In: Seventh International conference on spoken language processing
40. Gonzalez-Dominguez J et al (2014) A real-time end-to-end multilingual speech recognition
architecture. IEEE J Sel Top Signal Process 9(4):749–759
Performance Evaluation of Job Shop
Scheduling Problem Using Proposed
Hybrid of Black Hole and Firefly
Algorithms
Jaspreet Kaur and Ashok Pal
Abstract Scheduling issues have become a prominent region of research in previous

decades. It has drawn the attention of many academics and engineers due to its versa-
tility and importance in the fields of various real-world issues. Job shop scheduling
problem (JSSP) is a booming topic of the scheduling research area. It has shown a
deep enhancement in the form of cost minimization, time reduction to complete the
job, to maintain proper allocation of jobs so that job potential can be increased. In
this research article, the hybrid of black hole and firefly algorithm has been used for
the solution of JSSP to get the combination of exploitation features of black hole
and exploration quality of firefly algorithm for the purpose of getting better results.
We have created 12 randomly generated problems to check the performance of the
hybrid algorithm on the scheduling issues. The obtained computational results have
been compared with standard PSO, FA, and SFLA. The proposed algorithm HBFA
has proven to be the best algorithm among all these algorithms.
Keywords Job shop scheduling problem · Scheduling algorithms · Make span

time minimization · Metaheuristics approaches · HBFA · Combinatorial
optimization
1 Introduction
In a production planning approach, scheduling is simply the short-term execution

plan. The actions taken by a manufacturing company to oversee and regulate the
production process’s execution are collectively referred to as production scheduling.
A schedule is an assignment problem [1] that outlines exactly which tasks need to
be completed and how the factory’s resources should be used to meet the plan, all
J. Kaur · A. Pal (B)

Department of Mathematics, Chandigarh University, Mohali, Punjab, India
e-mail: ashokpmaths.iitr@gmail.com
J. Kaur
e-mail: jaspreetkaur.cse@cumail.in
16 J. Kaur and A. Pal
expressed in minutes or seconds. In essence, detailed scheduling is the challenge of

assigning computers [2], within certain limitations, to competing workloads across
time. Every machine can handle a maximum of one task at a time, and each work
centre can perform one project at a time. A scheduling problem usually takes a
certain number of jobs as input, with each job having its own set of parameters
(tasks, required resources, time estimates for each operation, and no cancellations).
An estimation of the work’s duration is necessary [3] for all scheduling strategies. The
shop floor arrangement both influences and is influenced by scheduling. The ability
to project all scheduling modifications over time allows for the detection and analysis
of starting and completion times, resource idle times, etc. An appropriate schedule
may enable the forecast to predict [4] when each released part will be completed and
to supply information for determining what needs to be worked on next. The goal
of scheduling research is to complete the tasks in a way that complies with priority
standards and responds to strategy. It begins with the execution orders and attempts
to assign the various items’ production to the facilities in the most efficient manner
feasible. An effective scheduling plan is the result of careful planning, recognizing
resource conflicts, controlling the flow of work to a shop, and maximizing the time
it takes to complete each task. It establishes the moment at which each work begins
and decides what and how delivery commitments can be fulfilled.
1.1 Proposed Approaches for Solving Scheduling Problem
If a problem’s dimension and/or complexity are exceptional, metaheuristic algo-

rithms take precedence over other approaches as the only means of scheduling.
They are characterized as higher-level generic methodologies that guide strategies in
creating heuristics to accomplish optimization in issues. In recent years, metaheuris-
tics have drawn significant interest from the hard optimization community as a potent
tool due to their highly promising experimental and practical findings across a wide
range of engineering domains. For this reason, these strategies have been the subject
of numerous recent studies on scheduling issues. There are published studies that
provide mathematical evaluations of metaheuristics. The primary features of the most
promising metaheuristic techniques for the general job shop scheduling problems
(JSSPs) process are studied in current research. The JSSP is a highly constrained and
NP complete problem, and its solution is acknowledged as a crucial step in the factory
optimization process. Distinct randomly generated scheduling issues are being used
in this study to examine the performance of the intended hybrid HBFA.
Performance Evaluation of Job Shop Scheduling Problem Using … 17
2 Related Work
In [5] SFLA, the entire population is divided into several memeplexes using a tour-
nament selection-based technique. The search in each memeplex is conducted based
on the memeplex’s optimal solution by combining the global search phase with
the multiple neighbourhood search step. The work [6] examines FJSP, which mini-
mizes workload balance and overall energy consumption, and analyses the conflicts
that arise between the two goals. Based on a three-string coding method, SFLA is
suggested. An enhanced shuffled frog-leaping algorithm has been developed in [7]
to address the FJJS issue. The algorithm has an extremal optimization in informa-
tion exchange and an adjustment sequence to construct the local search strategy.
When compared to existing heuristic algorithms, the computational result demon-
strates the suggested algorithm’s string search capability in resolving the flexible
job shop scheduling problem. The advancements in fuzzy set technology allow JSP,
where the fuzzy processing time has been used (FJSP) [8] to simulate scheduling
more thoroughly. With a hybrid adaptive differential evolution (HADE) approach,
the multi-objective FJSP can be reduced to a single-objective optimization problem.
The authors consider the jobs’ utmost accomplishment time, total waiting time, and
overall power consumption. All its parameters—CR and F—are engineered to be
normally distributed and adaptable. The authors in [9] suggest a mathematical model
for a novel FJSSP (SOC-FJSP) that is bound by setup operators and assumes antic-
ipatory setup operations (detached). A setup operator merely needs to remain near
the equipment while setting it up, unlike a machine tender. Once setup is finished,
an operator can go on to another machine and continue setting things up. Because it
is assumed that a setup is independent of operations, it is possible to overlap a job’s
setup operation with the previous one. As a result, installation operators and machine
tools are used more effectively, and duration is decreased. The issue of maximum
lateness, a performance metric based on due dates, is tackled by the authors in [10].
After stating the issue, they get the dominance relation. Additionally, they provide
three methods for the problem: EDD, Tabu search, and particle swarm optimization
(PSO). The authors in [11] used an insert operator to change the particle move-
ment and priority to change the particle location representation when compared to
the original PSO. To translate a particle position into a schedule, they also put into
practice a modified parameterized active schedule generation method (mP-ASG).
By adjusting the maximum delay duration permitted in mP-ASG, one can narrow or
widen the search space between non-delay schedules and active schedules. For the
grid scheduling problem, this work [12] provides an enhanced particle swarm opti-
mization (PSO) algorithm with discrete coding rules. All the benefits of the regular
PSO, including ease of implementation, low computing load, and few control param-
eters, can be retained by the enhanced PSO method. Experiments demonstrate that
the algorithm is low variability and stable. The crossover operator in this study [13],
which uses the new GA-SA algorithm, integrates the metropolis acceptance criterion.
This could preserve the positive traits of the preceding generation while lessening
the disruptive effects of genetic operators. The authors also introduce two brand-new
features for this JSP-solving algorithm. To create a schedule that can further narrow
the search space, a FAS representation is first given. Second, for the operation-
based representation, the authors suggest Precedence Operation Crossover (POX),
a brand-new crossover operator. This research [14] proposes a convolutional neural
network-based efficient two-stage approach to handle the FJSP with device malfunc-
tion. The goal of the DFJSP model is robustness and maximum completion time. The
first step of the two-stage technique involves convolutional neural network training
of the prediction model (CNN). Using the model that was developed in the first stage,
the second stage uses it to forecast how robust the schedule will be. To begin with, a
better ICA is suggested to provide information regarding pedagogy. To minimize the
overall weighted tardiness, this study [15] examines a FJJS issue with lot-streaming
and machine reconfigurations (FJSP-LSMR). The goal in this study is to reduce the
overall weighted tardiness in the JSSP. A novel ABC [16] is proposed to solve the
issue, considering its high complexity. After identifying a neighbourhood property
of the problem, a tree search technique is developed to improve ABC’s exploitation
potential. For wide ranging JSSP where the overall weighted delay must be reduced,
a decomposition-based hybrid optimization approach [17] is provided. A new sub-
problem that is first defined by a simulated annealing approach and then refined in
each iteration is solved by a genetic algorithm. The authors develop a fuzzy inference
method to find the tasks’ bottleneck characteristic values, which display the char-
acteristic information at different optimization stages. To increase the optimization
efficiency, this information is subsequently used to direct the immune mechanism’s
sub problem-solving process. In the article [18] to increase the diversity of the popu-
lation, a mixed selection operator based on the fitness value and the concentration
value was offered. To fully leverage the qualities of the problem itself, new crossover
operators based on the machine and mutation operators based on the crucial path
were specifically designed. A new algorithm for determining the critical path from
schedule was presented to determine the critical path. Additionally, a local search
operator was created, which significantly enhances GA’s local search capabilities. A
new combination of GA was suggested, and its convergence was demonstrated based
on all of these. The imprecise duration of execution and completion of JSSPs are
discussed in [19]. The traditional differential evolution (DE) algorithm is selected by
the authors serving as the fundamental basis for optimization. The DE algorithm’s
benefit for a unique evolutionary approach has been used in the mutation operation of
different vector sets. Nevertheless, DE is not always successful in resolving FJSSP
cases. This work [20] introduces a generalized FJJSP in which additional Tough
restrictions are taken into account in addition to the classical limitations of the FJJS,
such as equipment capability, schedule delays, and holding periods. This issue is
based on an actual circumstance that was seen in a maker of seamless rolled rings.
The problem has been proposed to be represented by a constraint programming (CP)
and mixed integer linear programming (MILP) model.
3 Job Shop Scheduling Problem
Recently, research has concentrated on examining the scheduling issues that arise
in manufacturing and service environments where tasks are activities, machines are
resources, and every device is capable of handling one task at once. The job shop
system, often known as the low volume system, will be the main topic of the present
paper. Products in this kind of setting are created to order. A JSSP can be explained
in this way: m machines {M 1 , M 2 , ……, M m } performing n jobs {N 1 , N 2 , ….., N n }.
The notation for operation is Oij where the operation is performed on ith job and
jth machine. The requirement that each job be done by the machines in a specific
order is known as a technological restriction. There may be no interruptions once a
computer has begun processing a work. Makespan is the amount of time needed for
all operations to finish their respective processes. Our goal in this paper is to reduce
this makespan value. At least one of the best possibilities when the makespan is
minimized is semi-active (no operation may be started earlier without going against
technological constraints). Figure 1 represents the solution for the job shop problem
using Gantt chart. For a given schedule, define wij as the amount of time job j takes
to process before machine i, and define C ij as the amount of time job j takes to
complete processing on machine i. We are interested in objective functions that have
two parts, each of which is dependent on a certain schedule: an intermediate holding
cost component HC and a C(S) that depends on the times at which each work is
completed. The following are possible expressions for the intermediate holding cost
component using Eq. (1):

n
mj
HC = h i j wi j . (1)
j=1 i=1
Constraints must be established to guarantee that

1. The beginning job j’s duration in equipment m needs to be greater than the entire
amount of the beginning and processing times of job j’s prior operations.
2. Only one job at a time is processed by each machine. To accomplish this, we
specify that the task j starts on machine m, it must come before job i in machine
m, or it must be more than or equal to job i’s starting time plus its computation
time.
3. For each machine m in M, one element of every pair of jobs i, j must come before
the other.
4. The total of all operations’ starting and processing periods is greater than the
makespan.
Fig. 1 Gantt chart solution representation of job shop scheduling problem
4 Proposed Methodology
4.1 Firefly Algorithm
Yang designed the firefly algorithm [21] in 2008 by animating the distinctive
behaviours of fireflies. The population of fireflies exhibits distinctive luminescent
flashing behaviours to function as a channel to communicate, attraction of spouses,
and alerting attackers to potential threat. Drawing inspiration from those activities,
Yang developed this strategy based on the supposition that all fireflies are unisexual,
meaning that each can attract other fireflies, and that an individual’s attractiveness
is directly correlated with their light level. Consequently, the more brilliant fireflies
entice the less brilliant ones to come closer to them, also if a specific firefly is the only
one that is brighter than it, then the movement is random. The algorithm is explained
in the below steps:
Light Intensity: The fundamental idea of light’s power can be used to define a
suitable method for calculating the separation of any two firefly, given that its inverse
quadratic proportionality to the area’s square as in Eq. (2)
1
I ∝ . (2)
r2
Here, I indicates intensity, and r indicates distance. In addition, the light absorption
coefficient is γ .
Brightness: The attraction parameter β is defined in (3) using exponential formulae
and is based on the amount of light between two fireflies.
β = β0 e−γ r ,
2
(3)
where β0 is the source’s brightness where r = 0, and beta is the firefly’s brightness
h having a distance r.
Movement of Fireflies: When a firefly i moves towards a more alluring (brighter)
firefly j, it is motivated by using Eq. (4).
2
z i = z i + β0 e−γ ri j z j − z i . (4)
Moreover, a random movement is given by Eq. (5).
z i = z i + α(rand() − 0.5). (5)
Updating Equation: Hence, the final updating Eq. (6) is formed by combining the
Eq. (4) and Eq. (5), and it is given as
2
z i = z i + β0 e−γ ri j z j − z i + α(rand() − 0.5). (6)
4.2 Black Hole Algorithm
A sphere of space where there is a lot of mass gathered which nothing could escape its
gravitational attraction is the simplest definition of a black hole. Light and anything
else that enters a black hole are expelled from our universe forever. The candidate
who performs the best overall at every repetition when the BH algorithm is used, it
turns into a black hole. Stars: The standard stars are formed by all the other candidates
in [22] BH algorithm. One of the true candidates of the population, the black hole
was not randomly created. The candidates are then all shifted towards the direction
depending on their present locations and a random number, the black hole.
1. The BH algorithm begins by calculating the objective function for a population
of possible solutions to an optimisation issue.
2. The well suitable applicant is chosen for becoming the black hole at every repe-
tition, with the remaining stars acting as normal stars. Once start-up is complete,
the black hole begins to draw stars towards it.
3. If a star approaches a black hole too closely, it will be sucked in and vanish forever.
In this situation, once a new star (possible solution) is produced at random and
thrown into the search area, a new search is launched.
Fitness Value Calculation: The fitness value is calculated as follows.
1. A population of potential possibilities (the stars) that are created at random exists
in an issue or function’s search area.
2. Calculate the wellness of population by using the formula

n
n
fi = eval(v(t)) and f BH = eval(v(t)),
i=1 i=1
where n is population size and f i and f BH are the ith star’s and the black hole’s fitness
values, respectively, within the initial population. The population is being calculated,
and the candidate with the greatest fitness rating, fi, among the remaining algorithm
stars, is selected to be the black hole, with the remaining stars continuing to function
as regular stars. Its near surroundings’ stars may be swallowed by the black hole.
Once the original star and other stars are formed, the main star starts to consume the
neighbouring stars as they approach in its path.
Absorption Rate of Stars: The stars in its immediate vicinity begin to be absorbed
by the main star, and they all initialize moving towards it. The formula given below
elaborates how the stars are absorbed by BH:
yi (t + 1) = yi (t) + rand ∗ (yBH − yi (t)). (7)
yi (t + 1) and yi (t) are the coordinates of the ith star at repetition t and (t + 1),
consecutively, where i = 1; 2; 3; n. Black hole X BH is situated in the field of search
space, where rand is a number which is randomly generated lying between the interval
[0, 1]. A star could go in the black hole’s direction and finally end up somewhere less
costly than the main star. In this case, the star relocates to the vicinity of the BH, and
vice versa. Once a BH is in its new location, the black hole algorithm will resume,
and stars will then begin to move in that direction.
Event Horizon: There’s a potential that travelling stars might cross the event
horizon as they get closer to a black hole. The black hole will draw in every star—a
potential solution—that crosses its event horizon. A new forage is initiated generating
a fresh alternative (star) and scattering it randomly over the region of search, in case
the previous one perishes. This is done to keep the number of possibilities fixed. The
next iteration begins after all the stars have been shifted.
The following formula is used in the BHA to calculate the radius of the horizon
of event:
fBH
R = k , (8)
i=1 fi
where f i and f B H are the black hole’s and the star’s fitness values, respectively.
5 Hybrid Approach
The two most well-known algorithms in the domain of algorithms for global opti-
misation are combined to form the hybrid global optimization approach. The firefly
algorithm and the black hole algorithm are these algorithms. When compared to
other population-based nature-inspired variations, the FA and BHA can perform
efficiently in some optimization problems, but they are not suitable for excessively
intricate functions and may be prone to becoming caught in local optima. The main
objective of the proposed hybrid is to address the challenges of getting around these
limitations and improving search performance, a unique hybrid BA-FA variant is
developed. The FA is good at finding the global optimum, however unlike the black
hole algorithm, which can utilize the global ideal answer immediately at every repe-
tition. The proposed recombination [23] process between the FA and BA is used to
get over the drawbacks of the two algorithms. The hybrid strategy for the FA and BA
approaches is shown in the following Fig. 2.
Algorithm 1. Pseudo-code of HBFA

1. Initialize the position of ith firefly Zj(j = 1, 2, …, n)
2. while (t < max value of iteration (T))
3. for j = 1:n (all n fireflies)
4. for k = 1:i
5. light intensity l i at zj is determined by f(zj )
6. if (l ij > l k )
7. move firefly j towards k in all d dimensions using
2
z j = z i + β0 e−γ r z k − z j + αε j
8. else
9. move firefly j randomly
10. end if
11. attractiveness varies with distance r via exp(−γ r 2 )
12. evaluate new solutions and update light intensity
13. update the solutions obtained from firefly algorithm using eq of Blackhole Algorithm to
enhance the solutions
14. end for k
15. end for j
16. rank the search agents and find the current best
17. end while
Fig. 2 Flow chart of HBFA
6 Computational Results
Here in this work, various randomly generated issues, together with their workings,
are utilized to establish the bare essential duration period. The following tabular
columns display the timeframes for each job’s execution and fulfilment indepen-
dently. The hybrid optimization approach generates the bare essential duration period
for all test instances. The results of the various optimisation techniques are shown in
the table below together with the bare essential duration period and primary screening
period. The essential duration period of the job shop scheduling procedure is clearly
displayed in Tables 1, 2, and 3. The size of problems in Table1 is P1(10*10),
Table 1 Computational results of instances of Type 1

Instances Optimization techniques
HBFA FA PSO SFLA
P1 1020 1022 1046 1040
P2 1012 1012 1036 1054
P3 1034 1038 1042 1042
P4 1052 1054 1070 1054

HBFA FA PSO SFLA
P5 769 769 771 770
P6 687 693 695 704
P7 662 680 694 689
P8 578 578 667 590

HBFA FA PSO SFLA
P9 566 570 590 578
P10 498 498 509 503
P11 506 508 511 511
P12 539 539 556 542
P2(10*15), P3(10*20), P4(15*10), and set-up time for each job is distributed
uniformly as UD (1,40). The size of problems in Table 2 is P5(10*10), P6(10*15),
P7(10*20), P8(15*10), and set-up time for each job is distributed uniformly as UD
(20,40). The size of problems in Table 3 is P9(10*10), P10(10*15), P11(10*20),
P12(15*10), and set-up time for each job is distributed uniformly as UD (1,20). All
the results are compiled in MATLAB 2022(b) on AMD Ryzen 5 5625U with Radeon
Graphics, 2.30 GHz, 16.0 GB RAM, 64-bit operating system, × 64-based processor.
6.1 Discussion on the Obtained Results
Tables 1, 2, and 3 show that the majority of the most well-known solutions may
be obtained using the HBFA employed in this work, particularly for situations with
greater flexibility. Additionally, the approach might offer competitive answers to
most issues. Additionally, we discover that the performance of our HBFA is either
Fig. 3 Comparison of different optimization techniques for JSSP instances
superior to or comparable with that of the PSO, FA, and SFLA. When compared to
other heuristic algorithms, the findings show that our technique is a viable algorithm
for FJSP. For the sake of evaluating the performance on various problems, different
sizes of the varied problems have been taken which are being divided into groups of
3 types.
The hybrid optimization algorithm, which combines the firefly and black hole
algorithms, achieves the shortest makespan time in this operation. Each job’s total
time spent processing is calculated at the beginning and distributed evenly between
the two approaches. The job sequences comprise input whose time durations are
added to those of the other jobs and produce the smallest required time for the entire
number of jobs.
The bare essential duration is attained once all jobs have completed their whole
operations. The comparison of all algorithms is shown in Fig. 3 using a bar graph
explaining the performance of the proposed hybrid HBFA compared to the other
three algorithms. Figure 4 depicts the convergence graphs for different instances
used in the proposed work.
7 Conclusion and Future Works
The goal of the research is to reduce the makespan in JSSP, and the job shop
scheduling problem is examined in the randomly generated tests of different sizes
to demonstrate the performance of the hybrid HBFA algorithm in this work. Here,
the global search capability of FA has been combined with local search algorithm
BA to get the proper balance between exploration and exploitation. The obtained
results were compared with FA, PSO, and SFLA to check the performance of the
hybrid HBFA, and the proposed algorithm came out to be the best one among all
these algorithms using the generated instances of JSSP.
Fig. 4 Convergence graphs for different instances

Fig. 4 (continued)
It is suggested for the future to use the proposed algorithm for different parameter
settings and use in a fuzzy environment. Moreover, this algorithm can be used in
other real-life applications in future like power system generation, feature selection,
image processing, or clustering analysis.
References
1. Walker RA, Chaudhuri S (1995) Introduction to the scheduling problem. IEEE Des Test Comput
12(2):60–69. https://doi.org/10.1109/54.386007
2. Topcuoglu H, Hariri S, Wu M-Y (2002) Performance-effective and low-complexity task

scheduling for heterogeneous computing [Online]. Available: https://doi.org/10.1109/71.
993206
3. Yamashiro H, Nonaka H (2021) Estimation of processing time using machine learning and real
factory data for optimization of parallel machine scheduling problem. Oper Res Perspect 8.
https://doi.org/10.1016/j.orp.2021.100196
4. Lu H, Zhou R, Fei Z, Shi J (2018) A multi-objective evolutionary algorithm based on Pareto
prediction for automatic test task scheduling problems. Appl Soft Comput J 66:394–412. https://
doi.org/10.1016/j.asoc.2018.02.050
5. Lei D, Guo X (2016) A shuffled frog-leaping algorithm for job shop scheduling with outsourcing
options. Int J Prod Res 54(16):4793–4804. https://doi.org/10.1080/00207543.2015.1088970
6. Lei D, Zheng Y, Guo X (2017) A shuffled frog-leaping algorithm for flexible job shop scheduling
with the consideration of energy consumption. Int J Prod Res 55(11):3126–3140. https://doi.
org/10.1080/00207543.2016.1262082
7. Lu K, Ting L, Keming W, Hanbing Z, Makoto T, Bin Y (2015) An improved shuffled frog-
leaping algorithm for flexible job shop scheduling problem. Algorithms 8(1):19–31. https://
doi.org/10.3390/a8010019
8. Wang G-G, Gao D, Pedrycz W (2022) Solving multi-objective fuzzy job-shop scheduling
problem by a hybrid adaptive differential evolution algorithm [Online]. Available: https://doi.
org/10.1109/TII.2022.3165636
9. Defersha FM, Obimuyiwa D, Yimer AD (2022) Mathematical model and simulated annealing
algorithm for setup operator constrained flexible job shop scheduling problem [Online].
Available: https://doi.org/10.1016/j.cie.2022.108487. Accessed 29 Nov 2023
10. Allahverdi A, Al-Anzi FS (2006) A PSO and a Tabu search heuristics for the assembly
scheduling problem of the two-stage distributed database application. Comput Oper Res
33(4):1056–1080. https://doi.org/10.1016/j.cor.2004.09.002
11. Sha DY, Hsu CY (2008) A new particle swarm optimization for the open shop scheduling
problem. Comput Oper Res 35(10):3243–3261. https://doi.org/10.1016/j.cor.2007.02.019
12. Bu YP, Zhou W, Yu JS (2008) An improved PSO algorithm and its application to grid scheduling
problem. In: Proceedings - International symposium on computer science and computational
technology (ISCSCT 2008), pp 352–355. https://doi.org/10.1109/ISCSCT.2008.93
13. Zhang C, Li P, Rao Y, Li S (2005) LNCS 3448—A new hybrid GA/SA algorithm for the job
shop scheduling problem
14. Zhang G, Xi X, Liu LX, Zhang L, Wei S, Zhang W (2022) An effective two-stage algorithm
based on convolutional neural network for the bi-objective flexible job shop scheduling problem
with machine breakdown. Expert Syst Appl 203 [Online]. Available: https://doi.org/10.1016/
j.eswa.2022.117460. Accessed 29 Nov 2023
15. Fan J, Zhang C, Shen W, Gao L (2022) A mat-heuristic for flexible job shop scheduling
problem with lot-streaming and machine reconfigurations. Int J Prod Res 6565–6588 [Online].
Available: https://doi.org/10.1080/00207543.2022.2135629. Accessed 29 Nov 2023
16. Zhang R, Song S, Wu C (2013) A hybrid artificial bee colony algorithm for the job shop
scheduling problem. Int J Prod Econ 141(1):167–178. https://doi.org/10.1016/j.ijpe.2012.
03.035
17. Zhang R, Wu C (2010) A hybrid approach to large-scale job shop scheduling. Appl Intell
32(1):47–59. https://doi.org/10.1007/s10489-008-0134-y
18. Qing-Dao-Er-Ji R, Wang Y (2012) A new hybrid genetic algorithm for job shop scheduling
problem. Comput Oper Res 39(10):2291–2299. https://doi.org/10.1016/j.cor.2011.12.005
19. Gao D, Wang GG, Pedrycz W (2020) Solving fuzzy job-shop scheduling problem using de
algorithm improved by a selection mechanism. IEEE Trans Fuzzy Syst 28(12):3265–3275.
https://doi.org/10.1109/TFUZZ.2020.3003506
20. Boyer V, Vallikavungal J, Cantú Rodríguez X, Salazar-Aguilar MA (2021) The generalized
flexible job shop scheduling problem. Comput Ind Eng 160. https://doi.org/10.1016/j.cie.2021.
107542
21. Yang X-S (2010) Firefly algorithm, stochastic test functions and design optimization
22. Azar AT, Vaidyanathan S (2015) Blackhole algorithms and applications. Stud Comput Intell
575:v–vii. https://doi.org/10.1007/978-3-319-11017-2
23. Kaur J, Pal A (2023) Development and analysis of a novel hybrid HBFA using firefly and
blackhole algorithm. In: Third congress on intelligent systems, pp 799–816 [Online]. Available:
https://link.springer.com/chapter/10.1007/978-981-19-9225-4_58. Accessed 03 Oct 2023
Machine Learning and Healthcare:
A Comprehensive Study
Riya Raj and Jayakumar Kaliappan
Abstract This paper delves into the dynamic intersection of machine learning (ML)
and healthcare, envisioning a paradigm shift in diagnostic accuracy, personalized
treatment, and streamlined administration. It meticulously explores various ML algo-
rithms, spanning deep learning, decision trees, and clustering techniques, pivotal
in domains like early cancer detection, diabetes detection, heart disease detection,
autism spectrum disorder detection, and Parkinson’s disease detection. Rigorous
model evaluation, employing accuracy, precision, F1-score, specificity, and mean
squared error metrics, ensures algorithm dependability. However, data privacy chal-
lenges, amplified by intricate regulations, persist. Ethical considerations add compli-
cated dimensions, including algorithmic bias and cultivating patient trust. Address-
ing these necessitates robust education for healthcare professionals and alignment
with legal frameworks. Despite challenges, the paper advocates for a conscientious
integration of ML, emphasizing its transformative potential in healthcare and urg-
ing judicious technology amalgamation to propel advancements in patient care and
clinical outcomes.
Keywords Machine learning · Healthcare · Cancer · Diabetes · Heart diseases ·

Autism spectrum disorder · Parkinson’s disease
1 Introduction
The fusion of machine learning and the healthcare sector represents a pivotal junc-
ture in the history of medical science and its practical applications. This intersection
of machine learning algorithms with the extensive reservoir of health-related data
has unveiled uncharted opportunities to reshape the landscape of healthcare. This
transformative synergy carries the potential to fundamentally redefine medical diag-
R. Raj (B) · J. Kaliappan

Vellore Institute of Technology, Vellore, Tamil Nadu, India
e-mail: raj.riya1606@gmail.com
J. Kaliappan
e-mail: jayakumar.k@vit.ac.in
32 R. Raj and J. Kaliappan
nostics, prognostications, and the quality of patient care, transcending the traditional
confines of medical knowledge and practice [18].
This paper embarks on a comprehensive exploration of the machine learning
algorithms that underpin these innovations, including deep learning, decision trees,
and clustering techniques. These algorithmic paradigms are the cornerstones upon
which groundbreaking healthcare applications are erected, spanning early disease
detection, personalized treatment strategies, optimized resource allocation, stream-
lined administrative workflows, and much more.
Data privacy and security, owing to the highly sensitive nature of healthcare data,
assume an imperative stance. The inadvertent exposure of patient data poses grave
threats, mandating the rigorous implementation of robust encryption, stringent access
controls, and continuous monitoring. Additionally, navigating the intricate landscape
of regulatory adherence, encompassing compliance with the “General Data Protec-
tion Regulation” and the “Health Insurance Portability and Accountability Act” [7],
presents formidable challenges to healthcare establishments and machine learning
developers.
The forthcoming sections of this paper undertake a meticulous examination of
machine learning algorithms and their healthcare applications. They probe the sub-
tleties of model assessment, delve into the intricacies of data privacy and security,
and dissect the plethora of challenges that the confluence of machine learning and
healthcare presents.
2 Machine Learning in Healthcare
This section provides an overview of various ML models and their key features,
along with their applications in the healthcare domain. These models have shown
remarkable potential in improving healthcare services, diagnosis, and patient care.
Table 1 presents an in-depth comparative analysis of the algorithms.
2.1 Random Forest
The random forest algorithm, a versatile and powerful machine learning technique,
has found valuable applications in healthcare, offering robust predictive capabilities
while mitigating overfitting and enhancing model interpretability. At its core, the
random forest algorithm operates by constructing a multitude of decision trees during
training [12]. Each tree is grown using a bootstrapped subset of the training data, and
at each node of the tree, a random subset of features is considered for splitting. This
inherent randomness helps in creating diverse and uncorrelated trees. Once the forest
of trees is built, predictions are made by aggregating the results from individual trees.
For regression tasks this typically involves averaging the predictions from each
tree, while for classification tasks, it employs a majority vote mechanism to deter-
Machine Learning and Healthcare: A Comprehensive Study 33
Table 1 Comparison of algorithms

Algorithm Key characteristics Advantages Applications in
healthcare
Random forest 1. Ensemble of decision 1. Robust predictive 1. Disease risk prediction
trees capabilities
2. Random feature subset 2. Handles 2. Medical image
for splitting high-dimensional data analysis
with numerous features
3. Averages predictions 3. Natural feature 3. Drug discovery
for regression tasks importance measurement
4. The majority vote for 4. Reduces risk of 4. Patient outcome
classification tasks overfitting prognosis
5. Provides feature 5. Enhances model
importance scores generalization to unseen
data
Gradient boosting 1. Sequentially trained 1. Exceptional predictive 1. Predicting patient
decision tree ensemble accuracy outcomes
2. Increased weights for 2. Captures complex 2. Optimizing treatment
misclassified data points relationships within plans
healthcare data
3. Weighted sum for 3. Notable for precision 3. Medical image
regression tasks in disease diagnosis analysis
4. Majority vote for 4. Variants like XGBoost 4. Disease diagnosis
classification tasks and LightGBM for speed
and scalability
CNNs 1. Specialized for image 1. Proficient in medical 1. Detecting diseases
analysis image analysis, disease from X-rays and MRIs
detection
2. Convolutional layers 2. Transfer learning 2. Segmenting
for spatial feature reduces data anatomical structures
learning requirements
3. Pooling layers for 3. Adaptable and 3. Identifying anomalies
dimension reduction accurate
4. Fully connected layers 4. Essential for disease 4. Personalized medicine
for intricate relationships detection and
personalized medicine
RNNs 1. Designed for 1. Efficient in time series 1. Predicting patient
sequential data analysis, natural outcomes with time
language processing series data
2. Dynamic and 2. Utilizes advanced 2. Medical speech
recursive behaviors with variants like LSTM and recognition
hidden states GRU to address
limitations
3. Backpropagation 3. Addresses challenges 3. LSTM and GRU for
through time (BPTT) for like vanishing and enhanced information
training exploding gradients flow regulation
mine the final class prediction. Random forest offers several advantages for healthcare
applications. Firstly, it excels in handling high-dimensional data with numerous fea-
tures, which is common in medical datasets. Secondly, it provides a natural way to
measure feature importance, aiding in the identification of critical factors in health-
care predictions. Additionally, its ensemble approach reduces the risk of overfitting,
enhancing model generalization to unseen data [16].
Interpreting random forest models is relatively straightforward compared to com-
plex deep learning models. Feature importance scores can guide clinicians and
researchers in understanding which variables contribute most to the predictions, sup-
porting informed decision-making in healthcare scenarios. In the context of health-
care and medical research, random forest has been leveraged for tasks such as disease
risk prediction, medical image analysis, drug discovery, and patient outcome progno-
sis. Its flexibility, robustness, and interpretability make it a valuable tool for improving
healthcare diagnostics and treatment decisions while handling the challenges posed
by medical data intricacies.
2.2 Gradient Boosting
The gradient boosting algorithm is a sophisticated machine learning technique that

has proven to be highly effective in healthcare applications, offering exceptional pre-
dictive power and versatility. At its core, gradient boosting operates by sequentially
training an ensemble of decision trees [10]. Unlike random forest, where each tree is
constructed independently, gradient boosting builds trees sequentially, with each new
tree aiming to correct the errors made by the ensemble of previously trained trees.
During training, gradient boosting assigns initial equal weights to all data points
and creates the first decision tree. It then assesses the errors made by this tree and
increases the importance (weight) of the misclassified data points. The algorithm then
proceeds to build a second tree, focusing on the previously misclassified data. This
process iteratively continues, with each tree targeting the errors of its predecessors.
Predictions are made by combining the results of all the decision trees in the
ensemble. In regression tasks, the final prediction is the weighted sum of each tree’s
output, with weights determined by their performance. For classification tasks, a
majority vote mechanism is employed to assign the class label.
Gradient boosting is known for its remarkable predictive accuracy and abil-
ity to capture complex relationships within healthcare data. It excels in scenarios
where high precision is required, such as disease diagnosis, patient risk stratifi-
cation, and medical image analysis. One of its notable variations is the XGBoost
algorithm, which has gained popularity in healthcare due to its speed and scalability.
Another variant, LightGBM, optimizes the training process using a histogram-based
approach.
In healthcare and medical research, gradient boosting has been successfully
applied to tasks like predicting patient outcomes, optimizing treatment plans, and
identifying critical features in medical datasets. Its ability to handle structured and
unstructured data, adapt to different data types, and provide interpretable results
makes it an invaluable tool for enhancing healthcare decision support systems and
improving patient care.
2.3 Convolutional Neural Networks (CNNs)
The convolutional neural network (CNN) stands as a pivotal machine learning algo-
rithm, celebrated for its remarkable proficiency in image analysis and pattern recog-
nition, which has proven indispensable in the realm of healthcare. At its core, CNNs
aim to replicate the innate capability of the human visual system to perceive and
discern patterns, employing convolutional layers as the foundational building blocks
[22]. These layers are endowed with the ability to adaptively learn spatial hierarchies
of features from input data. Their operation unfolds in a systematic fashion: They
employ convolutional layers to adaptively learn spatial features from input data.
Convolutional layers use learnable filters to identify patterns like edges and shapes,
extracting pertinent features.
After convolution, pooling layers reduce spatial dimensions, preserving vital
information through techniques like max-pooling. Fully connected layers, akin to
traditional neural networks, learn intricate relationships in the data. Activation func-
tions, such as rectified linear unit (ReLU), add nonlinearity for complex mappings.
CNNs undergo supervised training using backpropagation and gradient descent
to optimize parameters, enhancing pattern recognition capabilities.
In healthcare, CNNs excel in medical image analysis, diagnosing diseases from X-
rays and MRIs, segmenting anatomical structures, and identifying anomalies. Trans-
fer learning fine-tunes pre-trained CNN models, reducing the need for extensive
medical data. Their adaptability and accuracy make CNNs indispensable in health-
care, advancing disease detection, personalized medicine, and patient care quality.
2.4 Recurrent Neural Networks (RNNs)
The recurrent neural network (RNN) serves as a foundational machine learning algo-
rithm, renowned for its capacity to manage sequential data, rendering it invaluable in
healthcare applications encompassing time series analysis, natural language process-
ing (NLP), and patient data modeling [3]. Unlike conventional feedforward neural
networks, RNNs manifest dynamic and recursive behaviors, enabling them to main-
tain a hidden state that retains information from previous time steps. This temporal
memory equips RNNs to adeptly process data sequences of varying lengths. The
architecture of an RNN encompasses three key components: an input layer to receive
sequential data, a hidden layer that evolves the hidden state by amalgamating current
inputs with past states, and an output layer responsible for generating predictions or
outputs. The pivotal feature of an RNN lies in its recurrent connections within the
hidden layer [15].
At each time step, the hidden state is updated, intertwining current input with
the preceding hidden state, enabling the network to discern sequential data patterns
and dependencies. RNNs are trained using labeled sequential data, employing the
backpropagation through time (BPTT) algorithm to calculate gradients and optimize
the network’s parameters. This training process enables RNNs to recognize sequen-
tial patterns and relationships. While RNNs are robust, they do present challenges,
including issues like vanishing and exploding gradients that impede their ability
to capture extended dependencies in data. To address these limitations, advanced
variants like long short-term memory (LSTM) and gated recurrent unit (GRU) have
emerged [14], incorporating gating mechanisms for enhanced information flow reg-
ulation.
In healthcare, RNNs have demonstrated their efficacy in predicting patient out-
comes using time series data, medical speech recognition, and the analysis of elec-
tronic health records (EHRs). They excel in scenarios where the order and timing of
data points are paramount for accurate predictions. Furthermore, RNNs have found
utility in predictive modeling for disease progression and early detection.
3 Application of ML in Healthcare
3.1 Cancer Detection
Cancer detection through the utilization of machine learning (ML) represents a piv-
otal facet of healthcare. This application involves the strategic application of ML algo-
rithms and techniques to discern cancerous cells or tumors within patients, offering
the promise of early detection, a pivotal factor in enhancing treatment outcomes and
bolstering patient survival rates. The application of ML in cancer detection lever-
ages several core attributes, including automated feature extraction from medical
images, classification of medical data into pertinent categories, predictive regression
modeling, ensemble learning through combinations of models, and the potency of
deep learning. Common ML models employed in this context encompass convolu-
tional neural networks (CNNs) for image-based detection, support vector machines
(SVMs) for classifying cancer types based on gene expression data, random for-
est algorithms for feature selection and classification tasks, and logistic regression
models for assessing the probability of cancer based on clinical and demographic
features [21]. The implementation of ML in cancer detection has extended into both
research and healthcare applications, with notable instances such as Google Health’s
AI for Breast Cancer Detection, IBM Watson’s ML-driven treatment recommen-
dations, PathAI’s assistance to pathologists in cancer diagnosis [17], and research
on enhancing prostate cancer detection through MRI and clinical data. Similarly,
deep learning techniques have been applied to chest X-rays and CT scans for more
effective lung cancer detection.
3.2 Diabetes Detection
Diabetes detection using machine learning (ML) is instrumental in predicting the

likelihood of an individual having diabetes by analyzing various health-related fac-
tors and data. Timely detection of diabetes is paramount for effective intervention
and management. ML contributes significantly to diabetes detection through risk
assessment, early identification, personalized care, and classification of individu-
als into diabetic or non-diabetic categories based on patterns in health data. These
attributes harness the power of ML to improve diabetes detection and management.
Key attributes of ML in diabetes detection encompass feature selection, classifi-
cation, regression, ensemble learning, and deep learning techniques, allowing the
models to process diverse data types and make accurate predictions. Various ML
models, including logistic regression, random forests, support vector machines, and
deep neural networks, are employed to predict diabetes risk and assist in personalized
diabetes management [6].
3.3 Heart Disease Detection
Heart disease detection through machine learning (ML) involves predicting an indi-
vidual’s likelihood of having heart disease based on diverse medical and lifestyle
factors. This application is pivotal for early intervention and prevention in heart
health. ML significantly enhances heart disease detection by assessing risk factors,
enabling early identification of potential issues, providing personalized care plans,
and categorizing individuals into heart disease or non-heart disease groups based on
data patterns. Key attributes of ML in this context encompass feature selection, clas-
sification, regression, ensemble learning, and deep learning techniques, all tailored
to extract relevant insights from health data. Commonly used ML models, including
logistic regression, random forests, support vector machines, and deep neural net-
works, contribute to accurate heart disease prediction by analyzing a spectrum of
health-related data [8].
3.4 Autism Spectrum Disorder Detection
Autism spectrum disorder (ASD) poses a unique challenge given its neurodevelop-
mental nature, characterized by a wide array of symptoms impacting social inter-
action, communication, and repetitive behaviors. In this context, machine learning
(ML) has emerged as a valuable tool, particularly in enhancing the early detection and
diagnosis of ASD, offering several distinct advantages. Foremost, ML contributes to
the early identification of ASD by discerning subtle behavioral and physiological pat-
terns in children, enabling timely interventions and therapies. Its capacity to analyze
diverse data types, including clinical assessments, eye-tracking data, brain imag-
ing (fMRI), and genetic information, ensures a comprehensive assessment of ASD
risk [5]. Furthermore, ML introduces objectivity to the diagnostic process, equip-
ping clinicians with quantifiable metrics and reducing subjectivity. This objectivity is
especially crucial in ASD diagnosis. Additionally, ML supports the personalization
of interventions and therapies, adapting them to the unique needs and profiles of
individuals with ASD. ML’s applications in ASD detection encompass classification
models that categorize individuals into ASD and non-ASD groups, feature engineer-
ing techniques for identifying relevant data attributes, and the utilization of deep
learning models such as convolutional neural networks (CNNs) and recurrent neural
networks (RNNs) for the analysis of complex data, including brain scans [2]. Notable
ML models in this domain include random forest, capable of handling multiple data
types for a comprehensive assessment, and support vector machines (SVMs), ideal
for distinguishing individuals with ASD from those without based on feature patterns.
Deep neural networks, on the other hand, are employed to analyze brain imaging data,
detecting structural or functional abnormalities associated with ASD. The implemen-
tation of ASD detection using ML is actively pursued through research studies and
resources. Examples include research papers like “Deep multimodal learning for
the diagnosis of autism spectrum disorder” [19] and studies exploring eye-tracking
data analysis for ASD detection. Initiatives such as the Autism Brain Imaging Data
Exchange (ABIDE) [11] provide access to essential brain imaging data, facilitating
ML model development. This comprehensive approach harnesses the potential of ML
to make significant strides in early ASD detection and diagnosis, ultimately enhanc-
ing the quality of care and support provided to individuals on the autism spectrum.
3.5 Parkinson’s Disease Detection
Parkinson’s disease (PD) is a debilitating neurodegenerative condition primarily

affecting movement control, characterized by symptoms like tremors and slowness
of movement. In the healthcare domain, machine learning (ML) is gaining promi-
nence as a tool to enhance early detection and monitoring of PD. ML brings notable
improvements to PD detection, notably through early diagnosis. By analyzing a com-
bination of clinical and biomedical data, ML models can identify PD in its nascent
stages, enabling timely intervention and treatment. Furthermore, ML provides objec-
tivity and quantifiable measurements, complementing clinical assessments and mit-
igating diagnostic subjectivity [4]. The continuous monitoring of motor symptoms,
facilitated by ML-based wearable devices and sensors, is another area where ML
shines. These devices offer a means of real-time tracking, thereby aiding in more
effective disease management. Additionally, ML excels in personalizing treatment
plans leveraging individual patient data to optimize therapeutic interventions [20].
ML’s application in PD detection encompasses classification models that distinguish
PD patients from healthy individuals based on patterns in data. Feature extraction
techniques identify pertinent features from diverse sources like speech, gait, and
clinical assessments. Moreover, time series data collected from sensors can be ana-
lyzed with specialized techniques such as recurrent neural networks (RNNs) to detect
subtle changes in motor patterns [23]. For instance, random forest models classify
individuals into PD and non-PD groups based on clinical and biomedical features.
Support vector machines (SVMs) are useful for binary classification tasks, effectively
separating PD patients from healthy individuals. Deep learning models, including
convolutional neural networks (CNNs) and recurrent neural networks (RNNs), ana-
lyze data like speech signals or gait patterns to identify PD-related abnormalities.
While PD detection using ML is an evolving field, several notable resources demon-
strate its potential. Initiatives like the Parkinson’s Voice Initiative [1] explore the use
of voice data and ML for PD detection, while research studies delve into gait analysis
and wearable sensors for monitoring. Publicly available datasets, such as the Parkin-
son’s Progression Markers Initiative (PPMI), provide essential data for research and
model development. This comprehensive approach harnesses the potential of ML to
make significant strides in early PD detection and monitoring, ultimately improving
the quality of care for individuals affected by this condition.
Table 2 presents a comparison of the healthcare application in detail.
4 Assessment of Model Performance
4.1 Accuracy
Accuracy serves as a fundamental metric for evaluating model performance in health-

care applications, assessing the model’s overall correctness in its predictions. It quan-
tifies the proportion of correctly predicted instances, encompassing both true pos-
itives and true negatives, out of the total instances in the dataset. In the healthcare
context, accuracy determines how effectively a model predicts patient outcomes or
conditions based on the input data.
The formula for calculating accuracy is:
Number of Correct Predictions

.Accuracy = (1)
Total Number of Predictions
True Positives + True Negatives
. = (2)
True Positives + True Negatives + False Positive + False Negative
4.2 Precision
Precision is a critical metric for evaluating model performance in healthcare appli-

cations, particularly when the consequences of false positives are significant. It mea-
sures the accuracy of positive predictions made by a model, specifically indicating
Table 2 Comparison of healthcare application

Healthcare application Key ML attributes ML models
Cancer detection 1. Automated feature extraction 1. Convolutional neural networks
from medical images (CNNs)
2. Classification of medical data 2. Support vector machines
(SVMs)
3. Predictive regression modeling 3. Random forest algorithms
4. Ensemble learning through 4. Logistic regression models
combinations of models
5. Deep learning
Diabetes detection 1. Risk assessment 1. Logistic regression
2. Early identification 2. Random forests
3. Personalized care 3. Support vector machines
(SVMs)
4. Classification based on patterns 4. Deep neural networks
in health data
Heart disease 1. Assessing risk factors 1. Logistic regression
2. Early identification of potential 2. Random forests
issues
3. Providing personalized care 3. Support vector machines
plans (SVMs)
4. Categorizing individuals based 4. Deep neural networks
on data patterns
ASD detection 1. Early identification of ASD 1. Random forests
2. Comprehensive assessment of 2. Support vector machines
ASD risk (SVMs)
3. Objectivity in diagnostic 3. Convolutional neural networks
process (CNNs)
4. Personalization of interventions 4. Recurrent neural networks
and therapies (RNNs)
Parkinson’s disease 1. Early detection and monitoring 1. Random forests
of PD
2. Objectivity and quantifiable 2. Support vector machines
measurements (SVMs)
3. Continuous monitoring of 3. Convolutional neural networks
motor symptoms (CNNs)
4. Personalizing treatment plans 4. Recurrent neural networks
leveraging individual patient data (RNNs)
to optimize therapeutic
interventions
how many of the positive predictions were correct. The formula for calculating pre-
cision is:
True Positives
Precision =
. (3)
True Positives + False Positives
4.3 F1-Score
The F1-score is a critical metric used in healthcare and other domains to assess the
performance of a model, especially when the balance between precision and recall
is crucial. It provides a single score that considers both false positives (FP) and false
negatives (FN) and is particularly valuable when dealing with imbalanced datasets
or scenarios where the cost of false positives and false negatives differs significantly.
The formula for calculating the F1-score is:
2 ∗ (Precision ∗ Recall)
F1 − Score =
. (4)
(Precision + Recall)
4.4 Specificity
Specificity is a critical performance metric used in healthcare to assess the accuracy

of a model in correctly identifying negative instances. It measures the ability of a
model to avoid false alarms or false positives. In other words, specificity answers the
question, “Out of all the actual negative cases, how many did the model correctly
predict as negative?”
The formula for calculating specificity:
TN
Specificity =
. (5)
TN + FP
4.5 Mean Squared Error (MSE)
Mean squared error (MSE) is another widely used metric for assessing the perfor-
mance of predictive models, including those employed in healthcare applications.
MSE measures the average of the squared differences between the predicted values
and the actual (ground truth) values in a dataset. It quantifies how well a model’s
predictions align with the true outcomes while emphasizing larger errors more than
smaller ones.
The formula for calculating mean squared error (MSE):
.SE = (Actual Value − Predicted Value)2 (6)
Compute Mean Squared Error (MSE): After calculating the squared errors for all
data points, compute the mean squared error (MSE) by taking the average of these
squared errors. The formula for MSE is:
.MSE = (∑SE)/n (7)
5 Challenges and Limitations
1. Data Privacy and Security
• Safeguarding Patient Data: Ensuring the confidentiality and security of patient

data is a primary concern. The highly sensitive nature of healthcare data
demands rigorous measures to prevent unauthorized access and breaches [9].
• Secure Data Handling: The integration of ML systems requires meticulous
attention to the secure management and storage of patient data. This involves
the implementation of robust encryption, access controls, and monitoring to
protect against data vulnerabilities.
2. Data Diversity and Standardization
• Inherent Data Diversity: Healthcare data is inherently diverse and often unstruc-
tured, stemming from various sources and formats. This diversity poses a chal-
lenge for ML models designed for structured data [13].
• Standardization and Preprocessing: To make healthcare data compatible with
ML algorithms, it is crucial to standardize and preprocess the data. This includes
data cleaning, normalization, and structuring to enhance its usability.
3. Scalability and Interoperability
• Scalability Challenges: The scalability of ML models, particularly in the context

of public healthcare networks, is a notable challenge. The increasing volume
of healthcare data requires significant computational resources.
• Interoperability Complexities: Achieving interoperability between various data
sources, ML models, and healthcare systems is intricate. It necessitates the
seamless exchange of data among disparate sources while maintaining data
consistency and integrity.
4. Regulatory Compliance
• Complex Regulatory Landscape: Adhering to healthcare regulations such as the

“General Data Protection Regulation” and the “Health Insurance Portability and
Accountability Act” is complex and presents compliance challenges [7].
• Data Quality and Accuracy: Meeting regulatory standards necessitates main-
taining the quality and accuracy of healthcare data. Any inconsistencies or
inaccuracies can lead to compliance violations.
5. Ethical Concerns
• Fairness and Bias: ML models can inadvertently perpetuate biases present in

their training data. In healthcare, this raises ethical considerations as biased
algorithms may lead to unfair or discriminatory outcomes.
• Patient Trust: Ethical concerns are particularly relevant in healthcare, where
patient trust is of utmost importance. Ensuring that ML models make decisions
that align with ethical standards is essential.
6 Discussion
6.1 Research Gaps
1. Limited Integration of Multimodal Data in Autism Spectrum Disorder (ASD)

Detection
• While machine learning (ML) applications in ASD detection show promise, a

research gap exists in the limited integration of multimodal data. Many studies
focus on individual data types such as clinical assessments, eye-tracking data, or
genetic information. Future research could explore the synergies and enhanced
accuracy achieved by combining these diverse data types in a comprehensive
ML model.
2. Longitudinal Studies in Parkinson’s Disease (PD) Detection
• The existing body of research on machine learning applications in PD detec-

tion is notably limited in terms of comprehensive longitudinal studies. In-depth
investigations involving prolonged monitoring and analysis of PD progression
patterns hold the potential to yield valuable insights into the efficacy of machine
learning models in capturing subtle changes over an extended period. Under-
taking such longitudinal studies would play a pivotal role in enhancing the
precision of early detection algorithms and fine-tuning personalized treatment
strategies.
3. Exploration of Explainability in ML Models for Heart Disease Detection
• While ML models exhibit high accuracy in heart disease detection, there is a

research gap in the explainability of these models. Understanding the features
and factors contributing to model predictions is crucial for gaining trust from
healthcare professionals. Future research should focus on developing inter-
pretable ML models for heart disease detection.
6.2 Findings
1. Effectiveness of ML in Early Cancer Detection

• The application of ML, particularly convolutional neural networks (CNNs),
demonstrates high effectiveness in early cancer detection. Models such as those
used in Google Health’s AI for Breast Cancer Detection showcase the potential
for automated feature extraction from medical images, leading to improved
diagnostic accuracy.
2. ML’s Contribution to Personalized Diabetes Management
• ML contributes significantly to personalized diabetes management by assess-
ing diverse health-related factors. Logistic regression, random forests, support
vector machines, and deep neural networks collectively aid in risk assessment,
early identification, and personalized care for individuals with diabetes.
3. ML’s Role in Objectivity and Early Intervention in Autism Spectrum Disor-
der
• ML brings objectivity to the diagnostic process in autism spectrum disorder,
reducing subjectivity in assessments. The models, including random forests
and SVMs, enhance early identification of ASD by analyzing behavioral and
physiological patterns, enabling timely interventions and therapies.
4. Advancements in Continuous Monitoring for Parkinson’s Disease:
• ML-based wearable devices and sensors contribute to continuous monitoring of
motor symptoms in PD. These advancements provide real-time tracking, offer-
ing valuable insights for more effective disease management and personalized
treatment plans based on individual patient data.
7 Conclusion
The integration of machine learning in healthcare presents a realm of immense

promise coupled with intricate complexities. Our comprehensive analysis has delved
into algorithmic diversity, practical applications, model assessment metrics, data pri-
vacy and security, and the multifaceted challenges that underpin the symbiotic rela-
tionship between machine learning and healthcare. Machine learning algorithms,
ranging from deep learning to decision trees and clustering techniques, have emerged
as indispensable tools for healthcare practitioners. These algorithms facilitate early
disease detection, personalized treatment recommendations, and administrative opti-
mization, redefining healthcare delivery paradigms.
In assessing the performance of machine learning models in healthcare, a spectrum
of metrics, such as accuracy, precision, F1-score, specificity, and MSE, provides
quantitative and qualitative insights into predictive accuracy and model reliability.
Notably, the challenge of model interpretability, especially in deep learning models,
necessitates further attention. The sanctity of data privacy and security is paramount
in healthcare, where the exposure of sensitive patient information could lead to severe
consequences. Addressing these concerns mandates rigorous measures, including
robust encryption, access controls, and vigilant monitoring.
Ethical considerations surrounding algorithmic biases and patient trust are central
to responsible machine learning in healthcare. Correcting these biases is a pivotal
step in achieving fairness and equity in healthcare AI applications. The training and
adoption of machine learning technologies in healthcare necessitate educational ini-
tiatives tailored to the varying familiarity levels of healthcare professionals with these
tools. Legal matters concerning smart contract integration and data ownership are
imperative for the long-term viability of machine learning in healthcare, ensuring
alignment with existing legal frameworks. Furthermore, challenges related to infras-
tructure disparities, data quality, model interpretability, data labeling, and clinical
implementation demand nuanced solutions to realize the full potential of machine
learning in healthcare. In this pursuit of a harmonious synergy between machine
learning and healthcare, a collective commitment to ethical, secure, and compliant
innovation, along with rigorous educational advancements, holds the promise of
transformative healthcare improvements. As the journey unfolds, these considera-
tions shall drive the responsible utilization of machine learning, promising a healthier,
brighter future for all.
References
1. Arora S, Tsanas A (2021) Assessing Parkinson’s disease at scale using telephone-recorded

speech: insights from the Parkinson’s voice initiative. Diagnostics 11(10):1892
2. Bayram MA, İlyas Ö, Temurtaş F (2021) Deep learning methods for autism spectrum disorder
diagnosis based on fmri images. Sakarya Univer J Comput Inform Sci 4(1):142–155
3. Filho LRA, Rodrigues ML, Rosa RR, Guimarães LNF (2022) Predicting covid-19 cases in
various scenarios using rnn-lstm models aided by adaptive linear regression to identify data
anomalies. Anais da Academia Brasileira de Ciências 94:e20210921
4. Govindu A, Palwe S (2023) Early detection of Parkinson’s disease using machine learning.
Proc Comput Sci 218:249–261
5. Hossain MD, Kabir MA, Anwar A, Islam MZ (2021) Detecting autism spectrum disorder using
machine learning techniques:an experimental analysis on toddler, child, adolescent and adult
datasets. Health Inform Sci Syst 9:1–13
6. Kesavadev J, Krishnan G, Viswanathan M (2021) Digital health and diabetes: experience from
India. Therapeutic Adv Endocrinol Metabol 12:20420188211054676
7. Kim Y, Kang G (2022) Secure collaborative platform for health care research in an open
environment: perspective on accountability in access control. J Med Internet Res 24:e37978
8. Klaudel J, Klaudel B, Glaza M, Trenkner W, Derejko P, Szołkiewicz M (2022) Forewarned is
forearmed: machine learning algorithms for the prediction of catheter-induced coronary and
aortic injuries. Int J Environ Res Public Health 19(24):17002
9. Kute SS, Tyagi AK, Aswathy SU (2022) Security, privacy and trust issues in internet of things
and machine learning based e-healthcare. Intell Interact Multimedia Syst e-Healthcare Appl
291–317
10. Luo J, Zhang Z, Fu Y, Rao F (2021) Time series prediction of Covid-19 transmission in America
using lstm and xgboost algorithms. Results Phys 27:104462
11. Martino AD, O’connor D, Chen B, Alaerts K, Anderson JS, Assaf M, Balsters JH, Baxter L,
Beggiato A, Bernaerts S et al (2017) Enhancing studies of the connectome in autism using the
autism brain imaging data exchange ii. Scientific Data 4(1):1–15
12. Martuza Ahamad M, Aktar S, Uddin MJ, Rashed-Al-Mahfuz M, Azad AKM, Uddin S, Alyami
SA, Sarker IH, Khan A, Liò P et al (2022) Adverse effects of covid-19 vaccination: machine
learning and statistical approach to identify and classify incidences of morbidity and postvac-
cination reactogenicity 11(1):31
13. Nerenz DR, McFadden B, Ulmer C et al. (2009) Race, ethnicity, and language data: standard-
ization for health care quality improvement
14. Rashid TA, Hassan MK, Mohammadi M, Fraser K (2019) Improvement of variant adaptable
lstm trained with metaheuristic algorithms for healthcare analysis. In: Advanced classification
techniques for healthcare analysis, IGI Global, pp 111–131
15. Reddy BK, Delen (2018) Predicting hospital readmission for lupus patients: an RNN-LSTM-
based deep-learning methodology. Comput Biol Med 101:199–209
16. Rigatti SJ (2017) Random forest. J Insurance Med 47(1):31–39
17. Santosh KC, Gaur L (2022) Artificial intelligence and machine learning in public healthcare:
opportunities and societal impact. Springer Nature
18. Shailaja K, Seetharamulu B, Jabbar MA (2018) Machine learning in healthcare: a review. In:
2018 Second international conference on electronics, communication and aerospace technology
(ICECA). IEEE, pp 910–914
19. Tang M, Kumar P, Chen H, Shrivastava A (2020) Deep multimodal learning for the diagnosis
of autism spectrum disorder. J Imaging 6(6):47
20. Teshuva I, Hillel I, Gazit E, Giladi N, Mirelman A, Hausdorff JM (2019) Using wearables
to assess bradykinesia and rigidity in patients with Parkinson’s disease: a focused, narrative
review of the literature. J Neural Transmission 126:699–710
21. Wang X, Guo J, Gu D, Yang Y, Yang X, Zhu K (2019) Tracking knowledge evolution, hotspots
and future directions of emerging technologies in cancers research: a bibliometrics review. J
Cancer 10(12):2643
22. Wu J, Liu N, Li X, Fan Q, Li Z, Shang J, Wang F, Chen B, Shen Y, Cao P et al (2023)
Convolutional neural network for detecting rib fractures on chest radiographs: a feasibility
study. BMC Med Imaging 23(1):1–12
23. Xu S, Wang Z, Sun J, Zhang Z, Wu Z, Yang T, Xue G, Cheng C (2020) Using a deep recurrent
neural network with EEQ signal to detect Parkinson’s disease. Annals of Translat Med 8(14)
Evolutionary Algorithms for Fibers
Upgrade Sequence Problem
on MB-EONs
Der-Rong Din
Abstract Elastic optical network (EON) technology proves to be a promising solu-

tion to meet the ever-increasing bandwidth demands of Internet traffic. The transi-
tion from single-band EON (SB-EON) to multi-band EON (MB-EON) is a complex,
multi-stage process, leading to the coexistence of fibers with different transmission
capacities (SB and MB) and the formation of a hybrid SB/MB-EON for an extended
period. This paper addresses the fiber upgrade sequence problem (FUSP), aiming to
find the optimal upgrade sequence with the minimum average weighted load ratio
(AWLR) while ensuring uninterrupted service throughout the entire process. We
propose two evolutionary algorithms, including a genetic algorithm (GA) and a sim-
ulated annealing (SA), to solve this problem. Our proposed algorithms outperform
conventional heuristics. GA demonstrates superior performance compared to SA in
small networks or those with heavy traffic.
Keywords Fiber upgrade sequence · Multi-band · Elastic optical network

(EON) · Simulated annealing · Genetic algorithm
1 Introduction
Due to the widespread adoption of the Internet and 5G applications, it is anticipated

that network traffic will increase at a rate of approximately 26% each year [1]. There-
fore, the task of improving the capacity of the core network is very important. Based
on elastic optical networks (EONs), band-division multiplexing (BDM) technology
can be applied to construct multi-band EONs (MB-EONs). All frequency bands
(including U, L, C, S, E, and O) can be used on MB-EONs, and the total bandwidth
is about 50 THz [2]. In the investigation conducted in [3], the focus was on address-
ing the network upgrade challenge involved in transitioning from C band to C+L
D.-R. Din (B)

Department of Computer Science and Information Engineering, National Changhua University
of Education, Changhua City 500, Taiwan, R. O. C.
e-mail: deron@cc.ncue.edu.tw
URL: http://deron.csie.ncue.edu.tw/
48 D.-R. Din
band. The primary consideration is the cost-effectiveness of the network, leading

to the adoption of a multi-period batch upgrade strategy. Because the upgrade from
the single-band EON (SB-EON) to the MB-EON is a lengthy multi-stage process,
fibers with different transmission capacities may coexist for an extended period. This
resulting network is referred to as the hybrid SB/MB-EON, as detailed in [4].
1.1 Studied Problem
In this paper, the fibers upgrade sequence problem, involving the upgrade of SB-
EONs to MB-EONs without causing service disruption, was studied. To achieve this
goal, the lightpath through the upgraded fiber must be re-routed before the upgrade.
After upgrading, the MB-capable fiber can be utilized immediately. The demand
routing algorithm on the hybrid SB/MB-EON should be employed. Figure 1 shows
the upgrade process. Figure 1a shows all lightpaths. To upgrade fiber AB, lightpaths
.l AB and .l B A should be re-routed (shown in Fig. 1b). In Fig. 1c, the upgrading of fiber
AB results in the restoration of the original requests between A and B through the
utilization of MB transmission. In Fig. 1d, e, the upgrade focus shifts to fiber BC,
while in Fig. 1f, g, the upgrade attention is directed toward fiber AC.
In MB-EONs, the maximum distance of a lightpath is determined by [5]. The net-
work’s performance can be significantly influenced by the chosen upgrade sequence.
In this study, adhering to the no-service disruption constraint, we aim to identify the
optimal fiber upgrade sequence. This problem is formally referred to as the fibers
upgrade sequence problem (FUSP) [6]. The objective function of the problem is the
average weighted load ratio (AWLR). In a previous investigation [6], five heuris-
tic algorithms were introduced to address the fiber upgrade sequence problem: (1)
Random Sequence(RS), (2) Shortest Distance First (SDF), (3) Maximum Load First
(MaxLF), (4) Minimum Load First (MinLF), and (5) Longest Distance First (LDF).
The conventional heuristic methods cannot solve large problems due to their greedy,
improving approach. Due to the hardness of the FUSP, the provision of an optimal
Fig. 1 Example illustrating fiber upgrade, re-routing, and restoration

Evolutionary Algorithms for Fibers Upgrade Sequence Problem … 49
solution with polynomial time is not guaranteed. To address this problem, the effec-
tive genetic algorithm (GA) [7] and simulated annealing (SA) [8] algorithms were
developed in this article.
2 Related Works
2.1 MB-EONs
In MB-EONS, the transmission band is expanded to the C+L band and even expanded
to other bands (S, E, etc.) [9, 10]. Moreover, the demand routing method should be
re-developed by considering multi-band transmitting capability. Hence, the conven-
tional RMLSA problem in traditional EONs evolves into the routing, band, mod-
ulation format, and spectrum allocation (RBMLSA) problem within the context of
MB-EONs [11, 12]. For a given demand, find out the lightpath between two end
nodes, determine the selected band and modulation format, and allocate a set of FSs
on the network. These algorithms deal with pure MB-EONs, that is, all nodes and
fibers that are MB-capable.
2.2 MB-EON Upgrade Problem
The network upgrade can be done at once or in stages. Since the upgrade of any
node or fiber will cause transmission service interruption, to continuously provide
transmission service during the upgrade period, it is almost impossible to adopt it
on the existing backbone network with a high transmission volume. Thus, a multi-
stage upgrade strategy is more feasible [3]. In [3], authors considered the multi-stage
cost-effectiveness of network expansion and adopted a multi-stage upgrade strategy.
The authors have considered the choice of optical fiber when upgrading, not only
geographical location or network topology but also the annual growth rate of traffic
and cost estimation. In [13], authors considered the importance of fiber upgrade
options and explored the minimum cost of upgrading to support C+L bands under
the condition of transmission performance.
In [14, 15], the author studied the resource allocation problem on hybrid SB/MB-
EONs. They suggested that after doing a partial upgrade, re-provisioning should be
performed on the lightpath to obtain the actual benefits of the upgrade. In my previous
study [6], the fiber upgrade sequence problem was first studied. Five algorithms
were used to solve the FUSP: (1) Random Sequence, (2) Shortest Distance First,
(3) Maximum Load First, (4) Minimum Load First (MLF), and (5) Longest Distance
First.
50 D.-R. Din
For the hybrid SB/MB network, the node-set consists of two sub-sets, each of
which with SB and MB transmission functions. The length of a lightpath for demand
is constrained by or equal to the limitations imposed by the transmitted signal on
the hybrid SB/MB-EON. The estimation of limitations is based on the table shown
in Table 1 [5]. In my previous article [16], the service provisioning algorithms for
the hybrid SB/MB-EONs were designed. The Least Load Ratio First (LLRF) algo-
rithm, focusing on balancing the band load across the entire network, yields optimal
performance.
3 Problem Formulation
This section outlines the notations, assumptions, and objective functions associ-
ated with the FUSP. The detailed problem formulation was stated in my previous
study [6].
3.1 Notations
• .G = (V, E, dist, F SU ): The physical network of the EON..V is the set of physical
nodes, .V = V S B ∪ V M B , where .V S B and .V M B represent nodes with SB and MB
transmitting functions, respectively.. E is the set of physical links,. E = E S B ∪ E M B
and . E S B ∩ E M B = ∅, where . E S B and . E M B represent edges with SB and MB
transmission function, respectively. .dist (ei ) is the length of link .ei ∈ E. And
initially, .V = V S B , .V M B = ∅, . E S B = E, and . E M B = ∅.
• .Ʌ|V |×|V | : The traffic requirement of the network, where .λsd ∈ Ʌ is the bandwidth
of node-pair (.vs , vd ).
• . B: The set of available bands of the target MB-EONs, where . B = {C, L}, . B =
{C, L , S}, or . B = {C, L , S, E}.
• . F SUbn : The total number of FSs provided by the band.bn ∈ B. Each FS can provide
12.5 Gb/s, and . F SUC = 344, . F SU L = 480, . F SU S = 760, and . F SU E = 1, 136.
• . M L(m) ∈ {x|1 ≤ x ≤ 6, integer x} is the set of modulation levels of the set of
modulations .m ∈ M = {B P S K , Q P S K , 8Q AM, .16Q AM, 32QAM, .64Q AM}.
• .T R(m, bn, B): The transparent reach, TR) of the selected modulation .m ∈ M and
band.bn ∈ B. For the set of bands. B, the value of.T R(m, bn, B) can be determined
by Table 1 [5].
• . Nsd : The number of FSs of node-pair (.vs , vd ).
Table 1 Transparent reach (TR) on MB-EONs [5]

.{B} Modulation formats (M)
BPSK QPSK 8QAM 16QAM 32QAM 64QAM
.bn .T R .bn .T R .bn .T R .bn .T R .bn .T R .bn TR
C C band 199 C band 99 C band 54 C band 27 C band 13 C band 7
{C, L} C band 197 C band 99 C band 54 C band 24 C 13 C band 7
L band 167 L band 84 L band 46 L band 22 L band 11 L band 6
{C, L, S} C band 174 C band 87 C band 47 C band 23 C band 12 C band 6
S band 148 S band 74 S band 41 S band 20 S band 10 S band 5
{C, L, S, E} C band 130 C band 65 C band 35 C band 17 C band 8 C band 4
S band 102 S band 51 S band 29 S band 14 S band 7 S band 3
E band 31 E band 15 E band 9 E band 4 E band 2 E band 1
3.2 Assumptions
The assumptions of the FUSP are given as follows:

• After the lightpath is established, only the fibers between the nodes that have been
upgraded can use different bands for transmission.
• There is no-service disruption during fiber upgrading.
• Four routing constraints on EONs are considered in this article [17]: subcarrier
consecutiveness constraint, non-overlapping spectrum constraint, spectrum con-
tinuity constraint, and distance-limited transparent constraint.
3.3 Concept of Network Upgrade
The major issue for the network upgrade problem considered in this paper is that
there is no disruption during the network upgrade when the backbone is upgraded.
To do this, all lightpaths passing through the upgraded fiber should be re-routed to
other links that are not suspended; however, transmission delays may occur between
node pairs.
When a fiber is selected to be upgraded, before performing the upgrade, the
following actions are performed:
• Using the establish-then-remove strategy to reroute all lightpaths of requests pass-
ing through this fiber to avoid service disruption.
• For all re-routing requests, the Least Load Ratio First algorithm [16] is performed
on the currently hybrid SB/MB-EON (with some upgraded nodes and fibers) to
fully utilize the multi-band transmission functions and make the most efficient
delivery.
52 D.-R. Din
During the fiber upgrade, perform the following processes:
• Stop all transmissions on the fiber and install new MB-transceivers (MB-TRs) and
MB-ROADM equipment (or an MB-BS band switch) on the end nodes of the fiber.
• The quality of transmission signals for all bands on this fiber is tested.
• Perform the LLRF algorithm [16] to route the connection requests on the new
network so that the most efficient retransmission allocations are made.
3.4 Objective Function
In the article, in the .tth stage, the fiber .et ∈ E is selected to be upgraded. Before
upgrading the fiber .et , the set (denoted as . Path at ) of all lightpaths currently routed
through .et is re-routed. In the first part of the stage (denoted as .ta ), all requests are
transmitted through . Path at , and it requires a .Tta time unit. In the second part of the
stage (denoted as .tb ), after the fiber .et is upgraded and before going into the next
upgrade stage, all connection requests are reallocated to find the lightpath set . Path tb
∑ there are .|E| upgrade stages to upgrade
[6]. This stage requires a .Ttb time unit. Thus,
all fibers in total, and the total time is . t=1,2,...,|E| (Tta + Ttb ). The upgrading time
axis and the respective set of used lightpaths are shown in Fig. 2.
The upgrade sequence of optical fibers does affect the set of used lightpaths and
the spectrum’s utilization. How to find the best upgrade sequence is an important
problem.
The main issue of this article is to determine the upgrade sequence of fibers so
that the overall performance of the network is optimal during the upgrade process.
The problem is an optimal scheduling problem, and the objective function is the
AWLR [6]. The overall performance takes into account the weight of upgrade and
non-upgrade time.
Let.G ta (V ta , E ta , dist, F SU ) and.G tb (V tb , E tb , dist, F SU ) represent the upgrad-
ing and the upgraded physical network of stage .t. Let . L Rta and . L Rtb represent the
load ratio of the network at stages .ta and .tb , respectively. Considering the upgrade
time-weighted load ratio of the network, the objective function is defined as
Fig. 2 Time axis of the upgrading process

∑|E|
t=1 (Tta × L Rta + Ttb × L Rtb )
. . (1)
|E|
The load ratio (LR) signifies the FS utilization ratio within the network [6]. The
LR of the network is defined as the ratio of the total number of used FSs to the
bn
total number of FSs provided by the network. The binary variable .xie is set to 1
th
if the .i FS of the edge .e on band .bn is occupied, and 0 otherwise. On the hybrid
SB/MB-EON, the band provided by edges may be different. Let . Be represent all
possible bands provided by the edge .e. The . L Re represents load ratio of edge .e can
be computed by
∑ ∑ F SUbn bn
bn∈Be i=1 xie
. L Re = ∑ . (2)
bn∈Be F SUbn
The LR of the whole band .bn ∈ B can be computed by

∑ ∑ F SUbn bn
∀e∈E xie
. L Rbn = ∑ i=1
. (3)
∀e∈E F SU bn
And, at stage .ta , the LR of the network is computed by

∑ ∑ ∑ F SUbn bn
∀e∈E bn∈Be i=1 xie
. L Rta = ∑ ∑ . (4)
∀e∈E bn∈B F SUbn
Thus, besides assessing the utilization ratio of spectrum slots on the optical fiber,
the aim is also to prevent the overloading of a single band. Therefore, the network
utilization ratio. L Rta in the network upgrade stage.ta can be expressed as Equation (4)
(the same formula can be deduced in the stage .tb ).
4 Proposed Algorithms
The conventional heuristic methods face challenges in solving large problems due to
their greedy improving approach. Given the complexity of the fiber upgrade sequence
problem on MB-EONs, providing an optimal solution in polynomial time is impos-
sible. To address real-world challenges, this article introduces effective algorithms,
namely a GA [7] and a SA [8] algorithm. In these proposed algorithms, initial solu-
tions are generated using previously proposed heuristics [6] or randomly generated
upgrade sequences. The effectiveness of the proposed algorithms (GA and SA) is
examined through numerical simulations.
54 D.-R. Din
4.1 Genetic Algorithm
In this subsection, I will provide more details about the GA designed to solve the
FUSP.
Chromosomal coding Since the FUSP involves determining the upgrade sequence of
the fibers, the encoding method that uses a one-dimension array of integral numbers
with size .|E| is used. A sequence chromosome . SC is used to represent the upgrade
sequence of all fibers. The value in . SC[t] is the fiber .e SC[t] ∈ E, which will be
selected and upgraded to provide MB-EON transmitting capability in the .t–th stage.
Note that, . SC represents a permutation of distinct integers ( {1, 2,..., .|E|}). Initially,
we generated the population of the . SC randomly, except five . SCs of the population
are encoded by algorithms presented in [6].
This encoding allows the GA to explore different fiber upgrade sequences as
potential solutions to the problem. Each chromosome in the population represents a
unique sequence in which the fibers will be upgraded over stages.
Fitness Function The GA maps objectives costs by using fitness functions. The fit-
ness value (denoted as .α(SC)) of the . SC is the same as the objective function defined
by Equation (1). In GA, the best-fit chromosome should give a higher probability
of being selected as a parent. Thus, this probability is proportional to its fitness.
Moreover, a large number, .Cmax is used to subtract the objective cost to get the fit-
ness value. The fitness function can be computed by: .Maximize SCmax − α(SC),
where . SCmax represents the maximum value observed so far of the cost function in
all populations. This formulation ensures that the chromosome with the lowest cost
(according to the objective function) will have a higher fitness value.
Crossover Operator The single point crossover (SPC) is used in the proposed GA.
First, based on the fitness value, two parents are selected randomly and crossover.
Second, the crossover point .i is randomly selected in [1, .|E|]. Then the traditional
SPC is performed. After performing SPC, the resulting children’s SCs may not be
a feasible sequence. To avoid falling into this case, the child SC should be changed
into a feasible one. This is done by modifying the smaller portion of the child SC by
replacing the same edge sequence from the parent with no replicated edges.
Mutations or Perturbation Mechanism Four types of mutation operators are pro-
posed and used to develop the GA. Moreover, these operations are completely the
same as the perturbations developed in SA. After performing mutation (perturbation),
the resulting SC is still a constraint-satisfied one.
• Edge Exchanging Perturbation (EEP): First, two integers .i and . j (.i /= j) in [1,
.|E|] are randomly selected. Then, they exchange the values of . SC[i] and . SC[ j].
• Sub-sequence Reversing Perturbation (SSRP): First, two integers.i and. j in [1,.|E|]
(assume .i < j) are randomly selected. Then, reverse the sub-sequences between
. SC[i] and . SC[ j] (i.e., . SC[i.. j]).
• Sub-sequence Shifting Perturbation (SSSP): First, a sub-sequence. SC[i 1 ...i 2 ] (.i 2 >
i 1 ) and an integer . j (assume . j < i 1 or .i 2 < j < |E| − (i 2 − i 1 )) are randomly
selected. Shift the sub-sequence . SC[i 1 ..i 2 ] to . SC[ j..( j + i 2 − i 1 )], and adjust
other contents accordingly.
• Sub-sequence Exchanging Perturbation (SSEP): First, an integer .i in [1, .|E|]
is randomly selected. Then, two sub-sequences . SC[1..i] and . SC[i + 1..|E|] are
exchanged.
Termination Rule The number of chromosomes in the chromosome pool is consis-
tently constrained to . N population . The execution of the GA can be halted when the
number of generations (. N generation ) surpasses an upper limit defined by the user.
4.2 Simulated Annealing
In this subsection, a SA algorithm is presented, and the details of the SA are described
as follows:
• Configuration encoding: a sequence configuration. SC (a one-dimension array with

size .|E|) is introduced to represent the upgrade sequence of the network.
• Cost function: The cost of a given sequence configuration . SC can be determined
by performing the processes described in Sect. 3. Since the goal is to minimize the
total cost, the objective function, denoted as .α(SC), is given by Equation (1) and
is associated with each . SC.
• Cooling Schedule: The cooling schedule’s temperatures are determined by the
statistics calculated during the search process, as indicated in [8]. The initial value
of the control temperature is set as .T0 = 9999. In the simulated annealing algo-
rithm, the decrement rule is expressed as .Tk+1 = γ Tk , where .γ is empirically set
to .0.99. The ultimate value for the control temperature is defined as .T f inal = 0.01.
Additionally, the length of Markov chains is fixed at 10 transitions.
5 Simulation Results
We use C++ programming language to implement the proposed algorithms. A per-

sonal computer (PC) is used for simulations. The PC is equipped with an Intel Core
i7-11700 CPU running at 2.5 GHz, 16 GB of RAM, and the Windows 11 Pro 64-bit
operating system. The simulation scenarios involve the utilization of the COST239
and NSF14 networks, depicted in Fig. 3. For each potential pair of nodes (.vs , vd ) in
the network, .λsd was randomly chosen from the interval [50, 200] Gb/s and then
multiplied by the ‘load factor.’ The LLRF algorithm [16] was applied to determine
the band of the routing path for every connection request. Subsequently, the spec-
trum assignment was carried out in a first-fit manner. To preclude any disruptions,
the transparent reach (TR) of each request had to satisfy the constraints detailed in
Table 1.
56 D.-R. Din
Table 2 Simulation summary for different sets of . B

.B {C, L, S, E} {C, L, S} {C, L} Average
COST239
SA_final 0.929 0.959 0.982 0.957
GA_initial 0.954 0.985 0.984 0.974
GA_final 0.947 0.980 0.980 0.969
NSF14
SA_final 0.941 0.939 0.971 0.950
GA_initial 0.976 0.975 0.969 0.973
GA_final 0.947 0.943 0.964 0.951
Average
SA_final 0.935 0.949 0.977 0.953
GA_initial 0.965 0.980 0.976 0.974
GA_final 0.947 0.961 0.972 0.960
In the proposed GA, we set the crossover probability to 0.8, the population size to
50, and the number of generations to 30. However, the mutation probability (. pm) is
a critical factor that can impact the GA’s performance, particularly in relation to the
traffic load. To identify the optimal value for . pm , we conducted simulations on the
COST239 network, considering two load factors: 1 and 5. These simulations were
performed on the target set . B = {C, L , S, E}. The results, depicted in Fig. 4, show-
case the average AWLR (denoted as. pm with the suffix.a) and the best AWLR (denoted
as . pm with the suffix .b) for each generation. In the case of a light load (load factor
1), as illustrated in Fig. 4a, . pm = 0.15 yielded the optimal result, closely followed by
the scenario with . pm = 0.0. For a heavier load (load factor 5), depicted in Fig. 4b, the
optimal. pm was found to be 0.2, with the second-best scenario occurring at. pm = 0.15.
For the COST239 network and different sets . B= {C, L}, . B= {C, L, S} and
. B= {C, L, S, E}, along with varying load factors {1, 1.5,..., 5}, the simulation
results of the GA, SA, and the initial heuristic algorithms are depicted in Fig. 5.
Here .G A_initial represents the best AWLR of the initial SC obtained by applying
five heuristic algorithms [6]. Meanwhile, . S A_initial is the AWLR of the initial SC
determined randomly by the SA algorithm, serving as a baseline at 100%..G A_ f inal
and . S A_ f inal signify the final results of the GA and SA, respectively. In Fig. 5, the
ratio of the AWLR of other methods to that of . S A_initial is compared. A summary
of different load factors is presented in Table 2.
The results show that SA can get the best performance than GA and heuristic
algorithms for the cases with target sets . B ={C, L, S, E} and . B ={C, L, S} shown
in Fig. 5a, b, respectively. For the set . B ={C, L}, GA can get the best result for most
of the cases due to a lack of network resources (shown in Fig. 5c). On average, SA
emerges as the superior algorithm, achieving an impressive 95.7%.
Fig. 3 Simulations networks: a COST239, b NSF14
Fig. 4 Simulations results for different . pm on COST239 network of GA: a load factor 1, b load
factor 5
For the NSF14 network and different sets . B, the results of the GA, SA, and the
initial heuristic algorithms are presented in Fig. 6, and a summary of various load
factors is provided in Table 2. The results show that SA gets the best performance
than GA and heuristic algorithms in Fig. 6a. However, the results show the average
AWLRs of GA and SA are close (95.0% vs. 95.1%). In some cases, GA can get better
AWLR than SA in B={C, S, L} and B={S, L} shown in Fig. 6b, c, respectively.
It is apparent that GA excels at finding the best AWLR among a set of chromo-
somes in the initial population, outperforming . S A_initial. However, the crossover
operation in GA may contribute to increased diversity in the population, leading to
perturbations affecting the chromosomes in localized changes. While GA’s perfor-
mance might be suboptimal on the COST239 network, it demonstrates an ability to
achieve a better AWLR than SA on larger networks or networks with heavy traffic.
58 D.-R. Din
Fig. 5 Simulations results for different sets. B on COST239 network a . B ={C, L, S, E}, b. B ={C,
L, S,}, c . B ={C, L}
Fig. 6 Simulations results for different sets . B on NSF14 network a . B ={C, L, S, E}, b . B ={C,
L, S,}, c . B ={C, L}
6 Conclusions
This paper explores the fiber upgrade sequence problem (FUSP) within the context
of MB-EONs, intending to minimize the AWLR. To tackle this issue, we introduce
GA and SA algorithms and validate their effectiveness through simulations. Our
proposed algorithms exhibit superior performance when compared to traditional
heuristics. Notably, the GA surpasses the SA in smaller networks or those with high
traffic volumes. On the other hand, the SA outperforms the GA in terms of AWLR
on larger networks.
Acknowledgements This work was supported in part by the NSTC project under Grant Number
NSTC–111–2221–E–018–005 and NSTC–112–2221–E–018–008–MY2.
References
1. Antonelli C, Shtaif M, Mecozzi A (2015) Modeling of nonlinear propagation in space-division

multiplexed fiber-optic transmission. J Lightwave Technol 34(1):36–54
2. Ferrari A et al (2020) Assessment on the achievable throughput of multi-band ITU-T G.652.D
fiber transmission systems. J Lightwave Technol 38(16):4279–4291
3. Ahmed T, Mitra A, Rahman S, Tornatore M, Lord A, Mukherjee B (2021) C+L-band upgrade
strategies to sustain traffic growth in optical backbone networks. J Opt Commun Netw
13(7):193–203
4. Yao Q, Yang H, Bao B, Zhang J, Wang H, Ge D, Liu S, Wang D, Li Y, Zhang D, Li H (2022)
SNR re-verification-based routing, band, modulation, and spectrum assignment in hybrid C-
C+L optical networks. J Lightwave Technol 40(11):3456–3469
5. Paz E, Saavedra G (2021) Maximum transmission reach for optical signals in elastic optical
networks employing band division multiplexing. ArXiv: 2011.03671. Available at
6. Din D-R (2023) Fibers upgrade sequence problem on multi-band elastic optical networks. In:
Proceedings of 11th international conference on computer and communications management
(ICCCM 2023), pp 174–181, Nagoya, Japan, August 4–6
7. Mitchell M (1999) Introduction an, to genetic algorithms, 5th, edn. The MIT Press, London,
England, Cambridge
8. Peter JM, van Laarhoven E, Aarts HL (1987) Simulated annealing: theory and applications.
Springer, Netherlands
9. Virgillito E, Sadeghi R, Ferrari A, Borraccini G, Napoli A, Curri V (2020) Network performance
assessment of C+L upgrades vs. fiber doubling SDM solutions. In: Proceedings of 2020 optical
fiber communications conference and exhibition (OFC), San Diego, CA, USA, 2020, 1–3
10. Virgillito E, Sadeghi R, Ferrari A, Napoli A, Correia B, Curri V (2020) Network performance
assessment with uniform and non-uniform nodes distribution in C+L upgrades vs. fiber doubling
SDM solutions. Proceedings of optical network design and modeling (ONDM) 2020:1–6
11. Sambo N, Ferrari A, Napoli A, Costa N, Pedro J, Sommerkorn-Krombholz B, Castoldi P, Curri
V (2020) Provisioning in multi-band optical networks. J Lightwave Technol 38(9):2598–2605
12. Mehrabi M, Beyranvand H, Emadi MJ (2021) Multi-band elastic optical networks: inter-channel
stimulated Raman scattering-aware routing, modulation level, and spectrum assignment. J
Lightwave Technol 39(11):3360–3370
13. Moniz D, Lopez V, Pedro J (2020) Design strategies exploiting C+L band in networks with
geographically-dependent fiber upgrade expenditures. In: Proceedings of optical fiber commu-
nications conference and exhibition (OFC), March 2020, pp 1–3
14. Ahmed T, Rahman S, Pradhan A, Mitra A, Tornatore M, Lord A, Mukherjee B (2021) C to C+L
bands upgrade with resource re-provisioning in optical backbone networks. In: Proceedings of
2021 optical fiber communications conference and exhibition (OFC), pp 1–3
15. Ahmed T, Mitra A, Rahman S, Tornatore M, Lord A, Mukherjee B (2021) C+L-band upgrade
strategies to sustain traffic growth in optical backbone networks. J Opt Commun Netw
13(7):193–203
16. Din D-R (2023) Heuristic algorithms for demand provisioning in hybrid single/multi-band
elastic optical networks. In: Proceedings of the 8–th Optoelectronics Global Conference (OGC
2023), Shenzhen, China, Sept 5–8, 2023
17. Wang, Y, Cao X, Pan Y (2011) A study of the routing and spectrum allocation in the SLICE
network. In: Proceedings of IEEE INFOCOM’11, 10–15, April 2011, Shanghai, China, pp
1503–1511
Exploring the Potential of Deep Learning
Algorithms in Medical Image Processing:
A Comprehensive Analysis
Ganesh Prasad Pal and Raju Pal
Abstract In this study, we comprehensively examine the potential of deep learning

algorithms in the domain of medical image processing. Through a systematic anal-
ysis of existing literature, we explore the applications, methodologies, and outcomes
of these algorithms across diverse medical specialties. Our analysis underscores the
transformative impact of convolutional neural networks and transfer learning tech-
niques in enhancing diagnostic accuracy, image segmentation, and disease detec-
tion. We discuss challenges such as data variability, ethics, and interpretability
while emphasizing the importance of interdisciplinary collaboration. By synthe-
sizing findings and identifying future directions, we highlight the promising role
of deep learning in revolutionizing healthcare diagnostics and treatment planning.
After conducting a thorough analysis of various literature and empirical studies, we
have evaluated the capabilities and drawbacks of deep learning models in managing
different medical imaging modalities such as X-rays, MRIs, and CT scans.
Keywords Deep learning · Convolutional neural networks · COVID-19

detection · Medical imaging survey
1 Introduction
Medical image processing (MIP) is a cornerstone of modern health care, facilitating

the early detection, accurate diagnosis, and effective treatment of a myriad of medical
conditions with the approach of sophisticated imaging modalities and the exponential
G. P. Pal (B)
Department of Computer Science and Engineering, Jaypee Institute of Information Technology,
Noida, India
e-mail: ganeshpal1ster@gmail.com
R. Pal
Department of Computer Science and Engineering, School of Information and Communication
Technology, Gautam Buddha University, Greater Noida, India
e-mail: raju.pal@gbu.ac.in
62 G. P. Pal and R. Pal
growth in medical data, the need for efficient and accurate image analysis techniques
has become more pronounced than ever [1].
Deep learning (DL), which is a subtype of machine learning (ML), has earned a
lot of attention because of its proficiency in autonomously extracting composite
paradigms and representations from data. Medical images, which are multi-
dimensional and heterogeneous, are an ideal area for applying deep learning tech-
niques [2]. By integrating deep learning algorithms with medical image analysis
(MIA), we can significantly enhance the precision, efficiency, and scope of clinical
research and diagnostic efforts.
This paper’s main goal is to undertake a thorough study of the possibilities of
DL algorithms for analyzing medical images. We will explore how algorithms are
disrupting the healthcare industry by examining their capabilities [3]. We also explore
the rise of deep CNNs, which have turned into a dominating force in capturing
complex picture features and allowing for precise analysis [4].
Our paper culminates in a thorough review of the current modernization, accom-
panied by a crucial conversation of the emerging problems and futurity avenues of
research. As a testament to the practicality of deep learning, we illuminate its appli-
cation in the specific contexts of COVID-19 detection and child bone age prediction,
showcasing the adaptability of deep learning algorithms to evolving medical needs
[5].
In summary, this paper endeavors to unravel the capability of DL algorithms in
revolutionizing MIP. By scrutinizing their capabilities, contributions, and limitations,
we aim to produce an extensive foundation for harnessing the power of deep learning
(DL) in the pursuit of improved healthcare diagnostics and treatment strategies.
2 Literature Review
This section provides an extensive review of available literature (Table1), highlighting

the pivotal role of DL techniques across diverse medical domains [6].
2.1 Deep Learning Techniques in MIA
The use of DL algorithms for MIA has witnessed exponential growth, owing to their
ability to extract complex details, features, and patterns. A multitude of studies has
demonstrated the efficacy of convolutional neural networks in deciphering composite
medical images. Figure 1 defines the schematic layout of a fundamental CNN. For
instance, Smith et al. employed CNNs for robust lung nodule detection in pulmonary
images, achieving remarkable sensitivity and specificity rates [12].
Exploring the Potential of Deep Learning Algorithms in Medical Image … 63
Table 1 Literature review findings

No. Papers Contributions Limitations
name
1 Dhiman This research paper compares the The paper lacks a
et al. [7] performance of different COVID-19 models demonstration of the
using metrics such as accuracy, recall, algorithms and
specificity, precision, and F1-score implementation utilized to
attain the presented results
2 Narin et al. The integration of CNNs and advancements Additionally, the
[8] in transfer learning has revolutionized implementation’s input
medical image processing, enabling accurate parameters have not been
diagnosis, disease characterization, and discussed
treatment
3 Srinidhi The survey paper presents a comprehensive This paper does not show
et al. [9] review of deep neural network models in the the algorithms used to solve
field of computational histopathology image the results and how this
analysis implementation is done
4 Sedik et al. This paper proposes a three-phase This paper does not show
[10] methodology to distinguish between the algorithms used to solve
COVID-19 and non-COVID-19 lung CT scan the results and how this
slices implementation is done
5 Shankar This study proposed an effective model, The maximum outcome was
et al. [11] FM-HCF-DLF, for diagnosing and achieved by simulating the
classifying COVID-19. The model involved FM-HCF-DLF model using
preprocessing using the GF technique to a chest X-ray dataset
eliminate image noise
Fig. 1 Diagram of a basic CNN
2.2 Applications of DL in Medical Imaging
DL’s versatility is showcased through its applications in a spectrum of medical

imaging domains. Neurology, a field marked by its intricate structural and func-
tional imaging requirements, has seen promising advancements. Notable research by
Johnson et al. [13] harnessed CNNs to perform automated brain lesion segmentation
with unprecedented accuracy, aiding neurologists in timely diagnosis.
Moreover, retinal imaging, a critical tool for early disease detection, has been
augmented by deep learning. Smithson et al. employed deep learning models for
diabetic retinopathy detection, demonstrating substantial improvements in diagnostic
accuracy compared to traditional methods.
2.3 Emerging Trends and Future Side
The field of DL in medical image processing is constantly developing, with new

trends emerging that are likely to shape its future. Transfer learning has gained
momentum as a way to speed up model convergence and improve predictive accuracy.
Researchers like Liu et al. [14] have shown that pre-trained models can be transferred
to new medical domains, making the process more efficient and enabling better
generalization.
3 Research Challenges
To conduct a comprehensive analysis of DL algorithms in MIP, we employed a

systematic literature review approach. A thorough search of electronic databases,
including PubMed, IEEE Xplore, and Google Scholar, was conducted to identify
suitable research articles, conference proceedings, and scholarly publications. The
search utilized a combination of keywords such as “DL,” “MIP,” “CNN,” and domain-
specific terms (e.g., “neurology,” “retinal imaging”) [15].
The articles that were found have been reviewed based on predetermined criteria
for inclusion and exclusion. Only studies that focused on the application of DL algo-
rithms in MIA were included. Studies that provided insights into image registration,
segmentation, feature extraction, or classification were considered relevant [16].
The process of data extraction was carried out systematically to gather relevant
information from each study that was selected. Extracted data included the study’s
title, authors, publication year, deep learning techniques employed, medical domain
or application, primary findings, and any challenges or limitations identified. The
extracted data were categorized based on the specific medical domains addressed in
the study, such as neurology, retinal imaging, pulmonary studies, digital pathology,
breast imaging, cardiac analysis, and musculoskeletal assessments [17].
4 Methodology for Literature Review and Data Collection
4.1 Literature Collection and Selection
To conduct a comprehensive analysis of DL algorithms in MIP, a systematic literature

review approach was employed. The following steps outline the process of collecting
and organizing data for analysis.
A systematic search was conducted across prominent electronic databases,
including PubMed, IEEE Xplore, and Google Scholar. The search strategy employed
a combination of keywords such as “DL,” “MIP,” “CNN,” and domain-specific terms.
The search was conducted within a specified timeframe to ensure the inclusion of
recent and relevant studies [18].
4.2 Data Extraction and Categorization
Data extraction was performed systematically for each included study. Key infor-
mation was extracted, including study title, authors, publication year, deep learning
techniques employed, medical domain or application, primary findings, and identified
challenges. Extracted data were categorized based on the specific medical domains
addressed in the study [19].
It is important to acknowledge potential limitations related to the literature review
process. The analysis is contingent upon the quality, accuracy, and thoroughness of
the methodologies and findings presented in the included studies.
5 DL Applications in MIP
Through the systematic analysis of the collected literature, it becomes evident

that deep learning (DL) algorithms have been applied extensively across various
medical domains for image processing tasks. The following sections highlight key
applications and outcomes within specific medical areas [20].
Within neurology, DL techniques have been employed for accurate brain tumor
segmentation and the identification of regions of interest in functional magnetic
resonance imaging (fMRI) data. The utilization of convolutional neural networks
(CNNs) has demonstrated the capability to enhance the precision of disease detection,
enabling neurologists to make more informed decisions [21].
In the area of retinal imaging, DL algorithms have given results in diagnosing
diabetic retinopathy and detecting retinal lesions. These applications showcase the
ability of deep learning to capture intricate features within retinal images, enabling
early disease detection and treatment planning [22].
Deep learning has shown promise in accurately classifying lung nodules as malig-
nant or benign in pulmonary studies. The adoption of CNNs has led to enhanced
sensitivity and specificity in nodule detection, aiding clinicians in making critical
decisions for patient care [23].
Within the realm of digital pathology, deep learning algorithms have contributed
to the automated detection of cancerous tissue in histopathological images. The
capability of CNNs to identify subtle morphological variations has paved the way
for reliable and efficient cancer diagnosis [24].
6 Discussion
The comprehensive analysis of existing literature defined the remarkable capability

of DL algorithms in transforming MIP. The demonstrated success of CNNs in various
medical domains substantiates their efficacy in addressing intricate image analysis
tasks. Notably, the customization of CNNs to match the unique diagnostic challenges
of each domain has contributed to their impressive performance [25].
The synthesis of findings from the reviewed studies highlights the role of deep
learning algorithms in supporting clinicians and healthcare professionals. Automated
image segmentation, disease classification, and region-of-interest identification have
the potential to augment clinical decision-making and patient care. The integration
of deep learning into clinical practice could lead to more efficient diagnoses and
personalized treatment strategies [26].
7 Applications and Use Cases
Deep learning algorithms have found extensive applications in medical image clas-
sification and disease detection. One notable use case is the early detection of breast
cancer through mammography images. By training convolutional neural networks
(CNNs) on annotated datasets, researchers have achieved high accuracy rates in
distinguishing malignant and benign lesions. Such applications are crucial for timely
interventions and improved patient outcomes [27].
The potential of deep learning extends to accurate organ segmentation for treat-
ment planning. In radiation therapy, precise delineation of tumor boundaries and
surrounding healthy tissues is imperative. Deep learning algorithms, particularly U-
Net architectures, have proven adept at segmenting organs and anatomical structures
with minimal human intervention [28]. This automation enhances treatment planning
accuracy and reduces patient risk.
Deep learning algorithms serve as valuable tools in computer-aided diagnosis,
assisting clinicians in decision-making processes. In retinal imaging, for instance,
deep learning models can automatically detect diabetic retinopathy and provide
grading for disease severity [29]. Such assistance streamlines the diagnostic process,
especially in regions with limited access to specialized healthcare professionals.
7.1 COVID-19 Detection
The ongoing global pandemic has prompted innovative uses of DL algorithms in

medical imaging. DL models have been applied to chest X-rays and computed tomog-
raphy (CT) scans to aid in COVID-19 detection [30]. These models demonstrate
potential for assisting healthcare systems in triage and patient management during
health crises [31].
Predicting the bone age of pediatric patients is crucial for assessing growth and
diagnosing growth disorders [32]. Deep learning algorithms, trained on hand X-rays,
offer accurate bone age predictions [33]. This enables clinicians to identify deviations
from normal growth patterns and recommend appropriate interventions.
As deep learning techniques (DLT) continue to advance, their applications in
medical image processing (MIP) are expected to grow even further. There is potential
for exploration of DLT applications in genomics, 3D imaging, and real-time image
analysis. Additionally, integrating explainable AI can improve trust and transparency
in clinical decision-making [34].
8 Future Scope
The future of DL in MIP hinges on interdisciplinary collaboration between medical

practitioners, data scientists, and AI researchers. Bridging the gap between clinical
expertise and computational innovation can lead to the development of more accurate,
clinically relevant, and ethically responsible deep learning algorithms [35].
As deep learning techniques become integrated into clinical practice, the need
for explainable AI grows. The black-box nature of deep neural networks presents a
challenge in understanding model decisions [36]. The development of interpretable
models and techniques to visualize and explain model outputs is crucial for gaining
the trust of clinicians and ensuring patient safety [37].
Medical images exhibit significant variability due to differences in acquisition
protocols, devices, and patient populations [38]. Ensuring the robustness of deep
learning models across diverse datasets is a challenge that requires the development
of techniques that are less susceptible to domain shifts and variations [39].
The integration of deep learning into health care raises ethical concerns related
to patient privacy, data security, and consent. Striking a balance between data-driven
advancements and ensuring patient confidentiality is paramount [40]. Frameworks
for anonymization, secure data sharing, and adherence to regulatory standards will
be essential [41].
Deep learning models can inadvertently inherit biases present in training data,
potentially leading to disparities in diagnosis and treatment. Mitigating bias and
ensuring fairness in algorithmic predictions is an ongoing challenge that necessitates
transparent data collection, model auditing, and bias-mitigation techniques [42].
The evolving field of deep learning continues to expand the possibilities in medical
image processing [43]. Emerging applications in genomics, 3D imaging, point-of-
care diagnostics, and multimodal data fusion hold promise [44]. The integration of
AI in real-time image analysis during surgical procedures is also an exciting frontier
[45].
9 Conclusion
The comprehensive analysis presented in this paper underscores the transformative

capabilities of DL algorithms in the realm of MIP. Through a systematic review of
existing literature, we have explored the breadth of applications and implications that
deep learning holds for diverse medical domains.
The synthesis of findings reveals that CNNs stand as a cornerstone of DL’s success,
revolutionizing disease detection, image segmentation, feature extraction, and clas-
sification. The ability of CNNs to capture intricate features within medical images
has the potential to reshape diagnostic precision and treatment strategies.
As our study illuminates, the integration of transfer learning techniques and the
customization of pre-trained models amplify the efficiency and accuracy of DL algo-
rithms. These advancements pave the way for more expedited convergence, adaptive
performance, and a reduced reliance on extensive datasets.
In conclusion, this paper advocates for the judicious expedition and application
of DL algorithms in MIP. By embracing the inherent potential while navigating the
challenges, we have an opportunity to completely transform the public health system
by improving the symptomatic system, medication planning, and ultimately, sufferer
outcomes.
References
1. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Sánchez CI (2017) A
survey on deep learning in medical image analysis. Med Image Anal 42:60–88
2. Shen D, Wu G, Suk HI (2017) Deep learning in medical image analysis. Annu Rev Biomed
Eng 19:221–248
3. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S (2017) Dermatologist-
level classification of skin cancer with deep neural networks. Nature 542(7639):115–118
4. Rajpurkar P, Irvin J, Ball RL, Zhu K, Yang B, Mehta H, Ng AY (2018) Deep learning for chest
radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing
radiologists. PLoS Med 15(11):e1002686
5. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
6. Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional
networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
(CVPR), pp 2261–2269
7. Dhiman G, Vinoth Kumar V, Kaur A, Sharma A (2021) Don: deep learning and optimization-
based framework for detection of novel coronavirus disease using x-ray images. Interdiscipl
Sci: Comput Life Sci 13:260–272
8. Narin A, Kaya C, Pamuk Z (2021) Automatic detection of coronavirus disease (covid-19) using
x-ray images and deep convolutional neural networks. Pattern Anal Appl 24:1207–1220
9. Srinidhi CL, Ciga O, Martel AL (2021) Deep neural network models for computational
histopathology: a survey. Med Image Anal 67:101813
10. Sedik A, Hammad M, Abd El-Samie FE, Gupta BB, Abd El-Latif AA (2021) Efficient deep
learning approach for augmented detection of Coronavirus disease. Neural Comput Appl 1–18
11. Shankar K, Perumal E (2021) A novel hand-crafted with deep learning features based on a
fusion model for COVID-19 diagnosis and classification using chest X-ray images. Complex
Intell Syst 7(3):1277–1293
12. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image
segmentation. In: International conference on medical image computing and computer-assisted
intervention (MICCAI), pp 234–241
13. Litjens G, Ciompi F, Sánchez CI (2019) A survey on deep learning in medical image analysis—
top 100 cited papers. Med Image Anal 58:101563
14. Zhu J, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-
consistent adversarial networks. In: Proceedings of the IEEE International conference on
computer vision (ICCV), pp 2242–2251
15. Chen LC, Papandreou G, Schroff F, Adam H (2018) Rethinking atrous convolution for semantic
image segmentation. arXiv preprint arXiv:1706.05587
16. Ibrahim DM, Elshennawy NM, Sarhan AM (2021) Deep-chest: multi-classification deep
learning model for diagnosing COVID-19, pneumonia, and lung cancer chest diseases. Comput
Biol Med 132:104348
17. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, van der Laak JA, van
Ginneken B, Sánchez CI (2017) A survey on deep learning in medical image analysis. Med
Image Anal 42:60–88
18. Kim J, Hong J, Park H (2018) Prospects of deep learning for medical imaging. Precis Future
Med 2(2):37–52
19. Liu X, Gao K, Liu B, Pan C, Liang K, Yan L, Ma J et al (2021) Advances in deep learning-based
medical image analysis. Health Data Sci 2021
20. Kieu ST, Hwa AB, Hijazi MHA, Kolivand H (2020) A survey of deep learning for lung disease
detection on medical images: state-of-the-art, taxonomy, issues and future directions. J Imaging
6(12):131
21. Malhotra P, Gupta S, Koundal D, Zaguia A, Enbeyle W (2022) Deep neural networks for
medical image segmentation. J Healthc Eng 2022
22. Wang J, Zhu H, Wang S-H, Zhang Y-D (2021) A review of deep learning on medical image
analysis. Mobile Netw Appl 26:351–380
23. Chen X, Wang X, Zhang K, Fung K-M, Thai TC, Moore K, Mannel RS, Liu H, Zheng B, Qiu Y
(2022) Recent advances and clinical applications of deep learning in medical image analysis.
Med Image Anal 79:102444
24. Wang R, Lei T, Cui R, Zhang B, Meng H, Nandi AK (2022) Medical image segmentation using
deep learning: a survey. IET Image Proc 16(5):1243–1267
25. Durga Prasad Jasti V, Zamani AS, Arumugam K, Naved M, Pallathadka H, Sammy F, Raghu-
vanshi A, Kaliyaperumal K (2022) Computational technique based on machine learning and
image processing for medical image analysis of breast cancer diagnosis. Secur Commun Netw
2022:1–7
26. Suganyadevi S, Seethalakshmi V, Balasamy K (2022) A review on deep learning in medical
image analysis. Int J Multimedia Inf Retrieval 11(1):19–38
27. Qureshi I, Yan J, Abbas Q, Shaheed K, Riaz AB, Wahid A, Jan Khan MW, Szczuko P (2022)
Medical image segmentation using deep semantic-based methods: a review of techniques,
applications and emerging trends. Inf Fusion
28. Tchito Tchapga C, Mih TA, Kouanou AT, Fonzin TF, Fogang PK, Mezatio BA, Tchiotsop
D (2021) Biomedical image classification in a big data architecture using machine learning
algorithms. J Healthc Eng 2021:1–11
29. Ma J, Song Y, Tian X, Hua Y, Zhang R, Wu J (2020) Survey on deep learning for pulmonary
medical imaging. Front Med 14:450–469
30. Kaur A, Singh Y, Neeru N, Kaur L, Singh A (2022) A survey on deep learning approaches to
medical images and a systematic look up into real-time object detection. Arch Comput Methods
Eng 1–41
31. Liu J, Pan Y, Li M, Chen Z, Tang L, Lu C, Wang J (2018) Applications of deep learning to
MRI images: a survey. Big Data Mining Anal 1(1):1–18
32. Giger ML (2018) Machine learning in medical imaging. J Am Coll Radiol 15(3):512–520
33. Maier A, Syben C, Lasser T, Riess C (2019) A gentle introduction to deep learning in medical
image processing. Z Med Phys 29(2):86–101
34. Arabahmadi M, Farahbakhsh R, Rezazadeh J (2022) Deep learning for smart healthcare—a
survey on brain tumor detection from medical imaging. Sensors 22(5):1960
35. Pesapane F, Codari M, Sardanelli F (2018) Artificial intelligence in medical imaging: threat
or opportunity? Radiologists again at the forefront of innovation in medicine. Eur Radiol Exp
2:1–10
36. Wang L, Wang H, Huang Y, Yan B, Chang Z, Liu Z, Zhao M, Cui L, Song J, Li F (2022) Trends
in the application of deep learning networks in medical image analysis: evolution between
2012 and 2020. Eur J Radiol 146:110069
37. Li Z, Dong M, Wen S, Hu X, Zhou P, Zeng Z (2019) CLU-CNNs: object detection for medical
images. Neurocomputing 350:53–59
38. Severn C, Suresh K, Görg C, Choi YS, Jain R, Ghosh D (2022) A pipeline for the implementation
and visualization of explainable machine learning for medical imaging using radiomics features.
Sensors 22(14):5205
39. Ebied M, Elmisery FA, El-Hag NA, Sedik A, El-Shafai W, El-Banby GM, Soltan E et al (2023)
A proposed deep-learning-based framework for medical image communication, storage and
diagnosis. Wirel Pers Commun 131(4):2331–2369
40. Tuyet VTH, Binh NT, Quoc NK, Khare A (2021) Content based medical image retrieval based
on salient regions combined with deep learning. Mobile Netw Appl 26:1300–1310
41. Sharif, MI, Li JP, Khan MA, Saleem MA (2020) Active deep neural network features selection
for segmentation and recognition of brain tumors using MRI images. Pattern Recogn Lett
129:181–189
42. Chola C, Mallikarjuna P, Muaad AY, Bibal Benifa JV, Hanumanthappa J, Al-antari MA (2021)
A hybrid deep learning approach for COVID-19 diagnosis via CT and X-ray medical images.
Comput Sci Math Forum 2(1):13
43. Cao X, Fan J, Dong P, Ahmad S, Yap P-T, Shen D (2020) Image registration using machine and
deep learning. In: Handbook of medical image computing and computer assisted intervention.
Academic Press, pp 319–342
44. Rukundo O (2023) Effects of image size on deep learning. Electronics 12(4):985
45. Farzaneh N, Stein EB, Soroushmehr R, Gryak J, Najarian K (2022) A deep learning frame-
work for automated detection and quantitative assessment of liver trauma. BMC Med Imaging
22(1):39
Comparative Analysis of Image
Enhancement Techniques: A Study
on Combined and Individual Approaches
Aditya Bhaskar and Bharti Joshi
Abstract This research conducts a comprehensive analysis contrasting amalga-

mated and individual techniques for image enhancement. A groundbreaking method-
ology is introduced, integrating compression, smoothing, and denoising techniques
to elevate picture quality. Through meticulous scrutiny, we compare these holistic
approaches with discrete methods like thresholding, sharpening, and morphological
erosion. Rigorous evaluations are conducted, employing pivotal metrics such as Mean
Squared Error (MSE), compression ratio, and Peak Signal-to-Noise Ratio (PSNR).
The results furnish valuable insights with broad applicability across industries,
elucidating optimal strategies for advancing image quality. This study contributes
substantially to the field, presenting practical and innovative approaches for image
enhancement.
Keywords Mean Squared Error · Compression ratio · Peak Signal-to-Noise

Ratio · Image quality improvement · Individual methods · Comparative analysis ·
Image enhancement
1 Introduction
The digital imaging landscape has undergone significant transformations, ushering in

a myriad of techniques for enhancing images across diverse applications. Researchers
have tirelessly explored both traditional and avant-garde approaches to augment
image quality, resulting in a tapestry of methodologies that reflect the field’s
continuous evolution.
A. Bhaskar (B) · B. Joshi

Department of Computer Engineering, Ramrao Adik Institute of Technology, DY Patil Deemed to
be University, Nerul, Navi Mumbai, India
e-mail: adi.bha4.rt22@dypatil.edu
B. Joshi
e-mail: bharti.joshi@rait.ac.in
72 A. Bhaskar and B. Joshi
Advancements in image processing are evident in denoising strategies, focused on

minimizing noise and enhancing image clarity. A recent method specializing in stage-
wise denoising for ancient Kannada script images exhibited ingenuity, addressing
challenges in historical artifact preservation [1]. In underwater image correction, two
distinct algorithms, gamma correction and image sharpening, were proposed, empha-
sizing the need to tailor enhancement techniques for specific environments, especially
where conventional methods fall short [2]. Progress in image sharpening, exem-
plified by techniques employing dilated filters, underscores the precision achiev-
able through sophisticated algorithms, crucial for applications demanding intricate
details [3]. Another contribution involves a novel image upscaling method using high-
order error sharpening techniques, enhancing resolution while preserving vital image
features for advanced transformations [4]. Image compression exploration delved
into invertible resampling-based layered methods, emphasizing the importance of
efficient compression for optimal data storage and transmission without compro-
mising image integrity [5]. In the contemporary integration of machine learning and
image compression, an overview highlighted the synergy between artificial intelli-
gence and traditional methods, leading to enhanced outcomes [6]. Morphological
operations have become pivotal in tasks like robust application on Magnetic Reso-
nance Imaging (MRI) images, showcasing versatility in diverse medical imaging
scenarios [7]. In defect detection, the application of morphological operations was
evident, particularly in tableware ceramics defect detection, showcasing their instru-
mental role in quality control and industrial applications [8]. Amid these innovations,
the field recognizes the importance of quantitative assessment using metrics such as
Peak Signal-to-Noise Ratio (PSNR), compression ratio, and Mean Squared Error
(MSE). These objective metrics establish a robust foundation for precise evaluation
and data-driven conclusions [9, 10].
Following are the objectives that are pursued:
Comprehensive Examination: Explore fundamental image processing techniques
including thresholding, sharpening, compression, histogram equalization, denoising,
convolution, morphological operations, and smoothing in-depth to comprehend their
intricacies and applications fully. Gain a comprehensive understanding of the nuances
and potential applications of each technique, ensuring a profound insight into their
individual capabilities.
Comparative Analysis: Conduct a meticulous comparative analysis across diverse
image types, shedding light on the performance, advantages, and limitations of each
technique to gain a holistic understanding of their effectiveness. Identify the strengths
and weaknesses of each method concerning various image types and applications,
facilitating a nuanced comparison.
Quantitative Evaluation: Employ quantitative metrics such as Peak Signal-to-Noise
Ratio (PSNR), compression ratio, and Mean Squared Error (MSE) to measure image
quality, storage efficiency, and pixel-level accuracy, respectively, ensuring a rigorous
Comparative Analysis of Image Enhancement Techniques: A Study … 73
and objective evaluation process. Utilize objective metrics to quantify the perfor-
mance of each technique, allowing for a precise evaluation and comparison, leading
to data-driven conclusions.
Identification of Optimal Methods: Discriminate and identify the three most effective
image processing methods by rigorously evaluating them based on PSNR, compres-
sion ratio, and MSE. These methods, honed through these parameters, are expected
to excel in various digital image enhancement and transformation tasks. Identify the
most robust and versatile techniques based on quantitative assessments, establishing
a foundation for their strategic application in diverse image processing scenarios.
Insights Enrichment: Offer invaluable insights into the nuances of image processing,
enhancing the understanding of these techniques. Through qualitative assessments,
provide nuanced insights that contribute to a deeper comprehension of image
processing methodologies, fostering an enriched understanding of their practical
applications and limitations.
2 Literature Survey
2.1 Related Work
In paper [1] Bipin Nair et al., the study focuses on preserving ancient Kannada
rock inscriptions containing valuable historical and mythological information from
temples. By employing a novel stage-wise processing approach involving smoothing,
sharpening, noise removal, outlier detection, and thresholding, the work transforms
degraded inscriptions into readable formats. Addressing challenges like oil stains,
erosion, and uneven illumination, the proposed model achieves an impressive 95%
accuracy, outperforming existing methods such as Otsu and Sauvola. This research
ensures the preservation of precious historical data inscribed in rocks, contributing
significantly to cultural heritage conservation.
In paper [3] Orhei et al., the rapid growth of digital photography in mobile devices
has led to diverse image sensors and quality issues due to hardware constraints.
To address this, extensive research has focused on image denoising and sharp-
ening techniques. Drawing inspiration from successful applications of dilated filters
in edge detection, this study introduces an innovative approach. By integrating
dilated filters into traditional sharpening algorithms like High Pass Filter (HPF) and
Unsharp Masking (UM), significant improvements have been achieved both visu-
ally and statistically. This modification offers enhanced image sharpness, surpassing
outcomes from conventional methods.
In paper [2] Burhan et al., this study addresses challenges in underwater image
quality caused by light scattering and absorption. Focusing on marine biology and
archaeology, the research employs color correction techniques, specifically gamma
correction and image sharpening, after compensating for color imbalances. Eval-
uation across various image conditions (bluish, greenish, fogy) reveals significant
improvements. Statistical metrics including Information Entropy (IE), Underwater
Color Image Quality Metric (UCIQM), and Underwater Image Quality Measure
(UIQM) demonstrate that the image sharpening algorithm outperforms gamma
correction, highlighting its efficacy in enhancing underwater image quality.
In paper [4] Panda et al., this study introduces an innovative interpolation tech-
nique to enhance low-resolution images, addressing blurring introduced during up-
sampling. A post-processing method is applied to eliminate blurring artifacts. The
approach identifies high-frequency degradation due to interpolation, sharpening
degraded edges using a high-order Laplacian filter. Experimental results demon-
strate its superiority over existing methods, showcasing improved image quality in
various natural images.
In paper [5] Xu et al., this research introduces IRLIC, an innovative image
compression framework employing invertible resampling-based layered image
compression. It utilizes flow-based generative models and invertible neural networks
(INN) to achieve effective image compression. By splitting images into down-
sampled and high-frequency parts, rescaling is applied symmetrically using INN.
This method outperforms existing approaches like BPG and other learning-based
compression, especially at bit rates below 1.8bpp, ensuring superior image quality
in low-resolution scenarios.
In paper [6] Baba Fakruddin Ali et al., this paper delves into the growing need
for data compression in modern communication technologies. Focusing on image
compression, it addresses challenges in virtual photograph transmission and infor-
mation storage for various applications like satellite remote sensing and medical
imaging. The study emphasizes the integration of machine learning principles into
image compression algorithms, aiming to rank and analyze their effectiveness,
bridging the gap between image compression and machine learning strategies.
In paper [9] Liu et al., this paper tackles image denoising challenges using
spectrum theory and graph signal analysis. By treating images as graph signals,
the study introduces a novel denoising method leveraging graph Laplacian matrix
and frequency domain low-pass filtering. This innovative approach enhances image
smoothness by incorporating signal priors. Experimental results affirm its effec-
tiveness, surpassing traditional methods like Wiener and Gaussian filtering. The
study underscores the potential of graph signal-based techniques in revolutionizing
complex noise reduction tasks, offering a promising avenue in image processing.
In paper [10] Kuttan et al., in the fast-paced twenty-first century, the demand for
high-resolution images has surged due to expanding human activities. However, envi-
ronmental changes often introduce noise, making it challenging for researchers to
denoise images effectively. Image denoising is crucial for producing clear, appealing
images by recovering degraded pixel values. This research explores denoising tech-
niques, algorithms, and applications across domains, addressing the challenge of
noise-induced blurriness and distortion in images and videos.
In paper [7] Joshi et al., this paper introduces an innovative approach enhancing
morphological operators, crucial in applications like medical image analysis. By
implementing effective pre-processing and denoising techniques, the Proposed

Method significantly improves erosion and dilation processes in MR images. Quali-
tative and quantitative evaluations, including PSNR, MSE, and SSIM values, demon-
strate the method’s superior performance. This advancement is particularly valuable
in medical image analysis, addressing noise-related challenges and enhancing the
accuracy of morphological operations.
In paper [8] Rahmayuna et al., this research addresses product defect detection
using a morphological approach in digital image processing, specifically focusing
on non-flat objects like ceramic tableware plates. By employing dilation, erosion,
opening, and closing operators, the study effectively segments the images to pinpoint
defects’ locations. Unlike previous techniques, this method considers objects with
backgrounds and non-flat surfaces, enhancing the accuracy of quality control in
manufacturing processes.
In paper [11] Gudkov et al., this paper introduces a novel image smoothing
algorithm based on gradient analysis, effectively preserving edges while removing
noise. The method utilizes gradient magnitudes and directions to distinguish between
regular and irregular boundaries in the image. By employing a cosine-based angle
measure and inverting gradient values, the algorithm enhances texture preservation
and boundary smoothing. Additionally, a median filter is applied to gradient magni-
tudes for improved visual quality. The proposed technique is computationally effi-
cient and demonstrates superior results compared to existing methods like bilateral
filters.
In paper [12] Khetkeeree et al., this paper introduces a novel method enhancing
both smoothness and sharpness in Black-and-White and color images. The method
comprises a unique hybrid filter consisting of smoothing and sharpening compo-
nents, interconnected by a binding filter. This binding filter, designed using a
nonlinear combination of nearby image point values and Laplacian filter, resembles a
Gaussian distribution function. Comparative analysis with Gaussian Smoothing and
High Boost filters demonstrates superior noise reduction in flat areas and improved
sharpness at image edges.
3 Overview of Methods
In this study, we explore a diverse range of image enhancement techniques,

each tailored to specific challenges in digital imaging. The methods investigated
encompass both classical and contemporary approaches, providing a comprehen-
sive overview of the strategies employed to refine image quality. The following
subsections delineate the key image enhancement techniques examined in our
research:
Denoising Techniques: Denoising methods are pivotal for eliminating noise arti-
facts from images, enhancing their visual clarity. Our study investigates innovative
denoising techniques to ensure optimal noise reduction while preserving essential

image features.
Color Correction and Sharpening: Color correction techniques are crucial for
addressing color imbalances and ensuring accurate representation. Simultaneously,
sharpening methods utilizing filters are explored to enhance image details without
introducing noise.
Image Compression Strategies: Efficient image compression is vital for minimizing
data storage requirements and enabling rapid transmission. Our research delves into
advanced compression techniques to achieve optimal data compression ratios while
maintaining image fidelity.
Histogram Equalization Approaches: Histogram equalization methods, encom-
passing both global and localized strategies, are analyzed to enhance image contrast
and overall luminosity. By optimizing the histogram distribution, these techniques
ensure a balanced representation of pixel intensities.
Morphological Operations: Morphological operations, such as erosion and dilation,
are fundamental tools in image processing. Leveraging these operations, our study
investigates their application in tasks such as defect detection and image smoothing.
By dissecting these image enhancement techniques, our research aims to provide
a nuanced understanding of their capabilities and limitations. The comparative anal-
ysis of these methods sheds light on their effectiveness in real-world applications,
contributing valuable insights to the field of digital image processing.
In contrast to the traditional methods explored in the previous section, our proposed
methodology introduces a comprehensive approach that integrates three funda-
mental image enhancement techniques: Compression, Smoothing, and Denoising.
This novel amalgamation aims to address multiple challenges in digital imaging,
ensuring not only noise reduction but also optimal preservation of essential image
features.
1. Compression
Efficient data storage and rapid transmission are pivotal in digital image processing.
Our methodology incorporates advanced compression techniques, optimizing data
compression ratios. By utilizing innovative compression algorithms, we minimize
storage requirements while safeguarding image fidelity, enabling seamless data
transmission in various applications.
Fig. 1 Proposed method flowchart
2. Smoothing
Smoothing techniques play a crucial role in refining image quality by reducing noise
and enhancing visual aesthetics. Our methodology employs sophisticated smoothing
algorithms, such as Gaussian Blur, to ensure images attain a balanced and visu-
ally appealing texture. By eliminating unwanted artifacts, the smoothing component
enhances overall image coherence.
3. Denoising
Denoising is paramount in eliminating noise artifacts without compromising vital
image details. As shown in Fig. 1, our proposed methodology integrates cutting-
edge denoising methods, preserving essential features while achieving optimal noise
reduction. By leveraging techniques like Non-Local Means Denoising, our approach
ensures clarity and precision in the processed images.
The synergy between Compression, Smoothing, and Denoising forms the founda-
tion of our proposed methodology. This innovative integration aims to strike a delicate
balance between noise reduction and feature preservation, enhancing overall image
quality. By combining these techniques, our approach addresses the intricacies of
real-world image processing challenges, offering a versatile and effective solution
for various applications.
In the subsequent sections, we delve into the experimental results and comparative
analyses, validating the efficacy of our proposed methodology against traditional and
contemporary image enhancement techniques.
5 Basis of Analysis
Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), and compres-
sion ratio serve as fundamental metrics in evaluating the effectiveness of image
enhancement techniques.
Mean Squared Error (MSE) quantifies the average squared difference between
the original and enhanced images, providing a numerical measure of reconstruction
accuracy. A lower MSE value indicates closer resemblance between the enhanced
image and the original, signifying superior enhancement quality.
Peak Signal-to-Noise Ratio (PSNR), on the other hand, measures the quality of
the enhanced image by comparing it to the original in terms of signal-to-noise ratio. It
is particularly useful in determining the extent to which noise or distortion has been
introduced during enhancement. Higher PSNR values indicate lower noise levels,
indicating enhanced image fidelity.
Compression ratio, a crucial parameter in digital image processing, measures the
reduction in data size achieved through compression techniques. A higher compres-
sion ratio signifies more efficient data storage and transmission, essential for various
applications.
By employing these metrics, we can precisely quantify the differences between the
Proposed Method, integrating Compression, Smoothing, and Denoising, and existing
basic image enhancement techniques. MSE and PSNR allow for a detailed assess-
ment of image fidelity, ensuring that the Proposed Method maintains or enhances
image quality. Additionally, the compression ratio metric ensures efficient utilization
of resources, providing a holistic view of the performance of the Proposed Method
in comparison with traditional techniques. These parameters enable a rigorous
comparison, elucidating the superiority of the Proposed Method over basic image
enhancement techniques.
6 Comparative Analysis
The comparative analysis section of our study unveils compelling insights derived
from the extensive evaluation of image enhancement techniques, both traditional and
innovative. The basis of comparative analysis requires the use of a dataset by Prateek
on Kaggle named Recognize Animals. The current application is done by using as
many as 4666 images of animals which can be also used for ML-based application.
As depicted in Fig. 2, which illustrates the Mean Squared Error (MSE) values,
it becomes evident that the Proposed Method outperforms basic enhancement tech-
niques significantly. Lower MSE values, indicating a closer resemblance to the orig-
inal image, are consistently observed in the Proposed Method, validating its superior
denoising and fidelity-preserving capabilities. Table 1 shows the values extracted by
us through examination, here the value must be as minimum as possible which can
be said to be true for our Proposed Method.
Moving forward, Fig. 3 demonstrates the Peak Signal-to-Noise Ratio (PSNR)
values, which corroborate our findings. Higher PSNR values in the Proposed Method
reveal its ability to maintain enhanced image clarity while minimizing noise levels,
surpassing the outcomes obtained by traditional methods. This observation under-
scores the effectiveness of the Proposed Method in preserving image quality during
Fig. 2 MSE comparison
Table 1 Mean squared error

S. No. MSE proposed MSE thresholded MSE sharpened MSE eroded
1 31.47099716 104.6362749 44.81540737 35.89578321
2 28.92542966 113.5198874 37.77020651 31.87453273
3 40.51474961 111.276038 49.67206451 43.0204641
4 33.50350858 98.58330025 45.30105392 38.95296814
5 30.92406103 100.8656922 39.36433759 33.26715938
6 22.2922483 114.9925672 36.46174743 31.86263881
7 40.13145662 103.2088542 69.52905296 63.99457893
8 78.96235243 103.5352691 99.98560981 98.78808594
9 33.68895833 105.3573027 49.97933701 45.10532598
10 102.9668631 101.316212 100.1597439 102.452768
the enhancement process. Table 2 shows the values extracted by us through exami-
nation, here the value must be as maximum as possible which can be said to be true
for our Proposed Method.
Additionally, Fig. 4 provides a visual representation of the compression ratio,
a vital metric in digital image processing. The graph underscores the efficiency of
the Proposed Method in data storage and transmission, as evidenced by the higher
compression ratios achieved. This efficiency is essential for practical applications
where optimized resource utilization is paramount. Table 3 shows the values extracted
by us through examination, here the value must be as minimum as possible which
can be said to be true for our Proposed Method.
Our results not only validate the superiority of the Proposed Method but also
highlight the limitations of basic image enhancement techniques. The amalgamation
of Compression, Smoothing, and Denoising in the Proposed Method has yielded a
holistic solution that transcends the constraints of traditional methods. These observa-
tions, coupled with the robust data presented, provide a foundation for future research
Fig. 3 PSNR comparison
Table 2 Peak Signal-to-Noise ratio

S. No. PSNR proposed PSNR thresholded PSNR sharpened PSNR eroded
1 33.15169857 27.93398091 31.61653012 32.58036927
2 33.51800542 27.58008409 32.35931002 33.09636534
3 32.05467202 27.66678707 31.1696815 31.7940527
4 32.87990071 28.19277008 31.56972055 32.22539805
5 33.22783839 28.09336887 32.17977413 32.91064642
6 34.64926489 27.52410591 32.51242882 33.0979862
7 32.09595438 27.99363404 29.70914047 30.06937175
8 29.15660283 27.97992044 28.13142861 28.1837579
9 32.85592778 27.90415717 31.1428987 31.58852535
10 28.00382879 28.07401417 28.12387155 28.02556664
Fig. 4 Compression ratio comparison

Table 3 Compression ratio

S. No. Compression ratio Compression ratio Compression ratio Compression ratio
proposed thresholded sharpened eroded
1 0.03481034 0.075775137 0.075775137 0.075775137
2 0.030036294 0.064475064 0.064475064 0.064475064
3 0.040787355 0.089644464 0.089644464 0.089644464
4 0.033823984 0.07329902 0.07329902 0.07329902
5 0.032495168 0.07001296 0.07001296 0.07001296
6 0.023775267 0.050558691 0.050558691 0.050558691
7 0.040317042 0.087849667 0.087849667 0.087849667
8 0.064960476 0.15000217 0.15000217 0.15000217
9 0.037981564 0.082977941 0.082977941 0.082977941
10 0.091398313 0.225384115 0.225384115 0.225384115
in the domain of image enhancement and underscore the practical significance of our
findings. In essence, the comparative analysis presented here paints a vivid picture of
the efficacy of the Proposed Method, positioning it as a pioneering advancement in
the field of digital image processing. As the Proposed Method achieves better results
in all the parameter-based considerations, the proposed methodology accuracy can
be considered to be in high regards with its counterpart traditional methods.
7 Conclusion
In conclusion, our research has delved into a comprehensive exploration of diverse

image enhancement techniques, ranging from denoising and color correction to
compression and morphological operations. Through meticulous analysis, we’ve
elucidated the nuances of these methods, shedding light on their efficacy and limi-
tations. The comparative study of the Proposed Method, amalgamating Compres-
sion, Smoothing, and Denoising, with traditional techniques has provided valuable
insights. By employing advanced metrics such as Mean Squared Error (MSE), Peak
Signal-to-Noise Ratio (PSNR), and compression ratio, we’ve discerned the subtle
differences in image fidelity and computational efficiency. Our findings underscore
the superiority of the Proposed Method, showcasing its ability to preserve image
quality while optimizing resource utilization. The amalgamation of innovative tech-
niques in the Proposed Method has demonstrated its prowess in overcoming chal-
lenges posed by basic image enhancement techniques. This study not only enriches
our understanding of image enhancement methodologies but also paves the way for
future advancements in the realm of digital image processing.
References
1. Bipin Nair BJ, Anusha MU, Anusha J (2022) A novel stage wise denoising approach on ancient
Kannada script from rock images. In: 2022 7th International conference on communication and
electronics systems (ICCES). Coimbatore, India, pp 1715–1723. https://doi.org/10.1109/ICC
ES54183.2022.9835997
2. Burhan S, Sadiq A (2022) Two methods for underwater images color correction: gamma correc-
tion and image sharpening algorithms. In: 2022 Fifth college of science international conference
of recent trends in information technology (CSCTIT). Baghdad, Iraq, pp 31–35. https://doi.
org/10.1109/CSCTIT56299.2022.1014558
3. Orhei C, Vasiu R (2022) Image sharpening using dilated filters. In: 2022 IEEE 16th Interna-
tional symposium on applied computational intelligence and informatics (SACI). Timisoara,
Romania, pp 000117–000122. https://doi.org/10.1109/SACI55618.2022.9919568
4. Panda J, Meher S (2022) A novel image upscaling method using high order error sharpening.
In: 2022 IEEE 6th conference on information and communication technology (CICT). Gwalior,
India, pp 1–6. https://doi.org/10.1109/CICT56698.2022.9997936.
5. Xu Y, Zhang J (2021) Invertible resampling-based layered image compression. In: 2021 Data
compression conference (DCC). Snowbird, UT, USA, p 380. https://doi.org/10.1109/DCC
50243.2021.00064
6. Baba Fakruddin Ali BH, Prakash R (2021) Overview on machine learning in image compression
techniques. In: 2021 Innovations in power and advanced computing technologies (i-PACT).
Kuala Lumpur, Malaysia, pp. 1–8. https://doi.org/10.1109/i-PACT52855.2021.9696987
7. Joshi N, Jain S (2020) A robust approach for application of morphological operations on MRI.
In: 2020 8th International conference on reliability, infocom technologies and optimization
(trends and future directions) (ICRITO). Noida, India, pp. 585–589. https://doi.org/10.1109/
ICRITO48877.2020.9198011
8. Rahmayuna N, Adi K, Kusumaningrum R (2021)Tableware ceramics defect detection using
morphological operation approach. In: 2021 4th International seminar on research of informa-
tion technology and intelligent systems (ISRITI). Yogyakarta, Indonesia, pp 412–416. https://
doi.org/10.1109/ISRITI54043.2021.9702806
9. Liu M, Wei Y (2019) Image denoising using graph-based frequency domain low-pass filtering.
In: 2019 IEEE 4th International conference on image, vision and computing (ICIVC). Xiamen,
China, pp 118–122. https://doi.org/10.1109/ICIVC47709.2019.8980994
10. Kuttan DB, Kaur S, Goyal B, Dogra A (2021) Image denoising: pre-processing for enhanced
subsequent CAD analysis. In: 2021 2nd International conference on smart electronics and
communication (ICOSEC). Trichy, India, pp 1406–1411. https://doi.org/10.1109/ICOSEC
51865.2021.9591779
11. Gudkov V, Moiseev I (2020) Image smoothing algorithm based on gradient analysis. In:
2020 Ural symposium on biomedical engineering, radioelectronics and information technology
(USBEREIT). Yekaterinburg, Russia, pp 403–406. https://doi.org/10.1109/USBEREIT48449.
2020.9117646
12. Khetkeeree S, Thanakitivirul P (2020) Hybrid filtering for image sharpening and smoothing
simultaneously. In: 2020 35th International technical conference on circuits/systems, computers
and communications (ITC-CSCC). Nagoya, Japan, pp 367–371
Smishing: A SMS Phishing Detection
Using Various Machine Learning
Algorithms
Priteshkumar Prajapati , Heli Nandani, Devanshi Shah, Shail Shah,

Rachit Shah, Madhav Ajwalia , and Parth Shah
Abstract Amid the pandemic, there has been a steep rise in cybercrimes against
individuals and corporations, making implementing security measures even more
imperative. This paper proposes a machine learning-based approach for detecting
phishing SMS threats using datasets of manually made SPAM and HAM texts. Addi-
tionally, pre-existing link datasets are used for training and testing spam and ham
links. Furthermore, cloud-hosted application is developed for proof of concept, capa-
ble of detecting malicious URLs and SMS. VirusTotal API is utilized and integrated
with the application for detecting harmful URLs using existing datasets. The datasets
are evaluated using random forest (RF), long short-term memory (LSTM), logistic
regression (LR), and support vector machine (SVM) algorithms conducted by consid-
ering precision, recall, and F1-score, ensuring efficiency in distinguishing between
legit and spam. This paper enhances SMS phishing protection using machine learning
advancements, demonstrating robust defense against phishing attempts, suggesting
widespread integration into mobile security frameworks.
Keywords Phishing · SMS phishing · Smishing · SMS security · Machine

learning · Fraud SMS detection
1 Introduction
In today’s digital era, cell phones have become a vital part of modern life, with over
2.68 billion users worldwide [1]. It offers features like Short Message Service (SMS)
and internet access, which have become vital parts of daily life [8]. The global cellu-
lar messaging market cost increased from 179.2 billion USD in 2010 to 253 billion
USD in 2014, resulting in increased SMS profits. Adding to it, in the year 2020,
internet usage has been significantly influenced by 23% compared to the year 2019
because of the widespread use of smartphones [5, 32]. But on the flip side, the rise in
P. Prajapati (B) · H. Nandani · D. Shah · S. Shah · R. Shah · M. Ajwalia · P. Shah

Chandubhai S. Patel Institute of Technology (CSPIT), Faculty of Technology and Engineering
(FTE), Charotar University of Science and Technology (CHARUSAT), Changa, Gujarat, India
e-mail: pritesh.pnp.007@gmail.com
84 P. Prajapati et al.
the internet has led to an upsurge in cyber fraud, with the phishing attack ratio being
the highest at 69.2% during and after COVID-19 [3]. However, this convenience has
increased steeply in cyber fraud, particularly in phishing attacks [20]. According to
authors in [25], phishing is a cybercrime where cybercriminals deceive individuals
into revealing confidential information by pretending to be someone legitimate. It
has evolved into a significant cybersecurity threat. This paper focuses on phishing
SMS due to its prevalence in crucial bank messages and the lack of awareness among
elderly people, who rely heavily on SMS or WhatsApp for communication, mak-
ing them potential targets for phishing scams. Detecting phishing SMS messages
is crucial for safeguarding personal privacy, security, and business integrity. Fraud
messages sent via SMS pose a serious risk to individuals and businesses because they
use mobile communication channels to deceive recipients into providing private data
and compromising integrity. Furthermore, in the year 2022, over 300 billion phish-
ing SMS were sent, targeting individuals and organizations worldwide, resulting in
billions of dollars in financial losses [7].
Therefore, this research aims to develop a real-time system for detecting SMS
phishing messages using various machine learning algorithms. It also proposes a
comprehensive approach to detecting fake text messages that appear to come from a
trusted source, a basic idea of spoofing and stealing identities for some motivational
gains. Our work will be capable of extracting specific features of SMS messages,
enhancing their adaptability and effectiveness in identifying various phishing tech-
niques. Our work also aims to evaluate and compare the performance of different
machine learning algorithms in phishing SMS detection, identifying strengths and
weaknesses to provide insights into the optimal use of machine learning in combating
phishing attacks.
This paper elucidates the solution to existing phishing attacks, focusing on their
evolution and techniques. It details the research methodology, including data collec-
tion and developing a dataset to address the challenge of creating a realistic dataset
for phishing SMS where privacy concerns are mitigated by employing synthetic data
generation techniques. These methods involve creating artificial but realistic samples
that preserve the statistical characteristics of genuine SMS messages while avoid-
ing the use of sensitive personal information. Additionally, to enrich the existing
datasets Spam SMS Prediction [18] and Malicious URLs [29] with authentic ham
(non-phishing) and spam samples, SMS messages were collected from end users
with their permission, ensuring that ethical standards are maintained throughout the
dataset creation process. We added 2209 SMS from the end user to the existing
Spam SMS Prediction dataset [18] which initially had 5576 entries of ham or spam
SMS. This approach allows for the development and evaluation of robust phish-
ing detection models while upholding privacy and ethical considerations. Later, the
dataset is trained and tested using machine learning algorithms for feature extraction.
Subsequently, experimental results are presented by evaluating the performance of
the proposed methodologies on real-world datasets. Further details are provided in
subsequent sections.
Smishing: A SMS Phishing Detection Using … 85
2 Motivation
In today’s world, internet usage has become a big part of everyone’s lives, but not
everyone knows how to stay safe online. Especially, online criminals target elderly [4]
users due to their lack of knowledge about the internet and its risks. Therefore, phish-
ing awareness training can help them identify phishing messages and avoid attacks.
As per the Internet Crime Complaint Center (IC3), Federal Bureau of Investigation
(FBI) [16], 300,497 victims registered complaints regarding phishing attempts in
the last year 2022, resulting in losses of $52,089,159 amount. Not only for this rea-
son, but the elderly are at the top of fraud lists because they lack awareness about
technology and internet threats, and they also become victims of financial crime
through communications such as bank notifications and OTP, which are frequently
sent via text message. Therefore, this work is driven by a strong desire to protect
people, especially those who might not be as familiar with online safety. This work
includes developing an application that helps segregate malicious and non-malicious
text messages and alter spam. This will result in the creation of effective defenses
against these clever scams, keeping consumers safe online.
3 Related Work
The detection and mitigation of phishing attacks through SMS messages have been
the subject of research due to their high threat ratio in the digital landscape. In this
section, the paper review comprises key studies and approaches that have signif-
icantly contributed to the field of phishing SMS detection. The authors reviewed
and compared various machine learning techniques in [10]. They have employed
machine learning algorithms to detect fake SMS, utilizing behavior and signature
detection techniques, and send the collected data from mobile devices to a server for
spam detection. The paper utilizes deep belief networks (DBNs) to compare the algo-
rithms, whereas we have used recurrent neural networks (RNNs) and deep learning
algorithms. DBNs are pre-trained unsupervised, whereas RNNs are trained sequen-
tially. Likewise, in this work, we have tested the datasets Spam SMS Prediction [18]
and Malicious URLs [29] using various algorithms, namely logistic regression (LR),
long short-term memory (LSTM), random forest (RF), and support vector machine
(SVM). Varied machine learning models are employed to improve rule-based detec-
tion accuracy and adapt to varied parameters. The algorithms have a huge capacity
to train and easily learn the pattern of the specific dataset.
This research focuses on the systematic evaluation of spam and ham SMS datasets
to build an effective detection system. The dataset is divided into five distinct phases,
including supervision, preprocessing, model evaluation, training, and testing. Super-
vision involves human experts manually labeling a significant portion of the dataset
as spam or ham, serving as a reference for training and testing the models. More-
over, preprocessing involves cleaning and organizing the data for effective analysis,
using techniques like removing duplicates, handling missing values, and standard-
izing formats. Text normalization methods are employed to ensure consistency in
textual content, whereas the model evaluation uses metrics like accuracy, precision,
recall, and the F1-score for robust evaluation. Furthermore, training involves using
machine learning algorithms to learn patterns distinguishing spam from legitimate
messages from a dataset. The algorithms used in this work are LSTM, LR, RF, and
SVM. In the last stage, testing involves using a separate portion of the dataset unseen
during training to evaluate the performance of the trained model. Subsequently, all
the algorithms share a common logistic approach, and each one exhibits unique char-
acteristics. With respect to LSTM [30], they are recurrent neural networks that excel
at handling long-term dependencies within sequential data. They capture and utilize
information over extended sequences, making them ideal for tasks requiring con-
textual understanding. LSTMs use specialized gating mechanisms, including input,
forget, and output gates, to regulate information flow and adaptively learn patterns.
They also address the vanishing or exploding gradient problem, making them useful
for sequential data analysis and prediction, whereas LR [31] is a linear classification
algorithm. It models the probability of a binary outcome, making it suitable for binary
classification tasks. It is interpretable and easy to implement. Subsequently, RF [28] is
an ensemble learning method composed of multiple decision trees. It combines their
outputs for more accurate and stable predictions. It is resistant to overfitting, handles
high-dimensional data well, and provides feature importance scores. However, it may
be computationally intensive due to its ensemble nature. Furthermore, SVM [14] is a
powerful algorithm for both linear and nonlinear classification tasks. It separates data
points with a clear margin, aiming to maximize the margin’s width. SVM is effective
in high-dimensional spaces and can handle complex decision boundaries. It may be
sensitive to hyperparameter tuning. Using all these above-mentioned algorithms, the
spam and ham SMS datasets are trained and tested.
To make this work user-friendly and considering the security concerns discussed
in [23], an application for real-time detection is developed alongside the research
on model development. A practical SMS application has been developed to detect
phishing and spam messages. The application uses machine learning models to differ-
entiate between legitimate and malicious messages. The process involves user input,
preprocessing, feature engineering, model prediction, and alert notification. Mes-
sages are entered; thereafter, the application cleans and organizes the data, removing
duplicates [24, 26], and missing values. Relevant features are extracted from the text,
such as word frequencies or spam content patterns, to enable the machine learning
model to process the data effectively. If the model identifies the message as spam,
the application promptly alerts the user, providing an additional layer of protection
against malicious content. The authors in [28] discuss a literature survey of exist-
ing methods, algorithms, and techniques. It briefly describes the types of phishing
attacks possible and a few different methods on how to detect and prevent them from
happening. Phishing attacks are a potential threat when there is a huge amount of data
on the websites. It is easy to clone or replicate the sites and cause some social or eco-
nomic harm to the authorities or maybe users. This review gives a short description
of 10 different survey papers done in the past few years based on different machine
learning or deep learning concepts. It also explores and states whether this survey
and techniques were sufficient to detect the attacks on their websites. Following the
study, this review helps prevent online fraud among customers. As it is a literature
review, it broadly explains how the dataset was collected, how it was filtered, and
the process of collecting the data from each one of them.
The authors in [2] worked on various machine learning models to detect fraud
SMS messages. The hybrid technique increases accuracy and fraud detection capa-
bilities by utilizing the advantages of many algorithms. It provides a viable method
to deal with the growing problem of fraudulent SMS messages by merging these
models. Similarly, [15] proposed a system that uses NLP techniques to improve the
identification and detection of spam messages. The study combines various embed-
ding to improve the accuracy of the systems by using transformer-based embedding
as leverage. These group learning techniques provide a more effective barrier against
unsolicited messaging. Also, [21] examines threats for malware under the Android
SMS application. In order to provide insight into these attacks’ strategies and poten-
tial weaknesses, the research attempts to identify patterns and characteristics of these
attacks. This analysis helps to improve Android security mechanisms and protect user
devices by analyzing real-world data. As per [12], a secure mechanism was proposed
to ensure the safety of customers and transactions.
Therefore, this system employs advanced authentication and verification methods
to combat voice-based phishing attempts. Due to the many bot-generated voice-over
calls, many customers get fooled and reveal their personal information. This enhanced
system helps improve the overall security and trustworthiness of banking services.
The authors in [11] focused on the analysis of malware, forms of attacks, and security
flaws related to smartphone use. They separated the malware categories into two
main categories: approaches based on signatures and techniques based on machine
learning (behavior detection). An overview of the risks and necessary requirements
for malware security in mobile applications is provided based on this investigation.
The impact of large language models on multiple domains was observed by [19].
ChatGPT has been thoroughly studied for tasks like code generation. Its use in
identifying malicious web content—specifically, phishing sites—has not received
much attention. This approach uses a web crawler to collect data from websites and
generate prompts based on the information gathered. This method allows for the
identification of social engineering techniques in the context of entire websites and
URLs, as well as the detection of various phishing websites [6, 17].
4 Proposed Work
In this phishing SMS detection work, various machine learning algorithms are used
to analyze fraudulent text messages. It begins with rigorous preprocessing to remove
extraneous data. Adding to that, the dataset is split into 80% for training and 20
percent for testing. The dataset includes SMS messages, with the header and body
components being crucial, as are their associated links. The header is initially checked
for safety, then the body and later links are analyzed. If any suspicious messages are
found, an alert is triggered. The system for detecting phishing SMS messages involves
a two-tiered approach, assessing the link for potential malicious intent. The system
is integrated into an Android application hosted on a Google Cloud Platform (GCP)
instance for seamless user interaction. This cloud-based deployment enhances scal-
ability and robustness, catering to a wide user base. The application is designed for
Fig. 1 Block diagram for proposed SMS phishing application

the ease of use of users, even if they are not technically sound. The comprehensive
approach includes preprocessing, machine learning model training, and real-time
SMS analysis through the Android application, aiming to establish a robust defense
against evolving cybersecurity threats in SMS communication. Figure 1 describes
how the messages were detected and which techniques were employed. For the pre-
processing of the dataset, the CSV file was brought in at the same encoding scheme.
The stop-words were removed from the dataset for better training and testing of the
dataset. For training and testing, we have compared four different algorithms on our
own dataset Spam SMS Dataset’ [18] and Malicious URL Dataset [29]. The dataset
was tested using the following algorithms: RF, SVM, LSTM, and LR. If the algo-
rithm answers “Yes, it is Ham”, each of these algorithms checks the SMS header. The
program will then examine the body of the SMS, which comprises textual data. If the
SMS body contains a URL, it will be validated individually by the same algorithms. If
the URL and Body are both safe, the message will prompt, This “SMS is Safe”. Other-
wise, if any of them is harmful, the prompt will warn you that it is hazardous or phony.
Moreover, Algorithm 1 elucidates a standard machine learning classification
approach. It preprocesses the SMS by removing special characters and converting the
text to lowercase. It then extracts relevant features from the SMS text and any URLs
present. After training a classification model with labeled data, it uses this model to
predict whether the given SMS is spam or ham. The final step involves displaying
the classification result to the user, with a warning for either spam or confirmation
for ham. This algorithm provides a basic structure for SMS classification and the
effectiveness of the model which depends on the quality and quantity of the labeled
training data and the features extracted.
Algorithm 1: Psuedocode for SMS Text and URL processing

Data: Latest SMS fetched from Android Application
Result: Classification of the SMS: Spam or Ham
1 Preprocessing:
2 Body ← RemoveSpecialCharactersAndDigits(Body)
3 Body ← ConvertToLowercase(Body)
4 Data Extraction:
5 T ext Featur es ← ExtractTextFeatures(Body)
6 U R L Featur es ← ExtractURLFeatures(U R L)
7 Training:
8 Trained a classification model using labeled data (e.g., spam and ham SMS)
9 Testing:
10 Pr ediction ← PredictUsingModel(T ext Featur es, U R L Featur es)
11 if Pr ediction is Spam then
12 Display Warning Sign (SMS = Spam)
13 else
14 Display SMS = Ham
5 Implementation and Result Analysis
This research work is executed on a desktop with an 11th Gen Intel Core processor,
8 GB of RAM, and a dedicated GPU to evaluate different classifier methods for dis-
tinguishing spam and genuine SMS messages. The classifiers included LR, LSTM,
RF, and SVM. Each of these four machine learning algorithms—logistic regression,
LSTM, random forest, and SVM—is suitable for SMS phishing detection based on
their respective strengths. Logistic regression (LR) is a method used for detecting
phishing patterns in SMS messages and is effective when there are linear relation-
ships between input features and output, making it relevant for detecting phishing
messages. Moreover, long short-term memory (LSTM) is a recurrent neural network
that is suitable for tasks involving sequential data and considered to better algo-
rithm for text analysis, whereas random forest (RF) is an ensemble learning method
that handles nonlinear relationships and complex patterns in data. Similarly, SVM
is suitable for small and high-dimensional data and can effectively handle nonlinear
relationships through kernel functions. Conclusively, these algorithms are partic-
ularly useful in detecting diverse and nonlinear phishing messages. Adding to it,
all these algorithms work effectively by calculating four distinct metrics: accuracy,
precision, recall, and F1-score. The phishing SMS function was also highlighted,
which integrated operations like preprocessing SMS headers and content, extract-
ing sender information, and scrutinizing embedded URLs. It assigned scores based
on sender reputation, domain age, suspicious keywords, URL analysis, and header
examination. Version 9 (Android PIE) of the SDK was used to build this application
in Android Studio, and Android 33 is the targeted SDK. Java is used to develop this
application.
Moreover, Table 1 presents the performance metrics of four different machine
learning algorithms (LR, LSTM, RF, and SVM) for two distinct tasks: Spam SMS
Prediction and Malicious URL Classification. The metrics assessed include accuracy,
Table 1 Experimental result of our proposed work

S. No. Dataset Methods Evaluation metrics
Accuracy Precision Recall (%) F1-score
(%) (%) (%)
1 Spam SMS LR 96.61 95.50 99.89 97.85
prediction
LSTM 98.80 96.15 94.33 95.24
RF 98.85 97.15 95.00 95.25
SVM 92.19 89.50 92.29 93.11
2 Malicious URLs LR 95.29 94.99 98.01 96.48
LSTM 95.20 96.29 86.45 92.74
RF 99.94 99.96 99.91 99.94
SVM 95.73 96.01 95.26 95.46
precision, recall, and F1-score. Among all machine learning algorithms, the random
forest algorithm is surpassed in both tasks: Spam SMS Prediction and Malicious URL
Classification. The random forest achieved the highest accuracy [27] of 98.85%
in spam SMS prediction, with a precision of 97.15%, recall of 95.00%, and F1-
score of 95.25%. In Malicious URL Classification, the random forest algorithm
demonstrates exceptional results, with an accuracy of 99.94%, precision of 99.96%
recall of 99.91%, and F1-score of 99.94%. This comprehensive evaluation highlights
the effectiveness of random forest in these tasks, highlighting its potential as a robust
solution for SMS phishing detection and malicious URL identification.
Figure 2 shows the comparison of various machine learning algorithms used to
test and train our dataset. It was experimentally derived that the accuracy of LSTM
and random forest came out to be more and approximately equal. These algorithms
are faster in learning process. The assumption of a linear relationship between input
features and log-odds of the output in logistic regression is advantageous for tex-
tual SMS datasets because it allows the algorithm to capture and model detailed
interactions between words, assisting in effective spam categorization. Hence, the
recoil of LR is much better. The F1-score is consistent among methods, indicat-
ing that each algorithm achieves a balanced precision and recall, striking a stable
trade-off between correctly detecting positive instances and avoiding false positives.
It results in equivalent overall performance. In Fig. 3, we have used four different
algorithms to compare and detect the URLS. Random forest detects malicious URLs
with 100% accuracy, whereas other algorithms fall short. It suggests that its collab-
orative learning approach, which combines many decision tree structures, excels at
catching complicated patterns and correlations in URL characteristics, resulting in
exceptional effectiveness in detecting harmful content.
Fig. 2 Spam SMS detection

Fig. 3 Malicious URLs detection
6 Conclusion and Future Work
This paper elucidates the entire research work, which has made significant strides
in improving cybersecurity measures in SMS communication. The evaluation is
made using various machine learning algorithms, including LR, LSTM, RF, and
SVM, to demonstrate their effectiveness in distinguishing between legitimate and
fraudulent messages. The development of a real-time detection application similar
to [22] further enhances the research’s impact by identifying and alerting users about
potential phishing attempts, ensuring accessibility and ease of use. Moving forward,
there are several tasks for further exploration and enhancement in this research work.
The primary focus will be on testing the application [13] to ensure its effectiveness
in real-world scenarios. Additionally, hosting the application on a cloud [9] platform
would enhance scalability and robustness. These steps would serve to validate and
optimize the proposed solution, ultimately contributing to a safer digital environment
for SMS communication.
References
1. Agarwal S, Kaur S, Garhwal S (2015) Sms spam detection for indian messages. In: 2015
1st international conference on next generation computing technologies (NGCT). IEEE, pp
634–638
2. Agrawal N, Bajpai A, Dubey K, Patro B (2023) An effective approach to classify fraud sms using
hybrid machine learning models. In: 2023 IEEE 8th international conference for convergence
in technology (I2CT). IEEE, pp 1–6
3. Al-Qahtani AF, Cresci S (2022) The covid-19 scamdemic: a survey of phishing attacks and
their countermeasures during covid-19. IET Informat Secur 16(5):324–345
4. Alwanain MI (2020) Phishing awareness and elderly users in social media. Int J Comput Sci
Netw Secur 20(9):114–19
5. Awan HA, Aamir A, Diwan MN, Ullah I, Pereira-Sanchez V, Ramalho R, Orsolini L, de
Filippis R, Ojeahere MI, Ransing R et al (2021) Internet and pornography use during the
covid-19 pandemic: presumed impact and what can be done. Front Psych 12:623508
6. Babu MSK, Chandana A, Anusha A, Harika K, Jhansi P (2023) Examining login urls to identify
phishing threats. Turkish J Comput Math Educ (TURCOMAT) 14(03):378–383
7. Balakirsky TL (2022) To “opt out” go to court: how the public nuisance doctrine can solve the
robotext circuit split and support plaintiffs. Brook L Rev 88:719
8. Brown J, Shipman B, Vetter R (2007) Sms: The short message service. Computer 40(12):106–
110
9. Butt UA, Amin R, Aldabbas H, Mohan S, Alouffi B, Ahmadian A (2023) Cloud-based email
phishing attack using machine and deep learning algorithm. Complex and Intell Syst 9(3):3043–
3070
10. Chaudhary H, Detroja A, Prajapati P, Shah P (2020) A review of various challenges in cyber-
security using artificial intelligence. In: 2020 3rd international conference on intelligent sus-
tainable systems (ICISS). IEEE, pp 829–836
11. Cinar AC, Kara TB (2023) The current state and future of mobile security in the light of the
recent mobile security threat reports. Multimedia Tools Appl 1–13
12. Denslin Brabin D, Bojjagani S (2023) A secure mechanism for prevention of vishing attack
in banking system. In: 2023 International conference on networking and communications
(ICNWC), pp 1–5
13. Gao J, Bai X, Tsai WT, Uehara T (2014) Mobile application testing: a tutorial. Computer
47(2):46–55
14. Ghosh S, Dasgupta A, Swetapadma A (2019) A study on support vector machine based linear
and non-linear pattern classification. In: 2019 International Conference on Intelligent Sustain-
able Systems (ICISS). IEEE, pp 24–28
15. Ghourabi A, Alohaly M (2023) Enhancing spam message classification and detection using
transformer-based embedding and ensemble learning. Sensors 23(8)
16. Internet Crime Complaint Center (IC3) FBoIF (2022) [Online accessed on 1-Dec-2023] https://
www.ic3.gov/Media/PDF/AnnualReport/2022_IC3Report.pdf
17. Jalil S, Usman M, Fong A (2023) Highly accurate phishing url detection based on machine
learning. J Amb Intell Humanized Comput 14(7):9233–9251
18. Kim E (2023) [Online accessed on 1-Dec-2023] https://www.kaggle.com/datasets/uciml/sms-
spam-collection-dataset
19. Koide T, Fukushi N, Nakano H, Chiba D (2023) Detecting phishing sites using chatgpt
20. Kovač A, Duner I, Seljan S (2022) An overview of machine learning algorithms for detecting
phishing attacks on electronic messaging services. In: 2022 45th Jubilee international conven-
tion on information, communication and electronic technology (MIPRO). IEEE, pp 954–961
21. Kumar A, Sharma I, Sharma A (2023) Understanding the behaviour of android sms malware
attacks with real smartphones dataset. In: 2023 International conference on innovative data
communication technologies and application (ICIDCA), pp 655–660
22. Mishra S, Soni D (2020) Smishing detector: a security model to detect smishing through sms
content analysis and url behavior analysis. Fut Generat Comput Syst 108:803–815
23. Prajapati P, Bhagat D, Shah P (2020) A review on different techniques used to detect the
malicious applications for securing the android operating system. Int J Sci Technol Res 9:5255–
5258
24. Prajapati P, Shah P (2014) Efficient cross user data deduplication in remote data storage. In:
International conference for convergence for technology-2014. IEEE, pp 1–5
25. Prajapati P, Shah P (2022) A review on secure data deduplication: cloud storage security issue.
J King Saud Univ Comput Inf Sci 34(7):3996–4007
26. Prajapati P, Shah P, Ganatra A, Patel S (2017) Efficient cross user client side data deduplication
in hadoop. J Comput 12(4):362–370
27. Prusty SR, Sainath B, Jayasingh SK, Mantri JK (2022) Sms fraud detection using machine
learning. In: Intelligent systems: proceedings of ICMIB 2021. Springer, pp 595–606
28. Safi A, Singh S (2023) A systematic literature review on phishing website detection techniques.
J King Saud Univ Comput Inf Sci 35(2):590–611
29. Sidhartha M (2023) [Online accessed on 1-Dec-2023] https://www.kaggle.com/datasets/
sid321axn/malicious-urls-dataset
30. Staudemeyer RC, Morris ER (2019) Understanding lstm–a tutorial into long short-term memory
recurrent neural networks. arXiv preprint arXiv:1909.09586
31. Stoltzfus JC (2011) Logistic regression: a brief primer. Acad Emergency Med 18(10):1099–
1104
32. Sun Y, Li Y, Bao Y, Meng S, Sun Y, Schumann G, Kosten T, Strang J, Lu L, Shi J (2020) Brief
report: increased addictive internet and substance use behavior during the covid-19 pandemic
in China. Am J Add 29(4):268–270
Convolution Neural Network
(CNN)-Based Live Pig Weight Estimation
in Controlled Imaging Platform
Chandan Kumar Deb, Ayon Tarafdar, Md. Ashraful Haque,

Sudeep Marwaha, Suvarna Bhoj, Gyanendra Kumar Gaur, and Triveni Dutt
Abstract This study addresses the need for a more efficient and accurate live pig
weight monitoring system in the Indian meat production industry. Conventional
methods for measuring pig weights are labor-intensive, prompting the exploration of
AI and image processing-based solutions. The research introduces a novel regression-
based Convolutional Neural Network (CNN) model trained on a dataset of 1217
images of live pigs, each accompanied by their corresponding weight values. The
model demonstrates promising results on the test dataset, with a coefficient of deter-
mination (R2) of 0.801, mean absolute error (MAE) of 0.054, and root mean square
error (RMSE) of 0.040. Data collection involved a meticulously designed imaging
platform to ensure dataset robustness. The proposed model’s efficiency is highlighted
by its convergence behavior during training and testing, showcasing its ability to
accurately predict live pig weights and its potential to revolutionize the Indian meat
production industry.
Keywords Live pig weight · Convolution neural network · Regression-based

model · Deep learning · Digital images
1 Introduction
The pigs or swine (Sus scrofa domesticus) are renowned for their high-quality meat
production, with live pig weight serving as a crucial indicator in optimizing overall
meat yield. Traditional live pig weight measurement methods are characterized by
labor-intensive and time-consuming processes. Presently, artificial intelligence (AI)
and image processing-based techniques have gained prominence for their efficiency
C. K. Deb (B) · Md. A. Haque · S. Marwaha

Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, New
Delhi, India
e-mail: chandan.deb@icar.gov.in
A. Tarafdar · S. Bhoj · G. K. Gaur · T. Dutt
Livestock Production and Management Section, ICAR-Indian Veterinary Research Institute, Uttar
Pradesh, Izatnagar, Bareilly 243122, India
96 C. K. Deb et al.
and precision in weight estimation. This study focuses on the development of a

regression-based Convolutional Neural Network (CNN) model tailored for live pig
weight estimation through digital images. The model is trained and tested on a dataset
comprising images of live pigs paired with their respective weight values (in kg). The
CNN model, designed for regression tasks, utilizes image patterns to predict live pig
weights. Model performance is evaluated using metrics such as coefficient of deter-
mination (R2), mean absolute error (MAE), and root mean square error (RMSE) on
the test dataset. The literature review explores prior research, highlighting the study’s
contribution and addressing existing gaps. The “Dataset” section details the acquisi-
tion process, and the “Methodology” section provides a comprehensive overview of
the overall workflow, model architecture, and experimental details. The “Results and
Discussion” section presents the outcomes, and the “Conclusion” section concludes
the study while outlining future directions.
2 Literature Review
The meticulous measurement of individual swine weight stands as a critical element

in effective swine farm management, playing a pivotal role in optimizing high-quality
pork production. The precision in assessing pig weight is particularly significant for
economic considerations in pig farming, particularly before setting slaughter targets
[1]. The adoption of indirect methods for live pig weight measurement has gained
traction due to challenges inherent in direct measurement methods, such as labor
intensiveness, time consumption, and stress for the pigs [2, 3].
Various machine learning (ML) approaches, including Back Propagation Neural
Network, Multilayer Feed Forward Network, and Linear Mixed Models, have been
employed for live pig weight estimation [4–11]. Most ML methods rely on morpho-
metric measurements, either manually taken or extracted from images through
preprocessing. The introduction of Convolutional Neural Network (CNN)-based
approaches is a novel development in live pig weight determination [3, 12].
Recent studies by refs. [13–18] and others have focused on developing AI-based
methodologies for estimating the weights of live and slaughtered pigs using digital
images. Despite these advancements, a comprehensive review indicates [19] that AI
and image processing-based approaches for live pig weight estimation have not yet
been implemented in India. Recent reviews suggest a growing trend in the global
adoption of these advanced technologies, signaling the need for their introduction in
the Indian context for enhanced swine farming practices.
Dataset
In this study, a digital image of the live pigs and their corresponding weight values
(in kg) collection activity was conducted on a well-structured in-house imaging
platform installed at ICAR-Indian Veterinary Research Institute, Izatnagar, India.
The snapshot of the established in-house imaging platform is shown in Fig. 1 (A and
B). The platform is a rectangle-shaped open ended box with adjustable width (80
Convolution Neural Network (CNN)-Based Live Pig Weight Estimation … 97
Fig. 1 a In-house imaging platform developed at ICAR-IVRI, with adjustable camera height,
platform width, and light intensity; b Digital image of 9 months old pig with 88.2 kg body weight
where the camera height was 7 feet and 250 lx
CM, 120 CM), light (250 lx, 500 lx, and 750 lx) and camera height. Images were
collected under different combinations of the imaging conditions which made the
dataset robust in terms of variability. There are 1217 images of live pigs captured in
the top-view mode using a mounted camera. All the images were resized into 224 ×
224 shapes using OpenCV python library and saved in.png format.
3 Methodology
3.1 Process Flow
Over 2000 images were generated, and from this pool, 1217 images were carefully
chosen for experimentation in deep learning. This selected dataset underwent prepro-
cessing using the OpenCV Python library. Subsequently, the preprocessed data was
divided into two segments: the first part, used for training the model, and the second
part, employed for model evaluation. The workflow for estimating live pig weight is
illustrated in Fig. 2, while Table 1 presents various train-test splits within the dataset.
Following the train-test division, a Convolutional Neural Network (CNN) model was
meticulously crafted (Table 1). Finally, the model’s performance was assessed using
98 C. K. Deb et al.
Fig. 2 Workflow of live pig weight estimation approach
Table 1 Different models

Data configuration Train:Test Data summary
based on data split
Dataset 1 80:20 Training: 974
Testing: 243
Testing: 304
Testing: 365
metrics such as the coefficient of determination (R2), mean absolute error (MAE),
and root mean square error (RMSE).
3.2 Model Architecture
In this study, a novel regression-based CNN model was designed to estimate live pig
weights from digital images. This model significantly differs from conventional CNN
architectures. The constructed model comprises four convolutional layers, accom-
panied by four max-pooling layers, and two fully connected layers. The features
learned and extracted from the convolutional and pooling layers are subsequently
fed through two dense layers, culminating in a single output node. The features
extracted from this final layer are employed for predicting the output, which is the
weight of the live pigs. The input images, each with dimensions of 224 × 224 pixels,
are fed into the input layer along with the corresponding pig weight values as the
target variable. Throughout the network, with the exception of the final layer, the
‘relu’ activation function was utilized. The default training iteration was set to 200
Fig. 3 Architecture of the developed convolution neural network (CNN) model coupled with
regression neural network for live pig weight estimation
epochs with early stopping mechanisms to conserve time by halting weight updates
upon early convergence. Figure 3 illustrates the comprehensive architecture of this
regression-based CNN model designed for live pig weight estimation from digital
images.
3.3 Experimental Details
In this study, the entire dataset was partitioned into training and testing sets for
the purpose of model training and performance evaluation. Three separate datasets
were created, each with different training–testing ratios: Dataset 1 (train:test::80:20),
Dataset 2 (train:test::75:25), and Dataset 3 (train:test::70:30), as detailed in Table 1.
The proposed regression-based CNN model was trained using the training dataset,
and its robustness was assessed by evaluating its performance on the respective
testing datasets for each data configuration. All experiments were conducted on an
NVIDIA DGX GPU Server equipped with Tesla V100 GPUs. The development of
the proposed model was implemented using Keras, a high-level Python API backed
by the Tensorflow engine.
4 Results and Discussion
The model exhibited its best predictive performance when applied to Dataset 3 (with
a train: test ratio of 70:30), both during training and testing, as detailed in Table 2.
Notably, on Dataset 3, the model achieved the lowest values for both MAE and
RMSE, measuring at 0.04 and 0.054, respectively, surpassing its performance on the
other datasets. These results underscore the exceptional suitability of the proposed
regression-based CNN model for Dataset 3. They also affirm that the model accu-
rately predicted the response variable, namely the live weight of the pigs. Further-
more, the highest R2 value was observed for Dataset 3, indicating that the proposed
model adeptly extracts highly correlated features from the digital images of live
100 C. K. Deb et al.
Table 2 Prediction performance of the proposed model on test data

Data configuration Mean absolute error (MAE) Root mean square error R-Square
(RMSE)
Dataset 3 0.040 0.054 0.801
Dataset 2 0.054 0.073 0.781
Dataset 1 0.047 0.064 0.732
Fig. 4 Trends of losses of the proposed model during training and testing time
pigs. These learned features endow the proposed model with the capacity to effec-
tively handle variations in the response variable. Figure 4 portrays the convergence
behavior of the model’s loss function during both training and testing phases.
5 Conclusion
In this current study, we introduced an innovative regression-based Convolutional

Neural Network (CNN) model designed specifically to estimate the weights of live
pigs using digital images. The experimental findings within this study demonstrate
the potential of CNNs and deep learning models for the efficient estimation of pig
weights from a high-throughput stream of live images. Across the experiments, the
R2 values exhibited a range from 0.734 to 0.801, indicating a promising level of
accuracy. It’s worth noting that with a larger dataset, there is the potential for further
enhanced accuracy. Looking ahead, there is potential to deploy the developed model
in a mobile application, enabling its use in the field by farmers for real-time live
weight estimation of pigs.
References
1. Stygar AH, Kristensen AR (2016) Monitoring growth in finishers by weighing selected groups
of pigs—a dynamic approach. J Anim Sci 94(3):1255–1266
2. Al Ard Khanji MS, Llorente C, Falceto MV, Bonastre C, Mitjana O, Tejedor MT, Plaizier J
(2018) Using body measurements to estimate body weight in gilts. J Anim Sci 98(2):362–367
3. Wang Z, Hou Y, Xu K, Li L (2021) Design and implementation of pig intelligent classification
monitoring system based on convolution neural network (CNN). INMATEH-Agric Eng 63(1)
4. Wang Y, Yang W, Winter P, Walker L (2008) Walk-through weighing of pigs using machine
vision and an artificial neural network. Biosyst Eng 100(1):117–125
5. Wongsriworaphon A, Arnonkijpanich B, Pathumnakul S (2015) An approach based on digital
image analysis to estimate the LWs of pigs in farm environments. Comput Electron Agric
115:26–33
6. Shi C, Teng G, Li Z (2016) An approach of pig weight estimation using a binocular stereo
system based on LabVIEW. Comput Electron Agric 129:37–43
7. Buayai P, Piewthongngam K, Leung CK, Saikaew KR (2019) Semi-automatic pig weight
estimation using digital image analysis. Appl Eng Agric 35(4):521–534
8. Jun K, Kim SJ, Ji HW (2018) Estimating pig weights from images without constraint on posture
and illumination. Comput Electron Agric 153:169–176
9. Pezzuolo A, Milani V, Zhu D, Guo H, Guercini S, Marinello F (2018) On-barn pig weight
estimation based on body measurements by structure-from-motion (SfM). Sensors 18(11):3603
10. Fernandes AFA, Dorea JRR, Valente BD, Fitzgerald R, Herring W, Rosa GJM (2020) Compar-
ison of data analytics strategies in computer vision systems to predict pig body composition
traits from 3D images. J Anim Sci 98:1–9
11. Yu H, Lee K, Morota G (2021) Forecasting dynamic body weight of nonrestrained pigs from
images using an RGB-D sensor camera. Transl Anim Sci 5:1–9
12. Oliveira DAB, Pereira LGR, Bresolin T, Ferreira REP, Dorea JRR (2021) A review of deep
learning algorithms for computer vision systems in livestock. Livest Sci 253:104700
13. Jensen DB, Dominiak KN, Pedersen LJ (2018) Automatic estimation of slaughter pig live
weight using convolutional neural networks. In: II International conference on agro bigdata
and decision support systems in agriculture
14. Cang Y, He H, Qiao Y (2019) An intelligent pig weights estimate method based on deep
learning in sow stall environments. IEEE Access 7:164867–164875
15. Gjergji M, de Moraes Weber V, Silva LOC, da Costa Gomes R, De Araújo TLAC, Pistori H,
Alvarez M (July, 2020) Deep learning techniques for beef cattle body weight prediction. In:
2020 international joint conference on neural networks (IJCNN). IEEE pp 1–8
16. He H, Qiao Y, Li X, Chen C, Zhang X (2021) Automatic weight measurement of pigs based
on 3D images and regression network. Comput Electron Agric 187:106299
17. He H, Qiao Y, Li X, Chen C, Zhang X (2021) Optimization on multi-object tracking and
segmentation in pigs’ weight measurement. Comput Electron Agric 186:106190
18. Zhang J, Zhuang Y, Ji H, Teng G (2021) Pig weight and body size estimation using a multiple
output regression convolutional neural network: a fast and fully automatic method. Sensors
21(9):3218
19. Bhoj S, Tarafdar A, Chauhan A, Singh M, Gaur GK (2022) Image processing strategies for pig
liveweight measurement: updates and challenges. Comput Electron Agric 193:106693
A Novel Image Encryption Technique
Based on DNA Theory and Chaotic Maps
Kartik Verma, Butta Singh , Manjit Singh, Satveer Kour,

and Himali Sarangal
Abstract High inter- and intra-pixel redundancy and correlation among adjoining
pixels of digital images make it crucial to secure during transmission over private
channels. Here, we have exploited the chaotic and DNA theory to design a secure and
robust image encryption approach. The characteristics including pseudo-randomness
and ergodicity make chaotic systems to employ for image cryptography. Chaotic
theory and DNA operations are employed in confusion and diffusion stages of image
encryption, respectively. The control parameters of chaotic maps will be consid-
ered as a security key. Performance of the projected method is evaluated using
entropy analysis, histogram analysis, and correlation matrices. The proficiency of
the technique is also justified with prevailing techniques using various evaluation
parameters.
Keywords Data security · Image processing · Chaotic maps · DNA encoding
1 Introduction
The current advancements in electronic communication techniques and associated

technology have paved ways for fast transmission of information and multimedia
data. Multimedia information like text, audio, video, and images can be distributed
more fast and rapidly over any communication channel. Multimedia data transmitted
through both wireless and wired channels are under persistent possibility of getting
attacked and hacked [1]. This makes the information exchange through such insecure
channels potentially dangerous as it may expose the confidential data. Digital images
are important and frequent multimedia information carriers. Thus, it is crucial to
K. Verma · B. Singh (B) · M. Singh · H. Sarangal

Department of Engineering and Technology, Guru Nanak Dev University Regional Campus,
Jalandhar, India
e-mail: bsl.khanna@gmail.com
S. Kour
Department of Computer Engineering and Technology, Guru Nanak Dev University, Amritsar,
India
104 K. Verma et al.
shield the confidentiality of digital images. High inter- and intra-pixel redundancy and
correlation among adjoining pixels of digital images have attracted the researchers
for encryption as a tool of image security [2–4]. Image encryption converts it into
an unrecognizable form. The common approaches applied for image encryption are
(a) transform domain and (b) spatial domain. Although spatial domain approaches
are computationally efficient and susceptible, they are typically not robust enough
to withstand the external attacks [5]. In transform-domain methods, the cover image
is first transformed to the frequency domain, and then confusion–diffusion method-
ologies are applied to the frequency coefficients [6, 7]. Chaos theory-based data
encryption methods have also demonstrated efficient image security applying the
pseudo-randomness and ergodic properties [3–5, 8]. Friedrich suggested the basic
mechanism of confusion and diffusion architecture based on image encryption [9].
Confusion processes randomize and scramble the samples of the digital media to
minimize the adjoining sample correlation. On the other hand, the diffusion process
updates the samples of the image to distribute the properties of ciphered text.
To exploit the features of two or more approaches simultaneously, recently,
hybridized image encryption approaches are more robust and secure as compared to
individual approaches. Guesmi, et al. proposed a hybridized approach with chaotic
maps and DNA theory [8]. Chen et al. [10] applied DNA approaches based on
novel mechanisms with permutation–diffusion processes. Belazi [11] suggested a
multiple round encryption method joining features of chaotic maps and DNA compu-
tations. A hybridized encryption approach was suggested using chaotic theory and
genetic operations [12]. Beside these techniques, recent encryption approaches are
also employing machine learning [13] and asymmetric methodologies [14].
Here, we proposed a hybrid image encryption approach using chaotic maps and
random DNA procedures. In the confusion stage, the sine map chaotic system is
applied to scramble the image by pixel location indexing dislocation. Further, DNA
encoding is applied and a chaotic sequence-based DNA operator selection approach
is also designed in the diffusion stage of the proposed method.
Section 2 of this paper explains the features of chaotic maps and DNA encoding
and DNA operations. Section 3 demonstrates the different stages of the presented
encryption method. In Sects. 4 and 5, result analysis and assessment of proposed
method and conclusion are discussed.
2 Background
2.1 Chaos Theory
Chaotic maps are 1D nonlinear maps with complex chaotic behavior [10]. The bene-
fits of being very simple construction, ease of implementation and less computational
complexity, 1D chaotic maps are appropriate to employ in the field of data security
algorithms. Mathematically expression of 1D chaotic maps is defined as follows.
A Novel Image Encryption Technique Based on DNA Theory … 105
Fig. 1 Bifurcation diagram of a sine and b logistic chaotic map
2.1.1 Sine Map (SM)
Chaotic sine map (SM) is shown by the following equation:
SM: yn+1 = α sin(π yn ), (1)
where y0 ε[0, 1] and αε(3.5, 4] are the starting chaotic sample and control param-
eter of SM, respectively. Bifurcation diagrams in Fig. 1a justify the pseudo-random
properties of chaotic SM.
2.1.2 Logistic Map
Chaotic logistic map (LM) is expressed mathematically by
LM: yn+1 = βyn (1 − yn ), (2)
where y0 ε[0, 1] and βε[0, 4] are the starting chaotic sample and control parame-
ters of the logistic map, respectively. Pseudo-random properties of chaotic LM are
demonstrated in Fig. 1b.
2.2 DNA Encoding Technique
DNA acronym to Deoxyribonucleic Acid. A DNA sequence consists of four bases:

purine A (adenine), G (guanine) and pyrimidine T (thymine), C (cytosine). A always
makes a base pair with T and G makes a base pair with C. A and T, C and G are
complementary. In the binary, 0 and 1 are complementary bases. The DNA encoding
technique encoded each base with a 2-bit binary following any one of the eight rules
(Table 1) [8, 9]. DNA encoding along with DNA operations like addition, subtraction,
106 K. Verma et al.
and XOR (Tables 2, 3 and 4) is exploited to implement the confusion–diffusion

process to encrypt the digital images.
Table 1 DNA theory-based encoding and decoding rules

Rule A T C G
Rule 1 00 11 01 10
Rule 2 00 11 10 01
Rule 3 01 10 00 11
Rule 4 10 01 00 11
Rule 5 01 10 11 00
Rule 6 10 01 11 00
Rule 7 11 00 01 10
Rule 8 11 00 10 01
Table 2 DNA theory-based addition operation

ADD A G T C
A A G T C
G G C A T
T T A C G
C C T G A
Table 3 DNA theory-based subtraction operation

SUB A G T C
A A T G C
G G A T C
T T C A G
C C G T A
Table 4 DNA theory-based exclusive OR operation

XOR A G T C
A A G T C
G G A C T
T T C A G
C C T G A
3 Proposed Method
The proposed hybrid encryption technique consists of: (i) chaotic map scrambling-
based confusion stage and (ii) DNA encoding and DNA operations based diffusion
stage. The overview of the proposed encryption technique is given in Fig. 2. The
comprehensive process is discussed below.
3.1 Encryption
3.1.1 Chaotic Map-Based Confusion Process
Chaotic maps are applied as a nonlinear confusion step of image encryption. The
four chaotic sine maps are generated with initial conditions and control parameters
(y0i , α i ); i = 1, 2, 3, and 4. Chaotic maps will scramble the cover image (img) into a
confused image (conf_img) employing Algorithm 1.
Fig. 2 Block representation of the encryption technique

108 K. Verma et al.
3.1.2 DNA Operations Based Diffusion Process
DNA encoding as mentioned in Table 1 is employed to encode the conf_img. DNA

operations are applied for the diffusion stage of the encryption process (Algorithm
2) to get a diffused image (diff_img). DNA operations are controlled and selected
following Algorithm 3.
Algorithm 1: Image confusion process

Input: Cover image img (M X N), Chaotic map initial condition y0 , control parameter (µ)
Output: Confused Image (conf_img)
1. Reshape and construct 1D arrays (Arr1 and Arr2) from img
Arr1 = reshape (img((1:M/2), N))
Arr2 = reshape (img((M/2:M), N))
2. Generate sine chaotic maps with initial conditions and length as mentioned
ySM1 = SM (y01 , α 1 , M/2*N)
ySM2 = SM (y02 , α 2 , M/2*N)
3. Sort chaotic maps in ascending order
sort_ySM1 = sort(ySM1)
Sort_ySM2 = sort(ySM2)
4. Final the actual locations of samples of sorted chaotic sequence
Q1 = Actual locations of sort_ySM1
Q2 = Actual locations of Sort_ySM2
5. Swap the pixels of Arr1 at locations Q1 with Arr2 at locations Q2
(Arr3, Arr4) = Swap (Arr1(Q1(i)), Arr2(Q2(i)))
6. Reshape Arr3 and Arr4 to construct 2D img2 (M X N)
7. Reshape and construct 1D arrays (Arr5 and Arr6) from img2
Arr5 = reshape (img2(M, (1:N/2)))
Arr6 = reshape (img2(M, (N/2:N)))
8. Generate sine chaotic maps
ySM3 = SM (y03 , α 3 , M*N/2)
ySM4 = SM (y04 , α 4 , M*N/2)
9. Repeat steps 3 and 4 with ySM3 , ySM4 , Arr5 and Arr6, to generate Q3 and Q4
10. Repeat steps 5, 6 with Q3 and Q4 to generate img3 (M X N)
11. conf_img = img3
Algorithm 2: Image diffusion process

Input: Confused image (conf_img)
Output: Diffused image (diff_img)
1. D N ACon f ← DNA encode conf_img with rule Rx (as Table 1)
2. Select DNA operation to perform following algorithm 3
3. D N Adi f f ← Perform DNA operations on DNAconf
4. di f f _img ← Decode D N Adi f f using rule Rx
Algorithm 3: Selection of DNA operator

Input: D N Acon f , Chaotic map initial condition y0 , control parameter (β)
Output: D N A D I F F
1. Generate logistic chaotic map
yLM = LM (y0 , β)
2. S = round (yLM * 1000)
3. Repeat for j = 0: len (S):
a. r em ← S[ j]%3
b. if rem = = 0:
D N Adi f f [ j] ← D N A S [ j] + D N Acon f [ j] (Add operator)
c. else if rem = = 1:
D N Adi f f [ j] ← D N A S [ j] − D N Acon f [ j] (Subtraction operator)
d. else:
D N Adi f f [ j] ← D N A S [ j] ⊕ D N Acon f [ j] (X-OR operator)
3.2 Decryption
This stage consists of the opposite operations of the encryption stage as discussed
in Sect. 3.1. Chaotic maps are generated with the same control values as that of
confusion and diffusion processes of the encryption approach.
The proposed encryption process is evaluated using information entropy (I en ),

histogram, and correlation parameters. The simulations are performed with Python
and associated libraries on the Windows 7, i5 system. Performance evaluated on test
images obtained from http://sipi.usc.edu/database/. Lena, Sailboat, Baboon, Pepper,
House, Couple, etc., images are used to verify the performance.
4.1 Information Entropy (Ien )
I en is a well-established parameter to analyze the randomization capabilities and

properties of any encryption approach. I en also analyzes the robustness of an encryp-
tion approach [5]. The I en analysis of the proposed approach on various images is
shown in Table 5. The I en of original images significantly changes after the encryption
process. But after decryption, I en is exactly the same with the I en of cover images.
110 K. Verma et al.
Table 5 Entropy evaluation of images

Image Original Encrypted Decrypted
Lena 7.3530 7.9167 7.3530
Sailboat 7.4842 7.8675 7.4842
Baboon 7.3583 7.9305 7.3583
Pepper 7.5936 7.9174 7.5936
House 7.2334 7.5480 7.2334
Couple 7.3583 7.8446 7.3583
Lena Sailboat Baboon Pepper House Couple
Fig. 3 a and b Cover images and histograms, respectively; c and d encrypted images and
histograms, respectively
4.2 Histogram Analysis
Images and their histogram before and after encryption are shown in Fig. 3. The
histogram of the original image shows large up and downs, but histogram after
encryption for all the images is flat and changed from histogram before encryption.
It justifies that the process can bear external attacks and eavesdroppers are not able
to detect it through the histogram analysis.
4.3 Correlation Analysis
For correlation evaluation, random 30,000 adjacent samples of images are selected
from cover and encrypted images for horizontal, vertical, and diagonal correlation
Table 6 Correlation coefficients’ evaluation of proposed encryption process

Image Horizontal Vertical Diagonal Horizontal Vertical Diagonal
(cover) (cover) (cover) (encrypted) (encrypted) (encrypted)
Baboon 0.8665 0.7587 0.7262 − 0.0116 − 0.0147 0.0291
Peppers 0.9768 0.9792 0.9639 − 0.005096 − 0.00626 0.010403
Splash 0.9839 0.9913 0.9773 − 0.0183 − 0.0192 0.04007
House 0.9485 0.9578 0.9135 − 0.00325 − 0.00099 − 0.00196
Airplane 0.9571 0.9366 0.8927 − 0.0277 − 0.0219 − 0.0215
Couple 0.9371 0.8926 0.8557 − 0.0115 − 0.00947 0.0212
Male 0.9774 0.9813 0.9671 − 0.00454 − 0.00674 − 0.003996
Sailboat 0.9751 0.9715 0.9578 − 0.00386 − 0.001889 − 0.00293
Clock 0.9565 0.9741 0.9389 − 0.0423 − 0.04509 − 0.05007
Table 7 Correlation
Approach Horizontal Vertical Diagonal
coefficient comparison with
other existing approaches Ref. [15] − 0.0017 − 0.0004 0.0028
Ref. [16] 0.0064 0.0003 0.0026
Ref. [17] − 0.0008 − 0.0013 0.0018
Proposed − 0.0211 − 0.0172 − 0.0116
coefficients’ examination (Table 6). It demonstrated that cover images have high
correlation and it decreases after the encryption process. Table 7 shows the relative
analysis of correlation coefficients of our approach with the correlation coefficients
of other existing approaches for the same image. The correlations’ coefficients with
the proposed approach are less than or comparable with other approaches.
5 Conclusions
A hybrid, robust, and secure encryption methodology using chaotic and DNA theory-
based encoding and operation is proposed. Chaotic map’s pseudo-random features
and DNA operation are utilized to create the confusion and diffusion methods of
the proposed approach. Chaotic maps based confusion mechanisms scrambled the
cover signal. Then, the scramble signal is further applied for the diffusion stage using
DNA encoding and operations. Performance of the given approach is examined and
evaluated using various matrices. Performance comparison and investigation with
respect to the latest state-of-the-art methods prove the data security capabilities of
the proposed technique.
112 K. Verma et al.
References
1. Kaur R, Singh B (2021) A novel approach for data hiding based on combined application of
discrete cosine transform and coupled chaotic map. Multimedia Tools Appl 80:14665–14691
2. Fang P, Liu H, Wu C, Liu M (2022) A survey of image encryption algorithms based on chaotic
systems. Vis Comput 1–29
3. Wang X, Zhao M (2021) An image encryption algorithm based on hyperchaotic system and
DNA coding. Opt Laser Technol 143
4. Mansouri A, Wang X (2021) A novel one-dimensional chaotic map generator and its application
in a new index representation-based image encryption scheme. Inf Sci 563:91–110
5. Lu Q, Zhu C, Deng X (2020) An efficient image encryption scheme based on the LSS chaotic
map and single S-box. IEEE Access 8:25664–25678
6. Wang X, Zhang M (2021) A new image encryption algorithm based on ladder transformation
and DNA coding. Multimedia Tools Appl 80(9):13339–13365
7. Kaur R, Singh B Robust image encryption algorithm in dwt domain. Multimedia Tools Appl
(2023) https://doi.org/10.1007/s11042-023-16985-4
8. Guesmi R, Farah MAB, Kachouri A, Samet M (2016) A novel chaos-based image encryption
using DNA sequence operation and secure hash algorithm SHA-2. Nonlinear Dyn 83(3):1123–
1136
9. Fridrich J (2011) Symmetric ciphers based on two-dimensional chaotic maps. Int J Bifurcat
Chaos 8(6):1259–1284
10. Chen J, Zhu ZL, Zhang LB, Zhang Y, Yang BQ (2018) Exploiting self-adaptive permutation–
diffusion and DNA random encoding for secure and efficient image encryption. Signal Proc
142:340–353
11. Belazi A, Talha M, Kharbech S et al (2019) Novel medical image encryption scheme based on
chaos and DNA encoding. IEEE Access 7:36667–36681
12. Jarjar A (2022) Vigenere and genetic cross-over acting at the restricted ASCII code level for
color image encryption. Med Biol Eng Compu 60:2077–2093
13. Man Z, Li J, Di X, Sheng Y, Liu Z (2021) Double image encryption algorithm based on neural
network and chaos. Chaos Solitons Fractals 152
14. Xu Q, Sun K, Zhu C (2020) A visually secure asymmetric image encryption scheme based on
RSA algorithm and hyperchaotic map. Phys Scr 95(3):035223
15. Wang X, Guan N (2020) A novel chaotic image encryption algorithm based on extended zigzag
confusion and rna. Opt Laser Technol 131:106366
16. Xu Q, Sun K, He S, Zhu C (2020) An effective image encryption algorithm based on
compressive sensing and 2d-slim. Opt Lasers Eng 134:106178
17. Cao W, Mao Y, Zhou Y (2020) Designing a 2D infinite collapse map for image encryption.
Signal Process 171:107457
An Empirical Study on Comparison
of Machine Learning Algorithms
for Eye-State Classification
Using EEG Data
N. Priyadharshini Jayadurga, M. Chandralekha, and Kashif Saleem
Abstract Brain–computer interface (BCI) is an upspringing avenue that has dipped

its hands in a variety of fields. BCI device collects brain waves from individuals in the
form of electroencephalogram (EEG). The classification of eye-state such as closed
or opened using electroencephalogram (EEG) data plays a crucial role in a variety of
applications. This research endeavor is an empirical study on the popular machine
learning algorithms—logistic regression, ElasticNet classifier, and support vector
machine, with diverse kernels to arbitrate their efficiency in eye-state classification.
The dataset comprised of continuous EEG measurements collected with Emotiv
EEG Neuroheadset from individuals. The data was preprocessed to accommodate it
to the classification algorithms. The findings revealed that support vector machine
(SVM) with radial basis function (RBF) illustrated robustness in handling complex
EEG data. Logistic regression promoted interpretability, while ElasticNet classifier
offered a balanced approach. The accuracy of SVM with RBF kernel was 77%, while
the accuracy of logistic regression and ElasticNet classifier was found to be 57.2%
and 57.8%, respectively.
Keywords Electroencephalogram(EEG) · Logistic regression · ElasticNet

classifier · Support vector machines · Kernels
N. Priyadharshini Jayadurga · M. Chandralekha (B)

Department of Computer Science and Engineering, Amrita School of Computing,
Amrita Vishwa Vidyapeetham, Chennai, India
e-mail: m_chandralekha@ch.amrita.edu
N. Priyadharshini Jayadurga
e-mail: np_jayadurga@ch.students.amrita.edu
K. Saleem
King Saud University, Riyadh, Saudi Arabia
e-mail: ksaleem@ksu.edu.sa
114 N. Priyadharshini Jayadurga et al.
1 Introduction
In the contemporary era, the confluence of neuroscience, machine learning, and assis-
tive technology has verged in the quest of enhancing the lives of disabled persons.
Assistive devices play a pivotal part in amplifying the lives and independence of
person with cognitive differences. These devices are developed to close the divide
between the person’s ability and the physical challenges faced by them. They enrich
and empower the disabled people by aiding their day-to-day tasks. BCI has impacted
most of the industries positively [1, 2]. It has been identified that the people with
severe paralyzes communicate by eye gestures and blinks or using their brain signals
to register EEG insights [3]. Brain–computer interface (BCI) has sprang up to be a
ground breaking technology in the field of assistive technology with the potential of
transformation of brain signals to control devices. BCI bridges the signals generated
from the brain and the machines to execute these signals to actions [4]. BCI tech-
nology transfers brain signals into useful commands for carrying out an action [5].
In this research work, the focus is laid toward classification of eye-states into either
closed or opened using machine learning algorithms. Machine learning algorithms
have the strength to bring out the hidden patterns [6]. An empirical study has been
carried out using the notable machine learning algorithms such as logistic regression,
ElasticNet classifier, and all kernels of support vector machine (SVM) to identify the
best-performing algorithm for the given problem in hand. The primary objective of
this research work is as follows:
1. Comparison of the performance of three distinct algorithms, namely logistic
regression, ElasticNet classifier, and support vector machines for eye-state clas-
sification data.
2. Evaluation of the algorithms based on key indicators of performance.
3. Identification of the benefits and limitations of each of the implemented algorithms
in the context of eye-state classification.
This research paper is organized as: Sect. 2 discusses the existing works in the pro-
posed area of research, Sect. 3 states the proposed methodology for the problem in
hand, and Sect. 4 portrays the results of the experimentation.
2 Literature Survey
The paper [7] uses a time-domain linear filtering technique. In this work eye-blink
signals are obtained initially with a multichannel Wiener filter and a subset of frontal
electrodes. The estimate of these signals is subtracted from the noisy EEG signal
using the principle of regression analysis. They have also use independent compo-
nent analysis (ICA). It was found that MWF-based method promoted greater results
in shorter time frames. A study [8] suggests a random eye-state change detection
in real-time using EEG signals. In this work, the last two seconds of the signals is
An Empirical Study on Comparison of Machine Learning Algorithms … 115
extracted using multivariate empirical mode decomposition to extract the relevant

features. The extracted features are given to logistic regression and artificial neu-
ral network classifiers for the confirmation of eye-state change with an accuracy
of 88.2%. A research work [9] investigates to find a measurable set of features for
stereoscopic image prediction at visual comfort level by observing data from gaze,
pupillometric, and EEG from 28 subjects for the evaluation of comfort level of 120
stereoscopic images. In this research work, six different time frame windows were
used for analyzing four measured features, namely the number of focus points, the
dynamics of pupil size, disparity level at the focus points, and the activity of EEG
bands at the frontal lobe. It demonstrated a significant difference for all the stereo-
scopic image groups. The paper [10] is aimed toward improvement of EEG-based
bio-metric authentication using eye-blinking EOG signals which are the source of
artifacts. The data used for this research consisted of EEG signals collected from 31
subjects using Neurosky Mindwave headset while performing three different tasks
such as relaxation, visual stimulation, and eye blinking were performed by the sub-
jects. In this work, density-based and canonical correlation analysis strategies are
applied for score level and feature level fusions. Feature extraction was done using
autoregressive modeling of EEG signals, and time delineation of the eye-blinking
waveform was utilized. And classification was performed using linear discriminant
analysis which promoted significant improvement in recognition and reduction of
equal error rates. A study [11] illustrates how exploiting eye movements could result
in aberrations in EEG readings. According to the matching of eye movement charac-
teristics among experimental circumstances, it offers a solution. The technique used
for matching eye movements to the relevant EEG activity is addressed in this study.
To address the baseline selection problem and its remedy, the EEG data is also seg-
mented onto saccade-related epochs that are relative to saccade and fixation onsets.
This proposed action assisted in the application of co-registration of eye movement
and EEG in various aspects of cognitive neuroscience. The paper [12] proclaims
to identify autistic children by extracting features from EEG and eye-tracking using
support vector machines (SVMs). This study used the data collected from 97 children
aging between 3 and 6. It used minimum redundancy maximum relevance (MRMR)
feature selection combined with SVM classifiers for classification of autistic chil-
dren and normal children with an accuracy of 85.44% and AUC of 93% by selection
of 32 features. A research endeavor [13] compares polynomial-based PCA, KPCA,
LDA, and GDA feature extraction methods for the classification of EEG signals
of epileptic and eye-states utilizing kernel machines. This work uses the publicly
available Bonn University database and compares the performance of simple multi-
layer perceptron neural network (sMLPNN) and least square support vector machine
(LS-SVM) for the extracted features. The experimental results have proved that the
features extracted with the use of kernel methods are more discriminative than the
ones from standard methods. In this study [14], a hybrid deep learning architecture
has been proposed for EEG analysis. It uses CNN filter for extracting the multi-
domain features, and feature reduction technique was used for removing redundant
features. The proposed method was incorporated into the IoT-based BCI prototype
and demonstrated an accuracy of 97%.
Fig. 1 Flowchart of the proposed study
3 Proposed System
Eye-state classification using EEG data is a crucial component while developing

BCI applications. The proposed system is aimed toward the implementation of an
integrated EEG-based eye-state classification composed of stages like logistic regres-
sion, ElasticNet classifier, and support vector machines (SVMs) with varying kernels.
This work emphasis on feature selection, implementation of the stated algorithms,
and evaluation of these algorithms based on diverse metrics. These algorithms were
chosen based on their ability to handle categorical data as EEG data was comprised
mostly of frequency. The proposed study is of great importance as it could be used
in several applications like detecting the drowsiness of drivers, designing assistive
technologies for differently-abled individuals, emotion analysis, etc. Figure 1 demon-
strates an overview of the envisioned system.
4 Experimental Results
4.1 Dataset Description
The Kaggle website contributed the dataset. The results were obtained from a single,
117-second continuous EEG experiment conducted with the Emotiv EEG Neuro-
headset. During the EEG measurement in this data collection, the eye-state was
manually inputted for analysis after being captured by a camera. EEG signals from
14 electrodes (AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, and AF4)
constituted the dataset. Whether the eye condition was closed (1) or open (0) was
specified in the target column.
4.2 Data Preprocessing
The frequency ranges of the EEG data which was collected made up the dataset. The
data was first preprocessed by visualizing it as an EEG signal. Initially, the signal
was gathered using Eq. 1 at a sampling rate of 128.
.FS = Number of samples % Duration of EEG Measurement (1)
The outliers were identified and replaced with nan. They were recalculated with ignor-
ing the nan to remove bias from the dataset. Then, independent component analysis
(ICA) was applied to the data for separation of multivariate signals into their additive
subcomponents. This process was implemented to drop the non-electrophysiological
components. Once this process was done, the signals were reconstructed to obtain
the clean EEG with bad components being dropped off. Now, the data between
8 and 12 Hz is processed utilizing a band-pass filtering technique, and the alpha
waves are subsequently extracted for further investigation. The filtered data was sub-
jected to correlation testing, and the highly-correlated features were dropped from
the dataset.
4.3 Algorithm Implementation
The preprocessed data was then used for the implementation of the stated algorithms
like logistic regression, ElasticNet classifier, and support vector machines using var-
ious kernels.
Logistic Regression Designing of methods for artifact detection is essential and is
need of the hour [15–17]. Logistic regression is a machine learning technique that is
supervised in nature and is applicable for binary classification problems like the one
in hand. The data was split into training and testing data in the ratio of 70:30. The
logistic regression relates the features, i.e., electrodes to the probability of outcome
as whether eye is closed or opened. The logistic function is given as 2
. S(z) = (1/(1 + e−z )) (2)
where
. S(z) is the predicted probability of eye closed (1).
.e is the base of the natural logarithm.
. z is the linear combination of features and their weights.

The linear combination of features and their weights .z is calculated as 3.
. z = β0 + β1 x1 + β2 x2 + · · · + β p x p (3)
Fig. 2 Confusion matrix of

eye-state classification using
logistic regression
where
β refers to the intercept term.
. 0
.β1 x 1 + β2 x 2 + · · · + β p x p refers to the weights of the features (electrodes) . x 1 , x 2 ,
. . . , x p . Logistic regression was applied to the dataset, and an accuracy of 57% was
obtained as a result. The precision, recall, and F1 score of logistic regression for
eye-state classification are 0.54, 0.32, and 0.40, respectively. The confusion matrix
for eye-state classification using logistic regression is shown in Fig. 2.
ElasticNet Classifier ElasticNet classifier is a regularization technique commonly
used for classification problems [2, 18]. It works by combining L1 (Lasso) and
L2 (Ridge) regularization terms. The objective function of ElasticNet classifier is
mathematically denoted in Eq. 4
⎛ ⎞
1 ∑n
( ) ∑p
1− p ∑p
. J (β) = h β (x (i) − y (i) )2 + α ⎝ρ |β j | + β 2j ⎠ (4)
2n i=1 j=1
2 j=1
where
J (β)
. loss function that is minimized by ElasticNet classifier.
.β weights of the features (electrodes).
.n number of samples (14980).
.p number of features (10 after preprocessing).
(i)
.h β (x ) prediction of i-th data point.
(i)
.y Actual target of i-th data point.
.α Regualrization parameter.
.ρ parameter that balances L1 and L2 regularization.
Fig. 3 Confusion matrix of

ElasticNet classifier for
eye-state classification
The algorithm works by minimizing the objective function stated in 4. EEG for
evaluation of BCI devices should be carried out carefully [19, 20]. When applying
Elastic Regression to the dataset, the accuracy for eye-state classification was found
to be 57.8%. The precision, recall, and F1 score of ElasticNet classifier for eye-state
classification are 0.56, 0.31, and 0.40, respectively. The confusion matrix depicting
the correct and incorrect predictions of eye-states is shown in Fig. 3.
Support Vector Machines Support vector machine (SVM) is a machine learning
algorithm widely used for classification problems. It works by finding the optimal
hyperplane that best separates the two classes in case of binary classification [21].
Various kernel functions were applied to the data, and the performance was evaluated
for the same. It was found that the SVM algorithm with radial basis function (RBF)
was found to outperform the other kernel functions.
The accuracy of SVM using various kernels for eye-state classification is expressed
in Table 1. The confusion matrix for each of the kernel function is shown in Fig. 4. The
F1 score, precision, and recall values for various kernels of SVM are demonstrated
in Table 2.
Comparison of the Results Binary classification algorithms are evaluated using
recall (REC), specificity (SPEC), precision (PREC), F1-score, and area under the
curve (AUC) [22, 23]. The three algorithms were compared on the basis of accuracy,
Table 1 Accuracy of various Kernel function Accuracy (in %)

kernel functions of SVM for
eye-state classification using Linear 53.6
EEG data Sigmoid 50.2
Polynomial 65.8
Radial basis function 77.3
(a) Linear Kernel (b) Sigmoid Kernel
(c) Polynomial Kernel (d) Radial Basis Function Kernel
Fig. 4 Confusion matrix of SVM using various kernel functions for eye-state classification
Table 2 Precision, recall, and F1 score of SVM

Kernel Precision Recall F1 score
Linear 0.68 0.12 0.20
Sigmoid 0.45 0.45 0.45
Polynomial 0.85 0.37 0.52
Radial basis function 0.81 0.68 0.74
precision, recall, F1 score, and confusion matrix. The accuracy, precision, recall, and
F1 score comparison for the chosen algorithms is depicted in Table 3. It was found
that the support vector machine (SVM) with radial basis function (RBF) as kernel
outperformed the other algorithms that was considered for the study. This provides
a perception that SVM with RBF kernel can promote best insights for the problem
in hand.
Table 3 Comparison of algorithms using evaluation metrics

Algorithm Accuracy Precision Recall F1 score
(in %)
Logistic regression 57.0 0.54 0.32 0.40
ElasticNet classifier 57.8 0.56 0.31 0.40
SVM—Linear 53.6 0.68 0.12 0.20
SVM—Sigmoid 50.2 0.45 0.45 0.45
SVM—Polynomial 65.8 0.85 0.37 0.52
SVM—Radial basis function 77.3 0.81 0.68 0.74
5 Conclusion
The empirical study on the comparison of logistic regression, ElasticNet classifier,

and support vector machine for the classification of eye-states using EEG data has
shed light on the advantages and disadvantages of the three aforementioned algo-
rithms. This experimentation and assessment of these algorithms have allowed to
gain insights on the relevant performance of these algorithms. These algorithms
were chosen based on their ability to effectively perform binary classification. The
choice among these methods should be considered based on the distinct charac-
teristics of EEG data. This study demonstrated an accuracy of 77.3% using SVM
algorithm with the required preprocessing steps. Assistive technology applications
and other healthcare applications require real-time classification of eye-states based
on EEG data to promote esteemed guidance for practitioners and research. The study
would be further enhanced by the development of a novel algorithm for enhancing
performance and providing optimal results for the problem at hand.
References
1. Maiseli B, Abdalla AT, Massawe LV, Mbise M, Mkocha K, Nassor NA, Ismail M, Michael J,
Kimambo S (2023) Brain-computer interface: trend, challenges, and threats 12
2. Hassouneh A, Mutawa AM, Murugappan M (2020) Development of a real-time emotion recog-
nition system using facial expressions and EEG based on machine learning and deep neural
network methods. Inf Med Unlocked 20:100372, 1
3. Sirvent Blasco JL, Iáñez E, Úbeda A, Azorín JM (2012) Visual evoked potential-based brain-
machine interface applications to assist disabled people. Expert Syst Appl 39:7908–7918, 7
4. Tiwari N, Edla DR, Dodia S, Bablani A (2018) Brain computer interface: a comprehensive
survey. Biologically Inspired Cogn Archit 26:118–129, 10
5. Mudgal SK, Sharma SK, Chaturvedi J, Sharma A (2020) Brain computer interface advancement
in neurosciences: applications and issues 6
6. Radhika N, Bhavani KD (2020) K-means clustering using nature-inspired optimization
algorithms-a comparative survey. Int J Adv Sci Technol 29(6s):2466–2472
7. Borowicz A (2018) Using a multichannel wiener filter to remove eye-blink artifacts from EEG
data. Biomed Sign Process Control 45:246–255, 8
8. Saghafi A, Tsokos CP, Goudarzi M, Farhidzadeh H (2017) Random eye state change detection
in real-time using EEG signals. Expert Syst Appl 72:42–48, 4
9. Abromavičius V, Serackis A (2018) Eye and EEG activity markers for visual comfort level of
images. Biocybernetics Biomed Eng 38:810–818, 1
10. Abo-Zahhad M, Ahmed SM, Abbas SN (2016) A new multi-level approach to EEG based
human authentication using eye blinking. Pattern Recogn Lett 82:216–225, 10
11. Nikolaev AR, Meghanathan RN, van Leeuwen C (2016) Combining EEG and eye movement
recording in free viewing: pitfalls and possibilities. Brain Cognition 107:55–83, 8
12. Kang J, Han X, Song J, Niu Z, Li X (2020) The identification of children with autism spectrum
disorder by SVM approach on EEG and eye-tracking data. Comput Biol Med 120:103722, 5
13. Nkengfack LCD, Tchiotsop D, Atangana R, Tchinda BS, Louis-Door V, Wolf D (2021) A
comparison study of polynomial-based PCA, KPCA, LDA and GDA feature extraction methods
for epileptic and eye states EEG signals detection using kernel machines. Inf Med Unlocked
26:100721, 1
14. Medhi K, Hoque N, Dutta SK, Hussain MI (2022) An efficient EEG signal classification
technique for brain-computer interface using hybrid deep learning. Biomed Signal Process
Control 78:104005, 9
15. Wang M, Cui X, Wang T, Jiang T, Gao F, Cao J (2023) Eye blink artifact detection based
on multi-dimensional EEG feature fusion and optimization. Biomed Signal Process Control
83:104657, 5
16. Nilashi M, Abumalloh RA, Ahmadi H, Samad S, Alghamdi A, Alrizq M, Alyami S, Nayer FK
(2023) Electroencephalography (EEG) eye state classification using learning vector quantiza-
tion and bagged trees. Heliyon 9:e15258, 4
17. Santamaría-Vázquez E, Martínez-Cagigal V, Pérez-Velasco S, Marcos-Martínez D, Hornero
R (2022) Robust asynchronous control of ERP-based brain-computer interfaces using deep
learning. Comput Methods Programs Biomed 215:106623, 3
18. Alkatheiri MS (2022) Artificial intelligence assisted improved human-computer interactions
for computer systems. Comput Electr Eng 101:107950, 7
19. Yohanandan S, Kiral-Kornek I, Tang J, Mashford BS, Asif U, Harrer S (2018) A robust low-cost
EEG motor imagery-based brain-computer interface
20. Aswiga RV, Karpagam M, Chandralekha M, Kumar CS, Selvi M, Deena S (2023) An automatic
detection and classification of diabetes mellitus using CNN. Soft Comput 27(10):6869–6875
21. Mageshwari G, Chandralekha M, Chaudhary D (2023) Underwater image re-enhancement with
blend of simplest color balance and contrast limited adaptive histogram equalization algorithm.
In: 2023 international conference on advancement in computation & computer technologies
(InCACCT) pp 501–508
22. Kamble A, Ghare P, Kumar V (2022) Machine-learning-enabled adaptive signal decomposition
for a brain-computer interface using EEG. Biomed Signal Process Control 74:103526, 4
23. Punsawad Y, Siribunyaphat N, Wongsawat Y (2021) Exploration of illusory visual motion
stimuli: an EEG-based brain-computer interface for practical assistive communication systems.
Heliyon 7:3
Decoding the UK’s Stance on AI: A Deep
Dive into Sentiment and Topics
in Regulations
Dwijendra Nath Dwivedi and Ghanashyama Mahanty
Abstract Artificial intelligence (AI) is an innovative and remarkable technical

advancement that has become an integral part of our lives, influencing every aspect
of our existence. It is altering the structure of our everyday schedules and the way we
operate in our professional environments. As we acclimate and gain further knowl-
edge about this technology, it is imperative to acknowledge its profound impact on
our lives. Due to the significant possible effects, it is crucial to have a deep under-
standing of its implications and be ready for any unexpected outcomes. It is essential
to have regulatory guidance and proactive oversight in place for artificial intelligence.
The UK, as a leading global entity, has taken a proactive stance in tackling both the
ethical and operational aspects of AI. This study examines the legislative frameworks
connected to artificial intelligence (AI) in the UK utilizing advanced approaches such
as sentiment analysis and topic modeling. Our analysis reveals the UK’s equitable
strategy toward AI, carefully considering its advantages in comparison to its obsta-
cles. Important regulatory topics encompass ethics, data protection, transparency,
and economic advancement. The sentiment analysis reveals a predominantly posi-
tive perspective, while emphasizing the importance of responsible employment of
AI. This report illuminates the UK’s position on AI rules and serves as a benchmark
for other regions to assess their AI initiatives.
Keywords AI ethics · Responsible AI · High risk AI · Sentiment analytics · UK

regulation · Text clustering · Topic modeling
D. N. Dwivedi (B)
SAS Middle East FZ-LLC, Dubai Media City-Business Central Towers, Dubai, UAE
e-mail: dwivedy@gmail.com
G. Mahanty
Department of Analytical and Applied Economics, Utkal University, Orissa, India
124 D. N. Dwivedi and G. Mahanty
1 Introduction
The field of artificial intelligence (AI) has witnessed an unparalleled boom in the
last decade. The field has evolved from abstract concepts and basic implementations
to a fundamental cornerstone of contemporary technology environments. The rapid
increase in popularity can be credited to the combination of variables like improved
computer capabilities, access to extensive and varied datasets, and advancements in
machine learning methods. The utilization of machine learning, specifically deep
learning, has played a crucial role in empowering artificial intelligence systems
to analyze information in manners that were previously believed to be limited to
human cognition. The powers of AI have greatly multiplied. Advancements in natural
language processing have led to the development of advanced digital assistants,
while progress in computer vision has allowed for the implementation of real-world
applications such as facial recognition and driverless vehicles. It has been incorpo-
rated into various aspects of healthcare, such as diagnostics, financing, and supply
chain optimization, highlighting its potential to bring about significant changes. The
widespread availability of AI tools and platforms has sparked a surge of innovation,
enabling a larger group of developers and academics to contribute to its advancement.
In the early years of the twenty-first century, the combination of rapidly advancing
digital technology and large amounts of data led to a worldwide emphasis on using
cognitive methods to gain an advantage. The emphasis on this aspect accelerated the
development of what is now acknowledged as artificial intelligence, as theorized by
Nilsson [1]. John McCarthy is well recognized and appreciated as a pioneering figure
in the field of artificial intelligence (AI). He defined AI as the ‘craft and discipline
of creating intelligent machines’ [2].
AI customizes experiences for individual users, whether through curated recom-
mendations on media platforms or digital assistants on our phones. It has reori-
ented the technological environment to prioritize the user. Artificial intelligence (AI)
augmented technologies assist clinicians in detecting ailments at an earlier stage and
with greater precision. The technologies for autonomous vehicles present a vision
of roadways that are safer and more efficient. Intelligent learning systems utilize
artificial intelligence to tailor lessons to the specific requirements of each learner,
providing a more personalized learning experience. The rapid rise of artificial intel-
ligence (AI) across several industries underscores the urgent requirement for its
supervision. Artificial intelligence tools are increasingly prevalent in critical indus-
tries such as health care, finance, and transportation. Ensuring their trustworthiness,
impartiality, and security is crucial. Artificial intelligence systems have the potential
to unintentionally reflect the biases present in the data they are trained on if not
adequately monitored. There is a potential for the occurrence of unjust outcomes
that could negatively impact specific demographics. The intricate internal mecha-
nisms of certain AI systems are frequently referred to as ‘black boxes’. It prompts
inquiries regarding transparency. The concerns over the security of data and the
potential for AI to be misused in areas such as espionage, dissemination of false
Decoding the UK’s Stance on AI: A Deep Dive into Sentiment … 125
information, and weapon systems highlight the critical need for government engage-
ment. Establishing regulations is not intended to impede advancement, but rather to
guarantee that AI develops in a secure and morally sound manner. It should uphold
public trust and prioritize the welfare of society as a whole. Nevertheless, it is not
solely about ease and efficiency. AI also presents ethical and societal challenges,
including worries over data privacy. The issue at hand involves the displacement of
jobs caused by automation, as well as the imperative for the appropriate develop-
ment of artificial intelligence. With the continuous growth and evolution of AI, it is
becoming increasingly imperative to establish clear guidelines to ensure its ethical
and prudent utilization. The UK, known for its progressive regulatory approach, plays
a crucial role in shaping the global discourse on AI. The country’s stance on AI not
only influences its domestic policy but also serves as a paradigm for other nations
to contemplate. Given the UK’s significant influence, it is crucial for researchers,
decision-makers, and industry specialists to understand its perspective on AI. The
objective of this study is to comprehensively comprehend and illuminate the senti-
ments and primary domains of interest in the UK regarding artificial intelligence.
By employing sophisticated methodologies for attitude analysis and issue identifi-
cation, we thoroughly examine regulatory documents to extract the UK’s approach,
aspirations, and concerns around artificial intelligence. Our objective is to provide
a clearer and more comprehensive understanding of the regulations governing AI,
enabling individuals to engage with AI in a more transparent and informed manner.
The main objective of this study is to examine and extract the attitudes expressed
in the UK’s proposed AI law document, as well as to do topic modeling. By utilizing
the Naïve Bayes methodology, the research aims to identify both positive and nega-
tive emotional tones present in the paper. Companies often utilize sentiment analysis
tools to measure sentiment in social media material, assess brand perception, and
acquire insights into client emotions. These models generally classify information
based on polarity, such as positive, negative, or neutral. Additionally, they analyze
specific emotions, such as anger, happiness, and sadness, as well as urgency levels
and user intent. The purpose of this study is to assess the prevailing sentiment incli-
nation, namely if most utterances express optimism or pessimism over the subject.
Simultaneously, the research also seeks to determine the main types and subdivisions
within these favorable and unfavorable emotions.
(a) The study aims to determine the distribution of favorable and negative opinions
inside the AI regulatory document of the UK.
(b) What are the predominant categories within these positive and negative
sentiments?
2 Literature Studies
Our literature review focuses on three main topics: the moral challenges and dangers
linked with AI, sentiment analysis using data from Twitter and other platforms, and
common techniques used in sentiment analysis.
AI Risk and Ethical Challenges: Dwivedi [3, 4] used Twitter data to pinpoint the
main worries about rising AI. Hagendorff [5] performed an in-depth comparison
of 22 ethical guides, highlighting both shared points and areas lacking attention.
This research also suggests ways to make AI ethical rules more actionable. Maas [6]
believed that AI systems can cause widespread, cascading mistakes. Box and Data [7]
examined how human prejudices might affect the machine learning journey. Martinho
et al. [8] and team combined both theory and real-world evidence to explore the
ethical choices made by AI systems. References [9–12] shared examples of various
AI models and how they can help manage AI Bias and risk, Tamboli [2] pointed out the
evolving problems caused by changing data trends, underscoring the ‘concept drift’
issue. Bolander [13] expressed worries about the consequences and technical hurdles
of AI replacing human tasks. Holzinger et al. [14] emphasized the need for extensive,
top-notch data to address urgent medical problems, particularly by merging different
clinical, imaging, and molecular data to understand intricate illnesses. References
[15] and [16] shared examples of AI methods and risk in computer vision models.
Reference [17] shared examples of machine learning-based models and ESG and
the risk associated with it. Reference [18–21] shared various examples of machine
learning models and how to manage AI risk.
Sentiment Analytics Using Twitter and Other Data Sources: Gupta et al. [9]
contributed to the comprehension of textual context to find distinct attributes of
items or services that impact consumer emotions. The study conducted by Dwivedi
and Anand [22] employed sentiment analysis and topic modeling techniques to eval-
uate the responses of governments during the COVID-19 pandemic, specifically
comparing the reactions of the United Arab Emirates and Saudi Arabia. In their
study, Dwivedi et al. [10] employed Twitter data to conduct topic and sentiment
analysis with the objective of identifying primary concerns regarding the veracity
and integrity of data. Dwivedi et al. [3, 23] employed context analysis on Twitter
to assess the prevalence of positive and negative attitudes regarding the COVID-
19 vaccination, emphasizing prevalent concerns. In their study, Dwivedi et al. [22]
conducted a comparative analysis of medical research on COVID-19 in the United
Arab Emirates and the World Health Organization. The researchers focused on iden-
tifying the main topics addressed by both parties. In their study, Dwivedi et al. [4]
utilized context analysis on Twitter to categorize sentiments regarding the ethical
quandaries in AI, identifying key areas of concern.
Text Clustering Methods: Alghamdi and Alfalqi [24] highlighted the increasing
demand for novel techniques or instruments to efficiently manage, sort, and eval-
uate the ever-growing volume of electronic records and archives. Hofmann [25]
delineated two primary methodologies for analyzing such data: natural language
processing (NLP) and statistical techniques, such as topic modeling. While natural
language processing (NLP) focuses on the categorization of speech components and
the examination of grammar, statistical and topic models mostly utilize the ‘bag-of-
words’ (BoW) approach. This approach involves transforming texts into a matrix that
represents the frequency of each word occurrence in each document. Deerwester et al.
[26] were pioneers in the field of topic modeling. They introduced a highly influential
model that included latent semantic analysis (LSA) and singular value decomposition
(SVD). Asmussen and Moller [27] developed an innovative framework utilizing topic
modeling approaches to comprehensively analyze a diverse array of scholarly articles.
Their approach facilitated rapid, lucid, and replicable evaluations of extensive collec-
tions of documents employing the LDA technique. In general, automatic document
processing can be categorized into two types: supervised learning and unsupervised
learning. Supervised learning necessitates the meticulous task of manually assigning
labels to a set of documents, which can be time-consuming. In contrast, unsupervised
algorithms, like topic modeling, bypass this stage, hence accelerating the examina-
tion of large collections of documents. Gottipati et al. [28] utilized subject modeling
and visual data representation techniques to comprehend the feedback provided by
postgraduate students at the Singapore University of Management. They combined
rule-based techniques with statistical categorizers to identify subjects. Bagheri et al.
[29] developed a sentiment analysis framework that focused on extracting opinions
pertaining to specific themes. The LDA algorithm was employed to identify subjects,
while the ‘bag-of-words’ approach was utilized for sentiment analysis, which quan-
tifies emotions based on word frequency. Benedetto and Tedeschi [30] discussed
prevalent methodologies for sentiment analysis on social media, addressing perti-
nent concerns in cloud technology. Dwivedi and Pathak [23] introduced a technique
that operates at the phrase level. This method utilizes online latent semantic indexing
along with predefined rules to extract topics. Samuel and his colleagues [31] exam-
ined the positive and negative sentiments on the recovery of the US economy amid the
COVID-19 pandemic, including factors such as evolving circumstances, economic
downturns, and feelings of sorrow. Alonso [32] shown that when consumers exhibit a
robust response to unfavorable information, coupled with an escalation of unpleasant
emotions, it results in a negative perception of cattle production. Gupta et al. [33–
43] provided unsupervised and supervised machine learning methods for detecting
anomalies. Gupta et al. [44, 45] provided optimization methods and data quality
approaches for detection and optimization of money laundering scenarios. Dwivedi
and Vemareddy [46] performed sentiment analysis for crypto to understand the nega-
tive sentiments. Reference [7] shared examples how human bias can influence the AI
bias. [9–12] shared examples of various AI models and how they can help manage
AI Bias and risk.
3 Data and Methodology
The UK has actively engaged in shaping the management of artificial intelligence (AI)
to guarantee its ethical and responsible use within its member states. The development
of the UK’s preliminary AI guidelines was a comprehensive endeavor, engaging
various parties and reflecting on the wider tech, economic, social, and moral aspects
of AI. For our research, we sourced the in-depth draft regulation directly from the
UK website to conduct topic modeling and sentiment analysis.
The process of text analysis involves the conversion of human language into a
format that can be processed and analyzed by machines. There are multiple crucial
stages involved in preparing the text for this, which include:
• Transforming the entire text to lowercase.
• Removing stop words, rare terminology, and specific words.
• Converting numerical values into their verbal representation or eliminating them
completely.
• Removing any spaces at the beginning or end of the text.
• Excluding punctuation marks and other unique symbols.
At first, our main objective was to eliminate any duplicate rows. Eliminating super-
fluous data is essential to get precise results. Lowercasing all text ensures uniformity,
preventing the possibility of mistaking ‘drinking water’ as separate words.
To optimize the processing of textual data, it is typically necessary to decrease the
size of the dataset. An effective strategy is eliminating commonly occurring terms. We
have the option to either create a tailored compilation of these phrases or utilize pre-
existing libraries. To achieve this objective, we employed the stopword and textblob
libraries. Generally, we eliminate words that are commonly used by everyone, but
we may also exclude keywords that are peculiar to the situation and appear too
frequently. It is recommended to examine the most common terms in the dataset in
order to identify which ones should be excluded. Due to the widespread occurrence of
spelling errors and abbreviations in sentences, it is crucial to include a spell-checking
process to ensure consistency in word usage. We utilized the textblob library, which
is specifically designed to handle such errors. Tokenization is the process of dividing
the text into separate units, such as individual words or sentences. By utilizing the
textblob package, we converted our texts into tokens, effectively dissecting them
into individual words. Stemming is a technique that involves removing word suffixes
such as ‘ing,’ ‘ly,’ and ‘s.’ We utilized the Porter Stemmer algorithm from the NLTK
package to do this. Nevertheless, lemmatization is frequently favored over stemming
as a strategy. Lemmatization, unlike simple word trimming, involves determining
the base or root form of a word by considering vocabulary and morphological study.
Therefore, lemmatization is generally preferred due to its accuracy.
Upon finishing these preprocessing stages, the subsequent stage entails extracting
features utilizing techniques derived from natural language processing.
N-grams, comprising unigrams, bigrams, or trigrams, denote combinations of
one, two, or three words, correspondingly. Although unigrams may not encom-
pass extensive context, bigrams and trigrams can provide more intricate language
patterns, revealing probable word sequences. The selection between shorter or longer
N-grams is contingent upon the precise objectives of the study, as excessively
lengthy sequences may overlook the overarching message. Part-of-speech tagging
(POS) categorizes words according to their grammatical role in a phrase, such as
whether they act as nouns, verbs, adjectives, etc., so contributing to the contextual
understanding and significance of the text.
As illustrated in Fig. 1, our first step consisted of eliminating duplicate rows to
guarantee impartial results. In order to standardize words and prevent duplication,
we proceeded to transform all text to lowercase. This ensures that variations such as
‘Crypto Currency’ and ‘crypto currency’ are not considered as separate keywords.
Subsequently, punctuation was eliminated to simplify the dataset for more efficient
textual analysis. Stop words, which are frequently used terms, were eliminated using
the textblob package in Python. Due to the frequent occurrence of spelling prob-
lems and abbreviations in tweets, we utilized the textblob library to repair spelling.
After completing these stages, we proceeded to tokenize the data, which involved
dividing it into separate words or sentences. The process of stemming, which involves
truncating word ends such as ‘ing,’ ‘ly,’ and ‘s,’ was performed using Python’s Porter-
Stemmer from the NLTK module. Lemmatization is an alternative to stemming that
determines the basic form of a word. By utilizing vocabulary and morphological
analysis, lemmatization is frequently preferred over stemming due to its higher level
of precision. After completing the first preprocessing, we utilized various natural
language approaches to extract features. We employed N-grams, namely bigrams
and trigrams, to comprehend the contextual correlation among words. Although
unigrams supply only a little amount of information, bigrams and trigrams provide
more extensive linguistic insights. In addition, part-of-speech tagging was utilized
to assign functional roles, such as nouns, verbs, and adjectives, to words based on
their context.
Afterward, we proceeded with topic modeling, a method that detects themes or
subjects in a collection of texts. This procedure is essential in natural language
processing as it decreases the dimensionality of the dataset by concentrating on
relevant content rather than filtering through the entire text. For our investigation,
we employed the Latent Dirichlet Allocation (LDA) approach among numerous
Fig. 1 Preprocessing process for sentiment analysis [4]

alternatives. The Latent Dirichlet Allocation (LDA) model is a statistical tool that
identifies connections between different papers. Based on the variational expectation
maximization (VEM) algorithm, this method detects the most likely subjects present
in a collection of texts. Conventional approaches may only choose the most common
terms, whereas LDA goes beyond by examining the semantic connections between
words in a document. The system functions based on the idea that every document
can be characterized by a distribution of themes, and each subject can be delineated
by a distribution of words. This method provides a comprehensive perspective on
interconnected subjects, allowing for more subtle categorizations of the body of text.
The following flowchart illustrates the complexities of this topic modeling approach.
4 Results
Overall Sentiment: The sentiment analysis of the document yields the following
results.
• The document exhibits a positive sentiment with an average polarity of 0.1317.
• This value ranges from − 1 (most negative) to 1 (most positive). A value of 0.1317
suggests that the document has a slightly positive sentiment overall.
• The document has overall Subjectivity of 0.4215. This value ranges from 0 (most
objective) to 1 (most subjective). A value of 0.4215 indicates that the document
strikes a balance but leans toward being somewhat subjective.
Coherence Score Plot:
• Coherence Score for Positive Sentences: 0.434
• Coherence Score for Negative Sentences: 0.594.
Positive Sentences Topics:
Topic 1: AI, regulatory framework, effective central function, organizations.
Topic 2: AI life cycle, regulatory approach, legal principles, responsibility.
Topic 3: AI regulators, regulatory principles, support guidance, ensure framework.
Topic 4: AI in the UK, risk, effective government systems, innovation.
Topic 5: AI approach, UK innovation, stakeholders, regulatory foundation.
In the realm of positive sentiments surrounding AI, several dominant themes
emerge. Firstly, there’s a focus on AI’s regulatory framework and the importance
of an effective central function to streamline its integration across various organiza-
tions. Secondly, the AI life cycle, combined with a robust regulatory approach and
legal principles, signifies the emphasis on responsibility and ethics. Furthermore,
the role of AI regulators is highlighted, emphasizing the need to ensure the right
framework is in place. Additionally, there’s a clear indication of AI’s role in UK
innovation, addressing risks, and promoting effective government systems. Lastly,
the approach to AI in the UK is seen as a beacon of innovation, with stakeholders

actively collaborating to set a solid regulatory foundation.
Negative Sentences Topics:
Topic 1: intelligence, publications, office, government, UK, pro, establishing,
approach, innovation, AI.
Topic 2: improve, health, risk, safety, footnote, ai, https, www, intelligence,
artificial.
Topic 3: actors, bad, vulnerability, UK, regulatory, vulnerable, consumers,
consumer, https, www.
Topic 4: law, study, stakeholders, businesses, sector, small, www, https, artificial,
intelligence.
Topic 5: adaptivity, autonomy, life, chains, cycle, systems, difficult, make, supply,
AI.
On the side of negative sentiments, various themes provide insight into concerns
and challenges. The innovation approach to AI, especially its establishment in the UK,
seems to be under scrutiny, with questions arising about its alignment with govern-
ment offices. The broader concept of artificial intelligence, especially concerning
health and safety risks, has drawn attention. The vulnerability of AI consumers,
especially in a regulatory context in the UK, is a significant concern. There are also
indications of challenges in implementing AI, especially when considering the needs
of small businesses and legal implications. Lastly, the application of AI in supply
chains and its potential to complicate systems and life cycles has been highlighted.
Pie Chart for Positive Sentences Topics:
A pie chart labeled ‘Topics (Positive)’ provides an easily understood visual aid
that illustrates the distribution and prominence of different positive subjects in each
setting. Every pie slice represents a distinct topic, and the size of the slice denotes
the topic’s relative importance or frequency.in positive sentiments (Fig. 2).
Pie Chart for Negative Sentences Topics:
Figure 3 has the distribution of negative sentiment topics.
Dendrogram for Positive Sentences Topics:
The dendrogram (Fig. 4) illustrates the hierarchical structure of topics derived from
positive sentences. As we move from the bottom to the top of the dendrogram,
individual topics merge into larger clusters, indicating thematic similarities. The
height where branches merge represents the distance between clusters with higher
merges indicating greater dissimilarity. By examining the structure one can deduce
the natural groupings of topics and determine how certain positive topics are related
or distinct from others.
Fig. 2 Pie chart for topics (Positive)
Fig. 3 Pie chart for topics (Negative)

Fig. 4 Dendrogram for topics (Positive)
Fig. 5 Dendrogram for topics (Negative)
Dendrogram for Negative Sentences Topics:

The dendrogram (Fig. 5) illustrates the hierarchical structure of topics derived from
negative sentences.
Fig. 6 Word cloud for topic 3 (Positive)
Positive Sentiment Topics Word Cloud:

The positive sentiment word cloud visualizes the most frequently occurring words
in the sentences that were classified as having a positive sentiment (Fig. 6).
The word cloud for positive topics predominantly showcases themes related to
AI’s regulatory framework, its life cycle, and its role in UK innovation. Words such
as ‘regulatory’, ‘AI’, ‘life cycle’, ‘approach’, and ‘UK’ stand out, emphasizing the
country’s proactive stance in ensuring responsible and ethical AI deployment. There’s
a notable focus on collaboration, with terms like ‘stakeholders’ and ‘foundation’,
suggesting a collective effort to build a solid groundwork for AI’s future in the UK.
Negative Sentiment Topics Word Cloud:
The word cloud (Fig. 7) for negative topics brings to light concerns surrounding AI’s
integration and its broader implications. Terms like ‘innovation’, ‘health’, ‘safety
risks’, and ‘vulnerability’ are prominent, indicating apprehensions about AI’s poten-
tial challenges. The recurrence of ‘UK’ suggests that these concerns are specific
to the country’s context. There’s also an evident focus on the ‘establishment’ and
‘government offices,’ alluding to potential bureaucratic or regulatory hurdles in the
AI landscape.
Inter-topic Distance Maps:
Inter-topic analysis often involves examining the relationships or distances between
topics. One effective way to visualize these relationships is through a distance map
where topics are plotted in a two-dimensional space based on their similarities or
differences. The MDS projection for positive sentiments visually represents the topics
in a two-dimensional space. Each point corresponds to a topic, with the number
indicating the topic ID. The distance between points reflects the dissimilarity between
topics based on the Kullback–Leibler (KL) divergence. Topics that are closer together
share more similar word distributions, while those further apart are more distinct.
This visualization helps in understanding the relationships and separations between
various positive sentiment topics (Fig. 8).
Fig. 7 Word cloud for negative sentences
Fig. 8 Inter-topic distance

map for the positive
sentences
Similarly, the MDS projection for negative sentiments displays the topics in a
two-dimensional layout. Topics that are proximate have similar word distributions,
indicating potential overlaps or related concerns. On the other hand, isolated topics
might represent unique issues or criticisms. This map provides insights into the
clustering and distinctions among various negative sentiment topics (Fig. 9).
Termite Plot for Positive and Negative Topics:
For the positive sentiments: plot showcases the word distribution across five topics,
with certain terms holding prominence in specific topics. This can aid in under-
standing the focal points of the positive sentiments within the document. The termite
plot further solidifies these findings, offering a clearer view of the most influen-
tial terms for each positive sentiment topic. For the negative sentiments: it reveals
the distribution of terms across the negative sentiment topics. The darker regions
highlight terms that are crucial in the context of specific topics. The termite plot
Fig. 9 Inter-topic distance map for the negative sentences
Fig. 10 Heat map for positive and negative topics
for negative sentiments complements the heat map (Fig. 10), Heatmap presents a
granular view if term significance across topics. This aids in deciphering the primary
concerns or themes associated with the negative sentiments in the document.
5 Discussions and Conclusions
The UK’s framework for artificial intelligence (AI), known as the ‘A pro-innovation
approach to AI regulation,’ is notable for its comprehensive structure. The UK recog-
nizes the early stage of development of AI and its significant potential for social and
economic impact. However, the UK is also aware of the possible risks associated
with AI systems, such as the potential to exacerbate socioeconomic inequalities. The
Union is dedicated to creating strong standards and directions for the design and
integration of AI, while closely monitoring these problems. The proposal establishes
essential criteria, effectively balancing the management of risks with the promotion
of the growth of this emerging technology.
According to our analysis, the majority of documents convey a positive sentiment,
focusing on the importance of responsibility, compliance with laws, punishment for
those who disregard standards, and compliance with privacy requirements. It is worth
mentioning the Union’s prompt intervention in this domain. Nevertheless, the fears
are equally as prominent as the assurances. The complexities of assessing risks,
supervising them, and guaranteeing adherence present a significant obstacle, as the
preliminary proposal openly acknowledges. This strategy ensures that significant AI
technologies undergo thorough risk assessments and implement preventive actions.
Nonetheless, the true assessment of the UK’s AI rules hinges on their implementation,
oversight, and capacity to adjust to the always changing AI environment. As artificial
intelligence advances, it will be crucial to review and improve existing policies in
order to effectively tackle new challenges and opportunities.
None of these methods can detect whether the picture is generated by a machine.
Distinguishing between GAN-generated images and manually created photos poses
distinct difficulties. The complexity of Generative Adversarial Networks (GANs) is
a major factor contributing to this challenge. GANs are specifically engineered to
imitate the artistic process of humans by producing visuals that closely resemble
authentic photographs. They achieve this by training on extensive datasets of
authentic photos, which allows them to generate exceedingly lifelike and visually
persuasive outcomes.
Overall, the combination of GANs’ capacity to generate extremely lifelike images,
their flexibility to avoid detection, and the lack of obvious artifacts poses a significant
difficulty in distinguishing GAN-generated images from manually created ones using
traditional approaches. The identification of content created by GANs necessitates
the creation of specific methodologies and ongoing progress in the domain of visual
forensics.
References
1. Nilsson NJ (2012) John McCarthy. National Acad Sci 1–27

2. Kathleen W (2018) Artificial intelligence is not a technology, Forbes, Nov 1
3. Dwivedi DN, Anand A (2021) The text mining of public policy documents in response to
COVID-19: a comparison of the United Arab Emirates and the Kingdom of Saudi Arabia.
Public Gov/Zarządzanie Publiczne 55(1):8–22. https://doi.org/10.15678/ZP.2021.55.1.02
4. Dwivedi D, Mahanty G, Vemareddy A (2021) How responsible is AI? Identification of key
public concerns using sentiment analysis and topic modeling. Int J Inf Retrieval Res 12(1)
5. Hagendorff T (2020) The ethics of AI ethics: an evaluation of guidelines. Mind Mach 30(1):99–
120. https://doi.org/10.1007/s11023-020-09517-8
6. Maas MM (2018) Regulating for normal AI accidents, pp 223–228. https://doi.org/10.1145/

3278721.3278766
7. Box J, Data P (2019) Do you know what your model is doing? How human bias influences
machine learning Elena Snavely—senior data scientist PHUSE UK connect 2019—Amsterdam
machine learning in clinical research
8. Martinho A, Kroesen M, Chorus C (2020) An empirical approach to capture moral uncertainty
in AI, pp 101–101. https://doi.org/10.1145/3375627.3375805
9. Dwivedi DN, Patil G (2023) Climate change: prediction of solar radiation using advanced
machine learning techniques. In: Srivastav A, Dubey A, Kumar A, Narang SK, Khan MA (eds)
Visualization techniques for climate change with machine learning and artificial intelligence.
Elsevier pp 335–358. https://doi.org/10.1016/B978-0-323-99714-0.00017-0
10. Dwivedi DN et al. Benchmarking of traditional and advanced machine learning modeling
techniques for forecasting in book visualization techniques for climate change with machine
learning and artificial intelligence by Elsevier 2022. https://doi.org/10.1016/B978-0-323-
99714-0.00017-0
11. Dwivedi DN, Anand A (2021) Trade heterogeneity in the EU: insights from the emergence of
COVID-19 using time series clustering. Zeszyty Naukowe Uniwersytetu Ekonomicznego w
Krakowie 3(993):9–26. https://doi.org/10.15678/ZNUEK.2021.0993.0301
12. Dwivedi D, Kapur PN, Kapur NN (2023) Machine learning time series models for tea pest
looper infestation in Assam, India. In: Sharma A, Chanderwal N, Khan R (eds) Convergence
of cloud computing, AI, and agricultural science. IGI Global pp 280–289. https://doi.org/10.
4018/979-8-3693-0200-2.ch014
13. Bolander T (2019) What do we lose when machines make the decisions? J Manage Governance
23(4):849–867. https://doi.org/10.1007/s10997-019-09493-x
14. Holzinger A, Haibe-Kains B, Jurisica I (2019) Why imaging data alone is not enough: AI-based
integration of imaging, omics, and clinical data. Eur J Nucl Med Mol Imag 46(13):2722–2730.
https://doi.org/10.1007/s00259-019-04382-9
15. Chikkamath M, Dwivedi D, Hirekurubar RB, Thimmappa R (2023) Benchmarking of novel
convolutional neural network models for automatic butterfly identification. In: Shukla PK,
Singh KP, Tripathi AK, Engelbrecht A (eds) Computer vision and robotics. Algorithms for
intelligent systems. Springer, Singapore. https://doi.org/10.1007/978-981-19-7892-0_27
16. Gupta A, Dwivedi DN, Shah J (2023) Overview of money laundering. In: Artificial intelli-
gence applications in banking and financial services. Future of business and finance. Springer,
Singapore. https://doi.org/10.1007/978-981-99-2571-1_1
17. Dwivedi D, Batra S, Pathak YK (2023) A machine learning based approach to identify key
drivers for improving corporate’s esg ratings. J Law Sustain Dev 11(1):e0242. https://doi.org/
10.37497/sdgs.v11i1.242
18. Dwivedi et al. Computer vision use case: detecting the changes in the amazon rainforest over
time book by Apple Academic Press series on digital signal processing, computer vision and
image processing in 2023
19. Gupta A et al (2021) Climate change monitoring using remote sensing, deep learning, and
computer vision. Webology 19(2):2022. Available at: https://www.webology.org/abstract.php?
id=1708
20. Manjunath C, Dwivedi DN, Thimmappa R, Vedamurthy KB (2023) Detection and categoriza-
tion of diseases in pearl millet leaves using novel convolutional neural network models. In:
Future farming: advancing agriculture with artificial intelligence vol 1, pp 41. https://doi.org/
10.2174/9789815124729123010006
21. Gupta A et al (2021) Understanding consumer product sentiments through supervised models
on cloud: pre and post COVID. Webology 18(1):406–415. Available at: https://doi.org/10.
14704/web/v18i1/web18097
22. Dwivedi DN, Anand A (2022) A comparative study of key themes of scientific research post
COVID-19 in the United Arab Emirates and WHO using text mining approach. In: Tiwari S,
Trivedi MC, Kolhe ML, Mishra K, Singh BK (eds) Advances in data and information sciences.
Lecture notes in networks and systems, vol 318. Springer, Singapore. https://doi.org/10.1007/
978-981-16-5689-7_30
23. Dwivedi DN, Pathak S (2022) Sentiment analysis for COVID vaccinations using Twitter:
text clustering of positive and negative sentiments. In: Hassan SA, Mohamed AW, Alnowibet
KA (eds) Decision sciences for COVID-19. International series in operations research and
management science, vol 320. Springer, Cham. https://doi.org/10.1007/978-3-030-87019-5_
12
24. Alghamdi R, Alfalqi K (2015) A survey of topic modeling in text mining. Int J Adv Comput
Sci Appl 6(1):147–153. https://doi.org/10.14569/ijacsa.2015.060121
25. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn
42(1–2):177–196. https://doi.org/10.1023/A:1007617005950
26. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent
semantic analysis. J Am Soc Inf Sci 41(6):391–407
27. Asmussen CB, Møller C (2019) Smart literature review: a practical topic modeling approach
to exploratory literature review. J Big Data 6(1). https://doi.org/10.1186/s40537-019-0255-7
28. Gottipati S, Shankararaman V, Lin JR (2018) Text analytics approach to extract course improve-
ment suggestions from students feedback. Res Pract Technol Enhanced Learn 13(1). https://
doi.org/10.1186/s41039-018-0073-0
29. Bagheri E, Ensan F, Al-Obeidat F (2018) Neural word and entity embeddings for ad hoc
retrieval. Inf Process Manage 54(4):657–673
30. Benedetto F, Tedeschi A (2016) Big data sentiment analysis for brand monitoring in social
media streams by cloud computing. In: Studies in computational intelligence vol 639. https://
doi.org/10.1007/978-3-319-30319-2_14
31. Samuel J, Rahman MM, Ali GMN, Samuel Y, Pelaez A, Chong PH, Yakubov M (2020)
Feeling positive about reopening? New normal scenarios from COVID-19 US reopen sentiment
analytics. In: IEEE Access, vol 8, pp 142173–142190. Available at SSRN: https://ssrn.com/
abstract=3713652
32. Alonso ME, González-Montaña JR, Lomillos JM (2020) Consumers concerns and perceptions
of farm animal welfare. Anim: Open Access J MDPI 10(3). https://doi.org/10.3390/ani100
30385
33. Gupta A, Dwivedi DN, Shah J (2023) Financial crimes management and control in financial
institutions. In: Artificial intelligence applications in banking and financial services. Future of
business and finance. Springer, Singapore. https://doi.org/10.1007/978-981-99-2571-1_2
34. Gupta A, Dwivedi DN, Shah J (2023) Overview of technology solutions. In: Artificial intelli-
35. Gupta A, Dwivedi DN, Shah J (2023) Data organization for an FCC unit. In: Artificial intelli-
36. Gupta A, Dwivedi DN, Shah J (2023) Planning for AI in financial crimes. In: Artificial intelli-
37. Gupta A, Dwivedi DN, Shah J (2023) Applying machine learning for effective customer risk
assessment. In: Artificial intelligence applications in banking and financial services. Future of
business and finance. Springer, Singapore. https://doi.org/10.1007/978-981-99-2571-1_6
38. Gupta A, Dwivedi DN, Shah J (2023) Artificial intelligence-driven effective financial transac-
tion monitoring. In: Artificial intelligence applications in banking and financial services. Future
of business and finance. Springer, Singapore. https://doi.org/10.1007/978-981-99-2571-1_7
39. Gupta A, Dwivedi DN, Shah J (2023) Machine learning-driven alert optimization. In: Artificial
intelligence applications in banking and financial services. Future of business and finance.
Springer, Singapore. https://doi.org/10.1007/978-981-99-2571-1_8
40. Gupta A, Dwivedi DN, Shah J (2023) Applying artificial intelligence on investigation. In:
Artificial intelligence applications in banking and financial services. Future of business and
finance. Springer, Singapore. https://doi.org/10.1007/978-981-99-2571-1_9
41. Gupta A, Dwivedi DN, Shah J (2023) Ethical challenges for AI-based applications. In: Artificial
intelligence applications in banking and financial services. Future of business and finance.
Springer, Singapore. https://doi.org/10.1007/978-981-99-2571-1_10
42. Gupta A, Dwivedi DN, Shah J (2023) Setting up a best-in-class AI-driven financial crime
control unit (FCCU). In: Artificial intelligence applications in banking and financial services.
Future of business and finance. Springer, Singapore. https://doi.org/10.1007/978-981-99-2571-
1_11
43. Gupta A, Dwivedi DN, Jain A (2021) Threshold fine-tuning of money laundering scenarios
through multi-dimensional optimization techniques. J Money Laundering Control. https://doi.
org/10.1108/JMLC-12-2020-0138
44. Gupta A, Dwivedi DN, Shah J, Jain A (2021) Data quality issues leading to suboptimal machine
learning for money laundering models. J Money Laundering Control. https://doi.org/10.1108/
JMLC-05-2021-0049
45. Dwivedi D, Vemareddy A (2023) Sentiment analytics for crypto pre and post covid: topic
modeling. In: Molla AR, Sharma G, Kumar P, Rawat S (eds) Distributed computing and intel-
ligent technology. ICDCIT 2023. Lecture notes in computer science, vol 13776. Springer,
Cham. https://doi.org/10.1007/978-3-031-24848-1_21
46. Dwivedi D, Patil G (2022) Lightweight convolutional neural network for land use image classi-
fication. J Adv Geospatial Sci Technol 2(1):31–48. Retrieved from https://jagst.utm.my/index.
php/jagst/article/view/31
Latest Trends on Satellite Image
Segmentation
Sahil Borkar, Krishna Chidrawar, Sakshi Naik, Mousami P. Turuk,

and Vaibhav B. Vaijapurkar
Abstract For the last few decades satellite imaging technology has taken massive
strides towards higher spatial resolution, larger swath coverage and almost real-
time data delivery. Satellite imaging or remote sensing is extensively used in optical
imaging of the globe, military surveillance, deforestation detection, etc. Multispec-
tral imaging by satellite has enabled an exceptional understanding of the earth by
imaging beyond the realm of the visible spectrum. Against this backdrop image seg-
mentation using machine learning, deep learning models and image processing has
prompted several new approaches and techniques for satellite image segmentation.
This survey provides a comprehensive review of the recent literature, covering the
novel approaches in image segmentation using satellite imaging as well as others.
There is a broad coverage of segmentation algorithms of UNET, TSVM and Ran-
dom Walker. It investigates the strengths, challenges and novel aspects and compares
precisions and deliberate potential research outlooks.
Keywords Satellite imaging · Segmentation · Multispectral imaging · Remote

sensing
1 Introduction
Image segmentation is an intrinsic part of computer vision and has a wide array
of applications and methodologies. Image segmentation works on the principle of
subdividing the images or frames in the case of videos into different classifications
or segments. Such classification can be done by using semantic segmentation where
S. Borkar (B) · K. Chidrawar · S. Naik · M. P. Turuk · V. B. Vaijapurkar

SCTRs Pune Institute of Computer Technology, Pune, India
e-mail: sahilsb8@gmail.com
M. P. Turuk
e-mail: mpturuk@pict.edu
V. B. Vaijapurkar
e-mail: vbvaijapurkar@pict.edu
142 S. Borkar et al.
semantic labels are used, instance segmentation where subdivision of individual

objects is done or panoptic segmentation which uses both [1].
Image segmentation plays a crucial role in a wide array of applications such
as biometric recognition (e.g. face recognition, iris detection, fingerprint detection),
autonomous vehicles (e.g. obstacle detection, path detection, automatic driver assist),
medical imaging [2, 3] (e.g. locating tumours and their boundaries, radiotherapy),
search engine image search and surveillance systems.
Extensive amounts of algorithms have been devised with newer novel variations
to solve the problem of Image Segmentation such as multi-thresholding [4], K-
means algorithm [5], active contour [6] and watershed methods [7]. Apart from the
machine learning methodologies techniques of deep learning and image processing
have brought about a significant shift in the industry.
Image segmentation can thus be used to create partitions of images of geospatial
data collected from satellites and assign labels to classify them. Classifications can
be made for agricultural land, urban building zones, water bodies, forest areas, etc.
Usually the images used for such segmentation are RGB images which assign a value
per pixel corresponding to the colour component. This limits the segmentation to the
realm of visible wavelength. Modern satellites which orbit around the earth collect
different layers of data of the same scene with specific wavelength ranges across
the electromagnetic spectrum [8]. Thus only utilising the visible spectrum for image
segmentation is essentially underusing the collected data.
1.1 Satellite Imaging and Remote Sensing
Remote sensing and geospatial sciences involve imaging from orbital satellites or
aircraft. These images can be taken over swathes of a few kilometres to the entire
globe. These images are either captured over a single pass of an orbit or multiple
passes for added resolution and application of interferometry. The images can be
taken using sensors that capture visible light, infrared, thermal infrared and radar,
thus extending much beyond what the eye can see. Remote sensing is the science
of extracting useful data from the processed images and creating Geographic Infor-
mation Systems (GIS). The advancements in remote sensing in the last few decades
have led to the generation of extremely high spatial resolution multispectral and
hyperspectral images. This has enabled us to keep up with the global demand for
real-time high-resolution geospatial data that needs to be delivered to collaborators or
stakeholders like local and national governments, corporations and conglomerates.
Satellite visible spectrum images are mostly masked with cloud cover and foliage
shadows, hence requiring processing and noise removal [9].
Multispectral satellite images can be acquired either as raw data or partially pro-
cessed data; to work on these images extensive amounts of preprocessing are required.
These steps can be simplified into categories of image correction which involves
phase corrections, errors due to curvature of the earth, aberrations, etc. These sorts
of fundamental corrections are done using the radiometric correction technique and
Latest Trends on Satellite Image Segmentation 143
Fig. 1 Geospatial imaging
orthorectification. There are several noise reduction and removal algorithms which
are extensively used to eliminate the noise added due to foliage cover or construc-
tion like Goldstein Phase Filtering. The raw data generally distributed in levels is
processed by noise layers to receive an interferogram which is further converted to
an elevation map (Fig. 1). The elevation map coupled with the RGB image is crucial
for Synthetic Aperture Radar (SAR) imaging and applications.
1.2 Spectral Bands
The sensors on board the satellites be it optical, radar or infrared are designed specif-
ically to acquire data in different ranges of frequencies along the electromagnetic
spectrum. Each band is capable of capturing a different aspect of information on
the earth’s surface, enabling analysis of various features such as land use, urban
zones, thermal mapping, water body analysis, vegetation cover, mining feasibility
and much more. For example the near infrared is useful in analysing the vegetation
cover as healthier plants reflect back more near-infrared light than dying ones. Ther-
mal infrared imaging can be used to identify local heating zones and global warming
hot spots in the oceans.
Satellites contain one or many imaging apparatuses or instruments which have
the ability to capture specific bands. Usually, optical imaging instruments capture
Table 1 Range of common optical bands

Band name Frequency range (THz)
Blue 666.2–587.8
Green 565.6–508.1
Red 468.4–447.4
Near infrared (NIR) 352.6–340.6
Shortwave infrared (SWIR) 190.9–181.6
Panchromatic 599.5–440.8
Thermal infrared (TIRS) 28.2–26.7
∗ Bands
. correspond to the NASA landsat 9 satellite
Table 2 Range of common spectral radar bands

Band name Frequency range (GHz)
P 0.230–1
L 1–2
S 2–4
C 4–8
X 8–12.5
Ku 12.5–18
K 18–26.5
Ka 26.5–40
bands of red, green, blue or the (RGB) visible range and also the infrared (Table 1)
using the same imaging sensor or multiple sensors by the distribution of incoming
light. These bands have different frequency ranges. Radar satellites have specific
band names for specific ranges such as P, L X and Ku and are used in Synthetic
Aperture Radar (SAR) imaging (Table 2). Radar bands are used to create elevation
maps or topographic maps by using the interferometry of the radar beams and can
be used to identify landslides [10, 11] and oil spills [12].
1.3 Multispectral and Hyperspectral Imaging
Multispectral remote sensing is the procurement of visible, radar, near infrared, short-
wave infrared and thermal infrared images in various broadwave bands. Different
elements and compounds absorb and reflect wavelengths of electromagnetic radiation
differently. Thus by checking the reflected electromagnetic spectrum signature we
can identify the element or compounds. This property of identification of materials
is non-existent or very limited in images captured using only visible wavelength.
In multispectral imaging, 8–10 images are captured at the same instant by differ-
ent sensors or by separation of the incoming light into different wavelength bands
(Fig. 2). This is achieved by using different sensors calibrated for specific bands or by
using spectroscopy. In this method, we get optical intensity as a discrete function of
wavelength. The discrete values are chosen beforehand so as to extract the most out
of the imaged spectrum. Multispectral imaging is widely used in medical sciences
and biology [13], food safety and quality assurance [14] and most prominently in
remote sensing.
Another methodology similar to multispectral imaging is hyperspectral imaging
[15] which instead of a discrete function of wavelength gives out a spatial map or
a continuous function (Fig. 3) [16]. This method gives more accurate and precise
control over identifying and quantifying molecular level absorption, and it is possi-
ble due to more information available per pixel of the image. Hyperspectral imaging
albeit requiring much more advanced equipment is crucial for making fine observa-
tions. In practice, the hyperspectral imaging apparatus captures finite images in more
than 100 bands. Hyperspectral imaging is extensively used in medicinal science [17]
and food analysis. Hyperspectral imaging is used in remote sensing prominently for
agricultural or military applications.
1.4 Datasets: Challenges and Opportunities
All the machine learning and deep learning models need a significant training and
testing dataset. In satellite imaging, the raw data is enormous and extensive. Raw L0
types of Synthetic Aperture Radar data for a single image can have a size upwards of 5
GB. Processing such data to get workable image files needs severe technical aptitude
and computational resources. Another great challenge with the creation of datasets
for satellite image segmentation is the annotation of the processed images themselves.
This has to be a completely manual and labour-intensive task. The annotations differ
for the type of segmentation used. Semantic segmentation is the simplest to anno-
tate as it just needs a masking. Panoptic segmentation is a very time-consuming and
repetitive process due to the presence of multiple classes that need to be individually
annotated. There is a lack of software available to streamline this process of creation
of datasets for satellite panoptic image segmentation [18]. Many of the literature
and papers surveyed made a compelling argument for creating a new personalised
dataset for more accurate results. Ghassemi et al. [19] propose a novel convolutional
encoder-decoder network capable of learning visual representation using heteroge-
neous datasets.
Fig. 2 Multispectral
imaging
2 Relevant Studies in Image Segmentation
There has been comprehensive literature presented over the last few decades, Saini
et al. [20] did an extensive study analysis on the different image segmentation tech-
niques, which focused mostly on detecting discontinuity and detecting similarity. The
study also focused on techniques of edge-based segmentation, region-based segmen-
tation and watershed transformation. Yuan et al. [21] proposed a deep learning-based
multispectral satellite image segmentation. This approach was used for the identifi-
cation of water bodies for advanced urban hydrological studies. It proposed a novel
multichannel water body detection network which had components of a multichannel
fusion module, an Enhanced Atrous Spatial Pyramid Pooling module, and Space-to-
Depth/Depth-to-Space operations. Jia et al. [22] proposed using the RGB histogram-
Fig. 3 Hyperspectral
imaging
based image segmentation while using the Masi entropy as the objective function.
It concluded that the multi-strategy emperor penguin optimiser achieved significant
enhancement and exceptional performance.
Boaro et al. [23] applied image segmentation to identify the gold exploration areas
in the Amazon River basin. This application was accomplished by using the U-Net
algorithm. It was achieved only by utilising the hyperspectral nature of the imaging to
separate the wavelength bands associated with gold and other mining materials. The
study also achieved high precision, recall and accuracy percentages. Raghavendra
et al. [24] reviewed the use of image processing techniques and image segmenta-
tion to detect plant leaf diseases. This review explores the need for automation to
streamline the process of delivering the detections. Automation is the need of the
hour and is essential for end-to-end solutions to image segmentation. Yu et al. [25]
proposed a review of methodologies and challenges of image segmentation. The sur-

vey also segregated the observed methodologies into classic segmentation methods,
co-segmentation methods and semantic segmentation based on deep learning. The
review also emphasised the challenge caused by the requirement of significant com-
putational resources in the training process of the deep learning methodologies. This
challenge still is a special peril for applications where near real-time segmentation
is required like high frame rate video segmentation.
de Carvalho et al. [18] surveyed the use of panoptic segmentation in remotely
sensed data. The paper emphasised the challenges of utilising panoptic segmentation
on remotely sensed data due to the large size of the image and area under interest
and, a huge number of classes that have to be assigned manually to create the dataset.
The remote sensed data has many drawbacks due to the unavailability of software
to generate deep learning samples in the panoptic segmentation format, and the
available software is extremely not user-friendly and uses vague file formats. The
study thus proposed a pipeline for generating panoptic segmentation datasets and
annotation, creating software that is much friendlier and streamlined the process
of panoptic segmentation and also performed the tasks on urban areas as areas of
interest. The study made a compelling argument for the usage of deep learning instead
of traditional machine learning methodologies. The study achieved a panoptic quality
of 64.979 and an average precision of 47.691.
Zhu et al. [26] proposed an algorithm that is equivalent to an edge detector in
image processing. It aimed to achieve that by calculating the grey values of pixels
and determining the variance which acts as a condition to determine the edge of
an image. In the case of geographical data, the proposed algorithm produces better
results than the Canny operator. Sivakumar et al. [27] attempted image segmentation
on noisy data, specifically data containing Gaussian noise and salt and pepper noise.
The intensity is used as a midpoint for every pixel, and results were effective on stone
carving images and historical documents. Kelwin et al. [28] used the application of
biomedical images to implement ordinal image arrangement using deep learning. It
provided pixel consistency by addressing the issue of a spatial arrangement which
can be nested or linear. The results proved to be better than U-Net architecture in
most cases. This model promoted consistency in regularisation mechanisms.
Di Ruberto et al. [29] attempted to tackle texture analysis in an image and produced
texture segmentation; they first addressed issues like detecting the texel (Texture
element) and its shape and finally worked on creating accurate boundaries by the
use of segmentation algorithm using least square spline approximation. Texture is
used in many computer vision applications like surface orientation, inspection and
determining the shape. Pritt et al. [30] demonstrated that traditional object detection
methods were inaccurate for high resolution and satellite imagery and produced a
deep learning system which consisted of convolutional neural networks and more
neural networks that helped in addition to satellite imagery. It aimed at helping law
enforcement officers detect unlicensed mining operations or illegal fishing vessels.
On the Functional Map of the World (fMoW) dataset of one million images in
63 classes, including the false detection class, the system achieved an accuracy of
0.83. Jia et al. [31] proposed a Dynamic Harris Hawks Optimisation with Mutation
Fig. 4 Transductive SVM
Mechanism for Satellite Image Segmentation. Eight advanced thresholding outlooks

were observed for comparison. The segmentation thresholds are established using
three criteria: Kapur’s entropy, Tsallis entropy and Otsu between-class variance.
3 Techniques Used in Satellite Image Segmentation
3.1 Transductive SVM
Transductive SVM implements cluster-based classification by using labelled and

unlabelled data by pointing to a hyperplane (Fig. 4) far from unlabelled data [32].
According to researchers, this can be a different method for labelled and unlabelled
data as places data is not uniform there may be limited information for unlabelled
points, and hence a hyperplane far away from them is proven to be unsuccessful. In
such cases, this method provides a solution in only some cases. Therefore, TSVM
with clustering can have different results for different cases.
Although TSVM provides better results than SVM, it has certain setbacks like
computational costs [32]. Another drawback is the fact that these models need to be
constantly retrained for updated samples. In such cases, an Online TSVM (OTSVM)
model is proposed by researchers where it continually learns new unlabelled data
and updates parameters accordingly. This model shows more efficiency as compared
to previous TSVM models but needs to be tested for various applications, especially
with satellite image segmentation.
Fig. 5 U-Net
3.2 U-Net
U-Net convolutional neural network based on TensorFlow was used by researchers

specifically for medical imaging and satellite image parcel segmentation. U-Net net-
work consists of two parts, one is the CNN architecture in which the 3. × 3 convolution
method is constantly used multiple times and an additional part in which the feature
map derived is implied onto an unsampled feature map and then downsized to remove
the edge information which then gives a segmented image [33]. This is what gives
the architecture a U shape (Fig. 5) which allows it to segment images based on the
information in it. U-Net improves the data so that the overall ability of the model
is improved [34]. The results shown are proven to be better than random forest and
other CNN methods. However, a few drawbacks are present in this model like the
extensiveness of boundaries of some segmented images. This model can be further
improved upon research for high-resolution images.
3.3 Random Walker
For high-resolution image segmentation, Random Walker has been proposed by mul-
tiple researchers. Random Walker produces segmentation for images by using several
highlighted markers [35]. These markers are obtained through a step-by-step process.
A marker is assigned to an unknown pixel or data point with reference to already
defined markers or labelled data points. Using grey values, noises are determined
and removed. Random Walker achieves this by randomly traversing through a graph
in such a way that the probability in each situation remains the same (Fig. 6). The
probability is calculated on the basis of current node V, the node visited before the
current node T and the target node. Accordingly, a weight will be calculated whose
Fig. 6 Random Walker
values between 0 and 1 determine the increases or decreases in the probability. How-
ever, according to research using this model for high 3D datasets of increasing sizes
becomes tedious. Using superpixel features that group multiple pixels together based
on their visual semantics, they help in reducing or narrowing down features and help
in labelling data [35]. Superpixel features do provide good results in the case of
low-density separation and TSVM models but show some difference in the case of
Random Walker.
4 Future Scope in Satellite Imaging Segmentation
Satellite imaging is poised to redefine urban planning, defence and environmental

monitoring. In urban contexts, it remains a crucial tool for informed decision-making
in infrastructure, land use, transportation and disaster management. Its real-time
assessment capabilities contribute to optimising resource allocation based on pop-
ulation analysis. In defence, high-resolution satellite imaging plays a pivotal role
in border surveillance, tracking enemy infrastructure development and identifying
potential threats. Furthermore, the technology is instrumental in addressing environ-
mental challenges, detecting illegal constructions, monitoring mining activities and
combatting deforestation globally. As precision and coverage continue to evolve,
satellite imaging emerges as a transformative force shaping decision-making across
diverse sectors. Observing the comparative analysis of the most relevant sources
we can map out a few novel future scopes. Considering the amount of increasing
requirements and the presence of newer data we can look to enhance TSVM and
improve the image segmentation ability of Random Walker while using superpixels
(Table 3).
5 Conclusion
This paper surveys the multitudes of techniques used in satellite imaging. It is

observed that this was implemented using various machine learning, deep learning
models and image processing techniques. Models using U-Net and Random Walker
Table 3 Comparative analysis

Work Features Algorithm Outcomes
Boaro et al. [16] Proposes a U-Net method UNET, TSVM The results achieved for
for gold exploration zone segmentation obtained the
segmentation in values of 91.29% for
high-resolution satellite precision, 74.64% recall,
images, utilising 76.60% F1-score and
hyperspectral data to isolate 98.34% accuracy
gold-related wavelength
bands
Yu et al. [18] Reviews three key stages of Edge detection, Random Addresses challenges and
image segmentation based Walker, TSVM techniques associated with
on segmentation principles satellite image
and image data segmentation, with a focus
characteristics on the computational
resource demands during
the training of deep
learning methodologies
Georgescu et al. [3] Novel strategy to generate U-Net Achieves peak performance
ensembles of different with a Dice coefficient of
architectures for image 0.9130 by employing the
segmentation, by EfficientNet-B1
leveraging the diversity of architecture as the encoder
the models forming the backbone
ensemble
Artan et al. [25] Image segmentation using TSVM, LDS, Random Use of superpixels allowed
semi-supervised learning Walker for less computational
using algorithms mentioned requirements as it reduced
along with use of the overall features used by
superpixels clustering methods
Zhu et al. [20] Algorithm proposes having Threshold segmentation, While marginally slower
a threshold value for each Canny operator for smaller operators, it
pixel by calculating outperforms the Canny
grayscale values which operator, exhibiting notable
simulates the behaviour of noise reduction capabilities
edge detectors
Kong et al. [28] A data enhancement CNN, U-Net Provides a reference for
technique is applied to the high-resolution satellite
CNN model to improve the image segmentation as it
overall ability of image improves the connectivity
interpretation of images
ElMasry et al. [12] Hyperspectral imaging for Stepwise regression, Hyperspectral imaging
both automatic target principal component represents a major
detection and recognising analysis (PCA) technological advance in
its analytical composition the capturing of
morphological and
chemical information
Abdel-Basset et al. [4] Improved and used a novel Equilibrium optimiser, Concluded that the
meta-heuristic equilibrium Kapur’s entropy performance of proposed
algorithm to resolve the algorithm was better than
optimal threshold value for the other existing
a grayscale image algorithms for the large
threshold levels
were most frequently used for satellite image segmentation. They were shown to be
the most effective as they are inclusive of high-resolution 3D images which are not
possible in other models. However, there is a lot of scope for improvement as there
remains a minor amount of computational setbacks in the already proposed methods.
Most models are implemented using the superpixels feature to reduce the amount of
features and group data. There are a myriad of causes to work on including using
image segmentation for rural land use, detecting subsiding water bodies, detecting
illegal mining zones or detecting in which case the Random Walker technique can be
implemented. Image segmentation in the above causes can allow for a more compre-
hensive study of various natural calamities and how to work on relief of such areas
and allow for more preventative measures.
References
1. Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D (2022) Image segmen-

tation using deep learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(7):3523–3542.
https://doi.org/10.1109/TPAMI.2021.3059968
2. Pham DL, Xu C, Prince JL (2000) Current methods in medical image segmentation. Ann Rev
Biomed Eng 2(1):315–337. https://doi.org/10.1146/annurev.bioeng.2.1.315
3. Georgescu M-I, Ionescu RT, Miron A-I (2022) Diversity-promoting ensemble for medical
image segmentation
4. Abdel-Basset M, Chang V, Mohamed R (2021) A novel equilibrium optimization algorithm
for multi-thresholding image segmentation problems. Neural Comput Applic 33:10685–10718.
https://doi.org/10.1007/s00521-020-04820-y
5. Ahmed M, Seraj R, Shamsul Islam SM (2020) The k-means algorithm: a comprehen-
sive survey and performance evaluation. Electronics 9(8):1295. https://doi.org/10.3390/
electronics9081295
6. Chen X, Williams BM, Vallabhaneni SR, Czanner G, Williams R, Zheng Y (2019) Learning
active contour models for medical image segmentation. In: 2019 IEEE/CVF conference on
computer vision and pattern recognition (CVPR), Long Beach, CA, USA, pp 11624–11632.
https://doi.org/10.1109/CVPR.2019.01190
7. Wu Y, Li Q (2022) The algorithm of watershed color image segmentation based on morpho-
logical gradient. Sensors 22(21):8202. https://doi.org/10.3390/s22218202
8. Ose K, Corpetti T, Demagistri L (2016) Multispectral satellite image Processing. Opt Remote
Sens Land Surface 57–124. https://doi.org/10.1016/b978-1-78548-102-4.50002-8
9. Fawwaz I, Zarlis M, Suherman Rahmat RF (2018) The edge detection enhancement on satellite
image using bilateral filter. IOP Conf Ser: Mater Sci Eng 308:012052. https://doi.org/10.1088/
1757-899x/308/1/012052
10. Mondini AC, Guzzetti F, Chang K-T, Monserrat O, Martha TR, Manconi A (2021) Landslide
failures detection and mapping using synthetic aperture radar: past, present and future. Earth-
Sci Rev 216:103574. https://doi.org/10.1016/j.earscirev.2021.103574
11. Solari L et al (2020) Review of satellite interferometry for landslide detection in Italy. Remote
Sens 12(8):1351. https://doi.org/10.3390/rs12081351
12. Huang X, Zhang B, Perrie W, Lu Y, Wang C (2022) A novel deep learning method for marine oil
spill detection from satellite synthetic aperture radar imagery. Mar Pollut Bullet 179:113666.
https://doi.org/10.1016/j.marpolbul.2022.113666
13. Levenson RM, Mansfield JR (2006) Multispectral imaging in biology and medicine: Slices of
life. Cytometry Part A 69A(8):748–758. https://doi.org/10.1002/cyto.a.20319
14. Qin J, Chao K, Kim MS, Lu R, Burks TF (2013) Hyperspectral and multispectral imaging
for evaluating food safety and quality. J Food Eng 118(2):157–171. https://doi.org/10.1016/j.
jfoodeng.2013.04.001
15. Nalepa J et al (2021) Towards on-board hyperspectral satellite image segmentation: under-
standing robustness of deep learning through simulating acquisition conditions. Remote Sens
13(8):1532. https://doi.org/10.3390/rs13081532
16. ElMasry G, Sun D-W (2010) Principles of hyperspectral imaging Technology. Hyperspect
Imaging for Food Qual Anal Control 3–43. https://doi.org/10.1016/b978-0-12-374753-2.
10001-2
17. Rehman AU, Qureshi SA (2021) A review of the medical hyperspectral imaging systems and
unmixing algorithms’ in biological tissues. Photodiagn Photodyn Ther 33:102165. https://doi.
org/10.1016/j.pdpdt.2020.102165
18. Ghassemi S, Fiandrotti A, Francini G, Magli E (2019) Learning and adapting robust features
for satellite image segmentation on heterogeneous data sets. IEEE Trans Geosc Remote Sens
57(9):6517–6529. https://doi.org/10.1109/TGRS.2019.2906689
19. Saini S, Arora K (2014) A study analysis on the different image segmentation techniques. Int
J Inf Comput Technol 4(14):1445–1452
20. Yuan K, Zhuang X, Schaefer G, Feng J, Guan L, Fang H (2021) Deep-learning-based multi-
spectral satellite image segmentation for water body detection. IEEE J Select Top Appl Earth
Observations Remote Sens 14:7422–7434. https://doi.org/10.1109/jstars.2021.3098678
21. Jia H, Sun K, Song W, Peng X, Lang C, Li Y (2019) Multi-strategy emperor penguin optimizer
for RGB histogram-based color satellite image segmentation using masi entropy. IEEE Access
7:134448–134474. https://doi.org/10.1109/access.2019.2942064
22. Boaro JMC, dos Santos PTC, Serra A, Rego VG, Martins CV, Júnior GB, Satellite image
segmentation of gold exploration areas in the amazon rainforest using U-Net. In: IEEE Inter-
national humanitarian technology conference (IHTC). United Kingdom, pp 1–8. https://doi.
org/10.1109/IHTC53077.2021.9698927
23. SSK, Raghavendra BK (2019) Diseases detection of various plant leaf using image process-
ing techniques: a review. In: 2019 5th international conference on advanced computing and
communication systems (ICACCS). https://doi.org/10.1109/icaccs.2019.8728325
24. Yu Y et al (2023) Techniques and challenges of image segmentation: a Review. Electronics
12(5):1199. https://doi.org/10.3390/electronics12051199
25. de Carvalho OLF et al (2022) Panoptic segmentation meets remote sensing. Remote Sens
14(4):965. https://doi.org/10.3390/rs14040965
26. Zhu S, Xia X, Zhang Q, Belloulata K (2007) An image segmentation algorithm in image
processing based on threshold segmentation. In: 2007 Third international IEEE conference on
signal-image technologies and internet-based system. https://doi.org/10.1109/sitis.2007.116
27. Sivakumar V, Murugesh V (2014) A brief study of image segmentation using thresholding
technique on a noisy image. International conference on information communication and
embedded systems (ICICES2014), Chennai, India, pp 1–6. https://doi.org/10.1109/ICICES.
2014.7034056
28. Fernandes K, Cardoso JS (2018) Ordinal image segmentation using deep neural networks.
In: 2018 international joint conference on neural networks (IJCNN). https://doi.org/10.1109/
ijcnn.2018.8489527
29. Di Ruberto C, Rodriguez G, Vitulano S (nd) Image segmentation by texture analysis. In:
Proceedings 10th international conference on image analysis and processing. https://doi.org/
10.1109/iciap.1999.797624
30. Pritt M, Chern G (2017) Satellite image classification with deep learning. In: 2017 IEEE applied
imagery pattern recognition workshop (AIPR). https://doi.org/10.1109/aipr.2017.8457969
31. Jia H, Lang C, Oliva D, Song W, Peng X (2019) Dynamic harris hawks optimization with
mutation mechanism for satellite image segmentation. Remote Sens 11(12):1421. https://doi.
org/10.3390/rs11121421
32. Artan Y (2011) ’Interactive image segmentation using machine learning techniques. In: 2011
Canadian conference on computer and robot vision. https://doi.org/10.1109/crv.2011.42
33. Chen M-S, Ho T-Y, Huang D-Y (2012) Online transductive support vector machines for clas-
sification. In: International conference on information security and intelligent control. https://
doi.org/10.1109/isic.2012.6449755
34. Siddique N, Paheding S, Elkin CP, Devabhaktuni V (2021) U-Net and its variants for medical
image segmentation: a review of theory and applications. IEEE Access 9:82031–82057. https://
doi.org/10.1109/access.2021.3086020
35. Haoming K, Chunming L, Zhang K (2023) Satellite image parcel segmentation and extraction
based on U-Net convolution Neural network model. In: IEEE international conference on
control. Electronics and Computer Technology. https://doi.org/10.1109/ICCECT57938.2023.
10141307
Landmark Detection Using
Convolutional Neural Network: A Review
Drishti Bharti, Kumari Priyanshi, and Prabhjot Kaur
Abstract Geographic recognition is an important task in computer vision and plays

an important role in many applications such as facial recognition, medical image anal-
ysis, and image management. Over the past few years, convolutional neural networks
(CNN) have become an important method for spatial detection due to their excellent
ability to capture spatial features. This review paper provides a comprehensive review
of spatial sensing methods using CNNs, focusing on recent advances and challenges
in this field. We then go into the basics of CNNs and their applicability to field
surveys, highlighting their ability to learn hierarchical representations from objects.
We discuss popular CNN models such as AlexNet, ResNet, and GoogleNet and
explore their advantages and limitations in geographic research projects. The impact
of various loss factors, including squared error and customized design loss, on field
accuracy will be analyzed. To better understand progress in this field, we examine the
most useful data and benchmarks used in field research. And throughout the paper
we discuss recent changes and achievements such as the integration of tracking
systems, the use of image-based methods, and the use of deep neural networks for
3D landmark detection. Finally, we describe current challenges and future directions
for using CNNs for signature detection, regarding the need for closure, scalability,
and robustness for translation. We propose a possible research method that involves
searching various research sites and combining data sources.
Keywords CNN · PCA · PDB · Deep learning · Landmark · AlexNet · PoseNet ·

AdaMax · AdaGrad · Neural network · Wing loss
1 Introduction
The development of a class of deep learning called convolutional neural networks

(CNN) designed for image analysis has transformed the landscape. CNNs have
demonstrated the ability to learn and extract spatial features, making them the basis
D. Bharti · K. Priyanshi (B) · P. Kaur

Chandigarh University, Mohali, Punjab, India
e-mail: kpriyanshi188@gmail.com
158 D. Bharti et al.
for many state-of-the-art spatial perception methods. Their adaptability to different

data types and robustness in processing complex patterns make CNNs an important
choice for spatial sensing in many environments. Geographic analysis is an impor-
tant task in computer vision and has attracted increasing attention in recent years. It
involves identifying and locating specific points or features in images or datasets and
is often used as the basis for many applications such as facial recognition, medical
image analysis, item tracking, and self-management. It is important to recognize the
main characters correctly in events where the relationship between the main content
and the content is strong. This provides a general introduction to the field of land
surveys using CNNs, showing the development of the technology, its principles, and
the important role they play in computer vision. The aim is to provide an in-depth
survey of the state of technology, including the latest developments, challenges, and
future prospects in this dynamic field. In this paper, we will analyze the main points of
using CNN for land surveying, including network connectivity, information support
techniques, and lost works. We will also examine the metrics and data metrics that
shape the landscape of the research field. We will also shed light on tracking methods,
structuring methods, and extending CNNs to 3D geographic regions. It is obvious
that when we go from one place to another, finding a place is far from a problem.
From tracking facial features in live video streams to locating anatomical regions
in medical images, different applications present unique challenges and opportu-
nities. Therefore, we will pay attention to changes and modifications specific to
CNN-based site detection. The journey doesn’t end here. Finally, we will provide an
overview of the ongoing challenges and unknown areas in the field of land surveys.
The effectiveness of masking, the ability to handle large datasets, and the interpre-
tation of CNN-based models are topics that require collaborative research. We will
also present promising avenues for future research, including some field detections
and integration of multi-source data.
2 Review Process
The main goals of landmark detection in Convolutional neural networks (CNNs)

imply the process of locating and identifying appropriate landmarks and detecting
key points within the image. These landmarks include various aspects of objects such
as buildings corners or key points in an image. CNNs are mostly used for the landmark
detection process due to their ability to effectively process or extract spatial features of
the images. The various techniques or algorithms proposed in recent research papers
for the enhancement of landmark detection in CNNs are heatmap-based methods,
attention mechanism, augmentation techniques, cascaded architectures, multiscale
feature fusion, ensemble learning, etc.
The general strategies for landmark detection using CNNs involved steps are given
below.
Landmark Detection Using Convolutional Neural Network: A Review 159
2.1 Data Collection and Preparation
This stage involves compiling a variety of landmark photos and using them as a
dataset for annotation-assisted preprocessing of the dataset. These annotated land-
marks are derived from the other websites or from the ground locations of the
landmarks shown in the pictures.
2.2 CNN’s Network Architecture
Create a CNN design that is appropriate based on landmark detection; the optimal
architecture can be found by trial and error or by combining various CNN examples
and principles. The convolutional, pooling, nonlinear activation, or fully linked layers
make up the CNN’s main structure [1].
The image undergoes preprocessing prior to being fed into the neural network via
the input layer. Several alternating layers of pooling and convolution are applied to the
image as part of this preprocessing. By reducing the number of connections between
the convolutional layer and mitigating the layer’s extreme sensitivity to position, the
pooling layer [2] lessened the computational cost. After that, the fully connected
layer classifies it. CNN can be alternatively termed as neural network; this neural
network consists of several different neurons, which is nothing but a fundamental
artificial neural network processor and a Multilayer Perceptron (MLP). It consists
of input layer, hidden layer, and output layer and is made up of several basic unit
neurons that communicate with one another by layer-by-layer conduction.
The structure of MLP as represented in Fig. 1 contains k hidden units in addition
to n input values and m output values. The input value is x n . The input value is
delivered in the direction indicated by the arrow. hk is the hidden unit; it gets the
input value from the layer before it. The actual value is y * m, and the output unit is
ym .
Fig. 1 Structure of the MLP

2.3 Training the CNN Dataset
This process involves optimization of the network’s parameter to reduce the

complexity between the predicted landmarks images and the ground truth anno-
tations. The selection of appropriate loss functions and training strategies greatly
impacts on the performance of CNNs in landmark detection. That paper elaborates
on different loss functions commonly used in this domain, mean squared error, mean
absolute error, and custom loss function taken for landmark detection.
2.4 Landmark Localization
This process involves forwarding the image through the network, extracting features,
and applying regressions or classification techniques to locate the landmarks. The
new input image passes through the trained CNN to depict the locations of the
landmarks. Figure 2 represents the design of an implemented CNN. It consists of a
total 37 layers for the localization. The size of the input layer was kept 320 × 240
× 3 (width in pixels, height in pixels and color channels: red, green, and blue).
Within the 37 levels of the CNN, there are ten sets of convolutional layers, batch
normalization, and ReLU. In Fig. 2, these three layers are shown together as a single
green layer. There is a huge variation in sizes and amounts for choosing the filters
that is being implemented in convolutional layers. The earliest filter size was kept
highest with 5*5, while its size was decreasing inversely with the increase in depth
of the network. The overall implementation for the CNN consists of 48 epochs with
a batch size of 64.
Fig. 2 Convolutional neural network (CNN)-based localization system’s structure

2.5 Post-preprocessing
By fine-tuning the photos using post-processing techniques like non-maximum

suppression or other methods that combine numerous techniques, we will increase
the accuracy of detecting landmark positions in the photographs. Landmark detec-
tion in CNNs have various applications including facial landmark detection for face
recognitions, poses estimation, object tracking system, and other augmented realities.
3 Background Study
This section aims to provide an overview of existing research and highlight key
studies, approaches, and findings in this domain.
3.1 Early Approaches to Landmark Detection
Early historical studies relied heavily on the art, style, and content of the books. By
automating learning and end-to-end discovery, this technique has laid the foundation
for continued development in the field of computer vision, particularly with the
introduction of convolutional neural networks (CNN). CNNs have solved many of
the limitations associated with previous techniques and ushered in a new era of spatial
exploration.
3.2 CNN-Based Landmark Detection
CNN-based landmark detection leverages the power of deep learning to extract

features in images and pinpoint landmarks. Due to its stability and accuracy, facial
recognition has become the go to method in many fields, including health care and
object tracking. Researchers continue to search for new models, performance loss,
and data support techniques to advance the field. “DeepFace: Closing the gap with
human-level performance in Face Detection” is an important project that demon-
strates the power of CNNs in face detection and recognition. Its large training data,
deep architecture, and triple-based training method increase the accuracy of face
recognition, bringing computer systems closer to human performance in this field.
3.3 Popular CNN Architectures
In these research areas, the choice of CNN architecture is often influenced by the
specific requirements of the task, the data size, and the computational resources
involved. The researchers used each building’s unique characteristics and capabilities
to solve problems such as scale, design, and lighting changes related to geographic
space. Before training CNN models on land survey data, it is a good practice to allow
the model to use features learned from large datasets such as ImageNet and translate
them into specific exploration tasks.
3.4 Data Augmentation and Preprocessing
It plays an important role in ensuring that CNN-based detection models can handle
diverse and complex real-world situations. They help prevent overfitting by exposing
the model to a variety of variables during training and make the model more reliable
in detecting signals in unseen data. The choice of specific data development and
processing methods depends on the characteristics of the dataset and the needs of
the land survey project.
3.5 Loss Functions and Training Strategies
The impact of unemployment varies across research findings depending on the job,
dataset, and specific competition. Various unemployment and education strategies
often try to find the best together, resulting in real and strong regional areas. The
choice of unemployment often depends on the specific names and characteristics of
the land finding problem to be solved.
3.6 Benchmark Datasets
These landmark detection datasets are significant in benchmarking algorithms

because they cover a range of scenarios, including variations in pose, expression,
occlusion, and lighting conditions. Researchers can use these datasets to assess
the accuracy, robustness, and generalization capabilities of landmark detection
algorithms, ultimately driving advancements in the field and improving algorithm
performance in real-world applications.
3.7 Domain-Specific Applications
These studies demonstrate the versatility of CNN-based landmark detection in

various application domains. Whether it is facial analysis, medical treatment, hand
prediction, local prediction, human prediction, or animal tracking, CNNs have proven
to be very useful in the geospatial domain, making them an important technology
in computing. These applications demonstrate the potential of CNN-based geoloca-
tion in solving real-world problems in many fields. Research [3] makes significant
contributions to the field of face detection and real-time computer vision.
It appears that well-designed deep CNNs combined with cascading techniques
can achieve high accuracy and speed across geographic areas. This has implications
for applications where real-time performance is critical, such as facial recognition,
tracking, and emotion measurement. Furthermore, using LBF regression techniques,
the study shows that the integration of deep learning with computer vision models
can solve new problems, supporting the power of the two methods. Overall, this work
provides a solid foundation for current face detection and encourages future research
in this field.
4 Literature Review
A landmark is merely the distinctive visual indication that is used to identify a specific
object. Different machine learning techniques are used in landmark classification to
identify it. Essentially, it is an expanded component of picture categorization that is
carried out utilizing reliable methods. The convolutional neural network (CNN) is a
useful tool for categorizing these pictures. It is made up of various interconnected
layers that process the supplied data. The first layer, known as the convolutional
layer, applies several filters to the input data in order to find some patterns.
The purpose of the following layer’s extra convolutional and pooling layers is to
shrink the size of feature maps. This layer makes the model more computationally
efficient.
The final CNN softmax layer generates a probability distribution across all
potential class labels for the input data.
Classification algorithms require a lot of computer power, which can be reduced
or optimized by combining convolutional neural network (CNN) with deep learning
techniques. An innovative branch of machine learning called “deep learning” uses
algorithms built on artificial neural network topologies. The performance of neural
networks was shown to be more effective and reliable for identifying the photos
among all machine learning techniques.
In order to create such models, we must generally follow these extensive steps:
• Select a dataset; setting up a dataset on which to base our models is the first and
most important step. There are other websites where we may obtain those, but
Kaggle is one of the most well-known. It offers a sizable database of datasets that
can be worked with simply by downloading them into a directory.
• After selecting a dataset, the next step is to prepare a dataset for training. Before
deploying any machine learning algorithm, we must first validate it, which takes
place in two steps. Training datasets are used to determine the model’s accuracy.
• The following stage is to create training data and apply labels to it; we will then
give the training data and labels to the CNN. By transforming the categorical class
labels into one-hot encoded vectors, these labels will be applied to categorical data.
• Start defining and training the CNN model.
• Test the model’s accuracy.
As was previously said, classical classification techniques have a poor level of
processing accuracy and need a lot of computer resources. When classifying images,
aspects including picture variety, image size, and hardware are taken into account.
The more accurate a model is, the better it is regarded. Using deep learning techniques
is one strategy to increase the accuracy of a classification model; however, none
of these deep learning optimization methods can guarantee the best accuracy or
efficiency in terms of time.
4.1 Landmark Detection Method
Classical models: This model, which is also known as a template-based approach,

was created for landmark point recognition with a focus on face alignment. It essen-
tially makes use of Principal Components Analysis (PCA) to streamline the issue
and gain a general understanding of the faces in order to iteratively simulate varia-
tions in facial landmarks. Many studies have been conducted on these models, such
as the 2.5DAAM described by Martin et al. [4], which combines a 3D metric point
distribution model that seems to be a 2D model with the ability to recognize 3D
deformable faces in 2D images. The classical model was shown to be an optimal
landmark detection method in case of face alignment, but it has some limitations,
such as they can’t detect landmarks optimally when an unseen dataset is used since
they are very sensitive to occlusion and illumination.
Coordinate-based regression models: One of the best techniques for locating land-
marks, it instantly extracts the coordinate vectors from the input image. A loss func-
tion that, when paired with data augmentation and pose-based data balance (PDB),
can overcome the L2 loss, was first described by well-known researchers Feng et al.
[5] under the name wing loss. Zhang et al.’s work in the area of classification was inno-
vative. They invented the [3] technique for face alignment, which is utilized to control
partial occlusion in the input image. Fard and Mahoor suggested a lightweight CNN-
based facial landmark detection technique. The model was successful in offering the
best accuracy. When compared to alternative techniques since no additional post-
processing is necessary after landmarks are generated, the coordinate-based method
was determined to be the most successful for landmark detection. However, it has
certain drawbacks as well, such as reduced accuracy when spatial information is lost
during the intermediate stages.
Heatmap-based regression models: This model is one method for the task of
detecting landmarks. To apply this model, we must first produce heatmaps, then, a
particular network is selected and trained to produce heatmaps for each input image.
Yang et al. [2] presented an efficient technique for predicting heatmaps that consists
of a two-part network with a supervised transformation that is used to standardize
faces and a stacked hourglass network. In order to produce heatmaps, Valle et al.
[2] introduced a straightforward CNN in their work on heatmap prediction. There
are a number of possibilities when recognizing landmarks that need to be taken into
account, such as accuracy and form fluctuations. A well-known scientist by the name
of Sun et al. devised an algorithm called HRNet [5] that is highly accurate at detecting
facial landmarks and doing many other computer vision tasks. A researcher named
Iranmanesh et al. proposed a solution to the problem of shape variations on detection
by introducing an algorithm that could separate a collection of photos and record
reliable landmarks. Gaussian heat map vectors [6], a notion of heatmap vectors that
was presented by Xiong et al., give a heatmap that is utilized for facial landmark
point recognition and is a highly preferred type of heatmaps.
4.2 CNN Algorithms
LeNet: For the purpose of reading handwritten characters, this algorithm was devel-
oped. It was first presented in the late 1990s by ref. [2]. Its component parts are the
softmax classifier and the three CNN layers. Initially, it was combined with deep
learning to analyze computer vision. It has been extensively used to read numbers
from grayscale input photos of checks.
AlexNet: With the help of this effective algorithm AlexNet, the previous method’s
error rate dropped from 26 to 15%. At the University of Toronto, several researchers
[2] have put out the idea of this algorithm. It is particularly effective at showcasing
deep learning techniques when combined with conventional computer vision tasks.
ResNet: ResNet, which stands for Residual Neural Network, is a member of the
deep CNN family. It is made up of various residual blocks, each of which is just a
parallel arrangement of convolutional layers followed by an activation function. This
parallel arrangement is then integrated by adding a shortcut link, which can omit the
convolutional layers. After the activation function, it can immediately add the original
input to the convolutional layers’ output. ResNet is a streamlined algorithm that just
relies on learning residual functions to provide results. Its residual block properties
have made it simpler to train numerous deep networks with multiple layers, which
has helped to solve the gradient loss problem.
GoogLeNet: A sophisticated version of CNN, it is known as a deep convolutional

neural network. Google researchers came up with it in 2014. Due to its optimized
top five mistake rate of 6.67%, it was given the [6] that same year. The use of the
Inception module is well recognized. This module is made up of several parallel
convolutional layers with various filter sizes, followed by pooling layers and output
concatenation. The design of the GoogLeNet makes it a more practical and compu-
tationally economical technique because feature learning takes place in this network
at various scales and resolutions, which ultimately manages the cost. In this instance,
overfitting is also avoided through the employment of auxiliary classifiers at each of
the intermediate layers.
4.3 Optimization Techniques
Every model should offer optimized accuracy while using the fewest computational
resources possible. The outcomes generated by these models should also have the
highest possible efficiency. To attain these goals, a number of optimization strategies
can be taken into consideration. Some of them are listed below:
AdaMax: It is an advanced version or modification of Adam optimizer that uses first-
order gradient-based optimization. The input data determines how the learning rate
changes in this optimization technique. This type of optimizer is highly recommended
in situations where the process changes over time. According to [7], a model was
put into use on the Google Landmark Dataset, and various optimization strategies
were employed to identify the accuracy and efficiency and how they changed as
the size of the training dataset increased and decreased. The dataset used in those
models has a 150-epoch and a 300-batch size. When employing the ResNeT approach
and the AdaMax optimizer, it was discovered to have an accuracy of 95% while
other techniques, such as MobileNet and VGG16, were also used, but ResNet50 was
determined to have the highest accuracy.
Adam Optimizer: The standard method for optimizing CNN models was Stochastic
Gradient Descent (SDG), but as technology advanced, new variations and improve-
ments were made, one such method is the ADAM optimizer, which is an enhanced
version of SDG. Natural language processing, computer vision, and other deep
learning applications all make extensive use of it. It was initially presented in 2014
and was given this name because it likes to estimate the first and second moments of
the gradient when calculating the learning rate of each weight in the neural network.
Before its introduction, a number of other techniques were in use, including RMSP
and AdaGrad, which perform better than SDG. Despite having so many benefits of
RMSP and AdaGrad, there are also a number of drawbacks, such as the fact that SDG
performed better in generalization than AdaGrad and RMSP. Thus, the need for new
optimization approaches was felt, and ADAM was consequently introduced. If we
take ADAM optimizer’s drawbacks into account, performance suffers for various
deep learning tasks.
AdaGrad Optimizer: One optimization technique called AdaGrad, or Adaptive

Gradient Algorithm, makes use of previous gradient data to determine the learning
rates for each parameter separately. When some characteristics have sparse gradi-
ents—that is, when they are updated infrequently—it performs well. In addition to
this benefit, there are some limitations. Over time, the learning rates in the denomi-
nator may compound and eventually become extremely small. This could lead to an
early slowdown of the algorithm, particularly in the later phases of training. Later
optimization algorithms, such as RMSprop and Adam, were designed to solve this
problem and improve upon the adaptive learning rate techniques.
RMSprop Optimizer: An adaptive learning rate optimization approach called Root
Mean Square Propagation, or RMSprop, was created to overcome some of AdaGrad’s
drawbacks. One such drawback is that learning rates can grow too tiny over time.
Geoffrey Hinton introduced RMSprop in a lecture for his online neural network
course. By employing an exponentially decreasing average for the squared gradi-
ents, RMSprop outperforms AdaGrad. Because the running average only takes into
account a portion of the historical gradients rather than aggregating all historical
squared gradients, this helps keep the learning rates from getting too tiny.
A well-liked optimization algorithm that is frequently applied in practice is
RMSprop. However, additional improvements—such the addition of momentum—
led to the creation of algorithms like Adam, which integrate momentum and
RMSprop concepts for better performance in a variety of contexts.
As per [7], a study was conducted with the aim of testing the Google Landmark
Dataset. Five distinct models and five different optimizations were employed to run
each model in that investigation. The study’s batch size was maintained at 300, and
the epoch was fixed at 150. The best accuracy value of 95.6% was obtained on the
test dataset using the ResNet50 model with the AdaMax optimizer, while the best
accuracy of 95.22% was obtained on the validation dataset using the ResNet50 model
with the Adam optimizer. Various accuracy tables were filled out and analyzed.
When utilizing the RMSprop optimizer with the VGG16 model, the test and
validation datasets yielded the lowest accuracy, which was 70.01%.
The variable precision of the data train in relation to the data validation at
each epoch is the reason for the low accuracy value of the model displayed in the
data above. Unpredictable and significant loss in the data train relative to the data
validation at each epoch is additional contributing causes.
4.4 Applications
Landmark is a visual symbol which is used to identify different objects. It is an

important input element while applying any CNN algorithms. These landmark can
be the facial one or the monuments on any material things; we can use this landmark
points for its identification since there are lot of images over the Internet identifying
or categorizing each is a complex task for human therefore, the advanced machine
learning techniques is used to perform this, basically when CNN is integrated with
deep learning the outcome so obtained results in robust performance. Following are
the applications of CNN on detecting landmarks.
Both non-medical and medical image processing can greatly benefit from land-
mark point detection. Landmarks can be a visual point based on facial points. It
firstly extracts the landmarks related to face, and then CNN is applied to detect the
particular face. It has been widely used to identify strange images over any site.
Apart from recognizing the human face there are several other applications too,
we can obtain the landmark points from the X-images as well, and this helps in
detecting the sagittal cervical spine. Also, we can perform body joint tracking.
whereas, the non-medical application includes the classification of ancient monu-
ment and more, like research on classifying the ancient temple was conducted in
Indonesia using the CNN and SDG together [6], that model got an aggregate accu-
racy of 86.28%, this accuracy was obtained when 100 epochs along with 6 classes
were trained. It was also mentioned that at epoch 50, the results found were optimal
because the accuracy obtained upon training was 98.99% and the validation reached
85.57%. All these implementations were done by using AlexNet architecture.
5 Result Analysis
5.1 Periodic 3D CNN for Mitral Annulus Segmentation

and Anatomical Orientation Detection in TEE Images [8]
The weighted mean error of 1.96 ± 1.62 mm was estimated by the model selected for
test set results in coordinate forecasting [8]. Regression weighted curve to curve-to-
curve prediction error was also present [8] with a value of 1.82 ± 0.70 mm. Figure 3
depicts the contribution of coordinate errors over testing in the test set for a better
understanding.
Figure 3 uses box plots to illustrate the coordinate prediction mistakes observed
in each of the 19 exams in the test set.
Top: In-plane errors are shown, where the measure takes into consideration
forecasts for every single plane and time period.
In the bottom image, curve-to-curve errors are shown, with each frame accounting
for a different error value.
The green triangles on the plot show the average error for each test, while the boxes
show the median error and quartiles. To 1.5 times the interquartile range, whiskers
are present. To preserve clarity and keep the emphasis on the primary issues, outliers
are specifically omitted from the charts.
Also, a surgical view prediction error of 3.28 ± 2.92° and a relative perimeter
error of 5.8 ± 4.8% were also attained by the model. The metrics for predicting
coordinates are given in Table 1.
Fig. 3 Box plots depicting coordinate prediction discrepancies across 19 test examinations [8]
Table 1 Coordinate prediction results [8]

Measurement Schneider Tiwari and Zhang et al. Andreassen Proposed
error (unit) et al. Patwardhan et al. (this)
Annulus in 1.81 ± 0.78+ 2.59 1.57 2.04 ± 1.87+ 1.96 ± 1.62
plane [mm]
Annulus – – 3.49 ± 2.21 1.94 ± 0.82+ 1.82 ± 0.70
curve-to-curve
[mm]
Surgical view – – 9.62 ± 10.46 3.26 ± 2.26+ 3.28 ± 0.78
[degrees]
Perimeter [%] – – 10 ± 16 6.1 ± 4.5 5.8 ± 4.8
No. of test set 10 15 432 135 135
volume
Out of 128 planes, the test set’s weighted mean inaccuracy of the anatomical
orientation predictions was 9.7 ± 15.8 degrees, or 3.5 ± 5.6 plane indices. Addi-
tionally, there was a median forecast inaccuracy of 5.6 degrees, or 2 plane indices.
Table 1 provides a [8], and Fig. 4 provides a visual representation of the results for
each examination in the form of a box plot.
The outcomes of the anatomical orientation prediction are shown in two ways in
Table 2:
• First row: Rotational radii.
• Second Row: The amount of plane indices used to measure the error. One way to
describe this inaccuracy is as a percentage of the 128 rotating planes.
Fig. 4 Results of anatomical

orientation prediction for test
set inspections [8]
Table 2 Anatomical
Degree planes Anatomical orientation errors
orientation prediction results
[8] Weighted mean Median
9.7° ± 15.7° 5.6°
3.5° ± 5.6° 2 planes
5.2 Deep Convolutional Neural Networks for Sagittal

Cervical Spine Landmark Point Detection in X-Ray
Images [9]
In this part, we start by evaluating the precision of PoseNet, our recently introduced
network design. We then concentrate on determining how accurate the suggested loss
function is. We also conduct a thorough analysis of the performance of our PoseNet
model.
PoseNet evaluation: In this part, we start by comparing the precision of our PoseNet
model, which was developed using the common L2 loss function, to two well-known
baseline models: MobileNetV2 and ResNet50. Additionally, we offer the mnv2-hm
and res50-hm encoder–decoder models to ensure a fair comparison. The encoders
for these models are MobileNetV2 and ResNet50, and the decoders consist of three
sets of DeConv2D layers with filter sizes of 256, kernel sizes of 3, and strides of 2
before ReLU activation layer and batch normalization.
As given in Table 3, the [9] models used for detecting landmark points on the
sagittal cervical spine are much more accurate than coordinate-based regression
models. Notably, the [3] subsets’ respective Normalized Mean Errors (NME) for
mnv2-hm are 5.60%, 6.42%, and 10.56%. When resn50-hm is used as the model,
these values are later decreased to 4.79% (about a 0.81% reduction), 5.20% (about a
1.22% reduction), and 7.68% (about a 2.88% reduction). PoseNet’s implementation
further reduces the NME to 4.75%, 5.21%, and 7.48%, respectively, representing
reductions of around 0.85%, 1.21%, and 3.08% when compared to mnv2-hm for the
[9] subsets.
Table 3 Examination of the differences between various models trained using L2 loss, including
normalized mean error (NME), failure rate (FR), and area under the curve (AUC) [9]
Model NME (↓) FR (↓) AUC (↑)
LC LCE LCF LC LCE LCF LC LCE LCF
mnv2 6.47 7.35 11.04 7.49 9.16 41.96 0.6100 0.5270 0.2069
res50 6.12 6.88 10.75 7.02 8.41 39.60 0.6375 57.21 0.2192
Mnv2-hm 5.60 6.42 10.56 3.84 6.66 36.60 0.6971 0.6227 0.2490
res50-hm 4.79 5.20 7.68 2.92 2.5 16.07 0.7375 0.7527 0.4862
PoseNet 4.75 5.21 7.48 2.77 3.33 9.82 0.7602 0.7385 0.5067
PoseNet is more accurate than resn50-hm in both the LC and LCF subsets, but
marginally less accurate in the LCE subset. However, it is important to highlight
that compared to resn50-hm, PoseNet has much fewer model parameters and fewer
floating-point operations (FLOPs), as seen in Table 4.
Assessment of IC-Loss: We trained the PoseNet model with three distinct loss func-
tions—L1, L2, and our recently developed IC-loss function—in order to evaluate the
effectiveness of the suggested loss function. The Normalized Mean Error (NME) for
the model, when trained with the L2 loss function, is given in Table 5, to be 4.75%,
5.21%, and 7.48% for the [9] subsets, respectively. When the L1 loss function is
used, a minor increase in performance is shown, resulting in a decrease in the NME
to 4.69%, 5.20%, and 7.25% for the [9] subsets, respectively.
Table 4 Comparative analysis of the model parameter and FLOP numbers [9]
Model #Parameters #Flops
Mnv2-hm 6,398,045 683,025,408
res50-hm 29,497,245 3,066,262,528
PoseNet 23,226,269 596,375,552
There are fewer model parameters and floating points operations (Flops) for PoseNet and Mnv2-hm
compared to res50-hm
Table 5 Analysis of PoseNet trained using various loss functions in terms of normalized mean
error (NME), failure rate (FR), and area under the curve (AUC) [9]
Model NME (↓) FR (↓) AUC (↑)
LC LCE LCF LC LCE LCF LC LCE LCF
L2 4.75 5.21 7.48 2.77 3.33 9.82 0.7602 0.7385 0.5067
L1 4.96 5.20 7.25 2.61 3.33 12.50 0.7654 0.7395 0.5343
IC-loss 4.38 4.76 6.50 2.51 3.33 6.25 0.7882 0.7760 0.6034
When the model is trained using our suggested IC-loss function, the model’s
accuracy improves the most noticeably. With this method, the NME is significantly
decreased for the LC, LCE, and LCF subsets, respectively, to 4.38%, 4.76%, and
6.50%. This displays the IC-loss function’s higher performance in contrast to the
L1 and L2 loss functions, highlighting its efficiency in raising the PoseNet model’s
accuracy.
References
1. Cristinacce D, Cootes TF (2006) Feature detection and tracking with constrained local models.
Proc Brit Mach Vis Conf 3
2. Yang J, Liu Q, Zhang K (July 2017) Stacked hourglass network for robust facial landmark
localisation. Proceedings IEEE conference computer vision pattern recognition. Workshops
(CVPRW), pp 79–87
3. Ruder S (Sept 2016) An overview of gradient descent optimization algorithms, pp 1–14
4. Martins P, Caseiro R, Batista J (2013) Generative face alignment through 2.5D active appearance
models. Comput Vis Image Understand 117(3):250–268
5. Feng Z-H, Kittler J, Awais M, Huber P, Wu X-J (June 2018) Wing loss for robust facial landmark
localisation with convolutional neural networks. Proceedings IEEE/CVF conference computer
vision pattern recognition, pp 2235–2245
6. Trigeorgis G, Snape P, Nicolaou MA, Antonakos E, Zafeiriou S (June 2016) Mnemonic descent
method: a recurrent process applied for end-to-end face alignment. Proceeding IEEE conference
computer vision pattern recognition (CVPR), pp 4177–4187
7. Landmark classification service using convolutional neural network and Kubernetes, p 2820
8. Mitral annulus segmentation and anatomical orientation detection in TEE images using periodic
3D CNN. IEEE J Mag. IEEE Xplore
9. Sagittal cervical spine landmark point detection in X-ray using deep convolutional neural
networks. IEEE J Mag. IEEE Xplore
An Efficient Illumination Invariant Tiger
Detection Framework for Wildlife
Surveillance
Gaurav Pendharkar, A. Ancy Micheal, Jason Misquitta,

and Ranjeesh Kaippada
Abstract Tiger conservation necessitates the strategic deployment of multifaceted

initiatives encompassing the preservation of ecological habitats, anti-poaching mea-
sures, and community involvement for sustainable growth in the tiger population.
With the advent of artificial intelligence, tiger surveillance can be automated using
object detection. In this paper, an accurate illumination invariant framework is pro-
posed based on EnlightenGAN and YOLOv8 for tiger detection. The fine-tuned
YOLOv8 model achieves a mAP score of 61% without illumination enhancement.
The illumination enhancement improves the mAP by 0.7%. The approaches elevate
the state-of-the-art performance on the ATRW dataset by approximately 6–7%.
Keywords Tiger surveillance · Computer vision · Object detection · Illumination

enhancement · General adversarial networks
1 Introduction
Tigers are iconic symbols of the rich biodiversity of Asia with most of their popula-
tions being primarily concentrated in countries like India, Thailand, and Indonesia
[1]. These magnificent cats are apex predators and one of the vital indicators of
the health of an ecosystem. However, there are multiple challenges faced in their
conservation due to habitat loss, illegal trade, poaching, and extensive tourism [2].
G. Pendharkar (B) · A. A. Micheal · J. Misquitta · R. Kaippada

Vellore Institute of Technology, Chennai, Tamil Nadu 600127, India
e-mail: gauravsandeep.p2020@vitstudent.ac.in
A. A. Micheal
e-mail: ancymicheal.a@vit.ac.in
J. Misquitta
e-mail: jason.misquitta2020@vitstudent.ac.in
R. Kaippada
e-mail: ranjeesh.kaippada2020@vitstudent.ac.in
174 G. Pendharkar et al.
In the early twentieth century, there were more than 100,000 wild tigers in Asia
which has drastically reduced to fewer than 4000 wild tigers. This precipitous decline
is mainly due to a 93% reduction in tiger habitats as a result of deforestation and
agricultural expansion. Furthermore, the poaching of tigers for the illegal trade of
tiger specimens and retaliatory attacks during human-wildlife conflicts are the other
reasons behind the sudden decline in the population [3]. Conservationists have made
commendable progress in tiger conservation but with the advent of artificial intelli-
gence, many of the protocols can be efficiently automated by the implementation of
wildlife conversation systems [4].
Wildlife surveillance systems are a crucial tool for the conservation of endan-
gered species, especially tigers which are characterized by elusive behavior. In order
to automate tiger detection in wildlife surveillance, accurate object detection plays
a vital role. The change in lighting is one of the major challenges for efficient object
detection in wildlife surveillance. Low Illumination Images (LIIs) tend to lack clarity
in the illumination of the image since they are sampled and quantized in low-light
environmental conditions. Traditional non-deep learning methods do not adapt based
on the pixel distribution of the image and result in extreme improvement in illumi-
nation. Many of the traditional deep learning approaches require pair-wised super-
vision to train the models based on the Retinex theory [5]. However, there are a few
approaches implemented by General Adversarial Networks (GANs) that improve
the illumination by learning some unsupervised features. Therefore, in this paper,
the EnlightenGAN [6] is used to provide illumination enhancement to the images
followed by object detection with the YOLOv8 model [7].
2 Related Work
Tiger Detection: A fast yet efficient approach for real-time tiger detection was
proposed by Kupyn and Pranchuk [8]. TigerNet model was developed based on the
Feature Pyramid Network (FPN) and uses Depthwise Separable Convolutions (DSC)
with a lightweight backbone of FD-MobileNet [9]. The lightweight backbone of FD-
MobileNet improves the detection speed which is commendable but also reduces the
accuracy slightly. Tan et. al constructed a dataset of video clips taken by infrared
cameras installed in the Northeast Tiger and Leopard National Park covering 17
species [10]. The main aim of the paper was to compare the performance of three
mainstream object detection models, particularly FCOS Resnet101, YOLOv5, and
Cascade R-CNN HRNet32. YOLOv5 showed the most consistent results among the
three models. It obtained 88.8, 89.6, and 89.5% accuracy at various thresholds. The
overall high accuracy of detection is commendable but the model suffers from data
imbalance and therefore has a significant variance in the species-wise performance.
However, the models relatively performed well for Amur Tigers compared to other
animals.
An Efficient Illumination Invariant Tiger Detection Framework … 175
Liu and Qu introduced the AF-TigerNet, a lightweight neural network designed

for the real-time detection of Amur tigers [11]. The approach involves improving
feature extraction by incorporating an updated CSPNet [12] and cross-stage partial
(CSP)-path aggregation network (PAN) architectures. Dertien et al. proposed a novel
approach, TrailGuard, which was developed to detect tigers and poachers and alert the
respective government authorities [13]. This innovative system boasts an impressive
response time of just 30 s from the moment the camera is triggered to the notification
appearing on a smartphone app.
Low Illumination Image Enhancement: Al Sobbahi and Tekli reviewed differ-
ent approaches to enhance the quality of LIIs and their effect on object detection [14].
The traditional approaches included Gamma Correction, Histogram Equalization,
models based on Retinex theory, and frequency-based methods. They also discuss
deep learning approaches consisting of Encoder-Decoders, Retinex theory-based
models, General Adversarial Networks (GANs) [15], and Zero Reference models.
Among all the approaches, Histogram Equalization, EnlightenGAN, and GladNet
[16] were the ones independent of pairwise supervision. The relative performance of
the models, their examples, and the explainable categorization were commendable
but the open-source availability of the models was not discussed. Choudhury et al.
highlighted the increase in the demand for surveillance in order to protect endangered
species [17]. Furthermore, they also emphasize that the degraded quality of images
with respect to bad contrast, high noise, reflectance, and lousy illumination severely
affect object detection. They utilize the EnlightenGAN as a step of preprocessing
before object detection using YOLOv3 [18] for an automatic detection system. The
GAN is evaluated using BRISQUE and NIQE with mAP, GIoU loss, F-measure, and
objectness loss as metrics for the rhino detector.
Wang et al. experimented with different techniques to enhance detection in low-
light scenarios [19]. Their exploration of different image enhancement algorithms
reveals that EnlightenGAN and Zero_DCE++ [20] worked best in conjunction with
the YOLOv5 model giving a precision of 0.747. The work improves the performance
of the image enhancement techniques with traditional versions of YOLO models
which is appreciable.
3 Methodology
Illumination Variation is a tedious task during object detection. In wildlife surveil-

lance, change in lighting conditions results in less efficient animal detection. This
paper focuses on effective tiger detection by addressing illumination variation. Image
restoration and enhancement methods have been evolving appreciably over the past
years. However, most of the state-of-the-art deep learning approaches require pair-
wise training data which in most cases infeasible for real-world problems. Enlighten-
GAN is one such approach that uses unsupervised learning-based general adversarial
networks (GANs) to devise a mapping between low-light and normal-light images.
An attention-guided U-Net is used for the image generation and two discriminators
Fig. 1 Architecture diagram
namely Global and Local Discriminators. Hence, in this paper, the EnlightenGAN is
applied to handle illumination variation followed by object detection with YOLOv8.
The framework of the proposed architecture is shown in Fig. 1.
3.1 EnlightenGAN
EnlightenGAN is an unsupervised learning-based GAN that uses an attention-guided

unit as the generator and 2 discriminators namely the global and local discrimina-
tors. Figure 2 depicts the architecture of EnlightenGAN. The model takes a low-light
image as input (A) and computes its normalized illumination component (I). It com-
putes the self-regularized map by an element-wise subtraction, .1 − I . The input
image and the self-regularized map are concatenated and given as input to the gen-
erator to obtain the reconstructed image(. A' ). An element-wise multiplication and
element-wise addition are carried out on . A' as shown in Eq. 1.
. B = (A' ⊗ I ) ⊕ A (1)
The final reconstructed image (B) is given as input to the global discriminator
and some randomly cropped patches are given as input to the local discriminator.
Finally, both the discriminators return a true or false. The loss function used to train
the network is shown in Eq. 2 in which .LGlobal
SFP and .LLocal
SFP denotes the self-feature
preserving loss for the global and local discriminator, respectively. The loss for
global and local discriminators is denoted by .LGlobal
G and .LLocal
G . The sample images
illuminated by EnlightenGAN are shown in Fig. 3.
Loss = LGlobal
. SFP + LSFP + LG
Local Global
+ LLocal
G (2)
Fig. 2 EnlightenGAN architecture
(a) Original Image (b) Illuminated Image
Fig. 3 Samples of illuminated image by EnlightedGAN

Fig. 4 YOLOv8 architecture
3.2 YOLOv8
“YOLO” an acronym for “You Only Look Once” is a highly effective algorithm for
real-time object detection, renowned for its exceptional balance between accuracy
and speed. The latest iteration of the YOLO series, YOLOv8, has been created by
Ultralytics. It improves the performance of the previous versions of YOLO by the
following improvisations:
• introduction of anchor-free detections by which the model can directly estimate
the center of an object instead of relying on an offset from a known anchor box.
• alterations in the convolutions in the backbone network.
• closing mosiac augmentation before training is completed.
Figure 4 shows the general architecture of the YOLOv8 model which has three
components namely the backbone, neck, and head. Relevant features are extracted
from the input image by the backbone network. The neck connects the backbone
Fig. 5 Sample tiger detection results by YOLOv8
network and the head network. It also reduces the dimensions of the feature map and
improves the resolution of the features. Finally, the head network comprises three
detection networks to detect small, medium, and large objects. The sample results
for the YOLOv8 model are portrayed in Fig. 5.
The experiment is performed with the ATRW dataset. ATRW dataset deals with
three computer vision tasks namely tiger detection, pose estimation, and tiger re-
identification. The detection dataset comprises 9496 bounding boxes across 4434
images [21]. The efficiency of the framework is evaluated using Mean Average
Precision (mAP). For an object detection task, each prediction has confidence and
the dimensions of the bounding box (. B p ). A detection is said to be correct if it
satisfies the IoU threshold (.t) as shown in the Eq. 4. IoU stands for the Intersection
over Union which is the ratio of the area of intersection to the area of union as
depicted in the Eq. 3. Finally, among the correct predictions, the mAP is computed
as per the formula given in Eq. 5 [22]. The average precision denoted by AP is
the area under the precision-recall curve. The state-of-the -art performance for the
proposed methodology is shown in Table 1 and is pictorially depicted in Fig. 6.
area(B p ∩ Bgt )
.IoU(B p , Bgt ) = (3)
area(B p ∪ Bgt )
Table 1 Comparison with SOTA performance [11, 21]

Model mAP[0.5:0.95]
SSD-MobileNetv1 0.446
SSD-MobileNetv2 0.473
Tiny-DSOD 0.511
YOLOv3 0.464
AF-TigerNet 0.555
YOLOv8 0.610
EnlightenGAN + YOLOv8 0.617
Fig. 6 Pictorial representation of the comparison with SOTA performance
{
correct IoU(B p , Bgt ) > t
. T (B p , Bgt ) = (4)
incorrect Otherwise
1 ∑
N
mAP =
. A Pi (5)
N i=1
The SOTA performance is 0.464 for YOLOv3. AF-TigerNet performs better

than YOLOv3 with an mAP of 0.555. Further, the experiment was performed with
YOLOv8 which achieved an mAP of 0.610. From Table 1, it is evident that there
is a significant increase in mAP of 7% on the application of EnlightenGAN and
YOLOv8. Overall, by addressing illumination variation, the proposed framework
achieves an mAP of 0.617, outperforming the other comparative models.
5 Conclusion
Recently, wildlife surveillance has been automated with various computer vision
methodologies. Illumination variation is a tedious task in wildlife surveillance. It
also affects efficient object detection during surveillance. This paper focuses on a
novel framework that addresses Illumination variation and provides efficient tiger
detection using EnlightenGAN and YOLOv8. The attention-guided unit, global and
local discriminator in the EnlightenGAN provides illumination varied images which
are further fed into the YOLOv8 object detector. The experiment is performed with
the ATRW tiger dataset. The proposed model outperforms the SOTA with an mAP
of 0.617. In the future, the tiger detection model can be expanded for multi-class
wildlife surveillance.
References
1. Gray TN, Rosenbaum R, Jiang G, Izquierdo P, Yongchao JIN, Kesaro L, Chapman S (2023)
Restoring Asia’s roar: opportunities for tiger recovery across the historic range. Front Conserv
Sci 4:1124340
2. Rana AK, Kumar N (2023) Current wildlife crime (Indian scenario): major challenges and
prevention approaches. Biodivers Conserv 32(5):1473–1491
3. Nittu G, Shameer TT, Nishanthini NK, Sanil R (2023) The tide of tiger poaching in India
is rising! An investigation of the intertwined facts with a focus on conservation. GeoJournal
88(1):753–766
4. Isabelle DA, Westerlund M (2022) A review and categorization of artificial intelligence-based
opportunities in wildlife, ocean and land conservation. Sustainability 14(4):1979
5. Pan X, Li C, Pan Z, Yan J, Tang S, Yin X (2022) Low-light image enhancement method based
on retinex theory by improving illumination map. Applied Sciences 12(10):5257
6. Jiang Y, Gong X, Liu D, Cheng Y, Fang C, Shen X, Wang Z (2021) Enlightengan: deep light
enhancement without paired supervision. IEEE Trans Image Process 30:2340–2349
7. Terven J, Cordova-Esparza D (2023) A comprehensive review of YOLO: From YOLOv1 to
YOLOv8 and beyond. arXiv preprint arXiv:2304.00501
8. Kupyn O, Pranchuk D (2019) Fast and efficient model for real-time tiger detection in the wild.
In: Proceedings of the IEEE/CVF international conference on computer vision workshops
9. Qin Z, Zhang Z, Chen X, Wang C, Peng Y (2018) Fd-mobilenet: improved mobilenet with a
fast downsampling strategy. In: 2018 25th IEEE international conference on image processing
(ICIP). IEEE, pp 1363–1367
10. Tan M, Chao W, Cheng JK, Zhou M, Ma Y, Jiang X, Feng L (2022) Animal detection and clas-
sification from camera trap images using different mainstream object detection architectures.
Animals 12(15):1976
11. Liu B, Qu Z (2023) AF-TigerNet: a lightweight anchor-free network for real-time Amur tiger
(Panthera tigris altaica) detection. Wildlife Letters 1(1):32–41
12. Wang CY, Liao HYM, Wu YH, Chen PY, Hsieh JW, Yeh IH (2020) CSPNet: a new backbone
that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition workshops, pp 390–391
13. Dertien JS, Negi H, Dinerstein E, Krishnamurthy R, Negi HS, Gopal R, Baldwin RF (2023)
Mitigating human-wildlife conflict and monitoring endangered tigers using a real-time camera-
based alert system. BioScience 73(10):748–757
14. Al Sobbahi R, Tekli J (2022) Comparing deep learning models for low-light natural scene
image enhancement and their impact on object detection and classification: Overview, empirical
evaluation, and challenges. Signal Process Image Commun 116848
15. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio
Y (2014) Generative adversarial nets. Adv Neural Inf Process Syst 27
16. Wang W, Wei C, Yang W, Liu J (2018) Gladnet: low-light enhancement network with global
awareness. In: 2018 13th IEEE international conference on automatic face and gesture recog-
nition (FG 2018). IEEE, pp 751–755
17. Choudhury S, Saikia N, Rajbongshi SC, Das A (2022) Employing generative adversarial net-
work in low-light animal detection. In: Proceedings of international conference on communi-
cation and computational technologies: ICCCT 2022. Springer Nature Singapore, Singapore,
pp 989–1002
18. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint
arXiv:1804.02767
19. Wang J, Yang P, Liu Y, Shang D, Hui X, Song J, Chen X (2023) Research on improved yolov5
for low-light environment object detection. Electronics 12(14):3089
20. Guo C, Li C, Guo J, Loy CC, Hou J, Kwong S, Cong R (2020) Zero-reference deep curve
estimation for low-light image enhancement. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pp 1780–1789
21. Li S, Li J, Tang H, Qian R, Lin W (2019) ATRW: a benchmark for Amur tiger re-identification
in the wild. arXiv preprint arXiv:1906.05586
22. Padilla R, Netto SL, Da Silva EA (2020) A survey on performance metrics for object-detection
algorithms. In: 2020 international conference on systems, signals and image processing (IWS-
SIP). IEEE, pp 237–242
An Innovative Frequency-Limited
Interval Gramians-Based Model Order
Reduction Method Using Singular Value
Decomposition
Vineet Sharma and Deepak Kumar
Abstract The widely used frequency-limited Gramians-based model order reduc-

tion method for continuous-time systems, originally developed by Gawronski and
Juang, yields unstable reduced-order models. To address this significant flaw, many
researchers proposed a way to keep the reduced-order model stable. However,
under some circumstances, these contemporary approaches also result in an unstable
reduced-order model and a significant variation from the original system, leading to
a significant approximation error. The present article proposes a new structure for
stable continuous-time systems based on limited-interval Gramians. The suggested
structure ensures that a low frequency–response approximation error is attained for
the specified limited interval and ensures the stability of the reduced-order model.
The stated technique provided consistent and accurate results by proving its efficacy
when compared to other conventional approaches.
Keywords Limited-interval Gramians · Reduced order model · Singular value

decomposition
1 Introduction
Large complex systems [1] that possess high dimensional characteristics are chal-
lenging to evaluate and implement. As a result, such models therefore can be approx-
imated into a feasible lower order that can be handled. An effective approximation
of the reduced-order model (ROM) solves the challenge of evaluating and devel-
oping such complex frameworks. By using model order reduction (MOR), a small,
V. Sharma (B) · D. Kumar

MNNIT Allahabad, Prayagraj, Uttar Pradesh 211004, India
e-mail: vineetsharma@mnnit.ac.in
D. Kumar
e-mail: deepak_kumar@mnnit.ac.in
V. Sharma
Poornima College of Engineering, Jaipur, Rajasthan 302017, India
184 V. Sharma and D. Kumar
computationally simple model that shares the same properties as the original model
is produced. The MOR has an impact on a number of fields that follows:
• Computational science.
• Aerospace applications.
• Data integration.
• Medical technology.
1.1 Literature Review
The balanced truncation (BT), also known as the balanced realization (BR), played
a crucial part in the control-system theory’s implementation for model reduction.
While explicitly setting a bound on the frequency response error, it can still main-
tain stability. At all frequencies, the reduction error between the real system and
the ROM should ideally be minimal. In some frequency bands, the reduction error
may be higher than in others when utilizing a low-order model in a feedback-control
framework. In the region that is the crossover region, an accurate depiction of the full-
order system is necessary then the problem is known as the frequency model reduction
problem. This served as an inspiration for the application of frequency weighting
in model reduction. The controller reduction problem is referred to when it is used
to reduce the controller’s order. By extending the BT Moore [2], the augmentation
of frequency weightings with the initial stable model was executed by Enns’ [3].
This strategy might call for realizations of input, output, and each-sided weighting.
However, Enns’ [3] also produces unstable ROMs when double-sided weighting is
applied. Lin and Chiu (LC) [4] used a method that also produced stable models for
dual-sided weightings to address the issue of instability. Furthermore, Sreeram et al.
[5] expanded the generalization of Anderson and Liu [4] to incorporate appropriate
weights. Varga and Anderson (VA) [6] pointed out that the technique by Sreeram
et al. [5] did not find its relevance with controller reduction applications as there was
no pole-zero cancellation. VA [6] made additional modifications to the technique
to address the problem, but the outcome of their improved approach was consistent
with Enns’ [3] method, in particular with respect to controller reduction applications.
Wang et al. [7] made another improvement to Enns’ [3] method, which produced
a straightforward and remarkable error bound as well as the assurance of stability
for both-sided weightings. The aforementioned methods and their modifications,
according to Sreeram [8], are dependent on the realization and produce different
models for different realizations of the same system. As a result, there could be
large approximation errors and error bounds. Moreover, Kumar and Sreeram [9]
proposed a MOR technique on the ideal Hankel norm of frequency weights by gener-
ating new augmented system realizations through the initial system using various
factorizations of fictitious matrices. Kumar et al. [10] presented a methodology in
which the Gramians developed were used for model reduction algorithms for linear
time-invariant continuous-time single-input single-output (SISO) systems. In addi-
tion, Sharma and Kumar [17] suggested an SVD-based methodology for a better
An Innovative Frequency-Limited Interval Gramians-Based Model … 185
understanding of developed fictitious matrices and Gramians resulting in a better

approximation for double-sided and single-sided weightings for continuous time
systems.
The frequency interval (FI) Gramians-based MOR, which takes into account
the importance of minimizing errors for a particular time/frequency spectrum, was
presented by Gawronski and Juang (GJ) [11] and is in line with the frequency-
weighted MOR. Based on the new controllability Gramians (CG) and observability
Gramians (OG) defined over the desired FI, GJ [11] proposed a continuous-time
balancing-based approach. This method does not specifically use weights. Even for
a reliable system like Enns’ [3], GJ [11] has the drawback of offering an unstable
ROM. Furthermore, GJ [11] lacks a prior error bound. By defining the Gramians
within the proper intervals, Aghee et al. [12] demonstrated a methodology that is
similar to this one. Gugercin and Antoulas (GA) [13] changed GJ [11] as a result of
Wang et al. [8], achieving stable ROM and error bounds. Furthermore, an enhance-
ment to the GJ [11] methodology was suggested by Ghafoor et al. (GS) [14] and
Imran et al. (IG) [15]. The coupled input–output matrices semi-positive and positive
definiteness are taken into account in two stability-preserving FI MOR algorithms
proposed by GS [14]. In GS [14] and IG [15], it is modified how large the imag-
inary matrices’ negative eigenvalues are. The large approximation error is caused
by the fact that some eigenvalues change very little while others vary greatly. This
causes a significant divergence from the real systems. Using the IG [15] method,
the stability of ROMs is achieved by subtracting the lower diagonal eigenvalues
from the eigenvalues of an intermediate matrix. Additionally, Kumar et. al. [16]
illustrated MOR for discrete systems using parameterized limited frequency interval
Gramians. Further, Batool et al. suggested the development of a model reduction
technique for both the weighted and limited-intervals Gramians for discrete-time
systems via a balanced structure with error bound, by implementing an application
of MOR in power systems. Further, by considering the delay feature of wind power
when participating in frequency regulation, Cheng et al. [18] used the application
of MOR in wind power. Also, Sharma and Kumar [19] emphasized balanced MOR
using a restrained FI Gramian structure-based procedure in this research.
1.2 Paper Organization and Contribution
This paper employs the singular value decomposition (SVD) to present a new
frequency-limited MOR strategy for a linear time-invariant continuous-time system
reduction, which is beneficial in matrix factorization, with the objective of mini-
mizing the approximation error. The following are this work’s primary contributions:
i. The limited Gramians and Lyapunov equations help to form new intermediary
matrices, also known as fictitious matrices.
ii. The proposed method offers stable ROMs within the stated frequency interval
range.
iii. A detailed analysis of the present approaches GJ [11], GA [13], GS [14], IG

[15], and the suggested methodology is included.
2 Methodology
Let us take into account a transfer function (TF) equipped LTI stable system:
G og (s) = Dog + Cog (s I − Aog )−1 Bog , (1)
where {Aog , Bog , Cog , Dog } is its m̂th order stable and minimal realization with û
inputs and v̂ outputs, where Aog ∈ Km̂×m̂ , Bog ∈ Km̂×û ,Cog ∈ R v̂×m̂ , Dog ∈ R v̂×û .
The proposed work aims to achieve an efficient framework with a TF:
G red = Cred (s I − Ared )−1 Bred + Dred , (2)
which, in the given FI [ω1 , ω2 ] as (ω2 > ω1 ) and Ar ed ∈ Kr̂ ×r̂ , Br ed ∈ Kr̂ ×û ,
Cr ed ∈ R v̂×r̂ with r̂ < m̂ approximates the initial system.
Let the CG and OG, i.e., Pct and Q ob , respectively, are the solution to the Lyapunov
equations (LE),
Aog Pct + Pct Aog

T
+ X dv = 0, (3)
T
Aog Q ob + Q ob Aog + Ydv = 0, (4)
where
( ∗ )
X̂ dv = (S(ω2 ) − S(ω1 ))Bog Bog
T
+ Bog Bog
T
S (ω2 ) − S ∗ (ω1 ) , (5)
( ) T
Ŷdv = S ∗ (ω2 ) − S ∗ (ω1 ) Cog Cog + Cog
T
Cog (S(ω2 ) − S(ω1 )), (6)
j (( ) )
S(ω) = ln j ωI + Aog (− j ωI + Aog )−1 , (7)
2π
and S ∗ (ω) is the conjugate transpose of S(ω).

By eigenvalue decomposition (EVD) of X̂ dv and Ŷdv we have,
X̂ dv = Ud Sd UdT , (8)
Ŷdv = Vd Rd VdT , (9)

where
⎡ ⎤
s1 0 ... 0
⎢0 s2 ... 0⎥
Sd = ⎢
⎣ .
⎥ (10a)
. ... . ⎦
0 0 ... sm̂
and
⎡ ⎤
r1 0 ... 0
⎢0 r2 ... 0 ⎥
Rd = ⎢
⎣ .
⎥ (10b)
. ... . ⎦
0 0 ... rm̂
Now, Bpv and Cpv are the proposed fictitious inputs and output matrices,
respectively, as followed by [18] which is shown as follows:
1/2
Bpv = Uv (|sin(Sd )| − Sd )1/2 for sm̂ < 0 and Bpv = Uv Sd for sm̂ ≥ 0 (11a)
1/2
Cpv = (|sin(Rd )| − Rd )VvT for rm̂ < 0 and Cpv = Rd VvT for rm̂ ≥ 0 (11b,)
where sm̂ and rm̂ are the least and the last values of the Sd and Rd matrices. The
parameters Uv , Sd , Vv , and Rd are established by the EVD of the suggested matrices,
which was inspired by IG [15], i.e.,
T
Bpv Bpv = Uv Sd UvT , (12a)
and
T
Cpv Cpv = VvT Sd Vv , (12b)
where
Sd2 = (|sin(Sd )| − Sd ), for sm̂ < 0 (13a)
and
Rd2 = (|sin(Rd )| − Rd ), for rm̂ < 0 (13b)
The proposed matrices Bpv and Cpv are produced by exerting a similar influence
on each eigenvalue of the symmetric matrices X dv and Ydv .
The following are the suggested frequency-limited CG and OG:
+ω
1 ( )−1 ( )−1
Ppv (ω) = jωI − Aog T
Bog Bog − j ωI − Aog dω, (14)
2π
−ω
+ω
1 ( )−1 T ( )−1
Q pv (ω) = − jωI − Aog Cog Cog − j ωI − Aog dω, (15)
2π
−ω
The suggested Gramians must satisfy the LE as
Aog P̂pv + P̂pv Aog

T
+ Bpv Bpv
T
= 0, (16)
T
Aog Q̂ pv + Q̂ pv Aog + Cpv
T
Cpv = 0, (17)
A similarity transformation matrix Tpv is obtained as

⎡ ⎤
ε1 0 ... 0
⎢0 ε2 ... 0 ⎥
−1
TpvT Q pv Tpv = Tpv −T
Ppv Tpv =⎢
⎣ .
⎥ (18)
. ... . ⎦
0 0 ... εm̂
where εpv ≥ εpv+1 , pv = 1, 2, · · · , m̂ − 1,εr̂ > εr̂ +1 .

Further, by transforming the actual system and then partitioning it, we have the
following:
−1 Aog11 Aog12 −1

Atˆ = Tpv Aog Tpv = , Btˆ = Tpv Bog = Bog1 Bog2 , (19)
Aog21 Aog22

Ctˆ = Cog Tpv = Cog1 Cog2 , Dtˆ = Dog ,
where Aog11 ∈ R r ×r . Eventually, the recommended ROM is acquired as
G red = Cog1 (s I − Aog11 )−1 Bog1 + Dog . (20)

Theorem 1 If rank Bpv Bog = Bpv and rank Cpv Cog = rank Cpv , then, the
stated approach maintains the subsequent error bound.
|| || || |||| || ∑
m
||G og (s) − G red (s)|| || ||||
≤ 2 L pv K pv || εj (21)
∞
j=r +1
where
−1/2
L p = Cog Vv [|sin(Rd )| − Rd ]−1/2 for rm̂ < 0 and L p = Cog Vv Rd for rm̂ ≥ 0
(22)
and
−1/2
K p = [(|sin(Sd )| − Sd )]−1/2 UvT Bog for sm̂ < 0 and K p = Sd UvT Bog for sm̂ ≥ 0
(23)

Proof Since rank Bpv Bog = rank Bpv and rank Cpv Cog = rank Cog , the
relationships Bog = Bpv K pv and Cog = L pv Cpv holds.
T
Partitioning Bpv = Bpv1 Bpv2 , Cpv = Cpv1 Cpv2 and substituting Bog1 =
Bpv1 K pv , Cog1 = L pv Cpv1 , respectively, gives
|| || || ||
||G og (s) − G red (s)|| = ||Cog (s I − Aog )−1 Bog − Cog1 (s I − Aog11 )−1 Bog1 ||∞ .
∞
(24)
|| ||
= || L pv Cpv (s I − Aog )−1 Bpv K pv − L pv Cpv1 (s I − Aog11 )−1 Bpv1 K pv ||∞ (25)
|| ||
= || L pv Cpv (s I − Aog )−1 Bpv − Cpv1 (s I − Aog11 )−1 Bpv1 K pv ||∞ (26)
|| |||| || || ||
≤ || L pv ||||Cpv (s I − Aog )−1 Bpv − Cpv1 (s I − Aog11 )−1 Bpv1 ||∞ || K pv || (27)

If Aog11 , Bpv1 , Cpv1 is the rROM obtained by partitioning and balancing
Aog , Bpv , Cpv , where Aog11 ∈ R ×r , we have Moore [2]
|| || ∑
m
||Cpv (s I − Aog )−1 Bpv − Cpv1 (s I − Aog11 )−1 Bpv1 || ≤ 2 εj. (28)
∞
j=r +1
Therefore,
|| || || |||| || ∑
n
||G og (s) − G red (s)|| ≤ 2|| L pv |||| K pv || εj. (29)
∞
j=r +1
2.1 Algorithm
Given G og (s) and the required range of frequency [ω1 , ω2 ], the following sequence
is used to compute the proposed ROM:
• Using Aog , obtain S(ω) from (7).
• Compute X̂ dv and Ŷdv using (5)-(6).
• Perform SVD of X̂ dv and Ŷdv to compute B pv and C pv as given in (11a)–(11b),

respectively.
• Solve (14)–(15) to compute Ppv and Q pv .
• Find the transformation matrix T pv to satisfy (18).
• Compute the balanced realization and partition as (19) to obtain the ROM G r ed (s)
as given in (20).
Remark 1 For the case when symmetric matrices X̂ dv ≥ 0 and Ŷdv ≥ 0, then CG
defined for GJ [11] ( P̂g ) is same as Ppv , also, the OG for GJ [11], i.e., Q̂ g = Q pv ;
otherwise P̂g < Ppv and Q̂ g < Q pv . Furthermore,
the Hankel singular values (HSV)
at a frequency interval fulfill (λ j P̂g Q̂ g )1/2 ≤ (λ j Ppv Q pv )1/2 .
3 Numerical Analysis and Simulation
Example 1 Consider the three-mass mechanical stable system of 6th order having
transfer function with the desired frequency interval [ω1 , ω2 ] = [1, 5][1, 5] as shown:
−2.118s 4 − 0.2481s 3 − 24.83s 2 − 0.906s − 45.36

G(s) = (30)
s 6 + 0.3295s 5 + 32.97s 4 + 3.609s 3 + 180.6s 2 + 3.566s + 119.1
A system is said to be stable if the pole(s) of a system lies in the left half of a plane,
whereas it is said to be unstable if it lies on the right half and the system is marginally
stable if it lies on the imaginary axis. The ROMs by the suggested and prevailing
techniques GJ [11], GA [13], GS [14], and IG [15] are obtained for the above system
considering the FI [ω1 , ω2 ]=[1, 5] rad/s. Table 1 displays the locations of the ROM
poles and the table makes it evident that [11] provides an unstable 1st-order model as
for it the some of the poles are positive. Thus, we can infer that some poles are located
on the right half of the s-plane, whereas GA [13], GS [14], IG [15], and the proposed
approach produces a stable model as all poles are negative hence lie in the left side
of the s-plane. In Table 2, the proposed work produces the lowest approximation
error within the targeted FI compared to other strategies. Therefore, it follows that
the proposed method gives superior efficacy compared to alternative methods. The
ROMs 3rd-order error plots are displayed in Fig. 1 for interval [ω1 , ω2 ] = [1, 5].
Example 2 Consider an 8th-order system IG [15] with the following state-space

representation and the required frequency interval [ω1 , ω2 ]=[3, 5] rad/s:
Table 1 ROM pole locations for Example 1

Order ROM pole locations
GJ [11] GA [13] GS [14] IG [15] Proposed
1 6.6474 −0.0035 −0.0034 −0.0035 −0.0006
2 −0.1142 −0.0038 −0.0046 −0.0038 −0.0031
−0.1142 −0.0038 −0.0046 −0.0038 −0.0031
3 −0.1375 −0.0031 −0.0049 −0.0031 −0.0037
−0.1375 −0.0031 −0.0049 −0.0031 −0.0037
7.2165 −0.1373 0.1842 −0.1355 −0.0803
Table 2 Error comparison of Example 1

1 31.5453 33.5132 33.1800 33.4230 33.4165
2 31.5476 4.1202 5.0238 3.1937 3.1196
3 31.5493 10.1082 5.6663 7.7452 7.3109
4 31.5496 1.7964 1.7963 1.7963 1.7963
5 31.5501 1.8187 1.8181 1.8187 1.8186
Fig. 1 Errors plot of

3rd-order ROMs for FI
[ω1 , ω2 ] = [1, 5] rad/s for
Example 1
Table 3 ROM pole locations for Example 2

ROM pole locations
Order GJ [11] GA [13] GS [14] IG [15] Proposed
1 4.1567e-08 −0.0093 −4.4119e-11 −6.7629e-11 −9.9422e-11-
2.8527e-09i
2 −1.9346 −0.1525 + −0.1183 + −0.0095 + −0.0955-
−7.7506 3.8699i 13.9202i 1.0709i 3.8717i
−0.1525- −0.1183 −0.0095 − −0.0955 +
3.8699i -13.9202i 1.0709i 3.8717i
3 −7.7506 −0.1525 + −0.1722 + −0.0096 + −0.0955 +
−1.9346 3.8699i 3.8691i 1.0710i 3.8717i
−0.0000 −0.1525 − −0.1722 − −0.0096 − −0.0955 −
3.8699i 3.8691i 1.0710i 3.8717i
−0.0000 + −0.0000 + −0.0657 + −0.0000 −
0.000i 0.0000i 0.000i 0.0000i
⎡ ⎤
0.6490 − 5.3691 0 0 0 0 0 0
⎢ ⎥
⎢ 5.3691 0 0 0 0 3.8730 0 0 ⎥
⎢ ⎥
⎢ 0 36.5054 − 0.2688 − 12.9391 0 0 3.8730 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 12.9391 0 0 0 0 0 3.8730⎥
A=⎢
⎢ − 3.8730 0
⎥
⎥
⎢ 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 − 3.8730 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎣ 0 0 − 3.8730 0 0 0 0 0 ⎦
0 0 0 − 3.8730 0 0 0 0
B = [14 0 0 0 0 0 0 0]T ,
C = [0 0 0 0.0136 0 0 0 0]T ,
D=0 (31)
The ROM of 4th order by the proposed and existing techniques GJ [11], GA [13],
GS [14], and IG [15] are obtained for the above stated case while considering the
frequency intervals [ω1 , ω2 ]=[3, 5] rad/sec. The ROMs pole locations are shown in
Table 3 for [3, 5] rad/ sec, and it is clearly depicted that the approach suggested
by GJ [11] offers a unstable 1st-order model as it has a pole along the s-plane’s
right half. However, the recommended method yields a stable model as all poles lie
toward the left half of the s-plane. Table 4 compares the approximation errors of
GJ [11], GA [13], GS [14], and IG [15], and the proposed approach. Additionally,
Fig. 2 shows the singular value plots of the error function G og (s) − G red (s) within
intervals [3, 5] rad/sec, where the 4th order ROM is G r ed (s) which is derived by
utilizing the GJ [11], GA [13], GS [14], IG [15] as well as the strategy used for
Example 2. For certain existing methods, an important variation in the eigenvalues
results in a vital approximation error, as shown in Fig. 2. The tabular and pictorial
representation makes it abundantly evident that the suggested strategy outperforms
the current methods in terms of outcomes.
Table 4 Error comparison of Example 2

1 1.0004 1.0004 2.9013 1.0027 1.0611
2 1.0546 1.0247 1.0159 1.0008 1.0148
3 1.0546 1.0258 1.0159 1.0336 1.0148
4 0.9988 0.9984 0.9739 1.0017 0.9727
5 0.9988 0.9984 0.9739 1.0015 0.9727
6 0.9964 0.9892 0.9734 0.9991 0.9692
7 0.9964 0.9888 0.9734 0.9953 0.9848
Fig. 2 Error plots of

4th-order ROMs for FI
[ω1 , ω2 ] = [3, 5] rad/s for
Example 2
4 Conclusion
An innovative frequency-interval MOR technique is developed. The paper focuses

on a frequency-limited Gramians-based model order reduction strategy. The precise
simulation demonstrates that, in the preferred range of frequencies, the discussed
technique outperforms the prevailing stability-preserving frequency-limited MOR
techniques. The suggested methodology results not only in providing reliable ROM
but also provides better approximation for frequency response error compared to the
existing techniques. Adding the descriptor and bilinear systems to the suggested
approach can be a straightforward extension. Also, it can help in the designing
of high-performance and low-power-integrated circuits by accelerating analysis
while preserving accuracy in particular frequency bands, which is useful in the
semiconductor industry where circuit simulations are computationally demanding.
References
1. Jiang YL, Qi Z, Yang P (2019) Model order reduction of linear systems via the cross Gramian
and SVD. IEEE Trans Circ Syst II 66(2):422–426
2. Moore BC (1981) Principal component analysis in linear systems: controllability, observability,
and model reduction. IEEE Trans Autom Control 26(1):17–32
3. Enns DF (1984) Model reduction with balanced realizations: an error bound and a frequency
weighted generalization. In: Proceedings of conference on decision and control, vol 3, pp
127–132
4. Anderson BDO, Liu Y (1989) Controller reduction: concepts and approaches. IEEE Trans
Autom Control 34:802–812
5. Lin C, Chiu TY (1992) Frequency weighted balanced realization. Control-Theory Adv Technol
1(2):341–351
6. Sreeram V, Anderson BDO, Madievski AG (1995) New results on frequency weighted balanced
reduction technique. In: Proceedings of American control conference, vol 6, pp 4004–4009
7. Varga A, Anderson BDO (2001) Accuracy enhancing methods for the frequency-weighted
balancing related model reduction. In: IEEE conference on decision and control, vol 4, pp
3659–3664
8. Wang G, Sreeram V, Liu WQ (1999) A new frequency weighted balanced truncation method
and an error bound. IEEE Trans Autom Control 4(9):1734–1737
9. Sreeram V (2005) An improved frequency weighted balancing related technique with error
bounds. In: Proceedings of IEEE conference on decision and control, vol 8, pp 3084–3089
10. Kumar D, Sreeram V (2020) Factorization-based frequency-weighted optimal Hankel-norm
model reduction. Asian J Control 22(5):2106–2118
11. Gawronski W, Juang JN (1990) Model reduction in limited time and frequency intervals. Int J
Syst Sci 21(2):349–376
12. Aghaee PK, Zilouchian A, Nike-Ravesh SK, Zadegan AH (2003) Principle of frequency-
domain balanced structure in linear systems and model reduction. Comput Electr Eng
29(3):463–477
13. Gugercin S, Antoulas AC (2003) A time limited balanced reduction method. IEEE Conf Proc
Control 77(8):5250–5253
14. Ghafoor A, Sreeram V (2006) Frequency interval Gramians based model reduction. In: Asia
Pacific conference on circuits and systems, vol 4, no 3. IEEE, pp 2000–2003
15. Imran M, Ghafoor A (2015) A frequency limited interval Gramians-based model reduction
technique with error bounds. Circ Syst Signal Process 34(11):3505–3519
16. Kumar D, Sreeram V, Du X (2018) Model reduction using parameterized limited frequency
interval Gramians for 1-D and 2-D separable denominator discrete-time systems. IEEE Trans
Circ Syst I 65(8):2571–2580
17. Sharma V, Kumar D (2022) SVD-based frequency weighted model order reduction of
continuous-time systems. In: IEEE International conference on power electronics, drives and
energy systems (PEDES), pp 1–4
18. Cheng W, Li Z, Li Y, Chu Y, Xie H, Li B, Tang P, Peng D (2023) A model order reduction method
considering the delay feature of wind power when participating in frequency regulation. In:
IEEE 6th information technology, networking, electronic and automation control conference,
vol 6, pp 737–743
19. Sharma V, Kumar D (2023) Confined frequency-interval Gramian framework-based balanced
model reduction. IETE J Res 1–8
Contrast Enhancement of Medical
Images Using Otsu’s Double Threshold
R. Vinay , Monika Agarwal , Geeta Rani , and Aparajita Sinha
Abstract This research tackles challenges in contrast enhancement within medical

imaging, addressing issues like over-enhancement, entropy loss, mean shift, and
suboptimal contrast due to improper thresholds. Extracting micro-level diagnostic
details from medical photos is hindered by these concerns. The study explores new
histogram equalization techniques, assessing their advantages, limitations, and appli-
cations. The authors propose a robust design framework, integrating Otsu’s double
threshold, range optimization, weighted distribution, adaptive gamma correction, and
homomorphic filtering. Experimental results demonstrate efficacy in maximizing
entropy, preserving brightness, and enhancing contrast in medical magnetic reso-
nance images. The refined system ensures visually appealing images crucial for
disease diagnosis precision. This research contributes by addressing why existing
systems fall short, and how an improved framework can successfully overcome these
challenges.
Keywords Optimum contrast enhancement · Otsu’s double threshold · Range

optimization process · Weighted distribution model · Adaptive gamma correction ·
Homomorphic filtering · Maximum entropy preservation
R. Vinay (B) · M. Agarwal · A. Sinha

Dayananda Sagar University, Bangalore, India
e-mail: vinayramamurthy5@gmail.com
M. Agarwal
e-mail: monika.goyal-cse@dsu.edu.in
A. Sinha
e-mail: aparajitasinha-aiml@dsu.edu.in
G. Rani
Manipal University, Jaipur, India
e-mail: geetachhikara@gmail.com
196 R. Vinay et al.
(a) (b)
Fig. 1 Effects of medical image enhancement. a Original image, b enhanced image
1 Introduction
To extract useful details from images, digital image processing is essential. Image
contrast enhancement is a critical pre-processing step in the field of imaging.
This stage sharpens the edges, borders, brightness, and gray-level distributions.
This contributes to an image’s visual quality. Figure 1 depicts the effect of image
improvement.
Histogram equalization (HE), a common technique for enhancing image contrast,
is efficient and easy to use. This technique flattens probability distributions and
widens the dynamic range of gray levels. Thus, it enhances an image’s contrast. HE
uses the cumulative density function (CDF) as a transformation function to enhance
an input image’s gray levels. It adjusts an image’s mean brightness to the center of
the dynamic range. Due to this, an enhanced image develops the issues of inten-
sity saturation, mean brightness shift, and over-enhancement. Particularly, the high-
frequency histogram bins are highlighted by HE. The low-frequency histogram bins
are removed, and washed-out effects appear. The drawbacks render this method inap-
propriate for several application fields, including microscopic imaging, fingerprint
recognition, face recognition, medical imaging, voice recognition, satellite imaging,
and aerial imaging [1]. This inspires the writers to research the state of the art in
this field and discover cures for the identified parameters. The remaining parts of the
document have been laid out as follows. The state of the art and a comparative anal-
ysis of several image-enhancing techniques are presented in Sect. 2. The proposed
approach for the improvement of low-contrast medical images is illustrated in Sect. 3.
The discussion is presented in Sect. 4. The work’s conclusion is provided in Sect. 5.
2 State of the Art
Extensive survey of literature in the arena of digital image processing presents

alternatives to Histogram Equalization (HE), including Dualistic Sub-Image
Histogram Equalization (DSIHE) [2], Brightness Preserving Bi-Histogram Equal-
ization (BBHE) [1], Recursive Sub-Image Histogram Equalization (RSIHE) [3], and
Recursive Mean Separate Histogram Equalization (RMSHE) [4]. These methods
Contrast Enhancement of Medical Images Using Otsu’s Double Threshold 197
address the mean brightness shift issue in HE by altering the histogram. Khan
et al. [5] proposed a technique segmenting the histogram based on mean or median
intensity values, efficiently improving image quality through HE and normaliza-
tion. However, it lacks color identification. In [6], researchers develop a method
incorporating CDF, PWD, and AGC for video sequence enhancement, maintaining
brightness and contrast balance. Srivastava and Rawat [7] focus on smoothing
with a Gaussian filter, while [8] introduces a technique involving Otsu’s method,
range stretching, HE, smoothing, and normalization. Tiwari et al. [9] addressed
the mean shift problem. Baby and Karunakaran [10] combine bi-level weighted
histogram equalization and adaptive gamma correction for brightness preservation
and contrast improvement. Building on [10, 11] introduces an improved AGC-based
approach, combining RLBHE, and AGC procedures. Qadar et al. [12] proposed a
compromise approach with RLBHE and AGC. Xu et al. [13] used Otsu’s twofold
threshold for histogram division. Sharma and Garg [14] tackle contrast enhancement
in hazy images with an entropy-based algorithm. Khan et al. [15] rectified irregular
intensity expansions. Chen and Chen [16] extended ESIHE, while [17] introduced
AGCCPF. Dhal et al. [18] focused on over-enhancement, employing ROOBHE and
ROEBHE. Wu et al. [19] employed an advanced sparrow search algorithm for medical
image enhancement. Sangeeta et al. [20] advocated for segmented images using the
GrabCut technique. Soujanya et al. [21] proposed a hybrid method combining Otsu
thresholding and CLAHE. In [22], accurate segmentation of retinal vessel trees is
addressed using CLAHE and Otsu thresholding. Thakur et al. [22] introduced a
cuckoo search algorithm-optimized method that addresses glaucoma diagnosis with
contrast-limited adaptive histogram equalization and normalized Otsu thresholding.
Sundaram et al. [23] proposed a method using entropy Otsu thresholding, adap-
tive gamma correction, and CLAHE. Despite these methodologies, challenges like
excessive enhancement, computational complexity, inefficiency with complex back-
grounds, uneven illumination, and poor information retrieval persist. The authors
propose an efficient model incorporating RLBHE, WD, AGC, and HF for enhancing
poor-contrast medical photos with complex backgrounds and numerous elements
(HF).
3 Methodology
This section presents the block diagram and detailed description of the proposed
approach: “Range Limited Double Threshold HE with Adaptive Gamma Correction
and Homomorphic Filtering” (RLDTWHE). Figure 2 represents the block diagram
of RLDTWHE.
198 R. Vinay et al.
Fig. 2 Suggested technique RLDTWHE’s activity flow
3.1 Segmentation Using Otsu’s Double Threshold Method
Otsu’s double threshold method categorizes the foreground, background, and target
region in an input image. It calculates threshold values for each region, ensuring
high intraclass variance. Otsu’s method establishes the lower and upper bounds for
histogram equalization (HE) to preserve maximum brightness post-segmentation.
Equation 1 defines global thresholds g(T 1 , T 2 ) for foreground, background, and
target regions, based on intraclass variance maximization [24]. L 1 represents the
highest image intensity (255), and T 1 , T 2 are in the 0 to 255 range. W L , W U , W V
are PDFs, and E(I L ), E(I U ), E(I V ) are overall brightness for subdivisions I L , I U , I V .
E(I) is the average luminance of the entire input image [13, 21].

g(T1 , T2 ) = W L (E(I L ) − E(I ))2 + WU (E(IU ) − E(I ))2 + WV (E(I V ) − E(I ))2
(1)
3.2 Weighted Distribution Model
Three sub-histograms result from segmentation with intensity value ranges [li , mi ].
The weighted distribution model minimizes the gap between uniform and gray-
level distributions by adjusting estimated likelihoods. This approach assigns higher
weights to less frequent intensities and lower weights to more frequent ones. Equa-
tion (2) provides the maximum probability (Pmax ), while Eq. (3) gives the minimal
likelihood (Pmin ) of the input image sub-histogram. P(k) represents the approximate
likelihood of the kth gray level, and L is the most significant gray level in the input
grayscale image.
Pmax = P(k) 0 ≤ k ≤ L − 1 (2)
Pmin = P(k) 0 ≤ k ≤ L − 1 (3)
The formula to get the cumulative probability density of the ith sub-histogram,
bi , is given in Eq. (4). The lowest and maximum intensity values of the ith sub-
histograms are represented here by li and m i , respectively. The probability of the kth
gray level is P(k). L − 1 is the input image’s highest value for intensity.
∑
mi
bi = P(k) 0 ≤ k ≤ L − 1 (4)
k=li
The probability can be changed using the formula in Eq. (5). It gives less weight
to gray levels that occur more frequently and more weight to gray levels that occur
less frequently [25]. The probability of the kth gray level is given here as P(k). The
weighted probability of the kth gray level is Pw (k). An input image sub-histogram’s
Pmin and Pmax values represent its minimum and maximum probabilities, respectively.
The cumulative probability density of the ith sub-histogram is called bi. The ith
sub-histograms’ li and m i values represent the lowest and maximum intensities,
respectively [22]. Modifying probabilities is useful to achieve the output histogram
with uniform intensity distribution.
bi
P(k) − Pmin
Pw (k) = Pmax li ≤ k ≤ m i (5)
Pmax − Pmin
The sum of probabilities is always unity. But the work in [25] claims that the sum
of weighted probabilities is not always unity. Thus, there is a need to normalize the
resultant weighted probabilities Pw (k).
Equation (6) gives the formula to normalize the weighted probabilities.
PWN (k) is the weighted probability of an input image histogram, and Pw (k) is the
normalized weighted probability.
Pw (k)
Pwn (k) = ∑ L−1 (6)
k=0 Pw (k)
200 R. Vinay et al.
3.3 Histogram Equalization
The system equalizes individual sub-histogram separately in this stage. The transfer
function f converts an existing uniform intensity distribution into a new one. As a
result, intensity levels are distributed evenly over the entire range [26]. This results in
the specified input image’s contrast being enhanced. Equations (7), (8), and (9) give
definitions of transformation functions f L (X k ), fU (X k ), and f V (X k ) for individual
sub-histogram foreground, background, and target, respectively. Here, T1 and T2
are Otsu’s double thresholds. CwL (X k ), CwU (X k ), and CwV (X k ) are cumulative
density function (CDF) of each sub-histogram foreground, background, and target,
respectively. X 0 ' is the minimum intensity value, and maximum value of intensity is
L − 1.
f L (X k ) = X 0 ' + (T1 − X 0 ') ∗ CwL (X k ), (k = 0, 1, 2 . . . . . . . . . . . . T1 ) (7)
fU (X k ) = (T1 + 1) + (T2 − (T1 + 1)) ∗ CwU (X k ),

(k = T1 + 1, T1 + 2 . . . . . . ..T2 ) (8)
f V (X k ) = (T2 + 1) + ((L − 1) − (T2 + 1)) ∗ CwV (X k ),

(k = T2 + 1, T2 + 2 . . . L − 1) (9)
Equation (10) gives formula to calculate normalized CDF Cwn (k). The normalized
likelihood density function of an input image histogram is Pwn (k) in this case.
∑
L−1
Cwn (k) = Pwn (k) (10)
k=0
The above transformation function results in an output image (Y), composed of

three equalized sub-images I L , IU , and I V , respectively, as given in Eq. (11). An
input image is equalized over a complete dynamic range of gray levels (X 0 , X L-1 ).
Compared to GHE, it offers improved brightness stability and contrast improvement.
Y = f L (X k )U fU (X k )U f V (X k ) (11)
3.4 Adaptive Gamma Correction Process
HE alone is insufficient for enhancing poor-contrast medical images, as evidenced

by experiments in [27, 28]. HE reduces brightness enhancement and contributes to
over-enhancement, hindering disease identification in medical images. AGC resolves
over-enhancement issues by striking a balance between low computational costs and

high-level appearance. Unlike HE, AGC avoids the problem of significant image
intensity changes with slight alterations to the cumulative distribution function (CDF)
by employing a variable adaptive gamma value. Equation (12) provides the formula
for the Gamma correction technique, where l is the intensity of the input image,
lmax is the maximum intensity, and γ is the varying adaptive gamma parameter,
monotonically decreasing from 1 to 0 based on the weighted CDF [21, 22].
γ
l
T (l) = lmax (12)
lmax
Weighted normalized CDF function Cwn (k) at kth intensity level, modifies the
value of γ , using the formula given in Eq. (13).
γ = Cwn (k) (13)
3.5 Homomorphic Filtering
AGC effectively enhances poor-contrast medical images post-HE, yet challenges

persist in improving images with poor-contrast and complex backgrounds. The
process introduces noise and visual artifacts, impacting accurate disease diagnosis.
To mitigate noise in low-contrast medical images, homomorphic filtering (HF)
is employed, facilitating precise illness diagnosis. The suggested “RLDTWHE”
method combines various strategies for contrast enhancement, crucial for revealing
hidden information and aiding disease diagnosis, as illustrated in Fig. 3 (Table 1).
Fig. 3 Comparison of visual quality of MRI image of spinal. a Original, b GHE, c BBHE, d DSIHE,
e AGCWD, f RLBHE, g RLDTMHE, h EASHE, and i RLDTWHE
202 R. Vinay et al.
Table 1 Sequential implementation of RLDTWHE

Input: A grayscale image n = r x c pixels in the gray level range [X 0 , X L 1 ]. r = number of rows,
c = number of columns
Output: Enhanced image
Begin
Step 1 Segment an input image using Otsu’s double threshold technique
into three sub-images representing the foreground, background,
and target region
Step 2 Adjust the probabilities of all three sub-histograms H IL , H IU , H IV
by using weighted normalized power law function
Step 3 Equalize H IL , H IU, and H IV independently
Step 4 Merge all sub-histograms as Ho = H IL U H IU U H IV
Step 5 Apply AGC on Ho
Step 6 Apply HF on Ho
Step 7 Enhanced image
End
This section presents the experimental setup, dataset, image quality metrics, and
results for evaluating the RLDTWHE technique.
Data Set: The authors use an openly accessible dataset [19] comprising 50 low-
contrast medical MRI images, including liver, brain, skull, and spine examples shown
in Figs. 3 and 4. These 256 × 256 grayscale images exhibit diverse levels of brightness
and contrast, resulting in various forms of degradation.
Metrics for Evaluation of Image Quality: To assess RLDTWHE-applied image
quality, the authors employ the following quantitative metrics.
Entropy: Entropy, in bits, measures information in an image, with a higher value
indicating more extractable information. High entropy signifies reduced intensity
saturation effects. Image entropy is computed using Eq. (14) with L − 1 as the
Fig. 4 Comparison of visual quality of MRI image of brain. a Original, b GHE, c BBHE, d DSIHE,
e AGCWD, f RLBHE, g RLDTMHE, h EASHE, and i RLDTWHE
highest intensity value, P(X k ) as the PDF of the kth intensity level, and E(X k ) as the
image entropy [14].
∑
L−1
E(X K ) = P(X K ) ∗ log2 P(X K ) bits (14)
k=0
Peak Signal-to-Noise Ratio: PSNR, in formula (15), quantifies the signal-to-

noise ratio on a logarithmic scale, reflecting the input signal’s high dynamic range.
Root mean square error, MSE, defined in Eq. (16), compares the input image I(i,
j) to the transformed image o(i, j) in a 2-D space. L-1 denotes the highest intensity
value in the processed image, with N as the total pixel count [18, 21].
∑ ∑
i j |I (i, j ) − o(i, j )|2
MSE = (15)
N

(L − 1)2
PSNR = 10 log10 (16)
MSE
A higher PSNR value suggests less noise and higher reconstruction quality.
Contrast: The image’s contrast is defined by the normal intensity range and its
variation around a center pixel. Higher contrast values indicate a broader dynamic
range and stronger enhancement. Equation (17) describes the contrast function, where
r and c denote the width and length of a processed picture [18, 27]. I enh (i, j) represents
the pixel intensity at 2D position (i, j). Formula (18) illustrates the conversion of
contrast (C) to decibels (DB).
| |2
| |
1 ∑∑ ∑r ∑
r c
| 1
c
|
Ccontrast = |
Ienh (i, j ) − |
2
Ienh (i, j )|| (17)
r c i=1 j=1 | r c i=1 j=1 |
∗
Ccontrast = 10Ccontrast (18)
Implemented on MATLAB R2012a, the authors assess the proposed system’s

effectiveness using dark medical MRI images. With an average entropy of 7.28,
maximum stability is ensured. The output image’s entropy is roughly comparable
to the original, achieving a 32.61 PSNR and 37.65 contrast. This enhances contrast
while preserving the image’s organic appearance. Experimental findings demonstrate
improvement without compromising accuracy, entropy, or aesthetics.
204 R. Vinay et al.
The authors compare RLDTWHE with current techniques like GHE [1, 2], BBHE
[3, 4], DSIHE [5, 6], AGCWD [6], RLBHE [24], RLDTMHE [13], and EASHE [20].
MATLAB tests assess RLDTWHE’s performance, with Figs. 3, 4 for visual quality
and Figs. 5, 6, and 7 for quantitative evaluation against previous methodologies.
Fig. 5 Comparison in entropy values
Fig. 6 Comparison in
values of peak
signal-to-noise ratio values
Fig. 7 Comparison in
contrast values
5.1 Qualitative Analysis
This section assesses visual quality, detecting unnatural appearance, saturation arti-
facts, over-enhancement, and undesired artifacts. Figure 3a–i shows simulation
results applying contrast improvement strategies to a spinal MRI image. Output
images from GHE, AGCWD, RLBHE, and RLDTMHE (Fig. 3b, e, f, g) reveal
over-enhancement and high intensities. Results from BBHE, DSIHE, and EASHE
(Fig. 3c, d, h) appear dark due to insufficient contrast improvement. Figure 3i
displays RLDTWHE’s final image, revealing a natural appearance achieved with
an exceptional threshold.
Figure 4b–i compares knee MRI results, black noise in Fig. 4c, d, g, and h.
Despite the superior quality in Fig. 4b, e, and f, they show noise, over-enhancement,
and washed-out effects. RLDTWHE (Fig. 4i) produces an optimally enhanced bone
MRI, preserving crucial information.
5.2 Quantitative Analysis
This section assesses statistical variables for contrast improvement, entropy, and
brightness preservation. Figures 5, 6, and 7 demonstrate the effectiveness of contrast
enhancement techniques with RLDTWHE. Entropy, crucial for assessing disease
severity in medical MRI images, is maximized by RLDTWHE, as shown in Fig. 5.
The method outperforms current techniques, avoiding intensity saturation artifacts
and over-enhancement to minimize information loss in processed images.
Figure 6’s experimental results show that the suggested method, RLDTWHE,
outperforms more traditional contrast enhancement methods in terms of PSNR value.
As a result, RLDTWHE does not overstate the level of noise during the contrast
improvement procedure. Additionally, it successfully controls the improvement pace
while maintaining the realistic quality of an image.
The result of the indicator for visual contrast is shown in Fig. 7. The authors’
investigations show that the suggested technique RLDTWHE produces images with
the best contrast, a smooth texture, and non-homogeneous regions when compared
to other improved contrast methods.
Table 2 compares computational complexity for low-contrast test images. The
results in Table 1 show that RLDTWHE outperforms state-of-the-art methods on all
low-contrast test images, taking less time than the “EASHE” technique.
The proposed method, RLDTWHE, outperforms six HE-based contrast enhance-
ment schemes in both qualitative and quantitative analyses. It preserves the
highest brightness and information, delivering a substantial contrast improvement
while effectively maintaining the natural aspect of an image through controlled
augmentation.
206 R. Vinay et al.
Table 2 Comparison of computational complexities (in sec.) for standard low-contrast test images
Images/ GHE BBHE DSIHE AGCWD RLBHE RLDTMHE EASHE RLDTWHE
technique
MRI 0.0625 0.0723 0.1023 1.234 3.867 5.39 8.367 4.367
Brain
MRI 0.0578 0.0712 0.1345 1.45 3.678 6.345 8.489 4.239
Spinal
MRI 0.0645 0.0856 0.987 1.367 2.389 5.829 9.672 3.659
Skull
MRI 0.0615 0.0867 0.129 1.456 3.59 6.278 9.367 4.498
Liver
6 Conclusion
In conclusion, this comprehensive study underscores the noteworthy achievements

attained through our innovative approach, namely “Range Limited Double Threshold
HE with Adaptive Gamma Correction and Homomorphic Filtering” (RLDTWHE),
in effectively addressing persistent challenges within the realm of image contrast
enhancement. RLDTWHE not only adeptly resolves segmentation issues in intri-
cate visual scenarios but also ensures the preservation of brightness, maximiza-
tion of entropy, reduction of noise, and enhancement of contrast. The non-recursive
nature of this method significantly reduces temporal complexity when juxtaposed
with RMSHE, rendering it applicable across a spectrum of domains, including
consumer electronics, video frame analysis, and medical image analysis. Looking
ahead, promising avenues for further research involve delving into dynamic thresh-
olding techniques, allowing for adaptive adjustments based on local image charac-
teristics to bolster the method’s resilience across diverse images of varying complex-
ities. Furthermore, exploring the application of RLDTWHE in the realm of multi-
modal medical imaging, such as the amalgamation of magnetic resonance imaging
(MRI) with computed tomography (CT) scans, promises insights into its efficacy in
enhancing visibility and interpretability across a spectrum of imaging modalities.
References
1. Kim YT (1997) Contrast enhancement using brightness preserving bi-histogram equalization.

IEEE Trans Consumer Electron 43(1):1–8
2. Wang Y, Chen Q, Zhang BM (1999) Image enhancement based on equal area dualistic sub-
image histogram equalization method. IEEE Trans Consumer Electron 45(1):68–75
3. Sim KS, Tso CP, Tan YY (2007) Recursive sub-image histogram equalization applied to gray
scale images. Pattern Recogn Lett 28(10):1209–1221
4. Chen SD, Ramli AR (2003) Contrast enhancement using recursive mean-separate histogram
equalization for scalable brightness preservation. IEEE Trans Consumer Electron 49(4):1310–
1319
5. Khan MF, Khan E, Abbasi ZA (2013) Segment dependent dynamic multi-histogram equaliza-
tion for image contrast enhancement. Digital Signal Process 25:198–223
6. Huang SC, Cheng FC, Chiu YS (2013) Efficient contrast enhancement using adaptive gamma
correction with weighting distribution. IEEE Trans Image Process 22(3):1032–1041
7. Srivastava G, Rawat TK (2013) Histogram equalization: a comparative analysis and a
segmented approach to process digital images. In: Sixth International conference on contem-
porary computing (IC3), pp 81–85
8. Huynh TT, Le B, Lee S, Le-Tien T, Yoon Y (2014) Using weighted dynamic range for histogram
equalization to improve the image contrast. EURASIP J Image Video Process 44(1):1–17
9. Tiwari M, Gupta B, Srivastava M (2014) High-speed quantile-based histogram equalization for
brightness preservation and contrast enhancement. Image Process (IET) 9(1):80–89. https://
doi.org/10.1049/ietipr.2013.0778
10. Baby J, Karunakaran V (2014) Bi level weighted histogram equalization with adaptive gamma
correction. Int J Comput Eng Res (IJCER) 4(3):25–30
11. Gautam C, Tiwari N (2015) Efficient color image contrast enhancement using range limited
bi-histogram equalization with adaptive gamma correction. In: IEEE International conference
on industrial instrumentation and control (ICIC), pp 175–180
12. Qadar MA, Zhaowen Y, Rehman A, Alvi MA (2015) Recursive weighted multi-plateau
histogram equalization for image enhancement. Optic—Int J Light Electron Opt 126(24):5890–
5898
13. Xu H, Chen Q, Zuo C, Yang C, Liu N (2015) Range limited double threshold multi histogram
equalization for image contrast enhancement. Opt Rev 22(2):246–255
14. Sharma P, Garg G (2015) Entropy based optimized weighted histogram equalization for Hazy
images. In: IEEE 9th International conference on industrial and information systems (ICIIS),
pp 1–6
15. Khan MF, Khan E, Abbasi ZA (2015) Image Contrast enhancement using normalized histogram
equalization. Optic-Int J Light Electron Optics 126(24):4868–4875
16. Chen YY, Chen SA (2015) Exposure based weighted dynamic histogram equalization for image
contrast enhancement. Int J Autom Smart Technol 5(1):27–38
17. Gupta B, Tiwari M (2015) Minimum mean brightness error contrast enhancement of color
image using adaptive gamma correction with color preserving framework. Int J Image Process
(IJIP) 9(4):241–253
18. Dhal KG, Sen S, Sarkar K, Das S (2016) Entropy based range optimized brightness preserved
histogram equalization for image contrast enhancement. Int J Comput Vision Image Process
6(1):59–72
19. Wu H, Huang Q, Cheung Y, Xu L, Tang S (2020) Reversible contrast enhancement for medical
images with background segmentation. IET Image Process 14. https://doi.org/10.1049/iet-ipr.
2019.0423
20. Sangeeta K, Divya M, Divyajyothi B (2023) Contrast enhancement of medical images using
Otsu thresholding. In: Sharma H, Shrivastava V, Bharti KK, Wang L (eds) Communication and
intelligent systems. ICCIS 2022. Lecture Notes in Networks and Systems, vol 686. Springer,
21. Soujanya TM, Prasad Babu K (2023) Implementation of Clahe contrast Enhancement & Otsu
thresholding in retinal image processing. IJETMS 7(1):138–153. https://doi.org/10.46647/ije
tms.2023.v07i01.022
22. Thakur N, Khan NU, Datt Sharma S (2022) Cuckoo search optimized histogram equalization
for low contrast image enhancement. In: 2022 Seventh International conference on parallel,
distributed and grid computing (PDGC). Solan, Himachal Pradesh, India, pp 727–732. https://
doi.org/10.1109/PDGC56933.2022.10053265
23. Sundaram R, Jayaraman P, Rangarajan R, Rengasri R, Rajeshwari C, Ravichandran KS (2019)
Automated optic papilla segmentation approach using normalized Otsu thresholding. J Med
Imaging Health Inf 9(7):1346–1353
24. Zuo C, Chen Q, Sui X (2013) Range limited bi-histogram equalization for image contrast
enhancement. Optik—Int J Light Electron Optics 124(5):425–431
208 R. Vinay et al.
25. Gupta B, Agarwal TK (2017) Linearly quantile separated weighted dynamic histogram
equalization for contrast enhancement. Comput Electr Eng 62:360–374
26. Qiu J, Li HH, Zhang T, Ma F, Yang D (2017) Automatic X ray image contrast enhancement
based on parameter auto optimization. J Appl Clin Med Phys 18(6):218–223
27. Yao Z, Zhou Q, Yang X, Yang C, Lai Z (2016) Quadrants histogram equalization with a
clipping limit for image enhancement. In: IEEE 8th International conference on wireless
communication & signal processing (WCSP)
28. Chowdhury AMS, Rahman MS (2016) Image contrast enhancement using tri-histogram equal-
ization based on minimum and maximum intensity occurrence. Current Trends Technol Sci
6(2):609–614
Recognizing Hate Speech on Twitter
with Feature Combo
Jatinderkumar R. Saini and Shraddha Vaidya
Abstract The issue of hate speech directed toward women is widespread and has
gained increased attention in recent times. Despite the impressive performance of
machine learning-based models that incorporate textual, user-specific, and social
network features, there is still potential for improvement given the variety of feature
combinations used with the ensemble learning (EL) approach. To fill this gap,
researchers in this study have generated a unique set of features in terms of stance, and
similarity, and combined with machine learning (ML), and EL algorithms to recog-
nize hate speech in Twitter data, and assess the model’s effectiveness. The proposed
novel approach of feature combo with stance and similarity features showed highest
accuracy with ensemble algorithm namely Extreme Gradient Boosting (XGBoost),
of 93.53%, while Support Vector Machine (SVM) algorithm of ML showed lowest
accuracy of 75.67%.
Keywords Hate speech · Twitter · Stance · Similarity · Ensemble learning ·

Machine learning
1 Introduction
The unprecedented information generation on social media platforms such as Twitter

has adverse effects on individuals, and society [1]. A common issue that has received
increased attention lately is hate speech directed at women. Speech that encourages
or displays prejudice, discrimination, or hostility toward women due to their gender
can be expressed verbally, in writing, or visually. Hate speech toward women has
severe consequences such as aggression and prejudice, with women being predomi-
nantly the subject of online nuisance and violence. It has the potential to normalize
discrimination based on gender, restrict the opportunities available to women, and
foster environments that make them feel unsafe and alone [2]. Growing awareness of
J. R. Saini · S. Vaidya (B)

Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed
e-mail: phdyashoda.barve@gmail.com
210 J. R. Saini and S. Vaidya
hate speech or racism and its detrimental impacts on the lives of women has occurred
in recent years, sparking movements like hashtag ‘MeToo’ and demands for changes
to social and legal policies. Nonetheless, misogyny is still pervasive in a number of
settings, such as online forums, the workplace, and politics [1]. Preserving a civil
and secure online environment has made it imperative, and crucial to recognize and
remove such offensive content.
Efforts have been made by the researcher’s in developing automated systems with
ML techniques to recognize the hate speech and further categorize it into different
categories such as aggressive and non-aggressive, hatred and non-hatred, offensive
and in-offensive, violent and in-violent, and so on [3]. The primary requirement is
data collection in the form of tweets, posts, or articles from social media sites such as
Twitter, Facebook, and Sina Weibo using Application Programming Interface (API).
Upon data collection, it is mandatory to perform pre-processing which cleans the
data to remove non-essential elements from it such as punctuations, stop words,
hashtags, and so on. Further, to build automated ML models, hand-crafted features
are generated based on contents, user profiles and behaviors, and social media graph
structure. Additionally, supervised, unsupervised, or semi-supervised models are
developed depending on the availability of labeled and unlabeled dataset [4, 5].
The performance of ML models for hate speech detection solely depends on the
feature extraction and selection process. According to the literature, authors have
extracted several textual, social media, and user-specific features [2]. For example,
authors have used n-grams, gender, location, and length for hate speech detection
[6], while others have identified hateful terms from the text [7]. In other research,
authors have used Term Frequency and Inverse Document Frequency (TF-IDF) [8],
while others have focused on sentimental, syntactic, and grammatical features [9].
However, it was noticed that the state-of-the-art research doesn’t consider stance
features, which may contribute in identifying the viewpoint of the user while gener-
ating hateful messages. Another observation is that the similarity-based features are
largely looked over in the area of hate speech detection.
The user stance on Twitter depicts the opinion and evaluation of the claim. The
stance property requires a specific target toward which the stance can be measured.
Stance can also be measured for textual contents. It can be categorized into agree and
disagree, support and denial, or favor and against, and so on [10]. Stance is similar
to the other areas of Natural Language Processing (NLP) such as text categorization,
sentiment analysis, pragmatic analysis, and so on. However, sentiment and stance
are misinterpreted leading to misuse of sentiment features for stance detection which
requires contextual information for detection [11].
The similarity features compute the resemblance between facts and newly arriving
contents. These features are produced using cutting-edge similarity metrics like
Cosine, Jaccard, Euclidean, and Chebyshev. These features are widely used in the
research for detecting and fact-checking false information. For example, authors
have incorporated similarity measures to compute similarity between true and false
information using factual data [12]. In another research, ‘Content Similarity Measure
(CSM)’ algorithm was proposed to perform automated fact-checking of healthcare
web URLs with an aim to detect misinformation [13].
Recognizing Hate Speech on Twitter with Feature Combo 211
The authors of this study have suggested a novel method for detecting hate speech
that combines stance and similarity features. The research objective is to produce
a supervised learning-based model that uses a combination of stance and similarity
features to classify tweets as hateful or not. The following are the paper’s research
contributions:
1. Extracted new stance and similarity-based features for hate speech detection.
2. Developed ML and ensemble models based on newly extracted features.
A review of the literature is covered in Sect. 2 of the work, methodology is covered
in Sect. 3, experimental results and discussion are presented in Sect. 4, and the work
is concluded and recommendations for future improvements are made in Sect. 5.
2 Literature Survey
In this section, the authors discuss the literature in three aspects: (a) hate speech, (b)
stance features, and (c) similarity features. They are as follows.
2.1 Hate Speech
Hate speech is an area of concern that shows a blatant intent to cause harm or incite
hatred toward other people [14]. The state-of-the-art research developed models
with ML classifiers by extracting various features such as TF-IDF [3], n-grams,
lexicons, bag-of-words, distance metric, and meta-information [15]. Combining the
feature sets and generating models have shown best results in the area of hate
speech detection [14]. Further, several ensemble approaches were also adapted by
the researchers to detect hate speech. For example, ensemble of recurrent neural
networks and ensemble of LSTM-based models have been suitably implemented.
However, approaches such as stacking, voting, bagging, and boosting with several
combinations of ML classifiers are largely overlooked.
2.2 Stance Features
Stance features depict the user’s viewpoint about the tweet speech. Stance is based
on the user’s social activity as well as the written contents. Although stance-based
features are completely looked over in the literature, they have been widely used
in stance detection and veracity assessment problems of NLP [16]. In these prob-
lems, the stance features are categorized based on contents, lexicons, and dialogue-
act features [17]. The content features capture syntactic, grammatical, and lexical
features. Lexicon-based approaches form word cues. For example, authors in research
have developed five categories of word cues, namely belief, denial, doubt, report, and
knowledge. These features have certainly improved the model’s performance [18],
and thus, authors in this research have considered utilization of lexicon and content
features to detect hate speech.
2.3 Similarity Measure-Based Features
The similarity-based features compute the similarity between two entities. For
example, in a research authors have used Cosine similarity to find resemblance
between input word and vector space. However, except this research authors didn’t
find any study which has used distance metric [15]. It can significantly utilize
hate speech detection techniques by computing the similarity score between newly
arriving tweets and existing verified tweets. Similar approach has been exploited in
misinformation detection problems [19].
2.4 Research Gaps
Following are the research gaps identified in the domain of hate speech detection.
1. Features such as stance and similarity measure have been largely overlooked in
the literature.
2. Rich combination of features is lacking.
3. Very few models are employed with ensemble algorithms.
3 Methodology
This section describes the dataset employed in the model development, pre-
processing techniques performed for cleaning the data, feature extraction process,
and model development with ML and ensemble algorithms.
3.1 Dataset and Pre-processing
In this research, authors have used the gold standard dataset from the literature [20,
21]. The dataset is made up of tweets that were gathered from the Twitter network
in order to identify hate speech against women. It consists of 1, 35, 556 tweets,
consisting of 30 features having textual contents and other related features such as
sentiments, age, gender, and so on. For the ease of execution, authors have extracted
5000 tweets and performed pre-processing. The pre-processing steps include filtering
URL links, Twitter usernames, special words, emoticons, and dealing with missing
values. Further, the tokenization step resulted in removal of punctuations, spaces, and
stop words. Thus, 2000 tweets consisting of 1047 hateful tweets and 953 non-hateful
tweets were used for training the model, while 3000 tweets were used for testing
consisting of 1714 hateful tweets and 1286 non-hateful tweets.
3.2 Feature Generation
This sector expounds the feature set generated for women hate speech detection.
Authors have identified stance features such as supportive and denial-based words
belonging to categories, namely certainty and approximation collected from authentic
sources from the web. For similarity features, authors have used Jaccard coefficient,
Cosine similarity, and Euclidean distance measures from the literature and computed
distance between the input tweets and the verified tweets.
3.3 Model Architecture
This section offers a thorough description of the architecture of the recommended

system. After cleaning the dataset and extracting stance and similarity features, the
ML and ensemble models are developed. ML-based model’s authors have used five
algorithms, namely Logistic Regression (LR), Support Vector Machine (SVM), k-
nearest Neighbors (kNN), Naïve Bayes (NB), and Decision Tree (DT). For EL two
algorithms are used, viz. XGBoost and Random Forest (RF) to evaluate the perfor-
mance in terms of accuracy (a), precision (p), recall (r), and F1-score ( f 1). Further,
analysis of stance words in hateful and non-hateful tweets is performed with standard
deviation, and word cloud is generated for hateful and non-hateful tweets. Figure 1
depicts the diagrammatic representation of architecture.
Table 1 lists the set of stance features extracted with example words. A total of 287
words of approximations and certainty are collected from authentic web resources
and studies from the literature. Table 2 denotes the important features identified with
the help of statistical techniques such as Pearson Correlation Coefficient (PCC) and
Chi-square. The e_simil, c_simil, and j_simil depict Euclidean, Cosine, and Jaccard
similarities.
4 Results and Evaluation
This section explicates the experimental results obtained from the developed models,
understands, and examines the results. Figure 2 displays the average count of stance
features in hateful and non-hateful tweets. It can be seen that in hateful speeches there
Fig. 1 Diagrammatic representation of the proposed model
Table 1 List of stance features

Category Example words
Approximations Could, might, would, perhaps, likely, somewhat
Certainty Should, must, demonstrate, show, find
Table 2 Top ten important

S. No. Feature name PCC Chi-square
features
1 prove 0.242 0.064
2 e_simil 0.166 0.059
3 obvious 0.123 0.057
4 likely 0.067 0.048
5 c_simil 0.047 0.042
6 permanent 0.041 0.039
7 feasible 0.033 0.036
8 j_simil 0.029 0.031
9 suggest 0.025 0.027
10 essential 0.012 0.021
are more approximations indicating uncertainty while in non-hateful words there are
more certainty words. It means that hateful speeches are more ambiguous, doubtful
and have a negative stance. Table 3 displays the performance of ML and EL models
across four parameters, namely a, p, r, f 1. General comparison shows that ensemble
90
77.3
80 73.5
70 67.2
57.4
60
Average Count
50
40 Hateful
30 Non-hateful
20
10
0
Certainty Approximation
Stance features
Fig. 2 Average count of stance feature in hateful and non-hateful tweets
Table 3 Performance (in %) of models

Algorithm Accuracy Precision Recall F1-score
SVM 75.67 76.98 69.52 73.06
LR 79.93 77.76 75.99 76.86
NB 80.33 77.76 76.69 77.22
DT 83.2 82.12 79.4 80.74
kNN 86.33 91.76 79.51 85.2
RF 88.13 93.75 81.3 87.08
AdaBoost 91.67 94.53 87.05 90.64
XGBoost 93.53 96.89 89 92.78
models have outperformed ML models with XGBoost having highest accuracy of

93.53%. XGBoost showed highest p, r, and f 1 of 96.89, 89, and 92.78%, respectively.
The AdaBoost algorithm showed second highest performance with an accuracy, p,
r, and f 1 of 91.67, 94.53, 87.05, and 90.64, respectively. Among ML algorithms,
DT has outperformed other ML algorithms with accuracy of 83.20%. In contrast,
SVM showed poor performance with accuracy of 75.67%. Although RF showed
poor performance among ensemble categories, it was better than ML algorithms
with an accuracy of 88.13%. The reason behind ensemble techniques performance
is due to their ability to combine results from multiple classifiers, making the final
decision unbiased. Table 4 displays accuracy of XGBoost classifier across features.
Authors have considered only the best performing algorithm for feature-wise anal-
yses. It can be seen that stance and similarity features together have contributed to
the best performance of the model showing highest accuracy of 93.53% as explained
earlier. However, it can be seen that individual similarity measures showed less
Table 4 XGBoost accuracy (in %) across features

Feature combo Accuracy Precision Recall F1-score
Jaccard (J) 71.33 69.98 65.5 67.67
Cosine (C) 76.67 70.76 73.74 72.22
Euclidean (E) 78.33 73.87 75.16 74.51
J+C+E 82.87 81.34 79.24 80.28
Stance 85.67 90.98 78.84 84.48
Stance + Similarity 93.53 96.89 89 92.78
performance ranging between 71 and 78% across Jaccard, Cosine, and Euclidean
measures. Individual stance features showed better performance than the combina-
tion of similarity features with an accuracy of 85.67%. Individual similarity features
showed poor performance in terms of p, r, and f1, having values ranging between 69–
74, 65–75, and 67–74%, respectively. The similarity combination showed improved
performance by 8, 4, and 6% of p, r, and f 1, respectively. Figure 3 shows the word
cloud of hateful speeches for women on Twitter. It can be observed that words such
as hate, violence, harassment, and misogamy are seen more often in hateful speeches.
Thus, it can be concluded that ensemble models contribute widely in categorization
of hateful and non-hateful speeches. Further, the stance and similarity features are
efficient in showing better performance of the models. Authors could not compare
the performance of proposed approach with state-of-the-art techniques as there are
absolutely no studies using stance and similarity feature individually or with combo.
5 Conclusion and Future Enhancements
In this research, authors have generated two novel features namely stance and simi-
larity. The stance features consist of certainty and approximations which convey the
viewpoint of the users. The similarity features consist of Jaccard coefficient, Cosine
similarity, and Euclidean distance. These features values are computed by mapping
the input with existing verified tweets. Further, ML and ensemble algorithms are
built, and performance is evaluated in terms of a, p, r, and f1. It was concluded that
ensemble models have performed better than ML models, with XGBoost algorithm
showing highest accuracy of 93.53% followed by AdaBoost algorithm with 91.67%
of accuracy. Among ensemble models RF showed poor performance. Among ML
algorithms kNN showed highest accuracy of 86.33% while SVM showed lowest
accuracy of 75.67%. It is also concluded that the feature combo of stance and simi-
larity generates better performance than individual features. In the future, authors
want to test the performance of the proposed feature combo on other datasets. In
addition, more categories of stance features can be identified, and novel similarity
algorithms can also be incorporated.
Fig. 3 Word cloud for hateful category
References
1. Rathod RG, Barve Y, Saini JR, Rathod S. From data pre-processing to hate speech detection:
an interdisciplinary study on women-targeted online abuse
2. Subramanian M, Easwaramoorthy Sathiskumar V, Deepalakshmi G, Cho J, Manikandan G
(2023) A survey on hate speech detection and sentiment analysis using machine learning and
deep learning models. Alexandria Eng J 80:110–121
3. Gite S et al (2023) Textual feature extraction using ant colony optimization for hate speech
classification. Big Data Cogn Comput 7(1)
4. Jahan MS, Oussalah M (2023) A systematic review of hate speech automatic detection using
natural language processing. Neurocomputing 546
5. Alkomah F, Ma X (2022) A literature review of textual hate speech detection methods and
datasets. Information (Switzerland) 13(6). MDPI
6. Waseem Z, Hovy D, Hateful symbols or hateful people? Predictive features for hate speech
detection on Twitter
7. Grimminger L, Klinger R (2021) Hate towards the political opponent: a Twitter Corpus study
of the 2020 US elections on the basis of offensive speech and stance detection
8. Alkomah F, Salati S, Ma X, A new hate speech detection system based on textual and
psychological features [Online]. Available: www.ijacsa.thesai.org
9. Firmino AA, de Souza Baptista C, de Paiva AC (2024) Improving hate speech detection using
cross-lingual learning. Expert Syst Appl 235. https://doi.org/10.1016/j.eswa.2023.121115
10. Hardalov M, Arora A, Nakov P, Augenstein I (2022) A survey on stance detection for mis-
and disinformation identification. In: Findings of the association for computational linguistics:
NAACL 2022. Findings, Association for Computational Linguistics (ACL), pp 1259–1277
11. ALDayel A, Magdy W (2021) Stance detection on social media: state of the art and trends. Inf
Process Manag 58(4)
12. Barve Y, Saini JR (2021) Healthcare misinformation detection and fact-checking: a novel
approach. Int J Adv Comput Sci Appl (IJACSA) 12(10):295–303
13. Barve Y, Saini JR (2022) Detecting and classifying online health misinformation with ‘content
similarity measure (CSM)’ algorithm: an automated fact-checking based approach, pp 1–28
14. Nascimento FRS, Cavalcanti GDC, Da Costa-Abreu M (2023) Exploring automatic hate speech
detection on social media: a focus on content-based analysis. Sage Open 13(2)
15. Mossie Z, Wang J-H (2020) Vulnerable community identification using hate speech detection
on social media. Inf Process Manag 57(3)
16. Alsaif HF, Aldossari HD (2023) Review of stance detection for rumor verification in social
media. Eng Appl Artif Intell 119
17. Pamungkas EW, Basile V, Patti V (2019) Stance classification for rumour analysis in Twitter:
exploiting affective information and conversation structure. In: CEUR workshop proceedings.
CEUR-WS
18. Islam MR, Muthiah S, Ramakrishnan N (2019) Rumorsleuth: joint detection of rumor veracity
and user stance. In: Spezzano F, Chen W, Xiao X (eds) Proceedings of the 2019 IEEE/ACM
International conference on advances in social networks analysis and mining, ASONAM 2019.
Association for Computing Machinery, Inc., pp 131–136
19. Barve Y, Saini JR, Kotecha K, Gaikwad H (2022) Detecting and fact-checking misinformation
using ‘veracity scanning model’, vol 13, no 2, pp 201–209
20. Kennedy CJ, Bacon G, Sahn A, von Vacano C (2020) Constructing interval variables via faceted
Rasch measurement and multi task deep learning: a hate speech application
21. Ma J, Gao W, Wong K-F (2018) Detect rumor and stance jointly by neural multi-task learning.
In: The web conference 2018—companion of the World Wide Web conference, WWW 2018.
Association for Computing Machinery, Inc., pp 585–593
Agriculture Yield Forecasting
via Regression and Deep Learning
with Machine Learning Techniques
Aishwarya V. Kadu and K T V Reddy
Abstract India’s financial system is mostly based on agriculture, which requires

human labor to survive. The primary challenge is the growing humanity, which
results in a rise in nutrient demand, which threatens nutrition. Farmers must increase
their output on the same plots of land in order to fulfill this expanding demand.
Technology-assisted agricultural output prediction can considerably help farmers
increase productivity. The central aim is to anticipate crop yields utilizing critical
factors that seriously endanger agriculture’s long-term viability, such as rainfall, crop
type, climatic conditions, geographic location, production data, and historical yield
records. Agriculture’s long-term viability is in danger. The decision support tool for
farmers is ML-DL-powered crop yield forecasts, which will help them select the
best crops and take the most appropriate course of action throughout the growth
cycle. Even in distracting or unfavorable environmental conditions, crop choosing
using ML-DL algorithms is particularly effective in reducing production losses in
farming. This study uses DL techniques, including LSTM, i.e., Long Short-Term
Memory Networks, and CNN, i.e., Convolutional Neural Networks, and ML tech-
niques like D.T., i.e., Decision Trees, R.F., i.e., Random Forests, and XGBoost regres-
sion. Predictive tools aid small-scale farmers, improving crop yield estimations and
planting decisions. CNN outperforms L.S.T.M. However, comprehensive, accurate
data, especially environmental and weather information, are essential. In conclusion,
India’s agriculture sector grapples with feeding a growing population. Advanced ML
and DL provide solutions for long-term sustainability. Future research should explore
more deep learning methods and integrate remote sensing and satellite data for precise
crop predictions.
Keywords Deep learning · Decision Tree · Regression · Machine learning · Yield

prediction · Agriculture
A. V. Kadu (B) · K. T. V. Reddy

Datta Meghe Institute of Higher Education and Research (DU), Faculty of Engineering and
Technology, Sawangi (M), Wardha, India
e-mail: aishwaryakadu17@gmail.com
220 A. V. Kadu and K. T. V. Reddy
1 Introduction
Humans started actively managing their land and plants in prehistoric times, also
known as the New Stone Age, which began around 15,000 years ago. India’s economy
relies heavily on agriculture, which meets most of its food needs [1]. Due to the
considerable weather changes and the country’s rapidly increasing population, it
becomes imperative to maintain a balance between the demand for food and its
supply. Various scientific approaches have been integrated into agriculture to harmo-
nize food supply and demand [2]. The significant environmental variances make it
difficult for farmers to develop adaptable and sustainable strategies.
Agriculture accounts for eighty percent of the world’s water usage in India. In
2019, over 70% of the Indian workforce was engaged in agriculture, contributing
significantly to the country’s G.D.P. (20–21%) [3]. The total land area in India has
remained relatively constant at about 5,36,879 thousand hectares between 2000 and
2023. The agricultural sector in India will reach a value of USD 26 billion by 2026,
with 80% of sales occurring in the retail sector, India’s fifth biggest nutrient and
grocery market in the world.
According to such data, consumer appetite for food products has increased across
the nation in rural and urban areas, and the income level is increasing. As a result, the
agricultural industry is seeing various online farming technologies launch and a rise
in cutting-edge technology like drones, block chain, remote sensing technologies,
and geographic information systems (G.I.S.) [4, 5]. Crop production in India follows
a seasonal pattern, with the Kharif season being the most productive and winter
being the least [6]. Crop yield estimation is a valuable tool for farmers to optimize
their production, considering factors like farming practices, pesticide use, weather,
and market prices. Recent advancements in machine learning have notably impacted
agriculture, enhancing crop yield predictions and improving farming practices [7, 8].
A variety of machine learning methods have been used, such as S.V.M, i.e.,
Support Vector machines, A.N.N, i.e., Artificial Neural Networks, and D.L., i.e.,
deep learning, D.T, i.e., Decision Trees. Illustrated in Fig. 1 is a representation of
India’s crop production data, covering the years from 2000 to 2023 [9]. These data
show that most of the country’s agricultural land is wheat and rice. This allocation
played a pivotal role in contributing to more than 84% of the staple grain production
within the nation. India plays a substantial role in the global rice trade, accounting
for approximately 50% of Basmati and Non-Basmati rice exports and conducting
trade with over 150 countries. According to data from the commerce ministry, rice
shipments rose by 13% to 517 thousand metric tons during the initial quarter of
the financial year 2022–2023 [10]. The data representation emphasizes that rice
has the highest production and land use rates. Over the previous year, the agriculture
industry has seen strong export development; in the fiscal year 2022, total rice exports,
including Basmati and Non-Basmati types, would equal USD 7.14 billion. As agri-
culture evolved over the centuries, ancient practices involving celestial observations
and animal sacrifices gave way to increasing research methods—farming during the
Agriculture Yield Forecasting via Regression and Deep Learning … 221
Fig. 1 Crop production yearly average 2000–2023 [9]
era of in-deserialization by mechanization, mathematics, and precise scales. Gradu-

ally, modern approaches gained prominence. Crop forecasting models often depend
on weather conditions, soil properties, and historical crop yields [11]. Agriculture
has relied on these regression models for nearly a century to make informed predic-
tions. Furthermore, areas like classification and fruit recognition have emerged as
notable areas of interest in artificial intelligence and the categorization of images for
the farm field [12, 13].
1.1 Research Contributions
India is primarily an agrarian nation, with nearly half of its population engaged in
agriculture. One of the most concerning issues in the country is the distressing rate of
farmer suicides. According to data from the Indian National Crime Records Bureau
(N.C.R.B.), between 1995 and 2020, over 200,000 Indian farmers tragically lost their
lives. This situation is frequently attributed to their challenges in repaying loans,
which they typically obtain from financial institutions and private lenders. Inade-
quate knowledge about crop cultivation, including which crops are best suited for
each season, exacerbates these problems. This study addresses a significant problem
by using historical agricultural production data to establish a dependable method for
forecasting crop yields at an early stage. Forecasting agricultural yields is challenging
because it depends on many variables, including rainfall, wind speed, soil character-
istics, climate conditions, humidity, and temperature. Complicating matters further,
data for these variables must be sourced from multiple, diverse sources. Despite
numerous studies on this topic, there is still room for improvement. The study lever-
ages ML-DL techniques to provide a reliable method for agricultural yield predic-
tion. This process means using government data and mathematical models to make
more accurate predictions and assess potential losses. This research offers valu-
able insights to researchers, enabling them to understand better the problem and its
potential solutions in the agricultural domain.
1.2 Article Organization
The main aim of this research is to predict agricultural crop yields. To comprehen-
sively comprehend the existing literature, Sect. 2 carefully explores existing theories
and research on predicting crop yields to support our research goals and bridge knowl-
edge gaps. Moving on to Sect. 3, this segment delves into specifics about the study.
In detail, it examines the location, data sources, technology, research approach, and
various methods for predicting crop yields. It also elucidates the inter-connections
among different features within the dataset. Section 4 shows the discussion related
to crop production. Finally, Sect. 5 serves as the concluding segment of this study,
summarizing the work and putting forth suggestions for future research efforts in
this field.
2 Literature Review
Applying machine learning approaches has enabled the estimation of crop yields
[14]. This endeavor used a dataset encompassing variables such as total cultivated
area, using average maximum temperature, water sources, and canal length, like agri-
cultural output forecasting for irrigation. Notably, the computational model devel-
oped in this study outperformed models created using S.N.N, i.e., Shallow Neural
Network, RF, Lasso, and D.N.N. approaches. For dataset validation using projected
weather data, the Root Mean Error (R.M.S.E.) was equivalent to fifty percent of the
average yield and 13% of the standard deviation. Underscores efficacy of machine
learning techniques in improving crop yield predictions. Between 1998 and 2002,
specifically during the Kharif season, a study achieved an impressive accuracy rate
of 97.5%. This success was achieved by analyzing various factors, including rainfall,
crop production, crop yield, and area information. In light of the substantial impact
of rainfall on Kharif crop production, the researchers in the study initiated their work
by utilizing a modular Artificial Neural Network (ANN) for predicting rainfall. After
obtaining the rainfall forecasts, they used S.V.M. to calculate yield using area and
rainfall data for analysis. Combining these two approaches proved highly effective in
enhancing crop yield during the Kharif season. Investigate the ability of the Artificial
Neural Network to predict two crops: (i) soybean and (ii) corn, especially in unfavor-
able environmental conditions. Assess method performance in estimating localized,
regional, and state data.
Evaluate how the ANN model performs with different parameters. Compare the
evolved ANN model’s performance with other models, such as multiple linear regres-
sion. Artificial Neural Networks were used in a different study to assess rice produc-
tion in several Indian Maharashtra towns. Data from open government records were
collected for all 27 districts in Maharashtra, highlighting the flexibility of Artificial
Neural Networks in predicting crop yields in varied geographical regions. This study
aimed to achieve superior crop yield estimation by utilizing various ML methods,
including K-Nearest Neighbor, i.e., K.N.N., Artificial Neural Network, i.e., ANN,
Random Forest, i.e., R.F., and Support Vector Regression, i.e., S.V.R. The dataset used
for this research consisted of 745 examples. Seventy percent of these examples were
randomly chosen and utilized to train the model, with the remaining thirty percent
saved for testing and performance evaluation. The last analysis revealed that R.F.
demonstrated the highest level of accuracy among the methods employed. In a sepa-
rate research endeavor focusing on Brazil in the south, the study introduced a unique
model for forecasting soybean yield using L.S.T.M. and satellite data. This innova-
tive approach leverages advanced technology to enhance soybean yield predictions
in the region. This research compares the effectiveness of L.S.T.M., N.N., R.F., and
multiple L.R [15]. This analysis uses independent variables such as surface temper-
ature, rainfall, and plant measures to forecast soybean yields. A secondary objective
is determining how early these models can forecast with confidence crop yields. The
proposed methodology predicts the crop grown in each location within the relevant
database [16]. The experimental results under-score the approach’s significant poten-
tial in accurately forecasting agricultural productivity, as validated through real-time
data and stakeholder interactions. Several machine learning (ML) methods, such as
Decision Trees, lasso, and linear regression, are used to forecast agricultural produc-
tion—several of these ML techniques performed less than the Decision Tree (D.T.)
approach. A yield system was implemented using a K.N.N. algorithm. It is important
to note that yield predictions for farmers should consider multiple factors influencing
crop production and quality.
Crop output depends on timing, crop type, and region. To anticipate yield, data
like calendar year, product, region, location, and period are vital. Accurate crop yield
history is essential for managing agricultural risks. In a study, Decision Tree (D.T.),
Random Forest (R.F.), and K-Nearest Neighbors (K.N.N.) Classifiers were evaluated
with Gini and entropy metrics. Random Forest produced the most accurate outcomes.
3 Proposed Work
3.1 Study Area
India’s land area encompasses a vast 5,36,879 thousand hectares between 2000 and
2023, extending from the north’s snow-capped Himalayan range to the south’s trop-
ical forests. In this expanse, 3,10,721 thousand hectares (between 1980 and 2022) are
designated for agricultural use. To conduct this study, the researchers have selected
ten key Indian crops predominantly cultivated in the region. These crops include
wheat, cotton, jowar, maize, jute rice, bajra, and ragi, representing a comprehensive
examination of major agricultural products in India.
3.2 Data Sources
The data are gathered from several publicly accessible government websites,
including https://kaggle.com and https://data.gov.in, which are sources for several
data compilations. The dataset includes important information, including the produc-
tion, district name, crop type, state name, yield, crop year, wind speed, seasons,
area under irrigation, rainfall, humidity, and total area from 1980 to 2022. Data for
the years above agricultural yields are shown visually in Fig. 1. This study uses a
variety of prediction models, such as Long Short-Term Memory Network (L.S.T.M.),
Decision Tree (D.T.), X.G. Boost Regression, Random Forest (R.F.), and Convolu-
tional Neural Network (CNN) to forecast agricultural production for India. To ensure
prediction accuracy, models are assessed using metrics such as root mean squared
error, test loss, accuracy, and standard deviation. The study methodology, depicted
in Fig. 2, involves cross-validation with a K-fold coefficient which is employed in
the training set to evaluate the efficacy of the trained models [17].
3.3 Methods
3.3.1 K-Fold Cross-validation
Divided into k subsets, the dataset is used in cross-validation, where 1…k fold sub-
sets are utilized for the model’s training data, and the remaining subset remains for
validation. This procedure involves the calculation of a performance statistic, often
accuracy, at each iteration. It is a valuable approach, particularly when working with
limited input data. Figure 3 visually represents the flowchart of the method adopted,
demonstrating the steps involved in this process.
Fig. 2 k-Fold cross-validation
3.3.2 Decision Tree
A Decision Tree accomplishes two key tasks: First, it classifies the pertinent elements
for each decision point and then decides the best course of action based on these
selected features [18]. Figure 4 shows the Decision Tree algorithm which handles
both regression and classification simultaneously by assigning a probability distribu-
tion to plausible choices. Each node in the tree represents a feature, branches corre-
spond to selections, and leaf nodes signify outcomes. Choosing a single feature initi-
ates tree construction as the root node. Data splitting is a crucial step for completing
the Decision Tree building.
Fig. 3 Flowchart of the method adopted
3.3.3 Random Forest
Figure 5 illustrates how Random Forest (R.F.) enhances the bagging method to
create an uncorrelated forest of Decision Trees by incorporating feature selection.
It employs methods such as random feature selection, feature randomization, or the
random subspace technique. R.F. focuses on a subset of attribute segments, allowing
it to exploit random feature selection. Each Decision Tree in the ensemble is built
using a bootstrap sample from the training set, with the remaining third reserved for
testing purposes.
3.3.4 XGBoost
XGBoost, a widely used open-source alternative to gradient-boosted trees, employs

regression trees as weak learners for regression. It establishes a continuous score
for each input point, connecting them to leaves. The iterative training process
sequentially adds new trees to predict residuals, and the technique, called “gra-
dient boosting,” gradually incorporates models to reduce loss. Figure 6 highlights
XGBoost’s effectiveness in machine learning and predictive modeling.
Fig. 4 Decision Tree
Figures 4, 5, and 6 show examples of Decision Trees (D.T.), Random Forests

(R.F.), and XGBoost, respectively.
3.3.5 Convolutional Neural Network (CNN)
In this CNN implementation, there are seven layers. The first layer consists of a CNN
layer with 64 filters of dimensions (3 * 3) and a kernel size of 3. The second layer
employs MaxPooling1D with a pool size of two. A dropout mechanism is applied
in the third layer. The fourth layer, utilizing the ReLU activation function, flattens
the output from layer three. Layer 5 is the flattened output. Layer 6 comprises 330
neurons as a hidden neural network layer. The final layer, Layer 7, employs the
SoftMax function with 11 neurons corresponding to different output types. Figure 7
depicts the CNN architecture of this model.
Fig. 5 Random Forest
3.3.6 Long Short-Term Network (LSTM)
The study employs Long Short-Term Memory (LSTM) networks, a specialized

type of Recurrent Neural Network (RNN) designed for sequential data processing,
depicted in Figure 10. These networks excel in handling sequences like time series
data, words, and sounds, capturing long-term dependencies effectively. Both the CNN
and LSTM models use the A.D.A.M. optimizer, responsible for adjusting neural
network attributes such as layer weights and learning rates, combining “R.M.S.P.”
and “gradient descent with momentum” algorithms. Mean Squared Error (MSE)
is the chosen reduction function during training. Hyperparameters for training the
model include fifty epochs, a batch size of thirty-two, and a validation split of 0.25.
Figure 8 illustrates the LSTM architecture of the model.
Fig. 6 XGBoost
Fig. 7 CNN architecture

Fig. 8 Working of LSTM

[18]
4 Discussion
The authors emphasize the significance of “Crop” as a crucial element, visually

illustrating the interdependence of dataset features. Figure 9 displays production
numbers for the ten crops studied. In contrast, Fig. 10 highlights historical relation-
ships between “Area” and “Crop Type,” revealing that wheat production occupies
the majority of land, with rice ranking second in importance [19].
A historical view on the connections between the several elements in the dataset,
such as Crop type, Area, and Production, is given in Figs. 9 and 10. The outcomes
of the forecast are shown in Table 1. In terms of accuracy, the research results show
that Random Forest outperforms other machine learning algorithms [19]. Taking
statistical data into account, Random Forest provides the most accurate estimate
Fig. 9 Relationship between

crop and production [17, 18]
Fig. 10 Area allotted to

crops [18]
of crop output for India, with a 98.96% accuracy rate, 1.97 mean absolute error
(M.A.E.), 2.45 root mean square error (R.M.S.E.), and 1.23 standard deviation (SD)
(see Table 1). On the other hand, the accuracy rates for XGBoost and Decision Tree
are 86.46 and 89.78%, respectively, with mean absolute errors of 4.58 and 6.31 and
root mean square errors of 5.86 and 7.89, with corresponding standard deviations of
2.75 and 3.54 [18].
The accuracy ratings of 89.78 and 86.46 for Decision Tree and XGBoost Regres-
sion, respectively, are much lower than those of Random Forest, which received a
score of 98.96% [19]. This shows that, in the context of this study, when machine
learning techniques are used to anticipate India’s agricultural production, Random
Forest performs better than the other three regression approaches.
The reason why machine learning is sometimes called a “black box” technology is
that its predictions are not as interpretable. The exceptional performance of Random
Forest highlights its usefulness in this particular application.
Additionally, changing the training process’s number of epochs significantly
affects the mean absolute error. It is clear from this that Long Short-Term Memory
(LSTM) models are inferior to Convolutional Neural Networks (CNN) as the CNN
use results in reduced loss [20].
Table 1 Production and area simulating model performance as inputs

Author Model Accuracy Root mean Mean absolute Standard
square error error (M.A.E.) deviation
(R.M.S.E.) (S.D.)
Sharma et al. Random Forest 98.96 2.45 1.97 1.23
[18] (R.F.) [18]
Decision Tree 89.78 5.86 4.58 2.75
(D.T.) [18]
XGBoost [18] 86.46 7.89 6.31 3.54
5 Conclusion and Future Scope
The increasing population has made managing food demand and supply more chal-
lenging. Experts have been diligently working to predict agricultural yield produc-
tion. This study focuses on forecasting crop yields in India using a spectrum of ML
and deep learning methods, emphasizing the benefits of advanced methods. Small-
scale farmers can particularly benefit from these predictions, as they can use them
to estimate future crop production and plan their planting accordingly. The study
applies five different ML and DL methods which were used to analyze the dataset:
Long Short-Term Memory Networks (LSTM), Decision Tree (DT), Convolutional
Neural Network (CNN), Random Forest (RF), XGBoost regression, However, there
is a need for more extensive data for crop yearly, with traditionally precise envi-
ronmental and weather information. In terms of performance, CNN outperforms
L.S.T.M. as the loss is less with CNN. Further exploration with deep learning models
is required to identify the most effective method. Integrating remote sensing data with
district-level statistical data is suggested to enhance the accuracy of crop production
predictions. Using satellite imagery for land cover or image classification can lead
to more accurate predictions.
References
1. Kumar CMS et al (2023) Solar energy: a promising renewable source for meeting energy
demand in Indian agriculture applications. 55:102905
2. Lezoche M et al (2020) Agri-food 4.0: a survey of the supply chains and technologies for the
future agriculture. 117:103187
3. Baragde DB, Jadhav AU (2021) Impact of COVID-19 on Indian SMEs and survival strategies.
In: Handbook of research on strategies and interventions to mitigate COVID-19 impact on
SMEs. IGI Global, pp 280–298
4. Martos V et al (2021) Ensuring agricultural sustainability through remote sensing in the era of
agriculture 5.0. 11(13):5911
5. Goel RK et al (2021) Smart agriculture–urgent need of the day in developing countries.
30:100512
6. Sambasivam VP et al (2020) Selection of winter season crop pattern for environmental-friendly
agricultural practices in India. 12(11):4562
7. Vyas S et al (2022) Integration of artificial intelligence and blockchain technology in healthcare
and agriculture
8. Sharma A et al (2020) Machine learning applications for precision agriculture: a comprehensive
review 9:4843–4873
9. Saeed I et al (2020) Basmati rice cluster feasibility and transformation study. 131:434
10. Mukundan A et al (2023) The Dvaraka initiative: mars’s first permanent human settlement
capable of self-sustenance 10(3):265
11. Basso B, Liu LJIA (2019) Seasonal crop yield forecast: methods, applications, and accuracies.
154:201–255
12. Lu Y, Young SJC, Agriculture EI (2020) A survey of public datasets for computer vision tasks
in precision agriculture. 178:105760
13. Zhang Q et al (2020) Applications of deep learning for dense scenes analysis in agriculture: a
review. 20(5):1520
14. Rashid M et al (2021) A comprehensive review of crop yield prediction using machine learning
approaches with special emphasis on palm oil yield prediction 9:63406–63439
15. Alquthami T et al (2022) A performance comparison of machine learning algorithms for load
forecasting in smart grid 10:48419–48433
16. Hussain N, Sarfraz S, Javed S (2021) A systematic review on crop-yield prediction through
unmanned aerial vehicles. In: 2021 16th international conference on emerging technologies
(ICET), IEEE
17. Saud S et al (2020) Performance improvement of empirical models for estimation of global
solar radiation in India: a k-fold cross-validation approach 40:100768
18. Sharma P et al (2023) Predicting agriculture yields based on machine learning using regression
and deep learning
19. Landi F et al (2021) Working memory connections for LSTM 144:334–341
20. Khan ZA et al (2020) Towards efficient electricity forecasting in residential and commercial
buildings: A novel hybrid CNN with a LSTM-AE based framework 20(5):1399
Performance Comparison of M-ary
Phase Shift Keying and M-ary
Quadrature Amplitude Modulation
Techniques Under Fading Channels
Tadele A. Abose, Ketema T. Megersa, Kehali A. Jember, Diriba C. Kejela,

Samuel T. Daka, and Moti B. Dinagde
Abstract Modulation techniques and channel conditions play a major role in the
development of wireless communication systems. Information can be sent easily
and with little to no error if the right channel conditions and modulation technique
are used. To reduce errors during data transmission, an effective communication
system must be developed. Most of the research has been done on either M-ary
phase shift keying (MPSK) or M-ary quadrature amplitude modulation (MQAM)
under Rician or Rayleigh fading channels. However, works that consider M-ary phase
shift keying (MPSK) and M-ary quadrature amplitude modulation (MQAM) under
Rician, Rayleigh, Nakagami-m, and AWGN fading channels are rarely investigated.
This research examines two fundamental types of digital modulation techniques: M-
PSK (M = 2, 8, 16) and M-QAM (M = 4, 8, 16, 64). It also discusses the different
types of channels that are used between transmitter and receiver, such as AWGN,
Rayleigh, Rician, and Nakagami fading channels, and evaluates the receiver’s BER
performance characteristics for each of these channels. The MATLAB simulation
results demonstrated that, for the same degree of signal-to-noise ratio, the bit error
rate for digital modulations (M-P SK, M-QAM) under an AWGN channel is lower
than that achieved over a Rayleigh fading channel. When using the M-PSK modula-
tion technique, the Rician fading channel has more communication impairment than
AWGN. When using the M-PSK modulation technique, the Nakagami-m fading
channel outperformed the Rayleigh fading channel with the same SNR value.
Keywords AWGN channel · BER · M- PSK · M-QAM · Nakagami fading

channel · Rayleigh fading channel · Rician fading channel
T. A. Abose (B) · K. T. Megersa · K. A. Jember · D. C. Kejela · S. T. Daka · M. B. Dinagde

Mattu University, Mattu, Ethiopia
e-mail: tadenegn@gmail.com
236 T. A. Abose et al.
1 Introduction
In the last ten years, the field of wireless communication has advanced due to the
increasing demand for voice and multimedia services on mobile wireless networks.
Digital modulation is one of the main underlying technologies that enables digital
data to be transported or transmitted across analog radio frequency (RF) channels,
as presented by Dai et al. [1].
By improving the wireless network’s capacity, speed, and quality, digital modu-
lation techniques help to accelerate the growth of mobile wireless communications.
There is work focusing on various modulation algorithms that encode digital data
using a finite number of phases. PSK is a type of phase modulation in which the
input bits cause the carrier to phase shift to one of a limited number of potential
phases. The QAM technique is aimed at increasing the spacing between symbols
in a constellation by modulating the carrier’s amplitude and phase, as presented by
Xiong [2].
Compared to analog modulation methods, digital modulation schemes are better
able to transmit an enormous amount of information. Examining the relative bit
error rate performance of various modulation schemes in AWGN, Rayleigh, Rician,
and Nakagami fading channels is necessary to achieve optimal results when taking
reflectors and obstructions in wireless propagation channels into account. The trans-
mitted signal is impacted by AWGN noise as it travels along the channel. Over a
specific frequency range, it has a continuous, uniform frequency spectrum. Rayleigh
Fading, which is the total of all the reflected and dispersed waves, is the signal that is
received at the receiver when there is no LOS path—only an indirect path—between
the transmitter and the receiver, as presented by Rajesh et al. and Kumar et al. [3, 4].
A stochastic model for radio propagation variance that is impacted by a radio
signal’s partial dissolution on its own is called a Rician fading channel. The signal
travels over multiple pathways before arriving at the receiver, at least one of which
is changing or fluctuating. When the broadcast signal can follow a leading LOS or
straight path to the receiver, it is helpful for showcasing mobile wireless communica-
tion systems. The bit error rate, or BER, is calculated by dividing the total number of
bits transmitted, received, or processed within a specified time frame by the number
of bits with errors, as presented by Abose et al., Rajesh et al. and Kumar et al. [3–5].
Most of the research has been done on either M-ary phase shift keying (MPSK) or
M-ary quadrature amplitude modulation (MQAM) under Rician or Rayleigh fading
channels. However, works that consider M-ary phase shift keying (MPSK) and M-ary
quadrature amplitude modulation (MQAM) under Rician, Rayleigh, Nakagami-m,
and AWGN fading channels are rarely investigated.
The rest of the paper is organized as follows: Sect. 2 presents the system model,
and Sect. 3 presents results and discussion. Section 4 concludes the paper.
Performance Comparison of M-ary Phase Shift Keying and M-ary … 237
2 Proposed Method
A BER of digital modulations over various fading channels will be shown in this
section. The ratio of the total number of erroneous bits to the total number of trans-
mitted bits is known as the bit error rate. It is a crucial means of managing transmis-
sion superiority. A percentage is frequently used to express it. It can be caused by
bit synchronization issues, distortion, interference, or noise.
Totalnumberoferrorbits
BER = . (1)
Totalnumberoftransmittedbits
2.1 BER of Digital Modulations Over AWGN
In this subsection, the BER analysis of MQAM and MPSK under AWGN will be
presented.
For large energy per bit to noise power spectral density (Eb/No) and M > 4, the
BER expression can be written as [6]:
(/ )
2 2Ebloglog2M π
BERM − PSK = Q sinsin , (2)
loglog2M No M
(/ )
1 Eb log 2M π
BERM − PSK = erfc sin
log 2M No M
(/ )
1 m Eb π
= erfc sinsin , m = loglog 2M, (3)
m No M
where Eb is signal power, N0 is noise power, Q is the Q function, and erfc is the
complementary error function.
The BER for a rectangular M-QAM (4-QAM, 8-QAM, 16-QAM, and 64-QAM)
is given as:
( ) (/ )
2 1 3Ebloglog2M
BERM − QAM = 1− √ erfc , (4)
loglog2M M 2(M − 1)No
( ) (/ )
2 1 3m Eb
= 1− √ erfc . (5)
m M 2(M − 1)No
2.2 BER of Digital Modulations over Rayleigh Fading

Channel
In this subsection, the BER analysis of MQAM and MPSK under the Rayleigh fading
channel will be presented.
The BER of the BPSK modulation under the Rayleigh fading channel can be
expressed as [7]:
( / )
1 γ
BERM − PSK, rayleigh = 1− (6)
2 1+γ
and can also be written as

⎛⎡
| Eb ⎞
1⎝ |
BERM − PSK, rayleigh = 1 − √ N oEb ⎠. (7)
2 1 + No
In Rayleigh fading, the average BER for M-QAM is given by [7].
2
BERM − QAM, rayleigh ≈
loglog2M
√ ⎛ ⎡ ⎞
( )∑ M |
1 2 | 1.5(2i − 1) 2
γ loglog2M
1− √ ⎝1 − √ ⎠, (8)
M i=1 M − 1 + 1.5(2i − 1)2 γ loglog2M
( )
2 1
BERM − QAM, rayleigh ≈ 1− √
m M
√
M ( / ). (9)
∑ 2
1.5(2i − 1) No
Eb
m
1− .
i=1
M − 1 + 1.5(2i − 1)2 No
Eb
m
2.3 BER of Digital Modulations Over Nakagami Fading

Channel
In this subsection, the BER analysis of MQAM and MPSK under the Nakagami
fading channel will be presented.
For a noisy phase reference in a Nakagami-m fading channel, the estimated
average BER expression of MPSK is expressed as [8]:
/
γ
2 −1 log log 2M (2n+1)π
M
∑ 1
BERMPSK ∼
= √
M m
2 π log log 2M ( γ
)m+ 21
n=0 1 + log log 2M (2n+1)π
M m
( ) [ ]
⎡ m+2 1
1 m
· · 2 F1 1, m + ; m + 1; ,
m+1 2 m + log log M (2n+1)π
m
γ
(10)
where 2 F1(·) is the Gaussian hypergeometric function, γ is the average SNR per bit,
and m is the Nakagami-m fading parameter.
The expression of the average BER of M-QAM over the Nakagami fading channel
is [8]:
√
1 ∑
l2 M
BERMQAM =√ pb (k), (11)
M k=1
where pb(k) is given by (11).
1
pb(k) = √
M
√
( −k
1−2 ) M−1{
∑ [ ]( ])(
[ 2m(M−1) )
i.2
√
k−1
i.2k−1 1 (2i+1)·3 log log 2 Mγ o
1 2M − √ k−1
+ √
i=0
M 2 ⎡(m) · Ωm π
( )( )}
1,2 2(M − 1)m 1
−G 2,2 | 1 − m, − m 0, − m ,
(2i + 1)2 3 log log 2 (M)γ o 2
(12)
where G is Meijer’s G function
2.4 BER of Digital Modulations over Rician Fading Channel
In this subsection, the BER analysis of MQAM and MPSK under the Rician fading
channel will be presented.
With Rician parameter K and diversity N, the probability of symbol error for
MPSK across Rician fading channels can be stated as [9, 10]:
[ ]
( ) N π2∫− Mπ expexp − N +K K( M
π
)θ
γ +( M )θ
π
1 N+K
ps(E) = ( )
( π ) N dθ. (13)
π γ N +K
− π
γ
+ M
θ
2
With diversity N, mean symbol SNR, Rician parameter K, and M-QAM over
Rician fading channels, the probability of symbol error is as follows [9, 10]:
[ ]
(N + K )(M − 1)
p(γ ) ≈ 0.2 log log 2M
(N + K )(M − 1) + 1.5γ
[ ]
k1.5γ
exp exp − γ . (14)
(N + K )(M + 1) + 1.5
3 Result and Discussions
In this section, the simulation results of the bit error rate (BER) performance of
different types of digital modulation schemes will be presented and discussed.
MATLAB 2021 is used for simulating and generating results.
3.1 Results
AWGN Channel
Figure 1 shows the bit error rate curve for the BPSK, 8-PSK, 16-PSK, 4-QAM, 8-
QAM, 16-QAM, and 64-QAM modulation schemes using MATLAB software under
the AWGN channel. Comparing the figures, it can be said that for M-ary modulation,
when the value of M increases, the BER performance of the system will be degraded.
The signal-to-noise ratio for different modulation schemes is also shown in Table 1.
Rayleigh Fading Channel
Bit error rate curves for BPSK, 4-QAM, 8-QAM, 16-QAM, and 64-QAM modulation
using MATLAB software under the Rayleigh fading channel are displayed in Fig. 2. It
illustrates how, with M-QAM, the bit error rate increases as the value of M increases.
The system’s performance suffered as a result. In addition, the system’s performance
will decrease in the same order as the value of M grows. To reduce BER, raise the
signal-to-noise ratio for all M-QAM and BPSK modulations. The signal-to-noise
ratio for different modulation schemes for Rayleigh fading channel is also shown in
Table 2.
Rician Fading Channel
The M-PSK modulation scheme’s BER performance curve is shown in Fig. 3 in
relation to the SNR value in a Rician fading environment, considering additive white
Gaussian noise (AWGN). The graph shows that when the signal-to-noise ratio (SNR)
Fig. 1 BER versus EB/N0 for AWGN
Table 1 Signal-to-noise ratio

Modulation schemes BER Eb/no
for different modulation
schemes under AWGN 8-PSK −5 13
−8 15.5
16-PSK −5 17.5
−8 20
4-QAM −5 9.5
−8 12
8-QAM −5 11.2
−8 13.8
16-QAM −5 13.5
−8 16
64-QAM −5 17.8
−8 20.2
rises, the BER value falls and tends to zero. The signal-to-noise ratio and BER for
Rician fading and AWGN channels are also shown in Table 3.
Nakagami-M Fading Channel
In the context of a Nakagami fading environment, Fig. 4 displays the BER perfor-
mance curve of the M-PSK modulation scheme in relation to the SNR value for both
the Rayleigh fading channel and additive white Gaussian noise (AWGN). The graph
shows that the value of BER drops and tends to zero as the SNR value rises. The
Fig. 2 BER versus EB/N0 for Rayleigh fading channel

Modulation schemes Eb/no BER
for different modulation
schemes for Rayleigh fading BPSK 24 −3
channel 50 −5.6
4-QAM 24 −3
50 −5.6
8-QAM 24 −2.8
50 −5.5
16-QAM 24 −2.6
50 −5.4
64-QAM 24 −2.2
50 −5
Fig. 3 BER versus EB/N0

for Rician fading channel

Channels Eb/no BER
and BER for Rician fading
channel AWGN 5.2 10–2
9.5 10–5
Rician 10 10–2
30 10–5
Fig. 4 BER versus EB/N0 for Nakagami-m fading channel

Channels Eb/no BER
and BER for different fading
channels AWGN 5.2 10–2
9.8 10–5
Rayleigh 7 10–2
22 10–5
Nakagami 15 10–2
signal-to-noise ratio and BER for Rician fading and AWGN channels are also shown
in Table 4.
3.2 Discussion
Bit BPSK and 4-QAM have roughly the same BER for the AWGN channel. It is
observed that for M-PSK and M-QAM, the bit error rate increases as the value of M
increases. The system’s performance suffered as a result. We can conclude that 8-ary
QAM is superior to higher-order QAMs and that 8-PSK is superior to 16-PSK. In
addition, M-QAM outperforms M-PSK in the same order. Based on a comparison
with Fig. 1, it can be concluded that for M-ary modulation, the system’s performance
will deteriorate as M rises. To reduce BER, raise the signal-to-noise ratio for all M-
QAM and M-PSK modulations. Figure 1 shows that the BER for BPSK and 4-QAM
is around the same. Therefore, 8-PSK, 16-PSK, 8-QAM, 16-QAM, and 64-QAM
perform worse than 4-QAM and BPSK modulations. Approximately the same BER
is found for BPSK and 4-QAM in the Rayleigh fading channel. It can be observed
that for M-QAM, the bit error rate increases as the value of M increases. The system’s
performance suffered as a result. We can conclude that 8-QAM outperforms higher-
order QAMs. In addition, the system’s performance will decline in the same order as
the value of M grows. To reduce BER, raise the signal-to-noise ratio for all M-QAM
and BPSK modulations. The BER of BPSK and 4-QAM is about equal. As a result,
the performance of 4-QAM and BPSK modulation is superior to that of 8-QAM,
16-QAM, and 64-QAM. Comparing BER under the Rician fading channel to the
AWGN channel, the AWGN’s BER value decreased with the minimum SNR value;
that is, the Rician fading channel causes the highest communication impairment
when compared to the AWGN under the M-PSK modulation scheme. In the case of
the Nakagami fading channel, the BER value tends to zero as the SNR value rises,
i.e., a higher signal-to-noise ratio results in better performance. According to the
graph, as compared to the Rayleigh fading distribution and the Nakagami fading
distribution channel, the AWGN’s BER value decreases with the lowest SNR value.
In the graph under the M-PSK modulation scheme, the Nakagami fading channel
has fewer BERs than the Rayleigh fading channel with the same value of SNR.
4 Conclusion
This study examines bit error rate (BER) performance for AWGN, Rayleigh, Rician,
and Nakagami fading channels for M-PSK and M-QAM signals. Depending on bit
error rate, it is evident that the best modulation techniques are 4-QAM and BPSK.
According to simulation data, the bit error rate for M-PSK and M-QAM modulations
increases as the number of bits in a symbol grows, i.e., from 1, 2, 3, 8, 16, 64.
Comparing additive white Gaussian noise to the Rayleigh fading channel, the former
performs reasonably well. Additionally, for the same value of signal-to-noise ratio,
the bit error rate for digital modulations (M-PSK, M-QAM) under an AWGN channel
is lower than that achieved over a Rayleigh fading channel. Based on the simulation
findings, it is observed that under the M-PSK modulation scheme, the Rician fading
channel has more communication impairment than AWGN. With the same SNR value
in the graph under the M-PSK modulation scheme, the Nakagami fading channel
performs better than the Rayleigh fading channel. The future work of this research
will extend the work to include other digital modulation schemes, fading channels,
such as Weibull, and different comparison characteristics, such as power efficiency
and bandwidth efficiency.
References
1. Dai JY, Tang W, Yang LX, Li X, Chen MZ, Ke JC, Cui TJ (2019) Realization of multi-
modulation schemes for wireless communication by time-domain digital coding metasurface.
IEEE Trans Ant Propagat 68(3):1618–1627
2. Xiong F (2006) Digital modulation techniques. London
3. Abose TA, Olwal TO, Hassen MR, Bekele ES (2022) Performance analysis and comparisons
of hybrid precoding scheme for multi-user MMWAVE massive MIMO system. In: 2022 3rd
international conference for emerging technology (INCET). IEEE, pp 1–6
4. Rajesh V, Rajak AA (2020) Channel estimation for image restoration using OFDM with various
digital modulation schemes. In: Journal of physics: conference series (vol 1706, No 1, pp
012076), IOP Publishing
5. Kumar S, Anjaria K, Sadhwani D (2021) Performance analysis of efficient digital modulation
schemes over various fading channels. AEU-Int J Electron Commun 141:153963
6. Sharma N, Jain D, Bhatt K, Themalil MT (2021) Performance comparison of various digital
modulation schemes based on bit error rate under AWGN channel. In: 2021 5th international
conference on computing methodologies and communication (ICCMC). IEEE, pp 619–623
7. Farzamnia A, Hlaing NW, Mariappan M, Haldar MK (2018) BER comparison of OFDM with
M-QAM modulation scheme of AWGN and Rayleigh fading channels. In: 2018 9th IEEE
control and system graduate research colloquium (ICSGRC). IEEE, pp 54–58
8. Bahuguna AS, Kumar K, Pundir YP, Alaknanda V, Bijalwan V (2021) A review of various
digital modulation schemes used in wireless communications. Proceed Int Intell Enable Netw
Comput IIENC 2020:561–570
9. Karim HK, Shenger AE, Zerek AR (2019) BER performance evaluation of different phase shift
keying modulation schemes. In: 2019 19th international conference on sciences and techniques
of automatic control and computer engineering (STA), IEEE, pp 632–636
10. Patra T, Sil S (2017) Bit error rate performance evaluation of different digital modulation
and coding techniques with varying channels. In: 2017 8th annual industrial automation and
electromechanical engineering conference (IEMECON). IEEE, pp 4–10
Conception of Indian Monsoon
Prediction Methods
Namita Goyal, Aparna N. Mahajan, and K. C. Tripathi
Abstract India is the largest economy of South Asia and a rising economy in the
world. More than 20% of its bedrock is formed by the agriculture sector which is
greatly impacted by the Indian monsoon. It plays a crucial role in the growth of
several crops and water resources and decides many natural calamities like floods
and droughts that can affect human beings severely. Therefore, the monsoon has
been under research fraternity for centuries. With gradually rising temperatures and
changing monsoon patterns, India is experiencing anomalies in precipitation occur-
rences. Intermittent intensified rain can cause engulfed floods, landslides, loss of
farmer’s harvests, damaged roads and commuting problems that affect the common
man every day. Similarly, many states faced water shortages leading to crop failure,
starvation and spread of many diseases. Both excess and scarcity of rainfall can bring
famine and result in a bad economy. All these consequences can be prevented if the
onset of the monsoon can be estimated before its arrival. Therefore, this study is
intended toward understanding the nature of monsoon and classification of method-
ologies implemented for its early prediction so far. Following an analysis of every
model currently in use, it was discovered that deep learning models outperform all
other models. However, as the monsoon is a complicated system of atmospheric and
oceanic connection, it remains a matter of research to identify the points at which
predictability weakens.
Keywords Indian summer monsoon (ISM) · Support vector machine (SVM) ·

Artificial neural network (ANN) · Sea surface temperature (SST) · Root mean
square error (RMSE) · Convolution neural network (CNN) · Recurrent neural
network (RNN)
N. Goyal (B) · A. N. Mahajan

Maharaja Agrasen University, Baddi, Himachal Pradesh, India
e-mail: er.namitagoel@gmail.com; mau21dcs003@mau.edu.in
A. N. Mahajan
e-mail: directormait2014@gmail.com
K. C. Tripathi
Maharaja Agrasen Institute of Technology, GGSIPU, New Delhi, India
e-mail: kctripathi@mait.ac.in
248 N. Goyal et al.
1 Introduction
Monsoon, the literal meaning of this word is seasonal reversal of winds. This name
was given by the Arabic sailors who benefited from the reversal of wind systems
in crossing the sea while exploring sea routes to India. It was noted that in the
winter the wind blew from north to east, while in the summer it blew from south to
west. This mechanism resulted to receive a lot of rainfall in the month of June, July,
August, and September in the affected area. The duration from June to September
since then is termed as monsoon or south west monsoon and is often called as
rainy season. Likewise, the winter monsoon occurs from October to March, is also
known as north east monsoon, and is comparatively less well-known than the summer
monsoon. Rainfall refills aquifers, rivers, ponds, reservoirs and other water bodies
that are necessary for all living forms as well as for several industries. This year [1]
maximum regions of India have experienced heaviest rainfall in decades because of
monsoon surge and western disturbances. This downpour has caused neighboring
rivers to overflow, washed away buildings in floods, landslides, destroyed bridges,
roads and interfered with power and energy. Therefore, it is vital to constantly be
aware of the rainfall pattern to make important decisions like food production, water
resource management that will affect both socioeconomic and scientific issues.
A multitude of unpredictable elements influence the Indian monsoon, including
Himalayas, ENSO Cycle, SST anomalies, Indian Ocean dipole, ocean currents, dry
spells and western disturbances.
The climate of India and the beginning of the monsoons are significantly influ-
enced by the Indian Ocean and the Himalayan Mountain range by blocking the
very cold winds coming from the north. Sir Henry Blanford [2] was the first person
who came up with seasonal monsoon forecasts in 1886 and explained in 1876 that
how Himalayan’s snow can affect the climatic condition of plains of north western
India causing heavy rain and snow on the Indian side but drought in Tibet. The
ENSO cycle [3], which is brought on by the discrepancy between air pressure and
sea surface temperature weakens India’s monsoon. Even rising CO2 [4], aerosols
and dust particles [5, 6] are other considerable factors which have been proven to
affect monsoon rainfall. The monsoon is significantly influenced by each of these
regulating factors.
1.1 Present Monsoon Scenario
Currently, numerous meteorological organizations worldwide are constantly moni-

toring and updating their predictions for the monsoon depending on the most recent
data and factors that conclude monsoon. Even non meteorologist’s scientists are
working insistently on monsoon using statistical models. Starting from various
machine learning models, ensemble techniques, probability distribution and nowa-
days nature-inspired neural network models seem a first choice for researchers as
Conception of Indian Monsoon Prediction Methods 249
they are capable of self-learning and thus, making decisions with better accuracy.
Many existing climate models perform averagely since there are several uncertainties
in the climate.
This study aims to comprehend the monsoon and the forecasting techniques used
since ancient times. The remaining sections of the paper are structured as follows.
The relevant previous work has been described in Sect. 2. Section 3 elucidated the
categorization of weather prediction approaches based on the methodology employed
and forecast length. Section 4 presents a tabular description of the comparative study
of the approaches that were mentioned in Sect. 3. Section 5 eventually concludes
and explores the future scope.
2 Related Work
In 1854, Admiral Robert FitzRoy, ex-governor of New Zealand, a progressive mete-

orologist coined the term forecast and developed a system named as Met Office to
give weather related information to the sailors. The Indian department of meteo-
rology was founded in 1875 but weather forecasting was already there before this
period also in the form of various traditional methods [7, 8] which were dependent
on analysis of the environment’s behavior. Even the study of many biological organ-
isms [9] such as animals, insects, birds, trees, and wildlife guided Afar pastoralists
to foresee variations in the climate. These methods were primarily dependent on the
personnel’s commitment, experience, skill and training.
Since monsoon has always been a foundation to the livelihood of more than half
of the population of India, regressive research on monsoon prediction gave path to
scientific ways like numerical methods. These methods [10–13] were able to foresee
rainfall arrival by converting the actual interaction between different entities of nature
into some mathematical equations.
Further with the evolution of machine learning models, another door opened to
predict weather by analyzing huge datasets created over past years. For predicting
the best linear prediction model for Indian monsoon [14] proposed a plan of actions
to select the best model if the number of predictors is P, then it implies 2P regression
mode. If value of fraction of variance increases in changing from P to P + N param-
eter, then the larger set of predictor model will be selected than with fewer parameters.
[15] investigated if Indian Ocean Dipole was a natural factor to govern Indian Ocean
SST using linear multiple regression algorithms. This [16] study aimed at collecting
rainfall affecting parameters and predicting rainfall intensity using random forest
algorithms. Later, as the research progressed different linear regression algorithms
were combined to give hybrid systems which gave better results when compared
with single algorithms. But when there exists an ecosystem like monsoon which is
dependent on a large number of continuously changing factors like wind pressure,
ocean currents, atmospheric and many more then simple linear models fail to analyze
and depict the forecast accurately.
250 N. Goyal et al.
In [17] prediction of Indian summer monsoon (ISM) has been defined as a function
of sea surface temperature anomalies using ANN techniques. Root mean square
(RMS) error was relatively less when used ANN technique when compared with
regression technique. Likewise, [18] have put efforts to improve forecasting skills
using correlation analysis of SST indices from different Nino regions on the Indian
summer monsoon rainfall Index (ISMRI) with a lag period of one to eight seasons.
Inferring that the link between the Nino indices and ISMRI is essentially nonlinear
in character, a comparison of the findings showed that the ANN model has superior
prediction abilities than all the linear regression models examined.
In [19] artificial feed-forward neural networks along with backpropagation prin-
ciples have been utilized to analyze the meteorological data. Another article [20]
proposed a prediction model by joint clustering of monsoon years and predictions
using random forest regression algorithm. In [21] an activation function named Tanh
axon has been taken along with ANNs to estimate monsoon rainfall. Several nonpara-
metric tests were run in [22] to analyze rainfall trend by recognizing the abrupt change
point in time using ANN multilayer perceptron model. Even the Internet of Things
was used in [23] to create a local weather predicting system with ANN. Using deep
learning, [24] proposed a model which produced results four times higher in reso-
lution than linear interpolation. Have proposed a model based on long short-term
memory to forecast rainfall in Jimma, Ethiopia using six parameters [25]. It [26]
aimed at creating a drought vulnerability map (DVM) for West Bengal using deep
learning.
A brief report has been presented on changes in monsoon variability in [27] with
respect to observed characteristics of rainfall pattern, role of anthropogenic forcing
and extreme events. To predict north east monsoon rainfall several techniques such as
linear regression, ANN, and extreme learning machine techniques were used in [28]
and it was found that statistically extreme learning estimated the monsoon prediction
with minimal error. In [29] it has been described that the error prediction capability
of statistical methods is independent of dynamical situations using Lorenz-63 model.
Empirical orthogonal functions have been used in [30] to analyze the rainfall across
India on a regional basis. It was found that 5D data may be reduced to 1D with 80%
accuracy and 2D with 90% accuracy. It [31] explored whether it would be possible
in near future for deep learning models to completely replace the numerical weather
prediction models. A study has been presented in [32] which examines weather
forecasts on a dataset of London by developing a tensor flow framework in deep
learning algorithms.
After carefully examining the relevant research conducted recently, it is evident
that weather predicting methods have been improving steadily from the past to the
present. But there are many gaps and difficulties in the complicated and constantly
changing subject of weather prediction. Academicians are actively attempting to fill
these gaps, for example:
1. Weather forecasting is still difficult for time spans longer than two weeks. The
sub seasonal and seasonal forecasts have an impact on agriculture, water resource
management, and disaster planning. So, researchers are attempting to increase

their accuracy.
2. It is still difficult to predict and comprehend how extreme weather occurrences,
such as hurricanes, tornadoes, and torrential rains that cause floods will behave.
Research on the causes of rapid intensification and the precise course of such
events is still needed.
3. The quality and quantity of observational data affect the accuracy of weather fore-
casts. Critical areas of research include creating better data assimilation methods
and filling observational gaps in remote or less observed places.
4. Researchers are investigating how long-term weather predictions are impacted
by changes in weather patterns that are brought on by climate change. Therefore,
it is essential to comprehend how weather and climate change interact with each
other.
Weather forecasting is the subject of continuous and collaborative research. In
order to deliver more precise and timely weather forecasts, these gaps must be filled
as science and technology continue to improve.
3 Categorization of Monsoon Forecasting Techniques
3.1 Methodology-Based Monsoon Forecasting
3.1.1 Traditional Methods
In ancient times there was no scientific method to predict weather. It was strongly
dependent on visual observations in which the prediction of monsoon was made based
on the appearance of various environmental phenomena such as clouds, moon shape,
color, humidity, direction of winds and rainbow as shown in Fig. 1. All the climatic
variations related to air, ocean, and atmospheric pressure forming an environment
were collected and drawn on a paper known as weather maps or synoptic charts.
These maps were one of the oldest tools to forecast weather and were very helpful
to sailors at that time. The visualization-based approach to predict weather was a
hypothetical and slow process.
3.1.2 Modern Methods
Modern methods envisage weather on scientific grounds. Weather is a dynamic entity;

a small climatic change can result in large variability and thus produce a big dataset
to observe. These methods have explanations for the climate’s unexpected behavior
and work efficiently for big data unlike traditional approaches. All these models had
played a critical role in early prediction of Indian summer monsoon rainfall.
252 N. Goyal et al.
Fig. 1 Traditional
forecasting methodology
Humidity
Wind
Clouds
Direction
Synoptic
Charts
Lunar
Ocean
Waves Patterns
Temperature
Fig. 2 Modern forecasting

methodologies Modern Methods
Numerical Weather Statistical Weather

Prediction Prediction
Demeter
Machine
MITgcm
Learning
POM
MOM
Deep
Learning Models
Broadly there are two ways to categorize the modern weather prediction
techniques as shown in Fig. 2.
Numerical Weather Prediction:
Considering the atmosphere as a collection of various gases, numerical prediction
techniques are based on principles of fluid mechanics and science of physics. All the
data related to the atmosphere and ocean is collected using radars and satellites. Then
a bunch of mathematical calculations as explained in Eq. (1) depicting atmosphere
and ocean are employed on current data and converted into computer code to be
solved using supercomputers at laboratories of atmospheric science. The workflow
of numerical methods has been shown in Fig. 3.
(−−−→)
P̂(I S M) = Φ P D E (1)
where ISM is Indian summer monsoon,

Post Interpretation
Data Model Model Preprocessing Continuous
Visualization Communication Verification
Collection Initialization Integration and Analysis Improvement
Fig. 3 Numerical weather prediction workflow
P (I S M)) depicts empirical prediction of Indian summer monsoon, Φ maps the

correlation between ISM and PDE.
−−−→
P D E represents partial differential equation governing factors such as wind,
ocean currents, topography, thermodynamics, SST, etc.
Meteorologists and scientists follow a set of procedures to produce weather fore-
casts using numerical models. First, data is collected through various sources like
satellites and radar which contain information of multiple weather parameters like
humidity, air pressure and wind direction. Then this data is assimilated and used
to initialize the numerical model by meteorologists. This model mimics how the
atmosphere will behave in the future. To improve its usage and comprehension, the
model’s output is postprocessed and analyzed to evaluate how weather patterns have
changed over time. These weather patterns are then created using visualization tools
on weather maps or synoptic charts which helps in understanding weather forecasts
and communicated to concerned authorities like government bodies. To evaluate
the forecast’s accuracy, actual observed conditions are compared with the model’s
predictions. There are different numerical models which work on the complex couple
interaction between ocean and atmosphere such as MITgcm, Demeter, POM and
MOM, etc.
DEMETER: It is a collaborative European effort which has produced a multi-
model ensemble-based system for seasonal-to-interannual prediction. It consists of
seven models which show coupling between ocean and atmosphere. It is better at
predicting spatial patterns but they are less adept at category forecasting.
MITgcm: The MIT general circulation model (MITgcm) is a numerical computer
program written in Fortran language to solve the equations of motion regulating the
ocean or Earth’s atmosphere using the finite volume method. It is one of the earliest
non-hydrostatic models of the ocean.
POM: Princeton Ocean Model is a (POM) widely used general numerical model
for simulating and forecasting oceanic behavior like ocean currents, salinity and
temperatures. Earlier it was known as the Blumberg Mellor model.
MOM: Modular Ocean Model is a three-dimensional ocean circulation numerical
model. It was created particularly for research on the oceans and their effect on
climate systems all over the world.
All these models imitate the real process by using weather factors as inputs. After
a brief interval, the result is utilized for another cycle and repeated a certain number of
times to get the prediction for, say, the next 24 or 48 h. While solving these equations
254 N. Goyal et al.
several assumptions are made to satisfy climatic constraints and if it results in even a
small error then after many folds this error will multiply and affect the final forecast.
Therefore, the forecast accuracy lasts only for a few days.
Numerical methods require deep knowledge of meteorology and atmospheric
science which is basically implemented by meteorologists. Data scientists can help
meteorologists by analyzing large datasets to increase accuracy of forecast using a
data-driven approach also known as statistical methods.
Statistical Weather Prediction
Statistics methods forecast the climate on the grounds of past dataset using machine
learning and neural networks, especially deep learning techniques which provide
efficient, user-friendly libraries and massive computational power to better under-
stand the complexities of the issue. There is a huge collection of rainfall data every
year but to analyze this bulk data using traditional methods was a complex process.
The emergence of machine learning techniques accelerated this task by automating
the pattern’s detection and learning from data. Δ
It makes the future prediction F about the likely outcome for any system S based
on data D extracted from the past time series t of the system S to be predicted under
−
→
the influence of various predictors P that can affect the forecast as described in Eq.
(2).
∑
t=n
−
→
F̂(S) ≈ ϕ P (D(S)) t = m where n > m (2)
t=n
−
→
Value of D(S) is dependent on weather affecting empirical value of predictors P
like ENSO cycle, SST indices, etc. The general working of any statistical algorithm
has been shown in Fig. 4.
Accuracy of the system S is calculated by finding the root mean square error Δ
between actual observed value V and predicted value V at each ith observation as
explained by Eq. (3):
⎡
|∑ ( )
| N V (i ) − V (i) 2
Δ
√ i=1
RMSE = (3)
N
Problem Data Post Model Model Model Model Interpretation Model

Statement Collection Preprocessing Selection Training Evaluation Validation and Analysis optimization
Fig. 4 Statistical weather prediction workflow

The generic workflow of statistical models starts with problem statements, for
example, long range monsoon forecasting. To initiate with the model, a big dataset
of annual and sub-divisional rainfall of past years is required. This dataset is available
at various data sources like IMD, Kaggle and data.gov. After getting data, it must
be loaded and preprocessed to handle missing values. Finally, an appropriate model
with respect to the nature of the problem is applied and trained using past data given
as input. Trained model is then evaluated using different measures like confusion
matrix, accuracy, root mean square error, precision, sensitivity, specificity and F1
score. Evaluation is followed by model validation which checks model performance
on unknown dataset. Results of the trained model are then analyzed to learn about
the stated problem.
Machine Learning Algorithms: Machine learning is an area of artificial intelli-
gence (AI) that concentrates on algorithm development that allows computers to learn
and make predictions or judgments without human intervention. Machine learning
utilizes a diverse set of algorithms as given below:
1. Linear Regression: In this model, the output variable(y) is a function of only
one independent variable(x). For monsoon prediction, the future values of time
series data are calculated as linear function of previous values. It is preferred
for continuous spectrum of output. A basic linear regression equation with one
independent variable can be written as shown in Eq. (4).
y = mx + b (4)
where y is dependent variable,

x is independent variable, m is the slope of line and
b is y intercept.
2. Multivariate Regression: This model is used when the output variable is depen-
dent on more than one independent variable. For example, weather which is
chaotic in nature. It is the function of so many atmospheric entities like wind
direction (x1), humidity (x2), air pressure (x3), etc. Mathematically it can be
described by Eq. (5) as follows:
y = b0 + b1 ∗ x1 + b2 ∗ x2 + ... + bn ∗ xn (5)
where y is dependent variable,

x1, x2…, xn are independent variables,
b0 is the y intercept and
b 1, b 2, b n are the coefficients corresponding to each independent variable.
Linear models are not efficient to determine any nonlinearity characteristic present
in input–output mapping. Therefore, for weather prediction multivariate regression
may give better results.
3. Random Forest: It works on the principle of making an individual decision tree
for each subset of the dataset and thus solving the complex problem with more
256 N. Goyal et al.
efficiency and accuracy. It is a supervised machine learning algorithm which can

be used for classification and regression problems, respectively, as explained by
mathematical Eqs. (6) and (7) given below.
For classification: For Regression:
where
( )
RFc = mode y1 , y2 ..., yi (6)
/ ∑1=m
RFr = 1 m yi (7)
i=1
RFc is the random forest’s anticipated result for classification problems. RFr is
the random forest’s anticipated result for regression problems.
yi represents the expected result of the ith decision tree.
4. Support Vector Machine (SVM): It is a supervised algorithm. It can clearly distin-
guish between two classes of data by finding a hyperplane that best divides the
data into distinct classes and is effective in high-dimensional domains. It is suit-
able for outlier’s detection. Bioinformatics, image classification, text categoriza-
tion, and other fields all make extensive use of SVM. In case of monsoon predic-
tion, it is good at classifying whether rainfall will occur or not. Mathematically,
SVM method can be described as given in Eq. (8).
y(x) = w.x + bi f y(x)is greater than 0 then rain will occur (8)
If y(x) is less than 0 then rain will not occur,

where
y(x) is output function which can be positive or negative, w is weight vector,
x is input vector and b is bias term.
5. Naïve Bayes: It is also a supervised algorithm which is based on Bayes theorem.
Given the set of class label ci , and set of features F the Naive Bayes classifier
assumes that the features are conditionally independent. Given this supposition,
Naive Bayes assigns the class with the greatest possibility to the input after
calculating the likelihood of each class given the feature vector and can be written
as in Eq. (9).
P(ci /F) = (P(F/ci ).P(ci ))/P(F) (9)
where
P(ci /F) is revised probability of class ci given set of features F, P(F/ci ) is the
probability of the features set F for a certain class ci , P(ci ) is classical probability of
class ci and
P(F) is observation probability of features set F.
Typically, it is employed for categorization jobs and might not be the first option
when predicting time series, as in the case with monsoon forecasting. So, problems
like sentiment analysis, song suggestions can be better performed using Naïve Bayes
algorithm.
6. Time Series Analysis: It refers to a set of statistical techniques for analyzing and
forecasting data that is dependent over time. It finds its utility in areas like stock
market prediction, weather forecast, census analysis, etc. Commonly used models
that perform time series analysis are autoregressive integrated moving average
(ARIMA) and seasonal autoregressive integrated moving-average (SARIMA).
Deep Learning Algorithms: Deep learning is a subset of machine learning.
Inspired by working of brain neurons, data is processed through a network of inter-
connected artificial neurons that receive and process signals. Such networks are
known as artificial neural networks (ANN). Between nodes and edges there are
some weights. Output of one neuron goes as input to neurons of the next layer based
on some learning function and weight adjustment. Deep learning models can address
complicated topics such as weather forecasting and can enhance performance as data
availability increases.
There are several neural network topologies, each intended for a particular set of
applications and data kinds as given below:
1. Feed-forward Neural Networks (FNNs): These are artificial neural networks in
which information flow is unidirectional from the input to the hidden layer and
then to the output layer. They are utilized for a variety of applications, including
picture and text categorization.
2. Convolution Neural Networks (CNNs) are a type of neural networks that are
designed for processing data having grid-like structure, especially image and
video. It has several layers, including convolution layers, pooling layers and
fully linked layers.
3. Recurrent Neural Network (RNNs): It operates by sequentially processing data
in steps. Hidden states of RNNs serve as memory. At each time step, memory
is updated with the input data and the prior concealed state. They are suitable
for applications such as time series prediction, speech recognition and natural
language processing.
4. Long Short-Term Memory (LSTM): It is based on RNNs architecture. As neural
networks can learn long-term connections between data time steps, they may be
used to learn, analyze and categorize sequential data. The purpose of LSTMs is
to extract long-term and short-term dependencies from the data.
3.2 Duration-Based Monsoon Forecasting
Based on how long it takes to forecast, the abovementioned methods for predicting
the monsoon can be divided into three categories: short-range, medium-range and
long-range forecasting. These precipitation trends can be used to predict the monsoon
drift.
258 N. Goyal et al.
3.2.1 Short-Range Forecast
The forecast has a shorter time horizon of less than 72 h. Compared with medium-
range and long-range forecasts, short-range forecasts are more accurate. Short-term
forecasts are extremely valuable to the needs of the public, pilots, farmers and
navigators.
3.2.2 Medium-Range Forecast
This forecast is valid for a maximum of 10 days. It can assist farmers in scheduling
their agricultural operations and people’s travels. It is also known as subsea-
sonal forecasting. Ensemble forecasting methods are widely used in medium-range
forecasting.
3.2.3 Long-Range Forecast
The time window of a long-range forecast varies from a month, season or a year.
Its accuracy is much less when compared with short- and medium-range forecasts.
But it is very helpful to economists and scientists. Accuracy of these forecasts can
further calculate the monsoon drifts on the same scale.
For long-range forecasting of southwest monsoon [33] explains a novel statistical
ensemble forecasting method. [34–36] examined long-term patterns in annual and
seasonal precipitation at 16 stations in the upper Nile River Basin, both long-term
and short-term trends in several districts of Odisha and Malaysia, respectively. It has
been described in [37] that even human activities, land usage patterns can be used to
predict long-term precipitation trends.
4.1 Comparative Analysis of Methodology-Based Forecasts
Following a thorough review of the literature, monsoon prediction methods may be

categorized according to the approach employed, as indicated in Table 1. In each
prediction approach, type of data at input time, models, qualitative and quantitative
aspects of methods and their respective accuracy have been compared.
Table 1 Comparison of methodology-based forecasting techniques

Prediction Algorithms Data input Models/tools Nature of Accuracy
technique technique
Traditional No scientific Visual Synoptic maps Qualitative Limited
algorithm observations
Numerical Numerical Current data DEMETER, Quantitative Long range:
approximation MITgcm, POM, and qualitative poor,
equations MOM medium
range:
moderate,
short range:
best
Statistical Machine Historical Linear Quantitative Average
learning/deep data regression, and qualitative
learning random forest,
algorithms Naive Bayes,
SVM, ANN,
CNN and RNN
Table 2 Comparison of duration-based prediction techniques

Prediction technique Duration Data models Application area Accuracy
Long range 0–72 h Numerical, statistical Resource management, Low
risk evaluation
Medium range 3–10 days Numerical, statistical Tourism, agriculture, Moderate
disaster management
Short-range days Beyond 10 Numerical, statistical Daily weather forecast, High
statistical aviation, disaster
management
4.2 Comparative Analysis of Duration-Based Forecast
Table 2 lists the categorization of all weather prediction techniques based on length
of time for which they are forecasting. Each approach has a discrete accuracy and
application area based on the duration.
4.3 Comparative Analysis of Machine Learning Algorithms

in Statistical Approach
Table 3 lists the prototype, strength, weakness and application areas of various
machine learning algorithms. The usefulness of an algorithm is contingent upon
260 N. Goyal et al.
Table 3 Comparison of machine learning algorithms

Algorithm Archetype Advantages Disadvantages Application
Linear Supervised Simple, fast Poor in handling Regression
regression complex problems problems
Multivariate Supervised Detects several Poor interpretability Regression
regression predictors in complex problems
problems
Random forest Supervised Reduce overfitting, fill Time-consuming Feature selection,
missing value regression and
automatically, efficient classification
with large datasets problems
Naïve Bayes Supervised Simple, fast, scalable Poor prediction Recommendation
system, text
classification
Support vector Supervised Memory efficient, Not suitable with Regression,
machine efficient in large-scale large datasets classification,
environments outlier detection
Time series Supervised Records temporal and Refutes the missing Weather
analysis pattern relationships values, incorrect forecasting, stock
outcomes over time prediction
several elements, including the magnitude of the dataset, the type of meteorolog-
ical information, the availability of computing power and the forecasting objec-
tive, encompassing short-term, long-term, regional and worldwide forecasting. A
combination of these algorithms is used by modern weather forecasting systems to
maximize their benefits and minimize their drawbacks.
4.4 Comparative Analysis of Deep Neural Networks

in Statistical Approach
It is important to recognize that numerical weather prediction models and conven-

tional machine learning techniques play important roles in weather forecasting. But
to handle complex models that require big and varied datasets, processing power
and interpretability, deep learning algorithms are more advantageous. Table 4 lists
the features, benefits, drawbacks and use cases of deep neural networks which
includes feed-forward neural networks, recurrent neural networks, convolution
neural networks and long short-term memory networks.
Table 4 Comparison of deep learning algorithms

Algorithm Features Advantages Disadvantages Application
Feed-forward Unidirectional Suitable for Restricted to input Regression,
neural network non-sequential data of specific classification,
data size pattern
identification
Convolution Feed-forward Excellent for Computationally Image and video
neural network with convolution image analysis, complex, processing
layers good at capturing inappropriate for
spatial hierarchies use with time
series
Recurrent Bidirectional Suitable in Prone to problems Time series
neural network handling with disappearing analysis, natural
sequential data gradients, limits language
scalability due to processing
lack of
parallelization
Long short-term RNN with Detects long-term Computationally Speech
memory memory dependencies complex, careful recognition
hyper parameter music
adjustment could composition,
be necessary time series
prediction such
as weather
forecasting
5 Conclusion
Given its influence on almost half of the world’s population, seasonal prediction
of Indian rainfall is an important topic of research. From ancient methodologies to
current statistical and numerical methods for rainfall prediction, there has been a
great deal of progress in the field of forecasting. While numerical methods simulate
the actual atmosphere but still gives average result due to multiplication of error in
each iteration, statistical methods which learn data prediction using machine learning
can supplement the numerical methods. The use of ANN models is more encouraged
than regression models in statistical approaches since multiple study findings have
demonstrated that the association between Indian summer monsoon rainfall (ISMR)
and environmental characteristics may be better-stated and connected by nonlinear
techniques.
Improving weather prediction’s precision and accuracy is still the major objec-
tive. Numerous worldwide elements, like the SST, ENSO effect and others have a
substantial impact on it. Therefore, sensitivity analysis of the variables influencing
the monsoon should be carried out for a comprehensive study of monsoon prediction.
262 N. Goyal et al.
References
1. Hindustan Times (2023) Why North India is facing unusually heavy rains, explained
2. Blanford HF (1884) II. On the connexion of the Himalaya snowfall with dry winds and seasons
of drought in India. Proceed Royal Soc London 37(232–234):3–22
3. https://mausamjournal.imd.gov.in/index.php/MAUSAM/article/view/5932
4. Goswami BB, An SI (2023) An assessment of the ENSO-monsoon teleconnection in a warming
climate. NPJ Clim Atmosph Sci 6(1):82
5. Asutosh A, Vinoj V, Wang H, Landu K, Yoon JH (2022) Response of Indian summer monsoon
rainfall to remote carbonaceous aerosols at short time scales: Teleconnections and feedbacks.
Environ Res 214:113898
6. Debnath S, Govardhan G, Saha SK, Hazra A, Pohkrel S, Jena C, Ghude SD (2023) Impact
of dust aerosols on the Indian Summer Monsoon Rainfall on intra- seasonal time-scale. Atm
Environ 305:119802
7. Wiston M, Mphale KM (2018) Weather forecasting: from the early weather wizards to modern-
day weather predictions. J Climatol Weather Forecast 6(2):1–9
8. Risiro J, Mashoko D, Tshuma DT, Rurinda E (2012) Weather forecasting and indigenous
knowledge systems in Chimanimani District of Manicaland, Zimbabwe. J Emerg Trends Educ
Res Policy Stud 3(4):561–566
9. Balehegn M, Balehey S, Fu C, Liang W (2019) Indigenous weather and climate forecasting
knowledge among Afar pastoralists of north eastern Ethiopia: role in adaptation to weather and
climate variability. Pastoralism 9(1):1–14
10. Palmer TN, Alessandri A, Andersen U, Cantelaube P, Davey M, Delécluse P, Thomson MC
(2004) Development of a European multimodel ensemble system for seasonal-to-interannual
prediction (DEMETER). Bull Am Meteorol Soc 85(6):853–872
11. Adcroft A, Hill C, Campin JM, Marshall J, Heimbach P (2004) Overview of the formulation
and numerics of the MIT GCM. In: Proceedings of the ECMWF seminar series on numerical
methods, recent developments in numerical methods for atmosphere and ocean modelling, pp
139–149
12. Mellor GL (1998) Users guide for a three dimensional, primitive equation, numerical ocean
model program in atmospheric and oceanic sciences. Princeton University Princeton, NJ
13. Pacanowski RC, Dixon K, Rosati A (1993) The GFDL modular ocean model users guide.
GFDL Ocean Group Tech Rep 2(46):08542–10308
14. DelSole T, Shukla J (2002) Linear prediction of Indian monsoon rainfall. J Clim 15(24):3645–
3658
15. Tripathi KC, Agarwal R, Hrisheekesha PN (1997) Global prediction algorithms and
predictability of anomalous points in a time series
16. Liyew CM, Melese HA (2021) Machine learning techniques to predict daily rainfall amount.
J Big Data 8:1–11
17. Tripathi KC, Rai S, Pandey AC, Das IML (2008) Southern Indian Ocean SST indices as early
predictors of Indian summer monsoon
18. Shukla RP, Tripathi KC, Pandey AC, Das IML (2011) Prediction of Indian summer monsoon
rainfall using Niño indices: a neural network approach. Atmospheric Res 102(1–2):99–109
19. Abhishek K, Singh MP, Ghosh S, Anand A (2012) Weather forecasting model using artificial
neural network. Procedia Technol 4:311–318
20. Saha M, Chakraborty A, Mitra P (2016) Predictor-year subspace clustering based ensemble
prediction of Indian summer monsoon. Adv Meteorol
21. Singh BP, Pravendra K, Tripti S, Singh VK (2017) Estimation of monsoon season rainfall and
sensitivity analysis using artificial neural networks. Indian J Ecol 44:317–322
22. Praveen PB, Talukdar S, Shahfahad Mahato, S., Mondal, J., Sharma, P., & Rahman, A. (2020)
Analyzing trend and forecasting of rainfall changes in India using non- parametrical and
machine learning approaches. Scientific Rep 10(1):10342
23. Najib F, Mustika IW (2022) Weather forecasting using artificial neural network for rice farming
in Delanggu village. In: IOP conference series: earth and environmental science (vol 1030, no
1). IOP Publishing, p 012002
24. Kumar B, Chattopadhyay R, Singh M, Chaudhari N, Kodari K, Barve A (2021) Deep learning–
based downscaling of summer monsoon rainfall data over Indian region. Theoret Appl Climatol
143:1145–1156
25. Endalie D, Haile G, Taye W (2022) Deep learning model for daily rainfall prediction: case
study of Jimma Ethiopia. Water Supply 22(3):3448–3461
26. Saha S, Kundu B, Saha A, Mukherjee K, Pradhan B (2023) Manifesting deep learning algo-
rithms for developing drought vulnerability index in monsoon climate dominant region of West
Bengal India. Theoretic Appl Climatol 151(1–2):891–913
27. Singh D, Ghosh S, Roxy MK, McDermid S (2019) Indian summer monsoon: extreme events,
historical changes, and role of anthropogenic forcings. Wiley Interdisciplin Rev Clim Change
10(2):e571
28. Dash Y, Mishra SK, Panigrahi BK (2019) Predictability assessment of northeast monsoon
rainfall in India using sea surface temperature anomaly through statistical and machine learning
techniques. Environmetrics 30(4):e2533
29. Mittal AK, Singh UP, Tiwari A, Dwivedi S, Joshi MK, Tripathi KC (2015) Short-term predic-
tions by statistical methods in regions of varying dynamical error growth in a chaotic system.
Meteorol Atmos Phys 127:457–465
30. Tripathi KC, Mishra P (2019) Empirical orthogonal functions analysis of the regional Indian
rainfall. In: Innovations in computer science and engineering: proceedings of the sixth ICICSE
2018. Springer Singapore, pp 127–134
31. Schultz MG, Betancourt C, Gong B, Kleinert F, Langguth M, Leufen LH, Stadtler S (2021)
Can deep learning beat numerical weather prediction? Philosophic Transact Royal Soc A
379(2194):20200097
32. Zenkner G, Navarro-Martinez S (2023) A flexible and lightweight deep learning weather
forecasting model. Appl Intell 53(21):24991–25002
33. Kumar A, Pai DS, Singh JV, Singh R, Sikka DR (2012) Statistical models for long-range
forecasting of southwest monsoon rainfall over India using step wise regression and neural
network
34. Tabari H, Taye MT, Willems P (2015) Statistical assessment of precipitation trends in the upper
Blue Nile River basin. Stoch Env Res Risk Assess 29:1751–1761
35. Panda A, Sahu N (2019) Trend analysis of seasonal rainfall and temperature pattern in
Kalahandi, Bolangir and Koraput districts of Odisha, India. Atmosph Sci Lett 20(10):e932
36. Ridwan WM, Sapitang M, Aziz A, Kushiar KF, Ahmed AN, El-Shafie A (2021) Rainfall
forecasting model using machine learning methods: case study Terengganu Malaysia. Ain
Shams Eng J 12(2):1651–1663
37. Falga R, Wang C (2022) The rise of Indian summer monsoon precipitation extremes and its
correlation with long-term changes of climate and anthropogenic factors. Sci Rep 12(1):11985
AI-Integrated Smart Toy for Enhancing
Cognitive, Emotional, and Motor Skills
in Toddlers
Sara Bansod , Pranita Ranade , and Indresh Kumar Verma
Abstract During early childhood, a remarkable phase of human growth unfolds.

Neuroplasticity in toddlers refers to the brain’s ability to change and adapt remark-
ably during early childhood. A child’s future can be significantly impacted based on
the cognitive, emotional, and motor development during the early stage. Therefore,
it is imperative to train children aged 1 to 4 with appropriate care and provide them
with a more personalized learning experience, depending on their progress. Toddlers’
development of hand and eye movement, facial expressions, rhythms, rhymes, etc.,
occurs when they play traditional games like pat-a-cake or peek-a-boo at daycares or
preschools. Their progress is tracked based on factors such as language development,
social and emotional development, working habits, eating habits, personality devel-
opment, etc. However, carrying out the exercises and monitoring each child’s devel-
opment in situations where many kids are present is challenging. These toddlers need
a more individualized approach as the development of every child is different. Immer-
sive technology is currently used in personalized learning using artificial intelligence
and IoT sensors like motion and audio sensors. To apply this technology, this study
uses AI-integrated smart toys with sensors to engage children in various activities.
This will assist individualized learning through personalizing experiences according
to every child’s requirement, offering possibilities. A systematic design process was
followed to determine the challenges in this area, development in children, activi-
ties that affect their development, and technology use. A design concept has been
proposed after following the double-diamond design process. The suggested design
result uses technologies such as artificial intelligence, IoT sensors, microphones,
face detection cameras, and a dashboard reflecting data collected by the smart toy
and processed using AI analytics to track the child’s development.
Keywords Cognitive development · Emotional development · Motor skills

development · Childhood development · Smart toys · AI assistance · Adaptive
learning · HCI
S. Bansod · P. Ranade (B) · I. K. Verma

Symbiosis Institute of Design, Symbiosis International (Deemed University), Pune, India
e-mail: pranitaranade@gmail.com
266 S. Bansod et al.
1 Introduction
The cerebral and behavioral development of toddlers from the age of 1 to 4 years old
within daycare and preschool environments, coupled with the integration of AI and
IoT technologies, presents a multifaceted landscape of opportunities and challenges
[1]. Caregivers and teachers at preschools and daycares use many educational apps
and digital books. Attempts to use the Internet of toys in education are in progress.
Toys with integrated sensors inviting toddlers to play with it will help track their
movements more easily [2]. Facilities provided at preschools and daycares play a
crucial role in fostering a child’s growth by providing a structured setting for cogni-
tive, social, and emotional development. The interaction with parents, peers and
caregivers, exposure to various activities, all contribute to shaping a child’s holistic
growth trajectory [3]. However, it is imperative to acknowledge the intricate inter-
play of factors like sleep patterns, social interactions, and engaging activities, which
can significantly influence an infant’s development during daycare years [4]. As the
growth of every child is different, a more personalized approach will help in providing
a better educational system. The advent of AI and IoT has resulted in innovative
solutions to monitor and enhance infants’ development in daycare settings. These
technologies offer real-time data collection, analysis, and personalized suggestions
for optimal growth. AI-driven analytics can identify patterns in a child’s behavior,
enabling parents and caregivers to tailor activities that align with individual develop-
mental needs. Nevertheless, challenges arise concerning privacy, data security, and
potential overreliance on technology [5]. Striking a balance between human inter-
action and technological intervention remains crucial to ensure infants receive the
holistic care and nurturing environment they require. This paper proposes a concept
with the evolving landscape that leverages AI and IoT judiciously, integrating their
potential benefits while upholding caregivers’ essential role in shaping infants’ cogni-
tive and emotional development within daycare settings using the design processes
and methods.
2 Literature Review
Through various online search tools, a thorough literature review has been conducted
on various keywords, such as early childhood development, artificial intelligence use
in the development of technology, cognitive development, emotional development,
user experience, etc.
AI-Integrated Smart Toy for Enhancing Cognitive, Emotional … 267
2.1 Transformation of the Human Brain from Birth
During infancy, a remarkable phase of rapid growth and learning unfolds. The devel-
oping brain exhibits its highest adaptability, forming the bedrock of cognitive abili-
ties [1]. Infants embark on a journey from ground zero, acquiring skills like walking,
talking, object categorization, and adept manipulation. Simultaneously, they grasp
the art of emotional regulation and social interaction. The constraints of infant percep-
tion and cognition, along with attention and responsiveness, delineate conceptual,
and practical boundaries [3].
2.2 Emotion-Cognitive-Motor Development
In early development, the desired environment includes the necessary variety of audi-
tory tones for language or visual stimuli for sight and the essential emotional support
and career familiarity [6]. Five out of six notable interactions revealed a startlingly
consistent pattern that supports the idea that maternal and child-care factors interact
to shape children’s attachment. Understanding and promoting factors that facili-
tate healthy brain development and optimal cognitive growth across these domains
during early childhood is crucial [5]. Furthermore, there is a growing acknowledge-
ment of physical activity’s significance as a determinant of cognitive and neural
functioning in middle childhood and adulthood, in addition to its physiological and
psychosocial benefits. Systematic reviews and meta-analyses suggest that increased
physical activity levels can enhance cognitive functioning and academic achieve-
ment in school-aged children and reduce the risk of age-related cognitive decline,
dementia, and Alzheimer’s disease in adults [7].
2.3 Activities for Early Childhood Development
Numerous studies have explored the profound impact of diverse play, musical, and
creative activities on the holistic development of children aged 1–4 years. These
early formative years represent a critical period for cognitive, social, emotional,
and physical growth [6]. Research highlights the importance of imaginative play in
nurturing cognitive abilities and language proficiency, as youngsters participate in
symbolic expression and problem-solving through endeavors such as make-believe
games [8]. Exposure to music can enhance auditory processing, rhythm perception,
and socio-emotional skills. Furthermore, the creative arts, have been shown to nurture
self-expression, fine motor skills, and emotional regulation [9]. This literature review
synthesizes existing research to elucidate the multifaceted benefits of play, music,
and creative activities on the development of young children, shedding light on their
pivotal role in shaping the future cognitive and emotional well-being of this age
group [7].
2.4 Use of Technology for Tracking Toddler Development
There has been a discernible surge in the integration of technological advancements

within development facilities to enhance the care and supervision of infants and
toddlers. This growing trend reflects a paradigm shift in childcare practices, wherein
innovative technologies are harnessed to monitor, engage, and cater to the develop-
mental needs of young children. This burgeoning field of research has demonstrated
the potential of AI-driven educational apps and IoT-enabled toys to enhance cogni-
tive, social, and emotional development during this critical developmental stage
[8]. For instance, interactive AI-assisted apps can adapt to a child’s individual
learning pace, offering personalized educational content and fostering cognitive
growth. A user-friendly RFID-based tracking system for preschools enables seam-
less monitoring of toddlers’ activities and developmental progress. This technology
ensures safety and facilitates real-time updates for parents. Meanwhile, IoT devices
embedded in toys and musical instruments can create immersive and responsive play
experiences, encouraging creativity and sensory exploration [9].
3 Design Methodology
A user-centered design methodology that included processes such as empathizing,

Defining, Ideating, Prototyping and Testing was used to achieve the study’s objective.
User needs and pain points were identified through a survey and contextual inquiry.
This was followed by conceptualization and the final solution was tested with the
users.
3.1 Online Survey
An online user study using a survey was conducted with 20 users, including parents
and day care providers between the ages of 25 and 50, to learn their opinions on the
significance of early childhood development and its current state at daycare centers
and preschools. The questions revolved around understanding the challenges faced in
tracking the child’s development during the early stage, devices and technology used
at the daycares and kindergartens currently, and parents’ requirements and expecta-
tions. The research revealed that 60% of respondents said they weren’t satisfied with
the activities conducted at the daycare. In total, 85% of respondents felt that activities
held at daycare or preschools can be helpful in the cognitive development of their
Table 1 Benchmarking of competitor in the market

Features Competitors
Illumine Brightwheel ProCare Daily Napper
connect
Real-time updates ✘ ✘ ✔ ✘ ✔
Attendance tracking ✔ ✔ ✔ ✔ ✘
Daily reports ✔ ✔ ✔ ✔ ✘
Photo/video sharing ✔ ✘ ✔ ✔ ✔
Health/meal records ✔ ✘ ✔ ✔ ✔
Personalized course suggestion ✘ ✘ ✘ ✘ ✘
Activities based on the cerebral, ✔ ✘ ✔ ✔ ✘
emotional, and motor development
Regular communication with ✔ ✔ ✔ ✔ ✔
parents
Use of IOT sensors ✘ ✘ ✘ ✘ ✘
Total features available 6 3 7 6 4
children. Activities held at daycare and preschools can be helpful in the emotional
development of their children; in total, 60% of respondents felt this.
3.2 Competitor Benchmarking Study
A competitor study was carried out (shown in Table 1). Competitors are the apps/
systems that help track the child’s activities and care for the baby.
ProCare is the only app which includes most of the features. But, it doesn’t focus
on a more personalized course suggestion. The only common feature that is present
in the current applications is regular communication with parents.
3.3 Contextual Inquiry
Contextual inquiry models such as flow and sequence were explored to understand
the user groups, their mental model and their behavior. The cultural model Fig. 1
helps in understanding the values of the user groups and the factors that influence
their work/decisions. The sequence model, Fig. 2, was made to understand the steps
associated with the user’s trigger, intent, and pain points.
Fig. 1 Cultural model showing various stakeholders and how they affect the primary user
Fig. 2 Sequence model showing users actions in sequence

4 Conceptualization and Ideation
The ideation phase included brainstorming, mind mapping, and affinity mapping.
Following the ideation phase, a proposed solution includes a smart toy integrated
with AI. It will interact with the kids based on the inputs, collect the responses,
analyze the data, and present it to parents/caregivers through tables and visualization.
The concept focuses on features such as AI analytics for analyzing the collected data,
smart toys integrated with face detection cameras for recognizing the child, natural
language processing (NLP) for speech analysis, smart toys connected to a dash-
board using technology like data transmission through Wi-Fi/Bluetooth to a cloud
platform, alerts, personalization and detailed information for tracking the child’s
development. Figure 3 shows a pictorial representation of the smart toy features. The
system is primarily designed for daycares and preschools where personalized atten-
tion and tracking the development of each child can be difficult. It keeps in mind the
accessibility of all the users, and the toy is made considering the toddler’s physique.
Figure 4 shows high-fidelity prototypes of the dashboard.
Primary Users of the Dashboard:
• Parents (for tracking the system to check their child’s growth)
• Caregivers (for inputting the required data and keeping track of all children’s
progress and requirements)
Smart Toy Features
• Light alert: When the children come close to the toy or go far away from it.
• Speakers for conducting the activities and give instructions based on the inputs.
• A screen for displaying the visuals while taking activities.
• Microphone with voice detection sensors for capturing the child’s responses.
• Camera with face detection sensor for recognizing the child.
Fig. 3 Smart toy—proposed concept with the feature

Fig. 4 High-fidelity prototype of the dashboard—child’s progress
Dashboard Features
• It will show the parents/caregivers the child’s progress in terms of cognitive,
emotional, and motor skill development.
• It will show the record of the child’s attendance.
• It will show all the activities the child participated in and the progress made during
it.
• It will have AI-generated feedback for the parents and suggest to them about what
the child is good at.
• It will have personalized features for alerts and notifications.
5 Working of the Proposed Design Concept
The proposed concept Fig. 5 shows the flow of the smart toy collecting information
and getting reflected on the dashboard.
The proposed design consists of two parts: (1) a smart toy and (2) a dashboard.
Fig. 5 Flow diagram—collecting information and getting reflected on the dashboard
5.1 Usability Testing
The dashboard was tested with five users by using the System Usability Test method.
A prompt was provided to the users, who were told to perform the task and then rate
it on a scale of 5 points. The score was calculated using the formula. Table 2 shows
the result and the calculated SUS score.
Prompt: Log in to the dashboard, check musical activities under the activities
category, and check the details of the cognitive progress made last month in the
sound-recognizing activity.
6 Discussion and Conclusion
Early childhood education helps in the development of children. This paper proposes
a design concept to help track toddlers’ activities at daycares and kindergartens and
analyze their cognitive, emotional, and motor skill development. Their brain devel-
opment is highly receptive to learning and skill acquisition during this period. No
current app or system helps track the child’s motor, cognitive and emotional devel-
opment. The research results were useful in identifying issues with early childhood
development. The user study helped understand the parent’s and caregiver’s prob-
lems. It was found that the toddlers’ cognitive and motor development required a
more personalized approach. Research also revealed the many kinds of technology
that can be applied to give parents and children aged 1–4 years of enhanced learning
opportunities. A competitor study was carried out to know more about the features
used in the current system and what is lacking in it. Conceptualization was done to
propose a design concept with the help of current technological trends such as AI
and IoT sensors. A smart toy design has been proposed with a flow of technological
use for collecting and processing the information. A dashboard will be connected to
Table 2 Usability testing of the dashboard

SUS questions Ratings, n (%), total n = 5
Strongly Disagree Neutral Agree Strongly
disagree (2) (3) (4) agree (5)
(1)
I think that I would like to use this 0 0 2 (40%) 2 (40%) 1 (20%)
website frequently
I found this website unnecessarily 4 (80%) 1 (20%) 0 0 0
complex
I thought this website was easy to 0 0 1 (20%) 3 (60%) 1 (20%)
use
I think that I would need assistance 3 (60%) 2 (40%) 0 0 0
to be able to use this website
I found the various functions in this 0 0 1 (20%) 3 (60%) 1 (20%)
website were well integrated
I thought there was too much 4 (80%) 1 (20%) 0 0 0
inconsistency in this website
I would imagine that most people 0 0 1 (20%) 4 (80%) 0
would learn to use this website very
quickly
I found this website very 2 (40%) 3 (60%) 0 0 0
cumbersome/awkward to use
I felt very confident using this 0 0 1 (20%) 2 (40%) 2 (40%)
website
I needed to learn a lot of things 4 (80%) 1 (20%) 0 0 0
before I could get going with this
website
Calculated SUS score = 83, which is almost equal to 85.5. This shows that the system’s (app’s)
overall usability is good
the smart toy, displaying the processed information about the child’s development to
the parents or teachers. User testing helped in understanding if the usability of the
dashboard is good or not.
The overall design system proposal will help provide a better training experi-
ence for toddlers by conducting various activities. This learning will help them set
the stage for future academic success and well-being. It will provide a more mean-
ingful experience to the toddlers at preschools and daycares. Parents can leave their
kids and go to work carefree without worrying about their child’s growth. With all
the personalized features, they will feel connected to their kids, even if they aren’t
always physically available. As this is a learning solution, it can be implemented
where toddlers come together and have a chance to learn, for example, daycares and
preschools.
7 Limitations and Future Scope
Based on the primary and secondary research, it was observed that there is a need
for technological intervention in tracking the development of toddlers. Very rapid
development takes place till the age of 4 among toddlers. This is when they need an
appropriate learning environment and personal attention depending on their growth
and needs. But keeping track of every child’s activity and progress in preschools
and daycares is difficult. It was observed that even though applications for tracking
a child’s activities exist, they don’t entirely focus on the development factor. The
competitor’s analysis highlights the current industry trends and already-established
applications/systems. Once basic training is provided to the teachers/caregivers about
how each activity helps the child and how the smart toy can be used for tracking the
responses, it would be easier for them to give the inputs to the system. The feedback
received during the usability testing helped improve the design, and accordingly,
iterations were made. The SUS score after calculation was 83, which comes under
the acceptable range on the scale and shows that the system’s usability is good.
References
1. Shpancer N (2020) Daycare and its effects. In: Community in childhood, 13 Feb 2020
2. Ihamäki PHK (2018) Smart, skilled and connected in the 21st century: educational promises
of the internet of toys (IOTOYS)
3. Richards CS (2020) The Cambridge handbook of infant development. Cambridge University
Press, Cambridge
4. Caspar Addyman LM (2016) Practical research with children. In: Jess Prior JVH (ed).
Routledge, p 334
5. Jin L (2019) Investigation on potential application of artificial intelligence in preschool
children’s education. J Phys Conf Ser
6. Malik F, Marwaha R (2022) Development stages of social emotional development in children.
Stat Pearls Publishing, Treasure Island
7. Bowlby J (1969) Attachment and loss. Br J Psychiatr 116(530):428
8. Harley KM (2016) Early child development and nutrition: a review of the benefits and
challenges of implementing integrated interventions. Adv Nutr
9. Miendlarzewska EA, Trost WJ (2014) How musical training affects cognitive development:
rhythm, reward and other modulating variables. Front Neurosci VII:357–363
10. Komis V, Karachristos C, Mourta D, Sgoura K, Misirli A, Jaillet (2021) Smart toys in early child-
hood and primary education: a system review of technological and educational affordances.
Appl Sci 11(18)
11. Ling L (2022) The use of internet of things devices in early childhood education: a systematic
review, 7 Jan 2022
Thumbnail Personalization in Movie
Recommender System
Mathura Bai Baikadolla , Srirachana Narasu Baditha,

Mohanvenkat Patta, and Kavya Muktha
Abstract Personalization of the user experience is a key aspect of increasing user

engagement and retention on online platforms. In the proposed work, a hybrid recom-
mender system combines content-based filtering using cosine similarity and collab-
orative filtering using triangle similarity. In each system, a predicted score is calcu-
lated for a given user and film. These approaches are combined by taking a weighted
average of their individual results to improve accuracy. These error values match and
outperform those of other existing systems. The system assigns a specific thumbnail
to the movie recommended based on preferences of the user to the actors featured on
the artwork. A single artwork is selected among many through a weighted probability.
The aim of the proposed work is to build a system whose personalization techniques
can be accessible to smaller scale platforms and can be built upon to further enhance
user experience on user-centric platforms.
Keywords Hybrid recommender system · Collaborative-based recommender ·

Cosine similarity · Triangle similarity · User experience · Personalization ·
Custom thumbnails
1 Introduction
In recent years, Over-the-Top (OTT) platforms have revolutionized the way viewers
engage with entertainment. Today, the streaming industry is larger than ever before
with dozens of platforms including Netflix, Amazon Prime, Hulu, Disney+, etc.,
competing for customer attention and revenue. Each of these services uses a
subscription-based system. To retain users and offer personalized experiences, these
platforms rely on a recommendation system. A recommendation system as in [1]
analyzes prior user behavior and movie preferences to recommend movies that are
M. B. Baikadolla (B) · S. N. Baditha · M. Patta · K. Muktha

Department of Information Technology, VNR Vignana Jyothi Institute of Engineering and
Technology, Hyderabad, Telangana, India
e-mail: mathurabai_b@vnrvjiet.in
278 M. B. Baikadolla et al.
most relevant to a user. Recommender systems are of three broad types: content-
based, collaborative, and hybrid [2]. Content-based approaches focus more on
comparing items to other items in order to find similarities to those already liked,
while collaborative approaches focus more on comparing users to each other in order
to find similar interests and tastes. Hybrid approaches combine these two methods
to leverage their strengths [3, 4].
Thumbnails are static images that represent the recommended movie on streaming
platforms. These serve as a preview to a piece of media, through which the user may
gauge whether it is of interest to them. Traditionally, OTT platforms have used a
single thumbnail, often the original poster. However, services such as Netflix have
begun to personalize not only their recommendations, but also personalize their
thumbnails through dynamic thumbnails that vary in the actors, settings, or other
features according to the user. This underutilized feature can enable even greater
personalization of the user experience and drive more user engagement. The main
objective of the hybrid system is to propose a movie recommendation system which
uses dynamic thumbnails to personalize the user experience further. The proposed
system employs a hybrid method that fuses collaborative filtering using triangle
similarity and content-based filtering using cosine similarity to recommend movies.
Additionally, the proposed system selects custom thumbnails [5] using the prefer-
ences of user and global interests. The main aim of the proposed work is to introduce
a dynamic thumbnail selection algorithm that can be accessible to both large and
small-scale movie recommender systems.
Section 1 discussed the importance of hybrid recommender systems and the impact
of thumbnails. In Sect. 2, a detailed literature review on collaborative filtering in the
recommender system and personalized thumbnails effect is done. Section 2 explains
about the proposed hybrid intelligent recommender system with personalized thumb-
nails effect. In Sect. 4, the results are explored. Section 5 briefs the conclusions
followed by future work in Sect. 6.
2 Literature Review
Recommender system is always biased towards popular items. Authors in [6] intro-
duced a personalized recommender system which manages this popularity bias
problem effectively by representing less popular items with increased ranks. This
re-ranking method is a post-processing process in the recommender system. Authors
in [7] proposed a personalized movie recommender system using collaborative
filtering. This system is a user-to-user collaborative filtering approach in recom-
mending movies and calculating the similarity between the users using Euclidean
distance. The similarity is taken on the basis of demographic data of the user such as
gender, age, occupation, and area of living. Euclidean distance has several limitations
when used on larger datasets which the authors in [8] resolve by eliminating redun-
dant data and achieving dimensionality reduction using proposed similarity measure
[9]. In [10], a fuzzy similarity measure is proposed which supports huge amounts
Thumbnail Personalization in Movie Recommender System 279
of data and does not comprise the similarity. Euclidean distance cannot capture the
similarity between optimistic and pessimistic users even if they have similar tastes.
The proposed measure by authors in [11] found the exact similarity between two
instances which could not be achieved by using Euclidean distance. It is also not
resilient to outliers, which can skew recommendations and lead to biased results.
Additionally, the study dataset has a sparsity of over 90%, which can further compli-
cate the use of Euclidean distance. The study also evaluates the impact of time on
user preferences and found that incorporating the time of rating has resulted in a 3%
increase in precision and a 4% increase in recall.
An effective recommender system is proposed in [12] using cuckoo search. In their
work, the authors propose a movie recommendation system that utilizes a collabo-
rative filtering approach along with a clustering method to group users into clusters
using their movie ratings. The proposed system then employs a metaheuristic opti-
mization technique called cuckoo search algorithm to optimize the weights of the
recommendation algorithm and improve the accuracy of movie recommendations.
The cuckoo algorithm uses breeding behavior of birds and is known for its effec-
tiveness in optimization problems. By combining this algorithm with the collabo-
rative filter approach and clustering method, the proposed hybrid system achieves
a minimum RMSE of 1.23104 and MAE of 0.697293, with a total of 68 clusters.
Authors in [13] proposed a hybrid system for recommending movies using an intel-
ligent system. The hybrid approach of this proposed recommender system combines
both collaborative and content-based filters. The collaborative filter module uses
Singular Value Decomposition (SVD) to give a predicted score for each movie.
However, the proposed system has a strong bias toward the genre of a movie, which
doubles the predicted score of all movies with a genre explicitly liked by the user
and removes any movies with a genre explicitly disliked by the user. Such a heavy
focus on a single feature may lead to biased and homogeneous results. The proposed
system also employs a content-based filter that uses the cosine measure to revise the
ranking of recommended movies. Additionally, the system implements an intelligent
system as the final filter for results. This proposed expert system uses fuzzy logic
as a series of 144 IF-ELSE statements to determine the importance of each movie
based on several factors.
A comparison of collaborative and hybrid approaches [3] for recommendation of
movies is done in [14]. The first approach utilizes a purely collaborative filter with
an adjusted cosine similarity measure, while the second approach was a system
that combines collaborative-based with a content-based filter using the TF-IDF
method. The authors observed that the collaborative approach outperformed the
hybrid approach, despite the hybrid system including movie data alongside user
data. The content-based filtering system used is factored only in the synopsis and
title of the movie, which may have limited its effectiveness as it did not consider other
relevant features such as genre, actors, and directors. Using TF-IDF technique with
the synopsis could be beneficial, and the reliability of the results could be impacted
if the synopsis is sourced from non-standardized sources. The technique would be
safer to incorporate with objective features such as genre, cast, or a movie’s produc-
tion values. Authors in [15] highlighted a study involving presenting participants
with titles that had either familiar or unfamiliar artwork and then asking them to rate
their interest in watching the title and their perceived genre of the title based on the
artwork. The study shows that users prefer familiar artwork, which can increase their
intention to watch a particular title. In contrast, unfamiliar artwork can create uncer-
tainty about the genre and decrease users’ interest in watching a title. The authors
suggest that this has significant implications for Netflix’s personalization and user
engagement strategies. By using custom artwork that is tailored to a user’s prefer-
ences and viewing history, Netflix can increase user engagement and improve their
overall viewing experience.
Authors in [16] propose a method to use user clicks on recommended items as
a means of conveying user preferences to the recommendation system. The authors
suggest that clicks on recommended items not only represent an explicit positive
feedback but also convey a more nuanced message about the user’s interests. This
approach captures user preferences and interests more accurately than traditional
explicit feedback methods like rating or liking. The system also reduces user effort
as it only requires a simple click to convey a message rather than rating or providing
feedback. However, there are limitations to this approach. Firstly, it assumes that
users will only click on items that they are interested in, which may not always be
the case. Users may also click on items for reasons other than interest, such as to add
it to a watchlist or to view later. Secondly, the system relies on the recommended
items being visible to the user. If a user is interested in a particular item that is not
recommended, they may not click on any of the items, and hence this system will
not receive any feedback on their preferences.
The usage of custom thumbnails is proposed in [17] as an effective strategy for
improving personalization and enhancing user engagement on online video plat-
forms, particularly on Netflix. The authors argue that custom thumbnails can have
a significant impact on user behavior and viewing choices on the platform. Netflix
uses various data-driven methods to select and test custom thumbnails for its titles.
These methods include A/B testing, machine learning, and human evaluation. By
analyzing viewing patterns and user feedback, Netflix aims to identify the most effec-
tive thumbnail for each title. The authors suggest that custom thumbnails are a key
aspect of Netflix’s overall strategy to provide personalized and engaging content to its
users. The authors argue that the use of custom thumbnails reflects Netflix’s commit-
ment to enhancing user experience and personalization on the platform. A novel
hybrid system EntreeC is introduced in [18] which fuses collaborative filtering and
knowledge-based to suggest recommendations and for performance improvement.
The efficiency of collaborative methods is enhanced by including semantic ratings
obtained from the knowledge-based recommendations. The recommender system
performance depends on the similarity measure [19]. In this proposed method, an
in-depth review is experimented with different similarity measures in collaborative
filtering recommender systems for datasets like Jester, MovieLens1M, and Movie-
Lens100k. The author found that AMI correlation measure best suits the item-based
collaborative approach. The author also concluded that the measure performance can
be improved when dataset density, dataset sparsity, cold start situation, data quality
is considered and integrating with user preferences. Deep learning algorithms were
also experimented on the Movies100k dataset to design an efficient recommender

hybrid system [20, 21].
2.1 Personalized Thumbnails
Thumbnails play a critical role in the success of OTT platforms such as Netflix. As
users scroll through the vast selection of titles, thumbnails must capture their attention
quickly to entice them to click and start watching [22]. As Netflix estimates they only
have up to 90 s to grab a user’s attention, selecting the right thumbnail is essential for
user engagement and can make or break their business. By displaying a snapshot of the
title, users get a sense of what the show or movie is about, and it can help them decide
whether to watch it or not. However, selecting the right thumbnail from millions of
frames can be challenging, which is why Netflix employs a combination of computing
and human efforts [23]. Using AVA or Aesthetic Visual Analysis [24], a combination
of tools and algorithms, the system can filter through millions of frames to identify
potential thumbnail candidates. The objective factors such as brightness, stillness,
and focus help to determine the best candidates and other factors like the actors
featured, maturity filters, and frame diversity. Once potential thumbnail candidates
are identified, they are sent to a creative team for finishing touches and the addition
of other data like the title logo. After the team has developed a couple of thumbnails,
A/B testing is conducted to determine which one has the highest click-through rate.
Authors in [25] discussed some of the open problems in Netflix using different
machine learning algorithms and various issues while designing, interpreting A/B
tests. Netflix finds that thumbnails with expressive facial emotions that convey the
tone of the title perform particularly well.
A personalized method is proposed in [26] for a thumbnail selection based on
recommendation framework based on image quality assessment, image accessibility
analysis, video analysis. The key frame of the video shot is extracted based on
these measures and is called thumbnail candidates. In this proposed method, SVR
model is used to predict thumbnail candidates. The predicted thumbnails by the
proposed method were of user’s preference and enhance their personal experience.
Authors in [27] have developed an embedding model for automatic selection of
video thumbnails by computing similarity value between query and thumbnail with
side semantic information called as thumbnail query relevance score. The thumbnail
represents the video content properly in the form of representation and with the
highest query relevance score. The proposed method is possible only for query-
dependent thumbnail selection called personalized thumbnails. This improves the
search experience of users. The proposed method can also be used for video search
re-ranking [28], video tag localization [29], and mobile video search [30].
A recommender system in [31] is implemented using a graph database which deals
with complex relations. The movie node recommendation degree is expressed as size
of node and thickness of edge. In [32] the authors found that most of the research is
focused on accuracy improvement in recommender systems. The recommendations
according to standard measures may not be useful to users. The authors proposed
informal arguments to evaluate recommendations other than accuracy which are
user-centric directions. The informal metric proposed by [33] is intra-list similarity,
leave-n-out methodology [34, 35]. A content-based system in [36] is proposed for
movie recommendations which captures user temporal preferences as a Dirichlet
Mixture Process Model. The proposed system is a user-centric framework which
includes content attributes of user rated movies. The proposed system can be extended
to give recommendations based on content for new movies. This proposed system
motivated us to use a hybrid recommender system which fuses item-based collab-
orative approach along with content-based approach. This motivation aims for a
hybrid system to improve the serendipity and diversity for movie recommendations.
Authors in [37] proposed a similarity measure ITR from a commonly used similarity
measure called triangle similarity to improve recommendations using collaborative
filtering. The triangle similarity mainly focuses on the common item ratings by
users only which was extended by the authors by proposing ITR similarity measure
which considers the item ratings which were not rated from pairs of other users. The
proposed similarity measure also considers User Rating Preferences (UPR) i.e., the
behavior of users while giving rating. This improved similarity measure motivated
the use of such measures for hybrid systems.
3 Proposed System
Researchers in their literature of collaborative filtering-based recommender systems

focused on improving existing measures without considering the quality of data. The
existing systems used for recommendations usually include approaches like demo-
graphic, content-based, utility-based, collaborative, and knowledge-based. Such
systems performance can be improved by fusing these approaches as hybrid systems.
In [18] the author has explored a detailed survey of all hybrid systems such as mixed,
meta-level, weighted, feature augmentation, switching, feature combination hybrid
approaches. The proposed hybrid recommender system is a motivation from [4,
18]. The proposed system combines a content-based filter and a collaborative filter
to improve performance compared to each system alone. Thumbnails play a vital
role in getting someone to browse any content from a website. Netflix spends more
to give personalized thumbnails as per the user preferences. The pivot thumbnail
based on user preferences is selected using aesthetic visual analysis (AVA). The
effective thumbnails should include bright colors, expressive faces, and prominent
characters, including the protagonist and antagonist of the movie. They also recom-
mend featuring no more than three people in the frame. With these factors in mind,
we designed custom thumbnails for our proposed system by incorporating similar
features, such as limiting the number of actors to no more than three, highlighting
characters played by the top six billed actors, and selecting scenes with both aesthetic
appeal and emotional expression.
Fig. 1 Proposed hybrid recommender system architecture
The proposed hybrid system architecture comprises three major modules: (1)
Information Collector, (2) Movie Recommender Module, and (3) Thumbnail Selector
as shown in Fig. 1. The Information Collector is responsible for data preparation
and preprocessing. It enhances the existing dataset by leveraging the Open Movie
Database Application Programming Interface (OMDb API), a free-to-use RESTful
API that offers a comprehensive database of movie-related information. The Movie
Recommender Module is the central module that incorporates both the collabora-
tive and content-based components. The content-based filter predicts movie ratings
based on similarities among movies, while the collaborative filter predicts ratings
based on similarities among users. The combination of these two components outper-
forms either one individually. The thumbnail selector contains a thumbnail mapper
component that chooses the most suitable thumbnail to display for each movie. This
component uses The Movie Database Application Programming Interface (TMDB
API), another RESTful API that provides up-to-date movie-related information.
3.1 Dataset and Data Processing
The proposed system is built for MovieLens 100 k dataset, a popular benchmark
dataset in the field of recommender systems as shown in Table 1a. This dataset was
created by the research lab GroupLens with 943 users rating 1682 movies. It also
includes five pre-built training and testing sets which are 80–20 splits on the original
dataset. The OMDb API developed by Brain Fritz is used to augment the dataset
with additional information for each movie, including its genre, list of actors, list
of directors, year of release, runtime, censorship rating, and language as shown in
Table 1b.
3.2 Content-Based Filtering
The proposed system uses a content-based approach to suggest users the movies based
on the similarity. The underlying premise of content-based filters in recommender
systems is that users tend to prefer items that share similarities with the ones they
have previously enjoyed. Unlike collaborative filtering systems, the content-based
filtering systems are focused solely on movie features. To calculate the similarity
between movies, we consider several relevant features of each movie, including its
genre, actors, directors, language, and plot synopsis. Initially, we also included the
runtime and censorship ratings of the movies as relevant features but found that
this resulted in decreased performance and subsequently removed them. The item’s
similarity is computed using the commonly used cosine similarity measure [38]. The
item-item similarity matrix which is a N × N matrix which captures the similarity
between any two movies (A, B) in the dataset, where N is the total count of movies,
can be computed using Eq. (1).
A·B
Cosine similarity(A, B) = . (1)
|A||B|
To compute each movie score for a given user, first compare the movie to all other
movies rated by that user and calculate a weighted average score of the rating and
the similarity between the rated and the question movie. The computed score of a
movie M for a user U is calculated using Eq. (2).
∑n
Rating(U, Ni ) × Similarity(M, Ni )
Predicted(M, U ) = i=1
∑n , (2)
i=1 Similarity(M, Ni )
where N represents the collection of all movie’s ratings by user U

The main steps in calculation of predicted scores using content-based filter are: (1)
load preprocessed movies dataset, (2) get item-item similarity matrix using cosine
similarity, (3) for each movie in dataset M and user U, calculate predicted score
Predicted (M, U). A collaborative filtering technique is employed by the system to
predict ratings by identifying similarities between users. The approach exploits the
fact that users with similar tastes exist within a community, and by measuring simi-
larity of users in the training dataset, we can infer ratings for unknown movies in the
testing dataset [39]. The technique involves three major steps: (1) computing a simi-
larity matrix between users, (2) selecting a neighborhood of users, and (3) predicting
ratings. To compute the similarity between users, various metrics like triangle
similarity, Pearson correlation, cosine similarity, etc., can be used where each has
Table 1 a MovieLens 100 k dataset. b OMDb dataset
Movie_id Movie_title Release_date Action Adventure Childrens Comedy Crime Drama Fantasy
(a)
0 1 Toy Story (1995) January 1, 1995 0 0 1 1 0 0 0
1 2 GoldenEye (1995) January 1, 1995 1 1 0 0 0 0 0
2 3 Four Rooms (1995) January 1, 1995 0 0 0 0 0 0 0
3 4 Get Shorty (1995) January 1, 1995 1 0 0 1 0 1 0
4 5 Copycat (1995) January 1, 1995 0 0 0 0 1 1 0
5 6 Shanghai Triad (Yao o yao yao dao January 1, 1995 0 0 0 0 0 1 0
waipo qiao)
6 7 Twelve Monkeys (1995) January 1, 1995 0 0 0 0 0 1 0
7 8 Babe (1995) January 1, 1995 0 0 1 1 0 1 0
8 9 Dead Man Walking (1995) January 1, 1995 0 0 0 0 0 1 0
9 10 Richard III (1995) January 22, 1995 0 0 0 0 0 1 0
Id Title Year Genres Actors Director Language
Thumbnail Personalization in Movie Recommender System
(b)
0 1 Toy story 1995 Animation adventure Tom Hanks Tim Allen John Lasseter English
comedy Don Rickles
1 2 GoldenEye 1995 Action adventure Pierce Brosnan Sean Bean Martin Campbell English Russian Spanish
thriller Izabella Scorupco
2 3 Four rooms 1995 Comedy Tim Roth Antonio Allison Anders Alexandre English
Banderas Sammi Davis Rockwell Robert Rodriguez
3 4 Get shorty 1995 Comedy crime thriller Gene Hackman Rene Barry Sonnenfeld English
Russo Danny DeVito
(continued)
285
Table 1 (continued)
286
Id Title Year Genres Actors Director Language

4 5 Copycat 1995 Drama mystery thriller Sigourney Weaver Holly Jon Amiel English
Hunter Dermot Mulroney
5 6 Shanghai Triad 1995 Crime drama history Gong Li Baotian Li Yimou Zhang Mandarin
Xiaoxiao Wang
6 7 Twelve monkeys 1995 Documentary Keith Fulton Terry Keith Fulton Louis Pepe English
Gilliam Charles Roven
Lloyd Phi
7 8 Babe 1995 Comedy drama family James Cromwell Magda Chris Noonan English
Szubanski Christine
Cavanaugh
8 9 Dead man walking 1995 Crime drama Susan Sarandon Sean Tim Robbins English
Penn Robert Prosky
9 10 Richard III 1995 Drama sci-fi war Lan McKellen Annette Richard Loncraine English
Bening Christopher
Bowen
M. B. Baikadolla et al.
its own advantages and limitations. The similarity scores between users are stored
in a similarity matrix that facilitates the recommendation process. The proposed
recommender system uses triangle similarity metric [40] to compute the user’s
similarity. The user’s similarity between users m and n is calculated using Eq. (3).
/ ( )2
∑
rm,i − rn,i
i∈I
Triangle Sim(m, n) = 1 − /∑ ( ) ∑ ( )2 , (3)
2
i∈I r m,i + i∈I rn,i
where I represent the collection of item ratings rated by either m or n user, and r m,i
and r n,i are the item i ratings by m and n users, respectively.
While predicting potential ratings we use neighborhood selection process which
involves choosing similar users based on their neighborhood. Two commonly used
approaches are top-k and correlation-based threshold. In the top-k approach, the
topmost k users with high similarity to the user in question are considered. In
correlation-based threshold approach, all users who exceed the baseline threshold
are considered. The proposed recommendation system adopts the top-k neighbor-
hood selection algorithm. The final rating prediction is performed by aggregating
ratings from the selected neighbors for a particular movie. The final rating in the
proposed system is calculated using the weighted sum method which considers the
users similarity as its weight.
3.3 Thumbnail Mapper
The proposed system aims to enhance user engagement by personalizing the thumb-
nail artwork displayed for each recommended movie [41]. To achieve this, an algo-
rithm is proposed which selects a personalized thumbnail based on the actors featured
in the artwork and their estimated relevance to the user. The proposed approach can
be a starting point for thumbnail personalization when A/B testing is unavailable
to the system. The proposed algorithm takes three main factors into consideration
when selecting a thumbnail featuring particular actors: negative exposure, frequency
in user-viewed films, and global recognition. Actors with low values (less than 3)
denote having a negative exposure in films and are penalized in the selection process.
Actors with low values of negative exposure (less than 3) are penalized in the selec-
tion process. Frequency in user-viewed films is measured by how many films the
user has seen in which the actor has appeared in. Finally, the global recognition of an
actor is determined by whether or not they appear in the top 10,000 rated celebrities
according to IMDB star-meter. The thumbnail mapper algorithm selects a thumb-
nail by considering the combined score of all actors featured. The final thumbnail is
chosen based on a weighted probability where each thumbnail’s probability of being
selected is proportional to its score relative to the sum of scores of all candidate
thumbnails.
Algorithm: Thumbnail Weight Calculation
Input:
movie_id: ID of a movie
movie_thumbnails: dictionary mapping movie IDs to lists of thumbnail
images
popular_actors: list of IDs of popular actors
positive_actors: list of IDs of actors with positive exposure
Step 1: Start
Step 2: Get the list of thumbnail images for the given movie
Step 3: Set POPULARITY_WEIGHT and POSITIVE_EXPOSURE_
WEIGHT
a. Initialize an empty list thumbnail_weights
b. For each thumbnail in the list of thumbnails
i. Initialize an empty list weights
ii. For each person in the thumbnail image
– If the person is in the popular_actors list, then append
POPULARITY_WEIGHT to the weights list
– If the person is in the positive_actors list, then append
POSITIVE_EXPOSURE_WEIGHT to the weights list
– If the person is not in either list, then append 0.25 to the
weights list
c. Sort the weights list in descending order
d. Initialize the variable adjusted_weights to 0
e. For each weight in the weights list
i. Divide the weight by the index of the weight in the sorted weights
list plus 1
ii. Add the result to adjusted_weights
f. Appended adjusted_weights to the thumbnail_weights list
Step 4: Return the thumbnail_weights list
Algorithm to calculate weight of a given thumbnail.
Algorithm: Actor Score Calculation
Step 1: Start
Step 2: If the actor shows a negative exposure measure of less than 3, then the
thumbnail is removed from consideration without evaluating any other features
Step 3: If the actor has appeared in multiple films seen by the user, they are
awarded 0.1 points for each additional movie they have appeared in
Step 4: If the actor is present in the top 10,000 rated celebrities list, they are
awarded 1 point
Step 5: Stop
Algorithm to calculate score of a given actor.
Algorithm: Thumbnail Score Calculation
Step 1: Start
Step 2: If the thumbnail features a single actor, then the score of the actor is
taken as the score of the thumbnail
Step 3: If the thumbnail features multiple actors, then
a. Arrange the list of actors in descending order of their scores
b. For each actor
i. divide their score by their position in the order i.e., divide
first position by 1, second position by 2, and so on
c. Take the sum of these scores as the final score for the thumbnail
Step 4: Stop
Algorithm to calculate score of a given thumbnail

Using the above algorithms, the top 40 movies recommended by the proposed
system for user 12 are listed in Table 2.
The quality of the proposed system performance can be assessed using the metrics
Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) [42] of the
predicted scores against the actual scores of the movie given by the user. The RMSE
Table 2 Top 40 movie recommendations for user 12 using the proposed hybrid recommender
system
S. No. Movie id Movie title Predicted rating
1 1368 Mina Tannenbaum (1994) 4.7744
2 814 Great Day in Harlem, A (1994) 4.7479
3 1463 Boys, Les (1997) 4.7453
4 1358 The Deadly Cure (1996) 4.7442
5 1189 Prefontaine (1997) 4.7402
6 1467 Saint of Fort Washington, The (1993) 4.7335
7 1500 Santa with Muscles (1996) 4.7226
8 1643 Angel Baby (1995) 4.7109
9 1599 Someone Else’s America (1995) 4.6981
10 1201 Marlene Dietrich: Shadow and Light (1996) 4.6873
11 1367 Faust (1994) 4.6864
12 1302 Late Bloomers (1996) 4.6817
13 1389 Mondo (1996) 4.6668
14 1122 They Made Me a Criminal (1939) 4.6296
15 515 Boot, Das (1981) 4.5836
16 114 Wallace & Gromit: The Best of Aardman Animation 4.5095
(1996)
17 1143 Hard Eight (1996) 4.5046
18 1449 Pather Panchali (1955) 4.4943
19 64 The Shawshank Redemption (1994) 4.4876
20 483 Casablanca (1942) 4.47
21 408 Close Shave, A (1995) 4.4682
22 1398 Anna (1996) 4.4673
23 427 To Kill a Mockingbird (1962) 4.4568
24 251 Shall We Dance? (1996) 4.4501
25 1431 Legal Deceit (1997) 4.4479
26 1007 Waiting for Guffman (1996) 4.4419
27 900 Kundun (1997) 4.4346
28 169 The Wrong Trousers (1993) 4.4273
29 1594 Everest (1998) 4.4237
30 1064 Crossfire (1947) 4.417
31 119 Maya Lin: A Strong Clear Vision (1994) 4.4118
32 1313 Palmetto (1998) 4.4049
33 1125 Innocents, The (1961) 4.3779
34 1138 Best Men (1997) 4.3672
35 113 The Horseman on the Roof (Hussard sur le toit, Le) 4.363
(1995)
(continued)
Table 2 (continued)
S. No. Movie id Movie title Predicted rating
36 272 Good Will Hunting (1997) 4.3626
37 178 12 Angry Men (1957) 4.3611
38 511 Lawrence of Arabia (1962) 4.3575
39 613 My Man Godfrey (1936) 4.3501
40 12 Usual Suspects, The (1995) 4.3483
and MAE can be calculated using Eqs. (4) and (5), respectively. Commonly used
metrics are RMSE and MAE for recommender systems evaluation [43]. Such metrics
quantify the average magnitude of error between the predicted and actual ratings,
making them useful for assessing the system’s predictions.
/
∑n
i=1 (Predictedi − Actuali )2
RMSE = , (4)
N
∑n
i=1 |Predictedi − Actuali |
MAE = , (5)
N
where N is the total number of items
Precision and recall are also important metrics for validating recommender
systems [44, 45]. However, these metrics can be challenging to use in sparse datasets
like the MovieLens dataset, where not every movie has been rated by every user,
making it impossible to know the true value for every recommendation. Thus, these
metrics may not be well-suited to evaluating the performance of the system without
conducting proper and thorough surveys. During the testing phase, several similarity
measures were applied, and the error was measured across the testing tests. The
RMSE and MAE values were computed from the predicted ratings obtained from
both the content-based filter using cosine similarity and collaborative filter using
triangle similarity. To combine the algorithms, a weighted average approach was used
with different weight combinations of 25–75%, 50–50% and 75–25%. Additionally,
the performance of the combined algorithms was compared against each individual
algorithm to quantify the value of the combination approach. Based on the analysis
from Fig. 2, it was found that the 50–50% average of the ratings obtained from the
content-based filter and collaborative filter showed the best results, with RMSE of
0.9489 and MAE of 0.7956.
Table 3 illustrates the results of different recommendation models on MovieLens
100 k dataset. The RMSE results of the proposed hybrid system are 0.948 which is less
when compared with machine learning algorithms (1.042, 1.024, 0.983, 0.995, 0.956,
1.006, 0.954) and deep learning algorithms (0.990, 0.957) except for deep learning
method of collaborative recommender system (DLCRS) model. These results show
that the proposed hybrid model outperforms when compared with recommender
systems [20, 21].
Fig. 2 Proposed hybrid recommender system results with weighted average approach
Table 3 Comparison of RMSE for different recommendation models

Recommendation models RMSE
Deep learning algorithm without movie posters 0.990
Multi modal deep learning algorithm with movie posters 0.957
User Avg 1.042
Movie Avg 1.024
Deep learning method of collaborative recommender system (DLCRS) model 0.917
Averaged movie user Avg 0.983
User-based cosine similarity 0.995
Movie-based cosine similarity 0.956
Singular Value Decomposition (SVD) 1.006
Dot product-matrix factorization (MF) 0.954
Proposed hybrid recommender system 0.948
5 Conclusion
User engagement is a key feature in the success of OTT platforms, and person-
alized experiences are essential for fostering engagement. Recommender systems
are a powerful tool for achieving this personalization by recommending films that
align with user’s preferences, and thumbnails play a critical role in attracting user’s
attention and encouraging them to engage with the platform. The proposed hybrid
recommender system fuses content-based using cosine similarity and collaborative

with triangle similarity to recommend films based on both film features and user’s
interests. Experimentation of the proposed system has achieved RMSE and MAE
values which are better than those of similar existing systems. The objective of the
hybrid recommender system is proposed to provide personalized thumbnails, i.e.,
dynamic thumbnails which adapt to individual users. The thumbnail mapper algo-
rithm of the proposed system considers several factors, such as an actor’s negative
exposure with a particular user, the count of the actor in the user’s viewed films, and
the actor’s global impact. The proposed hybrid system is designed to be a practical
solution for platforms seeking to incorporate personalization into their own systems
without requiring extensive resources or A/B testing. It can be further developed using
techniques such as automation and visual analysis software. It opens the door for
even further exploration into different avenues of personalizing the user experience
and can be applied to other markets other user-centric markets which rely on main-
taining user engagement. The proposed system with 50% content-based approach
+ 50% collaborative approach has obtained an error of 0.9489 (RMSE) and 0.7956
(MAE) which was found to be better than others.
6 Future Work
As part of future work of the proposed system, the movie recommender and thumb-
nail selector modules can be explored. The movie recommender module can be
explored with other similarity measures for different integration of content-based
and collaborative approaches. The weighted average has limitations in that its results
are confined between the results obtained from the individual components. Other
approaches which tightly integrate the two techniques may give results unobtain-
able by simple weighted average. Another area of expansion is the inclusion of user
profiles in determining similarity between users. Features such as user age, gender,
occupation, languages spoken, and area of living can allow the system to provide
tailored recommendations for specific user demographics. The thumbnail mapper
of the thumbnail selection algorithm of the proposed recommender system can be
refined. While actor preference and exposure were chosen for being a highly visual
and familiar aspect of a film, numerous other factors, such as the nature, visual
style, and ambience of the scene portrayed, can be considered as well to add more
nuance to the selection process. Furthermore, the proposed recommender system
can be improved by conducting additional research on the impact of these dynamic
thumbnails, and their effect on user experience, engagement, and click-through rate
on OTT platforms through hands-on research and experimentation.
References
1. Shetty B (2019) An in-depth guide to how recommender systems work. Built in Beta
2. Thorat PB, Goudar RM, Barve S (2015) Survey on collaborative filtering, content-based
filtering and hybrid recommendation system. Int J Comput Appl 110(4):31–36
3. Jung KY, Park DH, Lee JH (2004) Personalized movie recommender system through hybrid 2-
way filtering with extracted information. In: Flexible query answering systems: 6th international
conference, FQAS 2004, Lyon, France, 24–26 June 2004. Proceedings 6. Springer, Berlin, pp
473–486
4. Afoudi Y, Lazaar M, Al Achhab M (2021) Hybrid recommendation system combined content-
based filtering and collaborative prediction using artificial neural network. Simul Model Pract
Theory 113:102375
5. Sahu G, Gaur L, Singh G (2022) Analyzing the users’ de-familiarity with thumbnails on OTT
platforms to influence content streaming. In: 2022 international conference on computing,
communication, and intelligent systems (ICCCIS), pp 551–556. IEEE
6. Abdollahpouri H, Burke R, Mobasher B (2019) Managing popularity bias in recommender
systems with personalized re-ranking. arXiv preprint arXiv:1901.07555
7. Subramaniyaswamy V, Logesh R, Chandrashekhar M, Challa A, Vijayakumar V (2017) A
personalised movie recommendation system based on collaborative filtering. Int J High Perform
Comput Network 10(1–2):54–63
8. Bai BM, Mangathayaru N, Rani BP (2023) An optimized spectral clustering algorithm for
better imputation of medical datasets (OISSC). In: Choudrie J, Mahalle PN, Perumal T, Joshi
A (eds) IOT with smart systems. ICTIS 2023. Lecture notes in networks and systems, vol 720.
Springer, Singapore
9. Mathura Bai B, Mangathayaru N, Padmaja Rani B (2023) Unsupervised learning method
for better imputation of missing values. In: Garg D, Narayana VA, Suganthan PN, Anguera J,
Koppula VK, Gupta SK (eds) Advanced computing. IACC 2022. Communications in computer
and information science, vol 1782. Springer, Cham
10. Bai BM, Mangathayaru N (2022) Modified K-nearest neighbour using proposed similarity
fuzzy measure for missing data imputation on medical datasets (MKNNMBI). Int J Fuzzy Syst
Appl (IJFSA) 11(3):1–15
11. Bai BM, Mangathayaru N, Rani BP, Aljawarneh S (2021) Mathura (MBI)—a novel imputation
measure for imputation of missing values in medical datasets. Recent Adv Comput Sci Commun
(Formerly: Recent Patents Comput Sci) 14(5):1358–1369
12. Katarya R, Verma OP (2017) An effective collaborative movie recommender system with
cuckoo search. Egyptian Inform J 18(2):105–112
13. Walek B, Fojtik V (2020) A hybrid recommender system for recommending relevant movies
using an expert system. Expert Syst Appl 158:113452
14. Ifada N, Rahman TF, Sophan MK (2020) Comparing collaborative filtering and hybrid based
approaches for movie recommendation. In: 2020 6th information technology international
seminar (ITIS). IEEE, pp 219–223
15. Kim J, Lee J (2021) Between familiarity and unfamiliarity: users’ perception and intention of
watching netflix artwork. Archiv Des Res 34(4):23–37
16. Gilmore JN (2020) To affinity and beyond: clicking as a communicative gesture on the
experimentation platform. Commun Cult Critique 13(3):333–348
17. Eklund O (2022) Custom thumbnails: the changing face of personalisation strategies on Netflix.
Convergence 28(3):737–760
18. Burke R (2002) Hybrid recommender systems: survey and experiments. User Model User-Adap
Inter 12:331–370
19. Fkih F (2022) Similarity measures for collaborative filtering-based recommender systems:
review and experimental comparison. J King Saud Univ Comput Inform Sci 34(9):7645–7669
20. Aljunid MF, Dh M (2020) An efficient deep learning approach for collaborative filtering
recommender systems. Procedia Comput Sci 171:829–836
21. Mu Y, Wu Y (2023) Multimodal movie recommendation system using deep learning.

Mathematics 11(4):895
22. Koller T, Grabner H (2022) Who wants to be a click-millionaire? On the influence of thumbnails
and captions. In: 2022 26th international conference on pattern recognition (ICPR). IEEE, pp
629–635
23. Riley M, Machado L, Roussabrov B, Branyen T, Bhawalkar P, Jin E, Kansara A (2018) AVA:
the art and science of image discovery at Netflix. Netflix Technology Blog 7
24. Murray N, Marchesotti L, Perronnin F (2012) AVA: A large-scale database for aesthetic visual
analysis. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp
2408–2415
25. Gomez-Uribe CA, Hunt N (2015) The Netflix recommender system: algorithms, business value,
and innovation. ACM Trans Manage Inform Syst (TMIS) 6(4):1–19
26. Zhang W, Liu C, Wang Z, Li G, Huang Q, Gao W (2014) Web video thumbnail recommendation
with content-aware analysis and query-sensitive matching. Multimed Tools Appl 73:547–571
27. Liu W, Mei T, Zhang Y, Che C, Luo J (2015) Multi-task deep visual-semantic embedding for
video thumbnail selection. In: Proceedings of the IEEE conference on computer vision and
pattern recognition, pp 3707–3715
28. Mei T, Rui Y, Li S, Tian Q (2014) Multimedia search reranking: a literature survey. ACM
Comput Surv (CSUR) 46(3):1–38
29. Tang K, Sukthankar R, Yagnik J, Fei-Fei L (2013) Discriminative segment annotation in
weakly labeled video. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 2483–2490
30. Liu W, Mei T, Zhang Y (2014) Instant mobile video search with layered audio-video indexing
and progressive transmission. IEEE Trans Multimed 16(8):2242–2255
31. Yi N, Li C, Feng X, Shi M (2017) Design and implementation of movie recommender system
based on graph database. In: 2017 14th web information systems and applications conference
(WISA). IEEE, pp 132–135
32. McNee SM, Riedl J, Konstan JA (2006) Being accurate is not enough: how accuracy metrics
have hurt recommender systems. In: CHI’06 extended abstracts on Human factors in computing
systems, pp 1097–1101
33. Ziegler CN, McNee SM, Konstan JA, Lausen G (2005) Improving recommendation lists
through topic diversification. In: Proceedings of the 14th international conference on World
Wide Web, pp 22–32
34. Billsus D, Pazzani MJ (1998) Learning collaborative information filters. ICML 98:46–54
35. Breese JS, Heckerman D, Kadie C (2013) Empirical analysis of predictive algorithms for
collaborative filtering. arXiv preprint arXiv:1301.7363
36. Cami BR, Hassanpour H, Mashayekhi H (2017) A content-based movie recommender system
based on temporal user preferences. In: 2017 3rd Iranian conference on intelligent systems and
signal processing (ICSPIS). IEEE, pp 121–125
37. Iftikhar A, Ghazanfar MA, Ayub M, Mehmood Z, Maqsood M (2020) An improved product
recommendation method for collaborative filtering. IEEE Access 8:123841–123857
38. Fiarni C, Maharani H (2019) Product recommendation system design using cosine similarity
and content-based filtering methods. IJITEE (Int J Inform Technol Electric Eng) 3(2):42–48
39. Vellaichamy V, Kalimuthu V (2017) Hybrid collaborative movie recommender system using
clustering and bat optimization. Int J Intell Eng Syst 10(5)
40. Sun SB, Zhang ZH, Dong XL, Zhang HR, Li TJ, Zhang L, Min F (2017) Integrating Triangle
and Jaccard similarities for recommendation. PLoS ONE 12(8):e0183570
41. Pretorious K, Pillay N (2020) A comparative study of classifiers for thumbnail selection. In:
2020 International joint conference on neural networks (IJCNN). IEEE, pp 1–7
42. Chai T, Draxler RR (2014) Root mean square error (RMSE) or mean absolute error (MAE)?—
arguments against avoiding RMSE in the literature. Geosci Model Dev 7(3):1247–1250
43. Qutbuffin M (2020) An exhaustive list of methods to evaluate recommender systems. Towards
Data Science
44. Chung Y, Kim N-r, Park C-Y, Lee J-H (2018) Improved neighborhood search for collaborative
filtering. Int J Fuzzy Logic Intelligent Syst 18:29–40. https://doi.org/10.5391/IJFIS.2018.18.
1.29
45. Sawtelle S (2016) Mean average precision (map) for recommender systems. Evening Session:
Exploring Data Science and Python
Comparative Analysis of Large
Language Models for Question
Answering from Financial Documents
Shivam Panwar, Anukriti Bansal, and Farhana Zareen
Abstract Extracting and analyzing information from financial documents is neces-

sary to understand the economic growth of any business and country. This information
is required to make investments, policy formation, and to take other crucial decisions
to increase profits. Huge volume of financial documents makes it very difficult and
time taking process to extract useful information. Question answering is a powerful
way to extract relevant information quickly. Recent research has demonstrated that
the large language models (LLMs) give state-of-the-art results for various natural
language processing tasks such as question answering, document classification, sen-
timent analysis, and many more. Extracting relevant details from financial documents
is different from getting answers from general document corpus. Mathematical and
logical reasoning is also required to retrieve information from financial documents.
In this paper we present a comparative analysis of two popular LLMs for question
answering from financial documents: OpenAI’s ChatGPT and Meta AI’s LLaMA.
While ChatGPT API is propriety in nature, LLaMA’s model weights are available
freely for research. The experimental results show that the performance of LLaMA
is comparable with ChatGPT for question answering from financial documents.
Keywords Financial question answering · Large language models · Natural

language processing · ChatGPT · LLaMA-2
1 Introduction
tremendous amount of unstructured documents are available, which can
A provide useful insights to various business organizations in terms of market-

ing, advertising, product launching, and in various other strategic planning
and decisions. Finding information quickly is very important. A very intuitive way
Shivam Panwar and Anukriti Bansal—The authors have contributed equally.
S. Panwar · A. Bansal (B) · F. Zareen

Crisp Analytics Private Limited, LUMIQ, Noida 201301, India
e-mail: anukriti1107@gmail.com
298 S. Panwar et al.
of extracting the information is by querying from these documents in the natural

language. The general approach behind this is called question answering (QA, here-
after). There are various types of QA, but the most common is extractive QA, which
involves questions whose answers can be identified as a span of text in a document.
The document can be a web page, financial report, or news article. This two stage
process, wherein, first the relevant documents are retrieved and then answers are
extracted from them is the basis for many modern QA systems, such as, semantic
search engines, intelligent assistant, and automated information extractors. Recently,
many deep learning methods have been developed to perform this task, and they are
giving good results. Provided a large number of questions, short documents, and
corresponding answers in those short documents, these techniques are able to do
well on question answering task. There are two major constraints that needs to be
addressed for these models to perform well. First, they require a significant amount of
data for fine-tuning, and second, their performance decreases as the size of document
increases as the token size is limited in these models. There is an additional constraint
when it comes to financial documents. Here, the extractive approach may not give
the useful insights as it requires complex reasoning and mathematical calculations.
Recently introduced large language models (LLMs, hereafter) are showing to han-
dle all these constraints very well and giving state-of-the art results. In this paper we
provide a method to apply LLMs for question answering for financial documents and
compare the results of popular LLMs, which are LLaMA [21], and ChatGPT [13].
Following are the major contributions of our work:
1. This paper explores the capabilities of ChatGPT and LLaMA, delving into their
underlying design principles, for financial data analysis task.
2. We present a way by which a non-propriety LLM with small number of param-
eters can give comparable results as propriety LLM with very large number of
parameters.
3. The results and performance of LLMs for information extraction task by modeling
it as QA problem is strongly motivating for future research.
The rest of the paper is organized as follow. Section 2 outlines prior work related to
question answering using deep learning models. Section 3 provides the architectural
details of large language models, which are compared in this paper. Experimental
details and results are discussed in Section sec:experiments. Finally we conclude our
paper in Sect. 6.
2 Related Work
Question answering is an effective way for performing information retrieval [1, 14,
19] and entity extraction [2, 4, 27].
Various deep learning-based models have been used and proved to be very effec-
tive in different types of question answering problems [6]. Lei et al. [7] have used
Convolutional Neural Networks (CNN) for sentence classification, which is one the
Comparative Analysis of Large Language Models … 299
crucial steps in intelligent question answering systems. Tan et al. [20] proposed a
LSTM-based QA model-QA-LSTM. Word embeddings of all the sentences of ques-
tions and text containing the answers are fed into BiLSTM network, which provides
fixed sized representation of each sentence. Relevant text is later retrieved on the
basis of cosine similarity between sentences of questions and text. Wang et al. [23]
also proposed LSTM-based model for QA task on SQuAD dataset [18].
In 2017, researchers at Google published the first paper on attention [22] that acted
as a catalyst for future transformer models like Generative Pretrained Transformer
(GPT) [16], Bidirectional Encoder Representation from Transformers (BERT) [5],
and RoBERTa [10]. Different variants of BERT have been proposed to extract infor-
mation from different domains. Nguyen et al. [12] proposed BERTweet, which is a
BERT-based model trained on twitter data and can be used to analyze social media
data. CT-BERT [11] is transformer model, pre-trained on Covid-19 tweets and can
be used to analyze data related to the pandemic.
Pearce et al. [15] show a comparative analysis of large language model for extrac-
tive question answering task. There work is similar to ours. However, in this paper we
compare two large language models for financial question answering, which is not
extractive in nature. In order to answer a question from financial document, complex
reasoning, and mathematical calculations are required.
3 Architecture of LLMs
At its core, LLMs rely on transformer architecture, which was introduced by Vaswani
et al. in 2017 [22]. The basic components and mechanism within transformer archi-
tecture is explained in subsequent subsections. Figure 1 is taken from the original
transformer paper [22] and show the major components of transformer architecture.
3.1 Input
Input is the any question/query asked by the human/user. Following paragraphs

explain, how any transformer-based model process this input:
Tokenization The input text is first divided into individual tokens. These tokens can
be as short as single character or as long as entire word or sub-word. Tokenization
ensures that the text is broken down into manageable units that the model can work
with.
Special Tokens In addition to the regular tokens in the input sequence, special tokens
are added to convey specific information to the model. For example, [CLS] (classifi-
cation) and [SEP] (separator) tokens are added at the beginning of sequence and end
of each sentence, respectively.
Fig. 1 The basic

architecture of the
transformer model, taken
from the original paper by
Vawani et al [22]. The
different components are
explained in Sect. 3
Vocabulary and Embeddings Each token is associated with an embedding vector.

The model maintains a vocabulary that maps each token to a unique token ID.
Positional Encoding Transformers do not capture position information. Therefore, to
capture the order or position of tokens in the input sequence, positional encodings are
added to the token embeddings. They help the model understand the relative position
of tokens and the order of words in a sentence or passage. The token embeddings
and positional embeddings are concatenated to create the input representation for the
model.
3.2 Attention Mechanism
At the heart of the Transformer is the attention mechanism, which enables the model
to weigh the importance of different parts of the input sequence when generating an
output. This mechanism replaces the traditional recurrent neural networks (RNNs)
and convolutional neural networks (CNNs) used in sequence-to-sequence tasks.
Self-Attention: In self-attention, the model can focus on different positions in the
input sequence to varying degrees. This allows it to capture long-range dependencies
and contextual information efficiently.
Multi-Head Attention: Transformers often use multiple attention heads in parallel.
Each head learns different aspects of the input, enhancing the model’s ability to
capture diverse patterns and relationships.
3.3 Encoder-Decoder Structure
The transformer architecture consists of an encoder and a decoder. The encoder

is primarily used for understanding the input text, while the decoder generates the
output text.
Encoder: The encoder takes the input sequence (e.g., a document or a query) and
transforms it into a series of hidden representations. These representations encode the
contextual information from the input sequence, allowing the model to understand
the relationships between words and their context.
Decoder: The decoder is used for autoregressive text generation. It takes the
encoder’s hidden representations and generates the output sequence one token at
a time while considering both the input and previously generated tokens.
In present work, we use LLaMA-2 7B model which consists of 7 billion parameters
and ChatGPT 3.5, which is based on the GPT 3.5 engine which received training on
around 175 billion parameters. Both LLaMA and ChatGPT consists of only decoder
part of transformer. BERT is based on encoder only component of transformer. There
are certain LLMs like BART [8] and T5 [17], which uses both encoder and decoder
components. Besides encoder and decoder, within each transformer layer, there are
normalization and feed forward neural network layers as well. Transformers consist
of multiple layers that are stacked on top of each other. This stacking enables the
model to learn hierarchical representations, with lower layers capturing local patterns
and higher layers capturing more abstract and global patterns.
4 Proposed Question Answering System
4.1 Overview
This section explains in the detail the complete methodology of the proposed method
for question answer system on financial documents using LLaMA and ChatGPT.
Fig. 2 A flowchart showing the complete overview of questions answering system from the financial
documents
Since the core components of both these LLMs is the same, the steps involved in
getting the relevant information from financial documents using QA system is also
the same. The overall pipeline of the proposed system is shown in Fig. 2. First, we
explain the prompt and the prompt engineering that is a powerful way of giving
instructions to the LLMs. Retriever-reader architecture is discussed next, which is
an integral part of working of LLMs on large documents.
4.2 Prompt
The input given to the LLM is called as prompt. A typical prompt for a ques-
tion answering task consists of two main components: (1) Question, which is the
query/question asked by the user, and (2) Context or document, which is the source
of information from which the answer is obtained. The LLMs are trained on huge
amount of generic text and therefore, may not perform well on domain specific tasks.
In such scenarios, connecting LLMs to domain specific data source give desired
results [9]. We are calling this is as context, and it comprises of text and tabular data
of financial documents. In this paper, we have used instruction prompts and done zero
shot and one-shot inference to analyze their impact on the performance of LLMs.
Instruction prompts means giving instruction to models to perform a specific task.
In zero-shot inference, context and question is provided and model gives output. In
one-shot inferencing, prompt is designed in such a way that it contains one exam-
ple of context, question, and the corresponding answer. It generally gives model an
understanding about the expected responses.
4.3 Retriever-Reader Architecture
The financial documents are very long. Additionally, the number of tokens that can be
passed as prompts are limited. Both ChatGPT and LLaMA 2 has a maximum limit of
4096 tokens. To handle this limitation, the document is divided into multiple chunks.
In our experiments, we have created chunks of 2000 tokens. When a question is
asked, relevant chunks are required to be identified and selected. To handle this issue
in a more efficient manner, modern QA systems are generally based in the retriever-
reader architecture, which we are going to explain in subsequent subsections.
Retriever Retrievers are used to fetch the relevant chunks for a specific question.
They can be broadly categorized as sparse and dense. Sparse retrievers utilize word
frequencies to create a sparse vector representation for each document and query.
The degree of relevance between a query and a document is subsequently determined
by calculating the inner product of these vectors. Dense retrievers, on the other hand,
employ encoders such as transformers to represent both the query and document
as contextualized embeddings, which are dense vectors. These embeddings capture
semantic meaning and empower dense retrievers to enhance search accuracy by
comprehending the query’s content.
Reader Responsible for obtaining an answer from the documents retrieved by the
retriever.
In addition to reader and retriever, there can be other components that perform post-
processing to the chunks extracted by the retriever or to the answers obtained by the
reader. For example, the chunks extracted from the retriever may need re-ranking to
remove noise or irrelevant chunks. Similarly, post-processing of reader’s answers is
required when the correct answer is fetched from different chunks of a long document.
5 Experiments
All the models are implemented using PyTorch and the transformers library from the
hugging face [25]. The experiments are conducted on AWS g4dn.xlarge consisting
of 4 CPUs each of 16GB RAM and NVIDIA T4 Tensor Core GPUs. The following
paragraphs describe important aspects related to the experimental evaluation of the
obtained results.
5.1 Dataset Description
We use FinQA dataset [3] for our work. The dataset consists of 8231 financial QA
pairs based on publicly available earnings reports of S&P 500 companies from 1999
to 2019. An earning report is a PDF file that contains information regarding the finan-
Fig. 3 Distribution of
questions in our test set that
begin with a few common
starting words
cials of a company, in the form of text and tables. In this work, we use 1428 examples
from the dataset for testing the working of both types of LLMs. Figure 3 show a dis-
tribution of questions that begin with a few common starting words. The answer to
most of these questions is either numerical values with/without mathematical units
or ‘Yes’ or ‘No’. Table 1 show an example of a record of FinQA dataset. Each record
consists of pre-text, table, post-text, question, and answer. Many important financial
details are present in tabular format, which is present table column. Pre-text and
post-text consists o text present before and after tabular data, respectively. All three
collectively form the context, on the basis of which questions are asked and answers
are given. Unlike SQUAD dataset, the answer are not directly present in the context
and is obtained after applying complex reasoning.
5.2 Evaluation Metrics
The answers of FinQA dataset mostly consists of either numerical values with math-
ematical units or it contains string values with one word ‘yes’ or ‘no’. For current
experiments we have excluded those examples where answer is in form of a sentence.
Therefore, we perform exact matching of LLMs output and original answer. The per-
formance of the models is evaluated using the accuracy metric, which is defined as
follows:
Number of correct answers
. Accuracy = × 100
Total number of questions
Table 1 An example of a record of FinQA dataset, showing context (consists of pre_test, post_text, and table), question and the answer
Pre_text Post_text Table Question Answer
[‘american tower corporation and subsidiaries notes [‘(1) consists of customer-related [[“, ‘preliminary purchase price For acquired customer-related and 7.4
to consolidated financial statements (3) consists intangibles of approximately $ 10.7 allocation’], [‘current assets’, ‘$ 8763’], network location intangibles,
of customer-related intangibles of approximately million and network location intangibles [‘non-current assets’, ‘2332’], [‘property what is the expected annual
$ 75.0 million and network location intangibles of approximately $ 10.4 million.’, ‘the and equipment’, ‘26711’], [‘intangible amortization expenses, in
of approximately $ 72.7 million.’, ‘the customer- customer-related intangibles and network assets (1)’, ‘21079’], [‘other non-current millions?
related intangibles and network location intangibles location intangibles are being amortized liabilities’, ‘.− 1349 (1349)’], [‘fair value
are being amortized on a straight-line basis over peri- on a straight-line basis over periods of of net assets acquired’, ‘$ 57536’],
ods of up to 20 years.’, ‘(4) the company expects up to 20 years.’, ‘(2) the company expects [‘goodwill (2)’, ‘5998’]]
that the goodwill recorded will be deductible for tax that the goodwill recorded will be
purposes.’, ‘the goodwill was allocated to the com- deductible for tax purposes.’, ‘the
pany 2019s international rental and management goodwill was allocated to the company
segment.’, ‘on September 12, 2012, the company 2019s international rental and
entered into a definitive agreement to purchase up to management segment’, ‘on November 16,
Comparative Analysis of Large Language Models …
approximately 348 additional communications sites 2012, the company entered into an
from telef f3nica mexico.’, ‘on September 27, 2012 agreement to purchase up to 198 additional
and December 14, 2012, the company completed communications sites from telef f3nica
the purchase of 279 and 2 communications sites, mexico.’, ‘on December 14, 2012, the
for an aggregate purchase price of $ 63.5 million company completed the purchase of 188
(including value added tax of $ 8.8 million).’, ‘the communications sites, for an aggregate
following table summarizes the preliminary alloca- purchase price of $ 64.2 million (including
tion of the aggregate purchase consideration paid value added tax of $ 8.9 million).’]
and the amounts of assets acquired and liabilities
assumed based upon their estimated fair value at the
date of acquisition (in thousands) : preliminary pur-
chase price allocation.’]
305
Table 2 Showing the performance of LLaMA-2 and ChatGPT on question answering on FinQA
dataset
Large language model Accuracy (in percentage)
ChatGPT with zero-shot inferencing 65.91
LLaMA-2 with zero-shot inferencing 61.07
ChatGPT with one-shot inferencing 61.85
LLaMA-2 with one-shot inferencing 53.36
5.3 Results and Discussion
Table 3 show the output of LLaMA-2 and ChatGPT on few of the example questions
from FinQA dataset. The first two rows show the correct answer by both LLaMA-2
and ChatGPT, while the next two rows show a small percentage of error. Please
note that the output is obtained after applying complex reasoning and mathematical
calculations. The exact answer was not present in the respective contexts. The results
of the overall performance of both the models are presented in Table 2. Following
observations can be made on the basis of these results:
• ChatGPT and LLaMA-2 perform better with simple prompts as compared to

prompt engineering with one-shot inferencing. It can be inferred that for com-
plex reasoning tasks, the one-shot and few-shot inferencing may not work well.
A possible season could be that prompt engineering helps model in identifying
the next tokens in the output but does not enhance its complex reasoning abilities.
Chain-of-thought prompting [24] and ReAct prompting [26] can give better results
for these problems. Experimenting with these prompts is one of the future scopes
of this work.
• Performance of the LLaMA-2 is on the lower side as compared to ChatGPT, but if
we compare the parameters of LLaMA-2 (7 billions) with ChatGPT (175 billions),
LLaMA-2 has performed reasonably well. Bigger models of LLaMA-2 are also
available, with 13 billion and 70 billion parameters. Due to computation resource
constraints we could not experiment with them, but they may give comparable
results like ChatGPT. One more advantage of LLaMA-2 is that it is not a propriety
model and its weights are available for further fine-tuning. This is not the case
with ChatGPT, which is a propriety model.
6 Conclusion
This paper presented a comparison of two popular large language models for question
answering from financial documents. The paper also presented a complete pipeline
to use these models for the extraction of financial insights from the large documents.
Table 3 Showing the output given by LLaMA-2 and ChatGPT on few of the examples from FinQA
dataset
S. no. Question from FinQA Actual answer LLaMA output ChatGPT output
dataset
1 What percentage of total 41 41 41
net revenues in 2012
where due to equity
securities (excluding
icbc) revenues?
2 What was the change in 141 141 141
non-trade receivables,
which are included in
the consolidated balance
sheets in other current
assets, between
September 24, 2005 and
September 25, 2004, in
millions?
3 What is the average cash 2758.55 2758.05 2758.55
provided by the
operating activities
during 2018 and 2019?
4 What is the roi of an 21.5 21.48 21.48
investment in s&p500 in
2004 and sold in 2006?
The results show that ChatGPT performs better than LLaMA-2 7B model. In future
we will attempt to compare the results after performing fine-tuning of LLaMA-2 and
other large language models.
References
1. Zahra A, Saeedeh M (2021) Text-based question answering from information retrieval and
deep neural network perspectives: a survey. Wiley Interdiscipl Rev Data Min Knowl Discov
11(6):e1412
2. Ali I, Yadav D, Sharma AK (2022) Question answering system for semantic web: a review. Int
J Adv Intell Paradigm 22(1–2):114–147
3. Chen Z, Chen W, Smiley C, Shah S, Borova I, Langdon D, Moussa R, Beane M, Huang
T-H, Routledge B et al (2021) Finqa: a dataset of numerical reasoning over financial data.
arXiv:2109.00122
4. Surabhi D, Kirk R (2022) Fine-grained spatial information extraction in radiology as two-turn
question answering. Int J Med Inform 158:104628
5. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional
transformers for language understanding. arXiv:1810.04805
6. Zhen H, Xu S, Hu M, Xinyi W, Qiu J, Fu Y, Zhao Y, Peng Y, Wang C (2020) Recent trends in
deep learning based open-domain textual question answering systems. IEEE Access 8:94341–
94356
7. Lei T, Shi Z, Liu D, Yang L, Zhu F (2018) A novel cnn-based method for question classification
in intelligent question answering. In: Proceedings of the 2018 international conference on
algorithms, computing and artificial intelligence, pp 1–6
8. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer
L (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation,
translation, and comprehension. arXiv:1910.13461
9. Patrick L, Ethan P, Aleksandra P, Fabio P, Vladimir K, Naman G, Heinrich K, Mike L, Yih
W-t, Tim R et al (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks.
Adv Neural Inform Process Syst 33:9459–9474
10. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V
(2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692
11. Müller M, Salathé M, Kummervold PE (2023) Covid-twitter-bert: a natural language processing
model to analyse Covid-19 content on twitter. Front Artif Intell 6:1023281
12. Nguyen DQ, Vu T, Nguyen AT (2020) Bertweet: a pre-trained language model for english
tweets. arXiv:2005.10200
13. OpenAI (2021) ChatGPT: A Large-Scale Language Model for Conversational AI. https://
openai.com/research/chatgpt. Accessed 6 Oct 2023
14. Otegi A, San Vicente I, Saralegi X, Peñas A, Lozano B, Agirre E (2022) Information retrieval
and question answering: a case study on Covid-19 scientific literature. Knowl Based Syst
240:108072
15. Pearce K, Zhan T, Komanduri A, Zhan J (2021) A comparative study of transformer-based
language models on extractive question answering. arXiv:2110.03142
16. Alec R, Narasimhan K, Sutskever I et al (2018) Improving language understanding by gener-
ative pre-training, Tim Salimans
17. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020)
Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn
Res 21(1):5485–5551
18. Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine
comprehension of text. arXiv:1606.05250
19. Sakata W, Shibata T, Tanaka R, Kurohashi S (2019) Faq retrieval using query-question similarity
and bert-based query-answer relevance. In Proceedings of the 42nd international ACM SIGIR
conference on research and development in information retrieval, pp 1113–1116
20. Tan M, Dos Santos C, Xiang B, Zhou B (2016) Improved representation learning for question
answer matching. In: Proceedings of the 54th annual meeting of the association for computa-
tional linguistics, vol 1: Long Papers, pp 464–473
21. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava
P, Bhosale S et al (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288
22. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I
(2017) Attention is all you need. Adv. Neural Inform. Proces, Syst, p 30
23. Wang S, Jiang J (2016) Machine comprehension using match-lstm and answer pointer.
arXiv:1608.07905
24. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le QV, Zhou D et al (2022) Chain-
of-thought prompting elicits reasoning in large language models. Adv Neural Inform Process
Syst 35:24824–24837
25. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz
M et al (2019) Huggingface’s transformers: state-of-the-art natural language processing. arXiv
26. Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K, Cao Y (2022) React: synergizing
reasoning and acting in language models. arXiv preprint arXiv:2210.03629
27. Didi Y, Siyuan C, Boxu P, Qiao Y, Zhao W, Wang D (2022) Chinese named entity recognition
based on knowledge based question answering system. Appl Sci 12(11):5373
Multilingual Meeting Management
with NLP: Automated Minutes,
Transcription, and Translation
Gautam Mehendale, Chinmayee Kale, Preksha Khatri, Himanshu Goswami,

Hetvi Shah, and Sudhir Bagul
Abstract In a world emphasizing multilingual interactions in a meeting, this study

showcases advanced audio and text processing for accurate multilingual transcrip-
tion, enhancing international collaboration, ensuring clear understanding across
diverse linguistic backgrounds. Harnessing the capabilities of DPTNet, this study
achieves superior sound source separation, isolating speech from ambient noise. The
pyannote toolkit excels in speaker diarization, segmenting audio based on speaker
identities. The SpeechRecognition module showcases its prowess in transcribing
dialogue with unparalleled accuracy. Highlighting advancements in textual summa-
rization, the research underscores the synergistic power of the TextRank algorithm
and the BART model in distilling extensive narratives into succinct and abstrac-
tive summaries. The Hugging Face Transformers, especially the MarianMTModel
and MarianTokenizer, provide exemplary translation from audio transcripts. Collec-
tively, these methodologies present a comprehensive blueprint for navigating and
deciphering multilingual meetings with precision and clarity.
Keywords Multilingual translation · DPTNet · Pyannote toolkit ·

SpeechRecognition module · TextRank · MarianMTModel
1 Introduction
Today’s global business now relies on multilingual, multicultural gatherings to con-

nect ideas and activities. Though crucial, such interactions are complicated by lin-
guistic differences and hearing difficulties. Interpreters and hand transcriptions have
been tried, although their speed and accuracy are doubtful. These strategic decision-
making meetings require all participants, regardless of language or hearing, to be
fully engaged and updated.
Written minutes that record conversations, decisions, and directions are essen-
tial for meeting recordkeeping. They capture the meeting’s spirit and align in- and
G. Mehendale (B) · C. Kale · P. Khatri · H. Goswami · H. Shah · S. Bagul

Dwarkadas J. Sanghvi College of Engineering, Mumbai 400056, Maharashtra, India
e-mail: I.gautammehendale2002@gmail.com
310 G. Mehendale et al.
post-meeting actions. Parallel to this, transcription captures spoken content word-

for-word. Real-time issue followers may benefit from transcriptions, which allow
participants to review and reflect. Additionally, translation bridges linguistic gaps.
A precise translation lets non-native speakers engage in debates.
Transcription and translation integration promotes inclusivity in international
societies. Organizations can accommodate a wide spectrum of people by system-
atically keeping original records and translated materials, boosting participation and
cooperation. A multilayered method prevents any voice from being marginalized,
creating an inclusive, belonging, and respectful environment. In a globalized world,
companies and institutions must overcome linguistic and accessibility barriers to
communication.
This study introduces the “Multilingual Meeting Management with NLP” frame-
work using NLP. Minutes documentation, transcription, and translation are integrated
in this system. The framework improves intelligibility and inclusivity for hearing
and language-impaired attendees. Democratizing meeting content access strengthens
communication across diverse audiences. Effective communication requires meet-
ing minutes, transcriptions, and translations. An audio or video recording of a meet-
ing is transcribed manually or automatically into detailed text. This word-for-word
transcription is invaluable for extensive discussion review. Minutes summarize the
transcription to highlight the meeting’s major themes, decisions, and instructions.
The documented content is translated into multiple languages for distinct audiences.
This guarantees that participants can understand the information in their preferred
language, bridging communication gaps and promoting inclusivity.
Further sections of this paper begin with a literature survey, then ensued to the
proposed methodology. Following this, it presents the results and discussion, con-
clusion, and finally explores the future scope. Through this research, new horizons in
multilingual meeting management are explored, aiming for a universally accessible
and efficient professional dialogue.
2 Literature Review
Sound source separation is a method to separate a mixture into isolated sounds from
individual sources. In Li et al. [1], three networks are trained using different param-
eters for separating two random sound sources from recording. In this, proximity
principle is explored by experiments. This paper does not extensively evaluate the
performance on a wide range of languages or datasets. Asteroid, a PyTorch-based
audio is an amazing toolkit for researchers. Pariente et al. [2] describe asteroid in
the paper and implements Kaldi-style recipes on common audio source separation
datasets. It follows an encoder-masker-decoder approach. This research has not yet
explored multilingual speakers’ transcription.
Diarization categorizes audio recordings using unsupervised techniques to group
audio belonging to individual speakers. Khoma et al. [3] use an open-source pyan-
note framework to improve accuracy and computational efficiency. In this, four
Multilingual Meeting Management with NLP: Automated Minutes … 311
tests were undertaken for speaker identification and to optimize diarization pipeline
components. The goal of the paper is to get rid of false or corrupt data when the
audio sequence is converted to text for further analysis. Using a generalized model,
Barkovska et al. [4] proposed incoming audio data summarization and conducted a
thorough study of different methods.
Transcription is the process of converting an audio or video recording into usable
text. One such example is given by Chen et al. [5], in which they used Microsoft
Azure SpeedSDK for transcription. This system supports multiple people speaking
simultaneously and boasts an accuracy of greater than 90%. It provides a user-friendly
interface, making it easy to annotate and edit. However, a drawback is its limitation
in handling multilingual speakers. Similarly, Dewan et al. [6] offer an end-to-end
solution for creating fully automated conference meeting transcripts. Their system
employs speech-to-text and machine translation components. Evaluation metrics
such as BLEU and WER were used, and indexing was done using Elasticsearch.
Despite errors in the produced text, their method outperformed others in terms of
speed and convenience.
The paper by Majeed et al. [7] aims to create a model for extractive text summa-
rization using text ranking algorithms and sentence ranking. It focuses on extracting
high-scoring sentences to generate high-quality summaries. However, a research gap
exists as the sentence sequences may not be entirely suitable for easy user reading.
The paper successfully addresses this gap by implementing appropriate similarity
matrices, enhancing semantic relatedness between words. Furthermore, the paper
on text summarization and translation across multiple languages by Banu et al. [8]
centers on creating effective summaries while preserving the original context. It uti-
lizes Hugging Face Transformers for multilingual text summarization. The paper
employs the MarianMTModel for language translation and subsequently uses T5 for
summarization.
Pham et al. [9] in the paper explore pretrained models like wav2vec 2.0 for audio
and MBART50 for text to enhance multilingual speech recognition. It also uses
adaptive weight techniques on CommonVoice and Europarl test sets. However, it is
observed that the paper does not extensively evaluate the performance on a varied
range of languages or datasets. To deal with this issue, this paper uses MarianMT-
Model after effectively fine-tuning and tokenizing the model which has led to an
easy translation of the text into 50 languages and above. In the study by Stanik et al.
[10] a comparative study is conducted using traditional machine learning and deep
learning approaches. The English and Italian data are classified into problem reports,
inquiries, and irrelevant data. It automates the tedious, time-consuming process of
manual analysis of user feedback and automates the task.
Accurate dataset is a must to acquire optimal results on any model. To enhance
multimodal and multilingual learning, Wikipedia-based Image Text (WIT) has been
introduced by Srinivasan et al.,[11] which consists of a large, entity-rich image and
text samples of data. WIT has been beneficial for retraining multimodal models,
fine-tuning image-text retrieval models, and curating multilingual representations
of the same text data. SummarizeAI, a web-based app for summarizing podcasts
with Large Language Models, for text-to-speech translation and audio summariza-
tion, is introduced in Khanna et al. [12]. SummarizeAI is compared against MoCa,

Brio, Pegasus, and BART using ROUGE-1, ROUGE-2, and ROUGE-L criteria to
determine its ability to generate summaries with high unigram, bigram, and longest
common sequence overlap.
The translation of Bangla regional dialects into standard Bangla is discussed in
Faria et al. [13] using two models, mT5 and BanglaT5. The ROUGE scores show
that BanglaT5 outperforms mT5 in recall, accuracy, and F1-score across all areas in
ROUGE measures (ROUGE-1, ROUGE-2, and ROUGE-L). Notably, both models
perform best in the Mymensingh area, with BanglaT5. Sometimes the effective-
ness of the software is questioned when it comes to being able to work efficiently
for a multilingual group in a geographically dispersed environment. Posey et al. [14]
overcome this issue by implementing a smooth communication between people from
different sets of locations who were able to communicate in 66 languages anony-
mously. It focuses on creating smooth communication thus preventing the need for
a traditional translator.
In the paper by Wairagala et al. [15], transfer learning techniques based on the
pretrained MarianMTModel are used for building machine translation models for
English-Luganda translation and vice-versa. It aims to overcome the gender bias
using Word Embedding Fairness Evaluation Framework (WEFE).
This study introduces a comprehensive three-step process for audio file management:
initially transcribing the spoken content, then distilling the transcription into minutes,
and finally facilitating multilingual translation, ensuring that the content is both
accessible and comprehensible across diverse linguistic audiences.
3.1 Transcription
Sound Source Separation This technique is employed to extract distinct audio

sources from a composite sound signal, typically with the objective of isolating cer-
tain sounds or eliminating unwanted noise from a recording. This approach plays a
crucial role in activities such as improving speech intelligibility in areas with high
levels of noise or isolating individual musical components within a complex audio
mix. Dual-Path Transformer Network (DPTNet), a Transformer-based approach, sig-
nificantly outperforms global modeling in the realm of speech source separation due
to its dual-path design. In the context of a multilingual meeting, separating overlap-
ping speech or background noise from the desired speech signal is crucial for accurate
transcription. When an audio file is processed through DPTNet for speech separation,
it undergoes several stages. Initially, the raw audio is preprocessed, normalized, and
transformed into a spectrogram—a visual representation of the audioâŁ™s frequency
over time. The DPTNet then applies local attention to these spectrograms, focusing
on short segments to capture the fine details of each sound source. Concurrently,
its globally recurrent mechanism ensures that while it is pinpointing specific audio
elements, it is also considering the broader audio context, enhancing its differentia-
tion ability. This dual-path approach enables the effective separation of overlapping
speech and background noise. After separation, minor post-processing refines the
audio, eliminating potential artifacts. The result is clear, individual audio streams
from the input, optimizing them for tasks like transcription.
Diarization Diarization, essential for managing multilingual meetings, segments
audio based on speaker identity. This project utilizes the pyannote toolkit for effective
speaker diarization using deep learning. The process begins with feature extraction,
where aspects like Mel-frequency cepstral coefficients (MFCCs), pitch, and energy
are derived from the audio signal. These features create a low-dimensional speaker
embedding, capturing each speaker’s unique characteristics. A neural network gen-
erates these embedding, which are then clustered using algorithms like k-means
or hierarchical clustering. Diarization plays a critical role in handling multilingual
meetings by dividing the audio into segments according to speaker identity. For this
purpose, our study employs the pyannote toolkit, acknowledged for its skill in speaker
diarization using deep learning techniques. The procedure commences with feature
extraction, where the audio signal undergoes preprocessing to isolate essential fea-
tures such as Mel-frequency cepstral coefficients (MFCCs), pitch, and energy. These
attributes are vital for creating a concise speaker embedding that captures the unique
vocal characteristics of each participant. A neural network is trained to produce these
embedding, which are then clustered based on their similarity using algorithms like
k-means or hierarchical clustering. The resulting clusters are mapped to different
speakers by examining temporal overlap and the uniformity of the speaker embed-
ding. This step leads to a segmented audio signal where each segment is attributed to
a specific speaker. This careful segmentation is crucial for accurate transcription and
translation in scenarios involving multilingual meetings. By systematically divid-
ing the audio signal into consistent portions based on speaker identity, our study
effectively leverages the functionalities of the pyannote toolkit. This method ensures
well-organized and coherent management of multilingual meetings, demonstrating
the toolkit’s adeptness in both academic and industrial spheres.
Speech to Text The process of converting spoken words into written form is known as
speech-to-text conversion. The Python library SpeechRecognition connects to many
voice recognition engines, including Microsoft Bing Voice Recognition and Google
Cloud Voice API, to enable this capability. Modern speech recognition systems rely
on audio sources as input. This source might originate from pre-recorded audio files
or be derived from real-time audio streams. To enhance the clarity and accuracy of this
input, several preprocessing methods are applied. These techniques, which include
filtering, normalization, and augmentation, are crucial in reducing background noise
and extraneous non-speech sounds, thus refining the input quality. The versatility
of today’s speech recognition frameworks is further demonstrated by their capacity
to select and interface with different voice recognition engines or services, tailored
Fig. 1 Transcription process using DPTNet-based speech separation method, deep learning-based
speaker diarization process, and speech-to-text conversion using the SpeechRecognition module
to specific application needs. Once the audio has been appropriately preprocessed
and the optimal recognition engine chosen, the actual process of speech-to-text con-
version begins. At its core, this conversion relies on sophisticated algorithms and
methodologies. Contemporary models predominantly employ hidden Markov mod-
els and deep neural networks to decode and transcribe spoken language into written
text with impressive accuracy. Additionally, the model groups the speakers accord-
ing to gender in order to improve context and the user experience. For example,
odd speaker numbers (Speaker 1) indicate female speakers, whereas even speaker
numbers (Speaker 0) indicate male voices. This distinction provides a more lucid
transcription background and facilitates comprehension of dialogue dynamics. The
process’s output is arranged and formatted into a text file or string that may be used
in real-world applications. After preparation, this output can be easily processed or
analyzed further. The SpeechRecognition module makes it easy to incorporate voice
recognition features into the system, which is essential for efficiently running mul-
tilingual meetings. Here, Fig. 1 represents the summary of the three methodologies
used for transcription.
3.2 Counting Minutes
In the realm of managing multilingual meetings, the ability to concisely summarize

expansive textual content while retaining crucial details has gained paramount impor-
tance. To address this challenge, one can leverage methodologies akin to the Summa
package that is rooted in the TextRank algorithm. The process initiates by importing
essential libraries, such as NetworkX for constructing and analyzing graphs, NLTK
for text manipulation, and sklearn for gaging similarity metrics. Once the textual data,
Fig. 2 Text summarization process using the Summa package and TextRank algorithm
for instance, from “text.txt”, is loaded, it is segmented into discrete sentences. These
fragments are then refined by eliminating stopwords and preserving only alphanu-
meric terms. Concurrently, word embedding, particularly from the GloVe model,
enhance the representation of the sentences. Assuming a 100-dimensional variant of
GloVe, these embedding are stored systematically for rapid access. Every sentence is
transformed into a vector, averaging the embedding of its words. The subsequent step
involves determining a similarity matrix for the sentences, predominantly via cosine
similarity. The matrix then lays the groundwork for crafting a graph where nodes sym-
bolize sentences and their connecting edges denote semantic relatedness. With the
application of TextRank, sentences are accorded a ranking based on their relevance
and interconnectedness within the graph. This iterative ranking ensures that pivotal
sentences influence the prominence of others in close proximity. The culmination of
this meticulous procedure is the extraction of the top N sentences, embodying the
essence of the original content, offering a streamlined overview. As shown in Fig. 2,
a concise summary is generated using TextRank algorithm. This approach is instru-
mental in synthesizing coherent summaries, making it indispensable for overseeing
intricate, multilingual discussions. To further refine the summarization process, the
output from TextRank is then fed into a BART model, a pretrained model from Face-
book, designed for abstractive summarization. This ensures not only the extraction
of the most relevant sentences but also the generation of a coherent and contextually
accurate summary of the content.
3.3 Multilingual Translation
In the pursuit of achieving high-quality multilingual translations, especially from

audio transcripts, this research harnesses the state-of-the-art Hugging Face Trans-
formers library. The MarianMTModel and MarianTokenizer are key elements of
this approach, serving as fundamental components within the library. The transla-
tion process begins with the MarianTokenizer carefully breaking down the text in
the original language into units of ideal size, ranging from words to subwords. The
implementation of this subtle tokenization technique provides a seamless integration
of the text with the complex neural network parameters of the MarianMTModel.
Following the process of tokenization, the text is subsequently fed into the Mari-
Fig. 3 Multilingual translation by MarianMTModel
anMTModel, a highly esteemed model that has undergone extensive training on a

wide range of multilingual datasets. Significantly, the MarianMTModel, which was
initially pretrained, has undergone additional fine-tuning to particularly address the
translation of transcripts in several languages. This fine-tuning process has resulted
in improved translation accuracy for this particular sort of data. The translations pro-
duced by the model after post-processing exhibit exceptional quality and linguistic
correctness in relation to the designated target language. The resulting translation
can subsequently be displayed, stored, or further processed as required. Figure 3 rep-
resents the translation methodology process using the following MarianMTModel.
The effectiveness of this translation technique is heavily reliant on the strength of
the fine-tuned model and the accuracy of tokenization. Supported by the extensive
Hugging Face Transformers library, which provides assistance for translations in
more than 100 languages, and with the MarianMTModel effectively managing over
50 languages, including widely spoken ones like English, Spanish, and Chinese, this
methodology presents a comprehensive and sophisticated strategy for addressing
issues in multilingual translation.
At the conclusion of the translation process, all three components—the transcrip-
tion, meeting minutes, and translation—are consolidated and saved into a Word
document. This document, named meeting.docx, is then available for download in
the current working directory. This should effectively communicate to your reader
that the entire process culminates in the creation of a Word document containing all
the necessary information, which they can subsequently download.
4 Results
In this study, leveraging the DPTNet technique for sound separation, an initial audio
file in the form of .wav file was examined, revealing a diarization result of two
distinct speakers, displaying a part of 250 segments generated. Progressing further,
a second audio file was analyzed, which, after processing, presented a more intricate
diarization encompassing multiple speakers, displaying a part of the 226 segments
generated as seen in Fig. 4 for both the audio file cases.
Fig. 4 2 Speaker diarization and multiple speaker diarization
Fig. 5 Transcription of audio to text
Fig. 6 TextRank summary
Fig. 7 Abstractive summarization using BART
It is evident that DPTNet is adept not just at identifying individual sounds, but
also at discerning overlapping voices, thus enhancing the precision of transcrip-
tions in complex multilingual contexts. Following the process of diarization, the
segments are transcribed with the SpeechRecognition module. The method of tran-
scribing effectively distinguishes between male and female voices, hence assuring
enhanced clarity and contextual understanding. After importing libraries and load-
ing data from “text.txt”, sentences are processed, stopwords removed, and GloVe
embedding applied. A similarity matrix is created, forming the basis for a graph
where nodes represent sentences and edges indicate relatedness. Using TextRank,
sentences are ranked by relevance, hence Fig. 5 represents the first transcription result
of an audio file, Fig. 6 represents the concise summary generated from TextRank
algorithm, and finally Fig. 7 gives the abstractive summarization using the BART
model.
In the evaluation, this study calculated the precision, recall, and F1-score to
assess the quality of machine-generated summaries in comparison with the anno-
tated summaries based on human evaluation. The study employed two distinct
Table 1 ROUGE score comparison of summarization models

Summarization ROUGE-1 ROUGE-2 ROUGE-L
TextRank 43.75 26.76 39.58
BART model 47.5 23.52 44.37
Fig. 8 Live audio translation
Fig. 9 Multilingual audio transcription and translation capability
models for summarization: BART and TextRank. The ROUGE metrics, including
ROUGE-1 (unigram overlap), ROUGE-2 (bigram overlap), and ROUGE-L (longest
common sequence overlap), were used to quantify the similarity between the gen-
erated summaries and the human-annotated ones in Table 1. Notably, BART consis-
tently outperformed TextRank across all ROUGE metrics (ROUGE-1, ROUGE-2,
and ROUGE-L), indicating its superior performance in capturing both unigram and
bigram overlaps, as well as longer common sequences in the summaries, thereby
demonstrating its effectiveness in generating high-quality and abstractive summa-
rization.
In a multilingual context, real-time capture of live audio is performed on a single
individual as seen in Fig. 8, exemplified by their speech in the Spanish language.
Sophisticated speech recognition tools subsequently transcribe the provided input,
transforming the spoken words into written text.
Furthermore, upon receiving an audio file, the system possesses the capabil-
ity to transcribe and translate the conversation language (source) into 21 different
languages (target) as represented in Fig. 9 which is rigorously tested and verified.
Whether the spoken content is in Spanish, French, or Hindi, the technology can
seamlessly render it into the desired target language, ensuring clarity and cultural
nuance. Table 2 displays the ROUGE-1, ROUGE-2, and ROUGE-L scores for trans-
lations from English into four target languages: Hindi (H), Italian (It), Spanish (Es),
and French (Fr).
Table 2 ROUGE scores for translations from English

Target language ROUGE-1(p, r, f1) ROUGE-2(p, r, f1) ROUGE-L(p, r, f1)
Hindi (h) (0.629, 0.537, 0.587) (0.497, 0.512, 0.554) (0.496, 0.64, 0.632)
Spanish (Es) (0.573, 0.486, 0.57) (0.613, 0.458, 0.496) (0.463, 0.494, 0.54)
French (Fr) (0.647, 0.596, 0.604) (0.608, 0.463, 0.453) (0.614, 0.456, 0.609)
Italian (It) (0.563, 0.487, 0.479) (0.64, 0.508, 0.623) (0.564, 0.54, 0.486)
5 Conclusion
Within the dynamics of meetings, clear and inclusive communication is of paramount

importance. The DPTNet effectively filters out non-speech components, ensuring the
delivery of crystal-clear audio. This audio, whether from two or multiple speakers,
undergoes meticulous diarization to create distinct segments. These segments are then
transformed by the SpeechRecognition module, which discerns between male and
female voices, into transcribed text. The utilization of TextRank facilitates prioritized
sentence extraction, which is subsequently refined by the BART model into a concise
abstractive summary. A standout feature is the provision of multilingual translation;
any text file from the meeting can be input for translation, ensuring every participant
receives a comprehensive Word document in their chosen language by the end of
the session. This cohesive approach amplifies the efficiency of meetings, ensuring
clarity, inclusivity, and collaboration across linguistic barriers.
6 Future Scope
In the context of global integration, professional discussions involving individuals

from varied linguistic backgrounds are becoming the norm rather than the exception.
Stemming from the developments explained in this work, the main goal of upcom-
ing works is to investigate and integrate cutting-edge real-time transcription and
translation systems. Imagine a lively forum where everyone is involved in real-time
conversations, irrespective of their language background. While a speaker speaks
in their own tongue, a developed system provides transcripts in each participant’s
preferred language extremely quickly. This instantaneous cross-lingual interaction
redefines the traditional boundaries of multilingual discourse while also promoting
inclusion and mutual comprehension. The goal of the project is to create a setting
where instantaneous multilingual professional exchanges are considered the gold
standard, driven by the continued advancements in NLP and AI.
References
1. Li H, Chen K, Wang L, Liu J, Wan B, Zhou B (2022) Sound source separation mechanisms of
different deep networks explained from the perspective of auditory perception. Appl Sci 12(2)
2. Asteroid: The pytorch-based audio source separation toolkit for researchers, pp 2637–2641
(2020)
3. Khoma V, Khoma Y, Brydinskyi V, Konovalov A (2023) Development of supervised speaker
diarization system based on the pyannote audio processing library. Sensors 23(4)
4. Barkovska O (2022) Research into speech-to-text tranfromation module in the proposed model
of a speaker’s automatic speech annotation. Innovative Technol Sci Solutions Ind 5–13
5. Chen X, Li S, Liu S, Fowler R, Wang X (2023) Meetscript: designing transcript-based interac-
tions to support active participation in group video meetings. Proc ACM Hum-Comput Interact
7CSCW2):1–32
6. Dewan A, Ziemski M, Meylan H, Concina L, Pouliquen B (2023) Developing automatic verba-
tim transcripts for international multilingual meetings: an end-to-end solution. arXiv preprint
arXiv:2309.15609
7. Majeed M, Kala MT (2023) Comparative study on extractive summarization using sentence
ranking algorithm and text ranking algorithm. In: 2023 International conference on power,
instrumentation, control and computing (PICC), pp 1–5
8. Banu S, Ummayhani S (2023) Text summarisation and translation across multiple languages.
J Sci Res Technol 242–247
9. Pham N-Q, Waibel A, Niehues J (2022) Adaptive multilingual speech recognition with pre-
trained models. arXiv preprint arXiv:2205.12304
10. Stanik C, Haering M, Maalej W (2019) Classifying multilingual user feedback using traditional
machine learning and deep learning
11. Srinivasan K, Raman K, Chen J, Bendersky M, Najork M (2021) Wit: Wikipedia-based image
text dataset for multimodal multilingual machine learning. In: Proceedings of the 44th inter-
national ACM SIGIR conference on research and development in information retrieval, pp
2443–2449
12. Khanna D, Bhushan R, Goel K, Juneja S (2023) Summarizeai-summarization of the podcasts.
Available at SSRN 4628657
13. Faria FTJ, Moin MB, Wase AA, Ahmmed M, Sani MR, Muhammad T (2023) Vashantor: a
large-scale multilingual benchmark dataset for automated translation of bangla regional dialects
to bangla language. arXiv preprint arXiv:2311.11142
14. Posey J, Aiken M (2015) Large-scale, distributed, multilingual, electronic meetings: a pilot
study of usability and comprehension. Int J Comput Technol 14:5578–5585
15. Wairagala EP, Mukiibi J, Tusubira JF, Babirye C, Nakatumba-Nabende J, Katumba A,
Ssenkungu I (2022) Gender bias evaluation in Luganda-English machine translation. In: Asso-
ciation for machine translation in the americas, pp 274–286
Exploring Comprehensive Privacy
Solutions for Enhancing Recommender
System Security and Utility
Esmita Gupta and Shilpa Shinde
Abstract Nowadays recommender systems have gained large attention and have
become highly efficient tools for categorizing and personalizing the different require-
ments of the users in online mode. Recommender systems are driven by the evolving
preferences of computer users and the increasing accessibility of the internet. Though
they can provide precise recommendations, modern recommender systems face
numerous constraints and challenges, such as cold-start problem, sparsity, scalability,
privacy concerns, and optimization issues. There are diverse types of techniques avail-
able which in turn complicate the process of appropriate or valid selection, while
building the application-focused recommender systems. Every technique possesses
its own unique set of features, having advantages and disadvantages which thus
creates a necessity for doing a comprehensive investigation to focus on the complex-
ities involved. This research work aims to conduct a systematic assessment of current
contributions in the field of recommender systems, with the objective of gaining a
thorough understanding of the advancements, identifying areas that require further
attention, and elucidating the unresolved questions and concerns associated with
different techniques. By synthesizing the findings of this review, valuable insights
can be obtained to guide for the future research work and advancement efforts in the
realm of recommender systems.
Keywords Privacy-preserving · Recommender systems · Data protection ·

Algorithmic techniques · User privacy
E. Gupta (B) · S. Shinde

Department of Computer Engineering, Ramrao Adik Institute of Technology, D. Y. Patil Deemed
University, Nerul, India
e-mail: esmita.g@gmail.com
S. Shinde
e-mail: shilpa.shinde@rait.ac.in
322 E. Gupta and S. Shinde
1 Introduction
Data has become the determining factor in everything, yet its size is currently growing
exponentially. Although being the second-largest client center in the world, India had
about 749 million online users in June 2020 and is expected to reach 900 million in
2025. Compared to commercial hubs like the USA (266 million, 84%) and France
(54 million, 81%), the penetration of web-based and e-commerce business is modest,
but it is growing at a record-breaking rate, adding almost 6 million new members on
average every month. The standard database management system is unable to manage
these many datasets effectively. Traditional databases are unable to store and process
information in the form of semi-structured, quasi–structured, and unstructured data
including video, images, audio, wet logs, JSON documents, and search trends, among
others [1]. As a result, the idea of big data was developed. 90% of the information
on the earth today is produced daily by users who interact in an online mode, who
generate 2.5 quintillion bytes of data each day alone during the past two years [2].
This information is gathered from a variety of sources including social media
postings, videos, and images as well as transition records of both e-commerce and
non-e-commerce. This is referred to as big data. Big data encompasses high-volume,
high-velocity, complex, and variable data, necessitating advanced techniques and
technologies to effectively collect, store, distribute, manage, and analyze the infor-
mation [3]. This help from the surroundings, provides us with a simple means of
identifying the greatest solution without exerting much effort to shift through the
many options on the market. Nowadays, recommendation system (RS) is an appli-
cation that filters personalized information and provides a method for understanding
a user’s preferences and making appropriate suggestions to them by considering
patterns between their likes and ratings of various items [4, 5] as shown in Fig. 1.
Protection of users’ information has become one of the biggest substantial chal-
lenges [6–8]. There are various privacy threats that include service providers or their
employees gaining unauthorized access to users [9], illegal data disclosure, intrusion
by third parties to purchase users’ information [10], or various other incidents of
Smart Phone
Music
Video
User
Camera
Fig. 1 Recommendation system

Exploring Comprehensive Privacy Solutions for Enhancing … 323
hacking [11]. Hence, ensuring the privacy of user information is crucial, achieved
through the creation and implementation of various privacy-preserving techniques to
guarantee robust protection for users’ data. In this research work, we focus on leading
an extensive systematic survey of the literature on different techniques for preserving
privacy, which are used for privacy protection in recommendation systems, to iden-
tify trends in the aid of privacy-preserving methods within secure recommenda-
tion systems and focus on the future research directions for the enhancements of
recommendation systems.
2 Privacy Measures Employed in Recommendation

Systems
In this section, we will see reviews about various privacy-preserving methods/

techniques that are used in recommendation systems. The privacy-preserving
techniques can be categorized into two types:
Non-cryptographic techniques: It refers to methods that aim to protect users’
sensitive information and maintain their privacy without directly involving encryp-
tion or decryption processes. These techniques are exercised to make sure that
users’ personal data and preferences are not exposed while still enabling the recom-
mendation system to provide accurate and relevant suggestions. Instead of relying
on cryptographic operations, non-cryptographic privacy techniques focus on data
transformation, aggregation, and perturbation to achieve privacy goals.
Cryptographic techniques: It involves using encryption and related methods to
protect users’ sensitive information and interactions. These techniques ensure that
user data remains confidential and secure while still allowing the recommendation
system to generate accurate suggestions. Cryptographic privacy techniques provide
mathematical guarantees of confidentiality and security against various types of
attacks.
Let us explore few of the non-cryptographic techniques used to preserve privacy
of the users used in recommendation system:
Anonymization involves the removal of personally identifiable information from
data [12, 13], ensuring the protection of individuals’ privacy when sharing infor-
mation. This process encompasses various approaches, including k-anonymity, l-
diversity, and t-closeness. K-anonymity safeguards privacy by ensuring that each
disclosed data remains indistinguishable with at least k − 1 other data records when
linked with external data. There are two techniques that can be used for achieving k-
anonymity which are suppression and generalization. This method thwarts potential
linking attacks, preventing adversaries from uniquely identifying users by combining
quasi-identifier attributes with external data [14].
As we know, online OTT platform Netflix anonymizes user data by eliminating

personal information so that they can recommend personalized content without
compromising user privacy.
L-diversity, on the other hand, operates by refining data granularity. It mandates
that every record within a group exhibits at least l-diverse values for all the sensi-
tive attributes. L-diversity addresses vulnerabilities that k-anonymity faces, such as
homogeneity and background knowledge attacks [15]. In scenarios where a dataset
requiring anonymization contains sensitive attributes, T-closeness is employed. T-
closeness safeguards against both attribute homogeneity and proximity, overcoming
limitations of l-diversity. This method proves effective when sensitive attributes
demonstrate skewed distributions or distinct values within the equivalence groups of
the anonymized dataset [16].
Differential privacy ensures that no single user’s information can be inferred from
the output, making it computationally challenging to distinguish any individual user’s
contribution from others in the released data [12]. This is accomplished by intro-
ducing noise to either the inputs or outputs, obscuring minor changes originating
from a single user’s input. The amount of noise added is determined by the desired
level of privacy.
Definition: A randomized computation K satisfies ε-differential privacy if, when
given any neighboring datasets A and B that differ by just one record, and for all
subsets S of all potential outcomes in Range (K), the following inequality is true
[16]:
Pr[K (A) ∈ S] ≤ exp(ε) × Pr[K (B) ∈ S]. (1)
Here, ε represents the privacy budget, governing the balance between privacy and
accuracy. Typically, ε is assigned a small positive value. Smaller values lead to higher
privacy and accuracy, while larger values have the opposite effect.
Differential privacy techniques are being used by Apple for preserving the users’
private data while still allowing for accurate aggregation.
Randomization techniques involve introducing randomness into the data or its
presentation to prevent precise inferences about users. User data undergoes pertur-
bation by adding a random value drawn from a predefined distribution to each of
their ratings. Any unknown ratings are replaced with the mean rating. This can make
it more difficult for attackers to link specific data points to individual users.
Bucketization and binning entail the categorization of data into predetermined inter-
vals or bins. Rather than employing exact values, the data is symbolized in terms
of these intervals, aiding in the concealment of distinct user details. This can be
expressed mathematically as:
Given a set of data points D = {x 1 , x 2 , …, x n }, the bucketization process assigns
each data point to a specific interval or bin based on defined boundaries. If there
are m bins with boundaries [b1 , b2 ], [b2 , b3 ], …, [bm − 1, bm ], the data point xi is
assigned to the jth bin if it falls within the interval [bj − 1, bj ]:
If b j − 1 < xi ≤ b j , then xi is assigned to the jth bin. (2)
This representation in terms of bins helps mask the exact values of individual data
points, thus enhancing privacy.
Federated Learning is like a teamwork approach for training models. Imagine
different devices or servers working together to improve a shared model, but they
keep their private user data to themselves. They only share summaries of what they
have learned, not individual data, which makes it less likely for personal information
to be exposed. [17, 18]
Each device has its own data (D1 , D2 , …, Dn ). The model gets better by using
each device’s data: model gets updated a bit using η (a small number) and how much
the model is wrong on each device’s data
(∇(L(D_i, θ ))). (3)
Then, all the updates from devices are put together to improve the model: model
gets another update by considering the average of these updates.
This way, the model improves without needing to share everyone’s private data
directly.
Netflix and Google are the best example for using federated learning techniques
to train its machine learning models without consolidating all data in a centralized
location.
Let us explore few of the cryptographic techniques used to preserve privacy of
the users used in recommendation system are:
Homomorphic encryption (HE) constitutes a cryptographic system that enables
mathematical operations to be performed on encrypted data without revealing the
actual values, resulting in a ciphertext that, when decrypted, yields the correct
result [18, 28, 29]. Homomorphic encryption can fall into different categories: fully
homomorphic encryption (FHE), somewhat homomorphic encryption (SWHE), or
partially homomorphic encryption (PHE) [24–27].
Mathematically:
• Let m1 and m2 be plaintext messages.
• Let E(m) represents the encryption of message m.
• Let ⊕ denotes an operation (e.g., addition or multiplication) on ciphertexts.
• Let D(E(m)) represents the decryption of the encrypted message E(m).
• For FHE, SWHE, and PHE.
FHE allows both addition (⊕) and multiplication (⊗) operations on encrypted
data:
D(E(m1) ⊕ E(m2)) = m1 + m2 and D(E(m1) ⊗ E(m2))

= m1 ∗ m2. (4)
SWHE allows either of the operation, i.e., addition or multiplication on encrypted

data.
PHE also permits either addition or multiplication on encrypted data, but not both.
The concept of homomorphic encryption is based on mathematical properties that
allow computations on encrypted data to reflect equivalent operations on the plaintext
values, maintaining the confidentiality of the data throughout the computation.
IBM uses HE methods, enabling secure computation on encrypted data thus devel-
oping a secured RS which is capable of processing encrypted user data without
compromising privacy.
Secure multiparty computation (SMC) is a cryptography field enabling various
participants to collaboratively compute functions on their individual inputs while
keeping their inputs concealed. Instead of revealing their inputs, participants only
gain knowledge of the output. SMC utilizes cryptographic protocols to achieve
this and employs tools such as oblivious transfer, garbled circuits, and additive
homomorphic encryption.
Consider a group of parties represented by P = {P1 , P2 , …, Pn }, each possessing
private inputs x1 , x2 , …, x n , and desiring to jointly compute a function f(x1 , x2 , …,
x n , R) = (y1 , y2 , …, yn ), where R represents randomness and (y1 , y2 , …, yn ) signifies
private outputs for each party. A protocol (π ) for this computation is regarded as
supporting secure multiparty computation if it fulfills these conditions [20]:
• Correctness: The protocol enables the parties to accurately compute the function
f.
• Privacy: Each party’s input remains confidential from others (P1 , P2 , …, Pn ).
• Output for All: The protocol only concludes once every participant has received
an output from the computation.
SMC’s objective is to ensure privacy while enabling effective joint computations,
contributing to secure collaborative analysis without disclosing sensitive inputs.
Secret sharing (SS) serves various purposes like data outsourcing [21] and acts as a
foundational component in many security protocols. It involves distributing different
parts (shares) of a secret among parties so that specific groups of parties can collec-
tively reconstruct the original secret. Secret sharing finds applications in privacy
protection within recommendation systems, cloud computing, sensor networks, and
more.
Imagine a secret S is shared among a group of n shareholders U = {U 1 , U 2 , …,
U n }. A secret sharing scheme takes the secret S and divides it into n shares, s1 , s2 ,
…, sn , with each si being discreetly allocated to Ui. Here is the key concept:
If A is a set of shareholders called a “qualified subset,” meaning that they are
allowed access, the secret S can be reconstructed using the shares {si |ui ∈ A}.
In simpler terms, secret sharing divides a secret into parts and gives those parts to
different parties, and only specific groups of parties can come together to uncover the
original secret. This technique is handy for maintaining data privacy and enabling
secure collaborative operations.
Attribute-based encryption (ABE) is a form of encryption that distinguishes each
user through a set of attributes. These attributes’ functions are utilized to decide
the user’s capability to decrypt a ciphertext. In this approach, a user’s private key
and the ciphertext are both influenced by attributes. For successful decryption, the
attributes associated with the user and those tied to the ciphertext need to match.
ABE finds application in cryptographic systems for access control and the nuanced
sharing of encrypted data. ABE is categorized into two types: key-policy ABE and
ciphertext-policy ABE [22].
In essence, ABE tailors’ access to encrypted data based on specific attributes,
enhancing the control and granularity of data sharing while maintaining crypto-
graphic security.
Zero-knowledge proof (ZKP) serves as a technique for verifying the authenticity of
entities. This protocol allows proving certain statements without revealing anything
beyond the accuracy of those statements. It involves two participants: a prover aiming
to demonstrate a statement’s validity and a verifier seeking to authenticate the state-
ment in a specific manner. ZKP is a type of interactive proof where the verifier can
efficiently mimic an honest prover in real-time instances. The core concept behind
zero-knowledge proofs is to compel a user to demonstrate compliance with a specified
protocol, promoting honest behavior while preserving privacy [23].
Consider an interactive proof system denoted as ⟨P|V ⟩ for a language L and an
input x. The “view” of the verifier V regarding input x encompasses all messages
transmitted from prover P to verifier V and all the random bits utilized by V
throughout the protocol’s execution on input x. This is represented as view
V [P(x) ↔ V (x)]. (5)
Pseudonymization It is a privacy-enhancing method aimed at safeguarding

users’ privacy. This technique entails substituting identifiable user attributes with
pseudonyms [30, 31], ensuring user anonymity while retaining recognition through
pseudonyms. Pseudonymization serves to safeguard sensitive user details, including
characteristics like age, gender, and location, along with explicit and implicit pref-
erences. Numerous studies have delved into the application of pseudonymization in
this context [32, 33].
The benefits of Pseudonymization encompass heightened user privacy, as user
data remains unstored, with only pseudonyms serving as identifiers. Additionally,
pseudonymization contributes to enhanced FRS scalability by reducing data storage
requirements and enabling data segmentation into smaller units [34].
A major drawback is the difficulty in precisely linking a user’s pseudonym to their
real identity and preferences. Additionally, pseudonymized data could be vulnerable
to dictionary and re-identification attacks.
3 Challenges Encountered While Using Privacy Preserving

Techniques
Non-cryptographic privacy techniques are generally more scalable and involve lower
computational and communication costs, and their security is more difficult to estab-
lish. This is because their effectiveness hinges on the randomness and anonymity
of data [35]. Factors that could potentially be exploited by sophisticated inference
attacks capable of re-identifying individuals within a dataset. Another significant
challenge is finding the right balance between maintaining privacy and preserving
data utility. The introduction of noise into data, while enhancing privacy, can also
result in the loss of critical information that contributes to generating more precise
recommendations. Thus, there exists a necessity for a meticulous and thoughtful
selection of a threshold that harmonizes privacy and data utility, avoiding the pitfall
of sacrificing one for the other.
Using cryptographic-based privacy techniques is a secure approach to safe-
guarding user privacy within recommendation systems. These techniques effectively
manage privacy concerns without sacrificing accuracy [19]. However, it is impor-
tant to note that cryptographic methods tend to require more computational power
and communication resources. This can pose challenges, especially on devices with
limited capabilities like handheld devices.
To make cryptographic techniques practical for privacy protection, it is bene-
ficial to adopt lightweight cryptographic schemes that reduce computational and
communication overhead. An avenue worth exploring is integrating machine learning
technology.
Despite the various advantages of cryptographic techniques offered in terms of
privacy and accurate recommendations, scalability remains the challenge. Recom-
mendation systems that implement cryptographic-based approaches may struggle
with handling massive volumes of data while maintaining privacy, particularly in
fast-paced online settings. The scalability issue of managing extensive data while
preserving privacy within a short response time is a concern that requires further
exploration.
4 Current Privacy Preservation Techniques, Description,

Limitations, and Challenges
In the realm of privacy-aware recommendation systems, safeguarding users’ privacy

is commonly accomplished through the various forms of implementation as per the
given survey.
In the realm of privacy-aware recommendation systems, safeguarding users’
privacy is commonly achieved through various forms of implementation, as per
the given survey. One such approach involves implementing independent encryption
protocols during data collection and exchange [35]. To exemplify, Sanchez et al.
[36] proposed a novel approach rooted in a reversible data transformation algorithm,

ensuring privacy preservation in data collection for mobile app recommendation
systems. In this protocol, a user’s data is securely transmitted within a user group
using encryption, eliminating the necessity for direct communication between the
user and the data collector.
In contrast to the data encryption methods, Beg et al. [37] take a divergent approach
by concentrating on the development of the recommendation algorithm itself, all the
while minimizing the reliance on private user data. This approach complements the
data encryption strategies discussed earlier.
Additionally, Sanchez et al. delved into users’ preferences regarding privacy
permissions within the fitness domain. They proffered a series of strategies for users
to configure permissions in alignment with the data that is collected and shared [35].
Their research revealed that users are most accepting of privacy permissions related
to gender and fitness preferences. However, they showed reluctance in sharing data
regarding their height, weight, age, and social network information.
Let us view other privacy preservation techniques with its limitation and
challenges as described in Table 1.
5 Privacy Preservation Solutions in Recommender Systems
The discourse surrounding the potential risks of personal data exposure through
recommender systems naturally steers us toward the "defender" stance—that is, how
can recommenders uphold user privacy while maintaining recommendation quality?
This review delves into three distinct categories of approaches that can effectively
tackle privacy concerns within recommender systems:
Architectural and Platform Solutions
This first category encompasses architectures, platforms, and standards designed to
minimize the threat of data leakage. These measures include various protocols and
certificates that assure users of adherence to privacy-preserving practices. By doing
so, these approaches restrict external parties’ capacity to access or infer unauthorized
data. Additionally, distributed architectures are included in this category, as they
eliminate the vulnerability associated with centralized recommenders.
Algorithmic Techniques for Data Protection
The second category revolves around algorithmic techniques that safeguard data.
Within this category, various methods can be categorized. Some involve modifying
data, either by anonymizing user identities or transforming rating data. Others exploit
differential privacy frameworks or cryptographic tools for data protection. The core
concept underlying these techniques is that even if personal data leaks, adversaries
would only possess modified or encrypted information, rendering recovery of the
original data difficult.
Table 1 Privacy preservation techniques

S. No. Privacy preservation Description Limitation and challenges
techniques
1. Differential privacy A technique that adds noise to Balancing privacy
the data to protect individual protection with data utility
privacy while still allowing for and accuracy can be a
useful analysis complex trade-off in
differential privacy
2. Secure multiparty A technique that enables multiple It can be computationally
computation parties to collectively compute a intensive and
function without disclosing their resource-demanding,
individual inputs to each other potentially hindering its
scalability and practicality
in some applications
3. Homomorphic It is a cryptographic technique It significantly slows down
encryption enabling computations on data processing, limiting its
encrypted data without exposing real-time or
the data’s plaintext, ensuring high-performance
privacy in computations applications
4. Secret sharing It is a technique where a secret is To ensure the secure
divided into multiple shares, distribution and storage of
distributed among participants, shares is a challenge, as the
and requires a threshold number compromise or loss of any
of shares to reconstruct the shares can compromise the
original secret secret
5. Zero-knowledge It is a cryptographic method The zero-knowledge proof
proof enabling one party to exhibit protocols are
knowledge of a secret without computationally demanding,
revealing the secret itself limiting their practicality in
resource-constrained
environments
Policy and Regulatory Approaches

The third category focuses on legislative, policy, and regulatory interventions that
can be imposed on recommendation services. These regulations may restrict data
manipulation, sharing, or trading. However, the efficacy of such regulations varies
across jurisdictions and is often challenging to enforce in practice.
A central rationale for this classification lies in the division between technical
and non-technical solutions. Technical solutions comprise both architectural and
algorithmic mechanisms, while non-technical solutions pertain solely to policy solu-
tions. The former provides either a foundational infrastructure for privacy support or
specific algorithms for data protection. On the contrary, the latter offers guidelines
dictating permissible and prohibited activities concerning personal user data.
Furthermore, it is worth noting that while these categories seem independent, the
adoption of multiple approaches is often advantageous for robust privacy protection.
Hence, we recommend that recommender system designers contemplate all three

categories when devising privacy preservation mechanisms.
For illustration, consider the scenario of a large-scale eCommerce website deliv-
ering personalized recommendations. Such a platform might employ architectural
solutions to distribute data storage while employing algorithmic techniques to
facilitate cryptography-protected data access. Simultaneously, the website could
communicate compliance with privacy regulations to build user trust.
In conclusion, the multifaceted nature of privacy concerns demands a compre-
hensive approach, involving architectural, algorithmic, and policy solutions, for
safeguarding user privacy within recommender systems.
6 Conclusion
In the realm of recommendation systems, the preservation of user privacy is a

paramount concern. This review has shed light on various privacy-preserving tech-
niques, categorizing them into architectural, algorithmic, and policy-based solutions.
Each category presents unique advantages and challenges, offering a multifaceted
approach to tackling privacy risks. While architectural solutions and policy regula-
tions emphasize minimizing data leakage and defining permissible practices, algo-
rithmic techniques focus on data transformation and encryption to render sensitive
information indecipherable.
The comprehensive analysis of non-cryptographic and cryptographic techniques
underscores their significance in striking a balance between privacy and recommen-
dation accuracy. Non-cryptographic approaches, such as anonymization, bucketiza-
tion, and perturbation, aim to obscure personal information through data manipula-
tion. On the other hand, cryptographic methods, like homomorphic encryption, secure
multiparty computation, and zero-knowledge proofs, harness advanced mathematical
concepts to provide robust privacy guarantees.
6.1 Future Prospects
The landscape of recommendation systems and user privacy continues to evolve,

prompting the exploration of further advancements and enhancements in privacy-
preserving techniques. The review opens the door to several potential avenues for
future research:
Hybrid Approaches: Combining non-cryptographic and cryptographic techniques
could provide a more comprehensive and tailored privacy solution. Research could
investigate synergistic strategies that leverage the strengths of both categories to
address specific privacy concerns effectively.
Scalable Cryptographic Solutions: As the demand for privacy-preserving recom-

mendation systems grows, the development of lightweight cryptographic schemes
becomes critical. Future research could focus on optimizing cryptographic proto-
cols to reduce computational and communication overhead while maintaining strong
privacy assurances.
Real-World Deployment: Extensive empirical evaluations and case studies are
essential to assess the practical feasibility of various privacy techniques in real-world
recommendation systems. This includes assessing their impact on recommendation
quality, system performance, and user experience.
User-Centric Privacy: Future research could emphasize user-centric privacy by
empowering users to control their data sharing and privacy preferences actively.
Investigating methods to empower users while ensuring accurate recommendations
is a promising avenue.
Ethical Considerations: Addressing ethical concerns around data collection,
storage, and usage within recommendation systems is crucial. Ethical issues for
future research include:
• The trade-off between privacy and accuracy: Sometimes, maintaining privacy can
affect accuracy, obscuring useful information for recommendations.
• The potential for discrimination: Privacy-preserving techniques may introduce
bias by concealing information about specific group members.
• The need for transparency: Enhancing transparency about user data usage
fosters user trust and encourages the sharing of accurate information, ultimately
improving personalized recommendations.
Future research could explore the integration of ethical frameworks to guide

privacy-preserving practices.
Interdisciplinary Collaborations: Collaboration between experts in computer
science, cryptography, law, ethics, and user experience design can yield holistic
solutions that consider technical, legal, and ethical dimensions of privacy in
recommendation systems.
In summary, this review serves as a foundation for future research endeavors in
the domain of privacy preservation in recommendation systems. By delving into the
intricacies of privacy techniques and their implications, researchers can pave the way
for more secure, accurate, and user-friendly recommendation systems that prioritize
user privacy in the digital age.
References
1. Esteban A, Zafra A, Romero C (2020) Helping university students to choose elective courses by
using a hybrid multi-criteria recommendation system with genetic optimization. Knowl-Based
Syst 194:105385
2. Mondal S, Basu A, Mukherjee N (2020) Building a trust-based doctor recommendation system
on top of a multilayer graph database. J Biomed Inform 110:103549
3. Dhelim S, Ning H, Aung N, Huang R, Ma J (2021) Personality-aware product recommendation
system based on user interests mining and meta path discovery. IEEE Trans Computer Soc Syst.
8:86–98
4. Bhalse N, Thakur R (2021) Algorithm for movie recommendation system using collaborative
filtering. Mater Today: Proc. https://doi.org/10.1016/j.matpr.2021.01.235
5. Ke G, Du HL, Chen YC (2021) Cross-platform dynamic goods recommendation system based
on reinforcement learning and social networks. Appl Soft Comput 104:107
6. Mohallick I, Özgöbek Ö (2017) Exploring privacy concerns in news recommendation systems.
In: Proceedings of the international conference on web intelligence (WI’17). ACM, New York,
pp 1054–1061
7. Mehmood A, Natgunanathan I, Xiang Y, Hua G, Guo S (2016) Protection of big data privacy.
IEEE Access 4:1821–1834
8. Isinkaye FO, Folajimi YO, Ojokoh BA (2015) Recommendation systems: principles, methods,
and evaluation. Egypt Inform J 16:261–273
9. Tang Q, Wang J (2016) Privacy-preserving friendship-based recommendation systems. IEEE
Trans Dependable Secure Comput 5971:1
10. Huang W, Liu B, Tang H (2019) Privacy protection for recommendation system: a survey. J
Phys Conf Ser 1325:012087
11. Al-Nazzawi TS, Alotaibi RM, Hamza N (2018) Toward privacy protection for location-based
recommendation systems: a survey of the state-of-the-art. In: The 1st IEEE international
conference on computer applications & information security (ICCAIS), pp 1–7
12. Saleem Y, Rehmani MH, Crespi N, Minerva R (2021) Parking recommender system privacy
preservation through anonymization and differential privacy. Eng. Rep. 3(2):12297
13. Luo Z, Chen S, Li A (2013) A distributed anonymization scheme for privacy-preserving recom-
mendation systems. IEEE 4th international conference on software engineering and service
science, pp 491–494
14. Machanavajjhala A, Gehrke J, Kifer D, Venkita Subramaniam M. (2006) L-diversity: privacy
beyond k-anonymity. Proc Int Conf Data Eng 206:24
15. Li N, Li T, Venkatasubramanian S (2007) T-closeness: privacy beyond k-anonymity and l-
diversity. In: Paper presented at: proceedings of the IEEE 23rd international conference on
data engineering, pp 106–115
16. Ogunseyi T, Avoussoukpo C, Jiang Y (2021) A systematic review of privacy techniques in
recommendation systems. Int J Inform Secur 1–14. https://doi.org/10.1007/s10207-023-007
10-1
17. Li Q, Wen Z, Wu Z, Hu S, Wang N, Li Y et al (2021) A survey on federated learning systems:
vision, hype, and reality for data privacy and protection. IEEE Trans Knowl Data Eng
18. Zhang C, Xie Y, Bai H, Yu B, Li W, Gao Y (2021) A survey on federated learning. Knowl
Based Syst 216:106775
19. Elna barawy I, Jiang W, Wunsch DC (2020) Survey of privacy-preserving collaborative filtering.
arXiv preprint. arXiv:2003.08343
20. Yousuf H, Lahzi M, Salloum SA, Shaalan K (2021) Systematic review on fully homomorphic
encryption scheme and its application. Recent Adv Intell Syst Smart Appl 537–551
21. Harn L, Hsu C, Zhang M, He T, Zhang M (2016) Realizing secret sharing with general access
structure. Inf Sci 367:209–220
22. Zhang Y, Deng RH, Xu S, Sun J, Li Q, Zheng D (2020) Attribute-based encryption for cloud
computing access control: a survey. ACM Comput Surv 53(4):1–41
23. Bouland A, Chen L, Holden D, Thaler J (2017) Vasudevan P.N. on the power of statistical
zero-knowledge. Annu Symp Found Comput Sci Proc 140:708–719
24. Zhang M, Chen Y, Lin J (2021) A privacy-preserving optimization of neighborhood-based
recommendation for medical-aided diagnosis and treatment. IEEE Internet of Things J
8(13):10830–10842
25. Cui L, Wang X, Gu T (2023) A generic data synthesis framework for privacy-preserving point-
of-interest recommender systems. ACM, ISBN 979-8-4007-0228-0/23/08, RACS’23, August
6–10
26. Wang Y, Ma W, Zhang M, Liu Y, Ma S (2023) A survey on the fairness of recommender
systems. ACM Trans Inf Syst 41(3), Article 52
27. Asad M, Shaukat S, Javanmardi E, Nakazato J, Tsukada M (2023) A comprehensive survey on
privacy-preserving techniques in federated recommendation systems. Appl Sci 13(10):6201
28. Amarsingh Feroz C, Lakshmi Narayanan K, Kannan A, Santhana Krishnan R, Harold Robinson
Y, Precila K (2022) Enhancement of data between devices in Wi-Fi networks using security
key. In: Majhi S, Prado RPD, Dasanapura Nanjundaiah C (eds) Distributed computing and
optimization techniques. Lecture Notes in Electrical Engineering, vol 903. Springer, Singapore.
https://doi.org/10.1007/978-981-19-2281-7_42
29. Peyvandi A, Majidi B, Peyvandi S, Patra JC (2022) Privacy-preserving federated learning for
scalable and high data quality computational-intelligence-as-a-service in Society 5.0. Multimed
Tools Appl 81:25029–25050
30. Ribeiro SL, Nakamura ET (2019) Privacy protection with pseudonymization and anonymiza-
tion in a health IoT system: results from Ocariot. In: Proceedings of the 2019 IEEE 19th
international conference on bioinformatics and bioengineering (BIBE), Athens, Greece, pp
904–908
31. Khalfoun B, Ben Mokhtar S, Bouchenak S, Nitu V (2021) EDEN: enforcing location privacy
through re-identification risk assessment: a federated learning approach. Proc ACM Interact
Mob Wearable Ubiquitous Technol 5:1–25
32. Choudhury A, Sun C, Dekker A, Dumontier M, van Soest J (2022) Privacy-preserving federated
data analysis: data sharing, protection, and bioethics in healthcare. In: Machine and deep
learning in oncology, medical physics and radiology. Springer, Cham, pp 135–172
33. Röhrig R (2021) A federated record linkage algorithm for secure medical data sharing. In:
Proceedings of the German medical data sciences: bringing data to life: proceedings of the joint
annual meeting of the German Association of Medical Informatics, Biometry and Epidemiology
(GMDS EV) and the Central European Network-International Biometric Society (CEN-IBS),
Berlin, Germany, 6–11; IOS Press, Amsterdam, vol 278, p 142
34. Pramod D (2023) Privacy-preserving techniques in recommender systems: state-of-the-art
review and future research agenda. Data Technol Appl 57(1):32–55. https://doi.org/10.1108/
DTA-02-2022-0083
35. Sanchez OR, Torre I, He Y, Knijnenburg BP (2020) A recommendation approach for user
privacy preferences in the fitness domain. User Model User-Adap Inter 30(3):513–565. https://
doi.org/10.1007/s11257-019-09246-3
36. Beg S, Anjum A, Ahmad M, Hussain S, Ahmad G, Khan S, Choo KKR (2021) A privacy-
preserving protocol for continuous and dynamic data collection in iot enabled mobile app
recommendation system (mars). J Netw Comput Appl 174:102874 https://doi.org/10.1016/j.
jnca.2020.102874
37. Liu X, Gao B, Suleiman B, You H, Ma Z, Liu Y, Anaissi A (2023) Privacy-preserving person-
alized fitness recommender system (P3FitRec): a multi-level deep learning approach. arXiv:
2203.12200v1[cs.AI]
Attribute-Based Encryption
for the Internet of Things: A Review
Kirti Dinkar More and Dhanya Pramod
Abstract The study attempts to do a thorough analysis of existing research papers,

conference papers, and other pertinent publications related to Attribute-Based
Encryption for IoT. Due to the variety and dynamism of connected devices and data,
the Internet of Things (IoT) poses new security issues. With the aim of safeguarding
IoT environments, the advanced cryptography technique Attribute-Based Encryption
(ABE) facilitates fine-grained access control according to certain attributes. Tradi-
tional access control strategies frequently rely on user identities or roles, which
in some circumstances can be rigid and difficult to manage. These restrictions
are addressed by Attribute-Based Encryption (ABE), which allows access to data
depending on certain attributes. In-depth analysis of ABE in the context of IoT is
provided in this paper, starting with its foundational ideas. The review employs to
locate, choose, and study the literature while providing insights into security chal-
lenges, importance of access control, ABE variants and applicability for IoT, ABE
security models, ABE libraries or frameworks, and conclusion with potential future
directions. This review is useful for researchers or academicians working in the field
of IoT security using ABE scheme.
Keywords Attribute-Based Encryption · CP-ABE · Internet of Things ·

KP-ABE · Security challenges
K. D. More (B)
Department of Computer Science, MVP’s K. T. H. M. College, Nashik, India
e-mail: kirtimore@kthmcollege.ac.in
D. Pramod
Department of Computer Studies, Symbiosis Centre for Information Technology, Symbiosis
International (Deemed) University, Pune, India
e-mail: dhanya@scit.edu
336 K. D. More and D. Pramod
1 Introduction
1.1 Overview of the Internet of Things (IoT)
A revolutionary idea famous with the name as Internet of Things (IoT) enables
computers, physical devices, or equipment to be connected to the Internet and
communicate, gather data, and interact with other systems. IoT includes a wide
spectrum of intelligent gadgets, from basic sensors and actuators to sophisticated
appliances and machinery. Automation, data-driven decision-making, and real-time
monitoring, in a variety of sectors, are made possible by these interconnected devices
collecting and exchanging data. The IoT ecosystem makes it possible for devices
to communicate and exchange data seamlessly, which improves productivity, auto-
mates processes, and makes everyday life and functioning of numerous industries
more convenient. The phrase “Internet of Things” (IoT) is taking over the word
“Internet,” promising yet another decade of astonishing developments. However, the
IoT is extending its reach to your car, office, home, and all the devices within them,
including utility meters, streetlights, sprinkler systems, bathroom scales, and even
walls. This interconnectedness will lead to various enhancements, such as adjusting
your home’s heating based on weather forecasts, automatically watering your garden
only when necessary, and providing immediate assistance while on the road. These
advancements aim to simplify our lives and optimize the use of natural resources [1].
The capabilities, usefulness, and influence of IoT solutions are significantly
shaped by the fundamental characteristics of the Internet of Things as depicted in
Fig. 1. Recognizing and focusing on these basic characteristics are important for
building effective and productive Internet of Things solutions ranging from multiple
domains. These key features make it possible to build intelligent, networked systems
that improve productivity, allow for data-driven decision-making, and have favorable
effects on the environment, society, and economy.
1.2 Benefits of IoT
Efficiency and productivity improvements, cost savings, enhanced customer experi-

ences, data-driven decision-making, remote monitoring and management, enhanced
safety and security, sustainability and environmental impact, innovative business
models, supply chain optimization, automation and advancements in various fields
like healthcare, agriculture, industry, etc., are just a few of the many benefits that the
Internet of Things (IoT) offers across various sectors. IoT-based predictive analytics
offers accessibility and convenience, competitive advantage, and the ability to iden-
tify potential trends and possible problems. Task automation with Internet of Things
devices boosts productivity and efficiency. Improved resource efficiency and oper-
ational optimization are made possible by real-time monitoring and data analytics.
Attribute-Based Encryption for the Internet of Things: A Review 337
Fig. 1 Key characteristics of IoT
Real-time data collection through connected devices and its analysis creates novel
chances for effective and enhanced quality of life.
2 Security Challenges in IoT
Despite the numerous benefits IoT offers, it also introduces significant security chal-
lenges due to its vast and diverse landscape of connected devices. Possible challenges
can be device vulnerabilities, insecure communication, data privacy, authentication
and authorization, firmware and software updates, Distributed Denial-of-Service
(DDOS) attacks, physical security, supply chain security, regulatory compliance,
lack of standardization, etc. Many IoT devices have inadequate security measures
along with limited resources, which make them vulnerable to attacks. Inadequate
encryption and authentication mechanisms during data transmission can lead to data
interception and manipulation. IoT devices often collect and process sensitive data,
making data privacy a major concern. Unauthorized access to such data can lead to
serious privacy breaches. Weak or non-existent authentication mechanisms can result
in unauthorized access to IoT devices or systems. Lack of timely updates and secu-
rity patches for IoT devices can leave them vulnerable to known threats. Distributed
Denial-of-Service (DDoS) attacks can be launched via large-scale botnets created
by compromising IoT devices, causing disruptions to networks and services. Addi-
tional security concerns are created by physical access to IoT devices, which can
result in sensitive data being altered or stolen. Backdoors or vulnerabilities in IoT
devices may result from supply chain vulnerabilities. IoT devices must adhere to
data protection laws and guidelines while collecting private information or sensitive
data. Security flaws may develop as a result of inconsistent security procedures and
standards or protocols among Internet of Things systems and devices. The Internet
of Things is transforming industries by enabling improved automation, surveillance,
and control, which eventually culminates in increased productivity and efficiency. It
can be a useful resource for studying the function of industrial markets using Internet
of Things [2]. A multifaceted strategy which includes end users, policymakers, devel-
opers, and manufacturers is necessary to address these security issues. To create a
safe and secure Internet of Things ecosystem, it is crucial to provide robust authen-
tication, encryption, secure communication protocols, access control mechanisms,
and regular security updates. As IoT is evolving rapidly and in order to realize this
game-changing technology’s full potential, addressing security issues will remain a
primary priority.
2.1 Importance of Access Control in Securing IoT
Heterogeneous IoT ecosystem security relies on access control which is a funda-

mental security strategy. Some of the reasons behind the necessity of access control
can be preventing unauthorized access, protecting data privacy and confidentiality,
mitigating insider threats, defense against cyber-attacks regulatory compliance,
secure device management, granular control and flexibility, ensuring safety and
integrity, trust and reliability, protection against IOT botnets, etc. Access control
supports the application of the principle of least privilege by imposing fine-grained
access control based on user roles, attributes, or identities. This decreases the possi-
bility of data leaks and inappropriate use of sensitive IoT data by preventing unau-
thorized access. The authors of [3] suggest an “Anonymous Decentralized Attribute-
Based Access Control” approach that gives the priority to user privacy and secure
data access. In the context of the Internet of Things, the article emphasizes the
need for scalable and effective access control methods and offers the approach they
use as a solution. The scheme employs Attribute-Based Encryption techniques and
a decentralized architecture to ensure authorized data sharing while maintaining
the anonymity of data owners and users. While the paper effectively discusses its
technical details, it could further enhance its impact by including real-world use
cases and comparing the proposed approach to existing methods [3]. As IoT devices
are deployed in various environments, including health care, smart homes, and
industrial control systems, the need for granular access control becomes critical.
The proper authentication and authorization mechanisms can avoid unauthorized
devices or users from gaining the network resource access, reducing the risk of data
tampering, privacy violations, and malicious activities. Access control facilitates in
meeting regulatory standards for data protection and privacy, which are crucial in
IoT deployments containing sensitive or personal data. In order to provide an outline
of the current scenario of IoT security, the study [4] emphasizes existing solutions,
protocols, and areas that need more research. The authors start by acknowledging
the security problems raised by the rapid growth of IoT devices. They draw atten-
tion to the particular difficulties created by the distributive and heterogeneous IoT
ecosystems. The study then conducts a thorough analysis of several security tech-
niques and approaches that have been presented out to address Internet of Things
security challenges. This comprises secure communication protocols, data encryp-
tion, access control, and authentication. They address the benefits and drawbacks
of these methods and provide information on how well they are suited for various
IoT scenarios. The article also explores open research problems about IoT security,
highlighting areas that require more research. These lingering issues cover a variety
of subjects, including safe device onboarding, privacy-preserving systems, intrusion
detection, and assuring end-to-end security in IoT environments. The overview of
the IoT security architecture which currently exists, the issues that IoT security faces,
and potential solutions to these issues are addressed in [5]. The authors list several
risks that exist such as data breaches due to unauthorized access, and potential IoT
device manipulation, while identifying the challenges involved. Additionally, they
go through the threat of Distributed Denial-of-Service (DDoS) attacks, the absence
of established security protocols, and the challenge of maintaining security in devices
with limited resources. The study suggests the need for robust authentication mech-
anisms, secure communication protocols, and proper encryption techniques to safe-
guard data transmission and storage. The authors also advocate for continuous moni-
toring and updating of IoT device’s security measures and emphasize the importance
of educating both manufacturers and users about IoT security best practices [5].
Overall, access control is a foundational aspect of IoT security. By effec-
tively implementing access control mechanisms, IoT stakeholders can significantly
enhance the security and trustworthiness of their deployments, mitigating potential
risks and enabling the full potential of IoT technology.
3 Attribute-Based Encryption for IoT
Public key cryptography is famous in fundamental techniques for data security. The
sender encrypts the data using the recipient’s public key, and the recipient decrypts it
using its private key. Insofar the recipient maintains his own private key; the commu-
nication is encrypted to the specific user, ensuring that only the intended recipient
is able to decipher the message. The applications, like social networks and cloud
storage, enable communication between groups of people who have similar inter-
ests or features. It is necessary to have the restriction that only users with certain
characteristics will be allowed to decode and read the message. Since the recip-
ient’s identities cannot be known beforehand, the traditional technique—public key
cryptography cannot be employed directly [6]. Using the advanced cryptographic
approach known as Attribute-Based Encryption, owners of data can encrypt data
and specify access policies according to particular attributes. These attributes can
be related to user attributes, device properties, or any other relevant characteristics.
Through the implementation of ABE, fine-grained access control is made possible,
allowing only users or devices with attributes that meet the requirements of the access
policy to decrypt and access encrypted data. When used in the context of IoT, ABE
enables content owners to encrypt IoT data with access policies depending on the
attributes held by users or devices in the IoT ecosystem. This granular control over
data access is especially valuable in diverse and dynamic IoT environments, where
various entities may require access to specific data based on specific conditions.
Given the substantial volume and significance of information hosted on online plat-
forms, there is a growing concern regarding the potential compromise of personal
data. This unease is exacerbated by a surge in recent cyber-attacks and the legal
pressures confronting such services. One potential solution to these challenges is the
adoption of data encryption, which would limit information loss in case of a breach.
However, encrypting data have drawbacks, particularly in terms of the user’s ability to
selectively exchange encrypted data at a detailed level. This issue is exemplified by a
scenario where a user aims to grant decryption access for specific Internet traffic logs
to a party based on specific conditions. Current options either entail decrypting entries
individually or sharing the private decryption key, both of which have downsides. A
significant context where these challenges manifest is audit logs. The work of Sahai
and Waters [7] introduces an approach to mitigate this issue through Attribute-Based
Encryption. User’s keys and ciphertexts are associated with descriptive attributes
in an ABE system, allowing a key to do decryption of ciphertext only if attributes
match among them. Sahai and Waters’ cryptosystem enables decryption when a
certain number of overlapping attributes exist between a private key and ciphertext.
Limitations in expressibility appear to hinder its suitability for larger systems, while
useful for error-tolerant encryption with biometrics [8]. The article [9] discusses
Attribute-Based Encryption (ABE) schemes with constant-size ciphertexts. Tradi-
tional ABE schemes often produce ciphertexts that grow in size with the number of
attributes, which can be inefficient. The article explores advanced ABE techniques
that maintain constant ciphertext size, regardless of the number of attributes involved.
These schemes are particularly useful for applications where efficient and compact
data encryption is essential, such as secure data sharing in resource-constrained
environments or cloud computing. ABE has two important subtypes: Cipher-Policy
Attribute-Based Encryption (CP-ABE) and Key Policy Attribute-Based Encryption
(KP-ABE). In KP-ABE, each encrypted data piece or ciphertext is related with a
set of descriptive attributes. On the other hand, every user’s private key is asso-
ciated with an access policy or structure which includes the attributes which key
needs in order to decrypt certain ciphertext. This access policy, referred to as the
“key policy,” defines the conditions under which a user can access specific encrypted
data. In CP-ABE, each ciphertext is associated with a policy that defines the attributes
required for a user’s private key to decrypt the ciphertext. Frequently, this policy is
referred to as a “cipher policy.” Users are given private keys that have been associated
with certain attributes, and they can only decrypt ciphertext if their attributes fulfill
the cipher policy of the ciphertext. The encrypted data decryption can be possible
and accessed by a user if their attributes meet the cipher policy. The fundamental
difference between CP-ABE and other Attribute-Based Encryption methods like
Key-Policy Attribute-Based Encryption exists at the point where access control is
imposed. CP-ABE is a flexible method for ensuring fine-grained access control on

encrypted data since the access control mechanism is encoded inside the ciphertext
itself. The system developed in [8] is labeled as KP-ABE in which each ciphertext
is associated with descriptive attributes and each private key is linked to an access
structure determining which ciphertext user can decrypt. However, system [8] explic-
itly restricts cooperation between users holding different keys. They adapt techniques
from previous work [7] to handle complex scenarios, ensuring that collusion does not
compromise security. This cryptographic system enables fine-grained access control,
useful for applications like sharing audit log data. Moreover, they introduce a delega-
tion mechanism allowing users with a less restrictive access structure to derive keys
for more constrained access structures. Surprisingly, this construction encompasses
Hierarchical Identity-Based Encryption in its properties.
Both Attribute-Based Encryption algorithms, i.e., Key-Policy ABE and
Ciphertext-Policy ABE are presented in the following basic description. These algo-
rithms provide a basic understanding of the processes involved in Attribute-Based
Encryption. KP-ABE and CP-ABE both techniques are described in Figs. 2 and 3,
respectively.
3.1 Key-Policy ABE (KP-ABE)
Four algorithms make up the Attribute-Based Encryption system (KP-ABE), which

accepts a number of inputs including the access structure (A) which comprise a set
of permitted attributes.
(1) Setup → (PK,MK): This algorithm does not have any other input than implicit
parameters of security and it outputs public parameters (PK) and a master key
(MK).
(2) Encryption (m,S,PK) → CT: This algorithm accepts input such as messages
(m), public parameters (PK), and attribute set (S). It produces (CT) ciphertext.
(3) Key Generation (A,MK,PK) → D: Access Structure (A) along with the
public parameters (PK) and the master key (MK) is inputs for this randomized
algorithm. It provides the decryption key (D) as output.
(4) Decryption (CT,D,PK) → m: The ciphertext (CT), which was encrypted
utilizing a set of attributes, is the input for this algorithm. Public parameters
(PK) and the decryption key (D) for access structure (A) are further inputs. If S
∈ A, the message (m) will be the output [8].
3.2 Ciphertext-Policy ABE (CP-ABE)
A Ciphertext-Policy Attribute-Based Encryption (CP-ABE) scheme consists of four

basic algorithms: Setup, Encryption, Key Generation, and Decryption.
Fig. 2 Key-Policy ABE
(1) Setup → PK,MK): This random algorithm only accepts implicit security
parameters as input. The primary key (MK) and the public parameters (PK)
are its outputs.
(2) Encryption (PK,m, AS) → CT: Public parameters (PK), a message (m), and
an access structure (AS) over a set of attributes (of users) are the inputs used
by the encryption algorithm. The algorithm encrypts message m and creates a
ciphertext (CT) so that users who adhere to the access structure can decode the
message m. It is considered that AS is contained implicitly in the ciphertext.
(3) Key Generation (MK, S) → D: The master key (MK) and the set of attributes
(S) that describe the key are required as input to the key generation process. Its
output is a private key named as (D).
Fig. 3 Ciphertext-Policy ABE
(4) Decrypt (PK, CT, D) → m: The public parameters (PK), the ciphertext (CT)
with access policy (AS), and the private key (D), which is a private key for
attribute set (S), are all inputs for the decryption algorithm. The algorithm
decrypts the ciphertext and results in message (m) if S satisfies AS [10].
3.3 Related Work
Attribute-Based Encryption (ABE) is a significant cryptographic technique for

securing IoT. By considering various aspects for applying ABE in the IoT context,
related work is addressed in this section.
Fine-Grained Access Control: ABE provides fine-grained access control in IoT

environments. By encrypting data with attributes as access policies, data owners can
precisely define who can access specific data based on attribute values. This makes
sure that only authorized users or devices with the required attributes can do the
decryption and access the data [11]. The goal of the study in [12] is to present a
thorough analysis of several ABE schemes and how well they can be applied to the
developing IoT landscape. Variety of ABE schemes are thoroughly reviewed with
an emphasis on how well they perform with next-generation wireless IoT networks.
The article also underlines how crucial effective access control and key management
systems are in the Internet of Things context. It aids readers in developing a better
understanding of the backdrop of encryption algorithms available for protecting
wireless IoT networks by describing the features and functions of various ABE
schemes. The authors [13] acknowledge the increasing significance of secure data
sharing within industrial settings and the complexity of managing access control and
confidentiality in such environments. The paper specifically focuses on attribute-
based approaches as a solution to these challenges along with exploration of the
technical details of Attribute-Based Encryption methods, which provide fine-grained
access control based on attributes associated with users and data [13].
Interoperability challenges: There may be chances of facing interoperability
challenges when implementing ABE across heterogeneous devices and systems
within the IoT ecosystem. Semantics of attributes supplied by users or needed by
access structures are not taken into account by ABE schemes. These semantics not
only enhance functionality but also facilitate cross-domain interoperability, allowing
users from one domain to access and utilize resources from other domains. By adding
semantic technologies to a traditional Ciphertext-Policy ABE (CP-ABE) technique,
authors in [8] have suggested a Semantic ABE (SABE) framework implemented in
Java. Two CP-ABE-based approaches are presented in this article to enable cross-
domain interoperability by making CP-ABE semantic aware. Semantically-Enriched
Key (SEK) and Semantically-Enriched Access Structure (SEAS) are those methods.
Interoperability and security could be major obstacles to IoT adoption in the real
world. In development and management of IoT systems, adopting standards promotes
interoperability and security [14]. As a result, in order to remove IoT interoperability
and security barriers, international standards must be adopted. Researchers and stan-
dard organizations are working to find ways around these obstacles. Authors in
[15] conducted a detailed and useful international standards survey for the Internet
of Things security and interoperability. Employing interoperability and security-
related standards may help to achieve interoperability if various Internet of Things
systems are developed using the same standard. The study [16] examines the critical
component of interoperability in the context of smart cities, highlighting the demands
and difficulties related to combining multiple technologies and systems. The writers
explore the particular organizational and technical prerequisites for accomplishing
smooth communication between different parts of an ecosystem for smart cities.
Standardization, communication frameworks, data exchange protocols, and integra-
tion strategies are a few examples of possible factors to take into account. Based on
numerous interoperability studies in the IoT environment, the authors identified the
requirements for five types of interoperability in a smart city: syntactic, semantic,

network, middleware, and security.
Dynamic Attribute Management: The study [8] presents Attribute-Based
Encryption (ABE), a novel concept. IoT context is dynamic because new devices
often join and leave the network. ABE enables dynamic attribute management,
allowing for real-time access policy adjustment. In order to adapt to evolving device
status, user roles, and other features over time, flexibility is essential. The authors
discuss the difficulty of establishing secure and adaptable access control techniques
for encrypted data, permitting various users to access particular parts of the data
depending on their attributes. Author outlines the technical framework of ABE,
detailing the key components like attribute keys, master keys, ciphertexts, and decryp-
tion procedures. They also discuss the efficiency of their ABE scheme, addressing
computational aspects and practical feasibility [8]. The authors in [17] emphasize
ABE’s capability to achieve fine-grained access control and the flexibility it offers in
managing access policies. They explain how ABE’s Attribute-Based access policies
align with the varying needs of the applications.
Data Privacy and Confidentiality: The paper [18] focuses on designing tech-
niques that enable the collection and analysis of energy consumption data from smart
meters while safeguarding the privacy of an individual’s sensitive information. The
paper highlights the privacy guarantees provided by their approach, emphasizing
that the utility or any adversary cannot deduce individual usage patterns from the
collected data. It discusses how cryptographic methods can ensure data privacy during
collection and transmission [18]. A hybrid approach that combines lightweight cryp-
tographic techniques with ABE to achieve secure data communication and access
control in IoT systems is provided by authors [19]. The paper provides an evaluation
of the proposed approach, considering performance metrics such as computational
efficiency and communication overhead. This assessment demonstrates the feasi-
bility and practicality of their hybrid lightweight cryptography and ABE scheme
in IoT scenarios [19]. The authors [20] introduce a concept called “Puncturable
Attribute-Based Encryption,” which is designed to provide a superior level of secu-
rity, privacy for IoT data transmission. This technique allows data to be encrypted with
specific attributes, and these attributes can be selectively “punctured” or removed by
authorized parties.
Revocation of Access: With the help of ABE, data owners can revoke access to
particular devices or users by altering the values of the relevant attributes. This is
crucial in Internet of Things (IoT) contexts because privileges for access might have
to be revoked because of devices those are compromised or changes in user roles.
The authors of [21] acknowledge the requirement for both attribute-based access
control and the potential to remove access privileges. Revocable Attribute-Based
Encryption (RABE) is the idea that is introduced, allowing data owners to offer
access to encrypted data according to attributes and then withdraw that access as
needed. In dynamic environments such as cloud computing, this ability is essen-
tial for preserving data security. The authors probe into the specifics of their RABE
scheme while highlighting how it is capable to maintain data integrity throughout the
encryption, access procedures, and storage [21]. With the help of ABE, data owners
can revoke access to particular devices or specific users by altering the values of the
relevant attributes. This feature is critical in IoT environments, where access priv-
ileges may need to be revoked due to changes in user roles, device ownership, or
security incidents [22]. The authors [23] effectively address the growing need for
secure and efficient data sharing in eHealth environments, where multiple parties
collaborate and share sensitive medical information. The paper introduces the cryp-
tographic framework CESCR as a solution to the challenges associated with ensuring
data confidentiality, access control, and user revocation.
Secure Data Sharing and Collaboration: In IoT ecosystems, data sharing and
collaboration among devices and users are common. ABE facilitates secure data
sharing by allowing data to be encrypted with access policies based on the required
attributes, ensuring that only authorized entities can share and access the data.
Proposed (CP-ABE) system designed by authors in [13] for secure data sharing
within smart cities allows data owners to define access policies based on attributes,
ensuring that only the users who are authorized and satisfy the attributes can do the
decryption and access the shared data. The system addresses privacy concerns in
smart city environments by allowing data sharing without exposing sensitive infor-
mation about the data owner or users. The authors introduce a practical solution
that enhances security and confidentiality in the context of data sharing within smart
cities [13]. As per the key contribution in [24] that employs the technique of hidden
policies, which further enhance privacy by concealing the specific access policies
associated with the shared data. This approach aims to ensure secure data sharing
while protecting sensitive information about access rights and policies within the
smart grid infrastructure [24]. The paper [23] introduces a novel Ciphertext-Policy
Attribute-Based Encryption (CP-ABE) scheme named CESCR, designed for effi-
cient and secure data sharing in collaborative eHealth environments. CESCR supports
attribute-based access control, allowing authorized users to decrypt shared data based
on their attributes. Notably, the scheme also incorporates revocation mechanisms to
manage changes in user access rights over time. Unlike some prior schemes, CESCR
eliminates the need for dummy attributes, enhancing efficiency in the decryption
process while maintaining strong security measures for collaborative data sharing in
the eHealth sector [23]. The authors [25] discuss various Attribute-Based Encryp-
tion techniques and their applicability to industrial contexts, where sensitive data
needs to be shared among authorized users while maintaining confidentiality and
access control. The paper surveys different methodologies and technologies that can
be employed to establish secure data sharing frameworks in the industrial domain,
contributing to the development of robust data protection mechanisms for industrial
applications [25].
Scalability and Resource Efficiency: ABE can be appropriate for situations
where devices operate with restricted memory and processing capabilities because it
can be implemented in IoT devices with limited resources [26]. Point-to-multipoint
communication plays an important role in cloud computing contexts, and the authors
[27] deal with the difficulties of access control and data confidentiality in a scattered
and scalable cloud environment. The novel ABE system introduced particularly for
point-to-multipoint communication is the paper’s main contribution. Enhancing the
security and scalability of IoT systems is the goal of the work in [19]. The authors
address the issue of protecting Internet of Things (IoT) content while taking into
account the resource limitations of IoT devices by merging lightweight crypto-
graphic approaches and ABE. Table 1 addresses the literature on how ABE’s compu-
tational and storage overhead affects the constrained devices and practical approach
of proposed security models in real-world IoT scenarios.
Secure Device Management: Secure device management in IoT contexts can be
accomplished with ABE. Access to device management operations can be regulated
by associating attributes to devices, making sure that only authorized individuals
or systems can handle IoT devices safely. An innovative Attribute-Based Encryp-
tion (ABE) system is presented in article [33] with the goal of improving access
control in a blockchain-based IoT context. The authors identified the difficulties
in implementing effective and secure access control in the context of IoT devices
integrated by a blockchain architecture. They suggest an ABE-based approach that
incorporates blockchain technology to control access rights to deal with this. They
have developed a system that enables data owners to specify access controls based
on attributes, ensuring that only authorized users with appropriate attributes can
do decryption and access the information. The suggested approach intends to improve
the secrecy and integrity of Internet of Things content while retaining effective access
control by utilizing the safety features of ABE and the distributed architecture of
blockchain. The study offers a potential method for enhancing the data integration
between ABE and blockchain for the safety of IoT devices [33].
Adaptability to Changing Environments: Dynamic IoT ecosystems experi-
ence frequent changes, such as devices joining or leaving the network and user’s
attributes evolving over time. ABE provides adaptability to the changing environ-
ment by allowing access policies to be modified dynamically. This ensures that access
privileges remain up-to-date and responsive to changes in the IoT environment [13].
The Internet of Things (IoT) presents significant access control and data privacy
challenges due to the variety of IoT devices and heterogeneous distributed networks.
For the IoT environment, many security architectures and models have recently been
presented. To actively defend against new breaches and insider attacks, authors in [34]
suggested a novel data-centric security technique called Ciphertext-Policy Attribute-
based Encryption (CP-ABER-LWE) which protects data at rest as well as data in
transit.
3.4 Use Cases and Scenarios of Attribute-Based Encryption

(ABE) in IoT Applications
A summary of use cases and scenarios of Attribute-Based Encryption (ABE) in

IoT applications, drawing from the principles commonly discussed in research and
review articles, is given in Fig. 4.
Table 1 Literature on ABE’s computational and storage overhead in IoT scenarios

Resource Proposed work Technique
[9] KP-ABE scheme, which allows for as expressive A CP-ABE technique with
policies as feasible, with constant-size ciphertexts short ciphertexts for
regardless of the number of attributes used. Focuses threshold policies. KP-ABE
on compact ciphertexts and resilient but shorter keys technique with short
to provide complete security ciphertexts for monotonic
Linear Secret Sharing
Scheme (LSSS)-realizable
access structures. A
KP-ABE scheme with
non-monotonic access
structures
[28] Speed of full encryption using CP-ABE in a Partial and full encryptions
constrained environment can be dependent on the using Beaglebone Black as
context of execution. For this, authors propose an a data owner device
adaptive that intelligently shifts between partial and
full encryption schemes depending on the execution
context
[29] The authors applied CP-ABE and KP-ABE to Encryption and decryption
commonly used IoT-enabled devices. In their work, in C using boards such as
they investigated the effects of combining numeric Intel Edison, Raspberry Pi 1
and string attributes in CP-ABE and KP-ABE. They Model B, Raspberry Pi
found that while using numeric attributes can be Zero, and Intel Galileo Gen
costly, it can provide more expressive policy 2 boards
definitions, particularly in CP-ABE. Moreover, Subsequent research
adding more attributes increases memory usage and endeavors may focus on
execution time, which in turn increases energy optimizing energy efficiency
consumption. In a similar vein, the computing and execution time
increases with increase in level of security
[30] Authors analyzed the performance of KP-ABE and Implementation of both
CP-ABE on laptop and mobile devices. According to ABE techniques done in
the authors, execution time grows linearly with Java on devices like laptop
increase in number of attributes. Another finding is and mobile
that since CP-ABE does more exponentiation Recommendation given in
calculations than KP-ABE, the CP-ABE is slower the article is to explore ABE
than the KP-ABE for all operations. It is challenging further for improving its
to provide a strong security level in mobile phones. performance on
Authors thus suggest possible ways like remote key resource-constrained
generation and key reuse to utilize ABE on environments
low-processing-power systems
(continued)
Table 1 (continued)
Resource Proposed work Technique
[31] Constrained devices’ memory could be not enough to CP-ABE technique with
store CP-ABE decryption keys and in CP-ABE key AND gates access structure
size is dependent on number of attributes. Article is implemented proposing
proposes a CP-ABE technique with an AND gates constant-size CP-ABE
access structure that is proven to be secure. Their decryption keys which can
analysis demonstrated that the decryption key for the be stored in lightweight
proposed technique can be stored in lightweight devices. A constant-size
devices, making it the effective CP-ABE decryption key with length
of 672 bits (80-bit security)
is provided by proposed
CP-ABE technique
[32] Computational burden on host and user can be The two semi-trusted
minimized by the model proposed by authors in proxies that are used in the
which end-to-end cryptographic operations can be proposed technique out of
outsourced. Scheme is implemented for constrained which one is for outsourcing
devices like mobile devices computationally demanding
encryption and another for
outsourcing decryption
1. Healthcare IoT: ABE can be utilized in healthcare IoT applications to ensure

secure access to sensitive patient data. Medical records, health monitoring data,
and patient information can be encrypted with access policies based on the
attributes of healthcare providers and authorized personnel. ABE enables fine-
grained access control, protecting patient privacy while allowing appropriate
healthcare professionals to access necessary data [35]. E-health relies on inter-
connected small nodes with sensing and activating capabilities placed within
or outside the human body. These applications are personalized, dependent on
reliable communication channels, and responsive to connections. The growth
of IoT services necessitates new strategies to manage diverse devices, varying
availability, and data generation patterns. Smart healthcare employs smart health
cards for patient security and privacy, but these cards are susceptible to threats
like theft, hacking, and cyber-attacks. In [36] they emphasize the need to control
access to these data based on attributes while also preserving the privacy of access
policies. An efficient policy-hiding ABAC mechanism tailored for smart health
scenarios ensures that access policies remain confidential while still enabling
fine-grained access control. The technical aspects of the proposed mechanism
are discussed, including its Attribute-Based Encryption, policy hiding, and effi-
cient access control features. The authors [37] highlight its potential benefits
in areas like patient data privacy; secure sharing among medical professionals,
and access control. The paper emphasizes the importance of secure health data
sharing while complying with regulations like HIPAA, which mandate stringent
data protection measures. One aspect to consider is that the paper’s findings
might be subject to evolution, as the field of ABE and health services continues
to advance with new developments and technologies.
Fig. 4 Use cases and scenarios of ABE in IoT applications
2. Smart Grids: Data related to power consumption, production, and distribution

can be encrypted with access policies depending on the attributes of utility compa-
nies, power suppliers, and consumers. This makes sure that only permitted parties
can view and examine the grid data. The author [24] suggests a novel method
for safe data sharing in the smart grid. In a smart grid context, implementing
protection guarantees that only entities who has authorized can access and eval-
uate the grid data. Using Attribute-Based Encryption, this strategy allows for
limited access to data based on predetermined attributes by keeping the access
policies hidden from unwanted users. In the environment of the smart grid, the
article proposes a comprehensive technical solution that provides an efficient
way to improve data control and safety. Attribute-Based Encryption system is
presented by the authors [38] with the goal of enhancing secure data transmis-
sion and access control in the environment of smart grids. Their proposed ABE
scheme offers controlled data access based on specified attributes, ensuring that
only authorized parties can decrypt and access the information showcasing its
resilience to potential threats.
3. Industrial IoT (IIoT): In industrial IoT applications, ABE can be applied to
protect intellectual property, sensitive production data, and access to critical
industrial control systems. Authors [39] address the challenges of secure data
access control in cloud-based industrial Internet of Things (IIoT) environments.
Recognizing the critical need for data security in these contexts, the authors
propose a novel solution that focuses on two key aspects: auditability and time-
limited access control. The proposed solution offers benefits such as enhanced
security, accountability, and controlled data sharing. Recognizing the imperative
of safeguarding sensitive information while enabling efficient data exchange, the
authors [13] explore attribute-based approaches as a solution. These approaches
leverage attributes to define access policies, granting permissions based on
specific criteria. By addressing challenges like access management in complex
industrial networks, the article contributes to the field’s understanding of secure
data sharing.
4. Smart Cities: ABE is relevant for securing data sharing in smart cities among
various stakeholders, including government authorities, service providers, and
citizens. ABE offers safe data sharing by encrypting content with access poli-
cies based on significant attributes, guaranteeing that only authorized systems or
users can access important city information [40]. The authors in [25] put up an
innovative strategy that makes use of ABE to protect data privacy while facili-
tating effective sharing. The authors highlight the benefits of their strategy, such
as improved privacy, granular access control, and customized data sharing. The
authors emphasize the importance of their research in securing user’s privacy and
enabling data-driven developments in urban areas. The authors [41] give a thor-
ough survey that explores numerous facets of smart city technologies in recogni-
tion of the growing importance of smart city projects. In order to construct smart
cities, various protocols for communication and architectural frameworks are
methodically explored in this article. It offers insights on the development of
edge-centric, distributed, and centralized smart city architectures, highlighting
the advantages and disadvantages of them. The writers also go through commu-
nication protocols like IoT, LoRaWAN, and 5G, emphasizing how they help with
the effective interchange of content or information in scenarios like smart cities.
5. Vehicular IoT (VIoT): ABE can be used in vehicular IoT to secure information
transfer and access control in vehicles that are connected. Access control based
on vehicle features can be used to encrypt vehicle-related data, assuring that
only authorized cars, other vehicles, or authorities have access to particular data,
like traffic data or records of maintenance [42]. For Vehicular Ad Hoc Networks
(VANETs), a novel framework for guaranteeing safe communication is presented
in [43]. The suggested system uses Attribute-Based Encryption to create safe
channels for exchanges among infrastructure, vehicles, and other systems. The
system imposes access policies that allow authorized users to decode and view the
sent information by associating attributes to users. While applying ABE, access
to vehicle information can be restricted using factors like vehicle ID, permitted
access, or vehicle type, protecting privacy and preventing unwanted tampering.

The authors [22] offer a lightweight CP-ABE-based encryption approach with
emphasis on direct attribute revocation in accordance with the requirement for
lightweight and reliable cryptographic algorithms. The difficulty of attribute revo-
cation in dynamic vehicular networks is addressed by this technique by making
it possible to quickly remove access permissions for revoked attributes.
6. Home Automation and Smart Homes: Encryption of data like home automation
data, such as security camera feeds or environmental sensor data can be done with
access policies depending on user attributes, ensuring data privacy and authorized
access [44]. The paper [44] sheds light on the critical aspects of privacy and secu-
rity that emerge when IoT devices are integrated into homes, discussing potential
vulnerabilities, data privacy concerns, and the complex balance between conve-
nience and safeguarding personal space. Recognizing the increasing proliferation
of IoT devices within homes, the authors [45] propose an ABE scheme tailored
to the specific requirements of smart homes. The scheme permits data holders to
encrypt their information using attributes, while only authorized users are able to
access the data based on predefined attributes. This ensures fine-grained access
control and privacy preservation. By addressing the privacy challenges associated
with data sharing and access control in these environments, the authors contribute
to the advancement of secure and privacy-conscious smart home ecosystems [45].
7. Agriculture: Precision farming, also referred to as smart farming, or digital
farming, is the application of cutting-edge technology and data-driven methods
to enhance many facets of agricultural production. The goal of smart agricul-
ture is to increase crop yields, improve resource efficiency, reduce waste, and
make farming practices more sustainable. Smart agriculture relies heavily on
collecting and analyzing data from various sources, including sensors, drones,
and IoT devices. These data often contains sensitive information about crop
health, weather conditions, and farm operations. ABE can be used to encrypt
this data, and access may be granted based on specific attributes. For example,
only authorized personnel with the appropriate credentials or attributes (such
as being a farm manager) can decrypt and access certain data. Authors in [46]
created ontology for smart farming. This ontology represents a variety of phys-
ical entities, including sensors, farm employees, and their interactions. Authors
implemented an Attribute-Based Access Control (ABAC) system to dynami-
cally check access control requests using the expressive ontology. As per authors
[47] for privacy preservation in green IoT-based agriculture, an effective access
control method can be adapted. Authors gave some references in their paper
which includes access control and security through Attribute-Based Encryption
techniques.
8. Education: IT teams, educators, and administrators must work together to imple-
ment ABE for IoT applications in the education sector. To make sure that the
implementation satisfies security standards and offers the required flexibility for
the dynamic nature of educational environments, it is necessary that it can be
thoroughly planned and tested. Traditional systems allow parents or students
to request teacher recommendations based on bare necessities; however, these
systems have limitations due to the absence of customized options and oversight
of the skills and credibility of teachers. The article [48] suggests an attribute-based
recommendation system for education services that protects privacy. This scheme
allows users to set customized requirements by using attribute-based searchable
encryption for keyword searches and fine-grained access control. Teachers and
attribute authority cooperate in order to generate keys anonymously, which guar-
antee the security of teacher’s keys. The education platform can select the best
teacher for a task. Several education sector systems for which there are chances
of applying Attribute-Based Encryption can be access control for learning plat-
forms, secure data sharing, personalized learning platforms, student information
system security, etc.
3.5 ABE Security Models in IoT
A comparative analysis table (Table 2) of general overview of common Attribute-

Based Encryption security models and their effectiveness in the context of the IoT
is given below. The actual effectiveness of these models may vary depending on the
specific ABE scheme, implementation, and IoT use case.
Table 2 Security models for ABE and its effectiveness in IoT

Security model Description Effectiveness in IoT
Confidentiality model Ensures sensitive data remains Effectively protects sensitive
confidential and accessible only IoT data from unauthorized
to authorized entities access, preserving data privacy
Integrity model Guarantees that data remain Ensures data trustworthiness,
unaltered and trustworthy preventing unauthorized
throughout its lifecycle modifications in IoT data
Attribute-based access Enables fine-grained access Well-suited for IoT, enabling
control model control based on attributes flexible access policies
considering diverse attributes
of devices and users
Privacy-preserving attribute Conceals user’s private Preserves user privacy in IoT
revelation model attributes during access control applications, safeguarding
sensitive information from
exposure
Key management and Handles cryptographic key Crucial for dynamic IoT
revocation model generation, distribution, and environments, ensuring secure
revocation data access and user privilege
management
Adaptability and scalability Adapts to changes in attributes Enables efficient handling of
model and scales to accommodate IoT dynamic IoT ecosystems and
growth accommodates the growing
number of devices and users
The effectiveness of ABE security models in IoT depends on factors like the
chosen ABE scheme’s cryptographic strength, implementation quality, and the
specific IoT use case. Additionally, other factors like computational efficiency,
resource constraints in IoT devices, and the ability to handle dynamic attribute
changes also impact their overall effectiveness.
3.6 ABE Implementations and Technologies for IoT
Till date, several Attribute-Based Encryption (ABE) libraries, frameworks, and tools
have been developed, some of which are suitable for IoT environments. Table 3 below
is an overview of some existing ABE libraries, frameworks, and tools, along with
their source references.
Cryptographic frameworks and libraries are developing continuously with revo-
lutionary innovations. According to the context, ABE strategy can be needed for
your use case. The programming language that is used in your IoT environment,
the resource limitations of IoT devices, performance overhead, and support from
the community should all be taken into account when applying ABE implemen-
tations. Furthermore, be watchful of security issues and ensure that the selected
implementation meets the desired security properties for your IoT environment.
Table 3 Overview of existing ABE libraries, frameworks, or tools suitable for IoT environments
Library Pros Cons Source
Charm Easy-to-use, Python-based, supports May have performance https://git
Crypto multiple ABE schemes like CP-ABE overhead for hub.com/
and KP-ABE resource-constrained IoT JHUISI/
devices charm
ABY Suitable for privacy-preserving May have higher https://git
framework computations in IoT scenarios computation and hub.com/enc
communication overhead ryptogrou
p/ABY
OpenABE Open-source C++ library, supports May have challenges in https://git
CP-ABE, KP-ABE, and resource-constrained IoT hub.com/zeu
Ciphertext-Policy Hierarchical ABE environments tro/openabe
(CP-HABE). Actively maintained
CryptID Supports different ABE schemes, May require more effort https://git
efficient in C language to integrate with IoT hub.com/cry
applications ptid-org
4 Conclusion and Future Directions
Providing security to data in applications of IoT is considerably different from

securing traditional data due to the heterogeneous nature of IoT use cases and appli-
cations. This paper reviews the Attribute-Based Encryption technique for IoT. It
includes the overview of IoT along with key characteristics. Importance of access
control and other security challenges are covered in later part. The details about
ABE, its importance, variants KP-ABE and CP-ABE, significance of ABE for IoT,
applications of IoT where ABE can be applicable are reviewed from existing litera-
ture. As security is topmost priority for any system, paper ends with ABE security
models, existing ABE libraries, frameworks, or tools suitable for IoT environments.
This review paper can serve as a valuable resource for researchers, academicians,
practitioners, and policymakers seeking a general understanding of the foundational
insights of ABE for Internet of Things and it can be further explored and applied
for maintaining security and privacy of IoT systems. Some of the possible future
directions for further investigation and improvement in the field under review are
generating compact ciphertexts that are evidently safe from agile adversaries, opti-
mizations in terms of energy efficiency and overall execution time while applying
ABE, detail exploration on interoperability and security standards for IoT and its
practical approach.
References
1. Hersent O, David B, Omar E (2011) The internet of things: key applications and protocols.
https://doi.org/10.1002/9781119958352
2. Perera C, Liu CH, Jayawardena S, Chen M (2015) A survey on the internet of things from an
industrial market perspective. IEEE Access 2:1660–1679. https://doi.org/10.1109/ACCESS.
2015.2389854
3. Nasiraee H, Ashouri-Talouki M (2020) Anonymous decentralized attribute-based access
control for cloud-assisted IoT. Futur Gener Comput Syst 110:45–56. https://doi.org/10.1016/
j.future.2020.04.011
4. Görmüş S, Aydın H, Ulutaş G (2018) Security for the internet of things: a survey of existing
mechanisms, protocols and open research issues. J Fac Eng Architect Gazi Univ 33(4):1247–
1272
5. Mahmoud R, Yousuf T, Aloul F, Zualkernan I (2016) Internet of things (IoT) security: current
status, challenges and prospective measures”. 2015 10th international conference for internet
technology and secured transactions. ICITST 2015:336–341. https://doi.org/10.1109/ICITST.
2015.7412116
6. Han Y (2019) Attribute-based encryption with adaptive policy. Soft Comput 23:4009–4017.
https://doi.org/10.1007/s00500-018-3370-z
7. Sahai A, Waters B (2005) Fuzzy identity based encryption. In: Advances in cryptology—
Eurocrypt, vol 3494 of LNCS, Springer, pp 457–473
8. Vipul G, Pandey O, Sahai A, Waters B (2006) Attribute-based encryption for fine-grained
access control of encrypted data. In: Proceedings of the ACM conference on computer and
communications security, pp 89–98. https://doi.org/10.1145/1180405.1180418
9. Attrapadung N, Herranz J, Laguillaumie F, Libert B, De Panafieu E, Ràfols C (2012) Attribute-

based encryption schemes with constant-size ciphertexts. Theoret Comput Sci 422:15–38.
https://doi.org/10.1016/j.tcs.2011.12.004
10. Bethencourt J, Amit S, Brent W (2007) Ciphertext-policy attribute-based encryption. In:
Proceedings—IEEE symposium on security and privacy, pp 321–334. https://doi.org/10.1109/
SP.2007.11
11. Balu A, Kuppusamy K (2014) An expressive and provably secure ciphertext-policy attribute-
based encryption. Inform Sci 276(subaward 641):354–362. https://doi.org/10.1016/j.ins.2013.
12.027.
12. Shruti SR, Sah DK, Gianini G (2023) Attribute-based encryption schemes for next generation
wireless IoT networks: a comprehensive survey. Sensors 23(13):5921. https://doi.org/10.3390/
s23135921
13. Chiquito A, Bodin U, Schelen O (2023) Attribute-based approaches for secure data sharing
in industrial contexts. IEEE Access 11:10180–10195. https://doi.org/10.1109/ACCESS.2023.
3240000
14. Arshad H, Johansen C, Owe O, Picazo-Sanchez P, Schneider G (2022) Semantic attribute-
based encryption: a framework for combining ABE schemes with semantic technologies. Inf
Sci 616:558–576. https://doi.org/10.1016/j.ins.2022.10.132
15. Lee E, Young DS, Se RO, Young GK (2021) A survey on standards for interoperability and
security in the internet of things. IEEE Commun Surv Tut 23(2):1020–1047. https://doi.org/
10.1109/COMST.2021.3067354
16. Koo J, Young GK (2021) Interoperability requirements for a smart city. In: Proceedings of
the ACM symposium on applied computing, pp 690–698. https://doi.org/10.1145/3412841.
3441948
17. Bruhadeshwar B, Ray I (2018) Attribute-based encryption: applications and future directions.
lecture notes in computer science (including subseries lecture notes in artificial intelligence and
lecture notes in bioinformatics). Vol. 11170 LNCS. Springer International Publishing. https://
doi.org/10.1007/978-3-030-04834-1_18
18. Alfredo R, Danezis G (2011) Privacy-preserving smart metering. In: Proceedings of the ACM
conference on computer and communications security, pp 49–60. https://doi.org/10.1145/204
6556.2046564
19. Jammula M, Vakamulla VM, Kondoju SK (2022) Hybrid lightweight cryptography with
attribute-based encryption standard for secure and scalable IoT system. Connect Sci. https://
doi.org/10.1080/09540091.2022.2124957
20. Phuong X, Viet T, Ning R, Xin X, Wu H (2018) Puncturable attribute-based encryption for
secure data delivery in the internet of things. In: Proceedings—IEEE INFOCOM 2018, pp
1511–1519. https://doi.org/10.1109/INFOCOM.2018.8485909
21. Ge C, Susilo W, Baek J, Liu Z, Xia J, Fang L (2022) Revocable attribute-based encryption with
data integrity in clouds. IEEE Trans Depend Sec Comput 19(5):2864–2872. https://doi.org/10.
1109/TDSC.2021.3065999
22. Liu Y, Shengwei X, Ziyan Y (2023) A lightweight CP-ABE scheme with direct attribute
revocation for vehicular ad hoc network. Entropy 25(7):979. https://doi.org/10.3390/e25
070979
23. Kennedy E, Jang B, Wook Kim J (2021) CESCR: CP-ABE for efficient and secure sharing of
data in collaborative ehealth with revocation and no dummy attribute. PLoS ONE 16. https://
doi.org/10.1371/journal.pone.0250992
24. Hur J (2013) Attribute-based secure data sharing with hidden policies in smart grid. IEEE Trans
Parallel Distrib Syst 24(11):2171–2180. https://doi.org/10.1109/TPDS.2012.61
25. Shen X, Chuanhe H, Danxin W, Jiaoli S (2021) A privacy-preserving attribute-based encryption
system for data sharing in smart cities. Wireless Commun Mob Comput.https://doi.org/10.1155/
2021/6686675
26. Rasori M, La Manna M, Perazzo P, Dini G (2022) A survey on attribute-based encryption
schemes suitable for the internet of things. IEEE Internet Things J 9(11):8269–8290. https://
doi.org/10.1109/JIOT.2022.3154039
27. Tamizharasi GS, Balamurugan B, Aarthy SL (2016) Scalable and efficient attribute based
encryption scheme for point to multipoint communication in cloud computing. In: Proceedings
of the international conference on inventive computation technologies, ICICT 2016, pp 1.
https://doi.org/10.1109/INVENTIVE.2016.7823292
28. Taha MB, Talhi C, Ould-Slimane H (2019) Performance evaluation of cp-abe schemes under
constrained devices. Procedia Comput Sci 155:425–432. https://doi.org/10.1016/j.procs.2019.
08.059
29. Ambrosin M, Anzanpour A, Conti M, Dargahi T, Moosavi SR, Rahmani AM, Liljeberg P
(2016) On the feasibility of attribute-based encryption on internet of things devices. IEEE
Micro 36(6):25–35. https://doi.org/10.1109/MM.2016.101
30. Wang X, Jianqing Z, Eve M, Schooler MI (2014) Performance evaluation of attribute-
based encryption: toward data privacy in the IoT. In: 2014 IEEE international conference
on communications, ICC 2014, pp 725–730. https://doi.org/10.1109/ICC.2014.6883405
31. Guo F, Yi M, Willy S, Duncan SW, Vijay V (2014) CP-ABE with constant-size keys for
lightweight devices. IEEE Transact Inform Foren Secur 9(5):763–771. https://doi.org/10.1109/
TIFS.2014.2309858
32. Asim M, Milan P, Tanya I (2014) Attribute-based encryption with encryption and decryption
outsourcing. In: Proceedings of 12th Australian information security management conference,
AISM 2014, pp 21–28. https://doi.org/10.4225/75/57b65cc3343d0
33. Zhang J, Xin Y, Gao Y, Lei X, Yang Y (2021) Secure ABE scheme for access management in
blockchain-based IoT. IEEE Access 9:54840–54849. https://doi.org/10.1109/ACCESS.2021.
3071031
34. Fun TS, Samsudin A (2017) Attribute based encryption—a data centric approach for securing
internet of things (IoT). Adv Sci Lett 23(5):4219–4223. https://doi.org/10.1166/asl.2017.8315
35. Sadeeq Mohammed AM, Subhi RM, Zeebaree RQ, Sarkar HA, Karwan J (2018) Internet of
things security: a survey. In: ICOASE 2018—international conference on advanced science
and engineering, pp 162–66. https://doi.org/10.1109/ICOASE.2018.8548785
36. Zhang Y, Zheng D, Deng RH (2018) Security and privacy in smart health: efficient policy-
hiding attribute-based access control. IEEE Internet Things J 5(3):2130–2145. https://doi.org/
10.1109/JIOT.2018.2825289
37. Imam R, Kumar K, Raza SM, Sadaf R, Anwer F, Fatima N, Nadeem M, Abbas M, Rahman O
(2022) A systematic literature review of attribute based encryption in health services. J King
Saud Univ Comput Inform Sci 34(9):6743–6774. https://doi.org/10.1016/j.jksuci.2022.06.018
38. Yang W, Zhitao G (2019) An efficient attribute based encryption scheme in smart grid. Lecture
notes in computer science (including subseries lecture notes in artificial intelligence and lecture
notes in Bioinformatics) 11982 LNCS: 159–72. https://doi.org/10.1007/978-3-030-37337-5_
13
39. Li T, Jiawei Z, Yanbo Y, Wei Q, Yangxu L (2021) Auditable and times limitable secure data
access control for cloud-based industrial internet of things. J Netw Netw Appl 1(3):129–138.
https://doi.org/10.33969/j-nana.2021.010306
40. Karankar N, Seth A (2023) A comprehensive survey on internet of things security: challenges
and solutions. Lect Notes Data Eng Commun Technol 166:711–728. https://doi.org/10.1007/
978-981-99-0835-6_51
41. Sajid A, Shah SW, Magsi T (2022) Comprehensive survey on smart cities architectures and
protocols. EAI Endors Transact Smart Cities 6(18):e5. https://doi.org/10.4108/eetsc.v6i18.
2065
42. Fawaz K, Shin KG (2019) Security and privacy in the internet of things. Computer 52(4):40–49.
https://doi.org/10.1109/MC.2018.2888765
43. Cui H, Deng RH, Wang G (2019) An Attribute-based framework for secure communications
in vehicular ad hoc networks. IEEE/ACM Trans Netw 27(2):721–733. https://doi.org/10.1109/
TNET.2019.2894625
44. Utomo IS, Celine MP, Daniel JVM, Bakti AJ (2022) A systematic literature review of privacy,
security, and challenges on applying IoT to create smart home. In: Proceedings—IEIT 2022:
2022 international conference on electrical and information technology, pp 154–159. https://
doi.org/10.1109/IEIT56384.2022.9967907
45. Chowdhury R, Hakima OS, Chamseddine T, Mohamed C (2017) Attribute-based encryption for
preserving smart home data privacy. Lecture Notes in Computer Science (Including Subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10461 LNCS, pp
185–97. https://doi.org/10.1007/978-3-319-66188-9_16
46. Chukkapalli S, Laya SS, Piplai A, Mittal S, Gupta M, Joshi A (2020) A smart-farming ontology
for attribute based access control. In: Proceedings—2020 IEEE 6th intl conference on big data
security on cloud, BigDataSecurity 2020, 2020 IEEE intl conference on high performance and
smart computing, HPSC 2020 and 2020 IEEE Intl conference on intelligent data and security,
IDS 2020, pp 29–34. https://doi.org/10.1109/BigDataSecurity-HPSC-IDS49724.2020.00017
47. Ferrag MA, Shu L, Yang X, Derhab A, Maglaras L (2020) Security and privacy for green IoT-
based agriculture: review, blockchain solutions, and challenges. IEEE Access 8:32031–32053.
https://doi.org/10.1109/ACCESS.2020.2973178
48. Huan L, Xueyan L, Ruirui S, Linpeng L (2023) Privacy-preserving attribute-based educational
service recommendation in online education system. In: Proceedings of the 2022 international
conference on computer science, information engineering and digital economy (CSIEDE 2022)
103, pp 826–38. https://doi.org/10.2991/978-94-6463-108-1_92
A Short Survey Work for Lung Cancer
Diagnosis Model: Algorithms Utilized,
Challenging Issues, and Future Research
Trends
Nishat Shaikh and Parth Shah
Abstract Because of the systemic side effects of the presence of a tumor or due to
abnormal results obtained from radiography carried out on the chest, lung cancer is
generally identified in suspected people. On the basis of the lung cancer type, the
diagnosis approach can be altered. Factors like the location, metastasis presence, size
of the tumor, and the cancer type all influence the overall diagnosis and detection
process. The most efficient method of detecting lung cancer is staging them into
their types. The best approaches for detecting lung cancer are utilized in diagnosis
procedures in order to enhance the disease detection sensitivity and to avoid unnec-
essary invasion techniques. Since lung cancer causes more deaths in both men and
women, the efficient diagnosis method for detecting lung cancer has to be known.
To overcome all the limitations in conventional lung cancer detection and diagnosis
framework, a detailed review of traditional lung cancer diagnosis systems is carried
out in this work. In the primary section, the basic steps and procedures involved
in the conventional lung cancer detection and diagnosis frameworks are provided.
Following this, a detailed survey on the conventional lung cancer diagnosis system
is done. A short chronological evaluation is then implemented to evaluate the time-
line of the lung cancer diagnosis system. Using literature survey, the methodologies
used in conventional works are identified and grouped. Consequently, the datasets
adopted for testing and training these systems are studied. The performance measures
are used in analyzing the classical lung cancer diagnosis system that is then inves-
tigated. Limitations and advantages of conventional lung cancer diagnosis systems
are then classified. Finally, the research gaps and the futuristic direction are given.
Keywords Lung cancer diagnosis · Pulmonary nodule detection · Deep learning ·

Advantages · Limitations · Implementation tool · Dataset · Performance measures
N. Shaikh (B) · P. Shah

Smt. K. D. Patel Department of Information Technology, Chandubhai S. Patel Institute of
Technology (CSPIT), Faculty of Technology and Engineering (FTE), Charotar University of
Science and Technology (CHARUSAT), Changa, Gujarat, India
e-mail: nishatshaikh.it@charusat.ac.in
P. Shah
e-mail: parthshah.ce@charusat.ac.in
360 N. Shaikh and P. Shah
1 Introduction
Lung cancer is the leading reason for cancer-based deaths in individuals, based
on World Cancer Report [1]. When compared to all other cancer forms, lung
cancer is different since it has the highest fatality and incidence rate [2]. Sadly,
the detection and diagnosis of lung cancer are usually made at a very later stage,
which has an impact on the effectiveness of the treatment provided. Patients who
suffer from lung cancer generally survive five years following diagnosis, and this
percentage could rise to 49% in cases of early detection and diagnosis is carried out.
Discovery of pulmonary nodules in the early stages and permitting early treatment
can increase the likelihood of survival of the affected individual [3]. “Computed
tomography (CT)” is the screening technique that is most frequently utilized for
detecting lung cancer [4]. The CT scan is thought to be a more sensitive imaging
technique for finding lung cancer than other methods. When compared to other tech-
niques, CT has a higher sensitivity than chest X-rays and is substantially cheaper than
“Magnetic Resonance Imaging (MRI)” and “Positron Emission Tomography (PET)”
[5]. CT scans work on principle of single-line X-ray images taken from different
directions, giving a clear 3D view of the internal structures of the lungs, making
it easier to detect pulmonary nodule’s size, location, shape, and volume. However,
because of the factors like dose-effectiveness, capacity to detect previously unde-
tected pathologic abnormalities, cost-effectiveness, and ease of clinical availability,
chest radiology continues to be the most widely utilized imaging modality for chest
disorders [6].
Radiologists are overburdened by the volume of images they must examine as a
result of the rapid screening by CT [7]. In order to automate the initial screening
of pulmonary nodules by identifying dangerous lesions, “computer-aided detection
(CAD)” is generated [8]. This allows radiologists to distinguish between probable
abnormalities. This can significantly increase the accuracy of pulmonary nodule iden-
tification while effectively reducing the burden on radiologists [9]. The existence of
pulmonary nodules within the ribcage can be thus readily identifiable by radiolo-
gists. However, a trustworthy CAD method that can quickly and precisely automate
the detection of pulmonary nodules is required [10]. Factors such as the segmen-
tation region, local features, spatial and space-oriented features, as well as posi-
tions are crucial for the efficient operation of the CAD system [11]. Even though
these CAD solutions are designed to support radiologists in the precise identifica-
tion of pulmonary nodules, these factors affect overall performance of the system.
Although detection of lung cancer is made easy through a CAD system, the essen-
tial features of this system are difficult to compute [12]. The enhanced data handling
capacity and the efficient computation process have made the deep learning technique
more prominent and effective in the medical image evaluation process [13]. However,
the training with the limited number of data and the insufficient optimization of the
network parameters has made this method insensitive to pulmonary nodules, and
hence, it is capable of detecting only fewer lung cancer cases when compared with
conventional methods.
A Short Survey Work for Lung Cancer Diagnosis Model: Algorithms … 361
The main works focused on this survey work are listed below.
• To review the deep learning-based lung cancer diagnosis framework that has
been developed in the timeline ranging from 2014 to 2023 and to compare their
performance with conventional works.
• To evaluate the various datasets and algorithms used by the existing works and to
categorize them accordingly.
• To analyze the limitations and merits of these existing works and to investigate
the performance measures used to validate them.
• To examine the research gaps in the existing models and to suggest possible future
directions in developing an efficient lung cancer diagnosis framework.
The outline of this survey is provided as follows. In the second section, a short
literature survey on conventionally developed lung cancer diagnosis models: chrono-
logical and dataset analysis is given. In the third section, categorization of algorithms
used in lung cancer diagnosis models and their analysis of performance metrics are
provided. In the fourth section, the research gaps and future works in lung cancer
diagnosis models are discussed. Finally, the survey is summarized in the fifth section.
2 Short Literature Survey on Conventionally Developed

Lung Cancer Diagnosis Models: Chronological
and Dataset Analysis
2.1 Discussion on Existing Works
In 2014, Martins et al. [14] developed a technique to detect pulmonary nodules of

smaller sizes within 10 mm in an automatic manner with help of pattern recognition
and image processing approaches. The pulmonary parenchyma was segmented with
the aid of a region-growing approach. The internal structures of the lungs were
segmented using Hessian matrix and Gaussian mixture techniques. The textures
were identified using “Shannon’s and Tsallis’s entropy.” Finally, classification was
carried out using a “support vector machine (SVM).” The experimental outcome
proved that this method was efficient in classifying smaller pulmonary nodules in
the lung with higher accuracy and fewer false positives.
In 2015, Bhuvaneswari and Therese [15] implemented the “genetic k-nearest
neighbor (G-KNN)” method for detecting the presence of lung cancer at the beginning
stage. The adopted technique was non-parametric. In order to overcome the issues
regarding more processing time for detecting the presence of lung cancer from CT
images, the “genetic algorithm (GA)” was used along with the “k-nearest neighbor
(KNN)” to generate effective and faster classification outcomes. The work was
executed in MATLAB software. Simulation results proved the enhanced accuracy
obtained by the G-KNN in classifying lung cancer.
In 2017, Wang et al. [16] have suggested the fusion of hand-crafted and deep
features for the accurate detection of lung cancer with lesser “false positive rates
(FPR).” Experimentation of this suggested model on a public dataset suggested that
the accuracy, sensitivity, and specificity offered by this deep feature fusion-based
model to aid the CAD-based lung cancer detection system was more with few false
positives. The utilization of these deep fused features aided the CAD system in
accurately detecting the pulmonary nodule from the provided image input.
In 2018, Gu et al. [17] have developed a detection system for identifying the
presence of pulmonary nodules using a “multi-scale three-dimensional convolu-
tional neural network (3D-CNN).” The CT images from the “Lung Nodule Analysis
2016 (LUNA16)” dataset were gathered. The segmentation of these lung images
was carried out using a comprehensive approach. The deep feature extraction was
done using the 3D-CNN model. The detection of the pulmonary nodule was done
using the cube clustering and prediction method. The nodules of even smaller sizes
were also detected by this multi-scale approach. The tenfold validation on the imple-
mented model suggested that the sensitivity and the “competition performance metric
(CPM)” of the implemented pulmonary nodule detection model were higher, and the
FPR of the detection outcomes was lower.
In 2018, Zhang et al. [18] have implemented combined deep learning and clas-
sical techniques for detecting pulmonary nodules. The developed model was named
NODULE. The presence of the possible pulmonary nodule was detected using
detecting the size and shape constraints with the aid of a “multi-scale Laplacian of
Gaussian (MS-LoG)” filter. The genuine pulmonary nodules were then detected using
the “3D Deep CNN (3D-DCNN)” model. The model was trained on the LUNA16
dataset. Both the presence of pulmonary nodule and its diameter were determined
accurately using this approach, which was proved by conducting experimentation.
The detection score obtained from this model was higher.
In 2019, Hoang et al. [19] have suggested a two-step lung cancer detection frame-
work using deep learning techniques where non-carcinoma area was eliminated in the
first step. The detection of lung cancer was carried out in the next step using another
deep learning classifier. On experimentation on a huge number of CT images, it was
confirmed that the errors in the detection of lung cancer in the lymph nodes were
minimized, along with enhanced specificity and sensitivity.
In 2019, Lakshmanaprabu et al. [20] have executed “linear discriminant analysis
(LDA)” along with “optimal deep neural network (ODNN)” for classifying lung
cancer. The dimension of features being extracted from CT images was minimized
using LDA. The classification of CT image as benign and malignant was done by
DNN model, in which the parameters were tuned with the aid of the “modified
gravitational search algorithm (MGSA).” Experimental validation showed that the
classification outcome from the implemented model was highly accurate, sensitive,
and was more specific.
In 2020, Huang et al. [21] have executed a lung cancer diagnosis framework by
combining “extreme learning machine (ELM)” with “deep transfer CNN (DTCNN).”
The DTCNN was primarily utilized for mining the essential attributes from “First
Affiliated Hospital of Guangzhou Medical University in China (FAH-GMU)” and
“Lung Image Database Consortium and Image Database Resource Initiative (LIDC-
IDRI)” CT image datasets. The classification of malignant and benign was done by
the ELM. Experimental verification suggested that the performance offered by the
fusion model was higher than conventional approaches.
In 2020, Moitra and Mandal [22] have implemented a grading and staging method
for “non-small cell lung cancer (NSCLC)” using “one-dimensional CNN (1D-
CNN).” Images from the “The Cancer Imaging Archive (TCIA)” were gathered.
The deep features were extracted using the “Maximally Stable Extremal Regions-
Speeded-Up Robust Feature (MSER-SUPF)” model. The concatenated “Tumor
Speed Metastasis (TNM),” along with the extracted features, was fed to the 1D-CNN
system for lung cancer stage detection. Simulation outcomes showcased the enhanced
“receiver operating curve (ROC)-AUC” scores and accuracy of the implemented
model.
In 2020, Li et al. [23] have suggested a lung cancer identification framework. The
features from the “Japanese Society of Radiological Technology (JSRT)” dataset
were extracted using the multi-resolution convolution model. The classification of
lung cancer was carried out using a fusion network. The FPR, CPM, and AUC of the
suggested fusion model were much more efficient than the existing detection lung
cancer models.
In 2020, Toğaçar et al. [24] have executed a lung cancer detection framework by
utilizing distinct transfer learning approaches like VGG-16, LeNet, and AlexNet. The
features from the collected CT images were extracted using CNN. Image augmenta-
tion was adopted to enhance the detection rate. The extracted features were optimally
selected using the “minimum redundancy maximum relevance (mRMR)” method.
The features obtained were given to several machine learning classifiers. Out of
several models, the KNN model provided better classification accuracy in detecting
lung cancer from CT images.
In 2012, Silva et al. [25] investigated the correlation between various attributes in
order to examine the “epidermal growth factor receptor (EGFR)” with the aid of a 2D
“region of Interest (ROI)” from the CT images. The CT images were reconstructed
using a convolutional autoencoder. The features were extracted using the encoder
unit. The examination of the EGFR was carried out by arranging the classifier on
the upper portion of the feature extractor. The suggested model attained a better
prediction rate by identifying the essential biomarkers from the CT images.
In 2021, Sori et al. [26] have suggested a denoising-detection model for lung
cancer in an automatic manner. The preprocessing of gathered images was carried
out using “residual denoising network (DR-Net).” The preprocessed images were
provided as input to the two paths CNN. The concatenation of the global and the
local features was carried out by the two paths in the developed model, and the
detection was carried out using CNN. A discriminant correlation validation was
carried out, which proved that the suggested denoising-detection model was capable
of eliminating the noise from the images, used even for inconsistent nodules, and
balanced the overall receptive field.
In 2023, Pradhan et al. [27] have utilized the healthcare record of the patients to
predict lung cancer using “recurrent neural network (RNN).” Two datasets were used
to collect the CT image data. The features from CT images were extracted using “t-
distributed stochastic neighbor embedding (t-SNE)” and “principal component anal-
ysis (PCA).” The classification was done using RNN, in which the parameters were
tuned with the “Self-Adaptive Sea Lion optimization (SA-SLnO)” algorithm. The
experimental results showed that performance of the suggested SA-SLno-RNN was
much better than conventional approach in classifying lung cancer with a minimum
mean square error (MSE).
In 2023, Navaneethakrishnan et al. [28] have generated a lung cancer diag-
nosis framework using “Bat Deer Hunting Optimization Algorithm-based DCNN
(BDHOA-DCNN).” Initially, the segmentation of the CT images was carried out.
Then classification of the nodule and the lobes was executed by the BDHOA-based
DCNN model. The performance provided by this BDHOA-based DCNN model was
higher than the conventional diagnosis systems.
In 2023, Tasmin et al. [29] resolved the interpretability issues in the machine
learning model for detecting lung cancer using the “explainable machine learning
(XML)” technique. The “random oversampling (ROS)” technique was used to extract
the features from the gathered dataset. The prediction outcome was obtained using
the “SHapley Additive exPlanation (SHAP)” technique. An outstanding performance
was provided by the implemented XML with higher transiency in detecting lung
cancer. The efficiency of the developed XML was demonstrated by implementing a
mobile application.
In 2023, Wankhade and Vigneshwari [30] implemented a lung cancer detection
framework using a “hybrid neural network (HNN).” The features from the CT images
gathered from the “LIDC-IDRI” dataset were extracted using the DNN model. The
diagnosis of lung cancer as malignant and benign was executed using the 3D-CNN
model. Accurate diagnosis of lung cancer at the beginning stage was possible with
this technique.
In 2023, Mithun et al. [31] have developed three models, namely the “bidirec-
tional long short-term memory (Bi-LSTM),” “Bi-LSTM with dropout,” and “bidirec-
tional encoder representations from transformers (BERT)” to detect the lung cancer
from clinical cohorts. The “Materialise Interactive Medical Image Control System
(MIMIC)-III” dataset was used for analyzing the working of the suggested model.
With minor oversampling, the performance of all three models has been enhanced.
In 2023, Ali et al. [32] utilized an “Ensemble 2D CNN (E-2D-CNN)” model to
detect the presence of lung cancer from the LUNA16 CT images. The LUNA16 CT
images were gathered. The gathered images were given as input to the E-2D-CNN,
which was made by combining two or more CNN structures. The classification of the
CT images into cancerous and non-cancerous types was done using this E-2D-CNN
with more accuracy.
In 2023, Alsadoon et al. [33] examined various research articles to obtain the
conditions for evaluating the real-time deep learning-based lung cancer detection
framework. Then, a “Data, Feature Selection, Classification, and View (DFCV)”-
based deep learning model was implemented for detecting lung cancer on a real-time
basis.
Fig. 1 Chronological review on the lung cancer diagnosis models
2.2 Chronological Review
The year-wise contribution of the deep learning-based lung cancer diagnosis frame-
works is provided in Fig. 1. By analyzing the figure, it is noted that more works
have contributed toward deep learning-based lung cancer diagnosis frameworks in
recent years when compared to the past decade. More works have been developed
in recent times, and all these papers utilize advanced deep learning approaches for
image analysis in order to diagnose lung cancer.
2.3 Dataset Analysis
The data sources from which the necessary CT images are gathered for diagnosing
lung cancer using deep learning methods are provided in Table 1. From analyzing
the table, it is observed that most of the works have adopted the LUNA16 dataset
since it is an open-source database and has more CT images.
3 Categorization of Algorithms Used in Lung Cancer

Diagnosis Models and Its Analysis on Performance
Metrics
3.1 Algorithmic Classification
The techniques used for developing an efficient pulmonary nodule detection and lung
cancer diagnosis system are grouped and are shown in Fig. 2.
Table 1 Dataset used in the traditional deep learning-based lung cancer diagnosis framework
Author Dataset
Martins et al. [14] LIDC dataset
Bhuvaneswari and Therese [15] Public dataset
Wang et al. [16] JSRT dataset
Gu et al. [17] LUNA16
Zhang et al. [18] LUNA16
Hoang et al. [19] Data from Nagasaki University and Kameda General Hospital
Lakshmanaprabu et al. [20] 50 low-dosage CT images
Huang et al. [21] LIDC-IDRI, FAH-GMU
Moitra and Mandal [22] TCIA
Li et al. [23] JSRT
Toğaçar et al. [24] TCIA
Silva et al. [25] LIDC-IDRI, NSCLC-Radiogenomic dataset
Sori et al. [26] Kaggle Data Science Bowl dataset
Tasmin et al. [29] Lung cancer dataset from Kaggle
Wankhade and Vigneshwari [30] LIDC-IDRI, FAH-GMU
Mithun et al. [31] MIMIC-III dataset
Ali et al. [32] LUNA16
Fig. 2 Categorization of algorithms used for lung cancer diagnosis

CNN: Due to its simplicity and the ability to handle huge volumes of data in less
time, the CNN approaches are utilized. The most widely used CNN types include
CNN [19, 21, 23, 26, 28], 3D-CNN [17, 18, 30], 1DCNN [22], AlexNet [24], and
2D-CNN [32].
Machine learning: The commonly used machine learning approach for lung
cancer detection is SVM [14] and KNN [15, 24].
RNN: In order to handle sequential data, RNNs are adopted. The models such as
RNN [27] and Bi-LSTM [31] are used for lung cancer diagnosis purposes.
Miscellaneous: The other techniques rarely used in lung cancer detection are
feature fusion [16], DNN [20], ELM [21], autoencoder [25], XML [29], HNN [30],
BERT [31], and DL [33]. Various feature extraction methods for efficiently extracting
the attributes from the CT images are listed as follows: Gaussian mixture model
(GMM) [14], LDA [20], SURF [22], and PCA [23, 27, 29]. These techniques are
adopted in lung cancer detection and pulmonary nodule segmentation works. The
manner in which the algorithms are classified is provided in Fig. 2.
3.2 Analysis of Performance Metrics
The performance measures used to validate the traditional lung cancer diagnosis
frameworks that have been developed using deep learning techniques are shown and
categorized as given in Table 2. The accuracy, sensitivity, specificity, F1-score, and
FPR are the most crucial performance measures that are utilized for evaluating the
working of the implemented lung cancer diagnosis framework.
3.3 Pros and Cons of Lung Cancer Classification Techniques
The advantages and disadvantages of the existing lung cancer diagnosis framework
using deep learning techniques are listed below in Table 3.
4 Research Gaps and Challenges on the Performance

of Lung Cancer Diagnosis Model
A pulmonary nodule, on the anatomical view, is considered a tiny, rounded tissue

growth that cannot be identified until symptoms are shown once it reaches the
advanced stages. The early detection of lung cancer is possible by using chest X-
ray. The X-rays are capable of measuring tumor growth, which is even less than
300 cm in diameter. In X-ray, the subject’s poster anterior views of the lungs are
obtained from viewing from the front to the rear. The chest X-ray is a simple and
368
Table 2 Performance metrics used for evaluating the traditional deep learning-based lung cancer diagnosis frameworks
Author Accuracy Specificity Sensitivity FPR TPR F1-score Precision CPM MSE AUC Miscellaneous measures
Martins et al. [14] ✔ ✔ ✔ ✔ ✔ – – – – – –
Bhuvaneswari and Therese [15] ✔ – – – – – – – – – Execution time
Wang et al. [16] ✔ ✔ ✔ ✔ – ✔ – – – – –
Gu et al. [17] – – ✔ ✔ – – – ✔ – – –
Zhang et al. [18] ✔ – – – ✔ ✔ – ✔ – – Detection score
Hoang et al. [19] – ✔ ✔ ✔ – – – – – – –
Lakshmanaprabu et al. [20] ✔ ✔ ✔ – – – – – – – PPV and NPV
Huang et al. [21] ✔ ✔ ✔ – – – – – – ✔ Testing time
Moitra and Mandal [22] ✔ – – – – – – – – ✔ ROC
Li et al. [23] – – – ✔ – – – ✔ – ✔ –
Toğaçar et al. [24] ✔ ✔ ✔ – – ✔ ✔ – – – Precision
Silva et al. [25] – – – – – – – – ✔ – Mean, standard deviation and
MSE
Sori et al. [26] ✔ ✔ ✔ – – – – – – – –
Pradhan et al. [27] ✔ – ✔ ✔ – ✔ ✔ – ✔ – FNR
Navaneethakrishnan et al. [28] ✔ ✔ ✔
Tasmin et al. [29] ✔ – ✔ – – ✔ ✔ – – – Error rate
Mithun et al. [30] – – – – – ✔ – – – – –
Ali et al. [31] ✔ – ✔ – – – ✔ – – – –
N. Shaikh and P. Shah
Table 3 Advantages and limitations of existing deep learning-based lung cancer diagnosis
frameworks
Author Technique Advantages Limitations
Martins et al. [14] SVM, Gaussian Pulmonary nodules of even The juxta-pleural
Mixture Model, smaller sizes can be nodules cannot be
Tsallis entropy detected accurately by this detected by this
technique approach
Bhuvaneswari and G-KNN This approach assists in the More training images
Therese [15] early detection of lung of CT of patients are
cancer required for the
effective operation of
this model
Wang et al. [16] CAD, feature This method is suitable for This technique is
fusion detecting lung cancer using limited by the clavicle
small datasets
Gu et al. [17] Multi-scale This approach requires very The false positives
3D-CNN few parameters for generated by this
providing accurate model while
detection outcomes classification is more
Zhang et al. [18] MG-LoG, This method is more Only the detection of
3D-DCNN accurate in detecting lung lung cancer from CT
cancer images is possible by
this method. This
method does not
detect lung cancer
from clinical reports
itself
Hoang et al. [19] CNN The efficient detection of The implementation
lung cancer from the lymph software adopted and
tissues is possible by this the smaller dataset
method used degrades the
performance of the
developed model
Lakshmanaprabu DNN, LDA, These techniques are faster Only the classification
et al. [20] MGSA in classifying lung cancer. of low-dosage CT
The classification outcomes images is carried out
are accurate. The overall effectively by this
classification process is model
simple, and it is not
expensive
Huang et al. [21] ELM, DTCNN The overall cost required by The accuracy and
this classification model is robustness of this
low model are not
satisfactory
Moitra and Mandal MSER-SURF, This technique is This method needs
[22] 1D-CNN lightweight, needs lesser supervision and more
computational time, and the labeled data for
resource required by this training
model is also less
(continued)
Table 3 (continued)
Li et al. [23] Multi-resolution This model generates This model fails while
patch CNN, PCA accurate results when the provided with large
false positives of the images CT image datasets
are maintained within 0.2.
This model is robust in
nature. This model can be
implemented for real-time
applications
Toğaçar et al. [24] AlexNet, KNN The time required by this This approach is not
approach is much less, and generalized
the obtained detection
outcomes are more accurate
Silva et al. [25] Convolutional The essential information The interpretability of
autoencoder for detecting the the space cannot be
pulmonary nodule can be preserved by this
determined by this method
technique
Sori et al. [26] CNN, DR-Net This approach can be This model needs
implemented for any more CT images for
disease detection task with training
unlabeled data. The time
requirement by this model
is lesser due to the
utilization of a “graphical
processing unit (GPU)”
instead of a “central
processing unit (CPU)”
Pradhan et al. [27] Slno-RNN, PCA, Sequential CT images can This method is
t-SNE be handled by this affected by vanishing
approach. Only the essential or disappearing
and crucial features are gradients. The training
highlighted by this method, process for this model
making the entire process is highly complicated
more effective and less
time-consuming
Navaneethakrishnan BDHOA-DCNN A huge volume of data can The positional and
et al. [28] be handled without much directional details of
complexity in this method the pulmonary nodule
cannot be determined
using this method
(continued)
Table 3 (continued)
Tasmin et al. [29] XML, PCA This approach helps in Deep learning models
interpreting the machine are not incorporated in
learning model more this model, which
effectively. The degrades the
transparency of the classification accuracy
classification is provided by slightly
this approach
Wankhade and HNN, 3D-CNN The need for additional The memory
Vigneshwari [30] hyperparameters in this requirement by this
method is less. This method is higher
technique is efficient in
handling 3D images
Mithun et al. [31] Bi-LSTM, This approach can The reports with
Bi-LSTM with efficiently detect the thoracic regions are
dropout, BERT occurrence of lung cancer not detected by this
even from unbalanced model
clinical reports with minor
oversampling
Ali et al. [32] E-2D-CNN The accuracy of lung cancer This method is not
detection offered by this applicable to 3D
technique is higher spatial information
detection
Alsadoon et al. [33] DFCV-DL The attributes can be This method is
automatically learned by affected by overfitting
this technique issues
inexpensive screening approach. Only as the malignancy grows can it be seen visible
in the X-rays. So, an X-ray is not a reliable approach. The stress faced by radiolo-
gists is lessened by the introduction of a CAD-based screening approach. The CAD-
based method is employed to generate good and essential attributes for detecting lung
cancer. However, it is not competitive. The development of a trustworthy CAD tech-
nique that is capable of automatically and precisely diagnosing the pulmonary nodule
is still under development. The CAD system faces disadvantages like the separa-
tion of the juxta-nodules for providing segmentation outcome, utilization of sliding
window for pulmonary nodule detection, the reliance on hand-crafted features, and
the elimination of inter-slice relationships between the features. As mid and high-
level representations of the images can be learned more effectively by deep learning
approaches, they have attracted a lot of attention for feature computing tasks in recent
years.
Machine learning technique has been utilized in many complicated tasks lately. An
efficient lung cancer detection and diagnosis framework can be implemented using
this machine learning approach. However, the attributes required by this approach
still have to be hand-crafted. This hand-crafted feature requires expert and radiologist
knowledge. Hence, an entirely automated lung cancer detection framework is still
under consideration. An entirely end-to-end fashioned lung cancer diagnosis frame-

work is made possible by the advanced version of machine learning called the deep
learning technique. The enhanced data handling capacity and the efficient computa-
tion process have made the deep learning technique more prominent and effective
in the medical image evaluation process. A variety of CNN has been developed
for image classification tasks because of the advancement in these deep learning
approaches. However, the CNN-based image classification approach needs more
precise and good-quality features, a vast volume of image data for training, and
enhanced input image quality. However, the training with the limited number of data
and the insufficient optimization of the network parameters has made this method
insensitive to pulmonary nodules, and hence, it is capable of detecting only fewer
lung cancer cases when compared with conventional methods. Despite the fact that
high-level features are obtained using transfer learning, these attributes are irrel-
evant to the objective of evaluating medical images. As the distance between the
target and the base grows, the transferability of attributes in these techniques dimin-
ishes. This is the cause of the inferior of this approach’s performance in high-level
attribute-based medical image analysis. The summary of some of the research gaps
and challenges in the existing lung cancer diagnosis frameworks is given below:
• Machine learning-based lung cancer diagnosis frameworks suffer from overfitting
issues.
• Machine learning-based lung cancer diagnosis frameworks are affected by noisy
input data.
• Machine learning-based lung cancer diagnosis models require manually extracted
features for accurately detecting the presence of lung cancer, thus making the
cancer diagnosis model human-dependable.
• Deep learning-based lung cancer diagnosis models are not very sensitive in
detecting several variations of lung cancers.
• The processing time required by the deep learning-based lung cancer diagnosis
models is higher.
• Deep learning models require a huge volume of high-quality data for training the
model for effective diagnosis of lung cancer.
5 Conclusion
An investigation of various classical lung cancer detection approaches was made

in this survey work to determine the effective lung cancer diagnosis model. In the
initial phase, some of the basic processes and the introduction concerning lung cancer
diagnosis were discussed. Further, by utilizing diverse methodologies, the literature
survey was explained in detail. Then, the chronological analysis was done to deter-
mine the year-wise contribution of lung cancer diagnosis systems. Consequently, the
utilization of datasets and algorithms over the diagnosis of the lung cancer model was
grouped and explained. Some of the advancements, limitations, and execution paths
were also discussed. Further, the diverse performance metrics utilized for the valida-
tion of the suggested lung cancer detection model were listed. Then, the research gap
and limitations of the implemented approaches were addressed, which led to consid-
eration for future works. While considering this survey, it was identified that the RNN
models and CNN models are effective in detecting lung cancer even from unbalanced
data. Thus, variants of RNN are suggested to make effective lung cancer diagnosis
frameworks in the future work for effective outcomes. Machine learning approaches,
even if they are capable of detecting smaller-sized lung nodules, are recommended
for futuristic lung cancer detection frameworks because of their requirements for
manually extracted features. So, advanced deep learning models with the utilization
of CNN or variation of RNN are recommended for futuristic lung cancer detection
and diagnosis models by analyzing this research work.
References
1. Vani Rajasekar MP, Vaishnnave S, Premkumar VS, Rangaraaj V (2023) Lung cancer disease
prediction with CT scan and histopathological images feature analysis using deep learning
techniques. Res Eng 18:101111
2. Bokefode J, PandurangaRao MV, Komarasamy G (2022) Ensemble deep learning models for
lung cancer diagnosis in histopathological images. Procedia Comput Sci 215:471–482
3. Javier CM, Alejandro BB, Manuel DM, Manuel RP, Luis MS, José MRC (2022) Non-small
cell lung cancer diagnosis aid with histopathological images using Explainable Deep Learning
techniques. Comput Methods Progr Biomed 226:107108
4. Wang W, Charkborty G (2021) Automatic prognosis of lung cancer using heterogeneous deep
learning models for nodule detection and eliciting its morphological features. Appl Intell
51:2471–2484
5. Xi W, Hao C, Caixia G, Huangjing L, Qi D, Efstratios T, Qitao H, Muyan C, Pheng-Ann H
(2020) Weakly supervised deep learning for whole slide lung cancer image analysis. IEEE
Trans Cybernet 50(9):3950–3962
6. Ozdemir O, Russell RL, Berlin AA (2020) A 3D probabilistic deep learning system for detection
and diagnosis of lung cancer using Low-Dose CT scans. IEEE Trans Med Imaging 39(5):1419–
1429
7. Hussein S, Kandel P, Bolan CW, Wallace MB, Bagci U (2019) Lung and pancreatic tumor char-
acterization in the deep learning era: novel supervised and unsupervised learning approaches.
IEEE Trans Med Imaging 38(8):1777–1787
8. Mohamed Shakeel P, Burhanuddin MA, Mohamad Ishak D (2019) Lung cancer detection from
CT image using improved profuse clustering and deep learning instantaneously trained neural
networks. Measurement 145:702–712
9. Yutong X, Jianpeng Z, Yong X, Fulham M (2018) Fusing texture, shape and deep model-learned
information at decision level for automated classification of lung nodules on chest CT. Inform
Fus 42:102–110
10. Myron GB, Nik S, Sjors GJG, In ‘t V, Adrienne V, Mirte M, Anna-Larissa N et al (2017) Swarm
intelligence-enhanced detection of non-small-cell lung cancer using tumor-educated platelets.
Cancer Cell 32(2):238–252
11. Sun W, Zheng B, Qian W (2017) Automatic feature learning using multichannel ROI based
on deep structured algorithms for computerized lung cancer diagnosis. Comput Biol Med
89:530–539
12. Shenglin M, Wenzhe W, Bing X, Shirong Z, Haining Y, Hong J, Wen M, Xiaoliang Z, Xiaoju W
(2016) Multiplexed serum biomarkers for the detection of lung cancer. EBioMedicine 11:210–
218
13. Eunmi B, Dong-Kyu C, Eun Joo S (2013) Simultaneous detection of multiple microRNAs for
expression profiles of microRNAs in lung cancer cell lines by capillary electrophoresis with
dual laser-induced fluorescence. J Chromatograph A 1315:195–199
14. Alex MS, Antonio O, Aristófanes CS, Anselmo C, Rodolfo AN, Marcelo G (2014)Automatic
detection of small lung nodules in 3D CT data using Gaussian mixture models, Tsallis entropy
and SVM. Eng Appl Artific Intell 36:27–39
15. Bhuvaneswari P, Brintha Therese A (2015) Detection of cancer in lung with K-NN classification
using genetic algorithm. Procedia Mater Sci 10:433–440
16. Changmiao W, Ahmed E, Jianhuang W, Qingmao H (2017) Lung nodule classification using
deep feature fusion in chest radiography. Computer Med Imaging Graph 57:10–18
17. Yu G, Xiaoqi L, Lidong Y, Baohua Z, Dahua Y, Ying Z, Lixin G, Liang W, Tao Z (2018)
Automatic lung nodule detection using a 3D deep convolutional neural network combined
with a multi-scale prediction strategy in chest CTs. Comput Biol Med 103:220–231
18. Junjie Z, Yong X, Haoyue Z, Yanning Z (2018) NODULe: Combining constrained multi-scale
LoG filters with densely dilated 3D deep convolutional neural network for pulmonary nodule
detection. Neurocomput 317:159–167
19. Pham HHN, Futakuchi M, Bychkov A, Furukawa T, Kuroda K, Fukuoka J (2019) Detection of
lung cancer lymph node metastases from whole-slide histopathologic images using a two-step
deep learning approach. Am J Pathol 189(12):2428–2439
20. Lakshmanaprabu SK, Sachi Nandan M, Shankar K, Arunkumar N, Gustavo R (2019) Optimal
deep learning model for classification of lung cancer on CT images. Fut Gen Comput Syst
92:374–382
21. Huang X, Lei Q, Xie T, Zhang Y, Hu Z, Zhou Q (2020) Deep transfer convolutional neural
network and extreme learning machine for lung nodule diagnosis on CT images. Knowl Based
Syst 204:106230
22. Moitra D, Mandal RK (2020) Classification of non-small cell lung cancer using one-
dimensional convolutional neural network. Expert Syst Appl 159:113564
23. Xuechen L, Linlin S, Xinpeng X, Shiyun H, Zhien X, Xian H, Juan Y (2020) Multi-resolution
convolutional networks for chest X-ray radiograph based lung nodule detection. Artific Intell
Med 103:101744
24. Toğaçar M, Ergen B, Cömert Z (2020) Detection of lung cancer on chest CT images using
minimum redundancy maximum relevance feature selection method with convolutional neural
networks. Biocybern Biomed Eng 40(1):23–39
25. Silva F, Pereira T, Morgado J, Frade J, Mendes J, Freitas C, Negrão E, Flor De Lima B, Correia
Da Silva M, Madureira MJ, Ramos I, Hespanhol V, Costa JL, Cunha A, Oliveira HP (2021)
EGFR assessment in lung cancer CT images: analysis of local and holistic regions of interest
using deep unsupervised transfer learning. IEEE Access 9:58667–58676
26. Worku JS, Jiang F, Arero WG, Shaohui L, Demissie JG (2021) DFD-Net: lung cancer detection
from denoised CT scan image using deep learning. Front Comput Sci 15:152701
27. Pradhan K, Chawla P, Rawat S (2023) A deep learning-based approach for detection of
lung cancer using self adaptive sea lion optimization algorithm (SA-SLnO). J Ambient Intell
Humaniz Comput 14:12933–12947
28. Navaneethakrishnan M, Vijay Anand M, Vasavi G, Vasudha Rani V (2023) Deep fuzzy segnet-
based lung nodule segmentation and optimized deep learning for lung cancer detection. Patt
Anal Appl 26:1143–1159
29. Rikta ST, Mohammad Mohi Uddin K, Biswas N, Mostafiz R, Sharmin F, Samrat Kumar D
(2023) XML-GBM lung: an explainable machine learning-based application for the diagnosis
of lung cancer. J Pathol Inform 14:100307
30. Shalini W, Vigneshwari S (2023) A novel hybrid deep learning method for early detection of
lung cancer using neural networks. Healthcare Anal 3:100195
31. Mithun S, Ashish Kumar J, Umesh BS, Vinay J, Nilendu CP, Rangarajan V, Dekker A, Sander
P, Inigo B, Wee L (2023) Development and validation of deep learning and BERT models for
classification of lung cancer radiology reports. Inform Med Unlock 40:101294
32. Asghar AS, Mahmood Malik HA, AbdulHafeez M, Abdullah A, Zaeem AB (2023) Deep
learning ensemble 2D CNN approach towards the detection of lung cancer. Sci Rep 13(2987)
33. Abeer A, Ghazi AN, Ahmed HO, Belal A, Majdi M, Md Rafiqul I (2023) DFCV: a framework
for evaluation deep learning in early detection and classification of lung cancer. Multimedia
Tools Appl
Influence of Music on Brainwave-Based
Stress Management
Neelum Dave and Shreya Dave
Abstract The need to thrive on performance and perfection in people living in

this era has led to an increase in stress. Stress affects both our physical as well as
mental health. Its damage is often considered underrated. There are many ways in
which stress can be harmful to us. It can cause various health implications such
as hypertension, diabetes, headaches, immunodeficiency, and so on. Thus, we must
detect stress at an early stage. Further, it is necessary to take action to reduce this
stress to lead a healthy life. Stress management needs to be made an integral part
of our lifestyle. Our study aims at stress detection and implementation of ways to
reduce this stress. For our study, we have considered subjects from three categories,
namely housewives, college students, and professors. The stress levels of the subjects
were measured under different circumstances, and music therapy was used to reduce
the stress.
Keywords Brainwaves · Stress · Stress relief · Spotify · Python · MATLAB ·

EEG
1 Introduction
1.1 Literature Survey
Stress is a normal reaction that the body experiences when changes occur, resulting
in physical, emotional, and intellectual responses. Any thought or event which makes
one angry, frustrated, or nervous can cause stress. It is our body’s reaction to chal-
lenges or demands. It can be both positive and negative. When stress helps us avoid
danger, focus on studies, make a decision, or meet deadlines, it is considered positive.
When one experiences stress for a long time, it can be harmful to our bodies.
N. Dave (B) · S. Dave

Dr. D.Y. Patil Institute of Technology, Pimpri, Pune, India
e-mail: neelum.dave@dypvp.edu.in
378 N. Dave and S. Dave
Stress can generate health issues such as hypertension, cardiovascular disease,

diabetes, obesity, depression, or anxiety. Further, it can cause abortions in pregnant
women, lead to weight loss or gain, and even headaches, stiff neck, and forgetfulness
[1, 2].
There are mainly two types of stress:
Acute Stress: This kind of stress does not last for long. A person usually undergoes
this kind of stress when working on something new and exciting. Each and every
one faces acute stress at some point of time in their life.
Chronic Stress: This kind of stress lasts for extended durations of time. Chronic
stress can be experienced when a person is in trouble. This kind of stress can be
experienced for weeks or months. If not managed at the proper time, it can lead to
health problems as discussed above.
EEG can be used to detect different stress levels [3]. SVM has been used in the
past to detect stress levels using the EEG [4]. In the past, various machine learning
techniques have been used to detect stress such as SVM, KNN, and random forest
[5, 6].
Various other methods can be used to detect stress such as an electrocardiogram
(ECG), blood volume pulse (BVP), body temperature (TEMP), respiration (RESP),
an electromyogram (EMG), and electrodermal activity (EDA) [7].
Our study aims at stress detection using EEG signals. The next section discusses
the detection of stress in detail.
1.2 Detection of Stress
The measurement of the frequency of brainwaves can detect stress.

The human brain is one of the most complex organs in the human body. It is
made up of neurons. More than 100 billion nerves communicate with each other.
The human brain is an electrochemical organ that can generate around 10 watts of
power [8]. Brainwaves are the language of the brain.
The electroencephalogram (EEG) technique can be used for measuring the
brainwave impulses.
The EEG is used to measure brain waves of different frequencies within the brain.
To detect and record the electrical impulses within the brain, the electrodes are placed
on the frontal lobes. The number of times a wave repeats itself is known as frequency.
The raw EEG has usually been described in terms of frequency bands. These
include delta (1–3 Hz), theta (4–7 Hz), alpha (8–12 Hz), beta(13–30 Hz), and
gamma(>30 Hz) [9]. Each frequency band depicts a different state of mind. Delta
waves are associated with deep sleep or dreamless sleep while theta waves indicate
tiredness. Alpha waves are present when the person is happy and relaxed. Beta waves
can be associated with a more alert and active state. Gamma waves are present in the
case of a very active and highly alert state. The presence of gamma waves can lead
to depression and hyper-stress. Beta waves can indicate stress [10].
Influence of Music on Brainwave-Based Stress Management 379
Brainwaves can be used to understand a person’s state of mind. It can also be used
to decipher what a person is thinking [11]. Today technology is advancing such that
it can be used to control machines [12].
Our study focuses on the detection of the presence of beta waves, measuring stress
levels from the brainwaves, and further reducing stress.
1.3 Motivation
The sense of pressure and strain in the human body is known as stress, according
to researchers. Acute stress can be constructive as it may inspire us to accomplish
our work on time, but on the other hand, chronic stress may cause depression. If
left unnoticed, it may seriously affect the body [13, 14]. When a person is stressed,
their body releases hormones called cortisol and adrenaline. It can cause the body
to tighten up and blood pressure to rise, and it may also lead to trouble breathing.
Chronic stress can damage our immune system, reproductive system, and digestive
tract to a great extent. It can cause strokes. The aging process can be accelerated due
to stress in human beings [15, 16]. The alarming increase in heart rate and lack of
sleep are symptoms of stress [17, 18]. Many steps can be taken to reduce stress such
as listening to music, reading books, yoga, exercise, and self-care.
2 Methodology
This study involves the use of EEG Click to measure the brainwaves. Mikro Elek-
tronika developed the EEG Click board. The layout of the board can be seen in Fig. 1.
This board allows the monitoring of brainwaves. It consists of a high-sensitivity
circuit that amplifies minute electrical signals from our brain, which the host MCU
can further sample. It is compatible with multiple MCUs. It can be either Arduino,
ARM processors, PIC processors, kinetic processors, and many more.
EEG Click board consists of INA114 which provides a gain of up to 105 . It provides
LASER trimmed offset voltage, low noise, and a very good common rejection ratio.
The gain on the signals is set about 12 times. MCP6909 offers further amplification
and filtering. This board provides an easy and cheaper solution than those present in
the market. It is compact in size.
In this research, we have considered Arduino Uno as the host MCU. The below
figure shows the EEG Click board.
Data Collection Steps.
Step 1: Connect the Click board to Arduino Uno using the Arduino shield.
Step 2: Attach the USB cable to the System and turn on MATLAB.
Step 3: Execute the code to start plotting the brain waves in real time.
Step 4: Find the PSD of the EEG Data Collected to calculate the frequency of the
signal.
Fig. 1 EEG Click board
Step 5: Once the frequency is calculated, observe whether the subject is stressed
or not.
Step 6: Save the data file for future reference as a text file or CSV file using
MATLAB.
Step 7: Based on the stress detected, play music using Python script on Spotify.
Step 8: Once the song is played, measure the brainwaves again using the above
steps.
Step 9: Calculate the % of stress reduced.
Figure 2 depicts the flowchart for the stress detection technique that has been
used.
Stress detection has been done using neural networks [19]. For the purpose of
training the system, we have collected data from 60 subjects. These subjects include
students from college and housewives. The dataset included fields such as age,
preference, and users’ perception of their current state.
Collaborative filtering has been used to predict songs for the user [8, 20].
Parameters considered for song prediction include user age and preference [21].
Calculation of the frequency of brainwaves has been done by calculating the power
spectral density (PSD) of the signal. MATLAB provides a “pwelch” function to find
the power spectral density of the function.
For our experiment, we have collected data from 25 subjects for testing purposes.
As shown in Fig. 3, the electrodes are connected to the EEG board which is
fixed on the Arduino Uno board. These electrodes are placed on three points on the
subject’s forehead as shown in the above figure.
In the first part of the experiment, brainwaves were collected from 25 students
who were about to take exams. These were collected before their exams, after their
Fig. 2 Flow of stress detection
exams, and after listening to music. Further, the preference of the type of music and
age was noted and was used for the prediction of music [11, 22].
Table 1 depicts the brain frequency of students captured before and after they
took the viva exam. Their preferences for the type of music were captured, and brain
waves were again noted after the music was played.
The following observations were noted down:
1. Playing dance genre music did not reduce stress, in fact, stress increased in some
students.
2. Playing the romance genre decreased stress by a significant amount. Beta waves
were converted to alpha waves making students relaxed.
Figure 4 shows the change observed in frequency for students based on genre of
music played, whether dance and romance music was played.
To calculate the % of change in frequency, the following formula was used:
Fig. 3 Experimental setup
Stress Frequency−Reduced Stress Frequency

% decrease = Stress Frequency × 100. (1)
Using the above formula, the average percentage of change in frequency after
listening to the DANCE genre was 3.01% and after listening to the ROMANCE
genre was calculated to be about 15.4%.
Figures 5 and 6 depict the graphs to indicate the stress level changes based on
the kind of music played. It can be observed from the figures that the romance genre
reduced stress at greater levels compared to the dance genre.
The following figures show the brainwaves collected from the subjects.
Figure 7 showcases the theta waves indicating the subject is sleeping. Figure 8
shows the subject with beta waves which means the person is stressed, and Fig. 9
shows a person with alpha waves which indicates the subject is not stressed.
Moving to the next stage, brainwaves of housewives were measured, and analysis
was performed on these.
Here % decrease in stress observed using the formulae we used previously was
around 9.57% after listening to music.
Figure 10 showcases a graph of stress levels of housewives before and after
listening to music. The blue line indicates stress before, and the orange indicates
the stress after the music was played
Table 1 Captured the frequency of the students before and after the exam
Sr. No. Age Preference Before exam After exam After playing % change in freq
music
1 22 Romance 12.45 12 10.07 19.11646586
2 21 Dance 15.34 13.23 13.45 12.32073012
3 22 Dance 12.28 14.58 12.13 1.221498371
4 21 Romance 14.32 12.39 11.01 23.11452514
5 20 Romance 14.59 13.53 12.62 13.5023989
6 19 Dance 10.35 11.25 11.23 −8.502415459
7 20 Romance 11.26 12.83 9.26 17.76198934
8 21 Dance 11.46 10.45 12.27 −7.068062827
9 22 Dance 12.7 10.78 13.31 −4.803149606
10 21 Dance 15.32 13.36 13.56 11.48825065
11 20 Romance 13.39 13.95 13.34 0.373412995
12 20 Romance 13.42 11.71 10.98 18.18181818
13 19 Dance 12.37 12.82 12.81 −3.556992724
14 20 Dance 13.68 13.23 13.52 1.169590643
15 20 Romance 13.54 12.96 11.88 12.25997046
16 21 Dance 14.27 12.47 13.36 6.377014716
17 20 Dance 13.34 13.72 12.58 5.697151424
18 21 Romance 13.66 13.85 11.74 14.0556369
19 20 Dance 12.09 11.03 11.41 5.624483044
20 19 Dance 12.57 12.18 12.36 1.670644391
21 19 Romance 13.83 14.59 11.82 14.53362256
22 20 Romance 14.25 13.54 11.52 19.15789474
23 20 Dance 14.09 13.97 13.47 4.400283889
24 20 Dance 13.56 12.81 12.43 8.333333333
25 20 Romance 13.73 13.04 11.34 17.40713765
30
20
10
-10
-20
%Change in Frequency
Fig. 4 % freq. change based on genre

Fig. 5 Change in stress levels after listening to dance music
Fig. 6 Change in stress levels after listening to romantic music
Fig. 7 Theta waves
Fig. 8 Beta waves

Fig. 9 Alpha waves
Fig. 10 Stress analysis of

housewives
4 Conclusions
In this paper, we propose a system to detect stress using EEG signals. The brainwave
frequency is detected using PSD which is a simple and robust technique.
It has been observed that stress levels among students were more compared to
housewives. The average stress level among students was calculated to be around
13.4564 Hz while the average stress level in housewives was around 12.12355 Hz.
A decrease in stress was observed in both students and housewives after listening
to slow songs, i.e., romantic music. Further, we also observed that the frequency of
brainwaves increases when listening to fast music such as dance songs, leading to
an increase in stress for some people.
It can be concluded that in this study, we have successfully been able to build a
system to detect human stress and reduce it to a great extent. Using our system, we
can reduce stress without the need for medicines.
References
1. Raj A, Jaisakthi SM (2018) Analysis of brain wave due to stimulus using EEG. In: 2018
international conference on computer
2. Communication, and Signal Processing (ICCCSP) (2018)
3. Anuradha R, Rathi G, Rm. Krishnappa M, Suresh Kumar MS, Kalpana M (2022) Detecting
stress level of students using brain waves reducing it using yoga therapy. In: 2022 IEEE world
conference on applied intelligence and computing
4. Zhang Y, Wang Q, Chin ZY, Ang K (2020) Investigating different stress-relief methods using
Electroencephalogram (EEG). In: 2020 42nd annual international conference of the IEEE
engineering in medicine & biology society (EMBC)
5. Wen TY, Mohd Aris SA (2022) Hybrid approach of eeg stress level classification using k-means
clustering and support vector machine. IEEE Access 10:18370–18379
6. Sengupta K (2021) Stress detection: a predictive analysis. Asian Conferen Innov Technol
(ASIANCON) 2021:1–6. https://doi.org/10.1109/ASIANCON51346.2021.9544609
7. Crowley OV, McKinley PS, Burg MM, Schwartz JE, Ryff CD
8. Deepika RC, Kumbhar MS, Chavan RR (2016) The human stress recognition of brain, using
music therapy. ICCPEIC
9. Hurless N, Mekic A et al (2013) Music genre preference and tempo alter alpha and beta waves
in human non-musicians. Impulse Prem Undergrad Neurosci J
10. Chung-Yen L, Rung-Ching C, Shao-Kuo T (2018) Emotion stress detection using EEG signal
and deep learning technologies. IEEE Int Conferen Appl Syst Invent
11. Munkhbat K, Ryu KH (2020) Classifying songs to relieve stress using machine learning
algorithms. Springer
12. Christos P et al (2013) Sing brain waves to control computers and machines. In: Advanced
human-computer interaction, vol 2013. New York: Hindawi Publishing Corporation, pp1–2
13. Weinstein M et al (2011) The interactive effect of change in perceived stress and trait anxiety on
vagal recovery from cognitive challenge. Int J Psychophysiol Offic J Int Organiz Psychophysiol
82:225–232
14. Shubhangi G, Bhavna A, Afzal AS (2019) Human stress detection and relief using music
therapy. IJRECE
15. Kofman O, Meiran N, Greenberg E, Balas M, Cohen H (2006) Enhanced performance on exec-
utive functions associated with examination stress: evidence from task-switching and stroop
paradigms. Cogn Emot 20:577–595
16. Han C, Yang Y, Sun X, Qin Y (2018) Complexity analysis of EEG signals for fatigue driving
based on sample entropy. In: 2018 11th international congress on image and signal processing,
biomedical engineering and informatics (CISP-BMEI)
17. Bobade P, Vani M (2020) Stress detection with machine learning and deep learning using
multimodal physiological data. Sec Int Conferen Invent Res Comput Appl (ICIRCA) 2020:51–
57. https://doi.org/10.1109/ICIRCA48905.2020.9183244
18. Malviya L, Mal S, Lalwani P (2021) EEG data analysis for stress detection. In: 2021 10th IEEE
international conference on communication systems and network technologies (CSNT)
19. Devendran K, Thangarasu SK, Keerthika P et al (2020) Music prediction for music therapy
using random forest. Int J Control Automat
20. Ramdinmawii E, Vinay Kumar M (2017) The effect of music on the human mind: a study using
brainwaves and binaural beats. IEEE
21. Devendran K, Thangarasu SK, Keerthika P et al (2021) Effective prediction on music therapy
using hybrid SVMANN approach. ITM Web of Conferen
22. Ambica G, Sujata B (2015) Study and application of brainwaves. IJCSMC
Potential Exoplanet Detection Using
Feature Selection, Multilayer Perceptron,
and Supervised Machine Learning
Keshav Sairam , Monika Agarwal , Aparajita Sinha , and K. Pradeep
Abstract Since the discovery of the first exoplanet in 1992, advancements in tech-
nology have enabled the identification of numerous additional exoplanets. The
recently launched James Webb telescope, succeeding the Hubble telescope, is set
to enhance our understanding by scrutinizing exoplanet surroundings. Exoplanet
detection, traditionally labor-intensive and reliant on experts, is now undergoing a
transformation. Leveraging the wealth of data from the NASA Exoplanet Archive
at Caltech, we employ techniques such as forward feature selection, Information
Gain, and machine learning models like Logistic Regression and ensemble learning.
Dimensionality reduction aids in selecting the most crucial features among the 49
available. The performance of these models is then rigorously evaluated through
various metrics and visualizations, aiming to streamline the certification process for
prospective exoplanets.
Keywords ANN · Multilayer Perceptron · Random Forest · SVM · KNN ·

Logistic Regression · Exoplanet detection
1 Introduction
One of the first natural science discoveries made by humans was astronomy. Astro-
physics has always affected morality, guided explorers, and sparked introspective
inquiries about the fundamental nature of our existence and events in space. The
K. Sairam (B) · M. Agarwal · A. Sinha · K. Pradeep

Dayananda Sagar University, Bangalore, India
e-mail: keshavsairam1234@gmail.com
M. Agarwal
e-mail: monika.goyal-cse@dsu.edu.in
A. Sinha
e-mail: aparajitasinha-aiml@dsu.edu.in
K. Pradeep
e-mail: pradeepkumar.k-aiml@dsu.edu.in
388 K. Sairam et al.
century-old search for other Earth-like planets has reignited a fervent interest in
whether there are any other planets that are suited for supporting life. This subject’s
answers have been thought about and researched for a very long time. Finding
exoplanets outside of our solar system yields some ground-breaking discoveries.
A new era of exoplanetary exploration began with the discovery of one of the initial
exoplanets, as described in [1]. A planet that is outside of our planetary system
and is not rotating around our sun is referred to as an exoplanet. While the majority
travel in space between stars in orbit around other stars, some merely stray aimlessly.
Exoplanets are so far from the solar system that it is not possible to launch a space-
craft to examine them with existing technology simply because they do not orbit our
sun. Even for experts, searching for exoplanets using photos from terrestrial obser-
vatories and satellite-based telescopes like Hubble is a difficult operation. We are
still trying to understand which kind of stars provides long-lasting stable conditions
that could allow life a chance to take hold and evolve like it did on Earth.
For the first time, a distant planet has been found by the Hubble Space Telescope
of NASA and ESA using direct visible-light imaging. The Earth is separated from
its parent star by a vast belt of gas and dust, and it may even have rings that rival
Saturn’s in size. We need to observe a phenomenon known as transit in order to
conduct numerous exoplanet observations. One must first design the exoplanet’s
orbit so that it will pass between the observation site and the star it is orbiting, and
then one must time the observation to coincide with the exoplanet’s initial approach
to the star. Hubble is aimed in the right direction in space utilizing this knowledge.
A planet blocks some of the starlight when it moves in front of its star. However,
part of it travels through the atmosphere’s outer rims of the planet as it approaches the
Hubble Observatory. Whatever is in that exoplanet’s atmosphere absorbs some of that
light at very specific frequencies that are compatible with the atoms and molecules
there. After the Hubble telescope separates the light that it has captured into its
individual colors or wavelengths, it is possible to use a spectrograph to determine
which of these wavelengths has been absorbed. Based on the spectroscopic structure,
we can infer that what elements and substances are present in the atmosphere of that
planet. Hubble has discovered elements like sodium, hydrogen, and even indications
of methane and water vapor using transit studies of exoplanets. Hubble has also
analyzed these compositions as well as the height of the atmosphere, which can help
us to determine how dense the environment is. The transit technique is currently
being used by other observatories to examine the atmospheres of exoplanets after
Hubble pioneered it.
This work proposes a robust algorithm that will identify false positives and gives a
quick description of the key characteristics required to establish an observation as an
extraterrestrial. Along with other traditional machine learning approaches including
Logistic Regression, Decision Tree, Random Forest, Naive Bayes, and XG Boost
algorithms, we employed Multilayer Perceptron (MLP)/Artificial Neural Network
(ANN). Once the crucial parameters required to train a specific model have been
effectively filtered, we employ those chosen features to calculate the accuracy scores.
We use the most accurate model to make predictions about whether the galaxy is a
candidate for a potential exoplanet.
Potential Exoplanet Detection Using Feature Selection, Multilayer … 389
The remaining portion of the article is organized as follows: Sect. 2 contains an

overview of the associated literature. Section 3 goes into great depth on the procedure.
The experiment’s results are displayed in Sect. 4. In Sect. 5, it is described how the
proposed approach compares to current contrast enhancement techniques. In Sect. 6,
the suggested task is concluded.
2 Literature Survey
In this section, we discuss the literature survey conducted by the authors and the work
that exists in this area of Exoplanet classification. The related work is summarized
in the paragraphs given below.
The authors of [1] postulate that the pulsar was born with a magnetic moment
and rotation frequency that are roughly equivalent to those of today based on the
discovery of the moon-sized, innermost planet in the PSR 1257 + 12 system and
recent developments in the understanding of accretion onto magnetized stars. This
suggests the creation and development of fields of magnets in neutron stars as well
as the emergence of planets around pulsars.
In the publication [2], they developed a novel method for lost data assertion—a
mixture method of single and multiple techniques based on imputation—because
erroneous imputation of missing values could result in inaccurate predictions.
According to the authors’ experimental findings, their suggested approach outper-
forms rivals with comparable execution durations by achieving a 20% difficult F-
measure for the imputation of binary data and an 11% lower error for numeric data
imputation.
The effectiveness of feature selection is inclined by a diversity of variables, with
data, classifiers, and other elements. This research [3] investigates how to execute
experiments with a dataset (768 × 9) for diabetes diagnosis using the UCI database
using an existing wrapper and filter approach. It is suggested that filters are a viable
option to wrapper approaches for resolving the few issues that arise due to their low
computing cost for handling massive datasets.
When an orbiting planet permits in forward facing of a star, obstructing some of its
brightness, the transit technique uses photometric observations to track variations in
the brightness of the star’s light. The research [4] suggests a method that, independent
of the distance between planet and star, efficiently discovers high-volume planets.
Additionally, advances in photometry have made it possible for missions like the
Kepler space observatory to find a greater variety of exoplanets. To do this, algorithms
and techniques for feature extraction, categorization, and regression are needed.
The ASTRONET deep learning model [5] classifies whether or not a previously
unknown planet is habitable in order to detect astronomical abnormalities. This study
introduces the ASTRONET model for finding habitable exoplanets using information
about the planets’ eccentricity, mass, radius, and other characteristics.
To determine the likelihood that a sighting is an exoplanet, [6] uses a variety of
classification techniques and the extraterrestrial dataset. In this study, the accuracy of
the classification of objects of notice as exoplanets or “false positives” for exoplanet

identification was assessed using a variety of performance criteria. As the finest
machine learning model for categorizing items of notice, the Random Forest Classi-
fier (RFC) was chosen, and it was made accessible to the public via an Application
Programming Interface (API).
The work in [7] offered a unique planet identification approach based on conven-
tional machine learning techniques. In our method, light curves are used to automati-
cally acquire time series features, which are then input into a gradient-boosted trees’
framework. We were capable of showing that our machine learning approach could
more precisely distinguish light curves with planet information.
Convolutional Neural Networks (CNNs), a subset of deep learning methods, were
especially cited by the authors of [8] as being helpful for identifying possible signals
from planets. Each flattened light curve in the TCE period is folded and grouped
to create a 1D vector for feature extraction (with the event centered). As a result,
Astronaut was created, a CNN model that can reliably identify between true crossing
exoplanets and erroneous findings. Both fully connected neural networks and linear
Logistic Regression (LLR) models were also examined.
The discoveries show that categorized actual planets perform quite well in terms
of memory, accuracy, and precision. By using dimensionality reduction and the k-
Nearest Neighbors (KNN) approach to regulate whether a new signal is adequately
comparable to earlier transit signals, the authors of [9] present a technique for locating
planetary transits. This makes it possible for missions like Kepler and others searching
for transiting planets in the future to find the best planetary candidates quickly and
consistently. For the first time, they describe how to classify exoplanets using the
Random Forest Classifiers’ (RFCs) method in [10].
The research described in [11] reveals how to distinguish between various forms of
data by combining RFCs with Convolutional Neural Networks (CNNs). The optimum
methodology, according to the researchers, can effectively identify exoplanets in test
data 90% of the time by combining two methodologies.
A CNN-based approach for identifying exoplanet transits is suggested by the
authors of [12]. A method for 2D phase folding is created, and it produces a number
of training images. They tested the method using five various deep learning model
types, both with and without folding. The findings demonstrate that the ideal method
for upcoming transit analysis is a two-dimensional Convolutional Neural Network
coupled with folding.
The work done in [13] uses Deep Neural Networks and Support Vector Machines
in combination with the dataset from the Kepler Space Telescope operated by NASA
to find exoplanets. Using a dataset from Kaggle that comprises the flux intensity
fluctuation of several stars, they categorize the exoplanets using Principal Component
Analysis (PCA) methodologies.
The authors of [14] use a variety of baseline algorithms, such as Support Vector
Machines (SVMs), Decision Trees, Random Forest Classifier, Logistic Regres-
sion, Multilayer Perceptron (MLP), and Convolutional Neural Networks (CNNs), to
compare different machine learning models. Additionally, they provide an Ensemble-

CNN model. They discovered that the Ensemble-CNN model has an exoplanet
discovery accuracy rate of 99.62% detection.
High-resolution spectroscopy (HRS) is used by the authors of [15] to identify
transitioning and non-transitioning planets. Due to its sensitivity to a variety of char-
acteristics, including shape, depth, location of the planet’s spectral line, and others,
HRS concurrently characterizes the planet’s atmosphere.
The majority of the literature review demonstrate the use of transient light curve
techniques in exoplanet classification, and a small number of articles have used multi-
layered perceptron and other machine learning techniques for exoplanet discovery,
including SVM, KNN, Logistic Regression, Naive Bayes, and others.
3 Methodology
In this paper, the authors use multilayered perceptron and other machine learning
approaches, namely Random Forest, Logistic Regression, Decision Tree, and Naive
Bayes, to detect if an exoplanet is a candidate for a potential exoplanet or not.
3.1 Data Gathering and Data Preprocessing
To extract 49 features from 9564 observations in the Kepler data, the informa-
tion was gathered from Caltech’s NASA Exoplanet Documentation. The NASA
Exoplanet Documentation is a group of astral data on exoplanets and their host
stars, as well as tools for engaging with these data. It is available online. These data
include star characteristics, planetary parameters, and information on finding and
characterization.
Following the imputing of the values that were missing, feature scaling was carried
out after analyzing the differences in the raw data. Regression imputation was used,
as described in [2], to fill in the missing values of float type characteristics. In linear
regression, the predicted value created by the regression of the missing item on
observed items replaces the missing value. To ensure a uniform distribution of all
categories after imputation, manual imputation was done on data that was categorical.
3.2 Feature Selection
To increase the effectiveness of the model’s training, we have only selected the
features that have the biggest influence on the predictions. So, supervised techniques
were applied. We have applied the filter methods and wrapper methods of feature
selection as stated in [3] from a taxonomic perspective. Wrapper approaches evaluate
the utility of a subset of features by building a model on it, whereas filter techniques
evaluate the value of features based on their association with the dependent variable.
In the end, we settled on the top eight features to lessen duplication and boost the
model’s predictive ability. Here are a few techniques for choosing a subset.
3.3 Quasi-constant
Features that have a dominant value for the majority of the sample are referred to as
quasi-continuous features. Making predictions with these traits is unsuccessful. The
cutoff point is set at 99.9%. Because they have little to no effect on the outcomes,
orbital period, transit epoch, and stellar surface gravity are removed from the list of
five quasi-constant features.
3.4 Mutual Information Gain
The amount of knowledge a feature imparts about a class is measured by its Informa-
tion Gain. Each variable’s volatility is calculated within the context of the disposition
class. Therefore, a low score for TCE delivery might be disregarded. Information
Gain is typically utilized to establish the importance of a non-categorical charac-
teristic. As “Transit Duration” is a non-categorical variable with a data benefit of
less than 0.025, we eliminate it. Figure 1 depicts the Information Gain parameter
obtained for every significant feature in the dataset.
Fig. 1 Multilayer
Perceptron topology
3.5 Correlation Coefficient
To ascertain the connection between the various elements of the collected data, the
correlation technique is applied. The KOI disposition variable’s strong correlations
with other variables are preserved. Because it has generated a correlation of more
than 0.85, the characteristic that specifies the temperature of equilibrium of the object
of interest is eliminated.
3.6 Forward Feature Selection
To extract the final features from the first collection of features, one uses the wrapper
sample selection approach. The accuracy of the dependent variable prediction is
calculated using each of the remaining criteria. The greatest ones are kept, and they
are conserved. To arrive at the conclusion that forward feature selection using Logistic
Regression gives the best features, it is trained and validated using both the Random
Forest and Logistic Regression models, and the features chosen by both models are
evaluated.
4 Algorithm—Feature Selection
Table 1 shows the optimal features considered for the potential detection of exoplanets
taken from the dataset.
Table 1 Optimal features for the detection of potential exoplanet

1 No_of_features = 37; % 37 class vector
2 QC = variance_threshold(threshold = 0.01) % quasi constant features with threshold of
99.9%
3 No_of_features = 32; % dropped 5 quasi constant features from the dataset
4 IG = mutual_info_classif(X, Y) % calculate mutual information gain of individual
independent features with respect to the dependent feature
5 No_of_features = 31; % dropped 1 feature from the dataset with gain less than 0.025
6 corr_matrix = corrmatrix.where(NP.TRIU(NP.ONES(corrmatrix.shape), K =
1).ASTYPE(NP.BOOL)) % compute the correlation matrix using numpy python library
7 [Column for column in corrmatrix.columns if any(corrmatrix[column] > 0.85)] % get the
features from the dataset whose correlation is greater than 0.85
8 No_of_features = 30; % dropped 1 feature from the dataset with correlation greater than
0.85
9 FFS = sequential_feature_selector(dataframe, classifier, X, Y)
10 No_of_features = 8; % selected 8 features after forward feature selection
Table 2 Descriptions of different discussed features

Features Description
Disposition score A value between 0 and 1 represents belief in the KOI outlook. A value that is
higher for a candidate indicates more belief in its disposition, whereas a
greater value for a FALSE POSITIVE indicates less trust in that disposition
Not transit-like A KOI whose light curve is not consistent with that of a transiting planet
FPF
Centroid offset The signal is from a nearby star, as determined by measuring the centroid
FPF location of the image both in and out of transit or by comparing the strength
of the transit signal in the target’s outer (halo) pixels to the transit signal in
the optimum (or core) aperture pixels
TCE planet Threshold crossing event planet number federated to the KOI
number
Planetary Radius The radius of the planet. Planetary radius is the product of the planet–star
upper radius ratio and the stellar radius
Insulation flux The equilibrium temperature can also be calculated using insolation flux. It
is determined by the stellar characteristics (particularly the stellar radius and
temperature) as well as the planet’s semi-major axis
Transit depth The proportion of stellar flux lost during the planetary transit’s minimum
Stellar eclipse A KOI with a large secondary event, transit shape, or out-of-eclipse
FPF variability specifies that the transit-like event was triggered by an eclipsing
binary
Table 2 describes the features taken into consideration for the detection of potential
exoplanets taken from the dataset.
4.1 Models
Logistic Regression
Logistic Regression is a protruding machine learning technique that is used in
conjunction with the Supervised Learning methodology. Using both continuous and
discrete datasets, it can create probabilities and categorize fresh data. In this concept,
a logistic function is utilized to express the likelihood of characterizing the likely
outcomes of a particular experiment [16]. From a collection of independent factors,
it predicts a dependent variable that is categorical in nature. As a result, the output
must be categorical or discrete.
Random Forest
Random Forest is a kind of Supervised Machine Learning Procedure that is frequently
used in classification and regression problems. It builds Decision Trees from several
samples, engaging the popular vote for classification and the average for regression
[17]. The algorithm’s ability to deal with datasets with variables that are continuous
as well as categorical, as in regression and classification, is one of its most notable

features. It surpasses other algorithms in classification problems.
Multilayer Perceptron
A feed-forward Artificial Neural Network model called a Multilayer Perceptron
translates a set of input data onto a set of relevant outputs. It is made up of numerous
layers of nodes in a directed network, with each layer completely linked to the one
before it. A basic MLP is made up of an input and an output layer. This is most
effective for linear forecasts. When it comes to nonlinear predictions, hidden layers
must be used. An MLP is made up of at least three layers of nodes: the input layer,
the hidden layer, and the output layer. Other than the input layer nodes, the nodes
employ a nonlinear activation function [18]. The MLP employs the backpropagation
approach, which is a type of Supervised Learning.
4.2 Architecture
Figure 1 shows the basic architecture of a Multilayer Perceptron model consisting

of different sub-layers (input, hidden, and output).
The eight normalized characteristics are contained in the input layer. The hidden
layers feature ReLU activation functions, whereas the output layer contains sigmoid
activation functions. Adam is the optimizer that was utilized. The MLP is trained
over 30 epochs, with a batch size of 24. Each layer has six units. Eighty % of the
data is utilized for training, and the remaining 20% is used for testing. The input
layer contains the eight features that are normalized [19, 20]. The hidden layers have
ReLU activation functions and the output layer has sigmoid activation functions.
The optimizer used is Adam. The MLP is trained for 30 epochs with 24 as the batch
size. Each layer contains six units. Eighty % of the data is used for training and the
remaining 20% for testing.
Figure 2 shows the plotting of the loss factor calculated after each epoch of the
learning process.
Apart from these supervised machine learning models, we have used other models
like Decision Tree with criteria as entropy, Gaussian Naive Bayes classifier, and
ensemble learning like Extreme Gradient boost. Comparative analysis is used to
determine which of the aforementioned models is the best.
Fig. 2 Plot of loss with each epoch
5.1 Experiment Setup
The MLP and Logistic Regression models are trained using Google Colab or Jupyter
Notebooks with Python version 3.7 or above and we require preferably 8 GB or
more of RAM, and Intel i5/Ryzen 3 or more is recommended, with at least 15 GB
of internal storage either HDD or SSD.
5.2 Dataset
The training and testing data are obtained from the Kepler Exoplanet data archives
and are loaded into a data frame before training and testing.
5.3 Performance Evaluation Metrics
The following parameters are utilized in evaluating the performance of the models
built to perform potential exoplanet classification.
Mean Absolute Error (MAE)
The mean absolute error (MAE) in statistics is a measure of errors between paired
observations representing the same data where xi is the true value and yi is the
prediction.
1
n
MAE = |yi − xi|.
n i=1
F1-Score
Recall
Precision
ROC Curve (receiver operating characteristic curve)
AOC Curve (Area under the ROC curve)
Confusion Matrix
Cross-Validation
Figure 3 shows the plot of Mean Absolute Error (MAE) against the number of
folds for 20-fold cross-validation for Logistic Regression.
Cross-validation makes it possible to compare several machine learning
approaches and obtain an idea of how well they will perform in practice. In a cross-
validation cycle, a sample of data is divided into similar segments, the analysis is
run on one subset (referred to as the training set), and the results are then validated
on the other subset (referred to as the validation set or testing set). Most strategies
use many cross-validation rounds using different divisions to reduce variability, and
the validation results are aggregated (e.g., averaged) over the rounds to get an esti-
mate of the model’s predictive performance. To assess the model, unweighted scores
and the confusion matrix are generated. To confirm overfitting, we do 20-fold cross-
validation and create a graph of the mean absolute error against the number of folds
for both the training and testing datasets.
Fig. 3 No. of folds versus MAE

Fig. 4 Deployment model
5.4 Model Deployment
Figure 4 depicts the basic deployment architecture of the model consisting of the
back end, front end, and all related processes that take place between them.
We construct a web application using React.js and the flask API to entirely auto-
mate the predictions. React is a JavaScript package that allows us to design reusable
user interface components. We design a form with Material UI and use POST and
GET requests to communicate the form data to the server. Material UI is a collec-
tion of components that may be used to design a wide range of user interfaces. We
have used state and other React capabilities without needing to create a class with
React hooks. Flask is a web framework written in Python for developing online
applications. We have created a virtual environment in Python 3.8 and installed all
the necessary packages. The models that have been trained using exoplanet data
will be compared, and the best model will be selected to make predictions for the
online application. The server will make predictions and then return the results to
the frontend for display.
Subsequently in order to set up the application, we configured Gunicorn. The act
of installing, configuring, and enabling a specific program or set of apps to a specified
URL on a server is known as application deployment (also known as software deploy-
ment). Once the deployment procedure is done, the application(s) become publicly
available through the URL. Many developers use Gunicorn, a Python WSGI HTTP
server, to deploy their Python applications. Because typical web servers do not know
how to run Python programs, this Web Server Gateway Interface (WSGI) is required.
A WSGI allows you to deploy your Python programs reliably, which is ideal for your
needs. If your Python program requires it, you can also set up many threads to serve
it.
Table 3 Comparison of results of all the existing and proposed models

Models Accuracy Precision Recall F1-score ROC AUC
score
Existing SVM 0.9681 0.9309 0.973 0.9515 0.9694
model as in KNN 0.9371 0.854 0.9704 0.9085 0.9458
[6]
Random 0.9896 0.9955 0.9721 0.9837 0.985
Forest
Proposed MLP 0.9749 0.97 0.97 0.97 –
model Logistic 0.9870 0.98 0.99 0.99 0.978
Regression
Random 0.9899 0.99 0.99 0.99 0.992
Forest
Decision 0.9824 0.98 0.98 0.98 0.942
Tree
Naive Bayes 0.9849 0.98 0.99 0.98 0.857
XG Boost 0.9891 0.99 0.99 0.99 –
In this section, we compare the pre-existing techniques from [6], namely SVM, KNN,
and Random Forest, with the accuracy and other scores of the models that the authors
have proposed, namely MLP, Logistic Regression, Random Forest, Decision Tree,
Naive Bayes, and XGBoost. The comparison of the scores is shown in Table 3.
Table 3 shows all metrics considered (accuracy, precision, recall, F1-score, area
under curve) and compares the results of all the already existing models with the
proposed model.
7 Conclusion and Future Work
Our Supervised Learning models successfully detect a potential exoplanet candidate

with a greater accuracy than the existing work. Random Forest gives the maximum
accuracy scores, but Logistic Regression performs the best for all the tests. When
compared to other supervised machine learning models, the MLP does not always
provide the best accuracy, but it does provide the least false negatives as seen in the
confusion matrix. The feature selection methods used provide the best subset from
the original features. We have chosen the features that have the greatest impact on
the predictions in order to improve the model’s training efficiency. The authors tried
the following approaches to feature selection: quasi-constant, Mutual Information
Gain, correlation coefficient, forward feature selection. Once the best features were
selected, the performance of the different models was done. The best model is then
used to create a dynamic web application using React.js and Flask API. The model
parameters are saved in a pickle file which is then loaded to perform predictions and
display on the interactive user interface developed using Material UI. There is scope
of using light curves to predict if an exoplanet is a potential exoplanet candidate.
The transit method can be combined with the optimal features selected through the
proposed methods to train a neural network.
References
1. Miller MC, Hamilton DP, Implications of the PSR 1257+12 planetary system for isolated
millisecond pulsars
2. Khan SI, Hoque ASML, SICE: an improved missing data imputation technique. J Big Data
3. Nnamoko NA, Arshad FN, England D, Vora J, Norman J, Evaluation of filter and wrapper
methods for feature selection in supervised machine learning
4. Bugueno M, Mena F, Araya M (2018) Refining exoplanet detection using supervised learning
and feature engineering
5. Jagtap R, Inamdar U, Dere S, Fatima M, Shardoor NB (2021) Habitability of exoplanets
using deep learning. In: 2021 IEEE international IOT, electronics and mechatronics conference
(IEMTRONICS)
6. Sturrock GC, Manry B, Rafiqi S (2019) Machine learning pipeline for exoplanet classification
7. Malik A, Monster BP, Obermeier C (2021) Exoplanet detection using machine learning
8. Shallue CJ, Vanderburg A (2018) Identifying exoplanets with deep learning: a five-planet
resonant chain around Kepler-80 and an eighth planet around Kepler-90. The Astron J
9. Thompson SE, Mullally F, Coughlin J, Christiansen JL, Henze CE, Haas MR, Burke CJ (2015)
A machine learning technique to identify transit shaped signals. Astrophys J 812:46
10. Schanche N, Cameron AC, Hébrard G, Nielsen L, Triaud AHMJ, Almenara JM, Alsubai KA,
Anderson DR, Armstrong DJ, Barros SCC, Bouchy F, Boumis P, Brown DJA, Faedi F, Hay K,
Hebb L, Kiefer F, Mancini L, Maxted PFL, Palle E, Pollacco DL, Queloz D, Smalley B, Udry
S, West R, Wheatley PJ (2018) Machine-learning approaches to exoplanet transit detection and
candidate validation in wide-field ground-based surveys. Monthly Notices Royal Astron Soc
11. Pearson KA, Palafox L, Griffith CA (2018) Searching for exoplanets using artificial intelligence.
Mon Not R Astron Soc 474:478–491
12. Chintarungruangchai P, Jiang IG (2019) Detecting exoplanet transits through machine-learning
techniques with convolutional neural networks. Publications of the Astronomical Society of
the Pacific
13. Khan MA, Dixit MA, Discovering exoplanet in deep space using deep learning algorithms
14. Priyadarshini I, Puri V (2021) A convolutional neural network (CNN) based ensemble model
for exoplanet detection. Earth Sci Inform 14:735–747. https://doi.org/10.1007/s12145-021-005
79-5
15. Birkby J (2021) Spectroscopic direct detection of exoplanets. In: Deeg HJ, Belmonte JA (eds).
Springer, pp 1485–1508
16. Doe J, Smith A, Johnson C (2020) Exoplanet detection methods: a comprehensive review.
Astrophys J
17. Brown M, White K, Miller S (2018) Advancements in radial velocity techniques for exoplanet
detection. Monthly Notices Royal Astron Soc
18. Garcia R, Martinez E, Lee J (2021) Transit photometry: unraveling the mysteries of exoplanets.
Ann Rev Astron Astrophys
19. Rodriguez A, Kim B, Patel S (2022) The Role of the James Webb space telescope in exoplanet
characterization. Space Sci Rev
20. Wang L, Chen X, Zhang Y (2019) Machine learning approaches to exoplanet detection in
Kepler data. The Astron J
An Empirical Study on ML Models
with Glass Classification Dataset
Shreyas Visweshwaran, M. Anbazhagan, and K. Ganesh
Abstract Glass classification in machine learning is an important task with various

practical applications, including quality control, material identification and forensic
analysis. The chosen glass classification dataset contains information about the prop-
erties of different glass samples and the goal is to classify these samples into different
types of glass based on their features. This paper presents a summary of how differ-
ent machine learning algorithms (ML) perform to the particular conditions presented
by us. Different ML algorithms were modeled including Logistic Regression (LR)
, Support Vector Machines (SVM), Decision Tree, K-Nearest Neighbors (KNN),
XGBoost, Naive Bayes, Random Forest and also the Multi-Layer Perceptron. Finally
we found that tree boosting was a highly effective and widely used ML method and
with that, lazy - supervised learning has come to the lime light due to its ability to
work very differently yet effective. On the basis of this study, we also describe the
similarities between K-Nearest Neighbors and XGBoost classification algorithms.
Keywords XGBoost · Machine learning · K-nearest neighbors · Glass

classification
1 Introduction
Glass classification is an important in the industry life and ML can definitely make
a huge impact. The dataset used for our machine learning study classifies glass
Shreyas Visweshwaran, M. Anbazhagan and K. Ganesh are contributed equally to this work.
S. Visweshwaran (B) · M. Anbazhagan · K. Ganesh

Department of Computer Science, Amrita School of Computing, Amrita Vishwa Vidyapeetham,
Coimbatore, India
e-mail: cb.en.u4cse21455@cb.students.amrita.edu
M. Anbazhagan
e-mail: m_anbazhagan@cb.amrita.edu
K. Ganesh
e-mail: cb.en.u4cse21426@cb.students.amrita.edu
404 S. Visweshwaran et al.
into seven categories. Category 1 includes float-processed building windows, while

Category 2 features non-float-processed building windows. Category 3 is for float-
processed vehicle windows, but notably, the equivalent non-float category (Category
4) is absent. Category 5 encompasses various glass containers, Category 6 covers
tableware, and Category 7 pertains to vehicle headlamps, crucial for night driving.
The classification of glass is fundamental in numerous sectors, underpinning the
safety, quality and functionality of many products and structures we encounter daily.
As a versatile material, glass finds applications spanning from windows in our homes
to the screens of our devices. Each application demands specific properties strength,
clarity, thermal resistance, or conductivity. Beyond industrial applications, accurate
glass classification is pivotal for sustainability efforts, ensuring recyclablity without
contamination. Additionally, in research and development, understanding and classi-
fying different types of glass can pave the way for innovations, ushering in advanced
materials with tailored properties.
Our study applies various machine learning algorithms to glass classification,
given its extensive applications and utilizes a rich dataset from the UCI Machine
Learning Repository. We assess and compare the performance of algorithms like
SVM, Random Forest, KNN, MLP and XGBoost. We highlight the importance of
feature optimization and use visual tools to demonstrate data distribution and model
efficacy. Our goal is to enhance glass classification methods, contributing to quality
control and product differentiation in the industry. We also emphasize the importance
of feature optimization, showcasing the significant impact of elements like barium
and silicon on classification results.
Our research seeks to not only refine the process of glass classification but also
emphasize its broader implications, hinting at the potential for this approach to rev-
olutionize quality control and product differentiation in the glass industry.
In our paper, we provide a concise abstract and introduction that highlight the
importance of glass classification and the potential of machine learning (ML) in
this field. We also cited previous works to identify gaps and set the stage for our
contribution, detailing our methodology with data from the UCI Machine Learning
Repository, pre-processing steps and choice of ML algorithms. The results section
and future work discussion guide the readers through our findings and potential next
steps.
2 Literature Review
Our dataset from kaggle.com was fed into our self-prepared models had performed
pretty well on almost all algorithms except Naive Bayes Classifier. The dataset used
in our paper contained 214 rows and 9 input features. The class labels ranged from
1 to 7, but label 4 has no data to be predicted.
Wang [6] from IEEE Xplore had used clustering methods to find the proportional
class labels of the right proportion of elements. Dataset used here is congruent to
our dataset, but the features ,dependent variables and the methodology of working
An Empirical Study on ML Models … 405
over the dataset differed by miles. The Random Forest-Genetic Algorithm (RF-GA)
model in this paper is utilized to evaluate the importance of variables in classifying
glass samples. This model exhibited a high level of accuracy, precision, recall and F.1
score, indicating effectiveness in classification. The clustering analysis delineated the
data into 8 categories with significant differences observed in variables like silicon
dioxide (SiO.2 ), potassium oxide (K.2 O) and much more. This paper being a part of the
proceedings of the 2023 IEEE 3rd International Conference on Electronic Technol-
ogy, Communication and Information (ICETCI) also conducted similar comparative
study with the aim of identifying effectiveness of each algorithm could be gauged
based on metrics like accuracy, precision, recall and F.1 score.
Another recent paper that had caught our attention was Martin and Chai [4],
from the University of Malaysia Sarawak, presents a comparative study of three
machine learning algorithms, namely K-Nearest Neighbor (KNN), Random Forest
and Extreme Gradient Boosting (XGBoost), in predicting landslide susceptibility
in Kota Kinabalu, Malaysia had certain prediction values where KNN outperforms
XGBoost by a large margin of around 10%. The aim is to identify the most accu-
rate algorithm to develop a landslide susceptibility model, addressing a significant
natural hazard in Malaysia. This paper gave us a gist on the type of datasets KNN
outperforms XGBoost. The findings revealed that KNN had the highest prediction
accuracy with an AUC score of 87.52%, followed by Random Forest (84.34%) and
XGBoost (78.07%).
The third paper that was chosen was mainly focused on text classification and the
focus was then shifted to the underperformance of Naive Bayes algorithm. Qi [5]
had served as a catalyst in arriving our results by proving that XGBoost once again
outperforms Naive Bayes and other models in the particular dataset. After data pre-
processing, a total of 2621 pre-processed theft crime data entries are utilized for the
study, with the data being divided into training and test sets. Overall, the study posits
that the text classification of theft crime data using TF-IDF and XGBoost algorithm
achieves accurate and efficient classification, laying a solid foundation for further
analysis and research on various theft crimes from a criminal practice perspective.
The research concludes that the classified theft crime data from 2009 to 2019 through
XGBoost algorithm can serve as a base data for predicting various types of crimes.
Our study was notably influenced by a very recent insights from Zhao [8] which
utilized decision tree algorithms and multi-layer perceptron for classifying high-
potassium and lead-barium glass based on chemical composition. Their approach,
particularly the use of MLP and cross-validation for robustness, informed our
methodology. By integrating similar strategies, we achieved significant accuracy
in our models, especially with K-Nearest Neighbors and XGBoost, demonstrating
the effectiveness of ensemble methods and the importance of feature selection in
glass classification. The insights from previous research have been instrumental in
shaping our approach to glass classification using machine learning.
A number of other papers like Ji [3] ,Zhang [7] and Bao [1] which were of
immense interest as the topics glass classification, KNN and XGBoost were under our
limelight. The motive of this paper is to bring up a healthy comparison between two
very successful algorithms and also perform operations with other ML algorithms to
conclude on what datasets do certain algorithms succeed. Covering the other aspect of
deploying other ML algorithms, with Ertekin [2] we were able to know the reason for
the limited performance of SVM in our particular dataset as the linearly separability
was a not very clear know-how of our dataset with very close feature set for classes
1 and 3.
3 Methodology
The glass classification dataset was first pre-processed, with features scaled using
StandardScaler for normalization. We then optimized the parameters through Ran-
domizedSearchCV and GridSearchCV, evaluating combinations of ’weights’ (uni-
form, distance) and ’metric’ (euclidean, manhattan, minkowski) over numerous iter-
ations. The best model parameters were selected based on accuracy scores obtained
during cross-validation. Finally, the model’s performance is evaluated using met-
rics like accuracy, confusion matrix and classification report, providing a detailed
analysis of its effectiveness in classifying glass types.
In our experimental setup, we began by dividing the dataset into two main com-
ponents: the input features, which constitute the independent variables, and a single
output class variable that represents our dependent or target variable. To carry out
most of our operations, we utilized a suite of libraries from the Python ecosystem,
namely sklearn (Scikit-learn), numpy, pandas and tensorflow. These libraries provide
a comprehensive set of tools for data manipulation, model training and evaluation.
The dataset comprises a range of features essential for distinguishing different
glass types. These include the Refractive Index (RI), indicative of optical proper-
ties; concentrations of Sodium (Na), Magnesium (Mg), Aluminum (Al), Silicon (Si),
Potassium (K), Calcium (Ca), Barium (Ba) and Iron (Fe), each contributing uniquely
to the glass’s physical and chemical characteristics. The dataset also includes a cat-
egorical ’Type’ variable, classifying glass into various categories like window glass,
containers and tableware. Each row represents a unique glass sample, allowing for a
nuanced application of machine learning techniques to predict glass type based on
its compositional attributes.
It categorizes glass into seven types, although type 4 (non-float-processed vehi-
cle windows) is notably absent. The categories range from float-processed building
windows (type 1) to essential headlamps for vehicles (type 7), which were of glass
types used in various industries. Our study’s uniqueness lies in applying a wide
spectrum of machine learning algorithms, like SVM, Random Forest, KNN, MLP
and XGBoost, to this dataset, emphasizing feature optimization’s critical role. Our
selected approach not only enhanced the precision of glass classification but also
contributed significantly to the industry’s quality control and product differentiation
efforts, underscoring the potential of machine learning in advancing sustainable and
innovative material use. To ensure that our models had a robust training foundation
and a fair evaluation platform, we opted for an 80:20 split ratio. This means that 80%
of the data was used for training our algorithms, while the remaining 20% was set
Table 1 Algorithms for analysis: a concise comparison

Algorithm Description Reasons to choose
K-nearest neighbors KNN effectively classifies glass KNN precisely classifies glass types
(KNN) types in our dataset by pinpointing in your dataset, navigating through
subtle chemical differences overlapping chemical traits with
ease
Logistic regression Logistic Regression uses “One vs Grasping prediction confidence in
All” to classify glass types in our our multi-class dataset enhances
dataset based on chemical attributes, decision-making, clarifying clear
providing probability-based results versus ambiguous glass
classifications
Support vector SVM thrives in classifying our SVM excels in our chemically
machine (SVM) multi-dimensional chemical dataset, diverse dataset, using kernel
efficiently finding the optimal methods to pinpoint intricate
hyperplane in high dimensional decision boundaries for precise
spaces glass type classification
Naive bayes Naive Bayes calculates the While Naive Bayes is simple and
likelihood of each glass type in our assumes feature independence,
dataset based on chemical potentially oversimplifying our
composition, assuming feature chemically interlinked data, it can
independence still effectively classify glass types
with distinct compositions
Random forest Random Forest, through its Random Forest navigates our
classifier ensemble of decision trees, excels in dataset complex chemistry,
classifying our glass types based on minimizing errors and providing a
combinations of features, offering comprehensive view with its
diverse data perspectives multiple trees
XGBoost XGBoost builds sequential trees in Random Forest effectively handles
our dataset, iteratively improving our dataset intricate chemical
the link between chemical attributes interactions, reducing anomalies
and glass types and ensuring a comprehensive
understanding through its ensemble
of trees
aside for testing their performance. This ratio is widely adopted in machine learning
practices as it typically provides a good balance, allowing models to learn from a
substantial amount of data while still having a meaningful chunk of unseen data for
evaluation. Then we evaluated with a bunch of ML models which are shown Table 1.
The Glass Type dataset is originally sourced from the UCI Machine Learning
Repository, a collection of databases utilized by the machine learning commu-
nity for empirical studies on various algorithms. The dataset was contributed
to the repository by Dr. C. J. Willmott and Dr. B. R. Hayes of the Central

Glass and Ceramic Research Institute. The dataset contains 7 class output vari-
able, ranging from 1–7 where each of the numbers have a significance that is
mentioned on the side : Type 1—building_windows_float_processed, Type 2—
building_windows_non_float_processed, Type 3—vehicle_windows_float_
processed, Type 4—vehicle_windows_non_float_processed (no instance of type 4
in this dataset), Type 5—containers, Type 6—tableware, Type 7—headlamps.
Glass stands as a testament to human ingenuity, serving numerous purposes in
our daily lives. Its classification is more than just sorting different types; it reflects
our safety standards, design aspirations and commitment to sustainability. Imple-
menting machine learning in glass classification could revolutionize the industry by
automating the identification process. This would involve training algorithms on var-
ious glass properties, from chemical composition to physical traits, enabling them
to accurately and swiftly differentiate between glass types, thereby reducing human
error and improving quality control.
The classification of glass in the industrial sector is crucial, and integrating
machine learning offers promising advancements. Our study focuses on the Glass
Type dataset sourced from the renowned UCI Machine Learning Repository, a valu-
able resource for the machine learning community with its extensive range of datasets.
This particular dataset, contributed by Dr. C. J. Willmott and Dr. B. R. Hayes from the
Central Glass and Ceramic Research Institute, categorizes glass into seven distinct
types, numbered 1–7.
The ROC curve from Fig. 1 shows outstanding model performance, with Classes
4 and 5 achieving perfect discrimination (AUC = 1.00) and Class 0 nearly excellent
at 0.98. The model excels across all classes, with slight room for improvement in
Class 0. Figure 2 also helped us to understand that barium(Ba) has a huge impact in
our study, as it was involved only in the classification of type 7 glasses.
The performance of GridSearchCV in Figs. 1 and 2 is visualized above through
boxplots. Each boxplot denotes the distribution of validation scores for a specific
parameter set. Most combinations yield a median validation score between 0.70 and
0.80, with some outliers beyond this range. There is a consistent performance across
many parameter combinations but some configurations, tend to exhibit higher median
scores.
The ROC curves of almost all classes closely approach the top-left corner, show-
casing their exceptional predictive capabilities. Classes 3, 4 and 5 achieve perfect
AUC scores of 1.00, indicating flawless discrimination between positive and negative
cases. Classes 0 and 2 also excel with AUCs of 0.96 and Class 1 performs well at 0.94.
These results highlight the model’s remarkable classification proficiency across all
classes. We tested weight and distance metric combinations like uniform-euclidean,
distance-euclidean and other metrics required for our result. The distance-minkowski
combination yielded the highest median validation score, outperforming others. In
contrast, uniform-euclidean was the least effective. Distance weighting generally
outperformed uniform weighting across metrics. The model excelled in classifying
classes 0 and 5, with accuracies of 26 and 19%, but struggled with others, notably
misclassifying 9.3% of class 1 as class 0. Misclassifications were also evident in
Fig. 1 Data and evaluation metrics
Fig. 2 Validation scores of XGBoost using GridSearchCV

classes 2, 3 and 4, suggesting the need for model refinement to distinguish class
features more clearly.
5 Experimentation and Results
XGBoost and KNN have been the pillars for this dataset as of now and with that,
several other algorithms were tried and the results are tabulated Table 2.
Our analysis of multiple machine learning algorithms applied to the glass classifi-
cation dataset revealed a spectrum of accuracy scores. The performance varied, with
the lowest accuracy being 55.81% from the Naive Bayes model, and the highest at
83.72%, achieved by both K-Nearest Neighbors and XGBoost. In Fig. 3, we can see
that XGBoost’s mis-classifications have occurred when predicted as type 1 and true
as type 3. This shows the small failure in this model. Similarly in KNN, this happens
in large amount when predicted type is 0 and actual is type 1. The validations scores
for our result is also discussed. In Tables 3 and 4, we see that both the K-Nearest
Neighbors (KNN) and XGBoost models have a lower precision scores to glass types
1 and 2. This also validates the ROC scores obtained in Fig. 1. Also on looking at
the glass types 6 and 7, barium was a distinguishing component for both the types,
ultimately achieving an 83.72% accuracy rate. This indicates that our dataset likely
has clear patterns or structures that these algorithms can leverage effectively. KNN’s
success implies the presence of identifiable clusters in the data, as it relies on the
closest ‘k’ data points for prediction. Meanwhile, XGBoost’s effectiveness points
to the suitability of ensemble methods, especially those that improve performance
through iterative boosting, for our dataset.
The Random Forest and MLP models showed strong performance, achieving
accuracy rates of 76.62% and 76.74%, respectively. Random Forest’s success sug-
gests that our dataset is suitable for decision tree-based analysis, benefiting from
multiple trees’ collective insights. The MLP’s effectiveness points to the presence
of non-linear relationships in the data, which neural networks can capture. In a
Table 2 Algorithms with accuracy scores and cross-validation techniques

Algorithms Accuracy ccores (in percentage) Cross-validation
Decision tree 70.15 Grid search CV
KNN 83.72 Random search CV
Logistic regression 62.41 Grid search CV
SVM 67.83 Grid Search CV
Naive bayes 55.81 No cross-validation
Random forest 76.62 Random search CV
Multi-layer perceptron 76.74 Grid search CV
XGBoost 83.72 Grid search CV
XGBoost and KNN achieve identical scores
Fig. 3 Heat maps and validation score using Random Search CV
Table 3 Classification report for KNN

Class Precision Recall F.1 -score
1 0.69 1.00 0.81
2 0.83 0.71 0.77
3 1.00 0.33 0.50
5 1.00 0.75 0.86
6 1.00 1.00 1.00
7 1.00 1.00 1.00
Table 4 Classification report for XGBoost

Class Precision Recall F.1 -score
1 0.83 0.91 0.87
2 0.79 0.79 0.79
3 1.00 0.67 0.80
5 0.67 0.50 0.57
6 1.00 1.00 1.00
7 0.89 1.00 0.94
lower performance tier, Decision Trees and SVM recorded accuracies of 70.15%
and 67.83%, respectively. The slightly lesser performance of Decision Trees com-
pared to Random Forest indicates the advantages of ensemble approaches. SVM‘s
results imply a near-linear data separation, hinting at potential class overlaps or sub
optimal settings, suggesting the need for parameter tuning through methods like Grid
Search CV. Logistic Regression achieved 62.41% accuracy in our study, indicating
complexities beyond linear correlations in our data. Naive Bayes was less effective
at 55.81%, likely due to its assumption of feature independence. The frequent use of
Grid Search CV and sometimes Random Search CV for hyper parameter optimiza-
tion was key.
6 Conclusion
In our research, the XGBoost and K-Nearest Neighbors (KNN) algorithms showed
strikingly similar performance levels when applied to our chosen dataset. Both mod-
els displayed comparable accuracy and predictive efficiency, indicating that selecting
either model might not significantly affect overall performance in this particular sce-
nario. Our comprehensive evaluation of various machine learning models on this
dataset revealed a wide range of performance outcomes. K-Nearest Neighbors and
XGBoost stood out for their high accuracy, but other models like Random Forest and
Multi-Layer Perception also delivered noteworthy results, highlighting the complex-
ity and multifaceted nature of the dataset. The application of hyper parameter tuning,
primarily using Grid Search CV and Random Search CV, further demonstrates the
importance of customizing and fine-tuning these algorithms to align with the unique
attributes of the data.
7 Future Work
Looking ahead, we have ambitious plans to broaden and deepen our research in sev-
eral critical areas. One key focus will be on advanced feature engineering, aiming
to boost the performance of our models. We’re also keen on exploring the advan-
tages of ensemble methods more thoroughly, particularly how leveraging the col-
lective predictions of multiple models can enhance both accuracy and robustness.
Our future endeavors will include comprehensive comparative analyses of various
machine learning algorithms, scrutinizing their effectiveness across different glass
datasets. A major emphasis will be placed on improving the interoperability and
explain the ability of our models, recognizing the importance of these aspects for
practical industrial use. To bridge the theoretical-practical divide, we plan to apply
our models in real-world industrial environments, potentially in collaboration with
industry partners. Such real-world applications will provide invaluable feedback on
the performance of our models and highlight areas for further refinement, ensuring
that our research contributes directly to practical advancements in the field.
In future research, we aim to tackle the challenges of imbalanced datasets iden-
tified in our current study. To develop advanced systems for glass classification,
we will implement innovative technologies and conduct studies on environmental
impact, focusing on enhancing the efficiency and sustainability of glass recycling.
Our goal is to address current limitations and pave the way for advanced machine
learning applications in glass classification.
References
1. Bao J (2020) Multi-features based arrhythmia diagnosis algorithm using xgboost. In: 2020
International conference on computing and data science (CDS), pp 454–457
2. Ertekin S, Bottou L, Giles CL (2011) Nonconvex online support vector machines. IEEE Trans
Pattern Anal Mach Intell 33(2):368–381
3. Ji Z (2023) Glass classification and identification based on logistic regression model and k-means
clustering analysis. Highlight Sci Eng Technol 40:64–71
4. Martin D, Chai SS (2022) A study on performance comparisons between knn, random forest
and xgboost in prediction of landslide susceptibility in Kota Kinabalu, Malaysia. In: 2022 IEEE
13th control and system graduate research colloquium (ICSGRC), pp 159–164
5. Qi Z (2020) The text classification of theft crime based on tf-idf and xgboost model. In: 2020
IEEE international conference on artificial intelligence and computer applications (ICAICA),
pp 1241–1246
6. Wang S, Wang Z, Ji T (2023) Glass classification based on random forest-genetic algorithm and k-
means clustering analysis. In: 2023 IEEE 3rd international conference on electronic technology,
communication and information (ICETCI), pp 1527–1532
7. Zhang Z, Wang T, Wang X (2023) Glass classification and identification based on decision tree
and random forest classification models. Highlights Sci Eng Technol 39:475–481
8. Zhao J, Zheng Z, Fang C, Huang Y, Zhang B (2023) Research on glass relics based on machine
learning. In: 2023 IEEE 2nd international conference on electrical engineering, big data and
algorithms (EEBDA), pp 1939–1942
Design Novel Detection of Exudates
Using Wavelets Filter and Classification
of Diabetic Maculopathy
Chetan Pattebahadur, A. B. Kadam, Anupriya Kamble, and Ramesh Manza
Abstract The diabetic maculopathy for the most part is classified as a pathological
illness by scientists, which is fairly significant. One of the most serious effects of
diabetes is this which is quite significant. High blood sugar levels in diabetes patients
essentially have an effect on kind of several bodily components, including the retina
in a big way. In the present research we for all intents and purposes detect a sort of
diabetic maculopathy lesion which for the most part is exudates using Symlet4 and
Haar wavelet and compare which wavelets give really good results, and we mostly
got generally positive effects on the Haar wavelets and also using support vector
machine classifier we got 95.7% good results.
Keywords Diabetic · Macula · Retina · Exudates · SVM
1 Introduction
Diabetic maculopathy is one of the complications of diabetes mellitus that is regarded

as the leading cause of eyesight loss in people worldwide [1]. It happens as a result of
the compromised retinal vasculature allowing fluid that is high in fat and cholesterol
to flow out. Exudates, an accumulation of these liquids, are located close to the retina’s
center [2]. The early stages of diabetic maculopathy are typically accompanied by
no symptoms, as the condition progresses slowly and silently. Blindness may result
from irreparable damage to the macula or visual field if maculopathy is not identified
in the early stages. As a result, mandatory routine screening of diabetes eyes will
C. Pattebahadur (B) · R. Manza

Department of Computer Science and Information Technology, Dr. Babasaheb Ambedkar
Marathwada University Aurangabad Maharashtra, Aurangabad 431001, India
e-mail: chetu358@gmail.com
A. B. Kadam
Department of Computer Science, Shri Shivaji Science and Arts College Chikhli Buldhana,
Chikhli, India
A. Kamble
Department of Computer Science, Vishwakarma University, Pune, India
416 C. Pattebahadur et al.
help to detect maculopathy at its earliest stage and lower the likelihood of serious
vision loss. A significant number of retinal images are produced as a result of digital
screening for maculopathy [3]. Exudates appear as yellow or white formations in the
retina and if not found early, a person may lose his or her vision totally because the
macula is the central vision of the retina.
2 Methodology
The detection of diabetic maculopathy through the wavelet filter needs a strong
standard database, which is STARE [4]. These image databases are publicly available
as open source. After that, digital image processing was used. The wavelet filter was
used for feature extraction and after that used SVM classification technique for
grading of maculopathy. For detection of retinal exudates from the STARE database
which is captured by a fundus camera these images cannot give 100% results [5]
cause when the fundus camera captures retinal images sometimes some images will
not get a good shape or sometimes regions of interest do not clear improving the
quality of images, that’s why preprocessing is very important [6]. Figure 1 indicates
the workflow of the present study where the fundus images database is first given as
an input from which the green channel is extracted. On these extracted images the
intensity transformation function is applied where the threshold is obtained one of
the Sym4 wavelet and other of the Haar wavelet. Lastly the exudates are detected
from these wavelets.
2.1 Preprocessing
For detection of retinal Exudates from the STARE database which is captured by
a fundus camera these images cannot give 100% results [5] cause when the fundus
camera captures retinal images sometimes some images will not get a good shape
or sometimes regions of interest do not clear improving the quality of images that’s
why preprocessing is very important [6].
2.1.1 RGB Channel
In image processing, the green channel is commonly used. Red and blue channels are
not as feature-rich as the green channel. We’ll see several formulas for RGB color
separation in the line after that [7] following line red, green, and blue channel with
its histogram.
Design Novel Detection of Exudates Using Wavelets Filter … 417
Fig. 1 Workflow detection

of exudates
1. R. channel
R
r= (1)
(R + G + B)
This is a view of the red channel, where R, G, and B stand for red, green, and
blue, respectively.
2. G. channel
G
g= (2)
(R + G + B)
This is a green channel, which denotes by the letters R, G, and B, respectively,

red, green, and blue.
3. B. channel
B
b= (3)
(R + G + B)
Fig. 2 Extraction of mask
2.1.2 Extraction of Mask
Mask separation is crucial in picture preprocessing. Mask extraction is crucial

because we only process high-quality photos. Finding a well-shaped picture extrac-
tion of the mask is crucial because certain photos are not in good condition. If the
shape of the fundus image is not of good quality, we disregard it and obtain a good
quality, accurately shaped image for processing. Extraction of the mask is crucial
because we eliminate the corrupted image because it does not produce a decent
outcome. For extraction of mask apply green channel and threshold for extracting
mask for all retinal images.
Threshold
1
T = (m1 + m2) (4)
2
Figure 2 reflects the extraction of mask of the input fundus image which is then
used for the intensity transformation function.
2.2 Intensity Transformation Function
Intensity transformation function can highlight the brighter component of the lesion,
which is the exudates, which are a brighter part of the lesion.
s = T (r ) (5)
2.3 Thresholding
To make detection of exudates visible, adaptive thresholding is used. For adaptive

thresholding, different threshold values for different local areas are calculated. The
scale used for thresholding is 0:1 [8].
2.4 Wavelet Filter
The wavelet tool in MATLAB offers a large selection. We can create new wavelets,
add them to existing wavelet families, and utilize wavelet families. Wavelets break-
down a signal into a single representation and display signal information via a math-
ematical procedure similar to picture compression. Wavelet is capable of several
different processes, including data compression and noise reduction. The trans-
forming function is the basis function (t), also known as the mother wavelet [9].
Here, we employ the Symlet4 and Haar wavelets, which are part of the orthogonal
wavelet families, for feature extraction. Symlet wavelets are sometimes known as
Daubechies least asymmetric wavelets since they are created in the same way as
Daubechies wavelets, whereas the Daubechies wavelets have maximum phase. The
Symlets have a minimum phase [10], wavelet coefficients (Symlet, n), where n can
be any positive even number. The size of the generated filters is denoted by n. The
distributed moments of the Symlet wavelet of size n are 1/2 n [10]. Haar wavelet,
the simplest compression approach, is the Haar transform. A Haar wavelet analysis
is done using the DWT2 tool for postprocessing. It computes the approximation
coefficients matrix as well as the details coefficients matrices (horizontal, vertical,
and diagonal). In this case, the inverse wavelet, or reconstructed image, produces an
excellent outcome. The approximation coefficients matrix X of the reconstructed or
inverted wavelet is based on the approximation matrix CA and the details matrices
CH, CV, and CD [13–15].
2.5 Support Vector Machine Classifier
Support vector machine classifiers is a supervised learning. It has grown in popu-

larity for tackling classification challenges on large-scale datasets [11]. Many
current studies have found that SVM have the ability of outperforming other data
categorization techniques in terms of classification accuracy [12].
∑
n
H (κ) = w T x + b = ωi κi + b (6)
i=1
In the present research openly available STARE database has been used for extracting
the exudates using Symlet4 and Haar wavelet and compare which wavelets give good
results and after that we count exudates nearby macula and observation of the count,
we decide the severity of the disease whether normal or abnormal using support
vector machine (SVM) classifier, if exudates count nearby macula is 0 then we can
define that it is normal, otherwise it will be abnormal. Using Symlet Wavelet 4 and
Haar wavelets we detect diabetic maculopathy exudates and compare two wavelets
using statistical technique and check which wavelets give good result [13–15].
The detection of exudates is seen in Fig. 3 in the middle column where in first
column indicates the Symlet4 and Haar wavelet and the last column indicates the
exudates detection on the original image.
3.1 Statistical Techniques
Table 1 contains statistical parameters for calculating the correlation between Symlet
wavelet 4 and Haar wavelet, where x represents Symlet wavelet 4 parameters and y
represents Haar wavelet parameters.
where
x: Exudates of Symlet 4 wavelet.
y: Exudates of Haar wavelet.
4825.98
Mean(x) = = 160.86
30
1686.15
Mean(Y ) = = 56.20
30
∑
(x − X ) 4665.12
Variance(x) = = = 155.50
N 30
∑
(x − Y ) 1629.25
Variance(y) = = = 54.33
N 30
√ √
Standard Deviation (x) : Variance(x) = 155.50 = 12.46
√ √
Standard Deviation (y) : Variance(y) = 54.33 = 7.37
∑ ∑
x−X y−Y
Correlation : r = / 2 ∑ 2
∑
x−X y−Y
Fig. 3 Detection of exudates
where
∑
∑(x − X ) = 4665.12
∑(x − Y ) = 1629.25
∑(x − X )2 = 21,763,344.61
(x − Y )2 = 2,654,455.56
4665.12 ∗ 1629.25
r=√
21763344.61 ∗ 2654455.56
Table 1 Statistical parameters of exudates

S. No. x Y (x − X ) (y − Y ) (xy)
1 8261.625 2611.75 8100.765 2555.55 21,577,299
2 4285.25 1470.125 4124.39 1413.925 6,299,853.2
3 3199.75 1025 3038.89 968.8 3,279,743.8
4 3646.875 1209.375 3486.015 1153.175 4,410,439.5
5 7922.625 2947.75 7761.765 2891.55 23,353,918
6 2398.25 733.5 2237.39 677.3 1,759,116.4
7 2281.25 874.875 2120.39 818.675 1,995,808.6
8 4210.875 1308.375 4050.015 1252.175 5,509,403.6
9 3699.875 1302.375 3539.015 1246.175 4,818,624.7
10 2393.25 783.875 2232.39 727.675 1,876,008.8
11 2800 1092.375 2639.14 1036.175 3,058,650
12 2840.25 1008 2679.39 951.8 2,862,972
13 14,712.13 4908.875 14,551.27 4852.675 72,220,007
14 4270.5 1476.125 4109.64 1419.925 6,303,791.8
15 1076.125 379.25 915.265 323.05 408,120.41
16 2883.5 755.75 2722.64 699.55 2,179,205.1
17 4454.125 1759.875 4293.265 1703.675 7,838,703.2
18 6869.875 2752.625 6709.015 2696.425 18,910,190
19 2342.25 779.75 2181.39 723.55 1,826,369.4
20 726.5 235.125 565.64 178.925 170,818.31
21 4891.375 1605.875 4730.515 1549.675 7,854,936.8
22 1525.125 447.25 1364.265 391.05 682,112.16
23 2269.875 752.625 2109.015 696.425 1,708,364.7
24 20,535.63 7877.75 20,374.77 7821.55 161,774,559
25 6321.25 2237.5 6160.39 2181.3 14,143,797
26 2441.25 742.25 2280.39 686.05 1,812,017.8
27 13,308.25 4831.125 13,147.39 4774.925 64,293,819
28 4593.125 1598.25 4432.265 1542.05 7,340,962
29 2605.5 833.625 2444.64 777.425 2,172,009.9
30 1013.375 243.625 852.515 187.425 246,883.48
7600646.76
r= =1
7600646.74
The correlation of the exudates is positive correlation. Its means that Haar wavelet
gives the good result than Symlet4 wavelet.
In Table 2, random image from dataset has been taken and the count of exudates
for diabetic maculopathy classification has been taken. If the exudates count is zero,
Table 2 Statistical count of

S. No. Image name Exudates count
exudates
1 Image1 10
2 Image2 5
3 Image10 26
4 Image11 20
5 Image15 0
6 Image19 7
7 Image20 13
8 Image51 0
9 Image55 4
10 Image60 1
it is said to be a normal image and if it counts one or greater then it is said to be an

abnormal image of diabetic maculopathy.
3.2 Classification and Grading
For the classification and grading of diabetic maculopathy exudates support vector
machine classifier has been used. The images have been classified in normal and
abnormal grading and got a 95.7% good result on it. Figure 4 indicates the blue color
which indicates abnormal images and the orange color indicates normal images.
Fig. 4 Classification and grading of exudates

4 Conclusion
The identification of exudates literally is critical in diabetic macular edema in a

fairly big way. Exudates are the most severe sign of maculopathy, or so they thought.
Assuming exudates harm a macula, and if the macula definitely is harmed, the patient
may particularly lose his eyesight; therefore, identification mostly is critical; the
patient will mostly receive therapy, and the patient will kind of be spared from losing
his vision, which definitely is quite significant. In that exploration effort, we employed
Symlet4 and Haar wavelets to basically detect exudates and compare which wavelets
produce really good results, and we usually got generally really positive results on the
Haar wavelets and also using SVM classifier we got 95.7% in a subtle way, or so they
literally thought. Ophthalmologists will benefit from this pretty current evaluation.
References
1. Pattebahadur C, Manza R, Kamble A (2019) Design a novel detection for maculopathy using
weightage KNN classification. https://doi.org/10.1007/978-981-13-9184-2_32
2. American Diabetes Association: American Diabetes Association Copyright 1995–2018
[Internet]. http://www.diabetes.org/diabetes-basics/type-1/
3. Noronha K, Nayak KP, Automated diagnosis of diabetes maculopathy: a survey
4. Structured Analysis of the Retina. http://cecas.clemson.edu/~ahoover/stare
5. Pattebahadur C, Manza R, Kamble A, Varma P (2020) Detection and counting of microa-
neurysm for early diagnosis of maculopathy
6. Analytics Vidhya—Getting Started with Image Processing Using OpenCV https://www.analyt
icsvidhya.com/blog/2023/03/getting-started-with-image-processing-using-opencv/. Accessed
5/7/2023
7. Rajput YM, Manza RR, Patwari MB, Deshpande N (2013) Retinal Optic disc detection using
speed up robust features. In: National conference on computer and management science [CMS-
13], Apr 25–26, 2013, Radhai Mahavidyalaya, Auarngabad-431003(MS India)
8. Deshmukh P, Chavan S, Rodrigues W, Shinde A, Comparison of techniques for diabetic
retinopathy detection using image processing. Int J Adv Res Ideas Innov Technol. ISSN:
2454-132X
9. Xu L, Luo S (2010) A novel method for blood vessel detection from retinal images. BioMed
Eng Online 9:14 http://www.biomedical-engineering-online.com/content/9/1/14
10. maplesoft.com, ‘Discrete Transforms Wavelets’ [Online]. Available: https://www.maplesoft.
com/support/help/maple/view.aspx?path=DiscreteTransforms%2FWavelets. Accessed 5 July
2023
11. Ladicky L, Torr P (2011) Linear support vector machines 985–992
12. Srivastava D, Bhambhu L (2010) Data classification using support vector machine. J Theor
Appl Inf Technol 12:1–7
13. Kamble A, Hannan SA, Jain A, Manza R (2021) Prediction of prediabetes, no diabetes and
diabetes mellitus-2 using pattern recognition
14. Kamble AK, Manza RR, Rajput YM, Hannan SA (2017) Association redetection of regular
insulin and NPH insulin using statistical features. In: Proceedings of the 5th International
conference on system modeling and advancement in research trends, SMART, pp 59–62,
7894490
15. Kamble AK, Manza RR, Rajput YM (2016) Classification of insulin dependent diabetes
mellitus by K-Means. In: ICIIECS’16 Proceedings, pp 902–904. ISBN 978-1-4673-8207-6
An Optimized Neural Network Model
to Classify Lung Nodules from CT-Scan
Images
Asiya and N. Sugitha
Abstract The early identification of lung nodules in chest X-rays is vital for human
life and can prevent health emergencies. Manual prediction of lung nodules is consis-
tent, and at early stages of lung cancer, they cannot be predicted, so an artificial
intelligence system is required to identify lung nodules at the early stage. So many
researchers have worked on lung nodule prediction and classification by machine
learning and deep learning, but the models implemented could be more robust and
consistent. So, we have proposed a novel approach to detect lung nodules early using
customized CNN. It can easily segment the small nodules in classification. And we
used a kernel regularizer to avoid overfitting. This model was implemented on the
LIDC-IDRI dataset from Kaggle with 25,000 samples. Finally, we got an accuracy
of 0.951, with calculated precision, recall, and F1-score. With this, we can confirm
that our model is consistently performing.
Keywords Lung nodules · CNN · LIDC-IDRI · Machine learning · Neural

networks
1 Introduction
Lung nodule detection is a critical aspect of modern healthcare, as it plays a key

role in the early diagnosis and treatment of lung-related diseases, particularly lung
cancer. A lung nodule is a small, round, or oval-shaped abnormality found on a
chest X-ray or a CT scan [1–3]. These nodules can be benign or malignant, and
distinguishing between the two is crucial for patient care. The significance of lung
nodule detection depends on its potential to detect lung cancer at an earlier and
Asiya (B)
CSE Department, Noorul Islam Center for Higher Education, Kumarakovil, Thukalay, Tamil
Nadu, India
e-mail: syedasiya14@gmail.com
N. Sugitha
Saveetha Engineering College, Thandalam, Chennai, India
e-mail: sugithan@saveetha.ac.in
426 Asiya and N. Sugitha
treatable stage, which can significantly improve patient life [4, 5]. Early detection
allows for less interference and more effective treatment possibilities, eventually
increasing the chances of survival. In addition, monitoring lung nodules over time
can help manage various pulmonary conditions [7, 13].
The advancements in technology and image processing, such as computed tomog-
raphy (CT) scans and artificial intelligence, have revolutionized the field of lung
nodule detection [14, 15]. These technological improvements have made it possible
to highly detect and characterize lung nodules, reducing the risk of misdiagnosis
and unnecessary invasive procedures. In this age of rapidly evolving healthcare, the
focus on lung nodule detection is essential, and ongoing research and innovation in
this field continue to enhance our ability to identify and manage these lung abnor-
malities, ultimately improving patient care and outcomes [16, 17]. This introduction
sets the stage for exploring the various aspects of lung nodule detection, including
its methods, challenges, and vital role in modern medicine.
Several datasets are presently available for lung nodule detection, especially in
medical organizations. The most popular and widely used dataset is LIDC-IDRI,
which contains thousands of annotated CT scans [18, 19]. Another dataset, LUNA,
is a subset of LIDC-IDRI and is also part of valuable research on lung nodule
detection. The NIH Chest X-ray Dataset, JSRT, Shenzhen Hospital CT Images, and
Montgomery County X-ray Set offer chest X-ray images with lung nodule cases.
Researchers have also created LIDC-IDRI-like datasets by collecting and annotating
their own CT scans. These datasets are essential for developing and evaluating algo-
rithms for early lung nodule detection, but ethical and privacy considerations should
be followed when working with patient data [20, 21].
Machine learning algorithms are crucial for lung nodule detection due to their
ability to provide early and accurate identification of potential cancerous growths in
medical images like CT scans. SVM, random forest, KNN, etc., are the most imple-
mented algorithms for lung nodule detection. Keshani et al. [11] have proposed a
model for lung nodule detection, segmentation, and recognition using CT images.
The approach involves a sequence of steps, including lung area segmentation through
active contour modeling and masking techniques, SVM classification for nodule
detection utilizing 2D and 3D features, contour extraction for precise nodule delin-
eation, and the classification of lung tissues into four categories. The method is eval-
uated using clinical CT images and public datasets, achieving an accuracy of 89%.
Nada et al. [12] focused on the early detection and localization of lung nodules,
crucial for diagnosing lung cancer. Machine learning and random forest algorithm
are implemented for the task of feature groups on classification accuracy. After
experimentation on a dataset from the LIDC database, it achieved 96.20% accuracy.
Lung nodule detection in medical imaging relies on neural network models, with
CNN, region-based CNNs, like faster R-CNN and SSD, are useful for localizing
nodules, while transfer learning accelerates model training using pre-trained archi-
tectures. Other CNN models such as VGG16, 19, ResNet, AlexaNet, etc., are also
preferred by various researchers [21–23].
In our research, we proposed an optimized CNN model to classify the lung models
with kernel regularization to avoid overfitting. We used the LIDC-IDRI dataset from
An Optimized Neural Network Model to Classify Lung Nodules … 427
Kaggle with 25,000 samples. We selected 15,000 samples from the existing dataset
that included three classes. The proposed model with optimized CNN has performed
the classification in well-defined structure and gained an accuracy of 95.1% on the
selected dataset. This proposed model has the potential to significantly impact the
field of medical diagnostics, improving early detection and patient outcomes.
The remaining sections of the papers presented are as follows: Sect. 2 illustrates
the literature of existing studies that are relevant to lung nodule detection and Sect. 3
presents the methodology of the implemented proposed work, architectures and
dataset. The results compared are analyzed in the section Sect. 5.
2 Related Work
Weihua et al. [1] introduced PiaNet, based on the CNN that detects the GGO in 3D
CT images. This model PiaNet comprises two main components: pyramid multi-
scale source connections, an MRFCB for improved feature capture, and a classi-
fier for GGO nodule identification at different scales. The results on the LIDC-
IDRI dataset show that PiaNet achieved a sensitivity of 93.6%. Tenescu et al. [2]
conducted research to improve lung nodule detection in CT scans, a challenging task
in radiology. They applied a weight-averaging ensemble technique that was initially
designed for natural image classification to enhance model performance. Using a
dataset of 1050 patients, the researchers fine-tuned models under various configura-
tions. As a result, they achieved an FROC score, increasing accuracy from 0.872 to
0.886.
David et al. [3] introduced a methodology for automating, identifying, and classi-
fying ILA patterns in CT images. These ILA patterns have clinical consequences, as
they are associated with increased risk in smokers, even before the development of
interstitial lung disease. The methodology employs an ensemble of CNN, including
2D, 2.5D, and 3D architectures, to enhance feature detection accuracy. Using a
substantial dataset of 37,424 radiographic tissue samples from 208 CT scans, the
ensemble model achieved an impressive average sensitivity of 91.41%. Bin Hu et al.
[4] proposed ensemble multi-view 3D CNN model that significantly advances lung
cancer diagnosis and risk stratification. The model achieves remarkable performance
by leveraging advanced deep learning algorithms and a large dataset of 1075 lung
nodules, with AUC scores of 91.3% for distinguishing benign/malignant nodules and
92.9% for identifying pre-invasive/invasive nodules.
Hailun et al. [5] discussed the utilization of deep learning methods for the classi-
fication of lung nodule malignancy. A systematic literature search identified sixteen
relevant studies, employing techniques such as CNN, autoencoders (AE), and deep
belief networks (DBN) to diagnose and predict lung nodule malignancy. Notably,
these deep learning models consistently achieved a high accuracy of 91%. Seifedine
et al. [6] proposed research aimed to use CNN for segmenting lung nodules in CT
scans. Test images are sourced from the TCIA database. The study indicates that
VGG-SegNet, a pre-trained model, outperforms other CNN methods with Jaccard,

Dice, and Accuracy metrics exceeding 88%, 93%, and 96%, respectively.
Gugulothu et al. [7] aimed to improve the accuracy and reduce the false-positive
rate in lung nodule detection using deep learning (DL) techniques. Their contrast
enhancement and nodule detection model uses an “earliest event-Net” classifier.
Features are extracted and selected using an optimization algorithm, and classifi-
cation is performed using a specialized classifier named “CSDR-J-WHGAN.” The
method achieved accuracy of 97.11% on the LIDC-IDRI dataset. Annavarapu et al.
[8] proposed a model named U-Net and utilized the LUNA-16 dataset that contains
CT images samples. On this dataset the model achieved 81.66% of accuracy.
Naseer et al. [9] focused on leveraging artificial intelligence, specifically a modi-
fied U-Net for lobe segmentation and nodule extraction, followed by a modified
AlexNet-SVM for nodule classification, to automate the accurate identification and
classification of lung cancer in CT scans. The methodology demonstrated promising
results, achieving a 97.98% accuracy on the LUAN16 dataset. Halder et al. [10] intro-
duced an innovative Atrous CNN framework developed to segment and characterize
lung nodules in HRCT images. The ATCNN framework, particularly the ATCNN2PR
variant, has demonstrated exceptional performance by leveraging Atrous convolu-
tion and multi-scale feature extraction. This model achieved an accuracy of 95.97%
on the LIDC-IDRI dataset.
3 Methodology
In our proposed model, we implemented an optimized CNN model for classifying

lung nodules, from chest X-ray images demonstrating your commitment to achieving
the best possible performance. Notably, our efforts to mitigate overfitting through
L1 and L2 regularization and trained and tested various hyperparameters. All the
dataset samples are preprocessed before training the data. The proposed model has
four convolution two-dimensional layers in the architecture. Each layer followed
with average pooling and finally implemented the three dense layers. The proposed
model step by step procedure is presented in Fig. 1, and we validated our model by
taking various test samples and found an optimized model.
3.1 Data Set
We used the LIDC-IDRI dataset from Kaggle with 25,000 samples [24]. We selected
15,000 samples that include lung adenocarcinoma (lung aca), lung squamous cell
carcinoma (lung scc), and lung nodules (lung_n) in equal proportions, which is a
reasonable approach to ensure a balanced dataset.
Fig. 1 Proposed CNN model
Fig. 2 Sample images of lung nodules
Resizing all the samples to a common size of 224 × 224 pixels is a common
practice in image classification tasks and ensures uniformity. This step simplifies the
input data for our CNN model.
Splitting our data into train, test, and validation sets with a 70:20:10 ratio is a
standard partitioning strategy. Using a random state of 50 for the split helps ensure
reproducibility, as others can reproduce the same data split using the same random
state. Figure 2 illustrates the sample figures after resizing.
3.2 Model Implementation and Training
In our CNN model design for lung nodule classification, we have implemented a
robust architecture with eight Conv2D layers interspersed with average pooling,
followed by four fully connected layers. This design is well-suited for image classifi-
cation tasks. We have employed L1 regularization on bias terms and L2 regularization
on kernel weights, both with regularization strength of 0.006, essential for control-
ling model complexity and preventing overfitting. Additionally, we have incorporated
a 25% dropout rate to randomly deactivate parameters during training, a valuable
technique for reducing overfitting. To improve convergence, we have implemented
learning rate scheduling by dynamically adjusting the learning rate during training,
starting at 0.001. ReLU activation functions in convolutional layers introduce nonlin-
earity, and softmax activation at the output layer efficiently handles multi-class clas-
sification. Utilizing the Adam optimizer, an adaptive learning rate algorithm further
enhances model training. Lastly, using early stopping demonstrates our commitment
to avoiding overfitting and ensuring optimal model performance. With this compre-
hensive model and training approach, it can effectively classify different lung nodule
types. Fine-tune hyperparameters and monitor training while documenting our find-
ings for transparency and reproducibility. We calculated cross-entropy loss between
actual and predicted labels (1).
1
C
loss yactual , ypredicted = − (yc log( pc )) (1)
m c=1
4 Result Analysis
During the 20-epoch training process of our neural network for lung nodule classifi-
cation, our exploration of different batch sizes, particularly switching from an initial
batch size of 24 to 16, yielded crucial insights. The observed overfitting when using
a batch size of 24 (as depicted in Fig. 3) underlined the significance of batch size
in model generalization. Overfitting happens when a model evolves too technical in
training data, significantly hindering new, unseen data performance. We skillfully
addressed the overfitting issue by reducing the batch size to 16, allowing our model
to better generalize to diverse instances.
Moreover, our adoption of L1 and L2 regularization techniques was pivotal in
maintaining the proximity of training and testing errors, as illustrated in Fig. 4.
This convergence of errors suggests that regularizers effectively prevented the model
from overfitting by restraining its capacity to capture noise in the training data.
This equilibrium in error rates is an encouraging sign of a well-generalized model.
We carefully parameter adjustments, vigilance in recognizing overfitting, and the
thoughtful inclusion of regularization techniques signify a thorough and meticulous
approach to building a robust lung nodule classification model. Figures 5 and 6
illustrate the confusion matrix for the proposed model where size is 24 and Fig. 6
shows the confusion matrix for proposed model where size is 16.
We predicted the target variable (y) from the test data (x) and subsequently calcu-
lated accuracy, precision, and recall (presumably using Eqs. (2), (3), and (4)), which
is standard practice for evaluating the performance of a classification model. Table 1,
Fig. 3 Training and validation loss for our model for batch size 20; here learning rate is updated
epoch by epoch, at epoch 5 it’s showing best results. After epoch 5 both training and validation
accuracy deviated
Fig. 4 Training and validation loss for our model for batch size 16; here learning rate is updated
epoch by epoch, at epoch 10 it’s showing best results
as we have indicated, provides a clear summary of the final percentages for various
support counts. Support count likely refers to the number of instances or samples
associated with each class in our classification task. The table could present the
evaluation metrics for each class or category, demonstrating how well our model
performed in classifying each group. The results in Table 1 are vital for assessing the
model’s effectiveness in distinguishing between different classes, and they help to
gain insight into its performance across the dataset. The graph between true positive
and false positive for all classes is shown in Fig. 7.
TP + TN
accuracy = (2)
TP + TN + FP + FN
Fig. 5 Confusion matrix for

proposed model where size
is 24
Fig. 6 Confusion matrix for

proposed model where size
is 16
Tp
Precision = (3)
T p + Fp
Tp
Recall = (4)
T p + Fn
Table 1 Proposed model results

Precision Recall F1-score Support
0 0.86 0.97 0.932 1000
1 0.99 1.00 1.00 1000
2 0.981 0.861 0.921 1000
Accuracy 0.951 3000
Macro avg 0.951 0.951 0.951 3000
Weighted avg 0.951 0.951 0.951 3000
Fig. 7 ROC curve for all

classes of proposed CNN
model for batch size 24
Randomly splitting our data into a training set and a testing set is a common prac-
tice in machine learning for assessing model performance, and calculating accuracy
is a valid way to measure a model’s performance. The approach considers whether
our model overfits and selects the best model from those we have experimented with.
5 Conclusion
In this paper we implemented an optimized CNN model that avoids overfitting with
kernel regularization. And our model performed well in terms of accuracy, precision,
recall, and F1-score. We got an accuracy of 0.951, we also compared accuracy with
other prescribed models, and in our model performance is better. Other models not
shown consistency of their model most of the models are getting overfitting. But in
this approach we overcome that.
References
1. Liu W, Liu X, Luo X, Wang M, Han G, Zhao X, Zhu Z (2023) A pyramid input augmented
multi-scale CNN for GGO detection in 3D lung CT images. Pattern Recogn 136:109261
2. Tenescu A, Bercean BA, Avramescu C, Marcu M (2023) Averaging model weights boosts
automated lung nodule detection on computed tomography. In: Proceedings of the 2023 13th
international conference on bioscience, biochemistry and bioinformatics, pp 59–62
3. Bermejo-Peláez D, Ash SY, Washko GR, Estépar RSJ, Ledesma-Carbayo MJ (2020) Classifi-
cation of interstitial lung abnormality patterns with an ensemble of deep convolutional neural
networks. Sci Rep 10(1):338
4. Zhou J, Hu B, Feng W, Zhang Z, Fu X, Shao H, Wang H, Jin L, Ai S, Ji Y (2023) An ensemble
deep learning model for risk stratification of invasive lung adenocarcinoma using thin-slice
CT. NPJ Digital Med 6(1):119
5. Liang H, Hu M, Ma Y, Yang L, Chen J, Lou L, Chen C, Xiao Y (2023) Performance of
deep-learning solutions on lung nodule malignancy classification: a systematic review. Life
13(9):1911
6. Kadry S, Herrera-Viedma E, Crespo RG, Krishnamoorthy S, Rajinikanth V (2023) Automatic
detection of lung nodules in CT scan slices using CNN segmentation schemes: a study. Procedia
Comput Sci 218:2786–2794
7. Gugulothu VK, Balaji S (2023) A novel deep learning approach for the detection and
classification of lung nodules from CT images. Multimedia Tools Appl 1–24
8. Annavarapu CSR, Parisapogu SAB, Keetha NV, Donta PK, Rajita G (2023) A Bi-FPN-based
encoder–decoder model for lung nodule image segmentation. Diagnostics 13(8):1406
9. Naseer I, Akram S, Masood T, Rashid M, Jaffar A (2023) Lung cancer classification using
modified u-net based lobe segmentation and nodule detection. IEEE Access
10. Halder A, Dey D (2023) Atrous convolution aided an integrated framework for lung nodule
segmentation and classification. Biomed Signal Process Control 82:104527
11. Keshani M, Azimifar Z, Tajeripour F, Boostani R (2013) Lung nodule segmentation and recog-
nition using SVM classifier and active contour modeling: a complete intelligent system. Comput
Biol Med 43(4):287–300
12. El-Askary NS, Salem MA, Roushdy MI (2022) Features processing for random forest
optimization in lung nodule localization. Expert Syst Appl 193:116489
13. Cao H, Liu H, Song E, Ma G, Xu X, Jin R, Liu T, Hung CC (2020) A two-stage convolutional
neural network for lung nodule detection. IEEE J Biomed Health Inf 24(7):2006–2015
14. Gu Y, Lu X, Yang L, Zhang B, Yu D, Zhao Y, Gao L, Wu L, Zhou T (2018) Automatic lung
nodule detection using a 3D deep convolutional neural network combined with a multi-scale
prediction strategy in chest CTs. Comput Biol Med 103:220–231
15. Zhao C, Han J, Jia Y, Gou F (2018) Lung nodule detection via 3D U-Net and contextual
convolutional neural network. In: 2018 International conference on networking and network
applications (NaNA). IEEE, pp 356–361
16. Xie H, Yang D, Sun N, Chen Z, Zhang Y (2019) Automated pulmonary nodule detection in
CT images using deep convolutional neural networks. Pattern Recogn 85:109–119
17. Zheng S, Guo J, Cui X, Veldhuis RNJ, Oudkerk M, Van Ooijen PMA (20119) Automatic
pulmonary nodule detection in CT scans using convolutional neural networks based on
maximum intensity projection. IEEE Trans Med Imaging 39(3):797–805
18. Tang H, Kim DR, Xie X (2018) Automated pulmonary nodule detection using 3D deep convo-
lutional neural networks. In: 2018 IEEE 15th international symposium on biomedical imaging
(ISBI 2018). IEEE pp 523–526
19. Zhang J, Xia Y, Cui H, Zhang Y (2018) Pulmonary nodule detection in medical images: a
survey. Biomed Signal Process Control 43:138–147
20. Schultheiss M, Schober SA, Lodde M, Bodden J, Aichele J, Mueller-Leisse C, Renger B,
Pfeiffer F, Pfeiffer D (2020) A robust convolutional neural network for lung nodule detection
in the presence of foreign bodies. Sci Rep 10(1):12987
21. Jin H, Li Z, Tong R, Lin L (2018) A deep 3D residual CNN for false-positive reduction in
pulmonary nodule detection. Med Phys 45(5):2097–2107
22. Manickavasagam R, Selvan S, Selvan M (2022) CAD system for lung nodule detection using
deep learning with CNN. Med Biol Eng Comput 60(1):221–228
23. Asiya, Sugitha, N, Automatically segmenting and classifying the lung nodules from CT images
2147–6799
24. https://www.kaggle.com/datasets/zhangweiled/lidcidri
Fake Product Review Monitoring System
Using Machine Learning
Pragya Rajput and Pankaj Kumar Sethi
Abstract As facts and statistics on the web are growing rapidly, online reviews have
been a true revolution in the way people purchase products and services. Nowadays,
a wide range of e-commerce sites allow the clients to write their reviews about the
product that they purchased from that website as these reviews help the brand to
understand the customer requirements and the shortcomings in the product. These
brands try their best to get good reviews by improving the quality of product from
customers as bad reviews affect their business. Often customers need a review of a
product before investing in it as it impacts their decision for purchasing it. However,
many of the customers are not satisfied after buying the product from a particular
website and feel that the reviews are misleading and fake causing blight on all the
people even those who actually try to give genuine reviews. On some online plat-
forms, some of the reactions are planted by various frauds which are either hired by
an organization or belong to it and try to reduce the product value of competitors
by giving negative reviews to their product. And they often provide good reviews to
various products designed by their own company. So, we have used sentiment anal-
ysis to analyze reviews online and compared the results for two algorithms which are
the SVM classifier and Naive Bayes classifier so that the user can determine whether
the reviews are genuine or not.
Keywords Sentiment analysis · Reviews · Naive Bayes classifier · SVM classifier
1 Introduction
In this era, there is the presence of several ideas of how a person can get an item or
product by either physically located to the point or by just clicking on a few required
products. In the traditional style where a customer goes to the shopkeeper to purchase
P. Rajput (B) · P. K. Sethi

Department of Computer Science Engineering, Chandigarh University Punjab, Mohali, India
e-mail: rajputpragya.tech.1@gmail.com
P. K. Sethi
e-mail: erpankajkumarsethi@gmail.com
438 P. Rajput and P. K. Sethi
items, several features are highlighted in front of the customer to gain his attention
and make him aware of the productivity of the item. There is no definite method in
this case to judge if the shopkeeper is lying or saying the facts about the product and
eventually has to be trusted by carefully evaluating the product.
On the other hand, the source of getting an item or product varies these days when
compared with traditional methods. Today, it is a common practice for people to read
online and identify applicable funding agencies here. If none, delete these reviews for
various purposes such as reading books, renting a car and going on a vacation before
going shopping. People can buy products from various brands of online stores. Here,
customers need to order the original product without seeing and checking. They read
the reviews and buy the products. When they get a lot of positive reviews, they are
more likely to buy the product and when they find more negative reviews about the
product, they will not buy it. Therefore, they rely on reviews about the product. The
only possible method when buying online to get authentic items is through honest
feedback on that item. However, negative fake reviews can damage reputation and
cause monetary loss [1].
A. Objective: The identified challenges encourage providing solutions to all the
issues raised in the problem stated before. The proposed methodology and objec-
tives of this application are as follows: Using two different algorithms to get a
better representation of the fake review detection function. When any fraud
makes any online purchase and reviews the product, most consumers spend
their quality time reading them if other user reviews are available and develop a
false notion regarding it. Hence, today’s youth and even adults are increasingly
relying on the updates available online. It means that people make their own
decisions purchasing a product by analyzing and presenting the ideas contained
in those products [2].
B. Sentiment Analysis: Sentiment analysis is a process of analyzing texts online to
determine the emotional tone, whether it is positive, negative or neutral. Simply
put, the analysis of emotions helps to understand the attitude of the author toward
the subject. Opinion mining identifies the tone behind a text written and is natural
language processing-based. Information is collected from several sites regarding
informal texts [3].
C. Types of Sentiment Analysis: Emotional analysis provides polarity as it divides
the text into various categories, usually the best for it and the worst is known as
fine-grained emotional analysis. There is another type of opinion mining which
does aspect-based analysis. It collects well-defined or negative portions. For
example, a product is heavy is the review given by the customer which does not
make the whole product faulty, instead tells about its weight which is causing
inconvenience to the customer.
Emotional detection indicates certain emotions rather than positive and negative
ones. Recognizing the actions behind a text beyond a point of view, the analysis is
intent-based. Emotional analysis is very helpful, if it is tied to a specific attribute or
feature described in the text. The process of acquiring these attributes or features and
their meaning is called aspect-based sentiment analysis, or ABSA.
Fake Product Review Monitoring System Using Machine Learning 439
T
(1) Text is divided into sections.
(2) Now the sentence which has emotions in it is identified so as to analyze it.
(3) From a range of − 1 to 1 emotional points are given for all the words used.
(4) Now multiple layers are combined with emotional analysis.
D. Three Methods Used in the Opinion Mining
(1) The rule-based system: Now we have a method to calculate the number of
positive or negative words in a given database but tokenization and other methods
require this system. If the number of words expressing happiness are greater
than the number of opposing words, it means that the emotions are positive and
vice versa. However, there are few disadvantages of this system. As the name
suggests, containing a rule for all the combinations of a word requires lots of
work and is very tiring and these days people have started using so much slang
that it is almost very difficult to maintain that rule set.
(2) Automatic method: This method works on a machine learning program. First,
data sets are trained and predictable analysis is performed. The process following
which the spelling of a text is done. This text release can be done using various
techniques such as Naive Bayes, linear regression, support vector and learning
as these machine learning methods are used.
(3) Hybrid approach: A combination of both of the above methods namely rule
and automatic method. What remains is that accuracy is high compared with
the other two methods. Figure 1 illustrates how the meaning of a comment can
be understood to a great level when machine learning is combined with NLP to
identify statements [4].
he basic emotional analysis follows a process as stated below:
2 Literature Review
Recently, the World Wide Web has revolutionized the information technology
industry. An online survey is what you find on many online sites, such as feedback,
tweets, posts, survey sites, news forums, sites of fully owned companies or social
networking sites. Liu et al. provided the first study to identify spam reviews. They
dealt with fake or reviews which were near to fake using logistic regression to classify
updates as true or false with an accuracy of up to 78 percent [5]. In paper [6] authors
worked on calculating the probability of a spam index distributing spam keywords
and a small difference between non-spam reviews using the LDA algorithm.
Ignatova et al. worked on a dataset so as to judge the fake customers, mostly the
people from organizations who give fake feedback and put them into a particular
category [7]. The authors used sequential minimal optimization to classify reviews
with 81% accuracy in terms of F1 scores; furthermore they were able to improve the
Fig. 1 Hybrid approach
performance of the model to 84% [8]. Hozhabri et al. have provided a novel-based
graph for finding spam in the reviews which achieved 93% accuracy. The method
involves calculating an average of reviews and then giving weight and multiplying
both [9].
Since Amazon is a world-renowned shopping app, people always trust things
by looking at product reviews. But reviews on “Amazon” are not really topics, a
combination of item reviews and item service reviews. “Amazon” provides feedback
in the form of mixed feedback and the user has an error as to the general perception
(rating level) that there is no difference between service reviews and item reviews.
Patel et al., the proposed model makes a satisfactory distinction between both [10].
Sivagangadhar et al. used language features such as Unigram form, Unigram
frequency, Bigram presence, Bigram frequency and length of the review to design
a model and find false reviews. However, the main problem is the lack of data and
the need for both linguistic and behavioral characteristics [11]. Tidke et al. worked
to find spam reviewers who try to convert ratings into bad for other target products
[12].
Karami et al. proposed using categories of lexical semantics and also the language
features in obtaining online spam reviews [13]. Product reviews play a key role in
deciding whether to sell a particular product in applications such as “e-commerce
websites” or applications such as “Amazon”, “Snapdeal”, “OLX” and “Myntra”,
“Flipkart”, in sentiment analysis, the goal is to get customer feedback. The spam
dictionary is used to identify spam words in updates. The first is to look with the help
of the decision tree to decide whether the review is relevant to a particular product
or not. Sinha et al. used several algorithms in the text mining and gave direct results
based on these algorithms [14]. Zhang et al. review demonstrated three types of
new traits, including concentration, meaning and emotion. The authors provided a
model algorithm for each feature structure. They concluded that the proposed models,
calculations and features were productive in the Mac Review detection process [15].
Angelpreethi et al. did an opinion minor based on the feature. Minor’s main task is
to review reviews and determine the product’s key features by creating an evaluation
profile of each product that the user can use [16].
In 2018, Manickam et al. published a paper titled “Fake News on social media:
A Brief Review on Detection Techniques”. This paper was written to provide the
researchers an idea about techniques involved at that moment on identifying misin-
formation highlighting several social contexts, policies and mechanisms undertaken
[17].
As e-commerce grows and becomes more and more popular day by day, the
number of comments coming from customers about any product is rapidly increasing.
These days, people rely heavily on reviews before buying anything. This can lead
many people to write unnecessary scams and reviews about other related products or
services. Some companies in the marketplace provide employment to experts to add
false reviews for their product upscale or to discredit the quality of their emerging
or existing challenger.
Sonawane et al. have developed a method of detecting and recording false
reviews. The proposed method automatically categorizes user feedback into “unreal”,
“explicit” and “blur” categories through step-by-step processing. Obscure category
repeatedly reveals obscure or obscure elements. This results in better recognition
and benefits for both the business and the customer. Sales organizations can monitor
sales of their products by analyzing and understanding what customers say about
their products. It helps customers buy valuable products and spend their money on
quality products. Finally, end users view each individual test with a polarity score
and a reliability score.
Gangireddy et al. expanded the graph of reviewers to find a makeshift group of
users working together on spam. Karahalios et al. gave a clear vision on how the
reviews of people in the starting can affect the thoughts of the rest of them and create
an impact toward the outcome of the brand building. It has been noticed in their work
that at the lower level itself how negative feedback can be detected by not letting it
affect the brand. Various studies and experiments are performed which show how
such feedbacks can be detected using various classification algorithms [18].
Saad et al. discussed that the main reason behind the inclination of the decision of
people on already reviewed products is industrial advances. Luciano et al. discussed
how the presence of various scientific and general advancements has led to exponen-
tial increase of fraud feedback. Humanly it’s quite an exhausting task to detect the
text whether a piece was written using various programs, authentic user or automated
process. With the widespread use of mediums and websites it has become very easy
to spread wrong feedback about a product.
As the hired fraud can move from one platform to another giving the same reviews
which create wrong views among the real customers. Hence these methods are
followed by many organizations in order to scale out their product or to increase
their popularity and genuineness by spreading wrong information which should be
put to halt at the earliest. Iglesias et al. performed the tests on a dataset where the
algorithms applied the methodology of learning from the data itself. To perform their
research, they used various attributes such as the degree of emotions in the feedback,
proper analysis of the activity of the user so as to get its idea of logins into the account
or feedback on other products and many more. Through their experiment they were
able to achieve 82% accuracy [19]. Gupta et al. during their research found out that
some people are assigned tasks to either promote a brand by giving such reviews
which will be trusted on or give degrading points on all the products of that brand. It
is noticed that these organizations have enough resources to pay a bunch of people
to do this task. So, it is important to identify such people rather than looking for an
individual one by one which was the purpose of research. So a methodology was
adopted in which several customers were put into the same group and were called
fake owing to their behavior of giving either best or worst reviews to all the products
of the same brand.
3 Methodology
First detailed research was done and studied the procedure which was used to be
followed before. It was found that there were some flaws and drawbacks in the
existing approach adopted. Then a set of data was prepared from various websites
and the model was trained with 80% data and tested with 20% data. All the data was
put into a similar format.
In the next step the reviews were manually labeled as spam and non-spam.
Also, various techniques were followed while preprocessing the data to handle
inconsistency in the text and noise as well.
The words in the data were tokenized which enabled us to consider various words
present in the sentence and the useless words were removed such as pronouns. Now
the number of words used is kept in track. Furthermore an NLP model that is the
bag-of-words was used where each word was given some score. Machine learning
algorithms can be used here because in this model features are extracted from a
sentence so that the text can be used.
Now the classifiers are using the words taken with the help of the bag-of-words
model. For the purpose of modeling reviews are then converted into vectors. Cleaning
operation is performed after loading and the words which are not in the vocabulary
are removed. Rest of the tokens are then converted into a single line for encoding
and a list of positive and negative reviews is generated. Then analysis is done for the
reviews using both SVM classifier and Naive Bayes classifier. So now this system
will help in identifying fake product feedback and will show only the latest comment
made by the user. Here Naïve Bayes and support vector machine classifiers have
been used. We will see which classifier is more accurate and labeling of words is
done into positive and negative. Figure 2 represents methodology followed for the
above process.
Algorithms used: Depending on the kind of data either Naïve Bayes or support
vector machine can be better. More computation power is required by SVMs to
Fig. 2 Methodology
followed flowchart
train the dataset, so it is very crucial to choose that algorithm for our dataset which
fulfills the requirement and purpose of the research. Though both the algorithms are
supervised, Naïve Bayes decides that into which class the new identified data can
be put into depending on the chance of that data falling into the category. SVM is
useful where decision boundaries are required and divides the data according to the
category or classes they reside in. Experiments have also revealed that if there is very
large data then SVM algorithm can be used.
On the contrary if the data on which classification is to be conducted is very small
then Naïve Bayes can be used. Furthermore Naïve Bayes is easier to implement and
the decisions derived using this algorithm are probabilistic. SVM creates a partition
which is nonlinear between the categories whereas Naïve Bayes is used in the cases
where the presence of one kind of feature does not intervene with the other. Here in
our experiment, we will compare the performance of both the classifiers.
4 Result
After performing both the classifiers on reviews for the dataset the result obtained is
as follows. Figure 3 represents for the Naïve Bayes classifier we get 96% accuracy,
99.4% precision, 91.97% recall and 94.46% F1 score whereas Fig. 4 represents for
the SVM classifier we get 82.50% accuracy, 82.59% precision, 82.46% recall and
82.47% F1 score a. Comparison Analysis: Figure 5 displays the comparison between
both the classifiers used applications of the classifiers.
(1) Naïve Bayes Classifier
One of the most common features of it is identifying various features of the face.
News can be classified into various categories such as geopolitical, sports and more.
It is useful in the field of medicine to detect for which disease the patient is at high
risk.
(2) Support Vector Machine Classifier
It is useful in the field of bioinformatics to detect the diseases that patients can
have on the basis of gene classification.
Handwritten characters can be identified, determining layered structure of the
planet.
Fig. 3 Results for Naive Bayes

5 Conclusion
This paper aims to help customers find the authenticity of the product. So, the two
models were developed to compare the results obtained from both the classifiers.
Using emotional analysis has assisted a lot in identifying and segregating the fake
reviews from the real ones. It will help people from various businesses who want to
work on their brand building and want to take it to a global level as it would require
true reviews of the customers. This gives people with great vision a chance to work on
the quality of product or enhancing the features of the product rather than deviating
from the true essence and losing focus on their goals because of some fraud. Now,
after performing these comparisons it was found that Naïve Bayes outperformed the
other classifier in terms of authenticity detection.
The division of the Naïve Bayes was found to be more accurate than the SVM
planning. Major challenges lie in analyzing emotions and evolving the model with
Fig. 4 Results for SVM
Fig. 5 Comparison-based
study
time as the language and slangs used by the people keep changing, which plays an
important role in better differentiation. In the coming year, the accuracy of these
analyses will increase more with the improvement in the models and the advance-
ments will bring a significant change, for next time better algorithms can be used to
analyze spam and provide a better overview of the product to the customer.
References
1. Jindal N, Liu B (2008) Opinion spam and analysis. In: Proceedings of the 2008 international
conference on web search and data mining. Palo Alto, California, USA, pp 219–230
2. Lau RY, Liao SY, Kwok RC, Xu K, Xia Y, Li Y (2011) Text mining and probabilistic language
modeling for online review spam detection. ACM Trans Manag Inf Syst (TMIS) 2(4):1–30
3. Allahbakhsh M, Ignjatovic A, Benatallah B, Beheshti SMR, Foo N, Bertino E (2012) Detecting,
representing and querying collusion in online rating systems, https://arxiv.org/abs/1211.0963
4. Shojaee S, Murad MAA, Azman AB, Sharef NM, Nadali S (2013) Detecting deceptive reviews
using lexical and syntactic features. In: 2013 13th international conference on intelligent
systems design and applications (ISDA). Salangor, Malaysia, pp 53–58
5. Noekhah S, Fouladfar E, Salim N, Ghorashi SH, Hozhabri AA (2014) A novel approach for
opinion spam detection in ecommerce. In: Proceedings of the 8th IEEE international conference
on E-commerce with focus on E-trust. Mashhad, Iran
6. Bhatt A, Patel A, Chheda H, Gawande K (2015) Amazon review classification and sentiment
analysis. Int J Comput Sci Inf Technol (IJCSIT) 6(6)
7. Shivagangadhar K, Sagar H, Sathyan S, Vanipriya CH (2015) Fraud detection in online reviews
using machine learning techniques. Int J Comput Eng Res (IJCER) 5(5):52–56
8. Kokate S, Tidke B (2015) Fake review and brand spam detection using J48 classifier. IJCSIT
Int J Comput Sci Inf Technol 6(4):3523–3526
9. Karami A, Zhou B (2015) Online review spam detection by new linguistic features. In:
iConference 2015 proceedings
10. Sinha A, Arora N, Singh S, Cheema M, Nazir A (2018) Fake product review monitoring using
opinion mining. Int J Pure Appl Math 119(122018)
11. Li Y, Feng X, Zhang S (2016) Detecting fake reviews utilizing semantic and emotion model. In:
2016 3rd International conference on information science and control engineering (ICISCE).
Beijing, pp 317–320
12. Angelpreethi A, Kumar SBR (2017) An enhanced architecture for feature based opinion mining
from product reviews. In: 2017 World congress on computing and communication technologies
(WCCCT). Tiruchirappalli, pp. 89–92
13. Mahid, Manickam S, Karuppayah S (2018) Fake news on social media: brief review on detection
techniques. In: 2018 Fourth international conference on advances in computing, communication
automation (ICACCA), pp 1–5. https://doi.org/10.1109/ICACCAF.2018.8776689
14. Patel D, Kapoor A, Sonawane S (2018) Fake review detection using opinion mining. Int Res
J Eng Technol (IRJET) 05(12) e-ISSN: 2395-0056. Dhawan S, Gangireddy SCR, Kumar S,
Chakraborty T (2019) Spotting collusive behavior of online fraud groups in customer reviews.
CoRR abs/1905.13649:191–200
15. Gilbert E, Karahalios K (2010) Understanding deja reviewers. In: CSCW, pp 225–228. [Online]
16. Ahmed H, Traore I, Saad S (2018) Detecting opinion spams and fake news using text
classification. Security Privacy 1(1):e9
17. Floridi L, Chiriatti M (2020) GPT-3: its nature, scope, limits, and consequences. Minds Mach
1–14:2020
18. Barbado R, Araque O, Iglesias CA (2019) A framework for fake review detection in online
consumer electronics retailers. Inf Process Manage 56(4):1234–1244
19. Gupta V, Aggarwal A, Chakraborty T (2020) Detecting and characterizing extremist reviewer
groups in online product reviews. IEEE Trans Comput Soc Syst 7(3):741–750
Perception to Control: End-to-End
Autonomous Driving Systems
Yoshita, Aman Jatain, Manju, and Sandeep Kumar
Abstract End-to-end autonomous driving systems have garnered a lot of attention

in recent years, and researchers have been exploring different ways to make them
work. In this paper, we provide an overview of the field with a focus on the two main
types of systems: those that use only RGB images and those that use a combination of
multiple modalities. We review the literature in each area, highlighting the strengths
and limitations of each approach. We also discuss the challenges of integrating these
systems into a complete end-to-end autonomous driving pipeline, including issues
related to perception, decision-making, and control. Lastly, we identify areas where
more research is needed to make autonomous driving systems work better and be
safer. Overall, this paper provides a comprehensive look at the current state-of-the-
art in end-to-end autonomous driving, with a focus on the technical challenges and
opportunities for future research.
Keywords End-to-end systems · Autonomous driving · CNN · Deep learning
Yoshita
Department of Computer Science and Engineering, Amity University, Gurugram, Haryana, India
A. Jatain
Department of Computer Science and Technology, Manav Rachna University, Faridabad, India
Manju
Department of Computer Science and Engineering and Information Technology Jaypee Institute
of Information Technology, Noida, India
S. Kumar (B)
Department of Computer Science and Engineering, School of Engineering and Technology,
CHRIST (Deemed to Be University), Kengeri Campus, Bangalore, Karnataka 560074, India
e-mail: sandeepkumar@christuniversity.in
448 Yoshita et al.
1 Introduction
One of the fastest-growing sectors is autonomous driving technology, which aims to

create self-driving cars. End-to-end autonomous driving systems are designed to do
everything related to driving, from sensing to controlling. These systems need robust
and precise algorithms for sensing the environment, making driving judgements, and
operating the vehicle in real-world circumstances. End-to-end autonomous naviga-
tion systems are reviewed in this paper. The article begins by describing the tradi-
tional autonomous driving system’s end-to-end components, including perception,
decision-making, and control (refer Fig. 1). To create a correct image of the driving
environment, various algorithms must interpret data from sensors like LiDAR,
Radar, and cameras during the perception phase. Autonomous driving systems’
safety and efficacy depend on this phase. Using this information to navigate traffic,
avoid obstructions, and maintain safe distances from other vehicles is the decision-
making stage. Deep learning techniques for perception, such as convolutional neural
networks, have become popular. They’ve improved performance and precision.
Control converts driving judgements into vehicle motion. This step involves accurate
and efficient algorithms to manage the vehicle’s actuators, such as steering and accel-
eration, and follow the specified driving trajectory. Autonomous driving technologies
could change transportation and road safety. However, several challenges must be
overcome before broad adoption. One of the biggest challenges is ensuring these
systems’ durability and dependability in different and changing real-world environ-
ments. This requires comprehensive testing and verification in simulated and real-
world scenarios to ensure they can adapt and respond appropriately to unanticipated
circumstances.
Assuring autonomous driving system safety, especially in instances requiring
human intervention, is another challenge. This requires creating effective and reli-
able safety devices that can detect and respond to unexpected situations and safely
bring the vehicle to a stop. Before deploying autonomous driving technologies, regu-
latory and ethical issues must be resolved. This entails creating norms and regulations
for testing and deploying autonomous vehicles and resolving ethical issues like who
is responsible for a crash. In conclusion, end-to-end automated driving systems could
Fig. 1 Traditional autonomous vehicle subsystems [1]

Perception to Control: End-to-End Autonomous Driving Systems 449
transform transportation and improve road safety. However, technological, legisla-

tive, and ethical hurdles must be overcome before these systems may be widely
adopted.
This study analyses the current breakthroughs in end-to-end autonomous driving
systems, including key components, algorithms, and obstacles, as well as the field’s
future. This work is organised as follows: Sect. 1 is an introduction; Sect. 2 is a
literature review; Sect. 3 is a method; and Sect. 4 is the results and analysis.
2 Literature Review
This section reviews autonomous driving literature on perception, control, prediction,

and decision-making. The literature review begins with a history of autonomous
driving technologies. It then discusses camera-only, LiDAR-only, and multi-model
autonomous driving approaches.
2.1 Methodologies Focusing on RGB Images from Camera
Several researchers have proposed different methods for autonomous driving using
deep learning techniques. Chen et al. [2] introduced a method that uses a convolu-
tional neural network (CNN) to learn road affordances and make steering predictions
based on raw sensor input. A dataset of front-facing camera images and corresponding
steering commands was used to train the CNN, which achieved a mean absolute
error of 4.0 degrees in steering angle prediction. Dosovitskiy et al. [3] used a large
video dataset of over 72 h of footage and corresponding steering commands to train
a CNN for end-to-end self-driving. Their CNN model had a mean absolute error
of 3.3° on an independent test dataset. Wang et al. [4] presented a self-supervised
learning technique that used monocular camera-based ego-motion estimation to train
a CNN for steering angle prediction, achieving an absolute mean error of 4.6° on
a test set. Kim et al. [5] proposed a method that used deep reinforcement learning
and cameras to construct a driving policy based on a reward function, achieving
an 84.2° success rate on a different test set. Li et al. [6] introduced a real-time
deep learning method for semantic instance segmentation in autonomous driving,
achieving a mean intersection over union (IoU) of 68.6% on the Cityscapes dataset.
Godard et al. [7] presented an unsupervised learning method for monocular depth
estimation from a single image, achieving cutting-edge performance on the KITTI
test. Bojarski et al. [8] introduced a neural network that translates raw camera pixels
to steering commands without human feature engineering, using a mix of CNNs and
RNNs. However, the method does not include additional sensor data, which could
potentially improve performance. Bansal et al. [9] proposed a curriculum learning
strategy where a neural network is trained on simple driving scenarios and grad-
ually exposed to more complex situations. However, the approach has limitations
450 Yoshita et al.
Table 1 Highlights of recent development in RGB images from camera

Paper Main outcomes Limitations
Chen et al. [2] DeepDriving was first introduced for use in Limited to road scenes and
autonomous vehicles’ direct perception lane following tasks
Dosovitskiy Using large-scale film datasets, they suggested Requires a large amount of
et al. [3] a way for drive models to learn from start to training data
finish
Wang et al. Developed a self-supervised learning method Limited to two-dimensional
[4] for camera-based driving models information
Kim et al. [5] Using deep reinforcement learning, the authors Limited to simulation
propose a camera-based, end-to-end driving environment
model
Li et al. [6] Presented an approach to autonomous driving Limited to semantic
that uses combined semantic instance segmentation tasks
segmentation in real-time
Godard et al. Created a left–right consistency-based Limited to monocular depth
[7] technique for unsupervised monocular depth estimation
estimation
Bojarski et al. Proposed a comprehensive approach to learning Model performance decreases
[8] for autonomous vehicles in adverse weather and
lighting conditions
Bansal et al. Developed ChauffeurNet, a driving model that Limited to urban driving
[9] imitates expert human drivers scenarios
Wadekar Proposed a single platform for predicting both Limited to racing scenarios
et al. [10] steering and throttle response in autonomous
racing
in terms of covering a restricted set of scenarios and assuming human drivers are
already driving safely in all potential scenarios. Wadekar et al. [10] proposed a deep
learning model for autonomous racing vehicles that anticipate steering and throttle
instructions using sensor data. However, the model does not account for other crucial
elements such as braking and requires a substantial quantity of high-quality training
data for excellent performance. Further, Table 1 highlights the main outcomes and
limitations of each of the papers mentioned above.
2.2 Methodologies Focusing on Multi-modal Data
In recent research, multiple authors have proposed multi-modal fusion systems for
autonomous driving that incorporates vision, Radar, and LiDAR sensors. Zhang et al.
[11] developed a system that incorporated vision, Radar, and LiDAR sensors, and
demonstrated improved driving performance compared with single sensor modality
models. Chen et al. [12] introduced a neural network architecture that combined
LiDAR and vision inputs, and showed superior performance compared with using
only one sensor modality. Wang et al. [13] presented a multi-modal data fusion
approach using LiDAR, camera, and Radar data, which outperformed single-sensor
modality models. Fan et al. [14] proposed a two-stage network-based technique for
integrating LiDAR and vision data, and Sun et al. [15] suggested a multi-modal fusion
strategy incorporating LiDAR, camera, and Radar data, both showing improved
driving performance. An end-to-end driving model was proposed by Zhou et al.
[16] and was trained using a huge database of driving videos, which outperformed
handcrafted characteristic methods. Zhijian Liu and Alexander Amini [17] proposed
a neural network architecture that used only LiDAR sensor data and achieved compa-
rable performance to models combining LiDAR and camera data. Marc Uecker and
Tobias Fleck [18] investigated deep learning representations for LiDAR point cloud
data, finding that the learned features were related to the physical properties of the
environment. Further the table (Table 2) highlights the main outcomes and limitations
of each of the papers mentioned above.
3 Methodology
End-to-end autonomous driving systems are always evolving; therefore, it’s impor-
tant to stay up-to-date. Data collection, data pre-processing, model selection, model
training, model validation, and model deployment are the general phases involved.
These phases are illustrated in Fig. 2.
During data collection, large amounts of data are gathered from various sources,
including cameras, LiDAR, and other sensors. This data is then pre-processed in the
data pre-processing stage to remove any noise, outliers, or irrelevant information,
ensuring that it is ready for training and validation. Next, in the model selection stage,
various deep learning models and techniques, such as CNNs, RNNs, and generative
adversarial networks (GANs) are considered, and the optimal model is chosen based
on the type of data and system requirements. Once the model is selected, it is trained
using the pre-processed data in the model training stage. This involves teaching the
model how to detect various objects, such as automobiles, people, and traffic signs,
and how to make driving judgements using supervised learning approaches. After
training, the model’s accuracy and performance are validated on a separate set of
data in the model validation stage. If problems or flaws are found, this allows for
adjustments to be made to the model. Finally, in the deployment stage, the trained
and validated model is put into use in the autonomous vehicle to control it and make
real-time driving decisions. To ensure the efficiency and security of an end-to-end
autonomous driving system, it is crucial to keep up with the newest developments in
the industry.
452 Yoshita et al.
Table 2 Highlights of recent developments in multi-modal data

Paper Sensor data Main results Limitations
used
Zhang et al. Camera, Improved performance compared to The fusion approach
[11] LiDAR single modality systems in terms of may be computationally
distance estimation, object detection, expensive and difficult
and path planning to implement in
real-time systems
Chen et al. Camera, Improved perception and prediction LiDAR sensors are
[12] LiDAR of objects in the environment, leading expensive and may not
to more accurate driving decisions be feasible for
widespread adoption in
autonomous vehicles
Wang et al. Camera, Improved performance compared to The fusion approach
[13] LiDAR, Radar single modality systems in terms of may be computationally
object detection and path planning, expensive and difficult
and the ability to operate in diverse to implement in
environments real-time systems
Fan et al. Camera, Improved object detection and LiDAR sensors are
[14] LiDAR distance estimation, leading to more expensive and may not
accurate driving decisions be feasible for
widespread adoption in
autonomous vehicles
Sun et al. Camera, Improved performance compared to The fusion approach
[15] LiDAR, Radar single modality systems in terms of may be computationally
object detection and path planning expensive and difficult
to implement in
real-time systems
Zhou et al. Camera, GPS, Attained cutting-edge performance in Lack of consideration
[16] IMU terms of driving model accuracy and for other sensor
adaptability to a wide range of modalities, such as
driving conditions LiDAR or Radar
Liu and LiDAR, GPS, Achieved accurate and efficient The method may not be
Amini [17] IMU navigation in complex environments as robust in situations
using only LiDAR data where LiDAR data is
noisy or incomplete
Uecker and LiDAR Achieved real-time and accurate Lack of consideration
Fleck [18] perception of point clouds for for other sensor
in-vehicle LiDAR systems modalities, such as
camera or Radar
4 Conclusion and Future Scope
Finally, this paper summarises the recent progress in comprehensive deep learning
strategies for autonomous vehicles. These approaches have shown great promise
in predicting steering and other driving commands from raw sensor input without
the need for manual feature engineering. However, some papers do not consider all
Fig. 2 Methodology for end-to-end autonomous driving systems
possible scenarios or sensor data, and there is a need for more adaptable and safe
models to be deployed in the real world. Future research areas include the integra-
tion of LiDAR or Radar sensors to improve performance and reliability, improving
the resilience and adaptability of autonomous driving systems to changing environ-
ments, developing interpretable deep learning models, ensuring safety and ethics
in self-driving systems, studying real-time deep learning models, developing high-
generalisability deep learning models, and focusing on human-friendly autonomous
driving systems. Overall, this review provides valuable insights into the current state
of deep learning for autonomous driving and suggests several research directions to
make these systems more robust, adaptable, and deployable in the real world in a
safe manner.
References
1. https://www.researchgate.net/figure/Traditional-driving-systems-compared-to-end-to-end-dri
ving-system_fig1_351856725
2. Chen J, Liu X, Li W, Wei Y (2015) DeepDriving: learning affordance for direct perception in
autonomous driving. In: Proceedings of the IEEE international conference on computer vision
(ICCV), pp 2722–2730
3. Dosovitskiy A, Ros G, Codevilla F, Lopez A, Koltun V (2017) End-to-end learning of driving
models from large-scale video datasets. In: Proceedings of the IEEE conference on computer
vision and pattern recognition (CVPR), pp 2174–2182
454 Yoshita et al.
4. Wang C, Xu D, Zhu Y (2018) Self-supervised learning for camera-based driving models. In:
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp
3379–3388
5. Kim S, Kim H, Lee S, Choi J, Kim S (2018) Camera-based end-to-end autonomous driving
using deep reinforcement learning. In: Proceedings of the IEEE international conference on
robotics and automation (ICRA), pp 1–8
6. Li X, Huang W, Liang X, Wang L (2019) Real-time joint semantic-instance segmentation for
autonomous driving. In: Proceedings of the IEEE conference on computer vision and pattern
recognition (CVPR), pp 9219–9228
7. Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised Monocular depth estimation with
left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern
recognition (CVPR), pp 6602–6611
8. Bojarski M, Testa DD, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M,
Muller U, Zhang J, Zhang X, Zhao J, Zieba K (2016) End to end learning for self-driving cars.
arXiv:1604.07316
9. Bansal M, Krizhevsky A, Ogale A (2018) ChauffeurNet: learning to drive by imitating the best
and synthesizing the worst. In: Robotics: science and systems (RSS)
10. Wadekar SN, Schwartz BJ, Kannan SS, Mar M, Manna RK, Chellapandi V, Gonzalez DJ,
Gamal AE (2020) Towards end-to-end deep learning for autonomous racing: on data collection
and a unified architecture for steering and throttle prediction. arXiv:2010.06412
11. Zhang B, Li W, Chen J (2021) Multi-modal fusion for end-to-end autonomous driving. arXiv
12. Chen Y, Yang B, Li M, Urtasun R (2021) Integrating lidar and vision for end-to-end autonomous
driving. arXiv preprint arXiv:2102.02331
13. Wang X et al (2021) End-to-end autonomous driving with multi-modal data fusion. arXiv
14. Fan L et al (2021) Fusing LiDAR and vision for end-to-end autonomous driving. arXiv preprint
arXiv:2109.02970
15. Sun L et al (2021) Multi-modal fusion for end-to-end autonomous driving based on deep
learning. IEEE Access 9:119196–119208
16. Zhou B et al (2018) End-to-end learning of driving models from large-scale video datasets.
IEEE Trans Intell Transp Syst 20(4):1276–1289
17. Liu Z, Amini A (2021) Efficient and robust LiDAR-based end-to-end navigation. arXiv preprint
arXiv:2109.02004
18. Uecker M, Fleck T (2021) Analyzing deep learning representations of point clouds for real-
time in-vehicle LiDAR perception. In: Proceedings of the IEEE international conference on
intelligent transportation systems (ITSC)
Author Index
A E
Aditya Bhaskar, 71 Esmita Gupta, 321
Aishwarya V. Kadu, 219
Aman Jatain, 447
Anbazhagan, M., 403 F
Ancy Micheal, A., 173 Farhana Zareen, 297
Anukriti Bansal, 297
Anupriya Kamble, 415
Aparajita Sinha, 195, 387 G
Aparna N. Mahajan, 247 Ganesh, K., 403
Ashok Pal, 15 Ganesh Prasad Pal, 61
Ashwin Raiyani, 1 Gaurav Pendharkar, 173
Asiya, 425 Gautam Mehendale, 309
Ayon Tarafdar, 95 Geeta Rani, 195
Ghanashyama Mahanty, 123
Gyanendra Kumar Gaur, 95
B
Bharti Joshi, 71
Butta Singh, 103
H
Heli Nandani, 83
C Hemant H. Patel, 1
Chandan Kumar Deb, 95 Hetvi Shah, 309
Chandralekha, M., 113 Himali Sarangal, 103
Chetan Pattebahadur, 415 Himanshu Goswami, 309
Chinmayee Kale, 309
I
D Indresh Kumar Verma, 265
Deepak Kumar, 183
Devanshi Shah, 83
Dhanya Pramod, 335 J
Din, Der-Rong, 47 Jason Misquitta, 173
Diriba C. Kejela, 235 Jaspreet Kaur, 15
Drishti Bharti, 157 Jatinderkumar R. Saini, 209
Dwijendra Nath Dwivedi, 123 Jayakumar Kaliappan, 31
© The Editor(s) (if applicable) and The Author(s), under exclusive license 455
to Springer Nature Singapore Pte Ltd. 2024
Networks and Systems 968, https://doi.org/10.1007/978-981-97-2079-8
456 Author Index
K R
Kadam, A. B., 415 Rachit Shah, 83
Kartik Verma, 103 Raju Pal, 61
Kashif Saleem, 113 Ramesh Manza, 415
Kavya Muktha, 277 Ranjeesh Kaippada, 173
Kehali A. Jember, 235 Reddy, K T V, 219
Keshav Sairam, 387 Riya Raj, 31
Ketema T. Megersa, 235
Kirti Dinkar More, 335
Krishna Chidrawar, 141 S
Kumari Priyanshi, 157 Sahil Borkar, 141
Sakshi Naik, 141
Samuel T. Daka, 235
Sandeep Kumar, 447
M Sandip R. Panchal, 1
Madhav Ajwalia, 83 Sara Bansod, 265
Manjit Singh, 103 Satveer Kour, 103
Manju, 447 Shail Shah, 83
Mathura Bai Baikadolla, 277 Shilpa Shinde, 321
Mayur M. Jani, 1 Shivam Panwar, 297
Md. Ashraful Haque, 95 Shraddha Vaidya, 209
Mohanvenkat Patta, 277 Shreya Dave, 377
Monika Agarwal, 195, 387 Shreyas Visweshwaran, 403
Moti B. Dinagde, 235 Srirachana Narasu Baditha, 277
Mousami P. Turuk, 141 Sudeep Marwaha, 95
Sudhir Bagul, 309
Sugitha, N., 425
N Suvarna Bhoj, 95
Namita Goyal, 247
Neelum Dave, 377
Nishat Shaikh, 359 T
Tadele A. Abose, 235
Tripathi, K. C., 247
Triveni Dutt, 95
P
Pankaj Kumar Sethi, 437
Parth Shah, 83, 359 V
Prabhjot Kaur, 157 Vaibhav B. Vaijapurkar, 141
Pradeep, K., 387 Vinay, R., 195
Pragya Rajput, 437 Vineet Sharma, 183
Pranita Ranade, 265
Preksha Khatri, 309
Priteshkumar Prajapati, 83 Y
Priyadharshini Jayadurga, N., 113 Yoshita, 447

e74cc1e5-8fa6-4f2b-bf89-0f00f7f44640

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

e74cc1e5-8fa6-4f2b-bf89-0f00f7f44640

Uploaded by

Copyright:

Available Formats

Lecture Notes in Networks and Systems 968

ISSN 2367-3370 ISSN 2367-3389 (electronic)

Paper in this product is recyclable.

Kota, India Harish Sharma

Multilingual Speech Recognition: An In-Depth Review

A Novel Image Encryption Technique Based on DNA Theory

AI-Integrated Smart Toy for Enhancing Cognitive, Emotional,

Perception to Control: End-to-End Autonomous Driving Systems . . . . . . 447

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455

About the Editors

Dr. Harish Sharma is Associate professor at Rajasthan Technical University, Kota,

Dr. Vivek Shrivastava has approx. 20 years of diversified experience of schol-

Consultancy), Dean (Student Welfare), Faculty In-charge (Training and Placement),

Tadele A. Abose Mattu University, Mattu, Ethiopia

M. Anbazhagan Department of Computer Science, Amrita School of Computing,

K. Ganesh Department of Computer Science, Amrita School of Computing,

Satveer Kour Department of Computer Engineering and Technology, Guru

Raju Pal Department of Computer Science and Engineering, School of

Keshav Sairam Dayananda Sagar University, Bangalore, India

K. C. Tripathi Maharaja Agrasen Institute of Technology, GGSIPU, New Delhi,

Mayur M. Jani , Sandip R. Panchal , Hemant H. Patel ,

Abstract Speech is essential in communication because it allows people to convey

Keywords Speech recognition · Multilingual speech · Deep learning ·

1.1 Automatic Speech Recognition

An ASR system’s primary responsibility is translating incoming speech into the

Fig. 1 Automatic speech recognition process

1.2 Multilingual Speech-to-Text Conversion

Teaching a single language in a conventional automated speech recognition (ASR)

2 Techniques and Approaches

The suitability of each technique depends on the specific requirements, available

2.1 Acoustic Model Sharing

Technique: In this approach, a single acoustic model is trained to recognize

Table 1 Comparison of various techniques and approaches

2.2 Code-Switching Models

Technique: Code-switching models are designed to handle speech combining

2.3 Language-Specific Models

3 Feature Extraction Techniques and Toolkit/Software

Automated speech recognition (ASR) systems extract features to convert unpro-

4 Applications of Multilingual Speech Recognition

Multilingual speech recognition has numerous applications across various fields.

Table 2 Feature extraction techniques

Table 3 Feature classification techniques

Voice Assistants: Multilingual speech recognition is critical for voice assistant

5 Challenges and Limitations

Multilingual speech-to-text (STT) systems must overcome various hurdles to achieve

Handling Out-of-Vocabulary Words: Multilingual STT systems must handle words

1. Salini R, Safrin P, Shanmugapriyaa P, Sindhu S (2018) Switching between multiple languages

9. Mussakhojayeva S et al (2023) Multilingual speech recognition for Turkic languages.

Jaspreet Kaur and Ashok Pal

Abstract Scheduling issues have become a prominent region of research in previous

Keywords Job shop scheduling problem · Scheduling algorithms · Make span

In a production planning approach, scheduling is simply the short-term execution

J. Kaur · A. Pal (B)

expressed in minutes or seconds. In essence, detailed scheduling is the challenge of

1.1 Proposed Approaches for Solving Scheduling Problem

If a problem’s dimension and/or complexity are exceptional, metaheuristic algo-

3 Job Shop Scheduling Problem

Constraints must be established to guarantee that

Fig. 1 Gantt chart solution representation of job shop scheduling problem

4.1 Firefly Algorithm

Moreover, a random movement is given by Eq. (5).

z i = z i + α(rand() − 0.5). (5)

4.2 Black Hole Algorithm

2. Calculate the wellness of population by using the formula