BOOK - A Taxonomy of Social Engineering Defense Mechanisms

Advances in Intelligent Systems and Computing 1130
Kohei Arai
Supriya Kapoor
Rahul Bhatia Editors
Advances in
Information and
Communication
Proceedings of the 2020 Future
of Information and Communication
Conference (FICC), Volume 2
Advances in Intelligent Systems and Computing
Volume 1130
Series Editor
Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland
Advisory Editors
Nikhil R. Pal, Indian Statistical Institute, Kolkata, India
Rafael Bello Perez, Faculty of Mathematics, Physics and Computing,
Universidad Central de Las Villas, Santa Clara, Cuba
Emilio S. Corchado, University of Salamanca, Salamanca, Spain
Hani Hagras, School of Computer Science and Electronic Engineering,
University of Essex, Colchester, UK
László T. Kóczy, Department of Automation, Széchenyi István University,
Gyor, Hungary
Vladik Kreinovich, Department of Computer Science, University of Texas
at El Paso, El Paso, TX, USA
Chin-Teng Lin, Department of Electrical Engineering, National Chiao
Tung University, Hsinchu, Taiwan
Jie Lu, Faculty of Engineering and Information Technology,
University of Technology Sydney, Sydney, NSW, Australia
Patricia Melin, Graduate Program of Computer Science, Tijuana Institute
of Technology, Tijuana, Mexico
Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro,
Rio de Janeiro, Brazil
Ngoc Thanh Nguyen , Faculty of Computer Science and Management,
Wrocław University of Technology, Wrocław, Poland
Jun Wang, Department of Mechanical and Automation Engineering,
The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications
on theory, applications, and design methods of Intelligent Systems and Intelligent
Computing. Virtually all disciplines such as engineering, natural sciences, computer
and information science, ICT, economics, business, e-commerce, environment,
healthcare, life science are covered. The list of topics spans all the areas of modern
intelligent systems and computing such as: computational intelligence, soft comput-
ing including neural networks, fuzzy systems, evolutionary computing and the fusion
of these paradigms, social intelligence, ambient intelligence, computational neuro-
science, artificial life, virtual worlds and society, cognitive science and systems,
Perception and Vision, DNA and immune based systems, self-organizing and
adaptive systems, e-Learning and teaching, human-centered and human-centric
computing, recommender systems, intelligent control, robotics and mechatronics
including human-machine teaming, knowledge-based paradigms, learning para-
digms, machine ethics, intelligent data analysis, knowledge management, intelligent
agents, intelligent decision making and support, intelligent network security, trust
management, interactive entertainment, Web intelligence and multimedia.
The publications within “Advances in Intelligent Systems and Computing” are
primarily proceedings of important conferences, symposia and congresses. They
cover significant recent developments in the field, both of a foundational and
applicable character. An important characteristic feature of the series is the short
publication time and world-wide distribution. This permits a rapid and broad
dissemination of research results.
** Indexing: The books of this series are submitted to ISI Proceedings,
EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **
More information about this series at http://www.springer.com/series/11156

Kohei Arai Supriya Kapoor
• •
Rahul Bhatia
Editors
Advances in Information
and Communication
Proceedings of the 2020 Future of Information
and Communication Conference (FICC),
Volume 2
123
Editors
Kohei Arai Supriya Kapoor
Faculty of Science and Engineering The Science and Information
Saga University (SAI) Organization
Saga, Japan Bradford, West Yorkshire, UK
Rahul Bhatia
The Science and Information
(SAI) Organization
Bradford, West Yorkshire, UK
ISSN 2194-5357 ISSN 2194-5365 (electronic)

Advances in Intelligent Systems and Computing
ISBN 978-3-030-39441-7 ISBN 978-3-030-39442-4 (eBook)
https://doi.org/10.1007/978-3-030-39442-4
© Springer Nature Switzerland AG 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
We welcome all of you at the Future of Information and Communication

Conference (FICC) 2020 which was held at The Park Central Hotel San Francisco
on 5–6 March 2020.
The conference provides a platform to discuss information and communication
technologies with participants from around the globe, both from academia and
industry. The success of this conference is reflected in the papers received, with
participants coming from several countries, allowing a real exchange of experiences
and ideas. Renowned experts and scholars shared their excellent views and expe-
riences through speeches. The main topics of the conference include
Communication, Security and Privacy, Networking, Ambient Intelligence, Data
Science, and Computing.
Each submission has been double-blind peer reviewed by two to four reviewers
in the right area. More than 440 submissions have been received out of which only
135 (including 10 poster papers) were accepted for final presentation and will be
published in the conference proceedings. This conference showcases paper pre-
sentations of new research, demos of new technologies, and poster presentations of
late-breaking research results, along with inspiring keynote speakers and moderated
challenge sessions for participants to explore and respond to big challenge ques-
tions about the role of technology in creating thriving, sustainable communities.
The conference success was due to the collective efforts of many people.
Therefore, we would like to express our sincere gratitude to the technical program
committee and the reviewers who helped ensure the quality of the papers as well as
their invaluable input and advice. A special thanks goes to the keynote speakers and
the conference organizers for their various support to FICC 2020.
We hope that the FICC 2020 experience was enjoyable and benificial for all the
participants and the interested readers.
See you in our next SAI Conference, with the same amplitude, focus and
determination.
Regards,
Kohei Arai
Saga University, Japan
v
Contents
Comparative Study on Swarm Based Algorithms for Feature

Reduction in Twitter Sentiment Analysis on Figurative Language . . . . . 1
Akshi Kumar, Aarushi Gupta, Anant Jain, and Vansh Farma
Statistical Analysis of the Effects of Institutions on the Economic
Growth of France in Recent Years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Yaimara Céspedes-González, Guillermo Molero-Castillo,
Patricia Arieta-Melgarejo, Everardo Bárcenas,
and Alejandro Velázquez-Mena
A Taxonomy of Social Engineering Defense Mechanisms . . . . . . . . . . . . 27
Dalal N. Alharthi, Mahmoud M. Hammad, and Amelia C. Regan
A Trust Framework for the Collection of Reliable
Crowd-Sourced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Shiva Ramoudith and Patrick Hosein
Sentiment Analysis for University Students’ Feedback . . . . . . . . . . . . . . 55
Nguyen Thi Phuong Giang, Tran Thanh Dien, and Tran Thi Minh Khoa
Development of Waste Collection Model Using Mobile Phone Data:
A Case Study in Latvia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Irina Arhipova, Gundars Berzins, Aldis Erglis, and Evija Ansonska
Artificial Social Intelligence: Hotel Rate Prediction . . . . . . . . . . . . . . . . 78
James J. Lee and Misuk Lee
New Metric Based on SQuAD for Evaluating Accuracy
of Enterprise Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Harshad Kulkarni, Himanshu Gupta, Kalpesh Balar, and Praful Krishna
Case-Based Generation of Regulatory Documents
and Their Semantic Relatedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Andreas Korger and Joachim Baumeister
vii
viii Contents
A Comparative Evaluation of Preprocessing Techniques for Short

Texts in Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Marcos Orellana, Andrea Trujillo, and Priscila Cedillo
Automatic Visual Recommendation for Data Science and Analytics . . . 125
Manoj Muniswamaiah, Tilak Agerwala, and Charles C. Tappert
A Novel Recommender System for Healthy Grocery Shopping . . . . . . . 133
Yadagiri Bodike, David Heu, Bhavishya Kadari, Brandon Kiser,
and Matin Pirouz
Using Topic Modelling to Correlate a Research
Institution’s Outputs with Its Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Nicholas Chamansingh and Patrick Hosein
Long Period Re-identification Approach to Improving the Quality
of Education: A Preliminary Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Irina Arhipova, Gatis Vitols, and Inga Meirane
A Quantum Annealing-Based Approach to Extreme Clustering . . . . . . . 169
Tim Jaschek, Marko Bucyk, and Jaspreet S. Oberoi
Clustering and Classification to Evaluate Data Reduction
via Johnson-Lindenstrauss Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Abdulaziz Ghalib, Tyler D. Jessup, Julia Johnson,
and Seyedamin Monemian
Application of Statistical Learning in Ferro-Titanium Industry . . . . . . . 210
Mahan Balal Pour, Vahid Partovinia, and Robert Pellerin
Assessing the Effectiveness of Topic Modeling Algorithms
in Discovering Generic Label with Description . . . . . . . . . . . . . . . . . . . 224
Shadikur Rahman, Syeda Sumbul Hossain, Md. Shohel Arman,
Lamisha Rawshan, Tapushe Rabaya Toma, Fatama Binta Rafiq,
and Khalid Been Md. Badruzzaman
BeagleTM: An Adaptable Text Mining Method for Relationship
Discovery in Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Oliver Bonham-Carter
Comparison of Imputation Methods for Missing Values
in Air Pollution Data: Case Study on Sydney Air Quality Index . . . . . . 257
W. M. L. K. N. Wijesekara and Liwan Liyanage
BERT Feature Based Model for Predicting the Helpfulness Scores
of Online Customers Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Shuzhe Xu, Salvador E. Barbosa, and Don Hong
Contents ix
Evaluating Taxonomic Relationships Using Semantic Similarity

Measures on Sensor Domain Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . 282
Mireya Tovar Vidal, Aimee Cecilia Hernández García,
José de Jesús Lavalle Martínez, José A. Reyes-Ortiz,
and Darnes Vilariño Ayala
Trained Synthetic Features in Boosted Decision Trees with an
Application to Polish Bankruptcy Data . . . . . . . . . . . . . . . . . . . . . . . . . 295
Thomas R. Boucher and Tsitsi Msabaeka
UI Design Patterns for Flight Reservation Websites . . . . . . . . . . . . . . . . 310
Zeeshan Haider Malik, Tayyab Munir, and Mesan Ali
Conceptual Model for Challenges and Succession Opportunities
for Virtual Project Teams in the GCC . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Rasha Abou Samra, Nizar Al Sharari, and Salem AlTunaiji
Virtual Construction: Interactive Tools for Collaboration
in Virtual Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
Juha Ojala, Jukka Selin, Timo Partala, and Markku Rossi
Implementing Material Changes in Augmented Environments . . . . . . . . 352
Adam Pike and Sudhanshu Kumar Semwal
Using Activity Theory and Task Structure Charts to Model
Patient-Introduced Online Health Information into the Family
Physician/Patient Examination Process . . . . . . . . . . . . . . . . . . . . . . . . . . 362
Beth Ellington
Naming Anonymous Processes Using an Optimal Number
of Test-and-Set Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Layla S. Aldawsari
Development Trends of Information Technology
Industry in Armenia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Ashot Davoyan
A Study on the Inspection of Fundus Retinal Picture . . . . . . . . . . . . . . . 409
K. Sundaravadivu, N. Hariprasad, A. Sadeesh Kumar,
and N. Siva Balan
Accelerating Block-Circulant Matrix-Based Neural Network Layer
on a General Purpose Computing Platform: A Design Guideline . . . . . . 419
Krittaphat Pugdeethosapol, Zhao Jin, Daniel Rider, and Qinru Qiu
Energy Aware Next Fit Allocation Approach for Placement
of VMs in Cloud Computing Environment . . . . . . . . . . . . . . . . . . . . . . 436
Jyotsna Sengupta, Pardeep Singh, and P. K. Suri
x Contents
Multi-user Expert System for Operation and Maintenance

in Energized Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
Erika F. Moreno, Evelyn E. Pacheco, Víctor H. Andaluz,
and Álvaro S. Mullo
The Repeatability of Human Swarms . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Gregg Willcox, Louis Rosenberg, and Colin Domnauer
A Two-Stage Machine Learning Approach to Forecast the Lifetime
of Movies in a Multiplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
Abhijith Ragav, Sai Vishwanath Venkatesh, Ramanathan Murugappan,
and Vineeth Vijayaraghavan
A Cost-Reducing Partial Labeling Estimator
in Text Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
Jiangning Chen, Zhibo Dai, Juntao Duan, Qianli Hu, Ruilin Li,
Heinrich Matzinger, Ionel Popescu, and Haoyan Zhai
Unsupervised Cross-Lingual Mapping for Phrase
Embedding Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
Abraham G. Ayana, Hailong Cao, and Tiejun Zhao
Sidecar: Augmenting Word Embedding Models
with Expert Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Mathieu Lemay, Daniel Shapiro, Mary Kate MacPherson, Kieran Yee,
Hamza Qassoud, and Miodrag Bolic
Performance Analysis of Support Vector Regression
Machine Models in Day-Ahead Load Forecasting . . . . . . . . . . . . . . . . . 541
Lemuel Clark P. Velasco, Daisy Lou L. Polestico,
Dominique Michelle M. Abella, Genesis T. Alegata,
and Gabrielle C. Luna
Ecommerce Fraud Detection Through Fraud Islands
and Multi-layer Machine Learning Model . . . . . . . . . . . . . . . . . . . . . . . 556
Jay Nanduri, Yung-Wen Liu, Kiyoung Yang, and Yuting Jia
Automated Drug Suggestion Using Machine Learning . . . . . . . . . . . . . . 571
Vikrant Doma, Sahil Singh, Nikita Arora, Gabriel Ortiz,
Parneet Kaur Saran, Salvador Chavarin, and Matin Pirouz
Conditional Random Fields Based on Weighted Feature Difference
Potential for Remote Sensing Image Classification . . . . . . . . . . . . . . . . . 590
Yi Sun, Yan Tian, and Yiping Xu
Feature Selection Using Flower Pollination Optimization
to Diagnose Lung Cancer from CT Images . . . . . . . . . . . . . . . . . . . . . . 604
Dhalia Sweetlin Johnson, Daphy Louis Lovenia Johnson,
Pravin Elavarasan, and Ashok Karunanithi
Contents xi
Detecting Cyberbullying in Social Commentary Using Supervised

Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
Muhammad Owais Raza, Mohsin Memon, Sania Bhatti,
and Rahim Bux
Predicting the Risk Factor for Developing Chronic Kidney Disease
Using a 3-Stage Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
Hossam Medhat Aly, Mohamed Aborizka, and Soha Safwat Labib
Classification of Diabetic Retinopathy and Retinal Vein Occlusion
in Human Eye Fundus Images by Transfer Learning . . . . . . . . . . . . . . 642
Ali Usman, Aslam Muhammad, A. M. Martinez-Enriquez,
and Adrees Muhammad
Crop Monitoring Agent System Based on Pattern
Recognition Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
Ahad Hanif, Aslam Muhammad, A. M. Martinez-Enriquez,
and Andrees Muhammad
Small Ship Detection on Optical Satellite Imagery with YOLO
and YOLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
Wilder Nina, William Condori, Vicente Machaca, Juan Villegas,
and Eveling Castro
A Text Classification Model to Identify Performance Bonds
Requirement in Public Bidding Notices . . . . . . . . . . . . . . . . . . . . . . . . . 678
Urias Cruz da Cunha, Ricardo Silva Carvalho, and Alexandre Zaghetto
Estimating the Time-Lapse Between Medical Insurance
Reimbursement with Non-parametric Regression Models . . . . . . . . . . . 692
Mary Akinyemi, Chika Yinka-Banjo, Ogban-Asuquo Ugot,
and Akwarandu Nwachuku
CAMLPAD: Cybersecurity Autonomous Machine
Learning Platform for Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . 705
Ayush Hariharan, Ankit Gupta, and Trisha Pal
A Holistic Approach for Detecting DDoS Attacks by Using
Ensemble Unsupervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . 721
Saikat Das, Deepak Venugopal, and Sajjan Shiva
Moving Towards Open Set Incremental Learning:
Readily Discovering New Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
Justin Leo and Jugal Kalita
Automatic Modulation Classification Using Induced Class Hierarchies
and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
Toluwanimi Odemuyiwa and Birsen Sirkeci-Mergen
xii Contents
Using Digital Image Processing to Characterize Flocculation

of Papermaking Wastewater . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 770
Ming Li, Kaitang Hu, and Jin Wang
Detection of Anomalous Gait as Forensic Gait in Residential Units
Using Pre-trained Convolution Neural Networks . . . . . . . . . . . . . . . . . . 775
Hana’ Abd Razak, Ali Abd Almisreb, and Nooritawati Md. Tahir
Occluded Traffic Signs Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794
Shwu-Huey Yen, Chun-Yung Shu, and Hui-Huang Hsu
Consumer Use Pattern and Evaluation of Social Media Based
Consumer Information Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805
Yafang Zhang, Harim Yeo, Xu Li, and Hyesun Hwang
Hardware-Software Implementation of a McEliece Cryptosystem
for Post-quantum Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
Mariano López-García and Enrique Cantó-Navarro
Design of Ride Sharing Service “ZOUGAME”
in Neighborhood Community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826
Minako Matsui, HongYu Chang, Satomi Manzaki, Chihiro Sato,
and Naohito Okude
Deep Learning Based Face Recognition Application
with Augmented Reality Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
Andrew Kim, Ehsan Kamalinejad, Kelby Madal-Hellmuth,
and Fay Zhong
e-Canteen for the Smart Campus Application . . . . . . . . . . . . . . . . . . . . 842
Zulfajri Basri Hasanuddin, Muhammad Niswar, Aulia Fadhila,
and Mayong Adi Wardana
Voluntary Information and Knowledge “Hiding” in a Conservative
Community and Its Consequences: The Case of the
Ultra-orthodox in Israel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855
Moshe Yitzhaki
Emoji Prediction: A Transfer Learning Approach . . . . . . . . . . . . . . . . . 864
Linrui Zhang, Yisheng Zhou, Tatiana Erekhinskaya, and Dan Moldovan
On the Emerging Area of Biocybersecurity and Relevant
Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873
Xavier-Lewis Palmer, Lucas Potter, and Saltuk Karahan
The Impact of an Online Course of Inclusive Physical Education
on Teachers’ Skills and Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882
Noa Choresh and Yesha’ayahu Hutzler
Contents xiii
5G Service and Discourses on Hyper-connected Society

in South Korea: Text Mining of Online News . . . . . . . . . . . . . . . . . . . . 892
Li Xu, Harim Yeo, Hyesun Hwang, and Kee Ok Kim
A Digital Diagnostic Aide for Skincare: The Role of Computer
Vision and Machine Learning in Revealing Skin Texture Changes . . . . 898
Jaya Shankar Vuppalapati, Santosh Kedari, Anitha Ilapakurti,
Chandrasekar Vuppalapati, Sharat Kedari, and Rajasekar Vuppalapati
Deep-Learned Artificial Intelligence for Semantic Communication
and Data Co-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916
Nicolay Vasilyev, Vladimir Gromyko, and Stanislav Anosov
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927

Comparative Study on Swarm Based
Algorithms for Feature Reduction in Twitter
Sentiment Analysis on Figurative Language
Akshi Kumar(&), Aarushi Gupta, Anant Jain, and Vansh Farma
Computer Science Department, Delhi Technological University,

New Delhi 110042, India
drkumarakshi@gmail.com, aarushidtu@gmail.com,
anantjain245@gmail.com, farmavansh@gmail.com
Abstract. In this paper, we deal with the issue of feature selection by com-
paring different approaches based on Gravitational Search Algorithm, Particle
Swarm Optimisation and Genetic Algorithm. The comparison is drawn on the
parameters of feature reduction percentage, accuracy and time taken. The
optimization is performed with the following supervised predictive models -
Multinomial Naive Bayes, Support Vector Machine, Decision Tree, K-Nearest
Neighbour and Multilayer Perceptron. Datasets were acquired from SemEval
2015 task on the sentiment analysis of figurative language on Twitter (Task 11),
which provided a set of 5198 tweets scored with finely grained sentiment score
(−5 to +5 including 0). 11538 features were generated from the datasets and the
experiments performed have been successful in reducing an average of 55% of
the features, without any decline in the accuracy.
Keywords: Gravitation search algorithm Particle Swarm Optimisation

Genetic Algorithm Natural Language Processing Twitter Figurative
speech Curse of dimensionality
1 Introduction
Natural language processing (NLP) is a leading research area which deals with com-
putational methods to achieve human-like language processing. Traditionally, NLP is
widely used to devise efficient and robust techniques to accomplish tasks like syntactic
and semantic analysis, text classification, grammar induction, document clustering, etc.
Recently, researches in sentiment analysis (SA) have increased to a great extent.
Sentiment Analysis includes algorithms which can provide an overall sentiment
expressed in the document. Traditional methods used to determine sentiment were less
accurate and had a little future scope. Later with the use of supervised algorithms like
Naive Bayes, SVR, and KNN, far more accurate sentiment could be predicted. These
algorithms, however, lead to high computations due to a large number of features
generated. Among thousands of features derived per document, very few actually
provided some useful information to the classifiers; hence a need for feature reduction
techniques was raised.

K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 1–16, 2020.
https://doi.org/10.1007/978-3-030-39442-4_1
2 A. Kumar et al.
Social media produces large volumes of varied data at high rate, describing peo-
ple’s beliefs, views and feelings towards an object. Thus, a huge number of tweets
generated from one such social media platform, Twitter enable sentiment analysis to
transform text into an efficient knowledge database. There exists two major challenges
in twitter sentiment analysis: (a) tweets consists of figurative speech - irony, metaphor
and sarcasm - when literal meanings should be undermined and tangential or projected
meaning should be considered as they affect the polarity of the overall sentiment of the
tweet, and (b) high dimensionality of the features used to describe texts, which raises
problem of expensive computations in applying many widely-used sophisticated
supervised learning algorithms to determine sentiment.
Figurative language poses a major challenge while assigning a polarity to the
tweets. These speeches are unfortunately not restricted to specific forms of literature but
are part of common life. Veale and hao [9] while harvesting standards from the internet
knowledge targeted simile construction of the form of “as A as B” (e.g. “as brave as
lion”, “as tall as giraffe”, “as sweet as sugar”, etc.). However, to their amazement up to
20% of the similes harvested were ironic (e.g. “as white as coal”, “as clear as mud”, “as
private as a park bench” etc.). Hence, not able to gather any significant knowledge from
such texts, irony and other figurative speeches are considered worst kind of noise for
semantic analysis. Therefore, to predict “irony”, “metaphor” and “sarcasm” from the
data acquired during Semeval 2015 task 11 and other crowdfunded sources, partici-
pants were asked to provide with a fine-grained sentiment score between −5 and +5
including 0 (zero) - degree to which this sentiment has been expressed as positive,
negative or neutral. These scores were to represent the overall sentiment of the tweet.
Having a high number of variables requires more space to store training data,
expensive computations, large training time and for some algorithms even can degrade
the classifier’s performance. The computation power increase as the number of attri-
butes grows and hence models require larger amount of data for training. “Hughes
phenomenon” states that the classifier’s performance is directly proportional to the
count of features present till a threshold level of number of features is reached. Further
adding the number of features keeping size of training data constant is going to only
reduce the classifier’s performance. Thus, feature reduction methods are used to deal
with the issue of high dimensionality (Fig. 1). The objective of feature reduction is to
find the subset of original feature set which is most relevant for sentiment analysis to
produce improvement in classification, performance accuracy and reduced execution
time of learning algorithms.
Fig. 1. Graph showing hughes phenomenon.

Comparative Study on Swarm Based Algorithms 3
In this paper, we have considered the training dataset rich with irony, metaphor and
sarcasm. Using “Term Frequency-Inverse Data Frequency” (TF IDF) 11538 features
were generated from the raw data. TF IDF provides the frequency of the word
appearing in each document in the corpus and IDF gives the weight of rare words
across all the tweets. Combining both scores selects more significant words. Thousands
of features are then reduced using three modified versions of wrapper feature selection
techniques based on GSA, PSO and GA, used in the binary form. These techniques are
used with five supervised predictive models - Multinomial Naive Bayes, Support
Vector Machine, Decision Tree, K-Nearest Neighbour and Multilayer Perceptron. The
vector of the most relevant and meaningful set of features is then selected, corre-
sponding to which relevant training data is extracted and the predictive model is
executed. The results are evaluated on the basis of the efficacy measures like percentage
of features reduced, the accuracy of the predictions and execution time.
The discussions of the topic are arranged in the following manner. Section 2
summaries the literature survey and need of this study, along with the related work that
already has been done in this field. The following Sect. 3 provides a brief on the
algorithms of PSO, GSA and GA with their application in feature reduction. Section 4
describes the implementation and the experimental steps. Later Sect. 5 demonstrates
the findings and Sect. 6 concludes the work.
2 Related Work
The sentiment analysis was discovered earlier in some published work [6, 8, 14] and
since then researches have been expanded in this area. Comprehensive research has
been conducted to determine the polarity of the tweet sentiment. Such studies explore
and evaluate this microblogger, twitter [3, 5, 22, 24, 26].
The challenge that these studies face arises from the Natural Language Processing.
The traditional experiments generate thousands of attributes which is a result of spelling
variations across separate tweets. For example: “oh my god” as “omg”, “before” as “b4”
[11, 16]. Use of multilingual word, emoticons, conjunctions, prepositions and other
irrelevant words do not differentiate in the polarity of the tweets significantly. This
causes the difficulty of high dimensionality. Many other studies have been done to
overcome these challenges [1, 4, 18, 19, 25]. Some of the studies derived various aspects
like wrapper techniques which we applied to our algorithm for subset selection [12].
This research aimed at text classification, opinion mining and sentiment analysis. The
comparison with the use of document frequency (DF), Information Gain (IG), Mutual
Information (MI), Chi-Squared test (CHI) and Term Strength (TS) are used in the past
for statistical learning of text categorization [27]. The results experiments found that IG
and CHI test were the most efficient methods among all used.
When focused on using the metaheuristic learning algorithms in feature reduction,
some researches were gathered which has their use [2, 13]. Kurniawati and Pardede
[15] has proposed to use the hybrid of particle swarm optimisation (PSO) and
4 A. Kumar et al.
Information gain (IG) to select most appropriate attributes from the documents and use
SVM as the classifier. The experiment results show a 94.80% of accuracy achieved,
which was more when PSO and IG were used as compared to without using them.
Similarly, some studies show studies related to the use of swarm-based algorithms
in feature reduction [7, 20, 23]. Various PSO variants have been applied with SVM in
wrapper phase [21]. The results of both the phases showed higher and improved
accuracy performance from that of the filter phase in the problem domain using most
suitable PSO variant. Experiments in [17] combined GSA and KNN and concluded
from the experiment that classification improved with a decrease of 66% in number of
features. In fact, the use of Genetic algorithm for feature selection showed a better
performance than using all features for text clustering and classification [10].
To the leading of our knowledge no studies have been done to distinguish between
these metaheuristic methods for feature selection for opinion investigation. Subse-
quently utilizing the five directed soft computing strategies as said over we implement
the application of PSO, GSA and GA to models. Within the paper, we center on
drawing a differentiate between utilize of these three methods on the grounds of
accuracy, time taken and feature reduction percentage.
3 Feature Reduction
This sections briefs about the steps followed for application of PSO, GA and GSA with
different classification models. The fitness score is calculated using function, imple-
menting k-fold cross validation technique to learning algorithm.
3.1 Binary Particle Swarm Optimisation

PSO algorithm has an important variation which consists of a population called
swarms, which have multiple candidates known as particles or individuals. Each par-
ticle consists of two features: position and velocity. Each particle moves around in the
whole search space in order to find the personal best position that the particle can
achieve (local best) as well as the entire swarm’s best known position (global best).
Every individual trying to find the most optimum solution for itself encourages the
development of entire swarm and hence provides the most relevant solution to the
problem. The method is iterated number of times to improve the reliability of the global
best solution obtained.
We use the binary version of the PSO algorithm, where each particle consists of a
vector consisting of 0’s (zeroes) and 1’s (ones) which determines the position of the
particle. Each particle also consists of velocity vector which determines what is the
probability of change of 1 into 0 and vice versa in the position vector. The size of
position and velocity vector is equal to the dimension or number of features in the data.
The complete process is briefly described through the following algorithm.
Algorithm 1: Binary Particle Swarm Optimisation

Input -
evaluate_func : the estimating function with classification algorithm
calculating fitness score.
n: the population size.
max_it : the maximum number of iterations.
dim: number of features.
w1: move rate.
c1, c2: two fixed variables or constants.
vel_max: limit search range.
Output -
Best Value: type float.
Best Position: type list of length ‘dim’.
Begin -
Initialise pbest(i) = 0, fit = [-infinity] * n, it = 0;
gens = generate the initial population randomly.
/* randomly initialise the position and velocity vectors of the population generated */
gbest = min(fit)
xgbest = index of min(fit)
While it < max_it do
for each particle i = 1, …, n do
/* estimate the performance score of each particle using evaluate_func */
score = evaluate_func(gens[i]);
fit[i] = score;
/* update the local best performance score of each particle so far */

If pbest[i] <fit[i] then
pbest[i] = fit[i]; positionid = xid; d = 1, …, dim;
/* update the global best score so far */

If fit[i] > gbest[i] then
gbest[i] = fit[i]; gindex = i;
for each particle i = 1, …, n do

for each dimension d = 1, …, dim do
/* calculate the hamming distance between the particle i and
pbest(i) & particle i and gbest */
vid(new) = w1.vid(old) + c1.U(0,1).(pid - xid(old)) + c2.U(0,1).(pgd -

xgd(old));
/* updating velocity where pi is the personal best position of i and
pg is the entire swarm’s best position */
If vid > vel_max then vid = vel_max;

if vid < -vel_max then vid = -vel_max;
/* update the position following gbest */

if vij <= distance(gbest) then
xij = rand();
else
for every k do
if(gbestik == xik)
xik = rand();
it = it + 1;
return gbest, xgbest;
End.
6 A. Kumar et al.
The output returned is the vector containing zeroes and ones, where each 1 rep-
resents selection of that attribute. Then the selected subset of attributes is used to reduce
the dimensions of training and testing data to execute the prediction models.
3.2 Binary Genetic Algorithm

A genetic algorithm is a search heuristic that’s motivated by Charles Darwin’s
hypothesis of natural evolution. This algorithm reflects the method of characteristic
determination where the fittest people are chosen for generation in order to create
descendant of another generation. In a genetical algorithm a population of individuals
is generated. It focuses on selecting the better half of the population and rejecting the
worse half. The optimum candidates of the current generation are selected, crossover
and mutated to produce offspring for the next generation. An ordinary hereditary
calculation requires a genetic representation of the arrangement space; and a fitness
function to assess the arrangement space.
Algorithm 2: Binary Genetic Algorithm
Input -
evaluate_func : the estimating function with classification algorithm calculating
fitness score.
n: the population size.
max_it : the maximum number of iterations.
dim: the number of features.
mutation_rate = probability of mutation.
Output -
Best Value: type float.
Best Position: type list of length ‘dim’.
Begin -
Initialise fitness = [0] , best_val = 0, best_pos = [0];
gens = generate initial population;
Initialise it = 0;
While it < max_it do
for each particle i = 1,…,n do
/* evaluate the fitness of each individual of population using
evaluate_func */
score = evaluate_func(gens[i]);
fit[i] = score;
/* update the best value and best position */

if best_val < score then
best_val = score;
best_pos = gens[i];
/* Select two best individuals from the current generation */

alter_gens = sorted(gens, order = descending)[:2] // top two
individuals with highest number of 1’s in chromosomes
/* crossover and mutation */

sample_num = random( list( combination of gens taken two at a
time ));
next_gen = [crossover( s, m ) for s,m in sample_num];
ggens = mutation(next_gen, mutation_rate);
gens = []; /* next generation population */
gens.append(ggens);
gens.append(alter_gens[0]);
gens.append(alter_gens[1]);
return best_val, best_pos;

END.
The motive of the algorithm is to remove worse performers in the population and
tries to collect as many best individuals as possible so that only the good population is
carried forward. Hence it is used in searching and here we used it to find the most
optimum feature subset.
3.3 Binary Gravitational Search Algorithm

GSA is an algorithm based on the facts of the Newtonian Law of Universal Gravitation
and Law of Motion. This specifies that “each particle draws in another particle with a
drive which is directly proportional to the product of their masses and contrarily
corresponding to the square of the distance between them”.
Mass1 Mass2
Force ¼ G ð1Þ
Distance2
Force
Acceleration ¼ ð2Þ
Mass
But, Rashedi et al. showed through their experimental results that inverse pro-
portionality to distance between agents gave better results than that of the inverse
proportionality to square of distance as stated by gravitation law [22].
In GSA, individual particles are considered as entities and their performance
measures are their masses. Each massive object has four specifications: position,
inertial mass, active and passive gravitational mass. All the particles are forced to move
towards heavier mass objects by gravity. This fact is used as best individual particle
being the heaviest move slowly and hence stay closer to the most optimum solution.
The steps and formulas are explicitly explained in [22]. The binary GSA algorithm
differs as it deals with dimensions that can take 0 or 1 only. The position update in
BGSA means that changing value from “0” to “1” or vice versa. The mass velocity
determines the probability of switch between values. Hence, the final value returned is
the most relevant subset of the feature set.
8 A. Kumar et al.
Algorith nal Search Allgorithm

hm 3 - Binaryy Gravitation
1. Generrate Initial poppulation.
2. Repeaat
2.1 Esstimate fitnesss score of eachh object usingg classification
n model beingg used.
2.2 Uppdate the valuues for G, bestt_score and worst_score
w of population.
2.3 Foor each objectt calculate its Mass
M and Accceleration.
2.4 Uppdate the veloocity and posittion.
3. Until (iteration > m
maximum_iteraation or termination conditiion is satisfied d).
4. Returnn the best valuue and best poosition.
4 Systematic Architecture
The overall System Architecture (Fig. 2) can be divided into 5 parts: collecting the
data, making features from the collected tweets, using feature selection techniques,
applying machine learning models and analysing the results.
Fig. 2. System architecture

4.1 Collecting Data

The data is collected from SemEval-2015 Task 11. They have provided a python script
to retrieve the same dataset that was used during the task. The target data is ranging
from −5 to +5, −5 indicating negative sentiment, 0 indicating neutral sentiment and +5
indicating positive sentiment. −1 to −5 indicates the intensity of a negative sentiment
(−5 indicates the maximum negative intensity whereas −1 indicates the least negative
sentiment) and same is the case for the positive sentiment. There are total 5198 tweets
retrieved.
Once the data is collected, it is pre-processed. The tweets may contain something
that is totally irrelevant for the prediction models, so there is a need to preprocess the
data, for example, that is preprocessed as that is, where’s as where is, can’t as cannot,
won’t as will not and many more. Similarly all the words are made lowered case so that
the two same words are not treated as different. The extra spaces are removed. The stop
words such as (is, are, the, a, etc.) are removed. After Preprocessing only the important
and meaningful words remain in the data.
(a) (b)
Fig. 3. Distribution of fine gradient score as negative, zero and positive scores.
Let’s analyse the fine gradient score distribution. Figure 3 represents that 88.7% of
the tweets have been marked with negative score where as 6.9% of the tweets are
marked with positive score and the remaining 4.4% of the tweets are marked as neutral
i.e. Zero score.
4.2 TF - IDF Algorithm

TF-IDF is a weighted matrix which shows the weight of each word used in the tweet
based on TF (Term Frequency) and IDF (Inverse Document Frequency). TF-IDF
weighted matrix is a product of TF and IDF matrix. TF-IDF Algorithm is use to words
based on their TF-IDF matrix score.
TF: Term Frequency, TF of a “x” word in a document of “d” words can be
calculated as, TF(x) = x/d. For Example, “dog” word comes 10 times in a 100 word
document d, then TF(dog) = 10/100 = 0.1
10 A. Kumar et al.
IDF: Inverse Document Frequency, the words like is, are, the, a, etc. are usually
present in each document. There might be also some words which would be present in
every document. So, the words which would be present in every document would not
be significant in ranking the documents.
IDF can be calculated as the logarithm of total number of documents in the corpus
divided by total of number of documents in which the word “x” is present. Hence, IDF
(x) = log(N/d(x)). For Example, “dog” word is present in 1,00,000 documents out of
10,00,000 documents. So, IDF(dog) = log(10,00,000/1,00,000) = 1. So TF-IDF(-
dog) = TF(dog)*IDF(dog) = 0.1*1 = 0.1
Hence, TF-IDF has been applied on the pre-processed tweets and got 11538 fea-
tures out of it. Features here are the words that are used in the tweets. Each feature has
been assigned a TF-IDF weighted score.
4.3 Feature Selection

With such huge number of features i.e. 11538 features, comes costly computation
powers and time consumption. Also, accuracy should not be compromised with
reducing the features. So, keeping all this in mind, comes the need of feature selection
techniques. Here, three Evolutionary Computations (EC) algorithms have been
implemented for this purpose. The three algorithms are: Gravitational Search Algo-
rithm (GSA), Particle Swarm Optimisation (PSO) and Genetic Algorithm (GA). All the
three algorithms are implemented in their Binary form i.e. either a feature is selected
(1) or not selected (0).
4.4 Models
The various Machine Learning Models are used to fit and test the results based on the
features subset produced by GSA, PSO and GA. The following Machine Learning
Models are used:
1. Multinomial Naive Bayes (MNB) - Family of probabilistic algorithms based on
Bayes theorem with the “naive” assumption that each pair of attributes is inde-
pendent of each other [5, 24]. To estimate the crucial parameters a small amount of
training data is required.
2. Decision Tree (DT) - This algorithm breaks down the dataset into smaller subsets
creating a tree-like structure, where each internal node is the condition, branch is the
outcome and leaf node is the result. It contains only conditional control statements.
3. Support Vector Regression (SVR) - It is one of the most powerful machines
learning technique which is based on the maximising the distance between the
hyper-plane and the nearest point to one of the classification point. In this way the
classes are classified based on the planes [5, 11, 24].
4. Multi-Layer perceptron (MLP) - It is a simple form of Deep Learning technique,
which consists of input layer, hidden layers and the output layer. The hidden layers
have an activation function, which helps in generating a non-linear output. Each
layer has some weights associated with it and it’s adjusted in each epoch (forward
and backward learning) using back propagation.
Table 1. Comparison between PSO, GA and GSA.

Model Feature reduced Previous accuracy Final accuracy Time taken
MMB + GSA 3803 89.7998 89.5804 0.9
MMB + PSO 3548 89.7998 89.7576 0.82
MMB + GA 4475 89.7998 89.6613 0.82
DT + GSA 11238 89.0072 89.8206 28.99
DT + PSO 8899 89.0072 89.5933 28.28
DT + GA 9430 89.0072 89.3514 27.03
SVR + GSA 5364 90.389 90.3889 105.58
SVR + PSO 7284 90.389 90.3895 89.57
SVR + GA 4488 90.389 90.3893 104.44
MLP + GSA 10692 88.0946 89.8151 87.98
MLP + PSO 10940 88.0946 90.0242 90.81
MLP + GA 8884 88.0946 89.6409 94.05
KNN + GSA 3680 89.6032 89.6631 56.03
KNN + PSO 2379 89.6032 89.9156 42.49
KNN + GA 5252 89.6032 89.7605 43.67
5. K-Nearest Neighbour (KNN) - In a feature space, classification is done on the basis

of k closest training. The candidate is assigned to the class which gets majority
votes of its neighbours, that is the selected class is the most widespread class in the
neighbourhood of the candidate (k is a typically small, positive integer).
These models are fitted and predicted using K Fold Cross Validation (K = 5) and
the final results are then averaged.
4.5 Experiment Results

There are mainly three metrics to be observed for every feature selection technique and
the ML model combination.
1. Features Reduced: It means the total number of features reduced out of 11538
features.
2. Accuracy: It means how close are the actual Fine Gradient Score values and the
predicted Fine Gradient Score values. For example if the actual value is −4 and the
predicted value is also −4, then it means 100% accuracy, whereas if the actual value
is 5 and the predicted value is −5, then it means 0% accuracy.
3. Time Taken: It means the time taken by the feature selection algorithm to select the
best possible features. It is only a time process as once the features are selected; they
can be further used without any further computational power for predicting gradient
score in future tweets.
12 A. Kumar et al.
5 Result and Analysis
The feature reduction, final accuracies and the time taken have been recorded for the
feature selection techniques GSA, PSO and GA using the machine learning models
Multinomial Naive Bayes (MNB), Decision Tree (DT), Support Vector Regression
(SVR), Multi-Layer Perceptron (MLP) and K-Nearest Neighbour (KNN). Table 1
shows the comparison between the methods adapted during experiment and their
findings. Let’s compare the results on the above mentioned three factors:
Fig. 4. Feature reduction, final accuracy and time taken (minutes) observed using GSA, PSO
and GA on various Machine Learning Models.

The following graph plots the number of features reduced (out of 11538 features)
against the different combination of feature reduction techniques applied to 5 ML
models. Following are the results derived:
a. The TF-IDF algorithm made 11538 features and out of which, at least 2000 features
in every algorithm were reduced whereas the maximum features reduced were
11238 (in GSA using DT).
b. MLP was able to immensely reduce 8884 features using GA, whereas around 11000
features in GSA PSO (10692 in GSA and 10940 in PSO).
c. The MNB and KNN performed somewhat similar. Maximum features reduced in
MMB were 4475 using GA and in KNN were 5252 using GA. So, GA outper-
formed the GSA and PSO in MNB and KNN.
d. The minimum features reduced in SVR were 4488 using GA whereas the maximum
features reduced were 7284 using PSO. Hence, GA was not able to outperform
GSA and PSO in SVR.
e. Decision Tree was the most impressing among all by reducing minimum of 8899
features in PSO whereas reducing 9430 features in GA and 11238 features in GSA
(just using 300 features out of 11538).
5.2 Final Accuracy

Observations showing high accuracy maintained even after reducing significant num-
ber of features are listed below. Figure 4 shows the comparison between previous and
new accuracy before and after applying feature reduction techniques on various
machine learning models.
a. With such huge reduction in number of features using various feature selection
techniques, it could have a negative impact on the accuracy, but in contrast, 4 out of
5 machine learning models (DT, SVR, MLP and KNN) used were gave higher
accuracy when the input features were treated with GSA, PSO and GA. MMB gave
a comparable accuracy (difference of even less than 0.2%).
b. SVR gave the best accuracies (90.3889, 90.3895, 90.3893 in GSA, PSO and GA
respectively) among all the models. SVR outperformed the previous accuracies
(without any feature reduction) with a little margin.
c. MLP showed the drastic improvement from previous accuracies. After applying
PSO with MLP, a drastic improvement of over 1% was observed (from 88.0946 to
90.0242). The point to note here is that 10940 features were reduced in PSO with
MLP. This improvement in accuracy was obtained with just using 598 features for
prediction.
d. Just like MLP, Decision Tree (DT) and K-Nearest Neighbour (KNN) showed
significant improvement in accuracy despite of such huge reduction in number of
features.
e. Though MMB reduced the features to a certain number, a drop in accuracy was
observed but this drop is even less than 0.2%.
5.3 Time Taken

The following are the listings of all the observations made for time taken by all models.
a. The time taken factor is the amount of time taken in minutes to select the best
features out of 11538 features for the prediction.
b. MMB takes the least time i.e. less than a minute in all the three feature reduction
techniques.
14 A. Kumar et al.
c. It is clear from the graph that the time taken depends only on the model used and is
mostly independent of the feature reduction used.
d. The time taken by the models is in the increasing order i.e.
MNB < DT < KNN < MLP < SVR.
6 Conclusions
The improvement in final accuracy while the huge reduction in the number of features
brings a conclusion that there are only certain number of features useful for the pre-
diction. These features are reliable in terms of accuracy and the absolute new data for
which the model has not been trained yet.
With such huge reduction in the features, it will prevent large computational tim-
ings. The time consumed is due to only one single thing i.e. only during the selection of
the most optimum subset of features. Once the best features are selected, the model will
take less time than the original number of features in predicting the output due to less
number of features and thus running the algorithm on less features.
It also shows that it is not safe to say that more features means better prediction, our
experiment has proved that the better features means better prediction that is “Quality
over Quantity”.
References
1. Agarwal, B., Mittal, N.: Categorical probability proportion difference (CPPD): a feature
selection method for sentiment classification. In: Proceedings of the 2nd Workshop on
Sentiment Analysis where AI meets Psychology, COLING 2012, Mumbai, December,
pp. 17–26 (2012)
2. Ahmad, S.R., Bakar, A., Yaakub Mohd, R.: Metaheuristic algorithms for feature selection in
sentiment analysis. In: Science and Information Conference (SAI), London, UK, pp. 222–
226, 28–30 July 2015
3. Bing, L., Chan, K.C.C.: Fuzzy logic approach for opinion mining on large scale twitter data.
In: Proceedings of 7th International IEEE Conference Utility and Cloud Computing,
pp. 652–657 (2014)
4. Brezočnik, L., Fister, Jr., I., Podgorelec, V.: Swarm intelligence algorithms for feature
selection: a review. Appl. Sci. 8(9), 1521 (2018)
5. Dash, A., Rout, J., Jena, S.K.: Harnessing twitter for automatic sentiment identification using
machine learning techniques. In: Proceedings of 3rd International Springer Conference.
Advanced Computing, Networking and Informatics, India, pp. 507–514 (2016)
6. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and
semantic classification of product reviews. In: Proceedings of 12th International ACM
Conference World Wide Web, Hungary, pp. 519–528, 20–24 May 2003
7. Fong, S., Gao, E., Wong, R.: Optimized swarm search-based feature selection for text
mining in sentiment analysis. In: IEEE International Conference on Data Mining Workshop
(ICDMW), Atlantic City, NJ, USA, pp. 1153–1162, 14–17 November 2015
8. Gamon, M.: Sentiment classification on customer feedback data: noisy data, large feature
vectors, and the role of linguistic analysis. In: COLING 2004: Proceedings of the 20th
International Conference on Computational Linguistics, Geneva, Switzerland, pp. 841–847,
23–27 August 2004
9. Ghosh, A., Li, G., Veale, T., Rosso, P., Shutova, E., Barnden, J., Reyes, A.: SemEval-2015
task 11: sentiment analysis of figurative language in Twitter. In: Proceedings of the 9th
International Workshop on Semantic Evaluation, Denver, Colorado, pp. 470–478, 4–5 June
2015
10. Hong, S., Lee, W., Han, M.: The feature selection method based on genetic algorithm for
efficient of text clustering and text classification. Int. J. Adv. Soft Comput. Appl. 7(1), 2074–
8523 (2015)
11. Huq, M.R., Ali, A., Rahman, A.: Sentiment analysis on twitter data using KNN and SVM.
IJACSA Int. J. Adv. Comput. Sci. Appl. 8(6), 19–25 (2017)
12. Kohavi, R., John, G.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324
(1997)
13. Kristiyanti, D.A., Wahyudi, M.: Feature selection based on Genetic algorithm, particle
swarm optimization and principal component analysis for opinion mining cosmetic product
review. In: 5th International Conference on Cyber and IT Service Management (CITSM),
Denpasar, Indonesia, 8–10 August 2017
14. Kumar, A., Sebastian, T.M.: Sentiment analysis on twitter. Int. J. Comput. Sci. 9(4), 372–
378 (2012)
15. Kurniawati, I., Pardede, H.F.: Hybrid method of information gain and particle swarm
optimization for selection of features of SVM – based sentiment analysis. In: Proceedings of
2018 International Conference on Information Technology Systems and Innovation
(ICITSI), Bandung - Padang, Indonesia, 22–26 October 2018
16. Larsen, M.E., Boonstra, T.W., Batterham, P.J., Bridianne, O., Paris, C., Christensen, H.: We
feel: mapping emotion on twitter. IEEE J. Biomed. Health Inform. 19(4), 2168–2194 (2015)
17. Nagpal, S., Arora, S., Dey, S.: Feature selection using gravitational search algorithm for
biomedical data. Procedia Comput. Sci. 115, pp. 258–265 (2017)
18. Nicholls, C., Song, F.: Comparison of feature selection methods for sentiment analysis. In:
Farzindar, A., Kešelj, V. (eds.) Advances in Artificial Intelligence. Canadian AI 2010.
LNCS, vol. 6085, pp. 286–289. Springer, Heidelberg (2010)
19. O’Keefe, T., Koprinska, I.: Feature selection and weighting methods in sentiment analysis.
In: Proceedings of the 14th Australasian Document Computing Symposium, Sydney,
pp. 67–74 (2009)
20. Papa, J.P., Pagnin, A., Schellini, S.A., Spadotto, A., Guido, R.C., Ponti, M., Chiachia, G.,
Falcao, A.X.: Feature selection through gravitational search algorithm. In: IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2052–2055
(2011)
21. Rahman, S.A., Bakar, A.A., Mohamed-Hussein, Z.A.: An intelligent data pre-processing of
complex datasets. Intell. Data Anal. 16(2), 305–325 (2012)
22. Rashedi, E., Nezamabadi, H., Saryazdi, S.: BGSA: binary gravitational search algorithm.
Nat. Comput. 9(3), 727–745 (2010)
23. Vieira, S.M., Mendonça, L.F., Farinha, G.J., Sousa, J.M.C.: Modified binary PSO for feature
selection using SVM applied to mortality prediction of septic patients. Appl. Soft Comput.
13(8), 3494–3504 (2013)
16 A. Kumar et al.
24. Wang, N., Varghese, B., Donnelly, P.D.: A machine learning analysis of twitter sentiment to
the sandy hook shootings. In: Proceedings of 12th International IEEE Conference e-Science,
USA, pp. 303–312 (2016)
25. Wang, S.: A feature selection method based on fisher’s discriminant ratio for text sentiment
classification. In: International Conference on Web Information Systems and Mining. LNCS,
vol. 5854, pp. 88–97. Springer, Heidelberg (2009)
26. Whitley, D.: A genetic algorithm tutorial. Stat. Comput. 4(2), 65–85 (1994)
27. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In:
Proceeding of ICML 1997 Proceedings of the Fourteenth International Conference on
Machine Learning, San Francisco, CA, USA, pp. 412–420, 08–12 July 1997
Statistical Analysis of the Effects of Institutions
on the Economic Growth of France
in Recent Years
Yaimara Céspedes-González1(&), Guillermo Molero-Castillo2,

Patricia Arieta-Melgarejo1, Everardo Bárcenas2,
and Alejandro Velázquez-Mena2
1
Universidad Veracruzana, Xalapa, Veracruz, Mexico
yaimara.cespedes@gmail.com, parieta@uv.mx
2
Universidad Nacional Autónoma de México, Mexico City, Mexico
{gmoleroca,barcenas,mena}@fi-b.unam.mx
Abstract. This paper presents the analysis of the effects of the institutions on
the economic growth of France during the period 2014–2017. We chose to study
this period due to the reforms which took place from 2014 aiming to improve
the productivity and economic growth to preserve its place in the worldwide
economy. The data source used was the OECD database, Economic Outlook,
which collects information to encourage governments to make decisions on
economic and social issues for the future, such as finding stability and pro-
moting responsibility. The analysis was made in two stages, a preliminary
review and a trend’s analysis. Variables analyzed were business confidence,
labor market, exportation, and public spending. The results obtained show that
business confidence in France, backed by the growth of economic investment,
has increased significantly in recent years. In addition, it was observed that the
labor market had an important growth in employment, and exportation tend to
be volatile due to lack of demand strengthening from its business partners.
Keywords: Statistical analysis French economic growth Business

confidence Labor market Public spending
1 Introduction
The prosperity of a country depends on its economic institutions, and these, in turn,
depend on political power and political institutions [1, 2]. This results in an accumu-
lation of physical and human capital, technology access, an appropriate assignment of
resources, and innovation [3–5]. However, getting a sufficient level of these elements
might be affected by institutional characteristics [6]. These characteristics vary
according to the type of organization, the political powers distribution, how the pro-
ductive areas work, the security level of property rights, the risk of expropriation as
well as the efficiency of the legal system and of the government, among others [1].
The concept of institutions has been defined by several authors. According to [2],
the institutions are the limitations and incentive that determinate the human interaction
in the political, social and economic exchange. In [7] defines the institutions as
https://doi.org/10.1007/978-3-030-39442-4_2
18 Y. Céspedes-González et al.
organized entities with clear processes, structured and regulated. As for [1] they rep-
resent the combination of three connected notions: (a) the economic institutions,
establishing the distribution of resources and influencing the assignment of de facto the
power; (b) the political power, shaped by political institutions setting up the economic
institutions and economic performance, so they can have an impact on the prosperity
and future assignment of political power; (c) the political institutions, assigning the
political power of de jure to certain groups.
The connection between these three notions establishes a subordination of the
economic institutions to the political ones. On one hand, the political institutions set up
the distribution of the political power of de jure, while on the other hand, the resources
allocation influence in the distribution of the political power of de facto. Both powers,
de jure and de facto, affect the development of political institutions and the election of
the economic institutions, being these last the ones that define the economic results and
the future resources allocation.
Institutions are considered as quality institutions when they guarantee and protect
property rights. Besides, these institutions ensure equitable access to economic resources
and, to some degree, equality of opportunities to a significant number of individuals [1, 8].
This creates competition and encourages the involvement of people in economic activi-
ties. Therefore, there are more possibilities for quality institutions to be implemented
when political power is owned by a representative group of society, and when there are
regulations.
On another hand, the economic growth means the increase of the goods and ser-
vices value that occurs in an economy during a specific period. Generally, to measure
this economic growth is used the Gross Domestic Product (GDP). This growth is linked
with the productivity and the improvement of people standard of living [6]. Therefore,
this indicator is used to weigh the socioeconomic conditions of a country.
France is one of the five largest economy worldwide, measured by the GDP, a
position mainly due to its strength in several sectors as defense, technology, aero-
nautics, nuclear industry, among others [9]. In this context, the French economic
growth for the last years, from 2014, has shown a stable expansion rhythm. Even
though this growth process has been stable for some years, economic decline periods
and stagnation have also been evinced. Nevertheless, these periods have been over-
come with success, without reaching extended terms like other countries of the region.
That is why, in the European sphere, French growth could be considered as a stable
experience.
In this regard, this study aimed to analyze the effects of the institutions over the
economic growth of France in the last four years, 2014–2017. To reach this goal, we
investigated the impact of the institutional reforms in the French economic growth. As
for [10], the right implementation of these measures foretells a rise of the GDP, of the
employment rate, of the trade balance, and of the government balance. This is essential
to emphasize that the unfolding of the French economic development has its ante-
cedents in a historic process: the transition from an absolute monarchy to institutions
that supported the feudal system and then the capitalism.
Statistical Analysis of the Effects of Institutions 19
2 Background
Democracy was established in France a long time ago. It was the first European country to
use the voting system for elections in 1848. France has gone through empires, monar-
chies, and republics. Currently, it is a semi-presidential republic [11, 12], with relatively
independent executive powers. France is a member of international organizations as [10]:
G8, European Union, Schengen Area, United Nations, OTAN, among others. In addition,
is home to important headquarters as European Council, European Parliament, as well as
market-leading multinational companies in several economic sectors.
2.1 Economic Growth

In recent years, there have been studies on institutions and economic growth. In [13]
analyzed the differences between countries based on production per worker. To mea-
sure the impact of the institutions, they used the ICRG and Sachs-Warner databases. As
a result, they obtained that the variation in the production per worker was a conse-
quence of differences in the social structure, that is, between the institutions and the
governmental politics.
On the other hand, [1] analyzed the relationship between the type of colonization
and the institutions quality in developing countries. To perform the analysis, the
mortality variable was used. As a result, a negative correlation was found between the
mortality variable and the institutions quality. This analysis allowed to conclude that
the colonist settlement and the establishment of best institutions were in territories
where mortality rates were low.
Nowadays, other variables that have taken importance in economic growth are
technological change, corruption, financial development, and the importance of human
capital [3]. Even though the institutions quality matters, it is a cycle in which the power
concentration in a single group determines whether it is invested in research and
development. Thus, the role of institutions becomes fundamental, because, ultimately,
it is these that condition the redistribution of resources, which can also lead to equality
or inequality [14, 15].
2.2 Related Work

About the economic growth causes in France, several studies have been carried out, in
1968 Lévy-Leboyer studied the economic growth of France in the 19th century [16].
France at the beginning of that century had maintained an ascendant growth due to
agriculture. Authors hypothesis suggested that agricultural activity was the main factor
that had an impact on the growth of countries. However, the industrial revolution was a
factor that influenced the economic growth of the country, accompanied by new forms
of organization and restrictions, based on rural and urban markets. It is during this
period that the importance of establishing incentives and restrictions on commercial
activity is strengthened.
Subsequently, in [11] studied the political influence of the executive power on the
economic growth of France. The authors adopted a historical approach since they used
data from the period 1871–2009. As a result, they reached three conclusions: (a) the
role of political ideology was confirmed as an indirect determinant of growth, (b) it was
shown that left parties promote equality at the expense of economic growth, and (c) the
means, through which the political ideology impacted on economic growth, were
identified, such as public expenditures, fiscal and budgetary policies, these elements
affect employment and income inequality and, therefore, GDP growth.
In the current context, different studies have also been carried out. One of these is
the Organization for Economic Cooperation and Development (OECD), where some
structural reforms and their impact on France’s economic growth were analyzed. The
analysis showed that France’s economic growth between 2008 and 2013 was slow at
1.25%; while in 2014 it was 0.4%. Due to this situation, in 2014 France faced the
challenge of improving its productivity and growth. For this, it was necessary to
modify its economic and social structures through reforms with the aim of preserving
its place in the world economy. The reforms were made in four areas [9]:
– Improve competitiveness in goods and services market. These regulations were
designed to optimize competition and reduce prices, especially in sectors like:
energy, transport, commerce, legal services, accounting and architecture. In addi-
tion, other regulations were modified to protect some professional practices and to
avoid monopolies in certain sectors.
– Improve the functioning of labor market. To stimulate the supply of labor and to
reduce the cost of it. These reforms were accompanied by fiscal modifications and
other reforms aimed at improving labor supply, the incentives and the quality of
workforce.
– Clean up the fiscal structure of distortions. These reforms were aimed at reducing
corporate tax rates, introducing a carbon tax, increasing VAT for certain rates and
reducing income tax.
– Simplify the territorial distribution of country. The intention was to reduce political
segmentation and to facilitate the proper functioning of local labor markets.
In general, France has devoted special attention to reducing the bureaucracy that
affects companies and the productive process. The purpose of improving education and
training is interesting, to guarantee the implementation and continuity of appropriate
mechanisms. The creation of metropolitan areas is an important step to define
responsibilities at the territorial level with the aim of promoting local productivity and
improving trade between regions.
Therefore, according to [10], the implementation of the reforms promises for the
year 2020 an increase in GDP of 0.4%, the employment levels in 0.31%, the com-
mercial balance in 0.03%, and the government balance at 0.27%.
3 Method
3.1 Data Source
The data source from which the analysis of institutions effects on the economic growth
of France was made, were data from OECD Economic Outlook 102 database. The
access to data source was made through the OECD institutional page (http://stats.oecd.
org/index.aspx). This data source includes a set of macroeconomic data from 35
countries that belong to OECD, including France. This database also includes data
from 10 other countries that do not belong to OECD, such as Colombia, Brazil, China,
Russia, and others.
OECD, Organization for Economic Cooperation and Development, collects and
analyzes data and information to encourage governments to take decisions about
economics and social topics for the future, as fight poverty, find stability and promote
responsibility [17]. The OECD Economic Outlook database includes data on: spending,
foreign trade, production, employment and unemployment, interest and exchange rates,
payments balance, disbursements and revenues from government and households,
government debt, supply, and fiscal indicators.
3.2 Data Analysis

Data analysis was carried out in two stages. The first consisted of a preliminary review
of the total of variables listed in the OECD Economic Outlook database, this to identify
significant variables with data for 2014–2017 period. Variables identified were:
(a) business confidence, (b) labor market, (c) exports and (d) public spending. Table 1
shows the cumulative quarterly values (three-month moving average) for the business
confidence, labor market, and exportation variables. Also, Table 2 shows the data for
the public expending variable in 2017.
Table 1. Cumulative quarterly values for the business confidence, labor market, and exportation
variables.
Date Business Labor market Exportation (Index 2008 = 100)
confidence
Index Employment (% Unemployment Motor Air and Other
changes) (% force) vehicles spacecraft goods
2014q1 94.073 0.485 10.135 67.195 146.448 103.177
2014q2 94.172 0.190 10.204 67.494 144.872 102.559
2014q3 91.799 −0.216 10.415 68.158 143.081 102.397
2014q4 92.507 0.120 10.462 68.113 146.036 102.789
2015q1 94.224 −0.174 10.314 68.289 150.851 103.125
2015q2 97.095 −0.048 10.439 70.192 155.761 104.422
2015q3 99.678 0.367 10.505 71.625 162.634 105.747
2015q4 100.856 0.283 10.193 74.375 166.653 105.953
2016q1 100.932 0.727 10.180 76.149 165.038 106.331
2016q2 100.973 0.675 9.955 77.369 162.780 106.292
2016q3 101.371 0.510 10.121 77.349 160.748 105.508
2016q4 102.930 0.478 9.959 78.069 158.720 105.281
2017q1 104.246 0.448 9.541 79.749 162.154 106.192
2017q2 105.303 1.233 9.424 81.798 160.762 107.300
Table 2. Values of the public expending variable.

Item Public spending (% of GDP)
Social spending Public expenditure
IRL 20.990 16.853
CAN 19.225 18.938
CZE 16.964 21.997
GBR 20.614 21.495
ESP 20.575 21.661
HUN 21.638 20.599
ITA 21.738 21.987
PRT 25.795 18.538
BEL 26.767 18.132
FRA 24.736 21.441
USA 22.127 26.889
POL 24.003 26.628
OECD 28.640 22.253
DEU 28.112 23.347
NLD 25.210 26.557
GRC 28.414 24.306
SWE 30.685 24.415
AUT 30.140 25.180
DNK 31.879 25.403
FIN 31.036 27.052
In the second stage, the trends of previously selected variables were analyzed. The
analysis of the economic growth in France was carried out from 2014, this due to the
beginning of the reforms, which occurred in that year, to improve their productivity and
growth to preserve their place in the world economy.
4 Results
Figures 1 and 2 show the results obtained for four variables analyzed: business con-
fidence, labor market, exportation, and public spending. Regarding business confidence
in France, it was observed that in recent years this confidence has increased continu-
ously, thus supporting the growth of the economic investment. This may be because of
the increase in social security and business tax cuts to companies. However, it is
notorious that the growth of the investment has suffered a slight deceleration during
2017. This may be related to depreciation in April 2017.
(a)
(b)
Fig. 1. Trends of analyzed variables: (a) Business confidence, and (b) Labor market.
On the other hand, it was also observed that the labor market during 2017 had
significant growth. This indicates a recovery due to low-interest rates, supported by
household consumption and investment. In addition, according to [16], business sur-
veys indicate a favorable hiring prospect. To improve the results of the labor market,
the French government must promote the inclusion of less qualified workers,
encouraging the implementation of short-term and indefinite contracts. This could
improve access to more stable jobs and the training of many workers.
(a)
(b)
Fig. 2. Trends of analyzed variables: (a) Exportation, and (b) Public spending.
In the case of exportation, they show a negative (volatile) impact due to a series of
temporary factors, such as supply interruptions in aeronautics and bad weather that
affects exports in agriculture. This situation is due to exportation have been driven by a
few sectors, especially the transport industries. Another factor against is that the French
companies have not yet fully benefited from the strengthening of demand from their
business partners.
In terms of public spending, in comparison with other countries, in 2017 the French
government has taken some measures to reduce such spending. These measures include
reducing the overlap of responsibilities, that is, merging small municipalities and
improving the access of older workers to training and gradual retirement. In addition,
it is intended to reduce hiring and housing subsidies; as well as reducing the growth of
public spending to less than 0.5% per year between 2018 and 2022. Another measure
adopted, according to the OECD, is the increase in taxes on energy and tobacco, which
helps growth economic is more ecological and strengthens prevention in health care.
5 Conclusions
France was one of the first democratic countries, which allowed it to become a solid
nation with rising economic growth.
In the French economy, despite the elimination of hiring subsidies for small
businesses and the reduction of jobs, employment growth must continue, supported by
growth in GDP. This could strengthen the growth of private consumption and gradual
fall of unemployment rate.
Labor market reform in France can facilitate negotiations in business sector,
especially in small businesses. In addition, a substantial investment in training and
strengthening of learning should be planned.
Exports in France should benefit from increased demand from trading partners and
a rebound in emerging markets; as well as the rebound of tourism after the terrorist
attacks that have occurred in recent years.
The French institutions determine the economic growth of the country. However,
political events in other European countries, such as terrorist attacks, could undermine
confidence in economic union and trade in member countries.
France has faced and overcome the crisis caused by global order events thanks to
certain institutions the country has forged. However, France has multiple challenges yet
to face to increase its competitivity. For this to be achieved, it is essential to possess
efficient regulatory systems that simplify and foster the economic growth and that
protect the rights of property.
References
1. Acemoglu, D., Johnson, S., Robinson, J.: Institutions as a fundamental cause of long-run
growth. In: Handbook of Economic Growth, vol. 1A, pp. 385–472. Elsevier (2005)
2. North, D.: Institutions, institutional change and economic performance. Fourth reprint.
Fondo de Cultura Económica, Mexico (2012)
3. Esso, L.: Changement technologique, croissance et inégalité: l’importance du capital humain
et des institutions, Ph.D. thesis, Economies et finances, Université Panthéon-Sorbonne-Paris
I (2006)
4. Galindo, M.: La innovación y el crecimiento económico: una perspectiva histórica. Econ.
Ind. 368, 7–25 (2008)
5. Galindo, M.: Crecimiento económico. Tendencias y Nuevos Desarrollos de la Teoría
Económica 858, 39–56 (2011)
6. Docquier, F.: Identifying the Effect of Institutions on Economic Growth. Institutional
Competition between Common Law and Civil Law, pp. 25–40. Springer, Berlin (2014)
7. Williamson, O.: The new institutional economics: taking stock, looking ahead. J. Econ. Lit.
38(3), 595–613 (2000)
8. Yahyaoui, A., Rahmani, A.: Développement financier et croissance économique: Rôle de la

qualité des institutions. Panoeconomicus 56(3), 327–357 (2009)
9. OECD: France. Structural reforms impact on growth and options for the future.
Report OECD, Organization for Economic Cooperation and Development (2014)
10. European Commission: The Economic Impact of Selected Structural Reform Measures in
Italy, France, Spain and Portugal. European Economy, Institutional paper 023 (2016)
11. Facchini, F., Melki, M.: Political Ideology and Economic Growth in a Democracy: The
French Experience, 1871–2009. Documents de Travail du Centre d’Economie de la
Sorbonne (2012)
12. European Union. https://europa.eu/european-union/about-eu/countries/member-countries/
france_en. Accessed 11 Feb 2019
13. Hall, R., Charles, J.: Why do some countries produce so much more output per worker than
others? Q. J. Econ. 114(1), 83–116 (1999)
14. Rodrik, D.: One Economics Many Recipes: Globalization. Institutions, and Economic
Growth. Pricenton University Press, Pricenton (2007)
15. Sachs, J.: Les institutions n’expliquent pas tout. Finances & Développement, pp. 38–41
(2003)
16. Lévy-Leboyer, M.: La croissance économique en France au XIXe siècle. Résultats
préliminaires. Annales. Histoire, Sciences Sociales 23(4), 788–807 (1968)
17. OECD: OECD Economic Outlook. Database Inventory 102. Database documentation,
Organization for Economic Cooperation and Development (2017)
A Taxonomy of Social Engineering
Defense Mechanisms
Dalal N. Alharthi1(B) , Mahmoud M. Hammad2 , and Amelia C. Regan1

1
Computer Science Department, University of California, Irvine, USA
{dalharth,aregan}@uci.edu
2
Software Engineering Department, Jordan University of Science and Technology,
Irbid, Jordan
m-hammad@just.edu.jo
Abstract. Humans have become the weakest point in the information

security chain, and social engineers take advantage of that fact. Social
engineers manipulate people psychologically to convince them to divulge
sensitive information or to perform malicious acts. Social engineering
security attacks can be severe and difficult to detect. Therefore, to pre-
vent these attacks, employees and their organizations should be aware
of relevant defense mechanisms. This research develops a taxonomy of
social engineering defense mechanisms that can be used to develop edu-
cational materials for use in various kinds of organizations. To develop
the taxonomy, the authors conducted a systematic literature review of
related research efforts and extracted the main target points of social
engineers and the defense mechanisms regarding each target point.
Keywords: Cybersecurity · Social engineering · Attack vectors ·

Defense mechanisms
1 Introduction
In our digital world, information security threats can be divided into two primary
types: technical hacking and social engineering attacks. In technical hacking,
cyberattackers conduct attacks using advanced techniques to gain unauthorized
access to systems.
However, it becomes difficult for hackers to successfully attack computer
systems and networks using purely technical means [3]. Therefore, hackers rely
on social engineering attacks to bypass technical controls. Social engineering
allows cyberattackers to gain unauthorized access to systems by psychologically
manipulating users [7,17].
Compared to technical hacking, social engineering is easier, cheaper, and
more effective to gain access to confidential information. Numerous previous
research efforts have demonstrated the success of social engineering attacks [6,15,
21,27,34]. Social engineering attacks are conducted either by person-to-person
interaction (in person or over the phone) or by computer-interaction (email,
c Springer Nature Switzerland AG 2020
https://doi.org/10.1007/978-3-030-39442-4_3
28 D. N. Alharthi et al.
pop-up window, instant message, or malicious website). Social engineers target

individuals, organizations, and countries as well.
Since the consequences of social engineering attacks are severe and chal-
lenging to detect, employees and organizations need to be aware of the defense
mechanisms that can protect against various types of these attacks. Mouton et
al. [25] outlined the importance of increasing the awareness level of employees
against social engineering attacks.
Even though preventing social engineering attacks is crucial for organizations
and countries, the research lacks a well-designed taxonomy of the defense mech-
anisms against the ever-increasing types of social engineering attack vectors. To
fill this research gap, this paper provides a taxonomy of the main target points of
social engineers and the defense mechanisms against various social engineering
attacks. The taxonomy developed in this study will help researchers, practition-
ers, and organizations understand the defense mechanisms for social engineering
security attacks. Organizations can use the taxonomy to elevate the awareness
level of their employees about the various defense mechanisms and hence better
protect their organizations and their information.
The remainder of this paper structured as follows: Sect. 2 provides the nec-
essary background for this study. Section 3 presents the related research efforts
on social engineering attacks. Section 4 presents the research question this paper
aims to answer and describes the authors’ methodology for developing the taxon-
omy. Section 5 describes the taxonomy in details. Threats to validity for building
the taxonomy are reported in Sect. 6. Finally, the paper concludes with avenues
of future work.
2 Background
This section provides an overview of the different kinds of common social engi-
neering security attacks.
Organizations mainly focus on deploying high quality and sophisticated secu-
rity tools to detect security vulnerabilities or even prevent security attacks. How-
ever, security is only as strong as the weakest point in the system, which includes
the human actors. Since humans are the weakest point in the information security
chain, they are being targeted by social engineers. According to [10], misuse of
information systems by humans, both intentionally and unintentionally, accounts
for 50% to 75% of cybersecurity threats.
As stated by Granger [14], social engineering is “the art and science of getting
people to comply with your wishes”. It can be defined as the practice of acquiring
information through technical and non-technical means [22]. Therefore, social
engineering attacks rely on convincing people that a social engineer is a trusted
friend or colleague. Social engineering attacks can be carried out either by a
human or by a machine through a software system [23]. Social engineering attacks
have no limit, and they only depend on the creativity of social engineers. In the
past few years, the number and the sophistication of social engineering attacks
have increased and became more diverse. These attacks are difficult to detect
A Taxonomy of Social Engineering Defense Mechanisms 29
and prevent, resulting in loss of confidential data, intellectual property, financial

data, money, and organizational credibility with customers [2].
Figure 1 depicts the various social engineering attacks. As shown in the figure,
there are two types of attacks: technical SE attacks and non-technical SE attacks.
Below the paper provides a brief description of each one so the reader can under-
stand the discussion that ensues.
Fig. 1. Social engineering attack vectors.
2.1 Technical Social Engineering Attacks
As shown in the left side of Fig. 1, technical social engineering attacks are Vish-
ing, Phishing, Spear Phishing, Spam Email, Interesting Software, Popup Win-
dow, Baiting, Tailgating, and Waterholing.
– Phishing and Trojan Email rely on carefully crafted messages to entice victims
to open attachments or click on embedded hyperlinks [3]. In this security
attack, the victim is entirely unknown to the social engineer.
The phishing attack is one of the most successful social engineering attacks.
One of the biggest phishing attacks occurred in March of 2016 during the
U.S. presidential election. It targeted John Podesta, the former chairman
of Hillary Clinton’s U.S. presidential campaign, and through his account,
some of Clinton’s emails. The target of the attack was Clinton’s personal
Gmail account, which had messages from 2007 through 2016 [33], [29]. On
that phishing email, there was a “change password” link, once John Podesta
clicked on it and changed his password, social engineers maliciously received
his password and locked his account.
– Vishing (voice phishing) occurs by tricking people into revealing sensitive
information through a phone call.
– Spear Phishing similar to the phishing attack, but in this attack, the victim’s
information is known to social engineers. Therefore, the social engineer can
launch customized cyberattacks.
– Spam Email is an email that offers friendships, diversion, gifts and various
free pictures and information in order to plant malicious code on the reader’s
machine.
– Interesting Software and Popup Windows are other social engineering tech-
niques in which a social engineer convinces a victim to download and install
a useful program or application such as a CPU performance enhancer or
displays a pop-up window that prevents a victim from proceeding with the
session unless he reenters his username and password.
– Baiting happens when a malware-infected storage medium is left in a location
where it is likely to be used by targeted victims [22].
– Tailgating aims at accessing unauthorized places by getting help from an
authorized person.
– Waterholing means compromising a website that is likely to be of interest to
a chosen victim [3].
2.2 Non-Technical Social Engineering Attacks

There are several social engineering attacks where technology is not involved such
as Pretexting/Impersonation, Dumpster Diving, Shoulder Surfing/Spying, Hoax-
ing, Authoritative Voice, Smudge Attack, Support Staff, and Technical Expert.
– Pretexting/Impersonation occurs when a social engineer pretends to be some-

one else who is known to a target person.
– Dumpster Diving happens by sifting through the trash of an organization to
find discarded items that include sensitive information.
– Shoulder Surfing and Spying use direct observation techniques to get infor-
mation [22]. When a social engineer attempts to extract sensitive information
about the recent activity of a user using, for example, residual oils on touch
screen devices to detect the user’s input, such an attack called Smudge Attack.
This attack method can be applied to a significantly large set of devices such
as touch screens of ATMs and DRE voting machines [5].
– Hoaxing an attempt to trick an audience into believing that something false
is real [32].
– Authoritative Voice is another SE attack, discussed in [32], in which a social
engineer calls a company’s computer help desk and pretend to have access to
a troubleshooting system.
– Support Staff and Technical Expert are physical attacks used by social engi-
neers by acting as support staff or as technical staff. As an example, a man
dressed as a cleaning crew member, walks into a work area, carrying cleaning
equipment, then in the process of appearing to clean a desk area, he can snoop
around and get valuable information such as passwords, or confidential files
that an employee forgot to hide, or even make a phone call impersonating an
employee from his desk. Another example is that an attacker can pretend to
be a technical support person working on a network problem and request the

user to let him have access to a data center to “fix” the problem.
According to a 2018 Verizon report [1], phishing and pretexting combined

account for 98% of social engineering incidents and email is the most common
media to carry social engineering attacks.
3 Related Work
A large body of research efforts focuses on the pure technical security attacks
while fewer researchers have focused on social engineering attacks. This section
discusses the related research efforts in light of this research.
Medlin et al. [24] conducted a study to analyze the vulnerability of U.S.
hospitals to social engineering attacks. Employees who volunteered to complete
the survey were rewarded with both candy and a chance to win a gift card.
Within the questions, employees were asked to reveal their passwords and some
other confidential information. Surprisingly, 73% of them shared their passwords.
Krombholz et al. [22] illustrated some real-world examples of social engineer-
ing attacks against major companies, including the New York Times, Apple,
Facebook, Twitter, and the RSA Network Security LLC company. In 2013, social
engineers targeted the New York Times. The initial attack was a Spear Phishing
attack, recall Sect. 2, which sent fake FedEx notifications. Then the New York
Times hired computer security experts to analyze the attack, and they found
that some of the methods used to break into the company’s infrastructure were
associated with the Chinese military, i.e., a political motive. Because of this SE
attack, social engineers stole the passwords of some employees in The New York
Times, and hence they were able to access the personal devices of 53 employees.
As another example, leveraging Waterholing SE attacks in 2013 against
Apple, Facebook, and Twitter, social engineers were able to exploit a zero-day
vulnerability. Specifically, they were able to sneak into the corporate networks
and inject malicious code onto their websites. Once a user visited the infected
website, his device would be compromised. Moreover, in 2011, a small number of
RSA employees received an email entitled “2011 Recruitment Plan”. The email
was well written, so readers were convinced that it was legitimate. The email
contained a spreadsheet which contained a malicious payload to exploit a vulner-
ability on the user’s device. This SE attack led to stealing sensitive information
of the RSA SecureID system [22].
Aldawood and Skinner [2] suggested a few methods organizations can fol-
low to educate their employees about reducing the effect of social engineer-
ing attacks. These are Serious Games, Gamification, Virtual Labs, Simulations,
Modern Applications, and Tournaments. Serious game is a method that allows
employees to face real-time scenarios with an opportunity to use their knowl-
edge to implement mitigation strategies. Similarly, an organization can use the
Gamification to assess the behavior of hypothetical victims of social engineer-
ing attacks. Remote online networks is another method known as Virtual Labs,
which helps trainees learn about threats of social engineering via virtual solu-
tions. Simulations can be used as models of real scenarios to evaluate various
social engineering attacks. Additionally, Modern Applications that rely on the
use of software application training and learning modules can be used to assess
different types of social engineering threats. Furthermore, between multiple orga-
nizations with the need for social engineering mitigation training, Tournaments
can be executed, i.e., communication threats competitions.
Orgill et al. [27] demonstrated two metrics for determining security compli-
ance in an organization. These are user education and security auditing. They
emphasized the importance of educating employees about social engineering
attacks and how to prevent them.
Ghafir et al. [13] emphasized the importance of adopting a multi-layer
defense, also referred to as defense in-depth, to lower the risk associated with
social engineering attacks. They showed that a good defense in-depth structure
should include a mixture of security policy, user education/training, audits/-
compliance, as well as safeguarding the organization’s network, software and
hardware. The paper also illustrated four steps of social engineering which are
(1) information gathering, (2) developing relationships, (3) exploitation, and (4)
execution.
Chitrey et al. [9] developed a model of social engineering attacks. The model
categorized social engineering attacks under two main entities: (1) vulnerable
entities which are human, technology, and government laws and (2) safeguards
entities which are information security awareness program, organization security
policies, physical security, access control, technical control, and secure applica-
tions development. Such a model can be used in the development of organization-
wide information security policy.
Gupta and Sharman [16] proposed a framework for the development of a
Social Engineering Susceptibility Index (SESI) based on social network theory
propositions. The framework reveals the real risks of social engineering attack
that employees are exposed to. The framework suggested five indices: social
function, organizational hierarchy, organizational environment, network charac-
teristics, and relationship characteristics.
Beuran et al. [8] used the main cybersecurity training programs in Japan
as a detailed case study for analyzing the best practices and methodologies
in the field of cybersecurity education and training. The paper defined a tax-
onomy of requirements to ensure adequate cybersecurity education and train-
ing. The developed taxonomy has two main aspects, which are training con-
tent and training activities. As far as the training content, there are three
main categories, which are attack-oriented training, defense-oriented training,
and analysis/forensics-oriented training. Another perspective on cybersecurity
training is considered to focus on security-related activities that include individ-
ual skills, team skills, and Computer Security Incident Response Team (CSIRT)
skills.
According to [20], a combination of technical, social, economic, and psycho-
logical factors affect an employee’s decision-making process when contemplating
whether to comply with or ignore the terms of information security policies.

The social engineer might rely on some principles to raise the effectiveness of
the cyberattack, such as authority, intimidation, consensus, scarcity, familiarity,
trust, and urgency. According to [3], trust, authority, and fear are contribut-
ing to the success of social engineering attacks. These internal pressures can be
exploited by social engineers to achieve certain purposes, such as encouraging
someone to share sensitive information that they probably should not. Addi-
tionally, when someone does something nice to us, we automatically feel obliged
to return the favor [18]. Risky-shift is another critical factor that was coined
by James Stoner in 1961 [31]. It occurs when an employee (as part of a team)
tries to make decisions about the risk associated with the use of information
technology which is different from when he is using his personal devices. At the
personal level, employees tend to be more careful about their data. In contrast,
when working as a team, they are more likely to make riskier decisions.
4 Research Methodology
This section describes the research question this research aims to answer,
described in Sect. 4.1, and the followed research methodology to develop the
taxonomy, described in Sect. 4.2.
4.1 The Research Question
In this research, the target is to develop a taxonomy of the main defense mech-
anisms against social engineering attacks. To that end, this section presents and
discusses the research question this study tries to answer.
RQ1. What are the main defense mechanisms against social engineering
attacks that employees and organizations should be aware of?
In order to answer this research question, the paper conducted a thorough
literature review and discovered the main target points of social engineers. Then,
the paper outlined the various defense mechanisms regarding each target point.
These defense mechanisms reduce or even prevent social engineering attacks.
Hence, employees should be aware of them either through training programs or
reading materials given to them periodically through their organizations.
By answering this research question, organizations will have a better under-
standing of social engineering defense mechanisms so they can take the right
actions to incorporate them and make them part of their organizational culture.
4.2 Building the Taxonomy Methodology
To develop the taxonomy, the authors conducted a systematic literature review

and followed the SANS institute guidelines. Below is a brief description of each
one.
Systematic Literature Review (SLR). To develop a taxonomy of the

defense mechanisms of social engineering attacks, the authors followed a System-
atic Literature Review (SLR) technique as recommended by Okoli and Schabram
in [26]. To do that, this study conducted a literature review of most recent jour-
nals and conferences papers that contained “social engineering” in their titles.
For each paper, the authors extracted the target points of social engineers to
conduct their social engineering attacks as well as any suggested defense mech-
anism.
The SANS Institute. SANS (SysAdmin, Audit, Network, Security) institute

is a private company based on the United States founded in 1989. SANS is the
largest source for cybersecurity training in the world. It provides guidelines that
organizations need for rapid development and implementation of information
security policies. These guidelines divided into four categories: general category,
network security, server security, and application security. To build the taxon-
omy, the authors followed the guidelines in the SANS InfoSec Policies and the
SANS Awareness Survey.
5 The Taxonomy
To answer the research question, RQ1 in Sect. 4, the authors conducted a thor-
ough investigation of the literature, recall Sect. 4, and found that there are
five main target pints for social engineers. Social engineers try to achieve their
Fig. 2. A comprehensive taxonomy of social engineering defense mechanisms for each

target point
malicious goals through these five target points, which are the main assets of
any organizations. These five target points are People, Data, Software and Hard-
ware (SW/HW), and Network. For each target point, the authors determined the
defense mechanisms to prevent any potential social engineering security attack
targeting that target point. Figure 2 depicts a tree-structure taxonomy of the
main target points and the defense mechanisms for each target point. Next, the
paper provides a detailed description of each target point and the defense mecha-
nisms against social engineering attacks targeting these target points. Employees
and organizations should be aware of these defense mechanisms to prevent any
social engineering attack.
5.1 People (Employees)

Social Engineers target organizations’ employees using social intelligence tech-
niques to convince them to perform tasks that they should not do, such as giving
their passwords or share private data, etc. To protect this asset, organizations
should consider (1) educating their employees periodically, and (2) hiring IT
technical staff knowledgeable of social engineering security attacks. Below the
description of these two defense mechanisms.
Awareness Training Program. Previous research has shown that Informa-

tion Security Awareness (ISA) is vital in mitigating the risks associated with
information security breaches [4]. Raising employees’ awareness level is the best
way to limit the effect of social engineering techniques. Employees need guidance
to make the right decisions in the digital world. According to [30], “without any
guidelines to follow when exceptional situations arise, an employee is liable to
take actions that compromise the organization’s data or cause the organization
to miss out on lucrative business prospects”. Therefore, organizations need to
provide well-rounded awareness training programs to their employees to stay
secure.
Social Engineering Technical Staff. To prevent social engineers from per-

forming their tricks on people, each organization should be equipped with a
social engineering technical staff. This team needs to be knowledgeable about
social engineering security attacks and their consequences. The existence of this
team can be helpful and beneficial to educate employees and to prevent social
engineering attacks.
5.2 Data
Data is a valuable asset for any organization, and it is a critical target point
by cyberattacker either at the personal level or at the organizational level. At
the personal level, social engineers might target personal data of a high pro-
filed employee such as family pictures, videos, salary, etc. At the organizational
level, many types of sensitive information can be targeted, such as planning
documents, employees personally identifiable information (PII), financial infor-

mation, or any private organizational data. To defend this asset, organizations
need to (1) perform backup and replication to their data periodically, (2) deter-
mine the minimum information each employee and system needs to perform their
tasks and grant only that information to that employee or system, and (3) create
clear security policies to identify the sharing boundaries of the information so
employees would know what to share and with whom.
Backup and Replication. Constantly backing up the data and creating repli-
cation of the data, either inline or offline replication, ensure the integrity and
the availability of the data. Employees should be aware of any backup and repli-
cation policy in their organizations so they can consider it for the organization’s
data stored either on the servers or even on their work computers. Providing
consistent rules for backup and replication management is critical to ensure the
High Availability (HA) of the organization’s data.
Least Privileges (LP) Determination and Enforcement. Determining

the exact data each employee or system needs could be a complex task, but it is
crucial for security comprehension as well as preventing a data breach or, at least,
minimizing it. Applying the least privilege security principle ensures that each
employee has access only to the data he needs to perform his work. According
to [19], protecting organizations’ systems and securing their data require correct
enforcement of the “least user rights” and administrator privileges.
Sharing Boundaries. This defense mechanism ensures that employees are

aware of the sharing boundaries policies of each information they have access
to within and outside their organizations. For example, some information can
be shared only between workers in the same department, whereas others can be
shared only between workers on the same organization, etc.
5.3 Software and Hardware (SW/HW)

Organizations should educate their employees about the importance of the hard-
ware and the software of their organizations. To secure organizations’ equipment
and systems against social engineering attacks, organization need to educate
their employees regarding (1) the management process of the organization’s
hardware and software, (2) work emails and accounts, (3) any authentication
policy, and (4) the Bring Your Own Device (BYOD) policy.
SW/HW Management. Managing software systems and hardware compo-

nents to prevent social engineering attacks requires that each organization have
clear policies regarding software installation, configuration, updates, hardware
maintenance plan, etc. Such clear security policies, if exists and employees are
aware of, would prevent many social engineering attacks such as Technical Expert
and Support Staff, recall Sect. 2.
Work Emails and Accounts. Protecting work emails by filtering any potential
spams are critical since, in most cases, they are considered as a formal way of
communication. Therefore, if a social engineer was able to send an email from an
employee’s work email, this could lead to severe consequences. On the other hand,
social engineers mainly use emails as a medium to spread their malicious intent.
Therefore, employees need to know about security policies. Organizations must
ensure that their employees are aware of what is acceptable and unacceptable
use of the work emails and accounts as well as preventing any unauthorized
computers or locations from accessing employees work emails and accounts.
Authentication. To prevent social engineers from impersonating employees, an

authentication process should include “something they have”, such as biometric
information or their phone devices, in addition to the “something they know”,
such as username and password, as described in [33]. If organizations incorporate
such a policy, then employees are protected against many social engineering
attacks. There are many techniques to increase the security of the authentication
process, including two-factor authentication, using captcha, complex passwords,
specific IP address, open a service during certain hours, etc.
Bring Your Own Device (BYOD). Many organizations allow their employ-
ees to use their own devices for work purposes to increase the efficiency and the
productivity of employees during the working hours. However, many employees
do not pay attention to the security risks associated with this BYOD policy.
Therefore, such employees need to be aware of any security risks that theses
devices might pose. According to [12], the lack of understanding of BYOD by
organizations puts them at risk of losing control of their critical information
resources and assets. Hence, it is essential to ensure that these devices are not
compromising the confidentiality, integrity, and availability goals of information
security. This can be done by incorporating effective security and privacy policies
to manage BYOD.
5.4 Network
Employees access databases and other servers through a network. The network
could be a Local-Area Network (LAN), Wide-Area Network (WAN), wireless
network or wired network, etc. Each network has a different security policy. For
example, if an employee is connecting to the LAN network of an organization, he
might have access to servers that he would not have access to if he is connecting
from his home network.
Moreover, most organizations nowadays allow VPN (Virtual Private Net-
work) or RDP (Remote Desktop Protocol) to allow their employees to access
the local network remotely. All of these different network security policies bring
security threats to organizations if the employees are not aware of them. For
example, if an employee accesses his/her organization’s local network using a
VPN from a public computer or his/her friend’s computer, if that computer is
compromised, then that employee risks his/her organization. To protect orga-

nizations’ networks from potential social engineering attacks, employees should
be aware of the different Internet configuration as well as the different network
security policies regarding the VPN and the RDP.
Internet Configuration. The Internet has enabled interconnection of differ-

ent computer networks all over the world. Protecting confidential information
has been made especially challenging due to the ever-changing array of social
engineering tactics using the Internet. Thus, for any organization, it is vital to
secure its networks, including the wired and the wireless networks. Addition-
ally, employees should be aware of the different network configurations in their
organizations.
Remote Desktop Protocol (RDP) and Virtual Private Network

(VPN). Although both RDP and VPN protocols support secure Internet
communications, provide traffic integrity, confidentiality and authentication, it
remains a complex and error-prone task. Hence, it is important to ensure the
creation of a secure connection to an organization’s network from any host by
incorporating effective security policies. Such security policies need to be speci-
fied correctly in order to enforce access control and traffic protection appropri-
ately. Moreover, employees should be aware of these security policies and the
risks these protocols might bring to their organizations. Fu et al. [11] indicate
that RDP and VPN security policies require considering two levels: requirements
level and implementation level. The correctness of implementation level security
policy can be verified by checking satisfaction of requirement level security policy.
6 Threats to Validity
This section discusses the threats to validity for building the taxonomy and the
authors’ steps to minimize those threats.
To develop the taxonomy of the main target points and the various defense
mechanisms, the authors mainly relied on a comprehensive literature review. As
with any such review, some significant references could have gone unnoticed. To
minimize this threat, the authors examined papers with “Social Engineer” or
“Social Engineering” keywords in the title and read their abstracts, introduc-
tions, and conclusions. Moreover, this study also leveraged the Human Aspects
of Information Security Questionnaire (HAIS-Q) [28], SANS Awareness Sur-
vey, and the Essential Cybersecurity Controls of Saudi National Cybersecurity
Authority (2018).
7 Conclusion
Humans have become the weakest link in the security pipeline, and social engi-
neers are taking advantage of the knowledge gap in this area. Successful social
engineering attacks can be extremely damaging to organizations. The results

of such attacks can include the loss of reputation and public trust, legal ram-
ifications, loss of competitive edge, and financial damages. To that end, this
research conducted a large-scale study to develop a comprehensive taxonomy
of the social engineering defense mechanisms. Developing the taxonomy pre-
sented in this paper is considered the starting point towards the authors’ research
goal of preventing social engineering security attacks and increasing employees
awareness against these types of security attacks and their consequences. Future
research involves measuring employees awareness levels of the various defense
mechanisms as well as developing the necessary training sessions for employ-
ees and managers to educate them about the risk of social engineering security
attacks and their consequences.
Moreover, as another venue of future directions, the authors are developing
a set of social engineering security policies (SESPs) that organizations should
incorporate to prevent social engineering security attacks. The authors will apply
these SESPs through extensive surveys in public and private organizations to
measure their effectiveness in reducing social engineering attacks. The survey
results will be used to develop recommendations regarding translating those
written policies to technical processes within organizations.
Acknowledgment. The first author was supported by a generous fellowship from

Shaqra University, Saudi Arabia. All errors and omissions are the responsibility of the
authors alone.
References
1. Verizon 2018. 2018 data breach investigations report (2018)
2. Aldawood, H., Skinner, G.: An academic review of current industrial and commer-
cial cyber security social engineering solutions. In: Proceedings of the 3rd Inter-
national Conference on Cryptography, Security and Privacy, pp. 110–115. ACM
(2019)
3. Applegate, S.D.: Social engineering: hacking the wetware! Inf. Secur. J.: Glob.
Perspect. 18(1), 40–46 (2009)
4. Arachchilage, N.A.G., Love, S.: Security awareness of computer users: a phishing
threat avoidance perspective. Comput. Hum. Behav. 38, 304–312 (2014)
5. Aviv, A.J., Gibson, K.L., Mossop, E., Blaze, M., Smith, J.M.: Smudge attacks on
smartphone touch screens. Woot 10, 1–7 (2010)
6. Bakhshi, T., Papadaki, M., Furnell, S.: A practical assessment of social engineering
vulnerabilities. In: HAISA, pp. 12–23 (2008)
7. Berg, A.: Cracking a social engineer. LAN times (1995)
8. Beuran, R., Chinen, K., Tan, Y., Shinoda, Y.: Towards effective cybersecurity
education and training (2016)
9. Chitrey, A., Singh, D., Singh, V.: A comprehensive study of social engineering
based attacks in india to develop a conceptual model. Int. J. Inf. Netw. Secur.
1(2), 45 (2012)
10. Choi, M., Levy, Y., Hovav, A.: The role of user computer self-efficacy, cybersecurity
countermeasures awareness, and cybersecurity skills influence on computer misuse.
In: Proceedings of the Pre-International Conference of Information Systems (ICIS)

SIGSEC–Workshop on Information Security and Privacy (WISP) (2013)
11. Fu, Z., Wu, S.F., Huang, H., Loh, K., Gong, F., Baldine, I., Xu, C.: IPSec/VPN
security policy: correctness, conflict detection, and resolution. In: International
Workshop on Policies for Distributed Systems and Networks, pp. 39–56. Springer,
Heidelberg (2001)
12. Garba, A.B., Armarego, J., Murray, D., Kenworthy, W.: Review of the information
security and privacy challenges in bring your own device (BYOD) environments.
J. Inf. Privacy Secur. 11(1), 38–54 (2015)
13. Ghafir, I., Prenosil, V., Alhejailan, A., Hammoudeh, M.: Social engineering attack
strategies and defence approaches. In: 2016 IEEE 4th International Conference on
Future Internet of Things and Cloud (FiCloud), pp. 145–149. IEEE (2016)
14. Granger, S.: Social engineering fundamentals, part i: Hacker tactics. Security Focus,
18 December 2001
15. Greening, T.: Ask and ye shall receive: a study in “social engineering”. ACM
SIGSAC Rev. 14(2), 8–14 (1996)
16. Gupta, M., Sharman, R.: Social network theoretic framework for organizational
social engineering susceptibility index. In: AMCIS 2006 Proceedings, p. 408 (2006)
17. Hadnagy, C.: Social Engineering: The Art of Human Hacking. Wiley, Indianapolis
(2010)
18. Happ, C., Melzer, A., Steffgen, G.: Trick with treat-reciprocity increases the will-
ingness to communicate personal data. Comput. Hum. Behav. 61, 372–377 (2016)
19. Heartfield, R., Loukas, G.: A taxonomy of attacks and a survey of defence mecha-
nisms for semantic social engineering attacks. ACM Comput. Surv. (CSUR) 48(3),
37 (2016)
20. Herath, T., Rao, H.R.: Encouraging information security behaviors in organiza-
tions: role of penalties, pressures and perceived effectiveness. Decis. Supp. Syst.
47(2), 154–165 (2009)
21. Karakasiliotis, A., Furnell, S.M., Papadaki, M.: Assessing end-user awareness of
social engineering and phishing (2006)
22. Krombholz, K., Hobel, H., Huber, M., Weippl, E.: Advanced social engineering
attacks. J. Inf. Secur. Appl. 22, 113–122 (2015)
23. Manske, K.: An introduction to social engineering. Inf. Syst. Secur. 9(5), 1–7 (2000)
24. Medlin, B.D., Cazier, J.A., Foulk, D.P.: Analyzing the vulnerability of us hospi-
tals to social engineering attacks: how many of your employees would share their
password? Int. J. Inf. Secur. Privacy (IJISP), 2(3), 71–83 (2008)
25. Mouton, F., Malan, M.M., Leenen, L., Venter, H.S.: Social engineering attack
framework. In: 2014 Information Security for South Africa, pp. 1–9. IEEE (2014)
26. Okoli, C., Schabram, K.: A guide to conducting a systematic literature review of
information systems research (2010)
27. Orgill, G.L., Romney, G.W., Bailey, M.G., Orgill, P.M.: The urgency for effective
user privacy-education to counter social engineering attacks on secure computer
systems. In: Proceedings of the 5th Conference on Information Technology Educa-
tion, pp. 177–181. ACM (2004)
28. Parsons, K., Calic, D., Pattinson, M., Butavicius, M., McCormac, A., Zwaans, T.:
The human aspects of information security questionnaire (HAIS-Q): two further
validation studies. Comput. Secur. 66, 40–51 (2017)
29. Shane, S., Schmidt, M.S.: Hillary clinton emails take long path to controversy. The
New York Times (2015)
30. Siponen, M.T., Iivari, J.: Is security design theory framework and six approaches
to the application of ISPS and guidelines. J. Assoc. Inf. Syst. 7(7), 445–472 (2006)
31. Stoner, J.A.F.: Risky and cautious shifts in group decisions: the influence of widely
held values. J. Exp. Soc. Psychol. 4(4), 442–459 (1968)
32. Thapar, A.: Social engineering: An attack vector most intricate to tackle. CISSP:
Infosec Writers (2007)
33. Thomas, K., Li, F., Zand, A., Barrett, J., Ranieri, J., Invernizzi, L., Markov, Y.,
Comanescu, O., Eranti, V., Moscicki, A., et al.: Data breaches, phishing, or mal-
ware?: understanding the risks of stolen credentials. In: Proceedings of the 2017
ACM SIGSAC Conference on Computer and Communications Security, pp. 1421–
1434. ACM (2017)
34. Workman, M.: A test of interventions for security threats from social engineering.
Inf. Manage. Comput. Secur. 16(5), 463–483 (2008)
A Trust Framework for the Collection
of Reliable Crowd-Sourced Data
Shiva Ramoudith(B) and Patrick Hosein
The University of the West Indies, St. Augustine, Trinidad and Tobago
shiva@lab.tt, patrick.hosein@sta.uwi.edu
Abstract. With increasing access to the Internet through a multitude

of devices, it is relatively easy to collect data from the public through
a process called crowd-sourcing. Unfortunately, this approach has two
problems: enticing users to contribute data, and determining if users
have provided accurate data. The problem of inaccurate responses is
compounded when a group of malicious users intentionally skew survey
results in order to benefit themselves in some way. The first issue can
be addressed through incentives, but this increases the need to solve
the second issue since it is likely that users would produce fake data
in order to obtain more of the offered incentives. We present a simple
trust framework for addressing the issue of data quality in surveys, and
illustrate its use with a real-world example. In this model, users are given
access to the platform only when trusted users on the platform vouch
for them. Although not fool proof, it does increase the quality of the
collected data. Access to the survey results is used to incentivize users.
We compared our solution to a traditional internet survey; we found that
our solution reduced the number of invalid submissions by 9.29%.
Keywords: Trust framework · Trust modelling · Survey platform ·

Social network · Data science · Data integrity
1 Introduction
Trust holds a key role in society. It is a multidisciplinary concept that has been
studied in sociology, economics, psychology, and computer science. Each of these
disciplines have their own definition of trust. Broadly speaking, trust can be
defined as a measure of confidence that an entity would behave in an expected
manner [1]. It allows users to make decisions, sort and filter information, receive
recommendations, and develop a context within a community with respect to
whom to trust and why [2].
Networks such as Facebook, Twitter and Slashdot are examples of decentral-
ized environments where users are allowed to create and upload content at their
discretion [3]. This freedom can result in the fabrication of misinformation and
exploitation of systems. Creating a web of trust (a network where a link between
two nodes or entities mean a trust decision has been made and the value of that
https://doi.org/10.1007/978-3-030-39442-4_4
A Trust Framework for the Collection of Reliable Crowd-Sourced Data 43
trust is known) is a common solution to this problem where an algorithm is used

to compute and predict the trust of unknown users [4].
Numerous works exist on computing trust in networks. Based on the type
of data being utilized, algorithms that compute trust can be classified into the
following categories: network structure based models [5], interaction based mod-
els [6], and hybrid models [1]. Network structure models emphasize representing
users on the network via a graph and using its properties (in-degree, out-degree,
eccentricity, radius, etc.) to compute trust. In contrast, interaction based models
would take factors such as the volume, frequency, and even the nature of inter-
actions among users into consideration when computing trust. Lastly, hybrid
models leverage the structural as well as the user interaction data when com-
puting trust [7]. In the literature, mathematical expressions are commonly used
to express and compute trust values for users within a network.
There are a multitude of algorithms used to compute trust, also known as
trust metrics. Trust metrics can be classified as either local or global [3]. Local
trust metrics would take into account the relationship between users and predict
trust for each user. On the other hand, global trust metrics take the trust of
a user from members within the network into account and formulate a global
reputation for this user, representing how the network as a whole perceives this
user [3]. A popular global trust metric is PageRank, used by Google. Local
trust metrics have numerous benefits: they are more tailored towards each user,
they can be more suited towards preventing the trust propagation of malicious
users (have large number of trusts and distrusts) when compared to global trust
metrics since they don’t influence the personalization of users who don’t trust
them explicitly. However, they are computationally more expensive than global
trust metrics as they have to be computed for each user within the network
[8]. Global trust metrics also have the potential to reach and represent a large
portion of users within a network, allowing for informed decisions to be made on
existing users or topics [8]. Regardless of trust metric classification, computing
trust has the following challenges: determining the key dimensions on which to
base trust, the dynamic environments which affect trust estimation and, the
verification and validation of trust models [9].
Computing distrust is seen as equally important as the computation of trust
and it is commonly overlooked in many existing algorithms that model trust [10].
Real-world trust systems (e.g., Epinions and Slashdot) value the distrust among
users as important as trust. Guha et al.’s [10] findings indicate that distrust
information is very important when determining the trust of a particular user.
The concept of trust can be applied to Recommender systems and the Seman-
tic Web [2]. Examples of its applications are Epinions and Slashdot. Within the
Epinions ecosystem, users are encouraged to flag users as trustworthy if they find
their reviews as useful and untrustworthy otherwise. This provides users with
higher quality recommendations on products and services to purchase. Slashdot
is a social news website that provides news on technology. Similarly as in the case
of Epinions, users have the ability to rate each other as friend or foe. Regarding
44 S. Ramoudith and P. Hosein
the Semantic Web, trust can be applied in scenarios such as information quality
assessment and Semantic Web service composition [2].
Introducing trust can have a positive impact on the confidence placed in
users’ data. Internet-based surveys are quite common. They alleviate issues such
as: mailing/distribution costs for questionnaires, human error, and reaching peo-
ple in varying geographic areas [11]. Despite these benefits, there are a host of
issues that emerge concerning the exploitation of internet-based surveys and the
impact they can have on research being conducted using flawed data [12]. Studies
reveal that careless or insufficient responses can include 1–30% of respondents
whereas the modal rate is close to 8–12% [12]. Even a small percentage of these
types of respondents can have an impact on measures of central tendency, spread
and reliability [13].
We provide an approach based on increasing the quality of crowd-sourced
data in online surveys. Section 2 presents an example of the problems faced with
online surveys. Sections 3 focuses on describing the proposed framework in detail.
Finally, Sects. 4 and 5 gives some discussion of the results (when compared to a
traditional online survey) and the conclusion.
2 The Trust Issue in Crowd-Sourced Data
Let us consider, for example, a simple survey created on Google Forms to

gather the download rate of the clients of various Internet Service Providers
(ISPs). After the survey is publicized, members within a population submit
their responses. To determine the validity of data, anomaly detection techniques
[14] (Z-Score, Principle Component Analysis, Proximity Based models, etc.) can
be used to identify irregularities among the responses. Suppose that a group of
users wishes to skew the results of a survey. Let us assume that they all fabri-
cate similar responses to the survey in order to make a particular ISP appear to
provide excellent performance. Outlier detection techniques cannot easily iden-
tify this type of malicious intent and there is no mechanism that exists in the
current survey distribution process to mitigate this. Furthermore, invalid sub-
missions decrease the number of valid responses. As a result, conclusions based
on analyses of the responses would be incorrect. Our proposed solution to this
problem is described in the following sections.
3 Framework for Creating Trust
Our approach to this problem of increasing the reliability of user responses in

surveys is through the establishment of a trust framework. This framework mim-
ics a social network (hybrid structure with global trust metric as we monitor the
connections among users as well as the nature of these connections), and each
user has the ability to perform certain actions (recommend users, trust users
and complete surveys).
3.1 User and Framework Interaction
The interactions between the user types (trusted and non-trusted) are illustrated
in Fig. 1. Note that there are no self loops.
Invited
Users
register and verify invite users
trust user
trusted
Registered Trusted Take
Users Users Survey
recommend
complete survey
Untrusted View
Users Statistics
Fig. 1. Trust framework
We assume that we are given a set of potential data providers (users), U, and
a set of surveys, S, which require responses. The following information is stored
about each user:
Ti : the trust level of the user.

Vi ∈ {0, 1}: representing whether or not the user has verified their account.
Ii ∈ {0, 1}: representing whether or not the user has created their profile.
Cij ∈ {0, 1}: 1 if Ui has completed survey Sj .
Lij : the trust level that Ui places on Uj .
Initially, we assume that a subset of these users, T , are trustworthy and that
they will provide valid data as well as recommend other users whom they trust.
When new users are added to the system, they are initially not trusted and
belong to the subset ¬T . In order for a new user to be considered trusted, they
must satisfy all of the following conditions:
1. Ti >= Tmin , where Tmin >= 6.0

2. Ii = 1 (additional account information submitted).
3. Vi = 1 (user account verified).
As mentioned above, one of the criteria for a trusted user is one who has
associations with other users within the platform, satisfying a minimum trust
score. Let there be N associations. Based on the value Lij of each association,
the trust score (Tj ) for user Uj can be computed by summing across all i. Once
this value satisfies the predetermined constraint in addition to the rest of criteria,
the user is considered trusted. Our intuition behind this trust framework is that
a user whom is considered trusted has a higher probability of submitting a more
accurate response to a survey as opposed to a user who is not trusted. Note that
there is a trade-off between being too stringent in requesting recommendations
and ease of use of the platform (i.e., becoming trusted). We varied parameters
to find a suitable balance.
Users interact with each other on the platform through invitation or recom-
mendation. The degree by which a user’s trust level increases depends on the
person making the recommendation or invitation.
For a recommendation, a user can specify the degree of trust with values given
in Fig. 2. Note that recommendations can include negative points (e.g., when a
user knows that a registered or trusted user is in fact not trustworthy). Therefore
it is possible for a user to become untrustworthy and eventually eliminated from
the platform.
Increasing Trust
3pts Close Friend
2pts Friend
1pt Acquaintance
-1pt Suspicious
-2pts Untrustworthy
Fig. 2. Degrees of trust
For an invitation, a new user’s initial trust level is dependent on the trust level
of the user sending the invitation. When a new user Uk is added to the platform
by a trusted user Ui , the following variables are set, Tk = min{Ti − 3, 6}, Ik = 0,

Vk = 0.
A new user can only be added to the system through a referral from an
already trusted user. Upon receiving an invitation email, the new user has lim-
ited access to the system. This user has to complete their profile information (so
other users can easily identify them) and verify their account. The verification
process involves using a script to send a randomly generated sequence of char-
acters to the user’s mobile phone via SMS. Once the sequence of characters are
entered correctly by the user the account is considered verified. When the user
has completed the two aforementioned tasks and satisfied Tmin , they will be con-
sidered for transitioning from a non-trusted user to a trusted user. When a user is
considered highly trusted (multiple users have trusted them positively and their
trust level is ≥9), any user that they invite will have their trust points assigned
to 6. This would allow the new user to access the available surveys immediately
(once they submit their information and validate their user account).
Furthermore, there exists the problem of a user wanting to utilize the plat-
form but they do not know any users within the platform and as such, they
would not be able to receive an invitation. On the login page of the application,
new users who wish to be added can submit their email address and information
pertaining to their user account. Trusted users within the platform will receive
intermittent emails on these potential users and can invite them once they know
them. To prevent exploitation of this module, users are asked to complete a
CAPTCHA 1 in addition to submitting their information.
3.2 Malicious Users
There always exist the possibility of malicious users within social networks. As
mentioned in the literature, global trust metrics are susceptible to malicious
users so we present some novel features within our framework which prevents
against easily introducing these types of users into the network and gaining trust
quickly. Table 1 contains a list of common issues that were expected to occur
with malicious users and the preventative measures built into the framework to
mitigate them. These measures are by no means fool proof but the difficulty in
overcoming them may deter malicious users.
Furthermore, each recommendation made within the platform is stored and
can be used for purposes of identifying malicious users (as mentioned in the
introduction) whom are already trusted.
Let Pj denote the number of positive trusts (Lij >= 2.0) for Uj and let ¬Pj
be the number of negative trusts (Lij <= −1.0) for Uj . One possible way of
detecting an anomalous user is when |¬Pj − Pj | ≥ 3 and they both cross some
threshold (≥5). When this type of user is detected, they will be flagged as not
1
A CAPTCHA is a program that protects websites against bots by generating and
grading tests that humans can pass but current computer programs cannot. http://
www.captcha.net.
Table 1. Measures for handling vulnerabilities
Vulnerability/issue Preventative measure

Unauthorized user trying to User must have an account to access services within
access the system the system
User creating multiple Each user in the system has a phone number and
accounts email address associated with them. These two
identifiers are unique for each user
Newly added user being able System requires that the user must be considered
to trust other users and trusted by other trusted users of the system before
complete surveys such actions can be performed
User becoming trusted by A function is used to compute the trust points of a
simple association with many user by utilizing the degree of each association
peers among peers. In addition, the user’s profile
information must be completed and their account
be verified
A trusted user can A limitation on the number of recommendations is
recommend a series of users placed on each user’s account. They are limited to
one after the other two recommendations per day
An anomalous user bypassing Each survey within the framework has custom
measures mentioned above anomaly detection measures which flag undesirable
and submitting inaccurate entries. Administrators have the ability to address
information these submissions (refer to Fig. 3)
A user submitting multiple Only the most recent submission is considered for
responses statistics
A user trusting other users Measures exist within the platform to detect and
multiple times or trusting prevent against these scenarios. A user will only be
themselves able to trust another user once and cannot trust
themselves
trusted and the administrator will be able to discard their submission from the
respective survey(s).
3.3 Framework Implementation and Survey Design
The trust framework was implemented using the Django framework2 . A rela-
tional database is used for storing information concerning users, user relation-
ships, user requests, surveys and, user responses to surveys. User responses are
stored as JSON3 objects within the relational system. Scheduled recurring tasks
2
Django is a high-level Python Web framework that encourages rapid development
and clean, pragmatic design. https://www.djangoproject.com.
3
JavaScript Object Notation. JSON is a lightweight data-interchange format. https://
www.json.org.
yes
Invalid Submission Admin Con- User Re-
Submission Flagged tacts User submits
no
yes no
Include in Admin Reject and
Statistics Verifies Investigate
Fig. 3. Survey submission procedure
are created to log the number of trusted and non-trusted users in the system as
well as the number of users that completed each survey on a daily basis. A cache
is also implemented to reduce the time taken to display the overall statistics
(average, standard deviation, etc.) for each survey.
Survey design also plays an important role in the collection of user informa-
tion. Obtaining information about a user’s ISP was the first survey administered
on our platform. We decided that a survey must be fairly short to complete and
it should require the least amount of user input while still gathering the most
information possible. This reduces the likelihood of users submitting inaccurate
or dishonest information. An example of this is with our first survey done on
ISPs. Instead of asking users to submit their ISP information (achieved rate,
advertised rate, ISP, etc.) we asked the user to submit their subscription rate
and the link to their speed test results (which gave us all required information).
The platform ensured that duplicate links were not submitted and verified that
the ISP identified in the link belonged to the set of ISPs we were considering for
the survey. Failure of either case mentioned above results in rejected submissions.
Users have the opportunity to resubmit.
In order to create incentive for users to complete surveys, we created a survey
statistics module to display useful information regarding a user’s submission and
other users. In the case of the ISP survey, we provided statistics that allowed a
user to gain a sense of the performance associated with their ISP and compare
this to competing ISPs. Statistics can be highly customized for each survey.
3.4 Benefits of Framework

As mentioned previously, the framework’s goal was to collect reliable survey data.
Reliable data would have a positive impact on measures of central tendency and
spread. Analysis of the data would give a more truthful representation of the
survey participants and by extension, the population. In a typical survey plat-
form, users are not likely to have access to any statistical reports immediately
after completing a survey. The feature of the framework that gives users imme-
diate access to statistical reports has the potential to entice users. Lastly, the
framework allows for customized statistics to be computed, thus giving the users
an informed decision on whatever the survey’s goal may be.
4 Comparison of Traditional Survey and Trusted Survey

There were no publicly available datasets that represented trust in the context
of gathering survey data. As a result, we decided to collect this data ourselves.
We invited some users to our trust platform where the ISP survey was imple-
mented. Additionally, we conducted a similar survey using Google Forms which
was posted on a social media platform. This was done for comparison purposes.
The same information was collected by both approaches. The survey imple-
mented on our trust platform, however, was able to collect more data without
any additional user input. All of the user recommendation data was consid-
ered and, no user within our network displayed traits of being anomalous. As
mentioned above, the proposed framework monitors the number of trusted and
non-trusted users. In initializing the process, however, we had to make minor
changes to the framework and this prevented us from collecting the data.
We first investigated outliers by using the ratio of the provided achievable
rate (determined from a speed test) and the subscription rate. If this ratio was
less than 0.5 or greater than 2.0, then the sample was deemed an outlier. The
statistics for these outliers for the traditional survey and the trust framework
survey are provided in Fig. 4. We find a significantly larger number of these
outliers in the traditional survey.
We used the expected cost to achieve 50 Mbps to compare ISPs. The cost
function is obtained through a non-linear regression model using the pricing
plans of the ISP as the sample points [15]. Let Pq (where q is the ISP) be the
estimated price that a user would need to pay to achieve a speed of 50 Mpbs.
Pq is computed based on the following functions derived from the regression
analysis for ISPs A, D and F, respectively:
PA (ri ) = −319.6 + 177.5 ln(ri ) (1)
PD (ri ) = 271 + 0.333ri + 0.0095ri2 (2)
PF (ri ) = 182.5 + 1.72ri − 0.000617ri2 (3)
Let
μq = the average price to achieve a download rate of 50 Mbps for ISP q,
σq = the standard deviation of these samples,
σq /μq = the corresponding coefficient of variation.
We provide these results for the traditional survey in Table 2 and the trusted
platform survey in Table 3. We find a significant increase in the Coefficient of
Variation for the traditional survey indicating that the data is potentially less
reliable.
Next, we removed the outliers from both datasets, and recomputed the rel-
evant statistics. These are provided in Table 4 for the traditional survey and in
Table 5 for the trust platform survey. Here we find that both surveys have similar
Coefficient of Variations.
35
30 28
20
10
5
3
Google Form Survey Trust Framework Survey
No. of Submissions No. of Outliers
Fig. 4. Statistics for the survey conducted on the two platforms.
Table 2. Statistics for the traditional survey
Metric ISP F ISP D ISP A

μq 339.17 555.81 NA
σq 120.03 706.93 NA
σq /μq 0.3539 1.2719 NA
Table 3. Statistics for the trust platform survey

μq 302.35 377.89 336.24
σq 41.39 143.30 36.52
σq /μq 0.1369 0.3792 0.1086
4.1 Analysis
We find that there were 9.29% less invalid submissions from the trusted platform
survey compared with the traditional survey. Given the nature of the survey we
conducted, it was possible for an invalid response to be nullified and a new
response submitted (only most recent submission considered) by a user. If we
determined that a user submitted invalid data, we attempted to contact them via
email; however, we did not receive any responses from said users in the traditional
survey. This may be due to users supplying invalid email addresses. Similarly, for
Table 4. Statistics for the traditional survey without outliers

μq 295.06 320.19 NA
σq 46.88 11.14 NA
σq /μq 0.1589 0.0348 NA
Table 5. Statistics for the trust platform survey without outliers

μq 302.35 318.01 336.24
σq 41.39 19.46 36.52
σq /μq 0.1369 0.0612 0.1086
the trust platform survey, all users with irregular submissions (4 in total) were
contacted. These invalid submissions in the trusted platform were determined
to be truthful after contacting the relevant users and asking them to retake
the survey. There was no significant change when compared to their original
values. However, since we know that the responses are valid, as ISP might have
incorrectly provisioned a client. One user decided to contact their ISP (ISP D)
and the company realized they were provisioning the user according to an out-
of-date plan. The user then retook the survey, and this affected the properties
measured positively (decrease in μq , σq and σq /μq ). The rest of the users are
still in liaison with their ISPs (all of them have ISP D as their provider), and we
have decided to classify their submissions as invalid until they get confirmation
from their ISP on their subscription rate since they may also be victims of the
provisioning issue.
Tables 2 and 3 show the metrics using all data collected from each platform.
We notice that there is a decrease in σ/μ by an average of 74.39% for all ISP’s for
the trusted case when compared with the traditional case. The differences in σ/μ
are quite large, and this reveals that the invalid data present in the traditional
survey are misrepresenting the ISPs. In particular, only with ISP D, we notice
a substantial increase in μ, indicating that users might be trying to skew the
results concerning this ISP. This also indicates that there is less variability in
the data gathered from users in the trusted case versus the traditional case.
Furthermore, Tables 4 and 5 show the metrics using only valid responses.
Interestingly, σ/μ values are comparable for both approaches. Our platform has
the ability to easily flag submissions and link them back to a user. The submission
can then be inspected by the administrator, potentially be converted to a valid
submission and then included as part of the statistical results. Moreover, our
platform is also able to collect additional information from the user such as their
ping, upload rate, IP address without requiring any additional input.
The trusted platform was able to give a unique insight into the ISP survey.
Users can compare their service against competing ISPs using our metrics. On
the other hand, the traditional survey was only able to provide a basic level of
statistical reporting to users such as the number of people who are associated
with an ISP and the number of users that took part in the survey.
5 Conclusions and Future Work

A trust framework was developed and implemented to address the issue of data
quality in online surveys. The data collected by our trust platform is con-
sidered to be more trustworthy in comparison to a similar survey conducted
using Google Forms. Our platform also demonstrated the potential to convert
responses that were initially determined to be invalid into valid responses. Future
work includes allowing for trust to be altered over time among the users and,
improving the detection of trusted users who have malicious intent. The ISPs
have also expressed an interest in the approach and we will be assisting them in
improving their performance in the survey.
References
1. Sherchan, W., Nepal, S., Paris, C.: A survey of trust in social networks. ACM
Comput. Surv. 45(4), 47:1–47:33 (2013). https://doi.org/10.1145/2501654.2501661
2. DuBois, T., Golbeck, J., Srinivasan, A.: Predicting trust and distrust in social
networks. In: 2011 IEEE Third International Conference on Privacy, Security, Risk
and Trust and 2011 IEEE Third International Conference on Social Computing,
pp. 418–424, October 2011
3. Massa, P., Avesani, P.: Trust-aware recommender systems. In: Proceedings of the
2007 ACM Conference on Recommender Systems, RecSys 2007, pp. 17–24. ACM,
New York (2007). https://doi.org/10.1145/1297231.1297235
4. Artz, D., Gil, Y.: A survey of trust in computer science and the semantic web.
Web Semant. Sci. Serv. Agents World Wide Web 5(2), 58–71 (2007)
5. Wang, Y., Vassileva, J.: Bayesian network-based trust model in peer-to-peer net-
works. In: Proceedings of the Workshop on Deception, Fraud and Trust in Agent
Societies, pp. 57–68. Citeseer (2003)
6. Bhattacharya, R., Devinney, T.M., Pillutla, M.M.: A formal model of trust based
on outcomes. Acad. Manag. Rev. 23(3), 459–472 (1998)
7. Zhao, K., Pan, L.: A machine learning based trust evaluation framework for online
social networks. In: 2014 IEEE 13th International Conference on Trust, Security
and Privacy in Computing and Communications, pp. 69–74, September 2014
8. Massa, P., Avesani, P.: Trust metrics on controversial users: balancing between
tyranny of the majority. Int. J. Semant. Web Inf. Syst. (IJSWIS) 3(1), 39–64
(2007)
9. Cho, J.H., Chan, K., Adali, S.: A survey on trust modeling. ACM Comput. Surv.
48(2), 28:1–28:40 (2015). https://doi.org/10.1145/2815595
10. Guha, R., Kumar, R., Raghavan, P., Tomkins, A.: Propagation of trust and
distrust. In: Proceedings of the 13th International Conference on World Wide
Web, WWW 2004, pp. 403–412. ACM, New York (2004). https://doi.org/10.1145/
988672.988727
11. Roztocki, N.: Using internet-based surveys for academic research: opportunities
and problems. In: Proceedings of the 2001 American Society for Engineering Man-
agement (ASEM) National Conference, pp. 290–295 (2001)
12. Curran, P.G.: Methods for the detection of carelessly invalid responses in survey
data. J. Exp. Soc. Psychol. 66, 4–19 (2016)
13. Curran, P., Hauser, D.: Understanding responses to check items: a verbal protocol
analysis. In: Philadelphia, PA: Paper presented at the 30th Annual Conference of
the Society for Industrial and Organizational Psychology (2015)
14. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Com-
put. Surv. (CSUR) 41(3), 15 (2009)
15. Ramoudith, S., Hosein, P.: A metric for the fair comparison of ISPs (To be pre-
sented at the Sixth International Conference on Internet Science in December 2019)
Sentiment Analysis for University
Students’ Feedback
Nguyen Thi Phuong Giang1(&), Tran Thanh Dien2,

and Tran Thi Minh Khoa1
1
Industrial University of HCM City-IUH, HCMC, Vietnam
nguyenthiphuonggiang@iuh.edu.vn, ttmk84@gmail.com
2
Ngo Si Lien High School, Ho Chi Minh City, Kien Giang Province, Vietnam
thanhdien1984@gmail.com
Abstract. Students’ feedbacks have been an essential part of a range of higher

educations as universities in Vietnam recently. This information could be about
lectures, facilities, curriculum and how to improve them. The feedback is col-
lected and processed manually at the end of each semester. We proposed to
build a system which is ability to categorize students’ feedbacks automatically.
This system helps us save time, human resource and money for any higher
education institutions. Firstly, we created university students’ feedbacks data in
two years and organized them into three classes: Positive, Negative and Neutral.
We built the Vietnamese sentiment dataset with 5000 classified sentences. Then,
we use three classifiers which are Naïve Bayes, Maximum Entropy and Support
Vector Machine on our annotated data. The result proves that Maximum
Entropy algorithm is better than Naïve Bayes and Support Vector Machine with
the best score of 91.36%. With high accuracy, we confidently implement our
results to develop the students’ feedbacks system to detect students’ opinions.
With negative and positive students’ opinions, we can adjust and improve the
lectures, facilities, curriculum and make the quality of university better over the
years.
Keywords: Sentiment analysis Opinion mining Student feedback
1 Introduction
Opinion mining could solve classification problems relative to data such as customer
review, auto suggestion based on user historical purchases, etc. Kieu and Pham [1] use
sentiment analysis and computer products reviews to help sale manager to decide
which products have most positive responses. Most organizations use opinion mining
are e-commerce in the retail, the travel and so on, especially in Vietnam. We can see
that changes in the education are becoming our top priority. Evolution in education is
the heart of a nation. That is the reason why we need to point out the weak spots in the
way we deliver knowledge to students.
In recent years, every higher education institution has their own way to collect
student’s feedback. Some uses paper polls, the others do it online. Most polls are
yes/no questions, rating scales which is easy to process but not practically. These kinds

https://doi.org/10.1007/978-3-030-39442-4_5
56 N. T. P. Giang et al.
of poll prevent student from expressing opinion freely. With full text responses, the
summarizing process takes lots of time to read them one by one and take note.
Feedback is a helpful source for any institute to have better views at overall
education. There are two types of feedback: the feedback from lecturers to students
which is for the self-improvement of students and the feedback from the students to the
lecturers that allows them to guide the lecturers into teaching the course in ways they
could understand better. Most researches are mainly feedback from student side since
they have bigger expectations in education. Some sentiment analysis experiments, such
as [2–5] points out that digital methods are similar to web form, social media are better
than old ways for paper form or communicating directly at class. Students can express
their feeling, option about any study aspects with or without showing their identity.
In our experiments, we propose a student’s feedback model based on classification
algorithms to analyze and assign content sentiment in positive, negative or neutral. We
compare three different classifiers which are Naive Bayes (NB), Maximum Entropy
(MaxEnt) and Support Vector Machines (SVM). They have been used in sentiment
analysis for many years. Our goal is to find out the best classifier for our Vietnamese
educational sentiment data.
We present the foundation of our journey, they are papers, article and research
relating to opinion mining in Sect. 2. The data construction process from raw data to
annotated dataset is Sect. 3. We choose three labels four our data: positive (POS),
negative (NEG) and neutral (NEU). In Sect. 4, we built classifier system for experi-
ments and how features work with it. Section 5 is about experiments and our results
with students’ feedbacks data annotated positive, negative and neutral. We can go to
details and see which feature has most involvement lead to mislabeled results by using
our error analysis tool.
2 Related Work
Observation protocol COPUS [6] is used to qualify teaching process in United States
higher education institutes. Students have 13 codes and faculties have 12 codes to
express their opinions are agree or disagree with methods, activities in class. This
protocol was experimented by Achen and Lumpkin (2015) [4] and mixed with normal
students’ feedbacks. Beside COPUS codes, students and faculties can give back their
quantitative responses. Result from both protocols helps faculties see which aspect of
their lectures is more effective. The downside here is that quantitative responses are not
processed automatically as COPUS protocol.
In [7] by Delen, some machine learning techniques were used to determine
freshman likely dropping out from a range of factors. His study mainly focuses on
comparing machine learning algorithms just as SVM classifier, Neural Network on 2
class datasets. Dataset is quite complex with many attributes could be variable or
binary. SVM predicts the accuracy of 81,18% and the accuracy of each class with
above 74% with balanced dataset. Sensitivity analysis result detects interactions
between factors and shows important attributes in dataset.
In Vietnam, Duyen et al. [8] use hotel reviews from agoda.com page to compares
three model: Naive Bayes, Maximum Entropy and SVM. Their results show SVM with
Sentiment Analysis for University Students’ Feedback 57
word-based and unigram features give better result. They also show overall score play a
very important role in predicting sentiment. As the rise of attention in education field,
we see more study related to education in Vietnam. Phuc and Phung [5] uses Naive
Bayes classifier model to determine main subject in student messages. They use POS
features on classifier to detect Vietnamese messages. Based on result of classifier, their
system moves those misplaced messages to the right topics in school forum.
3 Vietnamese Students’ Feedbacks Data
Our students’ feedbacks data is collected from a university in Vietnam. Students’

feedbacks are from two years, 2017 and 2018. After removing all junk, useless,
duplicated sentences, we have more than 5,000 raw sentences. Then, data were
annotated into 3 labels: positive (POS), negative (NEG) and neutral (NEU) by two
annotators who have good background in Vietnamese linguistics. We divide this stage
into two steps.
(1) Step 1:
In the first step, we built the annotation guideline. Following this document, the
annotator can make the label of each sentence easily. We choose randomly 100
sentences from raw data then let each annotator assigns them POS, NEG or NEU.
This early work helps us find the conflicts between annotators too. All data
attributes are added to corpus guideline so, any similar assigning debates in future
are solved based on rules from it. Because this feedback from student aspects,
student annotators could assign sentences more accurately. The annotator agree-
ment at this step is 88% with positive and negative classes cause less disagree-
ment than Neutral. Recommendation or refinement sentences are not really
considered as negative.
(2) Step 2:
At this step, the rest of raw data are processed. Table 1 shows our data as D1 this
study have negative count most with 69.4%, positive class come as second,
26.38%. Since all 3 classification models based on statistics, we could foresee a
little obstacle for any statistical algorithm. Usually researchers balance numbers of
each label to performs fairly, however, we do not do it. We want to try with raw
data first then base on the results, we will see whether balancing is necessary.
We also build two more cross-domain datasets that have complex structure and
meaning. First dataset as D2 in Table 1 is comments from two vnexpess.net articles,
contains 1013 sentences. We use D2 to see how models perform with small data, less
misspelled words and relating to two subjects. The next one as D3 is 3477 Facebook
status sentences. This data cover more than one field, more abbreviate and emoji, data
size is not big as ours but it is bigger than D2. Social opinion mining is popular in
recent years but big company like Facebook, Google use mainly neural network. D2
and D3 data are processed through two steps as D1 and annotated by same two persons.
D2 is the smallest dataset with 1013 sentences which may make training stage less
accurate, moreover D2 is cross domain data. D3 has 3 volumes quite equally with
36.54%, 37.14% and 26.32% as POS, NEG and Neu, respectively.
Table 1. Annotated datasets

Dataset POS NEG NEU Total
D1 1319 3470 211 5000
D2 86 682 245 1013
D3 1270 1291 915 3476
Fig. 1. The process of building an opinion mining system for Vietnamese students’ feedbacks.
We use D1 as main dataset to comparing three model performances. After that, we

run test on D2 and D3 sequentially with model has best overall on D1, feature settings
are the same. The D1 and D2 results is used to explain answer how number and domain
data could affect to our models.
4 Methodology
Figure 1 shows the complete process of our sentiment system for our students’ feed-
backs data. In natural language processing, data are processed one by one could be
word, sentence or whole documents. With students’ feedbacks, we think analyzing at
sentence level is enough information. Most students use one or two sentences to
express their opinions. In our experiments, we focus on the feedback with one sentence.
Preprocessing includes splitting paragraphs into sentences, remove junk text

“Không có ý kiến” (means “No opinion”) and replace lecturers’ name by A to make
sure there is no personal information in our data. After that, data is sent to annotator
who categorize them classes. We use three types of label: positive, negative and
neutral. Most of sentiment research only use positive and negative because it is make
classification problems easier to solve [9]. Neutral could mean they do not have opinion
or it is subjective. As n-gram is our main feature, some strong positive (or negative)
features use neutral feature to hold back their affection to end results [9]. Also, our
dataset could be used to determine whether that sentences are objective or subjective in
other educational studies. At the end, positive and negative labels are more useful in
most use cases.
Each input sentence is extracted into two parts: Label and Features. We have one
label (POS, NEG or NEU) but a lot of features. Apart from data, any classification
model needs right. Each input sentence is extracted into two parts: Label and features
[10]. Instead of using one type of n-gram only, we combine all unigram, bigram,
trigram and 4-gram to produce one feature value. These features could be one word or
more depending on our configuration.
We see 92% of long sentences (above 20 words) are negative, hence, we use
sentence length as a feature. First feature is word context which is a bag of three word
tokens: the previous word, the word itself and the next word [11]. The word itself could
be one word or more depend on tokenization process. Second used feature is the length
of sentence. This feature could be helpful any many cases.
Classifier such as Naive Bayes, Maximum Entropy or Support Vector Machine
match each feature with suitable labels based on their statistic. In training process, the
statistic score was built from input training data. Testing process labels features input
data from what it learnt in previous stage. Results we have here is a sentence contains
range of label-features. We use confusion matrix table to analyze the performance of a
classification model.
4.1 Classification Methods

We use three opinion mining techniques are Naive Bayes, Maximum Entropy and
Support Vector Machine. All algorithms are based on statistics which mean data input
play significant role in any model.
Naïve Bayes are a family of simple probabilistic classifiers based on applying
Bayes’ theorem with strong independence assumptions between the features. This
algorithm is simple, easy for starter but good enough for any use. Its cons are it cannot
find the interaction between features [12].
Multinomial Logistic Regression or Maximum Entropy (MaxEnt) is a classification
method that generalizes logistic regression to multi class problems with more than two
possible discrete outcomes. This algorithm classifier is commonly used as an alterna-
tive to Naive Bayes classifier because they do not assume statistical independence of
the random features that serve as predictors. However, the learning process is slower
than Naïve Bayes with substantial number of classes [12]. We only use three classes so
there is no problem with speed performance.
Support Vector Machine (SVM) belong to a family of generalized linear models

which achieves a classification or regression decision based on the value of the linear
combination of features [7]. We can easily improve SVM by using specify dictionary
for our field choose right kernel [13].
4.2 Features
About feature, we have word n-gram, previous/next words and sentence length as a
feature. For example, we have sentence “thầy dạy rất tuyệt vời” (means “The lecturer
teaches amazingly”) is split into “thầy”, “dạy”, “rất”, tuyệt vời”. In Vietnamese, “tuyệt
vời” is one meaning word, cannot be tokenized into smaller units. Table 2 lists our 19
features come from the sentence. Word n-gram features are unigram, bigram, trigram
and 4-gram as row 5 to row 14 in Table 2. Each word feature also is used as previous
word (or next word) for feature we use. Sentence length value could be 1 to 10, 11 to
20, 21 to 30 or above 30, so this example length is 5.
Table 2. Features example

Feature Type
1 B-thầy Previous words
2 B-thầy dạy Previous words
3 B-thầy dạy rất Previous words
4 B-thầy dạy rất tuyệt vời Previous words
5 thầy Unigram
6 thầy dạy Bigram
7 thầy dạy rất Trigram
8 thầy dạy rất tuyệt vời 4-gram
9 dạy Unigram
10 dạy rất Bigram
11 dạy rất tuyệt vời Trigram
12 rất Unigram
13 rất tuyệt vời Bigram
14 tuyệt vời Unigram
15 E-tuyệt vời Next words
16 E-rất tuyệt vời Next words
17 E-dạy rất tuyệt vời Next words
18 E-thầy dạy rất tuyệt vời Next words
19 Len-21-30 Length
5 Experiments and Results
5.1 Experiment Settings

With Naive Bayes, we use Datumbox [14] framework. This framework supports many
algorithms but we only use Naive Bayes in this framework for basic use. About SVM,
libSVM [15] is popular open source library for Support Vector Machine with flexible
configuration and many kernels for future use. We choose Joachims’s SVMlight kernel
[13] because of its fast optimization algorithm. Next, we choose Stanford Classifier
which use Maximum Entropy as main algorithm [16]. Stanford Classifier proves it can
give better result than Naive Bayes, SVM classifier based on English dataset. Stanford
Classifier also come with many feature tweaks we can use but we apply same feature
setting to all three models.
To estimate the performance of the classification models, we use 10-fold cross-
validation. The value of k we choose is 10. This means with 5000 labelled sentences,
we have 500 sentences in each fold. Empirical studies by Kohavi [17] proved that 10
seem to be an optimal number of folds. In 10-fold cross validation the entire dataset is
divided into 10 mutually exclusive folds. Each fold is used once to test the performance
of algorithm that is generated from the combined data of the remaining nine folds.
5.2 Results
Based on the 10-fold cross-validation, Table 3 presents summary result with 3 clas-
sifiers in our study. MaxEnt obtains the best overall score with 91.36%, following is
Naïve Bayes with 88.00%. Surprisingly here, SVM has the lowest accuracy with
78.45%. Some previous studies show SVM classifier stands out in binary cases e.g.
positive and negative class only. We could see SVM obtain the highest negative class
accuracy with 98.56%. But SVM is the only model fail at neutral class.
Table 3. Classifier results for 10-fold cross validation

Classifier Naïve Bayes Maxent SVM
models
Overall 88.00% 91.36% 78.45%
accuracy
Per-class POS 81.70% 84.10% 40.91%
NEG 98.28% 96.83% 98.56%
NEU 09.53% 23.81% 00.00%
Although Maxent neutral class accuracy is not acceptable in real life, it has positive
and negative accuracy quite equally.
Tables 4, 5 and 6 are confusion matrix of each classifier models. With only 4.22%
neutral sentences, we do not have high expectation that any classifier could archive good
accuracy above 50% on neutral class. Table 4 is seen that Naive Bayes has most false in
POS (100%) and NEU (94.74%) class by classify them NEG instead of POS or NEU.
Table 4. Naïve Bayes confusion matrix

Naïve Bayes POS NEG NEU
POS 70.45% 29.55% 00.00%
NEG 01.73% 98.27% 00.00%
NEU 04.76% 85.71% 09.53%
In Table 5, Maxent has lower accuracy in NEG class but it has the best result in
NEU class. Though Maxent has similar mistake Naive Bayes did with POS and NEU,
it obtains the lowest error mislabeling NEU with NEG (81.25%) in all three models.
Table 6 proves SVM could do best with binary dataset but fail at NEU class. SVM
cannot classify correctly even one case.
Table 5. Maxent confusion matrix Table 6. SVM confusion matrix

Maxent POS NEG NEU SVM POS NEG NEU
POS 84.09% 04.43% 11.48% POS 40.91% 59.09% 00.00%
NEG 02.88% 96.83% 00.29% NEG 01.44% 98.56% 00.00%
NEU 14.29% 61.90% 23.81% NEU 27.27% 72.73% 00.00%
Along with our data, we run our Maxent model with newspaper comments (D2)
and social media dataset (D3) from Sect. 3. As seen in Table 7, D2 archives 67.29%
overall accuracy and D3 scores 60.25%. D2 result lower than our experiment could be
caused by small dataset size but it is still impressive with 92.75% of NEG class
accuracy. This could be explained by narrow domain cover of dataset. About D3, this
field is much complex than any dataset we have here. Facebook posts are truly cross-
domain datasets. Although D3 does not have best overall score, it has the most equal
results between per class accuracy, thanks to almost balanced dataset.
Table 7. Maxent model results for 10-fold cross validation of three different datasets
Dataset D1 D2 D3
Overall 91.36% 67.29% 60.25%
accuracy
Per-class POS 84.10% 12.50% 58.27%
NEG 96.83% 92.75% 63.56%
NEU 23.81% 25.00% 57.61%
5.3 Error Analysis

Stanford research with balanced English dataset (Reuters) [12] give us foreseen result
that Maxent outperformed in three models. Our unbalanced data maybe makes this
experiment not statically indeterminate but it is practical and acceptable. We say this
because in most cases, we only care about positive or negative feedback from student.
In this section, we analyze Maxent errors to find out which part of our model leads
to mislabeling. Both Naive Bayes and SVM have lower overall accuracy so they are
ignored from here. From previous Maxent Confusion matrix, we have 48 false cases.
Our insight present three types of causes.
• Unbalanced dataset: 18 cases are caused by high density of negative training
dataset. Because the amount of data in negative class, if one feature appears most, it
likely has negative score higher than usual. In the other hand, if one feature never
shows up in training dataset, final score of sentence will depend on other words. E.
g. “cách giảng dạy thấm thía vào học sinh” (means “teaching method is under-
standable for student”), word “thấm thía” (mean easy understanding) is misspelled
but in real life we accept it. We only have 4.22% of NEU in dataset so cutting down
three class equally is not empirical. We could solve this problem by combine
positive class and neutral class then balance these two class. This is much easier,
more effective than took all the samples from the minority class (Neutral class) and
randomly selected an equal number of samples from the majority class. Neutral
space too small could pull down other class accuracy.
• Sentence length: If the length of sentence is too short (contains 2 or 3 words) or too
long (more than 10 words), our model marks it mostly as NEG. Especially, POS has
been misclassified most. We tested a small experiment but without this feature. The
accuracy is lower than at least 10%, mostly NEG class.
• Previous word/Next word features: In 13 cases, Previous or Next word feature
manipulate most result. We try eliminating these features and see accuracy decrease
from 3 to 5%.
Furthermore, we see minor errors related to elevated level negation and contrastive
conjunction. We cannot classify comparative sentences but with the complex one as
“thầy rất nhiệt tình, nói rất nhiều đến khản giọng mà sức truyền tải không bằng thầy
XYZ” (means “The lecturer is so enthusiastic that he may have raucous voice, yet he
teaches as Mr. XYZ”) we know it is completely a negative response. This is not a
common comparative sentence because it compares 1 aspects (attitude) of object A (the
lecturer), which is good, with 1 totally different aspect (teaching ability) of object B
(Mr. XYZ). By using dependency feature with neural network, we could solve some
complex sentences contain negation conjunction word such as “nhưng”, “mà”, …
(“but”, “yet” … in English). We use our tool to analyze these errors. For example, we
have “môn học rất có ích” (means “The subject is helpful”) assigned as positive by
model base on word feature “ích” and next word “ích” also (Fig. 3). These feature
numbers are over 1.0 whereas NEU and NEG class that have mostly negative numbers.
5.4 Application
We want to make an application which can take raw data user and export result in text
format or visual expression as chart. We can compare two or three model result side by
side in one table as seen in Fig. 2 which is our application interface. This application is
for both tester and daily user so we add two functions especially for each of them: error
analysis and data chart.
As testers, we need a tool that could help us figure where our system goes wrong.
We can test mislabeled sentences by tool like Fig. 3 or challenge system with any new
sentence we make up. There four columns are type of feature, statistical data of each
feature in three class.
Second addition is visualizing our results in an intuitive way. School administers
can export data as text for now but in case they want to have a quick look to the result,
our tool could draw charts like pie or column from results. If we choose more than one
year student’s feedback, we could have its charts side by side as Fig. 4. What more is
we can export these chart as images and import them easily to any student report.
Fig. 2. Main user interface of application
Fig. 3. Error analysis tool
Fig. 4. Visual summary data

6 Conclusion and Future Work
Three classifiers in our experiment based on probability and statistics, thus training data
play key role. Per-class accuracy results prove that unbalanced data does not affect to
negative class much like what we see in positive class. Our data contains more negative
sentence than positive is usual in real life which makes our accuracy reliable.
We contribute our educational sentiment data with 5,000 labeled positive, negative
or neutral sentences. Our data could be the helpful source for sentiment analysis
community in future.
With the best accuracy of 91.36%, we see MaxEnt is full of promises. Our research
also proves that Naïve Bayes maybe old, less accurate in some fields but not this one.
Despite feedback is complex, traditional algorithms could work effectively with
appropriate feature selected. Based on our results and analysis, we suggest some
possible directions for the development of sentiment analysis for our data in futures:
(1) Enriching our data by collecting and labeling students’ feedbacks from the uni-
versities. Our goal is the number of 10,000 sentences in our dataset.
(2) Classifying students’ feedbacks by topics as well as lectures, facilities,
curriculum.
(3) Building sentiment treebank for our data based on the method of English Senti-
ment Treebank [18].
(4) Developing a dictionary for sentimental education to improve the accuracy of the
classifiers.
References
1. Kieu, B.T., Pham, S.B.: Sentiment analysis for vietnamese. In: 2010 Second International
Conference on Knowledge and Systems Engineering (KSE), pp. 152–157 (2010)
2. Altrabsheh, N., Gaber, M., Cocea, M.: SA-E: sentiment analysis for education. In: 5th KES
International Conference on Intelligent Decision Technologies (2013)
3. Mac Kim, S., Calvo, R.A.: Sentiment analysis in student experiences of learning. In:
Educational Data Mining 2010 (2010)
4. Achen, R.M., Lumpkin, A.: Evaluating classroom time through systematic analysis and
student feedback. Int. J. Sch. Teach. Learn. 9, 4 (2015)
5. Phuc, D., Phung, N.T.K.: Using Naïve Bayes model and natural language processing for
classifying messages on online forum. In: 2007 IEEE International Conference on Research,
Innovation and Vision for the Future, pp. 247–252 (2007)
6. Smith, M.K., Jones, F.H., Gilbert, S.L., Wieman, C.E.: The classroom observation protocol
for undergraduate STEM (COPUS): a new instrument to characterize university STEM
classroom practices. CBE Life Sci. Educ. 12, 618–627 (2013)
7. Delen, D.: A comparative analysis of machine learning techniques for student retention
management. Decis. Support Syst. 49, 498–506 (2010)
8. Duyen, N.T., Bach, N.X., Phuong, T.M.: An empirical study on sentiment analysis for
Vietnamese. In: 2014 International Conference on Advanced Technologies for Communi-
cations (ATC 2014), pp. 309–314 (2014)
9. Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol. 5, 1–167
(2012)
10. Rohrer, B.: How to choose algorithms for Microsoft Azure machine learning (2015)
11. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phrase-level
sentiment analysis. In: Proceedings of the Conference on Human Language Technology and
Empirical Methods in Natural Language Processing, pp. 347–354 (2005)
12. Klein, D., Manning, C.: Maxent models, conditional estimation, and optimization. In:
HLTNAACL 2003 Tutorial (2003)
13. Joachims, T.: SVM-Light Support Vector Machine, vol. 19. University of Dortmund (1999).
http://svmlight.joachims.org/
14. Vryniotis, V.: Developing a Naive Bayes text classifier in JAVA, 27 January 2014 (2014)
15. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans.
Intell. Syst. Technol. (TIST) 2, 27 (2011)
16. Klein, D.: The Stanford Classifier. The Stanford Natural Language Processing Group (2003)
17. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model
selection. In: Ijcai, pp. 1137–1145 (1995)
18. Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., et al.: Recursive
deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the
Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 1642
(2013)
Development of Waste Collection Model Using
Mobile Phone Data: A Case Study in Latvia
Irina Arhipova , Gundars Berzins, Aldis Erglis(&),

and Evija Ansonska
University of Latvia, Riga 1050, Latvia

irina.arhipovasalajeva@gmail.com, {gundars.berzins,
evija.ansonska}@lu.lv, aldis.erglis@gmail.com
Abstract. In organizing household waste management and controlling waste

collection and disposal, it is necessary to minimise risks to the environment and
human health and, where possible, ensure that waste is recycled and returned to
the economic cycle. Different models are being applied to increase waste col-
lection management efficiency, but in recent years, the mobile phone data is
widely used to solve various application problems. The research objective is to
develop a waste collection model, which responds to the population’s current
demands and allows planning waste container loading, based on mobile phone
data statistics. The developed approach, techniques and data model can be used
for waste container analysis and optimisation of their placement near small
commercial structures, information kiosks, residential areas and other places
attracting larger amounts of people. The developed relational data model
includes information about mobile phone base stations, waste container data,
calendar table and geographic location table. Further steps include data pro-
cessing and data modelling in order to generate a data model for visual and
quantitative analysis. The methods and data analysis techniques used in this
research could be used to build a commercial product for mobile data operators
allowing predicting the most appropriate placement of waste containers in any
territory where mobile base station data is available. The choice of any of the
proposed strategies allows achieving both direct benefits, like increasing the
collected amount of recyclable glass, and indirect benefits – an increase in the
amount of glass collected in the remaining containers.
Keywords: Conceptual architecture Functionality model Principal

Component Analysis
1 Introduction
The waste management policy in Latvia is laid out in the National Waste Management
Plan 2013–2020 as well as in the regional plans of ten waste management regions and
Riga City. The regional plans envisage the establishment of a separate waste collection
system in the regions. In cities, it means a separate waste collection point for every 300
to 500 inhabitants. Sorted waste collection points should be created in larger cities, as
well as in regional populated areas with over 1000 inhabitants.

https://doi.org/10.1007/978-3-030-39442-4_6
68 I. Arhipova et al.
One of the goals envisaged in the plan is to ensure that the generated waste is not
hazardous and with a low risk to the environment and human health. Where possible,
the collected waste should be returned to the economic cycle, particularly through
recycling. Specifically designed areas would constitute the necessary infrastructure for
collecting sorted household waste. Organizing household waste management and
controlling waste collection and disposal is one of the functions of the local govern-
ments. The production of glass is expensive and energy consuming, so glass products
should be in use as long as possible through re-use and recycling. In the production of
new glass products, it is economically advantageous to add up to 30% of used glass. In
Latvia, the existing glass sorting and collection system is not optimal and does not
ensure efficient waste collection and recycling.
Laws and regulations in Latvia oblige producers of household waste to entrust the
management of their waste to companies which have received the permit of the
respective local government for the operation in its territory. An individual or a legal
entity in question is also responsible for covering the costs associated with the man-
agement of household waste, including hazardous waste. These requirements also
apply to owners and users of summer cottages or other short-term accommodation
buildings. In line with the objectives set by the European Union (EU), by 2020 Latvia,
like other EU countries, must collect and hand over for recycling 68% of glass placed
in its territory and ensure annual increase in recycling. Often due to the absence of
containers for sorted glass waste, glass packaging ends up in household waste con-
tainers, making it more difficult to sort and recycle.
To increase the efficiency of the waste collection management, different models are
used. For example, the GIS based model [1], the data driven model [2], the integrated
management model [3], the network optimization model [4], the decision-making
model, the cost-minimization model and the landfill model [5], as well as multi-criteria
decision models and graphical models [6]. In recent years, the mobile phone data is
being widely used to solve various application problems in different fields – tourism
management [7, 8], population estimation and migration monitoring [9, 10], traffic flow
measuring [11], regional economic activity evaluation [12–14] and others. The research
objective is to develop responsive waste collection model, which responds to the
current demands of the population and allows planning waste container loading, based
on mobile phone data statistics.
2 Glass Waste Collection System and Mobile Phone Base

Stations Location in Riga
Glass waste collection services are available to every resident and company in Latvia.
The environmental management company, Eco Baltia Vide, Ltd. provides the widest
range of environmental management services – collection of household and sorted
waste, management of used packaging, construction waste and bulky waste manage-
ment, cleaning of premises and territories, and different seasonal services. Eco Baltia
Vide, Ltd. is a part of Eco Baltia, the largest waste management company in the Baltic
countries, providing a complete waste management cycle from collection to recycling.
Development of Waste Collection Model Using Mobile Phone Data 69
Almost 55 000 tons of glass products are reused and returned to the Latvian market
every year, and only 36 000 tons of glasses are recycled. Therefore, Eco Baltia Vide in
co-operation with local governments of Latvia are planning to place 1000 additional
containers for glass waste collection, allowing to collect at least 5000 tons more glass
annually. This will reduce the amount of glass waste disposed in municipal landfills
and will improve the collection and recycling of glass in Latvia. Usually, a waste
collection point is fixed up where it is physically possible due to municipal rules and
agreements with property owners, and not as a result of decisions based on the data on
potential volumes of waste. Therefore, waste collection infrastructure is created upon
the easy-to-go principle, which means that a waste container may not be in its optimal
placement but in the possible one. For the collection of glass waste, containers of two
different volume capacities are used: 1.5 m3 container can contain up to 300 kg of glass
waste and 2.5 m3 – up to 500 kg of glass waste.
In this research, the mobile station data by the operator, LMT (Latvia Mobile
Telephone) are used. Other mobile operators are also operating in Riga, but LMT is the
oldest and largest one in the region and their data include fixed calls and text messages
made from and to other mobile operators as well. Geographically, the entire territory of
Riga is covered by 298 mobile phone base stations (see Fig. 1).
1.5 m3 volume container; 2.5 m3 volume container; mobile base station.
Fig. 1. Locations of glass waste containers and LMT mobile phone base stations in Riga.
Mobile phone base stations (BS) are erected according to the consumption in the
respective part of the city. In areas where the use of mobile phone services is growing,
the operator installs a base station, ensuring that the distribution of base stations
corresponds with the level of use of mobile phone services. For each BS, the following
parameters are used:
• base station ID;
• base station position by Longitude and Latitude;
• the number of the unique users on a base station;
• base station activity or number of the outgoing call and text-message activities.
The data from BS is available in 15-min intervals, allowing to identify users’
behaviour in the respective area. This time-series data can be used for forecasts and
dynamic analysis across time and as aggregated values. In the data model for mobile
activity analysis, data could be used on different aggregation levels: daily, weekly,
monthly or even yearly aggregations. The data from various mobile base stations do not
overlap; therefore, they can be used for geographically precise location analysis.
3 Analysis of Collected Waste Volumes Based on Location

and Time Period
The database used for the collected waste volume study includes information on the
weight of every glass waste container, the date of collection and the location address
for the period of 40 months, from January 2015 to April 2018 (Table 1).
Table 1. Extract from glass waste data statistics.

Container ID Date Volume (m3) Weight (kg) Latitude Longitude
1 22-03-2015 2.5 500 56.909995 24.084610
305 15-04-2018 1.5 270 56.954303 24.204908
Based on the methodology of regional economic development evaluation, which

uses mobile positioning data [13, 14], the call detail record (CDR) data of LMT of 2016
and 2017 were analysed. The database includes the number of outgoing call and text-
message activities, the number of unique phone users, the exact date, aggregate time by
15 min, the 298 mobile base stations ID and address, altogether 20 883 840 rows in the
table (Table 2).
Table 2. Extract from CDR database in Riga city.

Date Time Number of Number of Base station ID Address
call unique phone
activities users
10-10-2016 08:15 2770 2117 100102 119 Brivibas
street, Riga
10-10-2016 18:30 2578 1919 100104 96 Brivibas
street, Riga
10-10-2016 18:45 6207 4745 100103 98 Kr.
Barona
street, Riga
3.1 Data Statistical Analysis

The statistical analysis of the data shows that there is no significant correlation between
glass waste container weight and time period between the actual and the previous waste
collection days (r = 0.03). The hypothesis is that the volume of collected waste
depends on the type of district of the city – business, residential or mixed.
Using the Principal Component Analyses (PCA) with Varimax rotation, similar
groups of 298 mobile phone base stations (by call activity and number of unique phone
users) in the districts of Riga were found. The PCA results show that 93% of the total
variance is described by the first two principal components (PC), where the first PC
describes 54% and the second PC – 39% of the total variance.
To define the type of each Riga city district, the average values of the first and the
second principal components (PC) were calculated, regarding the time and the day of
the week. The first principal component has higher values before 4.00 p.m. and lower
values after standard business hours, but the second principal component has higher
values after 4.00 p.m. The similar pattern is found for workdays and weekends, where
the first principal component has the highest values during workdays and the lowest in
weekends in comparison with the second principal component (see Fig. 2).
Fig. 2. PC average values depending on time and day of the week.
As a result, the first PC can be interpreted as a business district, but the second PC –
as a residential district. Using the component loadings, which are the correlations of the
particular mobile base stations’ call activities and unique phone users with the first two
principal components, three similar groups of mobile phone base stations by mobile
phone call activities and unique phone users of Riga city district were found:
• the 1st group includes 45% of all base stations in business districts, when the 1st PC
loadings are more than 0.7 and 2nd PC loadings are less than 0.5;
• the 2nd group includes 22% of all base stations in residential districts, when the
1st PC loadings are less than 0.5 and 2nd PC loadings are more than 0.7;
• the 3rd group includes 33% of all base stations in mixed districts, when the both PC
loadings are more than 0.5 and mobile phone call activities are relatively higher at
all times during the day.
The distribution of mobile base stations by the type of district can be marked on the
map – business districts are located in the city centre, residential districts are further
away from the city centre, while mixed districts are located in between the business and
the residential districts (see Fig. 3).
Waste containers: in business district; in residential district; in mixed district.

Mobile base stations: in business district; in residential district; in mixed district.
Fig. 3. Distribution of base stations and container types in the city’s districts in 2018.
The two-factor analysis of variance shows that the district type (p-value < 0.01) and
the month of the year (p-value < 0.01) have significant effect on the collected glass waste
(kg/day), but there is no significant factor interaction effect in 2015–2018. Higher vol-
umes of collected glass waste are associated with spring cleaning in April and during
Easter time. The summary of various time periods in Riga city districts from 2015 to
2018, using the two components loadings or correlation coefficients, is shown in Table 3.
Table 3. Extract from glass waste data statistics.

# Type of district Call activities and Principal Waste average (kg/day)
number of unique phone components
users on loadings
Business days Weekend 1st PC 2nd PC
1 Business High Low 0.7–1.0 0.0–0.5 6.0
2 Residential Low High 0.0–0.5 0.7–1.0 3.1
3 Mixed Average Average 0.5–1.0 0.5–1.0 4.9
3.2 The Strategies for Waste Containers Location

Based on the data statistical analysis results, four strategies were considered for the
responsive waste collection model development. The first strategy envisaged that 5
containers with the lowest collected volume of glass waste (kg/day) from business,
residential and mixed districts, altogether 15 containers, are moved to other business
districts with insufficient number of containers.
The second strategy assumed that 5 containers with the lowest collected volume of
glass waste (kg/day) from business, residential and mixed districts, altogether 15
containers, are moved to districts of the same type with insufficient number of
containers.
The third strategy assumed that 15 containers with the lowest collected volume of
glass waste (kg/day) from all three types of districts are moved to the business districts
with insufficient number of containers.
The fourth strategy assumed that 15 containers with the lowest collected volume of
glass waste (kg/day) from all types of districts are moved to the same types of districts
with insufficient number of containers.
Using the data of average collected glass waste volumes (kg/day) in business,
residential and mixed districts (Table 3) and actual collected glass waste volume for 15
moved containers, it is possible to evaluate the absolute growth (kg/day) and the
growth rate for all strategies (Table 4).
Table 4. Summary of the waste containers moving strategies.

Indicator 1st strategy 2nd strategy 3rd strategy 4th strategy
Absolute growth (kg/day) 76.59 56.59 77.25 67.75
Growth rate 5.7 4.2 6.1 5.3
4 The Data Model of Glass Waste Collection Information

System
All data transformations are performed to convert data in a format allowing to create a
common relational data model which can be used for data analysis from different
perspectives. Because of the amount and structure of the data, it is more reasonable to
use a relational model (normalized) instead of a single large table approach. The
relational data model also allows, fully or incrementally, load and reload data from
different data sources during analysis, or even add different data sources if necessary.
4.1 MS Excel Functionality Model

Microsoft Excel 2016 is the main tool used in the data analysis. All data transformation
and visualization is performed, using Excel 2016 advanced add-ins: PowerQuery for
data load and processing, PowerPivot for building a data model and PowerMap for
visualization of the maps (see Fig. 4).
Fig. 4. Microsoft Excel advanced functionality model.
Microsoft Excel provides similar functionality to state-of-the-art data warehouses

and business intelligence solutions. It allows to build needed ETL (extract, transform,
load) processes and connect to external API services as Google MAP API, store data in
relational data model in PowerPivot and visualize date using PivotTables, graphs and
3D maps. The final relational data model includes information on mobile phone base
stations, waste container data, calendar table and geographic location table. The fol-
lowing data processing and data modelling steps were taken to create the data model
for visual and quantitative analysis:
1. Cleaning of the source data stored in Excel file (removing errors and removing an
entire row in case of missing values, correcting addresses).
2. Table of dates converted in transactional form where every record represents one
waste container collection period.
3. Using Google MAP API GPS coordinates calculated for each waste container and
added to the table.
4. Mobile phone base station data added to the data model.
5. Calculated mobile phone base stations in a 1 km radius from each waste container.
6. Each waste container classified according to the district of the respective base
station.
7. Results displayed on Bing 3D map and tables.
4.2 Conceptual Architecture of Waste Container Placement Software

All data processing steps mentioned above can be automated and included in software
(see Fig. 5). The methodology used in this research could be standardised and included
in the product for commercial use, including the following parts: standardised relational
data model and data input interfaces, algorithms and transformation methods, pre-
sentation of results in tabular and graphical format using BING maps.
Application for Waste container placement

optimization
1 Enterprise prepares data Results 3
according input data
Data model Enterprise can show to
format
clients and partners
Mobile phone data (CDR) Algorithms results and
recommendation in
Waste container data tabular and graphical
Results in machine form
Calculate readable form (API,
GIS map data
CSV)
Other data
Calculations according
2 parameters
Fig. 5. Conceptual architecture of the waste container placement software.
The most time-consuming task was data cleaning and processing to combine it with
all needed data sources in a common relational data model. The relational data model
contains several tables and some of the tables are larger than 1 million rows, therefore,
PowerPivot add-in for Excel was used. PowerPivot allows storing relational data with
tables larger than 1 million rows in Excel and is built on top of Vertipaq in-memory
engine, which allows working with data 20 times faster than in an Excel sheet.
The methods and the data analysis techniques used in this research could be built in
a commercial product for mobile data operators allowing to predict the optimal
placement of waste containers in any territory where mobile base station data is
available. It could work in three major steps:
1. preparation of source data (CDR data, waste container data, other data);
2. calculation and parametrisation;
3. results and analysis of maps and tabular results.
The main benefit of the standardised solution is the data model and data structure.
By preparing and mapping source data to the data model data, it is possible to repeat all
calculations and generation of results. Calculations could be adjusted using different
parameters, such as in what radius base stations are assigned to a waste container. To
create visualisations of geographical information on the map we needed to generate
multilayer maps that allow combining different data together to enable part of analysis
and conclusions visually. Visual examination of the data results helps to understand the
big picture and how data is distributed across a geographical area.
Several GIS tools allow creating multilayer maps such as Quantum GIS or ArcGIS.
In this research, we used 3D Bing multilayer maps built in Excel 2016. Bing maps are
the original Microsoft add-in for Excel 2016 and provide needed functionality to
complete this research in data visualisation part. The following PowerMap functions
are used:
• geocoding – enables to find point on the map using the street address of the object;
• filtering – enables to filter objects;
• layers – enables to create several layers where visualisation on each layer could be
different;
• transparency – enables to overlap several layers with clarity;
• exact location – enables to find the objects’ location using GPS coordinates;
• map themes – enables to add different map visualisation themes and colour
schemes.
All these functions of PowerMap allowed visualising GIS information in the way it
could automatically generate results from a relational data model using a large amount
of data – several million rows and stored in PowerPivot.
5 Conclusions
The developed approach, techniques and data model could be used for:
• other waste container analysis and other applications, such as small places of
commerce, real estate, information kiosks, where optimization of placement means
possible attractiveness for a larger amount of people,
• providing digital products and applications using CDR data by mobile operators in
combination with different business data,
• short-term and long-term forecasts on the number of people located across the
geographical area for the planning of public events, marketing campaigns, and the
placement of short-term engagements.
The methods and the data analysis techniques used in this research could be built in
a commercial product for mobile data operators allowing to predict the optimal
placement of waste containers in any territory where mobile base station data is
available.
The main benefit of the standardised solution is the data model and data structure.
By preparing and mapping source data to the data model data, it is possible to repeat all
calculations and generation of results. Calculations could be adjusted using different
parameters, such as in what radius base stations are assigned to a waste container.
Choosing any of the proposed strategies helps achieve direct benefits - increasing the
amount of glass collected, and indirect benefits - increasing the amount of glass col-
lected in the remaining containers.
If the anticipated effect is achieved after a long-term monitoring, it will be nec-
essary to recalculate the appropriate locations of the remaining containers and to create
the most optimal container placement map for the entire area. An additional effect can
be achieved through a public awareness campaign and by promoting an innovative
approach to glass waste collection. The amount of data available makes it possible to
offer optimisation of routes and periods of waste collection.
Optimal placement of glass waste collection containers can contribute to the devel-
opment of the circular economy, which is still at an early stage in Latvia. In the future, the
amount of waste disposed in landfills and the costs of municipal waste management will
also decrease as the volume of glass packaging will no longer end up in household waste
containers. Besides, absence of glass, a rather abrasive material, in municipal waste

sorting plants would significantly prolong the life of the sorting equipment.
Acknowledgments. The research leading to these results has received funding from the research
project “Development of Responsive Glass Waste Collection System”, the contract Nr.
ZD2018/20580 signed between the University of Latvia and Eco Baltia Vide, Ltd.
References
1. Vu, H.L., Ng, K.T.W., Bolingbroke, D.: Parameter interrelationships in a dual phase GIS-
based municipal solid waste collection model. Waste Manag 78, 258–270 (2018)
2. Esmaeilian, B., Wang, B., Lewis, K., Duarte, F., Ratti, C., Behdad, S.: The future of waste
management in smart and sustainable cities: a review and concept paper. Waste Manag 81,
177–195 (2018)
3. Ilankoon, I.M.S.K., Ghorbani, Y., Chong, M.N., Herath, G., Moyo, T., Petersen, J.: E-waste
in the international context – a review of trade flows, regulations, hazards, waste management
strategies and technologies for value recovery. Waste Manag 82, 258–275 (2018)
4. Van Engeland, J., Beliën, J., De Boeck, L., De Jaeger, S.: Literature review: strategic
network optimization models in waste reverse supply chains. Omega 91, 102012 (2020).
https://doi.org/10.1016/j.omega.2018.12.001
5. Eiselt, H.A., Marianov, V.: Location modeling for municipal solid waste facilities. Comput.
Oper. Res. 62, 305–315 (2015)
6. Kayakutlu, G., Daim, T., Kunt, M., Altay, A., Suharto, Y.: Scenarios for regional waste
management. Renew. Sustain. Energy Rev. 74, 1323–1335 (2017)
7. Ahas, R., Aasa, A., Roose, A., Mark, U., Silm, S.: Evaluating passive mobile positioning
data for tourism surveys: an Estonian case study. Tour. Manag. 29, 469–486 (2008)
8. Zhao, X., Lu, X., Liu, Y., Lin, J., An, J.: Tourist movement patterns understanding from the
perspective of travel party size using mobile tracking data: a case study of Xi’an, China.
Tour. Manag. 69, 368–383 (2018)
9. Balzotti, C., Andrea, B., Briani, M., Cristiani, E.: Understanding human mobility flows from
aggregated mobile phone data. IFAC-PapersOnLine 51(9), 25–30 (2018)
10. Bwambale, A., Choudhury, C.F., Hess, S.: Modelling trip generation using mobile phone
data: a latent demographics approach. J. Transp. Geogr. 76, 276–286 (2019). https://doi.org/
10.1016/j.jtrangeo.2017.08.020
11. Ni, L., Wang, X.C., Chen, X.M.: A spatial econometric model for travel flow analysis and real-
world applications with massive mobile phone data. Transp. Res. Part C 86, 510–526 (2018)
12. Arhipova, I., Berzins, G., Brekis, E., Opmanis, M., Binde, J., Steinbuka, I., Kravcova, J.:
Pattern identification by factor analysis for regions with similar economic activity based on
mobile communication data. Adv. Intell. Syst. Comput. 886, 561–569 (2019)
13. Arhipova, I., Berzins, G., Brekis, E., Kravcova, J., Binde, J.: The methodology of region
economic development evaluation using mobile positioning data. In: 20th International
Scientific Conference on Economic and Social Development, pp. 111–120. Varazdin
Development and Entrepreneurship Agency, Prague, University North, Koprivnica, Croatia,
Faculty of Management University of Warsaw, Poland (2017)
14. Arhipova, I., Berzins, G., Brekis, E., Binde, J., Opmanis, M.: Mobile phone data statistics as
proxy indicator for regional economic activity assessment. In: 1st International Conference
on Finance, Economics, Management and IT Business, pp. 27–36. SCITEPRESS – Science
and Technology Publication, Lda., Crete, Greece (2019)
Artificial Social Intelligence: Hotel Rate
Prediction
James J. Lee(&) and Misuk Lee
Seattle University, Seattle, WA 98122, USA

jamesleeseattle@gmail.com
Abstract. Artificial Intelligence has enabled new possibilities in today’s

business domain from operational efficiency to smart decision making and even
innovative product/service design. Still there are plenty of grey areas where
human modelers are struggling to create optimal machine learning scenarios.
This research is the first attempt to build machine level structuration where the
human modelers’ continuous commitment to enhance machine learning models
can be eliminated. In Artificial Social Intelligence Framework, those require-
ments are replaced at the machine level by adopting cloud native computing
foundation (CNCF) with continuous integration and development. The sug-
gested machine level structuration is demonstrated with hotel rate predictions.
Keywords: Artificial Social Intelligence Cloud native Hotel rate

prediction Structuration
1 Introduction
Artificial Intelligence (AI) has been revived in modern days and regained its fame from
the 1950’s. Today initiatives from business world are the main driving force. With help
from advancements in cloud computing and data analytics, AI is spread out using
today’s fast networking capacity. Though the fine source code libraries of AI in the
majority of programming platforms have been established, AI solutions still require
human modelers’ tremendous efforts and interactions. AI models must be refined with
parameters to be adopted today, then tweaked again tomorrow.
Artificial Social Intelligence (ASI) is an innovative framework capable of replacing
human modelers’ time-consuming jobs with machine agents implemented in
microservices. These agents are then facilitated by cloud native computing foundation
(CNCF).
2 Microservices
With the “Keep it Simple, Stupid (KISS)” Unix philosophy, Linux naturally supports
the modular nature of cloud architecture with virtualization technology that maxi-
mizes resource utilization alongside the concept of multi-tenant systems in one phys-
ical server computer [3]. This creates one of the major benefits of using cloud
computing, scalability, by creating as many virtual machines as needed. However,

https://doi.org/10.1007/978-3-030-39442-4_7
Artificial Social Intelligence: Hotel Rate Prediction 79
because Linux applications have so many dependencies based on resource require-

ments, it is difficult to manage multiple versions of applications. With the advancement
of Docker containers, this complexity problem is completely resolved by enforcing
polylithic design principles [4, 6]. Docker is a very lightweight virtual machine, called
a container virtualization [1].
The philosophy of microservices is to split an application into set of smaller,
interconnected services [7]. A microservice can be a loosely coupled functionality,
such as registration management, download management, order management, pro-
duction management, etc. Therefore, each microservice is a mini-application with its
own database that complements polyglot persistence - different kinds of data are best
dealt with by different databases. Microservices can be implemented either in a cloud
virtual machine or a container. This paper mainly uses the example of the Docker
container today as it provides various tools with ease of maintenance in many
departments, such as scheduling, scaling, upgrades, health checking, and service
discovery.
3 Cloud Native Computing Foundation
Like parallel computing in the data analytics field today, orchestration in container is
getting attention as microservice architecture is a double-edged sword; decomposing
applications in multiple microservices while adding complexity by managing the
overwhelming number of microservices with application growth proves challenging
[6]. Previously, orchestration comes into this position to manage resources systemat-
ically. Managing multitude containers can be a daunting job without orchestration
tools; Kubernetes, Mesos, ECS, Swarm, and Nomad.
At this time a new paradigm has emerged from the practices of the last decade in
the form of fast moving microservices architecture deployment, called cloud native
computing foundation (CNCF). CNCF is an open source software stack that deploys
applications as microservices, packaging each part into its own containers, and
dynamically orchestrating those containers to optimize resource utilization. The com-
puting in AI field vastly needs this processing power. Today’s data analytics foundation
has widely accepted parallel computing, such as the Hadoop and Spark processing
engines. Still, human analysts are fully responsible for modeling any AI algorithms.
This research is the first step to use CNCF to aid modeling jobs by human analysts with
machine agents using microservices architecture.
4 Artificial Social Intelligence
The process of structuration is the reciprocal interaction of human players and insti-
tutional properties of organizations [2]. The theory of structuration recognizes that
human actions are enabled and constrained by structures, yet these structures are the
result of previous actions [5]. Though run by machines, CNCF adopts the structuration
process in AI use cases. This research is theorizing possible structuration on
microservices (machine agents) and institutional properties (rules and resources).
80 J. J. Lee and M. Lee
AI algorithms are purely logical ways of thinking. But there are various methods of
implementation when one applies algorithms to their data source. Each analyst creates
his or her own methods to interpret the data source. As a result, multiple views of data
interpretation are created and the best explanation on a case-by-case basis is adopted.
With CNCF, Artificial Social Intelligence is proposed where modeling is run by
machine agents with an additional layer of blending other algorithms. This will be
discussed in the following case study in hotel rate prediction.
5 Case Study: Hotel Rate Prediction
the 1950’s. When buying a perishable good, most consumers consider the trade-off of
buying today and waiting a little longer in the hope that prices will drop. This trade-off
is more pressing the larger the uncertainty of one’s preferences in the future. A buyer
must consider whether sellers will drop prices or increase prices. In the presence of
price uncertainty, price forecasts for the remaining days in the booking horizon are
valuable information to drive consumer’s purchase decision for perishable goods.
In this case study, the CNCF architecture for predicting minimum hotel prices is
applied. Several different forecasting models including traditional time-series models
and machine learning models are examined, and machine learning blending architec-
ture to improve forecasting accuracy is proposed. Using real-life hotel data, this
research provides the empirical results of the proposed approach along with a com-
parison to traditional forecasting methods.
5.1 Data Layer

Data layer retrieves, cleans, and manipulates raw data. The data is then to be fed into
forecasting models. This part of the system may require updating data in database as
well as running specific program modules for data processing.
For the hotel price prediction, daily price data of three hotels from a major hotel
chain are used. A booking window of 0–60 days out for 728 arrival dates of each hotel
is examined. The last three months’ data is reserved for validations.
5.2 Modeling Layer

Modeling layer prepares various prediction models at the abstract levels. When a
program instance for a specific prediction model is executed, it retrieves the pre-
processed data from either database or files and runs the forecasting model. Modeling
layer of the hotel price forecasting system contains time-series forecasting models
including ETS (exponential smoothing), ARIMA, and machine learning models (in-
cluding support vector machines and neural networks).
Artificial Social Intelligence: Hotel Rate Prediction 81
5.3 Blending Layer

Modeling process in a typical forecasting project runs multiple models and selects the
best model based on a certain prediction performance measure. Thus, the final forecasts
are from a single model. Whereas obtaining forecasts from a single best model is the
predominant approach, combined models are also accepted in practice. Combined
models use either regression or a weighted average of forecasts obtained from different
models. While blending can improve the forecasting accuracy in certain cases, there is
no systematic procedure to combine multiple forecasts.
The blending layer of CNCF framework uses machine learning algorithms to blend
forecasts generated in the modeling layer. Any machine learning regression algorithm
such as the genetic algorithm, neural network, and support vector machine can be
adopted in the blending layer. However, neural networks are used in hotel price
forecasting because neural networks generally offer low error rates and high modeling
flexibility.
6 Results
The performance of the proposed blending method with benchmarks pertaining to the
predictive accuracy for three hotels is compared. For the benchmarks, the ARIMA
model (one of the most advanced time-series models), support vector machine, and the
neural network model are used.
The Mean Absolute Percentage Error (MAPE) is used to assess the performance of
forecasting models:
X jFt At j
MAPE ¼
t
At
where At denotes the actual value and Ft the forecast value.
Table 1. Average MAPE of 60 runs (forecasting dates) for 60 days forecasts

Models Hotel 1 Hotel 2 Hotel 3 Average
ARIMA 23.41% 24.84% 31.35% 26.53%
Support vector machine 5.20% 14.87% 14.63% 11.57%
Neural networks 5.17% 9.52% 10.69% 8.46%
Machine learning blended 3.75% 9.31% 8.96% 7.34%
For model validation, the daily hotel rate data of three hotels with a booking
window of 1–60 days out of 120 arrival dates is used. The suggested system for
60 days is tested, i.e. forecasting models for 60 forecasting dates. For each forecasting
run (date), the system generates forecasts for the next 60 days. Table 1 reports the
average MAPE of 60 forecasting runs of machine learning blended model along with
82 J. J. Lee and M. Lee
benchmarks. The empirical result in Table 1 confirms that machine learning blended
model can significantly improve forecasting accuracy for all cases.
7 Concluding Remarks
the 1950’s. Currently, microservices are implemented in the modeling layer alone. Data
layer will be implemented later, followed by the blending layer. Once all three layers
are implemented in microservices, CNCF will be adopted with tools such as Kuber-
netes, GitLab, and Digitalocean in CI/CD approaches.
The CNCF framework allows for deploying, scheduling, and running multiple
machine learning models. Moreover, the machine learning selection process leverages
complementary characteristics of distinctive models to produce optimal forecasts. The
empirical study indicates that the CNCF machine learning approach can significantly
improve the prediction accuracy of hotel minimum rates.
References
1. Anderson, C.: Docker. IEEE Software, May/June 2015
2. Giddens, A.: The Constitution of Society. University of California Press, Berkeley (1984)
3. Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A., Varghese, G., Voelker, G., Vahdat,
A.: Difference engine: harnessing memory redundancy in virtual machines. Commun. ACM
53(10), 85–93 (2010)
4. Julian, S., Shuey, M., Cook, S.: Containers in research: initial experiences with lightweight
infrastructure. In: XSEDE16 Proceedings of the XSEDE16 Conference on Diversity, Big
Data, and Science at Scale, Miami, USA (2016). Article no. 25
5. Orlikowski, W.J., Robey, D.: Information technology and the structuring of organizations.
Inf. Syst. Res. 2(2), 143–169 (1991)
6. Richardson, C., Smith, F.: Microservices: from design to deployment. Nginx Inc. (2016)
7. Thones, J.: Microservices. IEEE Software, January/February 2015
New Metric Based on SQuAD for Evaluating
Accuracy of Enterprise Search Algorithms
Harshad Kulkarni, Himanshu Gupta, Kalpesh Balar,

and Praful Krishna(&)
Arbot Solutions Inc. dba Coseer,

301 Mission St Suite 9F, San Francisco, CA 94105, USA
{harshad,himanshu,kalpesh,praful}@coseer.com
Abstract. Enterprise Search is a continuously evolving and important field,

which is seeing a resurgence driven by artificial intelligence. Still, there is no
objective, generally accepted way to compare various enterprise search systems.
SQuAD is becoming popular for measuring algorithmic reading comprehension
(MRC) but is ineffective for quantifying effectiveness of enterprise search in
business-use situations. In this paper we modify the SQuAD scoring method-
ology to propose a scoring system for enterprise search systems that aligns with
the real world expectations of users. Further, we use a search system based on
Calibrated Quantum Mesh (CQM) to underscore the relevance of this metric.
Keywords: Enterprise search Scoring system Squad Calibrated Quantum

Mesh CQM
1 Introduction
In the field of Information Retrieval, searching for textual answers among enterprises
has always been thought of as being different than similar searches on the Internet. As
early as 2004, Hawking, et al. have described such differences in detail [1]. Multiple
challenges make it harder for enterprise search systems to be as accurate as search
systems focused on the Internet. First, there is no hierarchy among documents like the
hyperlinks used by Google’s Page Rank algorithm [2]. Second, the traffic on the
enterprise search system is a fraction of that of a system like Google, which is used for
more than five billion searches every day as per Internet Live Statistics as of publi-
cation date. Third, every enterprise has a highly specialized taxonomy or vocabulary
that is different than other enterprises [3]. Sometimes there are differences even among
different departments of business units of enterprises.
Search has kept evolving for both, searching for textual answers within enterprises
and over the Internet. Particularly, the concept of Natural Language Queries has
become popular, wherein the search string is just a sentence in natural language and the
search system can parse it for relevant keywords to complete the search [4]. Over time
semantic search has also been discussed, which reduces dependence on typing exact
keywords to get the right answers [5].
As enterprise search evolves there is a dearth of widely accepted objective metrics
that can measure and compare performance of various enterprise search systems. In this
https://doi.org/10.1007/978-3-030-39442-4_8
84 H. Kulkarni et al.
paper we take such a metric for machine reading comprehension and modify it for
enterprise search. We discuss the modifications in detail and provide source code for
others to evaluate their own systems.
2 Objectives of NextGen Enterprise Search
Today, with more sophisticated techniques like artificial intelligence, enterprise search
can achieve new successes despite the challenges enumerated above. As a research
team focused exclusively on enterprise search, and working with numerous enterprises
to solve these problems, we define Next Generation (NextGen) Enterprise Search as a
search system with the following objectives:
A. Accept a Natural Language Query and return a nugget of information that answers
the query. This answer could be a specific data point, a text snippet or an image,
but should not be a document or a list of documents. If such an answer does not
exist, the search system should not return any answers. If multiple such answers
exist, the search system should return all these answers.
B. Complete the objective A, as stated above, by focusing on the intent behind the
question and not on keywords. If the user asks the same question using a different
set of words such that the meaning of the question does not change, the answer
should not change either. On the other hand, if for the same question, if any
relevant information in the context changes, the answer should change to reflect the
new intent based on the changed context.
C. Complete the objectives A and B, as stated above, within a latency of five seconds.
There is an inverse correlation between accuracy and latency for most search
systems used for enterprise search [6]. Fixing the upper limit of latency levels the
playing field.
Different teams may use different objectives; however, there is a need for a metric
that can evaluate the performance of algorithmic search systems against any such set of
objectives. In the rest of the paper we call this metric NextGen Enterprise Search
Accuracy.
3 SQuAD as a Metric for Reading Comprehension
Since its publication in 2016, a scoring system based on Stanford Question Answering
Dataset (SQuAD) [7] has emerged to be a well-accepted methodology for comparing
different algorithmic approaches to reading comprehension (MRC). The SQuAD team
used more than 500 publicly available articles in Wikipedia and crowdsourced more
than 100,000 questions about these articles. The team required crowdworkers to enter
questions as free form text and mark one or multiple answers as spans within the
article. This ensures that answers are always contained in the articles. The team also
gave standard scripts to generate two scores for any reading comprehension system:
New Metric Based on SQuAD 85
• Exact Match (EM) Score gives the percentage of time a predicted answer exactly
matched the expected answer, controlling for punctuation.
• F1 Score is the average of F1 scores of each question, computed based on precision
and recall of tokens in the predicted answer vs. tokens in the expected answer.
In 2018, Rajpurkar et al. enhanced the dataset by crowdsourcing another set of
more than 50,000 questions that were unanswerable by the original set of articles [8].
As of the date of this publication, these works have been cited more than 743 times by
major publications as per Google Scholar [9]. At least 144 successful evaluations of
different algorithms have been completed against the prescribed methodology in
SQuAD as per the SQuAD leaderboards [10]. Of these attempts, an ensemble approach
by Joint Laboratory of HIT and iFLYTEK Research has achieved an Exact Match score
of 87.147, which is higher than the Exact Match score of 86.831 achieved by humans.
4 SQuAD Scores for Calibrated Quantum Mesh (CQM)
To illustrate the need for a new metric for NextGen Enterprise Search Accuracy other
than SQuAD scores, we used a search system based on an algorithm called Calibrated
Quantum Mesh (CQM) [11]. Other algorithms or search systems can be used for
establishing NextGen Enterprise Search Accuracy.
We chose CQM because it has been developed specifically for information retrieval
from natural language text corpora. The objectives are given in Sect. 2. It avoids many
problems related to use of other machine learning algorithms, notably Deep Learning,
in enterprise search systems. Specifically, using CQM, unsupervised learning is pos-
sible without any kind of annotation, tagging or structuring, which leads to a signifi-
cantly lower effort in training a search system.
Fig. 1. Basic tenets of Calibrated Quantum Mesh (CQM)
CQM works on three basic principles, as shown in Fig. 1. First, it recognizes that
any symbol, word or text can have more than one meaning (or quantum state) with
different probabilities. Second, it recognizes that everything is correlated in a mesh.
Each node modifies other’s probability distribution across quantum states. Finally,
CQM sequentially adds all available information to help converge the mesh into a
single meaning. The calibrations are implemented using training data, contextual data,
reference data and other known facts about the problem. These calibrating systems are
called Calibrating Data Layers. When the training data is passed through CQM, it
defines many of the mesh’s interrelationships. Where applicable, data layer algorithms
learn from such data.
Table 1. Score of CQM based search method as per SQuAD methodology.

F1 Score 4.56%
EM Score 0.00%
We used a search system based on CQM to predict answers on the SQuAD

questions. The scores are shown in Table 1. These scores are for the training corpus
itself, and not for the validation corpus. The training corpus has 87,599 questions based
on 442 articles. To reproduce these results, please complete the steps given below.
1. Download the output of the CQM based search system as run on SQuAD with one
answer per question from this link and unzip:
https://www.dropbox.com/s/wqgslqeppu3td94/cqm-output-1.zip?dl=0
2. Download the SQuAD training data and unzip from:
https://www.dropbox.com/s/id51mfcymdrox8i/train-v1.1.json.zip?dl=0
3. Download the SQuAD scoring script from:
https://worksheets.codalab.org/bundles/0x4c6febb3f9574587a6729b23b5e2f290
4. On command prompt run: python evaluate-v1.0.py train-v1.1.json
cqm-output-1.json
5. Look for the output string: “f1”: 4.561697533651976, “exact_match”:
0.0
The script took 26.18 s to complete on a standard Mac laptop with 8 GB RAM and
Intel’s i5-8250U CPU.
5 Need for a New Metric
The scores in Table 1 suggest that the search system based on CQM used in this
experiment is very limited in its reading comprehension capabilities. However, from
our experience applying CQM in other situations we know that the search system
performs much better. For example, CQM was used to develop an Intelligent Machine
for Document Preparation at Eli Lilly [12]. The CQM based search system achieved an
accuracy of 89% when used by Eli Lilly’s scientists. While many other situations have
not been published in academic journals, we have observed that the CQM based search
system is consistently between 87- and 98% on NextGen Enterprise Search Accuracy.
We hypothesize that this discrepancy is there because while SQuAD scores are a
great metric for reading comprehension, they do not reflect real expectations of busi-
ness enterprises. Specifically, while SQuAD focuses on getting the exact information,
for enterprises finding the snippet or image containing that information is sufficient.
Second, SQuAD insists on getting this answer as the first and only answer from the
Table 2. Key differences between Machine Reading Comprehension and NextGen Enterprise
Search.
Feature Machine Reading Comprehension NextGen Enterprise Search
Focus Focus is on returning exact answer Allows for a small text nugget
e.g. for the query “Who was the containing the answer e.g. for the
president in 2012?” the system must query “Who was the president in
return “Barack Obama” 2012?” accepts a sentence like
“Barrack Obama was the President in
2012.” as the correct answer
Number Insists on correct answer being the Allows the correct answer to be
of only answer from the system among several top answers. This
answers number varies from usecase to
usecase, but is typically 3
Handling Insists on reporting a “No Answer Requires best effort search so that the
absent Found” or equivalent if the answer is user can conclude that the answer is
answers not available in the corpus not available in the corpus
search system, for enterprises it is sufficient to produce the right answer among the top
few. For all analysis in this paper, we have assumed that if the right answer is among
top three, it is sufficient for an enterprise search system. Third, SQuAD, in its second
version (SQuAD 2.0) requires that a search system should not return any answer if the
exact information is not present in the corpus. While this is consistent with the
objectives of NextGen Enterprise Search, enterprise users do expect related informa-
tion. The related information helps them conclude on their own that the answer is
absent, which, in turn, helps them trust the search system better. Table 2 tabulates these
differences between SQuAD expectations and expectations of enterprise users.
While it is tempting to argue that MRC is a more sophisticated application than
enterprise search, and a more stringent set of metrics helps both applications, in truth
such a stark difference makes it impractical to apply SQuAD metrics for enterprise
search. We propose to leverage SQuAD’s strength in curated questions and answers,
and still use it meaningfully to evaluate algorithmic systems for enterprise search.
6 Proposed Metric to Measure NextGen Enterprise Search

Accuracy
Based on this hypothesis we propose a new metric to measure NextGen Enterprise

Search Accuracy. We term it as the Answer Score, which is the right counterpart to
SQuAD’s EM Score. For each question, we take all expected answers and the top three
predicted answers. If any expected answer is found to be completely contained in any
of the three predicted answer, we count it as a success. The Answer Score for each
successful question is set to one. The aggregate Answer Score for a search system is
computed as sum of Answer Scores across the corpus divided by the number of
questions in the corpus.
If Q is the set of all questions, Ei is the set of expected answers for the question i,
and Pi the set of predicted answers, then we can define a function to determine whether
an answer was produced:

1; if 9 e 2 Ei and p 2 Pi such that e is substring of p
AðiÞ ¼ ð1Þ
0; otherwise
Then,
P
AðiÞ
AnswerScore ¼ ð2Þ
jQj
We also note that in the proposed methodology, precision has little meaning. We
propose to replace SQuAD’s F1 Score by a Recall Score. For each question, we create
tuples of all expected answers and the top three predicted answers. For each tuple, we
compute Tuple Recall as the ratio of number of tokens in expected answer that are also
present in predicted answer, to total number of tokens in the expected answer. The
Recall Score of each question is taken as the maximum Tuple Recall for all tuples
related to the question. The aggregate Recall Score for a search system is computed as
the arithmetic mean of Recall Scores of all questions in the corpus.
If T(s) represents the set of tokens in string s, then recall of a string e versus another
string p is:
jT ðeÞ \ T ð pÞj
Rðe; pÞ ¼ ð3Þ
j T ð eÞ j
Further,
P
max Rðe; pÞ 8 e 2 Eq ; p 2 Pq
RecallScore ¼ ð4Þ
jQj
We used the same search system based on CQM but used the new metrics to
measure the NextGen Enterprise Search Accuracy. The scores are shown in Table 3.
To reproduce these results, please complete the following steps.
1. Download the output of the CQM based search system as run on SQuAD with three
answers per question from this link and unzip:
https://www.dropbox.com/s/517d1m0igt48ghw/cqm-output-3.zip?dl=0
2. Download the SQuAD training data and unzip from:
https://www.dropbox.com/s/id51mfcymdrox8i/train-v1.1.json.zip?dl=0
3. Download the new metric scoring script from this link and unzip:
https://www.dropbox.com/s/qnb8wskte1j6z2o/evaluate-cqm.zip?dl=0
4. On command prompt run: python evaluate-cqm.py train-v1.1.json
cqm-output-3.json
5. Look for the output string: “f1”: 5.20978433300829, “exact_match”:

0.0, “r1”: 92.4494139313303, “para_match”: 88.421100697-
49654
The new script took 146.09 s to complete on a standard Mac laptop with 8 GB
RAM and Intel’s i5-8250U CPU, compared to 26.18 s taken by original SQuAD script
on the same computer. This is expected because of the additional computations and the
fact that the new script compares three answers instead of one.
Table 3. Score of CQM based search method as per the proposed methodology
F1 Score 5.21%
EM Score 0.00%
Recall Score 92.45%
Answer Score 88.42%
These results are more in line with expectations of the performance of CQM.
7 Conclusion and Further Work
The Stanford Question Answering Dataset (SQuAD) is a generally accepted bench-

mark for algorithmic approaches for machine reading comprehension (MRC). Based on
our experience with numerous real-world situations, we noted that the objectives for
enterprise search are different than that of MRC. We proposed new metrics, Answer
Score and Recall Score based on SQuAD’s EM and F1 Scores, respectively, as new
metrics for NextGen Enterprise Search Accuracy.
We used a search system based on CQM to compare the two. As illustrated in
Table 4, the new metric gives results closer to accuracies observed in real world
situations, hence proving the hypothesis. We invite other teams to use different search
systems to compare the two metrics, as long as the candidate search system meets all
the objectives outlined in Sect. 2. The python script to implement the proposed metrics
is enclosed with the paper as well as hyperlinked within it.
Table 4. Score of CQM based search method as per the various metrics and observations
Score type Value
SQuAD metric EM 0.00%
Proposed metric Answer Score 88.42%
Observed accuracy at various clients User annotated 87 to 98%
References
1. Hawking, D.: Challenges in enterprise search. In: Proceedings of the 15th Australasian
Database Conference (ADC 2004), vol. 27, pp. 15–24. Australian Computer Society, Inc.,
Darlinghurst (2004)
2. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing
order to the web (1999)
3. Guha, R., McCool, R., Miller, E.: Semantic search. In: Proceedings of the 12th International
Conference on World Wide Web (WWW 2003), pp. 700–709. ACM, New York (2003)
4. Voigt, C.A., Gordon, D.B., Mayo, S.L.: Trading accuracy for speed: a quantitative
comparison of search algorithms in protein sequence design. J. Mol. Biol. 299(3), 789–803
(2000). Edited by J Thornton
5. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine
comprehension of text. CoRR, abs/1606.05250 (2016)
6. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for
SQuAD. CoRR, abs/1806.03822 (2018)
7. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad Rajpurkar - Google Scholar (2019).
Accessed 21 May 2019
8. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: The Stanford Question Answering Dataset
(2019). Accessed 21 May 2019
9. Kulkarni, R., Kulkarni, H., Balar, K., Krishna, P.: Cognitive natural language search using
calibrated quantum mesh. In: 2018 IEEE 17th International Conference on Cognitive
Informatics Cognitive Computing (ICCI*CC), pp. 174–178, July 2018
10. Viswanath, S., Yates, M., Burt, J., Yazell, J., Kuhr, R., Strum, B., Krishna, P., Balar, K.,
Kulkarni, R., Kulkarni, H., Fennell, J.: An intelligent machine for document preparation. In:
AICHE Annual Meeting, October 2018
11. Han, K.H., Park, J.W.: Process-centered knowledge model and enterprise ontology for the
development of knowledge management system. Expert Syst. Appl. 36(4), 7441–7447
(2009)
12. Redfern, D.M.: Natural language meta-search system and method. VI 20 2000. US Patent
6,078,914
Case-Based Generation of Regulatory
Documents and Their Semantic
Relatedness
Andreas Korger1(B) and Joachim Baumeister2,3

1
Angesagt GmbH, Dettelbachergasse 2, 97070 Wurzburg, Germany
a.korger@angesagt-gmbh.de
2
denkbares GmbH, Friedrich-Bergius-Ring 15, 97076 Wurzburg, Germany
3
University of Wurzburg, Am Hubland, 97074 Wurzburg, Germany
Abstract. Regulatory documents are required or provided by authori-

ties in many domains. They commonly point out relevant incidents for
specific scenarios. For those they have to present suitable preventive and
reactive measures. We introduce an approach to connect a case-based
description of the incidents structure with a pre-calculated word embed-
ding and describe how to adapt this word embedding to different context.
This paper shows how to use case-based methods to retrieve, adapt, and
reuse incidents descriptions. Subsequently they are used to generate new
regulatory documents via case-based reasoning.
Keywords: Case-based reasoning · Experience management ·

Knowledge management · SKOS · Semantic relatedness · Natural
language generation
1 Introduction
A regulatory document describes incidents that are likely to happen in a certain

situation. Preventive measures are elaborated to avoid the occurrence of relevant
incidents. Nevertheless, sometimes incidents occur. For this case a regulatory
document proposes adequate reactions. Further, harmful consequences are to be
avoided or mitigated. This underlying structure is represented in the documents
structure [13,14]. Popular examples of regulatory documents are public events,
for the handling of hazardous material or industrial workplace safety. For a
festival, a regulatory document would describe incidents such as terror attacks,
storm, fire and relevant measures like the allocation of fire-extinguishers and the
evacuation of the event site.
In this work we exploit the special kind of similarity of regulatory docu-
ments. We present an ontology for the representation of these special parts of
documents. We use natural language processing to connect graph-based and
textual knowledge representation. Our goal is to retrieve passages of documents
depicting such knowledge and to adapt the passages for usage in another context.
https://doi.org/10.1007/978-3-030-39442-4_9
92 A. Korger and J. Baumeister
For this, we implement and adapt techniques like word and sentence embeddings
in a small corpus.
For generating a new document, we need to answer three questions. Which
incidents are likely to happen? Which preventive and reactive measures are suit-
able for each incident under a certain context? How important is each measure?
The last question pays attention to the fact, that a limited budget and time does
not allow for the implementation of all preventive measures. Additionally, in the
case of an incident there may only be time to take limited action and implement
some of the available reactive measures. This sums up to consider a convenient
context-based ranking for the incidents and measures. The presented approach is
a general framework and easily adaptable to domains providing textual as well
as ontological information for the context dependent classification, prevention
of, and reaction to incidents.
For reasons of simplification and consistency we give examples of the domain
of public events. We therefore present a domain ontology for the classification and
security assessment of public events. The ontology is also capable of modeling the
according regulatory documents. Our approach is driven by theoretical and case-
based considerations we describe in the first part of this work. Afterwards we
show a novel technique of adaptation of word vector spaces. Then we present the
experimental setup we used to install a case-study showing practical capabilities
of our approach. We finish with related work as well as with discussions and
future work.
2 Ontology for the Modeling of Regulatory Documents
In this chapter we first introduce the fundamental ontological structures for the
modeling of regulatory documents. Then we extend this structure to be capable
of representing incidents, measures and special relations between them. Finally
we explain why we use the methods of case-based reasoning and introduce our
case-based architecture. We make the assumption that there exists a corpus of
regulatory documents of a certain domain. The documents are sub-classified into
passages, that are connected with incidents or the according measures. Those
passages of text are called information units. The work of identifying these pas-
sages was done by domain experts. Figure 1 shows the annotation of a document
using a hierarchical classification structure.
All information beyond the textual corpus of available documents is coded
into a knowledge base [4]. This knowledge base consists of entities (ontological
concepts) as logical units and the relations between them. An entity may be
for instance an action, an agent, an event, or a resource. In a textual context
an entity is coded or described as one word (term) or more words up to some
sentences. Entities may be composed of sub parts in an arbitrary manner. In the
following, we introduce the basic concepts of our scenario.
Case-Based Generation of Regulatory Documents 93
RD1
hasAnnotation ”page 1”
RD1 hasAnnotation ”page 3, line

17”
C15
16, line 5”
Regulatory hasA nnotation ”page
Document 1 C21
4, line 8”
hasAnnotation ”page
hasAnnotation ”pa C37
ge 4, line 8”
Fig. 1. Semantic annotation of a textual regulatory document.
2.1 Ontology for the Modeling of Regulatory Documents in the

Domain of Public Events
Event base for Risk identify Security create Security

Classification Management Incidents Concept
(ECLA) (PERM) (SECRI) (SECCO)
case-based approach
Fig. 2. The elements of the security document development workflow.
Due to the tense international security situation the implementation of regu-

latory security documents for public events became a crucial requirement. The
elaboration of a security document actually is mandatory for the official approval
of larger events. Additionally major events require detailed planning to minimize
security risks. For the modeling and the description of official security documents
for public events we developed the SECCO ontology (SECurity Concept Ontol-
ogy). The ontology is the base for a general approach to structure regulatory
documents for public events according to diverse recommendations as depicted
in Fig. 2. The ontology makes relevant content of such documents accessible,
comparable, and interchangeable.
secco:Chapter
skos:Concept
secco:Document secco:SecurityConcept
prov:Entity
Fig. 3. The hierarchical structure of security concepts.

We use the SKOS ontology (Simple Knowledge Organization System) [22]

as an upper-ontology to represent informational units and relations between
them. The security concept ontology models the actual regulatory document
in OWL2 [21]. The class secco:SecurityConcept is defined as a subclass of a
secco:Document, and merges the textual and conceptual character of a regu-
latory document. Therefore, secco:Document is a subclass of skos:Concept and
prov:Entity [23] in order to interlink the semantics of this concepts as depicted in
Figs. 3 and 4. This clarifies that we consider a real document (prov:Entity) but
also a hierarchical concept within a formal knowledge system (skos:Concept).
The PROV ontology [16] provides access to model documentary information
and document provenance in a standardized manner. A document has authors
(prov:Agent) and a final version developed step by step releasing intermedi-
ate versions through modification (prov:Activity). The document substructure is
refined by secco:Chapter as a subclass of skos:Concept.
skos:narrower
rdfs:subPropertyOf
secco:hasPart skos:broader
rdfs:subPropertyOf owl:inverseOf rdfs:subPropertyOf
secco:hasMandatoryPart secco:partOf
rdfs:subPropertyOf rdfs:subPropertyOf rdfs:subPropertyOf
skos:related secco:mandatoryPartOf secco:partOfGermanGovRecommendations secco:partOfDINRecommendation
rdfs:subPropertyOf rdfs:subPropertyOf
secco:coveredByConcept secco:partOfBavarianGovRecommendations
Fig. 4. Properties linking concepts in the SECCO ontology.
Definition 1. Let KB = (E, R, D) be a knowledge base. Let E ⊂ KB be the set of

available entities (ontological concepts). Let I ⊂ E be the set of known incidents.
Let M ⊂ E be the set of known measures. Let R ⊆ E × E be the relations
between elements of E. Let D be the set of available documents D = {d1 , .., dj }.
Let U = {(u ∈ di )|di ∈ D} be the set of available information units contained in
D and T the set of terms used to textually build them.
It is very important for the assessment of safety measures to pay respect to

the context under that they are applied. For instance to supply rescue boots on a
festival F1 besides a river makes sense but for a festival F2 in the forest it is pretty
senseless. Same holds for relevant incidents. Some incidents are more relevant in a
certain context because they are more likely to happen than others. The context
are the relevant parameters (features) of the environment. If the parameters

change, the relevant incidents, the according measures and the importance of
both changes. To cast an eye over the documentary corpus D, the context is
for instance represented by certain parameters fulfilled by the content of each
document or the parameters, all documents have in common.
Definition 2. Let KB = (E, R, D) be the knowledge base. For an entity ei ∈ E
let Cei ⊆ E \ {ei } be the context of ei with Cei = {c1 , .., cj }.
For instance, for the previous entities F1 and F2 the context CF1 would be
near river and CF2 in the forest. We want to consider relations making enti-
ties a preventive or reactive measure to incidents. This means to focus on the
chronological order of the execution. In some domains measures are classified
into before, during and after an incident. We consider the relational classes dur-
ing and after as unified. A measure that is taken before an expected incident is
a preventive measure, a measure that is taken during or after an incident is a
reactive measure. A measure may be of preventive as well of reactive character.
Definition 3. Let RP M −C ⊆ M × I be a relation under a context C, indicating
which measures are taken in this context C before an incident, making them
preventive measures. Let RRM −C ⊆ I × M be the analogous relation, indicating
which measures are taken after the occurring of an incident making them reactive
measures.
Additionally we rely on an importance ranking of incidents and measures
under a given context. The importance is quantified by assigning a value between
1 for important and 0 for not important.
Definition 4. Let IM P (i, C) ∈ ]0, 1] be the importance of an element of I under
the context C. Let IM P (m, i, C) ∈ ]0, 1] be the importance of a measure m for
the incident i under the context C.
For a given context and relevant incident induced by the context the accord-
ing measures are ordered by importance and classified into preventive and reac-
tive measures. Altogether they build a kind of facilitated process snippet we call
PIRI (Preventive-Incident-Reactive-Interrelation). The presented model simpli-
fies the real world for facilitation of assessment. Typically there is a cascade of
measures that are executed in a specific order, e.g. in case of fire first evacuate
all people, then close the doors and windows. A PIRI-snippet is formally defined
as follows.
Definition 5. Let KB = (E, R, D) be the knowledge base. Let i ∈ I be an
incident and C ⊆ E a context. Then, we define a PIRI-snippet P IRI(i, C) =
{C, P G}, where PG = (N,E) is a directed graph. We call PG the PIRI graph
with nodes N = C ∩ M ∪ {i} all measures mentioned by C and the incident i
and edges E = {RP M −C ∪ RRM −C } ∩ {i} all edges containing the node i. The
graph is weighted with node-weights IMP(N, C) and edge weights IMP(E, i, C).
Figure 5 shows a PIRI-diagram for the incident I1 under the context C con-
sisting of several contextual elements (c1 , c2 , c3 , ..).
Incident
C1 P1 0,9 0,95
R1
Contextual
Elements 0,7 0,8
C2 Influence P2 0,85
I1 R2
0,7 0,75
C3 Preventive
Measures
P3 R3 Reactive
Measures
Fig. 5. PIRI-diagram under a context C = (C1 , ..., Cj ) showing the ranked preventive
and reactive measures with the according importance weight.
2.2 Ontological Representation of the PIRI-Structure
Event base for Risk identify Incidents create Document

Classification Management Measures Structure
(ECLA) (PERM) (SECRI) (SECCO)
Fig. 6. Ontology interrelation and document creation workflow.
We extend the SECCO ontology by the definition of incidents, measures and

the PIRI-snippet. The existent ontology was used for the classification of pub-
lic events (OECLA ) and the structuring of the according regulatory documents
(OSECCO ) as depicted in Fig. 6. For the ontological description of incidents we
now continued the elaboration of the SECRI ontology (OSECRI ). The ontology
describes the hierarchical context of incidents and measures with a focus on pub-
lic events. For the ontological implementation of a PIRI-snippet we introduce
the analogous classes and interweave them with the documentary structure. An
information unit is represented as a secri:InformationUnit. This passage of text
semantically targets a secri:Incident or secri:Measure and is part of a document
represented by secco:Document. Incidents and measures are subsuming classes as
the top of a hierarchy. In the case study we will see an example of this hierarchy
in the domain of public events. A graphical excerpt of the ontology can be seen
in the Fig. 7.
targets
partOf secri:InformationUnit
secco:Document
targets
secri:Incident targets
secri:Importance
targets targets
secri:Measure subClassOf
secri:PreventiveMeasure
subClassOf
secri:ReactiveMeasure
Fig. 7. Class representation of a PIRI-snippet.
2.3 Case-Based Representation of the PIRI-Structure
Case-based reasoning is a computer reasoning method. The paradigm bases on

the assumption that similar problems have similar solutions. A problem together
with its solution is considered as one case. A case base containing already solved
past problems is queried for the solution of new problems. The most similar case
is retrieved by the use of a similarity function. The past solution is adapted to
the present problem. If the adapted solution was successful a new case can be
added to the case base. This mechanism makes the case-based system learn and
the problem solving capacity is increased [6].
A convenient case-based representation for the so far described scenario inter-
nalizes the document description of incidents and measures. For each incident
mentioned by the regulatory document the preventive and reactive measures are
combined into a PIRI-snippet. We choose a structural case representation using
attributes and their values [6].
Definition 6. A case c1 = (d1 , l1 ) is defined by the incident and its context as

problem description d1 = {C1 , i1 } and its solution l1 = {(C1 ), (m ∈ RP M −C1 ∩
i1 ), (m ∈ RRM −C1 ∩ i1 )} the combination of measures targeting the incident i1
under a context C1 , separated into preventive and reactive measures.
The problem descriptions and the solutions are conjunctions of elements of

the knowledge base. The context may be replaced for a unique identifier naming
the context without citing every component. For instance C1 = RegDocument1
then d1 = {RD1 , RainStorm} and l1 = {(RD1), (W eightT ents∧GetF orecast),
(CloseLiquidGas ∧ LockDoors ∧ GetRainCoat ∧ Evacuate)}.
Definition 7. The case base CB = {c1 , ..., cm } is the collection of all cases ci
extracted from available regulatory documents and constructed as described before
as PIRI-snippets. A query q to the case base is a conjunct subset of (negated)
measures and incidents.
For instance, the query q1 = CloseDoors ∧ ¬LockDoors ∧ Evacuation ∧

RainStorm retrieves all other PSR-snippets containing an evacuation and a
closing and not locking of doors.
To retrieve cases, we search the case base for similar problem descriptions di
for the query q1 . To define a similarity function, we consider all preventive mea-
sures, reactive measures and the incident as individual sub-parts. Each of these
parts is then compared by a local similarity measure. With an aggregation func-
tion a global similarity measure is composed by weighting with the parameters
(ωP , ωI , ωR ) and summed up as follows:

SimPIRI (ck , cl ) = ωP SimP (Pk , Pl )+ωI SimI (ik , il )+ωR SimR (Rk , Rl ) /3 (1)
The incidents and measures are classified by a taxonomy that was derived from
the connected ontology, building the base for the similarity assessment and adap-
tation. The local similarities SimP , SimI , SimR are calculated via the taxonomic
order of its elements. The incidents I and the measures P, R are hierarchically
structured. Each element of the hierarchy is assigned with a likelihood symbol-
izing the similarity of its sub-elements. The similarity of the leaf-elements is set
to 1 and to 0 for the root element. The similarity increases with depth d of the
element according to for instance simd = 1−1/2d [2]. If we want to compare two
PIRI-snippets it is desirable to consider the context. For this reason we define
the following extended similarity measure under the context C:

SimContext (ck , cl ) = ω1 SimPIRI (ck , cl )+ω2 SimC (Contextk , Contextl ) /2 (2)
The context may for instance be the fulfillment of a classification hierarchy

describing the environmental parameters as can be seen in the Fig. 8.
EventType (0)
SportEvent (0.5) MusicEvent (0.5) Festival (0.5)
Running (1) MartialArts (1) Motorsport (1) Rock (1) Classical (1) StreetFestival (1) ChristmasFair (1)
Fig. 8. Excerpt of the event type taxonomy.
For instance, security measures under a context of high consumption of alco-

holic beverages are to be considered different as under a context of low consump-
tion of alcohol. So SimC is set to the similarity function used in that scenario
weighted by the weights ωi ∈ [0, 1] working as biases.
2.4 Constrained-Based Extension of the PIRI-Structure

The importance ranking can also be used as an order of execution of mea-
sures. The most important measures have to be taken first. But sometimes less
important measures have to be taken before other, more important measures.

This pays attention to the so called concatenation of circumstances. It is neces-
sary to introduce a (partial) order of measures additionally to the order induced
by preventive and reactive and the importance ranking.
Definition 8. For two measures m1 and m2 the constraint m1 ≺ m2 states that

m1 should be taken before m2 .
An obvious problem is as follows. To avoid theft or unauthorized access espe-

cially large buildings have to be locked after an evacuation. This can yield people
being locked inside the building. In reality it is often too complex or not possible
to describe for each incident an order of taking the measures. Additionally in a
multi-agent-scenario it is very difficult to execute instructions being too complex
or too numerous. We therefore take a simple strategy of providing only rules for
pairwise measures, as described before (Evacuate ≺ LockDoors).
3 Retrieval Adaptation by Re-calculation of

Pre-calculated Word Embeddings
Regulatory documents are often not exposed to the public. Therefore it is diffi-
cult to obtain a corpus that is sufficiently large to conveniently apply statistical
NLP-methods. For this reason we rely on the adaptation of pre-trained word
embeddings created from a reference corpus. In our scenario we deal with the
problem of exploiting an importance ranking for preferred treatment of informa-
tion units. This ranking has to be transferred to the textual retrieval process of
terms. In the following we suggest a new method of adapting pre-trained word
embeddings for this purpose. The process of creating word embeddings is of a
statistical and mean-creating character. But the context dependent changing of
the meaning of words is neglected.
For instance the locking of doors is important to avoid theft whilst leaving
an event site unsupervised. But in a scenario of a quickly spreading fire it is
much more important to evacuate all people and set a focus on this measure.
A strategy to retrieve related entities is to consider all neighbors lying within a
certain threshold around the according word embedding.
We want to improve this by a more sophisticated way of retrieval. This shall
help to model certain scenarios within a word-vector-representation. This makes
the highly descriptive capability of word vectors accessible for the explanation
of a case-based assessment.
Definition 9. Let VS be a trained word-vector-space with its word-vectors vi (ti )

of a term ti . Let A ⊂ VS = {ak |k ∈ {1, 2, .., n} be a set of n terms contained in
the information units targeting the measures Mi relevant for an incident i.
Definition 10. Let ωk ∈]0, 1] be the importance IM P (mj , i, C) of an element

ak where mj is the measure it is contained in. If a term is contained in more
than one measures m, the highest importance is used.
Definition 11. Let fr : A → R be a function assigning a retrieval-threshold

to each element ak of A depending on the weight ωk . Let dk = fr (ak ) be the
retrieval-threshold around an element ak . All elements {vj ∈ VS |vj − ak | < dk }
are considered as “retrieved”.
Figure 9 shows the two-dimensional projection of an exemplary word vector

space V S1 . In this vector space the case-based query q = CloseGas∧CloseDoors
is depicted with its literal components. The threshold is depicted as the circles
around the marked vectors close, gas and door. The retrieved elements lying
within are colored gray. All information units containing a retrieved term are
also considered as retrieved. This way of retrieval only implies the local similarity.
The constellation of the marked elements to each other is neglected.
a2
close
a1
d1
d2
gas
door
d3
VS1 a3
Fig. 9. Retrieval thresholds di around marked elements ai in a two-dimensional pro-

jection of an exemplary word vector space VS1 .
An improved strategy would be to respect the influence of weighted marked

elements by changing all other vectors in a convenient manner. This can be imag-
ined as a kind of bending of the vector space. For instance every marked element
causes a dimple or bulge in the vector space drawing surrounding elements nearer
or pushing them further away. The influences of each element bending the vector
space are accumulated. In this manner an operator is constructed that adds a
shifting vector to each vector of the pre-trained vector space. An additional effect
is, that the position of the vectors towards each other in the area of influence is
slightly changed. Although it is important, that the structure of the vector space
is not heavily changed creating illogical changes. For instance vectors should not
overtake each other in the direction of the weighting element. Every weighted
vector is surrounded by a retrieval threshold d, that we call event horizon. The
event horizon is a hyper parameter that has to be adjusted to the pre-trained
vector space. Depending on the weights, elements are drawn inside the event
horizon or are pushed outside.
Definition 12. To calculate the word vector space VS each element vj ∈ VS is
updated by the operator SHIFT, where vj = SHIFT(vj , A). Let fs : (A, VS, ω) →
R be a shifting function assigning the scalar norm of the shift to each pair of
(ak , vj ) ∈ (A, VS. Let (vj − ak ) be the vector direction the shift has to be applied
in. For each element of A the shifting components are summed up and vj =
n
SHIF T (vj , A) = vj + k=1 fs (ak , vj , ωk )(vj − ak ) with n = |A|.
Figure 10 shows how the vector shift of the weighting elements a1 , a2 , and a3
are applied to move the element v1 to its new position v1 . This strategy works
in vector spaces with any dimension. The efforts needed for calculation grow
linear with the number of weighting elements and number of elements to shift.
Additionally the recalculation could be limited to a certain neighborhood around
the weighting elements. The retrieval based on word embeddings is used addi-
tionally or in combination to the case-based retrieval. If a term is not covered by
the domain ontology similar terms are retrieved searching e.g. for the most simi-
lar ontological concept covering the retrieved terms. Two informational units are
compared by comparing the semantic distance of each words of the informational
units. The values are aggregated and the outcome is the similarity of the two
informational units. These similarity values are influenced by the recalculation
of the word vector space [15].
VS1
d gas
close
a1 a2
v1 d
v1
v2
a3
d v2
door
Fig. 10. Recalculation of v1 and v2 in an exemplary word vector space VS1 .
4 Case Study
We exemplify the previous approach by a case study in the domain of public
events. We started with 30 real world regulatory documents of different public
events of which the 15 most relevant were annotated manually by three differ-
ent domain experts. This corpus is the basis for the present evaluation. For the
annotation process we developed and evaluated several ontological components.
These were used for the classification of public events (OECLA ) and the struc-
turing of the according security documents (OSECCO ). The following Table 1
shows the number of ontological concepts covered by each ontology.
Table 1. Number of ontological concepts for each ontology used.
OSECCO structuring OECLA classification OSECRI incidents OSECRI measures

278 136 115 72
4.1 Overview of the General Architecture
The architecture we use to build the knowledge based system consists of different
components. The multi modal knowledge base contains all ontological, case-based
and text-based information and according tools. The text-based system holds a
word vector model and the corpus of annotated documents.
SELECT
NEW PROBLEM INITIAL FEATURES
1 3 4
Ontological 6
2 REUSE
Concepts QUERY ADAPTATION
OLD PROBLEM
RETRIEVAL
Multi Modal Knowledge Base
RD-NEW
CBR 5
RD-OLD Ontology Model
SD
SD
Textual
SD
SD
GENERATION
SD
CBR
SD
SEMANTIC ANNOTATION
SD
Known
Corpus SOLUTION
Terms
7
RETAIN 8
Fig. 11. Case-based cycle of regulatory document assessment.
Figure 11 shows the user interaction and the case-based cycle of natural doc-
ument extraction and generation. We assume that there already exists a corpus
that has been processed using the following workflow of extraction. In step (1) a
new problem arises. That may be for instance that a new regulatory document
is required or an existing document has to be improved as shown in step (2). All
features are extracted out of the problem description and the old document at
step (3) and queried to the knowledge base at step (4). The retrieved features,
phrases and documents are returned in step (5) and adapted in step (6) which
requires user interaction and work of the user. The new regulatory document is
used (7) and retained in the corpus enlarging the case base (8).
For the search in the textual corpus we used indexed text files. The index
files and according term frequency vectors were created using Apache Lucene [1].
The tool is an environment that allows for the usage of NLP-techniques like for
instance stop-word-removal and stemming in German language. It showed, that
for our application the wikipedia-ratio was almost 100%. This will surely be
different in domains with a more exceptional vocabulary and domain knowledge.
Figure 12 shows the workflow of breaking documents into reusable information
units. Beginning with selected features it shows, how they can be put together
to form a new document. It presents which methods are used on each level for
extraction, retrieval and adaptation.
Corpus RD
Information Units
RD
RD
RD EXTRACT EXTRACT
RD1
RD
RD2 SELECT SELECT
ADAPT AND GENERATE ADAPT Ontological
Concepts
Document Similarity Assessment: Sentence Similarity Assessment: Concept Similarity Assessment:

ECLA, TF-IDF, SECRI PIRI, Sentence Embeddings Word embeddings, Ontologies,
Adaptation: User fills gaps Find and adapt similar information units Joint Embeddings
Fig. 12. Workflow of decomposing and recomposing security documents.
4.2 Retrieval of Similar Cases

To retrieve cases we need a structure for the comparison of different case rep-
resentations. This structure is derived from a distinct ontology. For the onto-
logical description of security incidents and according measures we elaborated
the SECRI ontology (OSECRI ). The SECRI ontology describes the hierarchical
context of security incidents and safety measures for public events. An excerpt
of the ontology can be seen in Fig. 13. We extended the ontology by the capac-
ity of modeling preventive and reactive measures for security incidents in the
domain of public events. All ontologies where implemented using the semantic
wiki KnowWE [5]. Amongst others we introduce the new classes secri:Measure
as well as secri:PreventiveMeasure and secri:ReactiveMeasure as subclasses of
secri:Measure.
For the case-based implementation we made use of myCBR [3]. The tool
supplies functionality for case representation and similarity modeling. The hier-
archically structured incidents and measures represented in the SECRI ontology
were exported to a myCBR model. The case-based attributes were arranged into
taxonomies as local similarity measures. Those were aggregated into a global sim-
ilarity measure for the assessment of the according PIRI-snippets. A number of
secri:Disaggregation
broader
broader secri:ObeyAuthorities
broader
secri:Measure secri:CrowdControl broader
broader
broader secri:Reprimand
secri:InspectionMeasure secri:SiteDismissal
Fig. 13. Excerpt of the SECRI Ontology showing 7 of 72 measures.
relevant cases was extracted out of the corpus and installed in myCBR mak-
ing up the experimental case base. Table 2 shows the number of different cases
contained in the case base.
Table 2. Overview of different cases.
Different contexts PIRI-cases (Measures under context)-cases Information units

15 300 1500 500
To evaluate the similarity assessment induced by the PIRI-strategy we con-

structed a postmortem analysis. This means to take every case of the case base
and use it as a query to the same case base. Our similarity measures are con-
structed symmetrically, consequently the query is commutative. As context we
use the event classification ontology Oecla . The context of each PIRI-snippet
is the fulfilled parameters classifying the event extracted out of the according
regulatory document. Figure 14 shows the ontological sub part as the beginning
of the workflow for the document generation.
Fig. 14. The elements of the security concept development workflow.
The pairwise similarities of the event classification cases are already avail-
able due to a postmortem analysis done in previous work and can be seen in
Fig. 15. A case-based postmortem analysis was evaluated by and compared to a
postmortem analysis of the same cases done by domain experts.
Pages 17 30 31 41 58 72 26 10 64 4 4 9 2 47 16
Coverage 11,90% 23,70% 23,70% 18,00% 22,30% 14,40% 15,50% 16,20% 30,22% 13,31% 12,23% 13,67% 16,91% 25,18% 23,38%
Event christm wine wine folk city carne folk music carne fair fair running camp arena campus
Event Case ecla0 ecla1 ecla2 ecla3 ecla4 ecla5 ecla6 ecla7 ecla8 ecla9 ecla10 ecla11 ecla12 ecla13 ecla14
christm ecla0 x 0.63 0.5 0.6 0.6 0.6 0.75 0.6 0.6 0.65 0.65 0.35 0.34 0.59 0.34
wine ecla1 0.63 x 0.65 0.6 0.48 0.48 0.75 0.35 0.48 0.65 0.53 0.35 0.4 0.34 0.34
wine ecla2 0.53 0.65 x 0.57 0.4 0.4 0.65 0.28 0.4 0.75 0.63 0.33 0.5 0.44 0.44
folk ecla3 0.6 0.6 0.57 x 0.68 0.68 0.6 0.68 0.68 0.54 0.45 0.5 0.33 0.57 0.33
city ecla4 0.6 0.48 0.4 0.68 x 0.75 0.6 0.75 0.75 0.53 0.4 0.43 0.28 0.43 0.18
carne ecla5 0.6 0.48 0.4 0.68 0.75 x 0.6 0.75 0.75 0.53 0.53 0.43 0.28 0.46 0.21
folk ecla6 0.75 0.75 0.65 0.6 0.6 0.6 x 0.6 0.6 0.62 0.53 0.35 0.4 0.65 0.4
music ecla7 0.6 0.35 0.28 0.68 0.75 0.75 0.6 x 0.75 0.24 0.28 0.43 0.28 0.46 0.21
carne ecla8 0.6 0.48 0.4 0.68 0.75 0.75 0.6 0.75 x 0.53 0.53 0.43 0.28 0.46 0.21
fair ecla9 0.65 0.65 0.75 0.54 0.53 0.53 0.62 0.24 0.53 x 0.75 0.33 0.44 0.38 0.38
fair ecla10 0.65 0.53 0.63 0.45 0.4 0.53 0.53 0.28 0.53 0.75 x 0.33 0.44 0.44 0.44
running ecla11 0.35 0.35 0.33 0.5 0.43 0.43 0.35 0.43 0.43 0.33 0.33 x 0.33 0.48 0.23
camp ecla12 0.34 0.4 0.5 0.33 0.28 0.28 0.4 0.28 0.28 0.44 0.44 0.33 x 0.38 0.63
arena ecla13 0.59 0.34 0.44 0.57 0.43 0.46 0.65 0.46 0.46 0.38 0.44 0.48 0.38 x 0.5
campus ecla14 0.34 0.34 0.44 0.33 0.18 0.21 0.4 0.21 0.21 0.38 0.44 0.23 0.63 0.5 x
Fig. 15. Results of the postmortem analysis for event classifications.
The domain experts were informed about the public event’s parameters. Then
they estimated the similarity of the events regarding to the writing of a regu-
latory security document for those events. Precision and recall [17] show how
well the manual measure done by real persons matches the objective measure
done by the case-based system. For the calculation we merged the evaluation
of the three experts into one by neglecting multiple classifications and just con-
sidering whether an event was rated by one of the three experts as depicted in
Fig. 16. As recall and precision just switch when we want to estimate how well
the case based classification matches the manual classification we did not depict
this information.
Fig. 16. Precision and recall of the evaluation, 0 = cbr, 1 = aggregated domain experts.
Each document of the corpus mentions about 20 different incidents. We now

focus on the incident FireAndExplosion. For this incident we pairwise calculate
the similarity of the according PIRI-snippets. Afterwards, we apply the context
and calculate the context dependent similarity. Figure 17 shows for each pair
of documents the similarity of the PIRI-snippet for the incident FireAndExplo-
sion as well as the PIRI-similarity combined with the context. This comparison
makes clear, where the influence of the context changes the similarity ranking of
retrieved PIRI-cases.
Fig. 17. Postmortem analysis of the PIRI-snippets for the measure FireAndExplosion
without and with the context of the event classification. The values show the similari-
ties SimPIRI | SimContext . The value for SimContext was calculated out of SimPIRI and
SimECLA which were weighted with 0.5 each. A significant change of the retrieval by
the incorporated context is marked bold.
Additionally to the case-based assessment the importance ranking for the

PIRI-snippets was done manually by the domain experts. They where asked to
choose a number between 0 and 1 for the importance of a security measure in
case of each incident and relevant context. These values are used as weights for
the adaptation of the according word vector space. As the available corpus is too
small to train word embeddings on it, we fall back to pre-trained word vectors.
We chose wikipedia as a convenient reference corpus. We want to know how
well the domain vocabulary is covered by the wikipedia corpus and therefore
define a wikipedia-ratio to be the percentage of words in a text having an article
in wikipedia. Entities of the domain vocabulary not covered by the reference
corpus also have to be assigned a vector to them. Strategies for this process
shall not be part of this work. The pre-trained word embeddings of wikipedia in
German language where provided by the text classification library fastText [9].
4.3 Generation of Adapted PIRI-Snippets

In the following we present the strategy for the textual construction of PIRI-
snippets and their adaptation for reuse. The relevant textual elements were
extracted from the corpus and transferred into the ontological structures. The
next step is to find the context dependent information and replace it to make
the PIRI-snippets reusable. We therefore searched for all elements of the domain
vocabulary. Everything left we considered normal text or context related infor-
mation that can be abstracted. A strategy for abstraction is to replace words by
their class name. For instance a city name is replaced by location data or by the
part-of-speech class. The following exemplary text for the incident storm shows,
how an according passage of a security document would look in reality.
“Storm. Get weather information on a regularly basis from the Munich
weather station. Weight all tents with heavy material or fix with ropes. In case
of upcoming storm, evacuate the event site using the Franz Josef avenue and call
the fire department.”
The PIRI-snippet with exemplary importance values for this would be:
Preventive(WeightTents(0.9),GetWeatherForecast(0.8))
Incident(Storm)
Reactive(CallFireDepartement(0.9),FullEvacuation(0.8)).
An abstracted information unit for the measure FullEvacuation would be:
“[FullEvacuation][StopWord][EventSite][Verb][StopWord][LocationData]”
This information unit can be for instance adapted to the measure PartialE-
vacuation. The ontological concept FullEvacuation is replaced by a retrieved
information unit for the new measure. The concept EventSite is for instance
replaced by the more specific concept EventSiteComponent. This information
can be retrieved out of other cases because PartialEvacuation is commonly com-
bined with EventSiteComponent. The concept LocationData hast to be replaced
by the contextual location information which is left to the user. The stop words
are inserted and corrected by a NLG tool or the user. The generated textual
passage before stop word correction and context correction looks as follows:
“[Partial evacuation of the affected area][the]
[EventSiteComponent][using][the][LocationData]”
4.4 Results and Discussion

The results of the case study for the retrieval of similar information units are
very promising even without user support. The incorporation of the contex-
tual paradigm significantly improved the simulation of the real world scenario.
Regarding the generation of regulatory documents the results were quite good
when supported by the user. To answer the initial question, which incidents are
likely to happen, the context-based assessment can be used - similar context
points to similar incidents. Same holds for the measures that are suitable for an
incident. The importance of incidents and measures is made accessible by the
percentage of cases covering the incident or measure under a certain context.
5 Related Work
We started the research for related work to this paper with an overview of state
of the art publications in the domain of natural language generation presented
by Gatt and Krahmer [10]. Most of the presented work requires a large corpus for
the application of statistical methods. More suitable for our necessities seemed
grammar-based approaches. This lead us to the idea of abstracting text by giving

it a pseudo grammar structure.
There exists some work for the assessment of incidents in different domains.
A similar approach we want to mention was presented by Sizov et al. [19]. The
work focuses on the extraction and the (case-based) adaptation of explanations
contained in incident reports in the transportation domain. The work differs in
that way that we are aiming for a holistic document oriented and ontology-based
approach with user support for generation.
The concept of retrofitting of word embeddings has been already used in dif-
ferent scenarios. For instance Faruqui et al. [8] use the graph-based information
of semantic lexicons for the improvement of word embeddings. To represent dif-
ferent senses of one word in a word embedding Remus and Biemann [18] present
a retrofitting strategy. For the similarity of sentences and documents the word
moving distance presented by Kusner et al. [15] is an accepted approach. A
promising work for a document metric incorporating weighted word-importance
vectors was presented by Huang et al. [11].
The structural integration of context into the case-based assessment was
covered by various authors. A conceptual revision of the context-based reasoning
paradigm was presented by Stensrud et al. [20]. For the role of context in case-
based reasoning a good work was published by Khan et al. [12] as well as by
Craw and Aamodt [7] for the use of similar case clusters for representing context.
6 Conclusions
In this paper we presented a data structure called PIRI for the representation of
a regulatory document describing incidents and according measures. After for-
mally describing it, we transferred the structure into a case-based model. Using
this model a novel approach for the adaptation of similarity measures and word
embeddings to different contexts was shown. In a case study the approach was
applied to a corpus of regulatory documents of the domain of public events. A
case base was built and a postmortem analysis was done on it. The results of
the case study and the general architecture were discussed with domain experts.
What we left for future work is the construction and semantic evaluation of
different shifting functions. Depending on their influence to the word vectors
a meaning has to be assigned to them. An approach for the evaluation of the
retrofitted word vector space in comparison to the original one is needed. The
major task is to find an evaluation strategy working without assistance of domain
experts. Additionally the integration of grammar-based natural language gener-
ation approaches seems to be promising to adapt abstracted information units
to different contexts and help to reduce the needed user support.
References
1. Apache. Lucene: http://lucene.apache.org/
2. Bach, K., Althoff, K.-D.: Developing case-based reasoning applications using
myCBR3. In: Agudo, B.D., Watson, I. (eds.) Case-Based Reasoning Research and
Development, pp. 17–31. Springer, Heidelberg (2012)
3. Bach, K., Sauer, C., Althoff, K.D., Roth-Berghofer, T.: Knowledge modeling with
the open source tool myCBR. In: Proceedings of the 10th International Conference
on Knowledge Engineering and Software Engineering - Volume 1289, KESE 2014,
Aachen, Germany, pp. 84–94. CEUR-WS.org (2014)
4. Baumeister, J., Reutelshoefer, J.: The connectivity of multi-modal knowledge
bases. CEUR Work. Proc. 1226, 287–298 (2014)
5. Baumeister, J., Reutelshoefer, J., Puppe, F.: KnowWE: a semantic Wiki for knowl-
edge engineering. Appl. Intell. 35(3), 323–344 (2011)
6. Bergmann, R.: Experience Management. Springer, Heidelberg (2002)
7. Craw, S., Aamodt, A.: Case-based reasoning as a model for cognitive artificial
intelligence. In: Proceedings of the 26th International Conference, ICCBR 2018,
Stockholm, Sweden, 9–12 July 2018, pp. 62–77, July 2018
8. Faruqui, M., Dodge, J., Jauhar, S.K., Dyer, C., Hovy, E.H., Smith, N.A.:
Retrofitting word vectors to semantic lexicons. CoRR, abs/1411.4166 (2014)
9. fastText. https://github.com/facebookresearch/fastText
10. Gatt, A., Krahmer, E.: Survey of the state of the art in natural language generation:
core tasks, applications and evaluation. CoRR, abs/1703.09902 (2017)
11. Huang, G., Quo, C., Kusner, M.J., Sun, Y., Weinberger, K.Q., Sha, F.: Super-
vised word mover’s distance. In: Proceedings of the 30th International Conference
on Neural Information Processing Systems, NIPS 2016, pp. 4869–4877. Curran
Associates Inc., USA (2016)
12. Khan, N., Alegre, U., Kramer, D., Augusto, J.C.: Is ‘context-aware reasoning =
case-based reasoning’ ? In: Brézillon, P., Turner, R., Penco, C. (eds.) Modeling and
Using Context, pp. 418–431. Springer, Cham (2017)
13. Korger, A., Baumeister, J.: The SECCO ontology for the retrieval and generation
of security concepts. In: Cox, M.T., Funk, P., Begum, S. (eds.) ICCBR. Lecture
Notes in Computer Science, vol. 11156, pp. 186–201. Springer, Cham (2018)
14. Korger, A., Baumeister, J.: Textual case-based adaptation using semantic related-
ness - a case study in the domain of security documents. In: Wissensmanagement
Potsdam (2019)
15. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings
to document distances. In: Proceedings of the 32nd International Conference on
International Conference on Machine Learning - Volume 37, ICML 2015, pp. 957–
966. JMLR.org (2015)
16. Moreau, L., Groth, P.: Provenance: an introduction to PROV. Synthesis Lectures
on the Semantic Web: Theory and Technology. Morgan and Claypool (2013)
17. Perry, J.W., Kent, A., Berry, M.M.: Machine literature searching x. Machine lan-
guage; factors underlying its design and development. Am. Doc. 6(4), 242–254
(1955)
18. Remus, S., Biemann, C.: Retrofitting word representations for unsupervised sense
aware word similarities. In: LREC (2018)
19. Sizov, G., Ozturk, P., Marsi, E.: Let me explain: adaptation of explanations
extracted from incident reports. AI Commun. 30, 1–14 (2017)
20. Stensrud, B.S., Barrett, G.C., Trinh, V.C., Gonzalez, A.J.: Context-based reason-
ing: a revised specification. In: FLAIRS Conference (2004)
21. W3C: OWL2 Profiles, April 2009. http://www.w3.org/tr/owl2-profiles/
22. W3C: SKOS Simple Knowledge Organization System Reference, August 2009.
http://www.w3.org/TR/skos-reference
23. W3C: PROV-O: The PROV Ontology, April 2013. http://www.w3.org/TR/
prov-o
A Comparative Evaluation of Preprocessing
Techniques for Short Texts in Spanish
Marcos Orellana1 , Andrea Trujillo1 , and Priscila Cedillo1,2(&)

1
Universidad del Azuay, Av. 24 de Mayo 7-77, Cuenca, Ecuador
{marore,atrujillo}@uazuay.edu.ec
2
Universidad de Cuenca, Av. 12 de Abril, Cuenca, Ecuador
priscila.cedillo@ucuenca.edu.ec
Abstract. Natural Language Processing (NLP) is used to identify key infor-

mation, generating predictive models, and explaining global events or trends.
Also, NLP is supported during the process to create knowledge. Therefore, it is
important to apply refinement techniques in major stages such as preprocessing,
when data is frequently produced and processed with poor results. This docu-
ment analyzes and measures the impact of combinations of preprocessing
techniques and libraries for short texts that have been written in Spanish. These
techniques were applied in tweets for analysis of sentiments considering eval-
uation parameters in its analysis, the processing time and characteristics of the
techniques for each library. The performed experimentation provides readers
insights for choosing the appropriate combination of techniques during pre-
processing. The results show improvement of up to 5% to 9% in the perfor-
mance of the classification.
Keywords: Natural Language Processing Preprocessing Twitter Sentiment

analysis Text mining
1 Introduction
Natural Language Processing (NLP) is formally defined as a study field that combines
informatics, artificial intelligence, linguistic, and analysis processes related to natural
language to generate knowledge and intelligence [1]. The relevance of this science is
based on the processing and comprehension of information expressed by different
media. These media are used in areas such as searching [2], translate machines, named
entity recognition (NER), the grouping of information, classification [3], sentiment
analysis [4], among others [5].
NLP consists of techniques applied in Knowledge Discovery in Databases
(KDD) such as the selection of a dataset, data cleaning and preprocessing, transfor-
mation, data mining, and evaluation/interpretation [6]. The preprocessing is one of the
most critical processes, which proposes to perform a cleaning process and to prepare the
text for its final analysis [7]. This stage includes techniques (e.g., tokenization, stem-
ming, lemmatization, stop-word removal, lowercase) usually applied during the NLP
processes. There are specific techniques that depend on the nature of the text in which
the analyst is working. For example, when working with Twitter data, it is necessary to

https://doi.org/10.1007/978-3-030-39442-4_10
112 M. Orellana et al.
apply data cleaning techniques, such as: deleting web links, usernames, hashtag sym-
bols, blank spaces, proper names, numeric words, and punctuation marks [8].
In improving the text classification through pre-processing techniques, researchers
focused their studies on texts obtained from social networks [8–10], news and emails
[3], and movie reviews [7]. However, these studies were applied for English texts and
used only one library. The preprocessing techniques are important for processing low-
quality data and allow them to obtain high-quality datasets [7]. Applying pre-
processing techniques in Spanish texts using different libraries generates knowledge
that is useful as a foundation in studies related to NLP, mainly in short texts
classification.
Pre-processing techniques have been implemented using libraries or tools such as
NLTK [11], StanFord-NPL [12], SpaCy [13], and FreeLing [14]. The FreeLing tech-
nique is considered by researchers as one of the most powerful, due to its function-
alities and its scope in the Spanish language, and for its contribution for other
languages (e.g., English, Portuguese, Italian, Russian, Catalan) [15].
The libraries for different languages have evolved substantially concerning the
corpus, dictionaries, and other functions [16]; however, for the Spanish language, they
are still in maturation stage due to their complex syntax and semantics [5]. Thus,
defining the techniques and the tools for NLP is a challenge, even more, if it is applied
to natural language in Spanish.
The combination of techniques can help in obtaining relevant results in the final
processing of the text. However, most authors evaluate the techniques combinations of
shallowly relegating experimental libraries. Hence, it is necessary to assess each of
these combinations in more detail, focusing on the main problem: data analysis in
Spanish.
The aim of this study is to analyze and measure the impact of preprocessing
techniques in PNL in the area of sentiment analysis. This study reports the results of the
application of a model of classification of feelings using libraries with support for
Spanish language. The data used in this study are tweets related to the Football World
Cup 20188 and the Transit in Ecuador. It is also reported the evaluation parameters, the
processing time, and characteristics of the techniques for each library. The results show
the differences in performance when applying preprocessing techniques in short text
written in Spanish.
The structure of this article is as follow. Section 2 presents the related work, Sect. 3
explains the methodology applied in the experimentation, Sect. 4 presents the results
obtained during this study, and finally, Sect. 5 presents the conclusions.
2 Related Works
In the last years, online data have grown exponentially as a result of the variety of
online communities, forums, and societies that promote network interactions among
users. The search for techniques that allow automatic gathering and processing of the
information generated in real-time is gaining increasing attention. NLP has become
dominant in recent years, and research relating to this topic has centered especially on
A Comparative Evaluation of Preprocessing Techniques 113
sentiment analysis, detection, and classification; areas that are linked to tools, algo-
rithms, and libraries.
NLP considers different levels of analysis including semantics [17], lexical [18],
syntactic [19], and pragmatics [20]. Researchers mainly have focused their studies on
techniques applied to the pre-processing stage. Among the most relevant techniques
addressed by researchers are: lowercase [21], tokenization [22], stemming [23],
lemmatization [24], and stop-word removal [25]. Gupta [8] explored more specific
techniques applied in social networks such as Twitter, which cover two aspects: (i) the
removal of noise, and (ii) the normalization of the text by means of word conversion
which are not in a standard manner in respect to their canonical forms. In his work, he
used the library NLTK applied to the tweet’s domain in English language. A more
specific application is the study conducted by [9], in this comparative research about
pre-processing techniques, the author focused on the opinion analysis in Twitter,
assessing the effect of six types of techniques (i.e., removal of web links, stop-word
removal, replacement of negative words, elimination of numbers, deletion of repeated
letters, extension of acronyms to their original words). These techniques were used to
evaluate the effect on the performance in the application of feelings analysis through
two types of methods, and experimenting with five different data sets in English. This
evaluation shows that the precision and the F-value are improved when using pre-
processing methods, to expand acronyms and to replace the denial, but they have not
significate changes if web links and numbers are removed, or stop-word removal is
applied in a simple way.
Other languages of complex structures such as Arabic were also investigated in
areas such as automatic text summaries [23], and information retrieval [26]. Arabic is a
highly flexible language, with variable words and complex morphology; for these
reasons, the application of techniques such as stemming is suitable since it allows the
reduction of the text length and the fast searching of information [23]. Jianqiang et al.
[9], highlighted the importance of pre-processing techniques applied to this language.
The authors experimented with two data sets, applying three types of classification
algorithms: SVM (Support Vector Machine), Naive Bayes and K-Nearest. Regarding
pre-processing techniques, they used: correlation of characteristics, stemming and n-
grams, demonstrating that the use of these techniques in revisions increases the per-
formance of the classifiers.
In NLP, authors include libraries such as FreeLing and SpaCy, which serve as tools
when processing and pre-processing text. FreeLing is considered one of the most
powerful libraries, building robust opinion mining systems, and providing language
analysis functionalities for different languages [14]. SpaCy is a library focused on the
advanced processing of natural language, which is aimed at commercial applications,
and according to [25], it is considered one of the fastest and most accurate libraries,
achieving the best accuracy in terms of tokenization, with 90% accuracy compared to
others.
Although there are studies related to preprocessing techniques and their impact in
NLP applications, most of them have been implemented for English language or Arabic
language, and measure the impact of these techniques in one library. This approach can
lead to incomplete results in short text classification.
3 Methodology
The methodology for the experimentation consists of three main sections: (i) Data
Extraction (see Fig. 1A), (ii) Techniques Application (see Fig. 1B), and (iii) Classifi-
cation (see Fig. 1C). Also, the method is based on several settings, features, and
algorithms.
The task of data extraction includes the download process, extraction settings and
the domains (i.e., Football, Transit). This process is performed by a Software Engineer
because it is required previous knowledge of programming for connecting to the
Twitter API. This process includes a cleaning sub-task which is performed by a Data
Analyst. The Data Analyst identifies the previous processing required by the tweets
extracted. The application of preprocessing techniques is also executed by the Data
Analyst, his role is to decide the correct techniques and libraries to generate clean text
easy to analyze. Finally, the task of sentiment analysis is performed by a specialist in
data analysis to set the correct configuration for efficient analysis.
Fig. 1. Experimentation methodology
3.1 Data Extraction

3.1.1 Downloading Tweets
The pre-processing techniques were evaluated by using two domains of tweets:
(i) “Football World Cup 2018” (Dataset M), and (ii) Transit in Ecuador (Dataset T).
The tweets were downloaded using the API Twitter [27], and the programming lan-
guage R. The API Twitter requires setting values for searching including n (number of
tweets), lan (language), since (start date), until (ending date), geocode (location miles),
and string (search string). According to the searching conditions, two domains of
datasets were downloaded. The total number of tweets in each domain was 1,300. The
tweets M set were acquired from different countries including Argentina, Colombia,
Spain, Mexico, and Uruguay. These data were taken in times of greater users’ activity
that corresponded to the times when the semifinal and final games of the Football
World Cup were played. Subsequently, this dataset was classified in a binomial way
(i.e., positive and negative), according to the feeling of the text. The results of this
classification yielded 912 positives tweets and 410 negatives tweets. The outcome of
this process was used later to compare the results of classification model proposed to
measure the recall, precision, and to determine F-measure as a parameter of the model
performance.
The tweets related to transit of different cities from Ecuador were extracted in the
set T. This set consists of tweets posted during May 2018, and like in the F set case, it
was classified in a binomial manner, obtaining 381 positive tweets and 919 negative
tweets. The classification yielded two sets of unbalanced data, that is, the size of these
datasets have a large difference between classes (positive, negative). Unbalanced
datasets are common in social media data since there is always a positive or negative
opinion is predominant among social media users. This type of data can generate that
the classifier chooses to minimize the error in the predominant class, which generates
biased accuracy metrics (see Fig. 1A).
The inclusion of two different domains for the experimentation was decided to
collect data sets with different language dimensionality. For example, the M set has
different expressions and forms of communication from a variety of countries. In order
to evaluate the accuracy of the results, with the proposed method, it has been necessary
to assess two domains due to there were considered the dimensionality, the way in
which each country communicates, and the proper characteristics of each way of
communication.
On the other hand, the set T allows for another perspective in the application of the
same techniques to a more closed set as to language.
3.2 Cleaning Tweets

RStudio [28] is a set of integrated tools to work with the programming language R.
This software was used for cleaning the data gathered. The cleaning technique used
during the text processing of data from social networks includes discarding irrelevant
information such as user names, special characters and web links (see Fig. 1A). This
cleaning technique is important for processing any text to obtain a dataset with less
noise.
3.3 Application of Techniques

The application of pre-processing techniques was the main process to be evaluated. For
this, we selected common techniques for this type of task. These techniques included
tokenizer, lemmatizer, lowercase and stop word removal. In addition, we choose
libraries that use these techniques. The choice of these elements was made based on a
prior analysis detailed below:
(a) Combination Techniques
The preprocessing techniques were the basic elements of the research, since they
were the object of evaluation and study, in order to give a criterion of their work in
NLP for short text in Spanish. According to its functionalities, the selected techniques
were: text cleaning, tokenization, stop-word removal, lemmatization, and lowercase a
Table 1).
Table 1. Evaluated techniques.

Technique How is it applied?
Text cleaning Elimination of URL links
Removal of special characters: @, #
Removal of user names
Tokenization Division of words or tokens
Stop-word removal Elimination of empty words by means of corpus
Lemmatization Transformation to his root word by means of corpus
Lowercase Transformation to lowercase
According to the selected techniques, the techniques were evaluated individually

and combined. All the possible combinations or individual applications of these
techniques were obtained to find the interactions with the best results.
The combinations were made based on a study presented by Uysal et al. [3], they
concluded that choosing the appropriate combinations in the preprocessing stage
instead of using all or only one of them, provide a significant improvement in the
processing of the language applied to text classification.
The techniques of text cleaning, stemming (LEM), stop-word removal (STO),
tokenization (TOK) and lowercase (LOW) were selected as the most relevant tech-
niques in NLP. However, the tokenization technique is usually excluded, due to its
obligatory participation in the classification models. On the other hand, text-cleaning
techniques are applied in all cases (see Fig. 1B).
The techniques were established as 0 or 1, interpreted as discarded or applied
respectively in each combination. According to the number of techniques, 15 different
combinations were obtained (including tokenization), described in Table 2, without
including the combination that will serve as a basis for comparison in the evaluation
(LM: 0, SR: 0, LW: 0).
Table 2. Combinations of evaluated techniques.

No. Techniques
LEM STO TOK LOW
1 1 0 0 0
2 0 1 0 0
… … … … …
15 1 1 1 1
(b) Libraries
There are multiple libraries that focus on providing tools in different areas of NLP
such as text classification, automatic labeling, sentiment analysis, NER analysis and
preprocessing. However, its functionality is still limited to certain languages such as
Spanish, due to it is an extended and difficult language to be processed. For this reason,
a more objective evaluation was performed, experimenting with different libraries that
provide support in these types of techniques. The selected libraries: NLTK, SpaCy, and
FreeLing were the result of an analysis of characteristics that make them competent for
the purpose of the research. Additionally, they comply with the majority of combi-
nations of established techniques (see Table 3).
Table 3. Libraries and technical characteristics.

Library Techniques Algorithms/Corpus
NLTK Tokenization Simple splitter
Lowercase Generic
Stop-word removal Corpus: 313 words
Lemmatization Not supported
SpaCy Tokenization Simple splitter
Lowercase Generic
Stop-word removal Corpus: 410 words
Lemmatization Part-of-speech-sensitive
FreeLing Tokenization Regular expressions
Lowercase Generic
Stop-word removal Not supported
Lemmatization Corpus: 555 words, 76000 combinations
3.4 Classification
The evaluation stage was developed based on a specific application of NLP, such as
sentiment analysis. Through this experimentation, the impact of the techniques and
combinations in NLP was evaluated. For this process, an analysis model based on the
Super Vector Machine (SVM) algorithm with Radial Basis Function (RBF) kernel was
built in the RapidMiner [29] tool (see Fig. 1C). This model consists of three phases:
(i) the process of reading documents, where the combinations of techniques were
applied, which were classified according to the feeling they generated (i.e., positive,
negative); (ii) the number of tweets for training and application of the model was
established, having 70% and 30% respectively. In addition, it has been used as an
optimization operator. This finds the optimal values of the parameters chosen for the
kernel operators of the model which for this case were the parameters C and c. By
means of this optimization operator, the values of the constants C and c to-command
were configured as the basis for the investigation of [26]. In this case, it is recom-
mended to use a “grid-search” by means of the cross-validation method. This method
consists of testing different pairs of the values of C and c and select the one that obtains
the best evaluation parameter selected; in order to measure the general performance of
the model. Here, it was used the F-measure value, since it is the most objective and
optimal for working with unbalanced text. Taking into account that using the expo-
nential growth sequences in the parameters (C and c) represents the most practical
method, it has been used the sequence where C takes values of 2–5, 2–3, …, 215 and c
values of 2–15, 2–13, …, 23. The last phase was the writing of the generated confusion
matrix and the results of classification of the model with its evaluation parameters (see
Fig. 2).
Fig. 2. Sentiment analysis model developed in RapidMiner.
4 Results
Based on the experimentation results, the characteristics of each library that partici-
pated in the selection of techniques were obtained, which implies the reduction of
combinations of techniques in the study of sentiment analysis. Among those charac-
teristics, the transformation to lowercase was mandatory in the stop-word removal
technique, assuming the elimination of combinations with the lowercase technique. In
addition, the presence of the tokenization technique within the algorithm of the analysis
of sentiment was taken into account, eliminating it also from the combinations pro-
posed (see Table 4).
Table 4. Combinations of the evaluation techniques

Libraries Combinations
NLTK STO
LOW
SpaCy STO
LOW
LEM
LOW-LEM
STO-LEM
FreeLing LOW
LEM
LOW-LEM
4.1 Techniques Evaluated by Libraries

Each dataset generated ten documents processed according to the resulting combina-
tions (see Table 4). These documents were the inputs for the sentiment analysis model,
where the f-measure allows to obtain the impact that occurs according to the evaluation
parameter. The first evaluation was made at the library level, obtaining the best
combinations, according to the f-measure value, and the related libraries.
The experimentation categorized by libraries, applied in the set M, reflected as
better combinations of techniques: lem and low-lem, in the libraries SpaCy and
FreeLing, whose result in both combinations was 90.5% in the f-measure value, sur-
passing the value obtained in the case of not applying pre-processing techniques by
4.1% (see Fig. 3). In the set T the best results were reflected applying the same
combinations, with an f-measure of 95.8% in each one, compared to the f-measure
which obtained 86.6% without applying techniques; although in this dataset, the best
library was SpaCy, FreeLing reached values of 95.5% with a processing time lower
than SpaCy (see Fig. 4).
Fig. 3. Evaluation of techniques by libraries in dataset M
Fig. 4. Evaluation of techniques by libraries in dataset T.

4.2 General Evaluation

For the evaluation of NLP techniques applied to the sentiment analysis in both data
sets, precision and recall were taken as parameters for the evaluation, and the f-measure
value as a measure of general performance and value to be considered in the deter-
mination of the best preprocessing techniques in NLP.
This evaluation extracted the best results of the 5 resulting combinations based on
the preliminary assessment according to the libraries, in order to make a comparison of
the performance of the combinations versus two different data domains, taking as a
performance parameter general, the f-measure value, considering the classification time
of the model in each scenario. Taking the “low-lem” combination as the best result; and
therefore, the technique of lemmatization which gave a significant impact related to the
performance in both datasets (see Fig. 5). Additionally, it has been obtained a com-
parative between the three parameters of the evaluation during the sentiment analysis
for both datasets (see Figs. 6 and 7).
Fig. 5. General evaluation of techniques in dataset M and T.

Fig. 6. Evaluation of value f_measure, recall and precision in dataset M with each combination
of techniques.
Fig. 7. Evaluation of value f_measure, recall and precision in dataset M with each combination
of techniques.
5 Conclusions
This study presents the impact of pre-processing techniques in NLP, taking as a case
study the field of sentiment analysis, being one of the most experienced in this type of
language processing. The techniques were evaluated in two different domains of
tweets, World Cup 2018 and Transit in Ecuador, using all possible combinations of
selected pre-processing techniques, as well as being evaluated in different PLN
libraries. The aspects to be considered in the evaluation were the processing time of the
libraries in each technique, processing characteristics and capacity of their corpus and
algorithms applied to the Spanish language and, in terms of the sentiment analysis, the
impact of each of the combinations obtained considering its accuracy, recall, and f-
measure as a measure of overall performance.
The experimental analysis revealed that the appropriate combinations of techniques
such as lemmatization and lowercase in the pre-processing stage according to the f-
measure obtained after applying the sentiment analysis model, provide a significant
improvement in the classification. Although there are pre-processing techniques, such
as lowercase or stop-word removal that do not reach the maximum improvement in the
feeling analysis process, if there is a demonstrative improvement for it. Therefore, for a
text classification problem such as sentiment analysis in any domain, it is advisable to
carefully analyze all possible combinations of techniques, instead of using one or all in
the pre-processing stage. In case of working with other fields of NLP, it is necessary to
examine the end of the study and analyze the function of each technique in the text, to
perform more efficient processing and with greater precision in its final results.
6 Future Works
Our work evaluated three libraries: spaCy, FreeLing, and NLTK. For future works we
propose the evaluation of new libraries such as UDPipe, a library in C++ and actually
implemented as a package in RStudio. This library was designed for the data prepa-
ration in tasks NLP and has support for Spanish language. Additionally, we propose to
applicate the selected techniques in other areas such as NER analysis and classification.
Acknowledgment. This research was supported by the vice-rectorate of investigations of the

Universidad del Azuay. We thank our colleagues from Laboratorio de Investigación y Desarrollo
en Informática (LIDI) at Universidad del Azuay who provided insight and expertise that greatly
assisted this work. Part of this research is supported by the Design of architectures and interaction
models for assisted living environments aimed at older adults project of the XVIII DIUC Call for
Research.
References
1. Reese, R.M.: Natural Language Processing with Java. Packt Publishing (2015)
2. Battistelli, D., Charnois, T., Minel, J.L., Teissèdre, C.: Detecting salient events in large
corpora by a combination of NLP and data mining techniques. Comput. y Sist. 17, 229–237
(2013)
3. Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process.
Manage. 50, 104–112 (2014). https://doi.org/10.1016/j.ipm.2013.08.006
4. Krouska, A., Troussas, C., Virvou, M.: The effect of preprocessing techniques on twitter
sentiment analysis. In: 2016 7th International Conference on Information, Intelligent System
Application (IISA), pp. 1–5 (2016). https://doi.org/10.1109/iisa.2016.7785373
5. Hidalgo, O., Jaimes, R., Gomez, E., Luján-mora, S.: Análisis de sentimiento aplicado al
nivel de popularidad del líder político ecuatoriano Rafael Correa Sentiment Analysis applied
to the popularity level of the Ecuadorian political leader Rafael Correa. In: 2017
International Conference on Information Systems and Computer Science (INCISCOS),
pp. 340–346 (2017)
6. Gómez-Jiménez, G., Gonzalez-Ponce, K., Castillo-Pazos, D.J., Madariaga-Mazon, A.,
Barroso-Flores, J., Cortes-Guzman, F., Martinez-Mayorga, K.: The OECD Principles for (Q)
SAR Models in the Context of Knowledge Discovery in Databases (KDD). Elsevier Inc.
(2018)
7. Haddi, E., Liu, X., Shi, Y.: The role of text pre-processing in sentiment analysis. Procedia
Comput. Sci. 17, 26–32 (2013). https://doi.org/10.1016/j.procs.2013.05.005
8. Gupta, I., Joshi, N.: Tweet normalization : a knowledge based approach. In: 2017
International Conference on Infocom Technologies and Unmanned Systems (Trends Future
Directions) (ICTUS), pp. 1–6 (2017)
9. Jianqiang, Z., Xiaolin, G.: Comparison research on text pre-processing methods on twitter
sentiment analysis. IEEE Access. 5, 2870–2879 (2017). https://doi.org/10.1109/ACCESS.
2017.2672677
10. Galadanci, B.S., Muaz, S.A., Mukhtar, M.I.: Comparing research outputs of Nigeria Federal
Universities based on the scopus database. In: CEUR Workshop Proceedings, vol. 1755,
pp. 79–84 (2016). https://doi.org/10.1177/0165551510000000
11. Paramkusham, S.: NLTK: The natural language toolkit. Int. J. Technol. Res. Eng. 5, 2845–
2847 (2017)
12. Weerasooriya, T., Perera, N., Liyanage, S.R.: A method to extract essential keywords from a
tweet using NLP tools. In: 16th International Conference on Advances in ICT for Emerging
Regions, ICTer 2016 - Conference Proceedings, pp. 29–34 (2017)
13. SpaCy: spaCy. https://spacy.io/usage/linguistic-features#_title
14. Padró, L., Stanilovsky, E.: FreeLing 3.0: towards wider multilinguality. In: Proceedings
Language Resources Evaluation Conference (LREC 2012), pp. 2473–2479 (2012)
15. Henríquez, C., Guzmán, J., Salcedo, D.: Minería de Opiniones basado en la adaptación al
español de ANEW sobre opiniones acerca de hoteles. Proces. del Leng. Nat. 41, 25–32
(2016)
16. Prata, D.N., Soares, K.P., Silva, M.A., Trevisan, D.Q., Letouze, P.: Social data analysis of
Brazilian’s mood from twitter. Int. J. Soc. Sci. Humanit. 6, 179–183 (2016). https://doi.org/
10.7763/IJSSH.2016.V6.640
17. Altszyler, E., Brusco, P.: Análisis de la dinámica del contenido semántico de textos. In:
Argentine Symposium on Artificial Intelligence, pp. 256–263 (2015)
18. Pérez-guadarramas, Y., Rodríguez-blanco, A., Simón-cuevas, A.: Combinando patrones
léxico - sintácticos y análisis de tópicos para la extracción automática de frases relevantes en
textos. Proces. L. 59, 39–46 (2017)
19. Antonio, F., Velásquez, C., Paul, J., De Paz, Z., Guzmán, J.F.: Aplicación del análisis
sintáctico automático en la atribución de autoría de mensajes en redes sociales. Res. Comput.
Sci. 137, 109–119 (2017)
20. Soto Kiewit, L.D.: Un acercamiento a la concepción de gobernabilidad en los discursos
presidenciales de José María Figueres Olsen. Rev. Rupturas. 7, 1 (2017). https://doi.org/10.
22458/rr.v7i1.1609
21. Poornima, B.K.: Text preprocessing on extracted text from audio/video using R. Int.
J. Comput. Intell. Inform. 6, 267–278 (2017)
22. He, Y., Kayaalp, M.: A comparison of 13 tokenizers on MEDLINE. Bethesda, MD List. Hill
Natl. Cent. Biomed. Commun. 48 (2006)
23. Alami, N., Meknassi, M., Ouatik, S.A., Ennahnahi, N.: Impact of stemming on Arabic text
summarization. In: Colloquium in Information Science and Technology, CIST, pp. 338–343
(2017)
24. Singh, T., Kumari, M.: Role of text pre-processing in twitter sentiment analysis. Procedia
Comput. Sci. 89, 549–554 (2016). https://doi.org/10.1016/j.procs.2016.06.095
25. Katariya, N.P., Chaudhari, M.S.: Text preprocessing for text mining using side information.
Int. J. Comput. Sci. Mob. Appl. 3, 3–7 (2015)
26. Althobaiti, M., Kruschwitz, U., Poesio, M.: AraNLP: a Java-based library for the processing
of Arabic text. In: Proceedings of the Ninth International Conference on Language
Resources and Evaluation (LREC 2014), pp. 4134–4138 (2014)
27. Twitter Inc: Search Tweets. https://developer.twitter.com/en/docs/tweets/search/api-
reference/get-search-tweets.html
28. RStudio: Take control of your R code. https://www.rstudio.com/products/rstudio/
29. GmbH R: Rapidminer Documentation
Automatic Visual Recommendation for Data
Science and Analytics
Manoj Muniswamaiah, Tilak Agerwala(&),

and Charles C. Tappert(&)
Seidenberg School of CSIS, Pace University, White Plains, New York, USA
{mm42526w,tagerwala,ctappert}@pace.edu
Abstract. Data visualization is used to extract insight from large datasets. Data
scientists repeatedly keep generating different visualizations from the datasets
for their hypothesis. Analyzing datasets which has many attributes could be a
cumbersome process and lead to errors. The goal of this research paper is to
automatically recommend interesting visualization patterns using optimized
datasets from different databases. It reduces the time spent on low utility visu-
alizations and displays recommended patterns.
Keywords: Big data Database Analytical query Query optimizer Data

science Data visualization Data analyst
1 Introduction
Data visualization tools have been used increasing by data scientists and analysts. They
load different datasets and examine their hypothesis using visualization tools, this
process is repeated several times until they find an interesting pattern. Data scientists
need to derive insights using this trial and error method which is tedious. The main goal
of this research paper is to find interesting patterns in large datasets across different
databases. When a user issues query, the optimizer would substitute the query with an
optimized copy and returns the results from which the recommended visualizations are
displayed automatically without manual intervention. The SQL optimized framework
supports different kinds of databases used in an organization and provides an extension
to SeeDB [1].
Data scientists need to build different visualizations from new datasets in order to
find various patterns and anomalies. When the dataset has high dimensionality finding
various patterns becomes a tedious task. Determining relationships between attributes
and their subsets is required for analysis of the data. Visualizations are likely to display
interesting patterns if the plotted data deviates largely from the reference points or
historical data. Even for a small dataset the number of visualizations that can be
generated is large. Also, the visualizations should be displayed at interactive speed with
quicker response time to the users.
Today data is stored in different databases that have storage models been cus-
tomized to their needs. When data needs to be analyzed it is required to retrieve them
from various sources. Healthcare datasets such as MIMIC-III (Medical Information

https://doi.org/10.1007/978-3-030-39442-4_11
126 M. Muniswamaiah et al.
Mart for Intensive Care) which is publicly available contains data from Intensive Care
Unit (ICU) collected from patients de-identified [2]. It contains both structured and
unstructured data of medications, doctor reports and the streaming data from medical
devices. These varied data formats are been stored in different databases. Structured
data is stored in a relational databases and unstructured data in NOSQL data stores in
order to gain the performance advantage of the native databases specifically designed to
handle them.
MIMIC-III is a data repository containing information about patients been admitted
to the hospitals. It contains details of the vital signs, medical device readings, doctor
notes and patient admission data. The data is released for researchers after de-
identifying the patient information [2]. Data federation among these different databases
is required to build a visualization recommendation tool. This would help data sci-
entists and analysts to analyze their hypothesis from different datasets. Today this
process is manual where the user has to gather the data of required interest and go
through all the visualizations which is a cumbersome task.
Graphs can be plotted from de-identified patients information collected from dif-
ferent sources. It includes patient admission date, gender, doctor notes and time series
data from the medical devices. The number of visualizations grow significantly
depending upon the number of user interested points. Tracking all the visualizations
generated becomes a difficult task. Users are interested in visualizations in which target
data shows large deviation from the referenced data. This data which a user wants to
analyze can be stored across different databases. Having a federated SQL framework
which retrieves the data quickly across databases helps them focus on their tasks.
Fig. 1. Shows the graph for heart related problems for admitted married and unmarried patients
Figure 1 shows that the target data which deviates from the reference points largely
helps in identifying the outliers and anomalies in the data for further investigation.
Automatic Visual Recommendation for Data Science and Analytics 127
2 Problem Statement
Data scientists analyze datasets which are retrieved from different databases. Dimen-
sional attributes represent the facts and measured attributes are derived from the
aggregate functions. These two attributes are used in the visualization tools [3]. This
research extends SeeDB [1] which recommends visualizations for a query using a high
utility function that exhibits larger deviations by retrieving datasets from different
databases. The customized SQL framework which is used to derive the data also makes
use of the optimized data been created by the database administrators. Healthcare and
medical devices generate data of different formats which needs to be stored in different
databases [4]. The customized SQL framework queries the data from any of the reg-
istered databases and during runtime, the query optimizer substitutes the optimized data
when applicable and retrieves the data quickly which is later used to build the required
recommended visualization. Dimensional attributes D represents the group-by attri-
butes of the query. The dimensional attributes are quantified using the measure attri-
butes M and a set of aggregate functions A. These queries are executed against a set of
registered databases S. We can group dimensional attributes D and aggregate them
based on the measure attributes M. This results in a two dimensional table which can be
used for visualization. Recommended visualizations have high utility factor and can be
obtained by executing aggregation over the group-by attributes on the registered
databases represented by function (d, m, a) where d 2 D, m 2 M, a 2 A. T(S) represents
grouping of data in set of registered databases S for the target data. R(S) represents data
for referenced datasets. Query results Q (target) and Q (reference) represent the high
utility factor which determines the visualizations to be displayed [1].
Q (target) = SELECT d; a (m) FROM T(S) GROUP BY d

Q (reference) = SELECT d; a (m) FROM R(S) GROUP BY d
The utility factory is calculated from the views of Q (target) and Q (reference). The
difference between views is used in determining the recommended visualizations to be
displayed (see Figs. 2 and 3).
SELECT sex, count (diagnosis) FROM admission_married GROUP BY sex;

SELECT sex, count (diagnosis) FROM admission_unmarried GROUP BY sex;
PYTHON SQL TOOL
CLIENT
Recommended Visuals
SERVER
View Generator
CUSTOMIZED SQL FRAMEWORK
HTAP NOSQL
RDBMS
Fig. 2. Architectural overview of the visual recommendation framework
Fig. 3. Recommended visualization graphs

3 Architecture
The front end component generates visualizations based on the data from different
databases. It interacts with a customized SQL framework which acts like a federated
query layer in retrieving the data. During the run time framework rewrites the query to
the optimized copy and returns the result set. The incoming query is parsed and
validated for syntax. The query optimizer applies various transformation rules to the
relational node to obtain the optimized node which has the same semantic meaning as
the original one and with reduced query cost. The query optimizer transforms the
relational node by substituting it with whole or partial rules which matches the pattern.
The metadata information regarding different registered databases is stored in a catalog
which is utilized during runtime by the query optimizer. It provides information about
the overall execution cost of the query, the data size in tables, memory and CPU usage
details for the query execution.
Sharing-Based Visualization Optimizations
The aggregate queries which have same group-by attributes are combined into a single
view and later multiple group-by are integrated. This results in improvement of query
latency and better performance [1].
SELECT diagnosis, occupation, count (diagnosis), sum (age) FROM admission_-
married GROUP BY GROUPING SETS ((diagnosis), (occupation)).
Pruning-Based Visualization Optimizations
It implements the interval-pruning based on the confidence level of the utility scores
and discards all the visualizations which have the utility score upper bound below the
least lower bound in the top-k visualizations [1].
4 Evaluation
Evaluation was performed to display the visualizations which had the best utility factor
among the top views, better accuracy and reduced response time.
Fig. 4. Recommended visualizations display for admitted patients diagnosed with respiratory
related issues.
Fig. 5. Recommended visualizations display for admitted patients diagnosed with heart related
issues.
Fig. 6. Recommended visualizations display for admitted patients diagnosed with hypotension
related issues
All the experiments were conducted on 64-bit Linux machine with 3 GHz Intel
Xeon processor and 16 GB RAM. PostgresSQL [5] was used to store patient related
information. The medical device data was stored in Splice Machine [6] and text notes
in MongoDB [7]. As shown in the Figs. 4, 5 and 6 the dataset1 is the target dataset and
dataset2 is the reference dataset. Visualizations which had a high utility factor deviating
from the reference points are displayed as the recommended visualizations.
5 Related Work
There are various visualization tools that have been developed and introduced in the
market [8]. These tools require users to manually specify graphs which is a tedious
task. The visualization tools should support the datasets which are large in volume and
also have better response time. Some of these data are volatile and not completely
preprocessed for display. There has been some active research on developing recom-
mended visual displays such as in VISO, VizDeck [9] the problem with these
approaches is that it quickly becomes intractable as the number of the dataset attributes
to be analyzed increases. In order to increase scalability and the response time in-
memory caching and sampling of the datasets has been used. Data materialization is
been used by database administrators to help reduce the run time computation while
analyzing these datasets [10]. The visualization tools must be interactive in displaying
the data. It must also support the relationship between different views and filter down
on the data required for display and focus on information which is of interest. Visu-
alization of dataset which is both structured and unstructured is challenging. Data
reduction techniques like binned aggregation has been effectively used for visualiza-
tion. Also, pre-computed data and data parallelization are techniques which have been
adopted by many visualization tools that are introduced in the market to reduce the
latency on the display of the visualizations. Data cubes and nano cubes are built from
the tables which perform aggregation on the dimensions of the tables for scalability.
There are different types of tools which support visualization through service, libraries
and platforms [11].
6 Conclusion
This research implements a visualization analytical tool along with query optimizer that
helps in recommending interesting visualizations automatically from different data-
bases. This work helps in improving the interactive data exploration for data scientists
and analysts. Further extension to this work would be to integrate it with cloud
databases.
References
1. Vartak, M., Madden, S., Parameswaran, A., Polyzotis, N.: SeeDB: automatically generating
query visualizations. Proc. VLDB Endow. 7(13), 1581–1584 (2014)
2. Johnson, A.E., Pollard, T.J., Shen, L., Li-wei, H.L., Feng, M., Ghassemi, M., Moody, B.,
Szolovits, P., Celi, L.A., Mark, R.G.: MIMIC-III, a freely accessible critical care database.
Sci. Data 3, 160035 (2016)
3. Waller, M.A., Fawcett, S.E.: Data science, predictive analytics, and big data: a revolution
that will transform supply chain design and management. J. Bus. Logistics 34(2), 77–84
(2013)
4. Muniswamaiah, M., Agerwala, T., Tappert, C.C.: Context-aware query performance
optimization for big data analytics in healthcare. In: 2019 IEEE High Performance Extreme
Computing Conference (HPEC-2019), pp. 1–7 (2019)
5. https://www.postgresql.org/
6. https://www.splicemachine.com/
7. https://www.mongodb.com/
8. Keim, D., Qu, H., Ma, K.-L.: Big-data visualization. IEEE Comput. Graph. Appl. 33(4), 20–
21 (2013)
9. Perry, D.B., et al.: VizDeck: streamlining exploratory visual analytics of scientific data
(2013)
10. Fisher, D., et al.: Interactions with big data analytics. interactions 19(3), 50–59 (2012)
11. Wang, L., Wang, G., Alexander, C.A.: Big data and visualization: methods, challenges and
technology progress. Digit. Technol. 1(1), 33–38 (2015)
A Novel Recommender System for Healthy
Grocery Shopping
Yadagiri Bodike, David Heu, Bhavishya Kadari,

Brandon Kiser, and Matin Pirouz(&)
California State University, Fresno, CA 93740, USA

mpirouz@csufresno.edu
Abstract. Given here is an anonymized dataset of online grocery purchases

from users; we present a recommender system framework to predict future
purchases. We describe the method of constructing a utility matrix to run a
collaborative filtering algorithm to pair similar and dissimilar users and ulti-
mately provide recommendations. Given those recommendations, we further our
analysis by proposing a method using natural language processing to determine
the nutritional value of a food product to further improve recommendations. The
results provide recommendations for the healthiest options based on historical
purchase data.
Keywords: Recommender systems Collaborative filtering e-Commerce

Natural language processing
1 Introduction
With the growth of online shopping, companies are scrambling to figure out new
methods to improve growth, obtain profits, and increase customer retention. Due to the
large amount of dynamic, growing data, these methods must provide accurate, efficient,
and quick answers. With this in mind, we investigate the broad topic of recommender
systems, methods and techniques that are used to predict preferences a user has for
certain items. By using recommender systems, businesses can provide only relevant
products to users and therefore expand their total inventory of products [6].
In this paper, we analyze a specific implementation of a recommendation system:
namely, user-based collaborative filtering [9]. To test this implementation, we analyze
an Instacart dataset containing grocery order information from over 200,000 users. We
utilize user-based collaborative filtering to accomplish two tasks: first, we calculate the
similarity of users in order to provide product recommendations, and second, we cal-
culate the similarity of orders to investigate trends and classify them based on their
nutritional value [18]. Finally, we propose a method to improve user-based collaborative
filtering product recommendations by utilizing the order classification method [5].
To accomplish both tasks, we implement utility matrices to map user and order
preferences to items, and similarity matrices to store calculations and locate the closest
users and orders in the dataset. Furthermore, we implement a natural language pro-
cessing algorithm to evaluate the nutritional value of food products by comparing them

https://doi.org/10.1007/978-3-030-39442-4_12
134 Y. Bodike et al.
to a USDA product dataset [7]. To determine similarity measures between users and
between orders, we utilize the cosine similarity and the Pearson correlation coefficient
equations.
2 Related Works
In this section, we present various related works of recommendation systems, and other
studies specifically relating to the Instacart dataset. We also provide some examples of
real use cases of recommender systems, as in the case of Netflix.
A Stanford report [4] shows a study of the Instacart dataset and describes the
method of extracting features, usage of market basket analysis, and logistical regres-
sion. The report describes a model of initially randomly recommending users some
items, then given another model, recommend items based on predictions. This report
demonstrates the usage of a recommender method based on a binary classification
problem. Another related method of recommender systems is to use a collaborative
filtering algorithm [9]. A collaborative filtering algorithm is a generic process to obtain
similar and dissimilar users and base recommendations on those classifications [18].
However, in this journal, they use an item-based collaborative filtering algorithm and
draw the conclusion that item-based filtering is much more accurate for sparse net-
works [16].
Lastly, we see recommender systems in large corporations ranging from Amazon to
Netflix in their usage of recommending products or movies. We know that recom-
mender systems are being used, as in 2007, Netflix [1] created a challenge to see if
anyone could develop a more accurate recommender algorithm given a subset of their
movie ratings dataset. As we see, recommender systems are being vastly researched
and improved upon and have always been at the forefront of ecommerce usage.
3 Datasets
Instacart is an ecommerce grocery delivery service where users can order online from
various grocery stores to be delivered by a personal shopper. Our dataset was provided
freely by Instacart through a Kaggle competition. The dataset contains an ample col-
lection of anonymized users, orders, and product types from Instacart. Specifically, it
contains over three million grocery orders from over 200,000 users.
Table 1. Dataset complexity

Dataset Size Attribute
aisle 134 2
departments 21 2
orders_prior 32434489 4
orders_train 1384617 4
orders 3214874 7
products 49688 4
A Novel Recommender System for Healthy Grocery Shopping 135
As shown in Table 1, our dataset is compiled into various files that are related by
identifiers; we also show the number of entries each file has and the number of attri-
butes. The structure of each file acts like a table in a relational database. For example,
orders contain a listing of product identifiers which relate to the products file.
Specifically, each order contains the user associated, a listing of products (in order),
and the date and time that the order was made. Furthermore, we are given information
on whether a product in an order has been previously purchased. The dataset also
provides an extensive listing of departments, aisles, and names for each of the products.
That is, names provided are actual branded products of Instacart with generic categories
(department, aisle).
Lastly, the data does not have any identifying material, nor do we have any sen-
sitive details of any user or order. The dates are given on a relative basis rather than an
absolute basis.
4 The Proposed Method
In this section, we provide detailed descriptions of our implementation of various

algorithms and explain their usability in relations to our analysis.
4.1 Utility Matrix

A recommendation system is comprised of two classes: users and items. Each entity in
the user’s class has some level of preference (represented numerically) for entities in
the items class [6]. For instance, preference can be measured by the number of times an
Instacart member has purchased a particular item [17].
The relationship between users and items is presented in the form of a matrix called
the utility matrix. The rows and columns of the matrix are represented by entities from
the users and items classes, respectively. This structure allows for the users and items to
be represented as vectors and therefore be used in similarity calculations [6, 14].
The only measure of preference between users and products provided in the
Instacart dataset is the number of times a user has purchased a particular product.
Therefore, we define the first utility matrix M1 as the following: For xij M 1 , user i
purchased item j a total of x times. An example is provided in Fig. 1.
Fig. 1. Five utility matrix M1
Thus, M1 will be a n m matrix where n is the number of users and m is the

number of products. Since a sample of users will likely not purchase every product
available in the products table, m will depend on the users that comprise n. Thus, each
column in M1 is guaranteed to sum to at least one and therefore the sparsity of the
matrix is reduced.
This implementation of user-item preference may overweigh the quantity of pro-
duct purchases when compared to the selection of products. For instance, a user who
purchased 80 strawberries when their average purchase count is less than five is defined
solely by their purported exclusive preference for strawberries. Therefore, we define the
second utility matrix M2 equivalent to M1 , except we enforce an upper limit of six on
the value of x. An example of this normalization is provided in Fig. 2.
Fig. 2. Utility matrix normalization from M1 to M2
Utility matrix M2 prioritizes item selection rather than item quantity when deter-
mining item preference. Ideally, user preference for products would not be defined as
ambiguously as product purchase count. A user may purchase a product less than their
average purchase count and still enjoy the item.
The decision to select the upper limit for M2 as six was arbitrary. The most common
product purchase count for most users was one, whereas outliers include quantities
surpassing 80. Therefore, a limit of six emphasizes product selection by diminishing
the effect of outliers in the dataset.
Finally, to calculate order similarity, we define utility matrix M3 as the following:
for xij M 3 , if order i includes item j, then xij ¼ 1, else xij ¼ 0: M3 is similar in nature to
M1 except the rows represent orders instead of users. An example is given in Fig. 3.
Fig. 3. Utility matrix M3
4.2 User-Based Collaborative Filtering Similarity Calculations

User-based collaborative filtering is a method to provide recommendations to a user
based on similarity measures between other users. In essence, user interests are pre-
dicted by collaborating with other users. For example, when users rate movies [3], we
can predict what movies users may enjoy based on other similar users. If user A enjoys
the same movies as user B, then we can recommend or predict a movie that B may
enjoy that A has seen and B has not seen.
User-based collaborative filtering makes use of a utility matrix by calculating the
similarity between every user pair in the matrix. Since each row designates a single
user, we can represent each user as a vector. This enables us to utilize similarity
equations to determine the difference between users.
To calculate the similarity between Instacart users, we utilize the cosine similarity
function [12]. This equation calculates the cosine between two vectors which can be
used to determine the correlation between them. The cosine similarity between vectors
u and v is as follows:
uv
simðu; vÞ ¼ cosðhÞ ¼ ð1Þ
kukkv k
Because all the values in each vector are positive (it is impossible to purchase a
negative number of products), the cosine similarity between any two users will range
between 0.0 and 1.0. A cosine similarity of 0.0 means that there exists zero correlation
between the two vectors. Conversely, a similarity of 1.0 means that the two vectors are
equivalent.
To calculate the similarity between Instacart orders, we utilize another correlation
calculation known as the Pearson correlation coefficient [8]. The correlation coefficient
r between two vectors X and Y is defined as follows:
P
xRy
Rxy N
r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P P ð2Þ
P 2 ð xÞ 2 P 2 ð yÞ 2
x N y N
Similar to cosine similarity, the correlation coefficient r will be a value between 0.0
and 1.0, where values near 0.0 means the vectors are dissimilar and 1.0 means they are
similar. Note that the Pearson correlation coefficient is equivalent to the cosine simi-
larity except it normalizes each vector by subtracting its mean from each value.
4.3 Similarity Matrix

Given the utility matrix described in section A, we generate another matrix called the
similarity matrix by utilizing the methods in section B to store the similarity value
between each entity in the users class. This structure enables fast lookup (as a Oð1Þ
operation) for similarity computations using a hash table [15]. We define an abstract
similarity matrix S as the following: for xij S, user class entity i has a correlation value
of x with user class entity j. An example similarity matrix for utility matrix M3 is
presented in Fig. 4.
Fig. 4. Example similarity matrix mapping order correlations
Using this abstract similarity matrix definition, we compute a similarity matrix for
each of the three utility matrices defined in section A: S1 for M1 , S2 for M2 , and S3 for M3 .
We add a stipulation to the Instacart order similarity calculations in our research;

we define the notion of a threshold when determining if a pair of orders is deemed
“similar enough”. This threshold is 0.1–0.5, meaning that if a pair of orders has a
similarity value between 0.1–0.5, then we deem them similar. This is because our
dataset is very sparse, meaning that orders have an incredibly large pool of products to
choose from, so orders containing exact products are exceedingly rare. Note that no
two orders in the dataset exceed a similarity value of 0.5.
4.4 Natural Language Processing

Natural language processing (NLP) is the analysis of large amount of language data
(words, sentences, phrases, etc.) [2]. In our case, we use NLP to classify whether a
product is healthy or unhealthy given the name of the product. After classifying
products, we can make a generalized statement of the entire order whether an order is
considered healthy or unhealthy overall. Then, given the statement, we can further our
recommendations solely based on the nutritional value of the product. NLP is not the
focus for our research so we will briefly explain our method here.
We ran a simple NLP algorithm on a test dataset (USDA product dataset [7]) that
will allow our model to understand what is healthy or unhealthy, given the name of a
product. Now that our model has a basic understanding, we run the algorithm again
(with the knowledge of the test dataset) on our Instacart dataset to obtain a binary
classification [0, 1] on whether a product is healthy or unhealthy.
We found that products containing “chocolate”, “sweetened”, “cake” in their name
were found to be unhealthy, while words with “whole”, “banana”, or “organic” where
found to be healthy. However, we must remember that the nutrition value of products
(and what is healthy and unhealthy) will vary substantially between people [10].
Therefore, our classification cannot be taken as a concrete source of fact. Our findings
and usage of NLP is solely used for generalizing orders to further recommend users
specific products.
4.5 Construction of Recommendations

Recommendation algorithms attempt to accurately predict empty values in the utility
matrix. Since the utility matrix as described in Section A measures preference by the
number of product purchases rather than product ratings, we can only predict which
new item a user will purchase next.
Due to the sparsity of the data available, we construct a list of recommended items
according to product departments. The list will only consist of items that the user has
not previously purchased and will include no more than 10 items.
Table 2. Example recommendation list and result

Recommended items Actual purchase
Creamy peanut butter Light brown sugar
Light brown sugar
Squeeze tomato ketchup
The most similar users are calculated using (1) as given in section B. Of the most
similar users, we recommend the most popular products that they purchased. For
example, Table 2 displays a recommended items list for the department “pantry.” Each
item in the list is found in a previous order of the most similar users and is among the
most popular items from the department.
For a new product purchase, we check if the product is found in the recommen-
dation list. If it is, we classify the recommendation as a success. Therefore, the rec-
ommendation list may include otherwise unfavorable items but so long as the user
purchases an item from the list, the recommendation is considered successful, which
makes intuitive sense [13].
5 The Proposed Method
In this section, we provide a generalized explanation of our dataset and various data
visualizations that were used to gain insight on the Instacart dataset.
The dataset contains between four and 100 orders for each user, with the sequence
of products purchased in each order. It also provides the week and hour of day the order
was placed and relative measure of time between orders. The exploratory analysis
performed on the dataset revealed the facts as below.
Fig. 5. Heat map for frequency of day in a week vs hour of the day
Figure 5 is a heat map which describes the hourly order frequency of the day vs day
of the week. It is observed that the maximum orders which are above 50,000 are placed
between 1 PM and 3 PM on Sunday which is labeled as 0 and on Monday the highest
frequency was observed from 10 AM to 11 AM. There were almost no orders from
midnight to early morning on all the days which is from 2 AM to 6 AM. Surprisingly,
the maximum number of orders is placed on Monday morning where it is generally
assumed to be a working day. On the other hand, Saturday, which is a day off for most,
has similar order frequency as other working days. It is observed that most orders are
placed between 9 AM and 4 PM in any given day.
Fig. 6. Bar chart showing frequency distribution by days since prior order
Figure 6 is a bar chart which shows the frequency distribution of orders by days
since the prior order. It suggests that the 7th day of the month has a hike in order
frequency, and then a relatively small peak at days 14, 21 and 28 which indicates that
every seven days the order frequency increases. Presumably, users prefer to make an
order on a specific day of the week. Finally, there is a huge peak the end of the month
indicating that most users make purchases on a monthly basis rather than a weekly or
bi-weekly basis.
Figure 7 is a bar plot which compares the frequency of orders for the top 5 aisles. It
is observed that fresh fruits and fresh vegetables have high frequency of orders which
are around 3.5 million. Fresh fruits and fresh vegetables, based on their nutritional
value, are considered as being healthy food whereas packaged cheese (whose order
frequency is about 1 million) is considered as being unhealthy food. From this, we can
conclude that healthy food items are exceptionally common and will appear in a
majority of orders, even when that order may include unhealthy products. The sparsity
of the utility matrices M1 and M2 is best illustrated by figure above. 11% of the users
had approximately five total orders, with the probability rapidly decreasing as number
of orders increases.
Fig. 7. Bar chart showing Aisle wise frequency for products
Fig. 8. Partial histogram showing number of orders for each user

Figure 8 presents a partial histogram illustrating the number of orders for each user.
Note that order counts exceeding 50 exist but are exceptionally low. Figure 9 is a heat
map describing the correlation matrix for different attributes of the datasets. We found
that there exists a positive correlation of 0.25 between the reordered and order_number
attributes. The later an order exists in a user’s order history, the more likely he or she
will reorder a previous product. Therefore, as order_number increases, the more likely
reordered will equal one. Furthermore, there is a negative correlation of −0.36 between
Fig. 9. Correlation matrix for various attributes merged from the datasets
days_since_prior_order and order_number, meaning that users with a more extensive

order history are less likely to wait many days before ordering again. Thus, as day-
s_since_prior_order increases, order_number decreases.
6 Results and Analysis
In this section, we report the results we obtained given the implementations that we
described in the methods section. We show the results of user-based collaborative
filtering on providing recommendations. Additionally, we report the trend consistency
of orders that are healthy and unhealthy and propose an improvement to our recom-
mendation algorithm.
6.1 Item Recommendation

To measure the efficacy of user-based collaborative filtering, we randomly sample
1,000 users from the approximately 200,000 users available from the dataset. Larger
samples face computational bottlenecks due to the complexity of the cosine similarity
calculations as computing the similarity matrix is a Oðn2 Þ operation. Each user has their
order history split into a training set and a testing set. Every order except the last will be
used in the training set and will populate the user-product utility matrices M1 and M2 .
Thus, each user’s final order will be used in the testing set.
For the testing set, we filter out all orders that do not contain a new product
purchase. This restriction results in 821 testable users (i.e., 821 users in the original
sample purchased a new product in their final order).
Table 3. Utility matrix results

Utility matrix Success ratio Average closest cosine similarity
M1 0.16 0.125
M2 0.11 0.111
Both utility matrices produced inconsistent results as they provided unsuccessful

recommendations at a relatively high rate. The normalized utility matrix M2 performed
consistently worse than M1 , likely due to the information lost when limiting quantities
to six. This lost information could result in less accurate similarity calculations.
In Table 3, we define the closest cosine similarity for a given user A as the cosine
similarity to the closest user B who has previously purchased the new product that A
has in the testing set. We define this notion as the “optimal” closest user since rec-
ommendations can only come from the closest users, but the closest users often have
not previously purchased the product that the target user will. Again, M1 outperforms
M2 by calculating the average closest cosine similarity at a greater value, albeit by a
minute amount.
Table 4. Example utility matrix

P1 P2 P3 P4
U1 80 0 0 0
U2 4 8 1 6
U3 5 25 20 30
To illustrate how M2 can result in less accurate recommendations, consider the

example utility matrix depicted in tabular form in Table 4 above. In this example, we
intend to provide a recommendation to U1. Using the utility matrix as defined and
calculating correlation using cosine similarity, we find that simðU1; U2Þ ¼ 0:37 and
simðU1; U3Þ ¼ 0:11. This makes intuitive sense because U3 prioritizes items that U1
has no interest in and purchased P1 only one more time than U2. Yet, if we normalize
table six to adhere to M2 form, we find that simðU1; U2Þ ¼ 0:42 and
simðU1; U3Þ ¼ 0:43. These results are counterintuitive and therefore would result in
less accurate recommendations.
Next, we found that utilizing user-based collaborative filtering to provide recom-
mendations presents scalability issues [12]. A popular website will involve many
dynamic users that add or change item preferences over time. Thus, similarity calcu-
lations will need to be made between every single user per recommendation. Item-
based collaborative filtering – by only calculating and storing similarity measures
between static items – can offer a solution to scalability issues and thus enable larger
user sample sizes [9]. Furthermore, advanced methods such as clustering and
smoothing can substantially improve the performance of user-based collaborative fil-
tering [11].
Finally, the measure of preference must be accurate when constructing the utility
matrix. The Instacart dataset provided no product rating measures and thus preference
had to be estimated by the number of times a user purchased a product. Although
preference can be inferred (for example, a user who has a large average purchase count
but only purchased a product once was likely unhappy with the product), it requires
even more calculations which further exacerbates scalability issues.
6.2 Healthy/Unhealthy Trend

Given the results of the similarity matrix, we found that users that were found to be
most similar shared a trend of being classified as healthy or unhealthy. If user A is
similar to user B, there was a 65% chance that they would both be healthy or unhealthy.
By analyzing this trend, we can further our recommendations by using products that are
simply healthy or unhealthy regardless if user A or B has bought that product.
Table 5. Sample selection of healthy/unhealthy trend

Sample size Similarity Consistency
30,563 44% (13,338/30,563) 65% (8,694/13,338)
As shown in Table 5, we ran the similarity matrix algorithm on a random sample

size of 30,563 orders. Of those orders, 44% of them were between our similarity
threshold. Of those similar orders, we found that 65% followed a trend of being either
unhealthy or healthy. Given the unhealthy or healthy nature of similar orders, our
analysis results in an interest to further investigate creating recommendations based on
the nutritional value.
This analysis is important since products not found in either users’ carts can be
recommended. Because collaborative filtering algorithms depend on products and items
that the comparing users have interest in, it does not provide insight into products not in
that domain. Therefore using this healthy and unhealthy trend is found to be important,
as products outside the domain can now be explored.
User-based collaborative filtering provided generally inconsistent recommendations.

Due to the sparsity of the data and the dynamic nature of user databases, the ideally
closest users were found to have an average cosine distance of 0.125. That is to say,
with a large set of users, and a large set of products, finding users who have bought the
same items is relatively rare. Additionally, our usage of the utility matrix only con-
siders the quantity of products bought. This leads to trouble, as users who may have
bought a product once does not correlate to the user enjoying the product (or having a
negative impression).
A content-based system that incorporates item features during cosine similarity
calculations would likely be more efficient and ultimately more accurate. Additionally,
taking into consideration an unequivocal level of preference (e.g., an explicit product
rating on a scale from one to five) is essential when attempting to compute similar users
accurately.
Lastly, we found that similar orders were found to follow a trend of being either
“healthy” or “unhealthy.” This pattern was found to be approximately 65% consistent.
We conclude that a classification model that utilizes a mathematical representation of
health and nutrition can be implemented to predict the quality of an order and therefore
provide more accurate recommendations.
Acknowledgment. This research is partially supported by a grant from Amazon Web Services.
References
1. Bell, R.M., Koren, Y.: Lessons from the Netflix prize challenge. SiGKDD Explor. 9, 75–79
(2007)
2. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing with the
Natural Language Toolkit. O’Reilly Media, Sebastopol (2009)
3. Chen, J., Huant, Y., Cohen, I.: Pearson correlation coefficient. In: Springer Topics in Signal
Processing, vol. 2, pp. 1–4. Springer, Heidelberg (2009)
4. Flores-Lopez, A., Perry, S., Bhargava, P.: What’s for Dinner? Recommendations in online
grocery shopping. Final report, Stanford University (2017)
5. Herlocker, J.L., Konstan, J.A., Borchers, A., John, R.: An algorithmic framework for
performing collaborative filtering. Associated for Computing Machinery, pp. 230–237
(1999)
6. Leskovec, J., Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Stanford University
(2015)
7. National Agriculture Library: USDA National Nutrient for Standard Reference Dataset,
April 2019
8. Newman, M.E.J.: Networks: An Introduction. Oxford University Press, Oxford (2010)
9. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering
recommendation algorithms, pp. 285–295 (2011)
10. Souci, S.W., Fachmann, W., Kraut, H.: Food Composition and Nutrition Tables. Medpharm
HmbH Scientific Publishers, Stuttgart (2007)
11. Xue, G.R., Lin, C., Yang, Q., Xi, W., Zeng, H.J., Yu, Y., Chen, Z.: Scalable collaborative
filtering using cluster-based smoothing. In: Proceedings of the 28th Annual Interna-
tional ACM SIGIR Conference on Research and Development in Information Retrieval,
pp. 114–121. ACM, August 2005
12. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for
collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in
Artificial Intelligence, pp. 43–52. Morgan Kaufmann Publishers Inc., July 1998
13. Drineas, P., Kerenidis, I., Raghavan, P.: Competitive recommendation systems. In:
Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing,
pp. 82–90. ACM, May 2002
14. Linden, G., Smith, B., York, J.: Amazon. com recommendations: item-to-item collaborative
filtering. IEEE Internet Comput. 1, 76–80 (2003)
15. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT
Press, Cambridge (2009)
16. Li, D., Zhao, G., Wang, Z., Ma, W., Liu, Y.: A method of purchase prediction based on user
behaviour log. In: Proceedings of the IEEE International Conference on Data Mining
Workshop, Atlantic City, NJ, USA, 14–17 November 2016
17. Ye, F., Zhang, H.: A collaborative filtering recommendation based on users’ interest and
correlation of items. In: Proceedings of the 2016 International Conference on Audio,
Language, and Image Processing (ICALIP), Shanghai, China, 11–12 July 2016
18. Poladis, N., Georgiadis, C.K.: A dynamic multi-level collaborative filtering method for
improved recommendations. Comput. Stand. Interface 51, 14–21 (2017)
Using Topic Modelling to Correlate
a Research Institution’s Outputs
with Its Goals
Nicholas Chamansingh(B) and Patrick Hosein
The University of the West Indies, St. Augustine, Trinidad

nicholas.chamansingh@gmail.com, patrick.hosein@sta.uwi.edu
Abstract. With the increasing pressure on private research organiza-

tions and universities to convert research output into innovative products
and services that can lead to revenue streams, it has become even more
important to ensure that research performed matches research goals set.
If an institution could quantitatively compare their research output with
other highly successful institutions that have similar goals (e.g., their
respective countries have similar needs) then this would allow them to
make appropriate organizational and personnel changes. We address this
problem and demonstrate the approach taken by looking at universities
from countries with similar characteristics and comparing their research
outputs. This is achieved by forming topic clusters using Latent Dirichlet
Allocations and then using a proposed metric for comparison of abstracts
with topic clusters to quantify closeness. We determine an upper bound
on this metric by comparing abstracts that were used to generate the
topic clusters and a lower bound by generating a dataset of randomly
chosen abstracts. We also investigate trending of this comparison over
time by splitting datasets based on temporal information.
Keywords: Topic Modelling · Latent Dirichlet Allocation · Topic

Coherence · Document Context Distance
1 Introduction
Topic modelling looks at the structure of text within a document or body of
text in topically coherent portions. In other words, a topic model is the prob-
ability distribution over a fixed vocabulary [1]. Given a number of documents,
which will serve as the vocabulary, a number of topics can be derived and used
to associate documents that share the probability distribution of text with a
particular topic. As a result, documents with similar contents will be associated
with the same topic distributions. Several researchers have investigated the area
of topic similarity, topic modeling and document similarity based on topics. Var-
ious methods have been proposed by [2–7], where the main focus is to determine
similar documents to a given document or group of documents based on the

https://doi.org/10.1007/978-3-030-39442-4_13
148 N. Chamansingh and P. Hosein
topic(s) of that particular document. These methods can be used in recommen-

dation systems to find similar books or consumer reviews for a particular item
(or similar items) [8–10].
In universities across the world research is conducted in many different areas.
These research areas can vary from cutting edge technologies to technologies
that are no longer relevant or even areas which are overly saturated. In the
latter case, it may be difficult to tell when to move from a particular research
area or on which areas to focus. This research aims to solve this problem by
analyzing research over a period of time that is being done by several “successful”
universities with many publications in cutting edge research areas and comparing
it to another university that may have been focused too much on traditional
areas.
Our objective is to use topic modelling to determine how closely the research
being done by different universities are, based on their publications. For this
research we compared the publications from STEM related departments from
3 universities in one country to those of comparable departments from another
country. For convenience, we will refer to the collection of Universities in the
successful country as SU while we will refer to the target University as TU. Specif-
ically, using Topic Modelling we compared the abstracts of the publications from
TU with those of SU over the course of 21 years, 1997 to 2018. This quantitative
comparison can be used to determine the degree by which they differ over time
and hence what actions should be taken by TU. Additionally, we compared one
university that made up the SU dataset to the TU dataset to illustrate their dif-
ference. We believe that this is the first time that such a comparison is made and
also believe that the approach can prove to be useful to universities in developing
countries that are trying to compete with those in developed countries.
2 Related Work
Topic Modeling can be applied to various document related problems. [11] pro-
posed a collaborative Web recommendation scheme based on Latent Dirichlet
Allocation (LDA). The model they proposed made associations between user
sessions and the topics of each web page visited. They used a variational proba-
bility inference technique to estimate the association between user sessions and
multiple topics and the associations between topics and Web page space. Their
work resulted in the discovery of a user’s navigational preferences distribution
over a topic space. Their results have shown that their approach achieved better
recommendations when compared to prior techniques.
The authors in [12] proposed a method to improve standard Collaborative
Filtering based recommendation systems that utilized contextual data avail-
able in the form of text descriptions of items. The features of their model were
derived from latent features of users and items through topic modeling. Their
method provided a hybrid similarity score to refine the neighbourhood forma-
tion which aided in mitigating the sparsity because it allowed the calculation of
the similarity between users if there was no overlap in the ratings among items.
Topic Modelling for Output and Goals Correlation 149
Their method produced high quality recommendations in terms of precision,

recall and f-measure, when compared with standard User-Based and Item-Based
Collaborative Filtering.
The researchers from [5] applied Kuhn’s insight, [13], which conjectures that
changes in scientific studies are non-cumulative developmental episodes in which
an older paradigm is replaced in whole or in part by an incompatible new one.
They accomplished this by applying LDA to the ACL Anthology which contained
all papers in the Computational Linguistics journal and the conferences and
workshops associated with the ACL, COLING and EMNLP. They investigated
the differences and similarities among COLING, ACL, and EMNLP by proposing
to measure the breadth of a conference by using a measure they called topic
entropy which is the conditional entropy of the conference topic distribution.
The research done by [14] focused on clustering news articles coming from two
different language sources. They extended the LDA algorithm to learn two sets
of topics from each language simultaneously resulting in topic distributions for
the documents in both languages. The researchers conducted their experiments
using English and Dutch languages and showed that their method can cluster
documents on the same events across the two languages without relying on trans-
lations or dictionaries. In summary, the previous work on topic modelling using
Latent Dirichlet allocation are mainly focused on document similarity, document
topic trends and recommendations with topic modelling and document cluster-
ing. They illustrated useful and innovative applications of topic modelling based
on the LDA algorithm. Our approach capitalizes on these ideas and also goes
a step further by providing a unique method to compare documents to topics
derived from a different set of documents to determine closeness between them.
3 The Basic Concept
We assume that if a publication relates to a particular topic, say Machine Learn-

ing, then the words used in the abstract to describe the research will contain
words that are usually found together with that particular topic. Now con-
sider another publication which describes classification using a Support Vector
Machine then the words that are used in the abstract should be similar to the
first publication, that is, they should be highly correlated and share the topic of
Machine Learning. In other words, words that appear in the same context share
semantic meaning. As a result, if we were to obtain a number of publications
within a particular field of study, say (Computer Science, Electrical Engineering,
Data Science, etc.) from one university and obtain the publications from another
university with the same field of study then we can determine how closely their
research topics match.
Note that, if we compare documents with the set of topic clusters that were
generated from them then this should provide a value for the optimal match-
ing. Therefore we use the topic clusters generated for SU and compare the SU
documents with this set. This will provide our upper bound. Next we obtain a
lower bound as follows. We obtain a random set of research publications and
compare these with the topic clusters for SU. Since the target in this case is
essentially random then it provides a reasonable lower bound. Therefore as the
metric value obtained for the TU varies from that of the lower bound to that of
the upper bound we get an idea of how closely the TU research output matches
that of the SU research output. One additional application of this research is to
compare the publications from one of the universities that SU is comprised of
and determine how close the target set is to it. This comparison between the
universities can also be done over a period of time to get an idea of the research
output closeness trends. Naturally this approach can be extended to other prob-
lems. In particular, the initial objective of this work was to compare the research
performed at a particular university with the research goals of the university’s
country. However, we were unable to get sufficient information on the research
goals of the country.
4 Topic Modelling
In this section we provide a more detailed description of the approach. Topic

Modelling is an unsupervised learning approach for grouping documents (or
bodies of text) in clusters according to topics based on their content. Similar
to other popular unsupervised learning methods such as K-Means, the individ-
ual words from each document are processed to obtain the topics. A probability
is then assigned to the topics based on the distribution of the words within that
particular topic [15].
For example, say we have an abstract from a paper regarding Support Vector
Machines. We first determine how many clusters are to be formed. Each of these
clusters represents some topic. In this example suppose that there are 3 topics
associated with the abstract. Each topic will have a probability value as it relates
to that abstract, that is (topic 10, 0.72), (topic 7, 0.17) and (topic 4, 0.09), where
the topic numbers are arbitrary numbers assigned by the algorithm. It can be
concluded that topic 10 generally describes the abstract because the probability
was the highest among the 3 topics that were generated.
Latent Dirichlet Allocation is a probabilistic model [16]. Cluster assignments
are obtained by using two probability values: the probability that a word appears
in a topic and the probability of a topic appearing in a document, that is,
P (word|topic) and P (topic|documents). The initial allocations are determined
randomly. This process is repeated for each word in each document in order
to obtain the probability distribution of words within topics and the probabil-
ity distribution of topics within documents. The process is repeated until the
algorithm converges.
Topic Coherence is the measure of a topic and its degree of semantic similarity
between words in a document and a topic. This measure can help highlight topics
that are semantically relevant and topics that are a result of statistical inference
[17–19]. For example, a set of statements about a particular document is said
to be coherent if the statements support the document. That is, the statements
can be interpreted in such a way that it summarizes the document. Practically, a
coherent statement can be “fishing is relaxing” or “fishing allows you time to

think”. This will yield a high coherence score when referring to a document on
fishing and a low score for a document about driving a sports car [20].
The coherence measure that was used for this research is the CV measure
described by [20]. In their paper they described the CV as a single framework
that in the space of topic coherence measures allows the combination of all the
main ideas in the context of coherence quantification. The authors combined the
segmentation of a given word subsets W to derive a set of pairs of subsets of
the words in W with the probability estimation of the joint probability of the
word pairs to estimate the number of documents containing both words divided
by the total number of documents. This is followed by the confirmation measure
which can be described as a measure which takes a single pair Si = (W • , W ∗ )
of words or word subsets as well as the corresponding probabilities to compute
how strong the conditioning word set W ∗ supports W • . Finally all confirmation
scores are aggregated to a single coherence score of all subset pairs Si .
5 Data Description and Pre-processing

For this study we look at the top publications produced from research performed
at three universities collectively called SU. We also obtained the top publications
in similar departments at the university that is being evaluated, the target Uni-
versity or TU. As mentioned in Sect. 3 the SU dataset was compared with itself
and this was used as the upper bound. We obtained a lower bound by creat-
ing a dataset consisting of randomly chosen publications. These publications
were obtained by using letters from the alphabet as the search term on Google
Scholar, for each search term the first 10 articles were added to the dataset. We
denote this set by RU (random university).
The data was automatically scraped from Google Scholar, the author’s name
was used as the query term when searching for publications for the SU and TU
dataset. For the random university dataset, letters from the alphabet were used
as the query term, in alphabetical order. Each publication contained the authors
of the paper, abstract, title, institute and year of being published. We used a
total of 18177 publications. This consisted of 17601 documents for SU, 476 for
TU and 100 for RU.
To prove the effectiveness of our approach we created intervals of 100 publi-
cations from the TU set. At each 100 interval, starting from 2018 going backwards
until 1997, we obtained the respective publication for those years from SU and
RU to create the datasets for that particular interval. In other words, for the first
100 publications from TU the time interval was 2014–2018 therefore we took
all the publications from SU and RU for those years. For each interval the topic
clusters were generated using SU, then the publications of SU were compared to
the topic clusters to create the upper bound followed by comparing RU to obtain
the lower bound for that interval. We then compared the publication from the
TU set. We also investigated the differences between one of the universities which
SU is comprised of and TU. This is denoted by SU1 . The publications used for a
particular interval was obtained in a similar way as described before.
6 Numerical Results
At each interval the first step involved using LDA to generate topic clusters for
the SU corpus. We started with 3 topics and increased the number of topics by
factors of 2. At each interval the coherence value was computed and compared to
the previous value. This was repeated until the coherence value did not increase
for four consecutive iterations. The number of topics that produced the highest
coherence value was used as the topic distribution for the abstracts. We refer to
this as the base topics, T .
Given the set of topics and the words for each topic we next compared the
documents for TU, RU, SU1 , and SU to determine their closeness to the top-
ics identified from the SU corpus. This is done as follows. Let us consider the
abstracts for TU. For each topic in T we determine the sum of the probabilities of
all words that belong to both the topic as well as the abstract. This is repeated
for all topics and we then choose the topic with the highest probability. This
represents the topic that contains words that are common with the abstract
and also have high probabilities of being in that topic. Hence it represents some
degree of closeness between the abstract and the topic. Finally we take the aver-
age of this highest topic probability over all abstracts in the TU dataset. This is
the metric used as the degree of closeness and will be denoted by CT U , CRU and
CSU for the sets TU, RU and SU respectively. For each set of publications from
the individual universities that make up SU, we will refer to them as CSUi for
university i when computing the closeness.
Note that since the abstracts in SU are the ones that were used to generate
T then there should be significant overlap in abstract words and words in a
specific topic and so this value would form the ideal case. In the case of RU
there is some small probability of overlap but in general this would be low and
hence this forms the lower bound. Our aim is to determine how closely the TU
set of abstract matches those of the ideal set SU. In the case of the publications
from the universities that make up SU the closeness factor will indicate which
universities performs best among them. We therefore define the following metric:
CT U − CRU
ρT U = (1)
CSU − CRU
Note that this metric roughly lies between 0 and 1 with ρRU = 0 and ρSU = 1.
Using the values that were computed for the datasets described above we
computed the various values for each of the respective sets of abstracts over
time and obtained the following results illustrated by Fig. 1. As expected, ρSU1
is very high since we are comparing the documents from a university that was
used to generate the topics. Furthermore this value does not vary significantly
over time. However, when we look at TU we see that its value is significantly
less. We also can see how this closeness varies over time. It appears that from
2007 onward the research performed at TU gradually improved and seems to be
approaching that of the best university in the set that made up SU.
Next we look at the contents of the generated clusters. In Table 1 we show
the top 15 words for each of the top topic clusters, that is, the top topic cluster
SU1
TU
0.8
0.6
ρT U
0.4
0.2
0
1997-2007 2007-2010 2010-2014 2014-2018
Fig. 1. Closeness factor over time
that contained the most abstracts, over the period 1997 to 2018. We also look
at the respective top cluster words for the TU set. Here we can clearly see the
change in research focus over time. For each period, the number of topics that
produced the highest topic coherence value were (1997–2007, 25) (2007–2010,
25) (2010–2014, 20) (2014–2018, 15).
7 Discussion
For each interval, the number of publications for each set, SU1 and SU varied
and this resulted in various ranges of number of topics. For example, the period
1997–2007 had 25 topics which produced the highest topic coherence value and
2014–2018 had 15 topics that produced the highest topic coherence value. Table 1
illustrates the associated top topics over the period 1997–2018. The topics were
represented as a list of words. Based on the words that represent the topics there
was no consistent sets of top topics over the years and we can see a change in
research focus over the years.
A by-product of our research was that we were able to compare the close-
ness between a university that made up the SU set and the target set. Because
these datasets were used to generate the topics models, it was expected that the
closeness values would be much higher than that of the TU dataset. This was
indeed so as illustrated by Fig. 1. The SU1 set was seen to have a high value
when compared to the TU set, which implies that the top topics illustrated by
Table 1 are closely related to SU1 but has a weaker relationship with TU.
Our initial aim of this research was to compare TU to SU and determine the
closeness of the research between the two sets. Our metric lies between 0 and 1
Table 1. Top 15 words for each top topic clusters from 1997–2018 SU & TU
Year SU top topic TU top topic

2014–2018 Feature data image system based Image network based graph user
learning real world model propose data model proposed propose
approach deep learning method different problem system scheme
first large scale performance video experiment
2010–2014 Temperature film interface Algorithm service problem network
electron layer dimensional electron node near duplicate performance
property room temperature energy annealing show based result
graphene surface effect observed proposed optimization systems
high magnetic laalo srtio
2007–2010 Registration image data scheme Network system design based result
experimental result mirror based real time model time printed
information method result atlas information scheme data function
proposed oxide task propose problem timed
1997–2007 System coupling tool support System model based data sensor
wireless communication approach network method feature
repudiation atomic language design device first notation present
query exchange data application scheme
programming language peer peer
speech recognition
repudiation protocol
and results show that over the years TU has fluctuated between higher and lower
values when compared to SU1 . The period 1997–2007 TU had a closer value
to SU1 then dropped for the next period and then raised for the subsequent
intervals. This can imply that TU takes a while to change research focus and
direction. That is, for the periods that the values were lower implies that the
research areas were becoming outdated. This is further illustrated in Table 1
where we can see, based on the top topic words, that there seems to be topics
between SU and TU that have overlap from 2014–2018.
Over the years there is a clear distinction between the top topics from
the period 1997–2007 and the period 2014–2018. These consisted of words
like “system coupling tool support wireless communication repudiation atomic
language query exchange data application programming language peer peer
speech recognition repudiation protocol” for the period 1997–2007 and words
such as “feature data image system based learning real world model propose
approach deep learning method first large scale performance” for the period
2014–2018.
To determine the sensitivity of our approach we varied the number of
abstracts used from TU and compared it to the entire SU set. We sorted the
TU set by year and took the first 100 abstracts and subsequently added publica-
tions by 100 at a time until we reached the total number of TU publications and
calculated the closeness at each iteration. The results were: (100, 0.53), (200,
0.63), (300, 0.68), (398, 0.67). From the results we find that the closeness factor
does not vary too much with the number of abstracts as long as at least 200 are
used. In our numerical results we had to use 100 because of the dataset size but
our objective there was on relative rather than absolute performance.

Our research shows a unique way of determining research closeness between two
datasets of different types of publications. In our experiments, we tested with
two datasets TU and SU as well as the universities that made up SU, one consist-
ing of many publications in cutting edge areas and the other with publications
in general areas. The results showed that there is a gap between the TU and SU
set throughout the years; however, in the period 2014–2018 there was a reason-
able increase. Even though the TU value for the period 2014–2018 has increased
implying that research has started shifting to leading edge research, more focus
should be placed on the dominant topics identified from SU. For future work we
plan on using this research to determine the closeness of the universities of a
country with the goals of the country.
References
1. Lin, L., Tang, L., Dong, W., Yao, S., Zhou, W.: An overview of topic modeling
and its current applications in bioinformatics. SpringerPlus 5(1), 1608 (2016)
2. Chen, Y., Bordes, J.-B., Filliat, D.: An experimental comparison between NMF
and LDA for active cross-situational object-word learning. In: 2016 Joint IEEE
International Conference on Development and Learning and Epigenetic Robotics
(ICDL-EpiRob), pp. 217–222. IEEE (2016)
3. Purushotham, S., Liu, Y., Kuo, C.-C.J.: Collaborative topic regression with social
matrix factorization for recommendation systems. arXiv preprint arXiv:1206.4684
(2012)
4. Agarwal, D., Chen, B.-C.: fLDA: matrix factorization through latent dirichlet allo-
cation. In: Proceedings of the Third ACM International Conference on Web Search
and Data Mining, pp. 91–100. ACM (2010)
5. Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic
models. In: Proceedings of the Conference on Empirical Methods in Natural Lan-
guage Processing, pp. 363–371. Association for Computational Linguistics (2008)
6. Jacobi, C., van Atteveldt, W., Welbers, K.: Quantitative analysis of large amounts
of journalistic texts using topic modelling. Digit. J. 4(1), 89–106 (2016)
7. Wang, C., Blei, D.M.: Collaborative topic modeling for recommending scientific
articles. In: Proceedings of the 17th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 448–456. ACM (2011)
8. Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and
author community. In: Proceedings of the 26th Annual International Conference
on Machine Learning, pp. 665–672. ACM (2009)
9. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci.
101(suppl 1), 5228–5235 (2004)
10. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for
authors and documents. In: Proceedings of the 20th Conference on Uncertainty in
Artificial Intelligence, pp. 487–494. AUAI Press (2004)
11. Xu, G., Zhang, Y., Yi, X.: Modelling user behaviour for web recommendation using
LDA model. In: IEEE/WIC/ACM International Conference on Web Intelligence
and Intelligent Agent Technology, WI-IAT 2008, vol. 3, pp. 529–532. IEEE (2008)
12. Wilson, J., Chaudhury, S., Lall, B.: Improving collaborative filtering based rec-
ommenders using topic modelling. In: Proceedings of the 2014 IEEE/WIC/ACM
International Joint Conferences on Web Intelligence (WI) and Intelligent Agent
Technologies (IAT), vol. 01, pp. 340–346. IEEE Computer Society (2014)
13. Kuhn, T.S.: The Structure of Scientific Revolutions, vol. 2. University of Chicago
Press, Chicago (1963)
14. De Smet, W., Moens, M.-F.: Cross-language linking of news stories on the web
using interlingual topic modelling. In: Proceedings of the 2nd ACM Workshop on
Social Web Search and Mining, pp. 57–64. ACM (2009)
15. Nikolenko, S.I., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies.
J. Inf. Sci. 43(1), 88–102 (2017)
16. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
17. Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coher-
ence over many models and many topics. In: Proceedings of the 2012 Joint Confer-
ence on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning, pp. 952–961. Association for Computational Linguis-
tics (2012)
18. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic
coherence. In: Human Language Technologies: The 2010 Annual Conference of the
North American Chapter of the Association for Computational Linguistics, pp.
100–108. Association for Computational Linguistics (2010)
19. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing
semantic coherence in topic models. In: Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pp. 262–272. Association for Computa-
tional Linguistics (2011)
20. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence mea-
sures. In: Proceedings of the Eighth ACM International Conference on Web Search
and Data Mining, pp. 399–408. ACM (2015)
Long Period Re-identification Approach
to Improving the Quality of Education:
A Preliminary Study
Irina Arhipova(&) , Gatis Vitols , and Inga Meirane
WeAreDots Ltd., Riga LV 1050, Latvia

irina.arhipovasalajeva@gmail.com,
gatis.vitols@llu.lv, inga.meirane@wearedots.com
Abstract. Early school leaving is one of the most frequently mentioned reasons
to social exclusion later in life. In order to reduce the risk of early school
leaving, it is necessary to automate the process of entering unjustified lessons’
delays in school management system. A person’s re-identifying (Re-ID) is a
complex automated process, where most studies use an approach to analyze the
descriptors of clothing and appearance that are intended for the use of short-
period Re-ID. In contrast, there is not much research in the real-time long-term
Re-ID process, when images or videos are taken at intervals of several days or
months in an uncontrolled environment. In this case descriptors characterizing a
person’s biometric identity based on unique features, such as a facial digital
image, are required. The objective of this research is to develop a real-time
person’s long-term re-identification approach for accounting of non-attended
lessons in educational institutions. The proposed Re-ID mechanism includes
face identification and new method of using multiple face etalon versions and
multiple versions of descriptors for a single person. This allows Re-ID of a
person in different clothing and appearance from different camera angles in a
long term.
Keywords: Early school leaving Face recognition Re-identification
1 Introduction
Early school leaving is one of the most frequently mentioned reasons to social
exclusion later in life [1], where one of the known factors is regular school classes non-
attendance by pupils. According to the Latvia Minister Cabinet regulations the head of
the educational institution determines the order of registration of the pupil’s attendance
or non-attendance at the school every day. For example, pupils’ delays are recorded in
the study e-journal by a class teacher and then every Friday class teacher summarizes
delays for the week, notes the reason for the delay, and informs the pupils’ parents
about unjustified delays. The real-time registration of classes’ delays isn’t assumed in
existing procedures, as the result there is a risk in data quality. Due the manual data
input, errors can be made, because a teacher who replaces another teacher during
classes is not familiar with all pupils and guided by the total number of pupils or at the
beginning of the school term the teacher is not familiar with all the pupils in his/her
https://doi.org/10.1007/978-3-030-39442-4_14
lessons. The checking and marking the lessons attendance of pupils should be done
during the lesson and take extra time from the lesson. The problem of pupils’ regis-
tration can be in the situations, when the pupil can come to the lesson later or leave the
classroom where the lesson takes place for different reasons, as well the pupil can be in
the school territory but not in the room where the lesson takes place.
According to the Law on Education in Latvia, the head of an educational institution
is responsible for the security at a secondary school. The implementation of a set of
measures, which should be provided by the heads of educational institutions and
municipalities, may be different, considering the requirements for the safety of pupils.
Requirements are created using potential hazards in each educational institution.
Threats mainly depend on the location of the educational institution in the social
environment area. For example, unauthorized persons at the secondary school, who
have arrived at an educational institution, must go to a duty attendant by naming their
name, surname, the purpose of the visit, the person they come to. It is forbidden for the
pupils to bring educational objects that are not necessary for the study work. Regis-
tration of personal identity data is not automated; it affects pupils’ safety, such as the
problems to:
• indicate the exact number of persons, records during evacuation or emergency
situations;
• identify pupils that are in the educational institutions but not attending classes;
• control of the pupils staying in the yard of the educational institution without
permission;
• control pupils’ behaviour;
• control the unauthorized subjects which are brought by the pupils;
• control of the stay and registration of unauthorized persons in the territory of the
educational institution.
In order to reduce the risk of early school leaving, as well to increase pupils’ safety,
facilitate the administrative work of the educational institution and to promote the
reliability of unjustified delays and operability, it is necessary to automate the process
of entering unjustified delays in school management system.
Identifying or re-identifying a person is an essential aspect of identifying a person’s
identity and location by retrieving data from multiple devices. Re-ID is a process of
matching images captured by multiple devices. It is used to determine the ownership of
digital images of multiple devices by the same person. A person’s re-identification is a
difficult automated process for several reasons, such as a person’s different appearance,
which is fixed by different devices in different locations. The short-term Re-ID is a
process of establishing correspondence between images of a person taken from dif-
ferent devices in few minutes, in turn long-period Re-ID is a process, when images are
taken during the several months apart [2].
The objective of this research is to develop a real-time long-term person’s re-
identification (Re-ID) approach for accounting of non-attended lessons in educational
institutions, increasing pupils’ safety and reducing administrative burdens.
Proposed tasks and solutions for improving the quality of education are given in
Table 1.
Long Period Re-identification Approach to Improving the Quality of Education 159
Table 1. Proposed tasks and solutions for the improvement of the quality of education.
# Tasks Solutions Results
1 Increasing security and Identification of the physical Reducing the risks of potential
safety person (face digital image) hazards at a school
2 Improving the quality of Automated input of non- Reducing the risk of early school
the implementation of the attended lessons’ hours into leaving in an educational
education program school management system institution. Pupil’s parents are
informed in real time about non-
attended lessons
3 Reducing administrative Automated input of lesson Automated tracking of
burdens attendance data into school attendance at lessons
management systems
2 Factors of the Long-Term Re-identification Process
Descriptors describing a person’s identity are obtained with several devices with dif-
ferent technical parameters. The differences in the technical parameters of the devices
make the Re-ID process more difficult. If two images are taken with a few minutes or
hours apart, one can assume that the person’s visual appearance, for example, clothing
will be the same. Such a re-identification scenario is called short-period Re-ID. If
images or videos are taken at intervals of several days or months, the person’s re-
identification process is a long-period Re-ID [2]. Distribution of images over a long
period of time (in the long run) is one of the complexity factors of Re-ID.
In 2018 alone there are 380 articles indexed by SCOPUS on Re-ID in computer
science related publication. Most studies in the Re-ID area use an approach to analyze
the descriptors of clothing and appearance that are intended for the use of short-period
Re-ID. In contrast, there is not much research in the long-term Re-ID process. If a
person’s re-identification is intended for a long period of time or in the long term,
descriptors characterizing a person’s biometric identity based on unique features, such
as a facial digital image, are required.
Modern face recognition methods provide many identified individuals. However,
these results are obtained with high quality (resolution) data obtained in a controlled
environment (lighting, posture, settings). The use of biometric data for re-identification
is problematic as Re-ID data is generated in an uncontrolled environment. Automatic
face recognition with low resolution images, considering changes in posture, age,
lighting, is still a problem. Thus, the use of biometric information in Re-ID is theo-
retically possible but has practical implementation problems [2].
Scalability is a key factor for the Re-ID process. It is necessary to ensure the
technology’s ability to adapt to changing factors while maintaining performance. The
following scalability issues are topical:
• in real applications, the gallery’s size is large and constantly increasing, as the result
general methods based on coexistence ranking are not effective;
• to increase uniqueness, descriptors are large in size and expensive to obtain, which
affects the complexity of the Re-ID process;
• effective data analysis requires a large amount of memory and computing resources;
• automatic video analysis can be simplified by using data-processing devices (smart
cameras) and communication between cameras, however, intensive Re-ID systems
for memory and computing resources cannot be scaled simply by working with
low-power processors and narrow bandwidth transmission channels [2].
Thus, the actual problems in practical Re-ID systems are scalability and compu-
tational complexity. Available solutions to resolve this issue are limited because dis-
tribution of images over a long period of time (in the long run) is one of the complexity
factors of Re-ID. The use of biometric data for re-identification is problematic as Re-ID
data come from an uncontrolled environment. There are practical scalability and
computational complexity problems in the Re-ID system.
2.1 External Factor Impact on Long Term Re-identification

There are various attempts to address the complexity of, for example, face recognition.
One of such solutions [3] assumes that people tend to have limited wardrobe and
estimated wardrobe can help with person long term re-identification. Another solution
is to look at pedestrian recognition solutions. Pedestrian recognition deals with lower
quality images which are obscured and often blurry. If such data set is used to train, for
example, deep neural networks then there is a risk for over-fitting which leads to poorly
trained models. There are attempts [4] to improve the quality of images of pedestrian to
provide between training set for algorithms.
Promising is research [5] on context-aware method to address long term re-
identification for walking scenario which takes into consideration gait and proportions
of persons body considered the soft-biometric features. However, at present Kinect
sensor is used for identification purposes and data quality highly depends on posture of
person towards the camera which makes this solution less reliable in practical appli-
cation on Re-ID. There is an extra constraint that must be considered when performing
re-identification for pupils and youngsters. That is their rapid growth and change of
face structure, bone length, changes in hair, eye and facial colour, etc.
2.2 Application of Biometric Templates for Person Face Re-Identification

in Various Age Groups
Typical scenario for identification of face is that image is taken from a person and then
scanned and compared to set of images pre-stored in a gallery which is considered as
one to many scenarios. Depending on type of image various methods could be applied.
For example, in case of comparison of image to sketch or infrared image heterogeneous
face recognition method could be applied [6]. However typically recognition is per-
formed to images created by using security cameras, airplane boarding gate cameras or
consumer cameras. The problem of person’ images is that recognition process involves
various uncontrolled factors such as
• variation in facial expressions;
• changes in facial hair;
• pose variations;
• light variations;
• makeup, scars and damages to the face; and
• aesthetic surgeries and aging.
Not to mention other constraints relating to hardware, blurry images, low-resolution
images, etc. To solve each constraint there are various hardware solutions and available
diversity of methods that could be possibly applied [7]. Algorithm’s applied for face
recognition use various parameters for identification such as length of the jaw line,
depth of the eye sockets, width of the nose, distance between eyes and shape of
cheekbones and others.
One of the challenges raising if face recognition where limits in current recognition
methods are reached relate to rapid changes in face such as after accident, aesthetic
surgery and face change for infants and youngsters. According to report from The
American Society for Aesthetic Plastic Surgery in US alone people spend over 8 billion
US dollars on aesthetic procedures each year and one of the most demanded procedures
relating to facial image is chin augmentation, nose surgery, eyelid and ear aesthetic
surgeries [8]. For example, South Korea leads the list of aesthetic face procedure
frequency as 13.1 cases of procedures per 1000 capita. Studies show that more than
40% of female college students perform some cosmetic surgeries [9]. These changes
can significantly impact face recognition.
Research shows [10] that after face aesthetic surgeries such recognized algorithms
as Principal Component Analysis, Fisher Discriminant Analysis, Geometric Features,
Local Feature Analysis, Local Binary Pattern, and Neural Network Architecture cannot
be successfully applied, especially if surgeries involve changing various markers of the
face, e.g. nose, skin pattern, chin, etc. Aging is another factor in face recognition.
Observation studies [11] show that various people have personalized aging patter that
depends on such factors as lifestyle, genetics, environment, stress level, etc. Authors
conclude [12] that “the variations in the shape of a face are more prominent while in the
later stages of life, texture variations such as wrinkles and pigmentation are more
visible”.
There are research papers related to persons age estimation based on facial
recognition with field of application, for example, if person is allowed to buy alcohol
and is not under legal age limitation. Results show that easiest prediction of age is for
infant and toddler group where other factors as ethnicity do not play a major role. More
challenges come in later age groups. Above 60 years old people face recognition being
most challenging [12].
Baby face versus adult face recognition solutions are available and work with
improved precision [13]. Algorithm has been developed to identify newborns, toddlers,
and pre-school children age 0–4 to extract unique features of children. Proposed
algorithm uses a deep learning model which applies class-based penalties while
learning the filters of a convolutional neural network [14].
Authors identify [12] two aspects of age-invariant face recognition system: facial
age estimation and age-separated face recognition. It is concluded that eye region is one
of the most important for age prediction. There have been successful solutions for
determining not only age of person, but gender and race [15] which shows expansion of
possibilities with face-recognition. Identification of children faces raises additional

challenges in practical applications [16]:
• children face are smaller than adult so distinctness is reduced;
• children are less cooperative and more active – factor that raise problem of cap-
turing the face; and
• faces change.
Analysis of children behavior in classroom is done using security cameras in
classroom and analyzing images from cameras [17]. To detect if children pay attention
to the designed spatial are, distribution of the focus points in two dimensions are
performed. After that identification of anomalous points is identified.
3 Results
The Re-ID process excludes usage descriptors of appearance from face identification
[18] and proposes inclusion of descriptors of human face math formula versus bio-
metric template version supervision process with independent evolution option. Pro-
posed mechanism consists of M N, where M is an etalon version of biometric pattern
and N is version of combination of descriptors from biometric template with exact
values by which person can be identified with high precision (see Fig. 1).
Fig. 1. Reliability of the biometric and appearance descriptors in a long term.
Count of combinations of biometric descriptor values that uniquely identify person

decreases during years. The most significant biometric descriptor changes are during
childhood. However rapid changes of descriptors of appearance occur every day. These
changes are unpredictable and can’t be used to identify person with high precision even
in a short term.
3.1 Overview of the Proposed Mechanism

Proposed M N mechanism includes updating of the etalon version of biometric
pattern and creation of new etalon version when the similarity threshold starts to
decrease and new candidate with high quality biometric pattern is identified. There is
supposed to be person identification with high precision despite multiple “correct”
faces of an individual person.
Each of the face may be heavy different from each other, i.e. one with a makeup
and one without, in some sort of mask and without it, etc. Nevertheless, identification
of person with fully covered face is not a part of this research. There should be opened
a part of face in order to extract biometric template from the photo that is obtained from
video feed.
Further investigation in this research will include identification of different factors
that are critical to successful identification of person with high precision. These factors
include determination of minimal set of dimensions from descriptors of biometric
pattern, identification of equivalence class in order to reach high precision in person
identification process. The proposed mechanism consists of three steps:
• Video feed processing in order to identify object in the crop – consists of processing
of the video feed, object tracking and localization in the frame, processing of the
crop.
• Creation of the biometric patter and identification of person – consists extraction of
the feature vector un creation of the biometric pattern and identification of the
person.
• Elaboration of the knowledge base – this step involves activation of two separate
processes that are supervised and unsupervised learning process.
3.2 Decomposition of Proposed Mechanism

High level steps are logically associated with business processes that result in the
identified object (face) in the crop that is used to create biometric pattern and identify
person by using this template. Thereafter, elaboration of the knowledge base is by
using supervised or unsupervised learning (see Fig. 2).
Video Feed Processing in Order to Get Identified Object in the Separated Crop
First series of actions are to process the video feed, track the object and extract the
frame in a real time. Object tracking can be realized by using existing algorithms, for
example VGGface net [19], Facenet [20], Single Shot multibox Detection (SSD),
YOLO, Fast Region-based Convolutional Network method (Fast R-CNN) or Viola-
Jones object detection framework. Detection based multi object tracking algorithms
creates links of an object (person) instances between frames. Thereafter actions are
localization of the frame and extraction of the frame, localization of all faces and
cropping faces one by one out of the frame. Face cropping can be realized by using
algorithms such as Deep sort [21] and TracTrac [22]. Deep sort algorithm enables
object re-identification in several frames that are not sequential.
There are differences among mentioned algorithms. For example, Fast R-CNN
performance indicators are low, but object detection accuracy is proven on small
objects. Viola-Jones performance indicators are high, but precision rate is low. SSD has
higher option of configuration. Therefore, SSD algorithm will be included in the further
investigation due to it configurations options despite YOLO performance indicators
and object detection accuracy rates are similar.
Fig. 2. Person long-period re-identification model.
Creation of the Biometric Pattern and Identification of a Person

Feature vectors are extracted from the gained crop with the detected object. Further
research will be done in order to evaluate Siamese Neural Network compliance with
requirements of procession of feature vector in order to achieve high face identification
accuracy rate in long term and performance indicators.
Feature vector is created from the crop and processed through the Siamese Neural
network. Complicity of the feature vector depends on the quality of the crop. Therefore,
object detection algorithm with high configuration option (SSD) is important in order
not to lose details that are necessary to achieve high identification rate. Biometric
pattern is created from the feature vector. Biometric pattern is unique set of values vs.
formula of the face that defines person’s identity. This set of values includes infor-
mation of descriptors of face anatomy that are not influenced by descriptors of
appearance.
Thereafter biometric pattern is used to identify person and can be used to re-identify
person in a long term due to it minor changes on everyday bases. Identification of
person includes search of the similar biometric pattern in the knowledge base in order
find the most suitable biometric pattern from the existing ones. Further research is to
determine and define similarity threshold equivalence classes that describe high pre-
cision, low precision, critical points, etc. to estimate the result of the biometric pattern
comparison. If there are similar biometric pattern in the knowledge base that is com-
pliant with necessary similarity threshold – person’s identification parameters can be
outputted for further processing.
Elaboration of the Knowledge Base by Using a Supervised or Unsupervised
Learning
Data in the knowledge base and continues training of the Siamese Neural Network is
the keys to the successful person’s identification in the long term and is the main
objective of this research. In order to achieve high identification rate, it’s necessary to
fill the knowledge base with high quality biometric patterns by using a supervised
learning. There should be used several high-quality photos that are made from different
angles, for example full face frontal, right or left profile and processed through the
proposed mechanism in a supervised learning manner in order to create pattern etalon
version and link to the other pattern etalon versions.
If there is person’s biometric template that consists of the several linked template
versions that are made at one-time, patterns can be blended together or used inde-
pendently during the detection of the best matching pattern. High quality etalons
versions are significant to person’s long-term identification and in unsupervised
learning process.
There is an alternative method to elaborate data in the knowledge base in an
unsupervised learning process. Unsupervised learning process starts with an evaluation
of the tracking results. The result consists of a conclusion whether the same person is
identified successfully or not identified in several, not sequential frames. During
unsupervised learning process (without a human interaction) can be identified and
proposed option to elaborate existing pattern etalon version if there is high quality
candidate to new pattern etalon version, version of an existing pattern etalon.
Candidate of an identified person can be selected during unsupervised learning
process by continuous elaboration biometric pattern etalon version that consists of
more than one version. For example, person is detected in the video feed. The face is
fixed in the crop. Crop is processed thought the neural network. Quality of the crop is
enough to build biometric pattern. Few seconds later the face is fixed in another crop
and so on. Object tracking algorithm provides an option to generate pairs, triplets that
consists of an object instances that are detected in several crops and can be used to train
Siamese Neural Network. During the video streaming process several persons can be
detected without defined sequence.
There should be mechanism that can receive data about compatibility among
detected persons from several crops and blend information together in order to create
complete biometric pattern that may consist of one or more pattern etalon versions.
Siamese Neural Network (a part of the proposed mechanism) should be trained to be
able to take a decision on a base of crops that are obtained in a short term. Compat-
ibility among pattern etalon versions that are not made sequential at one time can be
detected in an evolutionary process. Short term equivalence class determination is the
further field of research. Evolution of the blending of data about several persons from
the tracking algorithm can be done in one or more steps. Each step elaborates links
between one or more versions. During biometric pattern creation process there can be
applied search in the knowledge base for a best match. Thereafter, it’s possible assign
identities to all detected persons in the crop. Each crop is made in a determined place
and time.
4 Conclusions
In order to reduce the risk of early school leaving, as well to increase pupils’ safety,
facilitate the administrative work of the educational institution and to promote the
reliability of unjustified delays and operability, it is proposed to automate the process of
entering unjustified delays in school management system.
The real-time long-term person’s re-identification (Re-ID) approach for accounting
of non-attended lessons in educational institutions is developed. Authors proposed a
mechanism M N, where M is an etalon version of biometric pattern and N is version
of combination of descriptors from biometric template with exact values by which
person can be identified with high accuracy. Descriptors of biometric pattern are more
reliable for a person’s Re-ID in a long term due to their changes evenly during a period.
Automatic face recognition in real world setting must be performed with lower
resolution images as majority of images are taken by consumer level equipment thus
taking into consideration changes in posture, age, lighting is one of the main issues that
must be addressed.
Rapid face change of the individual can significantly decrease efficiency of existing
face recognition algorithms especially in groups of infants, kids, seniors (above age 65)
and people after face damage or certain aesthetic surgeries.
The developed mechanism consists of video feed processing in order to track,
localize the object in the crop, creation of the biometric pattern and identification of
person, elaboration of the knowledge base by using a supervised and unsupervised
learning process. Supervised learning process includes manual human interactions in
order to create links between one or more etalon pattern versions.
Unsupervised learning process includes continues elaboration of knowledge base
by evaluating the results from the tracking, searching etalon pattern version with the
best match, creation of new etalon pattern or its version, blending of one or more etalon
pattern versions in one biometric etalon pattern version in order to create complete
biometric pattern and to be able to link created biometric pattern (may consist of several
etalon pattern versions created from events in fixed in two separate video feeds) to
identified persons data without human interaction.
Limitations of this study include that we do not perform evaluation of optimal

technological solution for the task as deep neural network architectures or tuning of
hyperparameters e.g. selecting appropriate activation functions.
Future research direction is finding most appropriate technological solution for
successful implementation as well as organisation of experiments in classrooms with
students. Possible challenges are data privacy of pupils, legal procedures (permissions,
internal documentation of the school, etc.) as well as selection of appropriate training
data set and model. For person re-identification it is planning to investigate application
of Siamese triplet model as well as benchmarking data set such as i-LIDS and VIPeR.
Proposed solution will raise additional challenges in non-controlled environment, for
example, walking streets, non-seated concert halls. Mainly relating to quality of cap-
tured data e.g. persons face, gait and other biometric markers.
Acknowledgments. The research leading to these results has received funding from the project
“Competence Centre of Information and Communication Technologies” of EU Structural funds,
contract No. 1.2.1.1/18/A/003 signed between IT Competence Centre and Central Finance and
Contracting Agency, Research No. 2.1 “Person long-period re-identification (Re-ID) solution to
improve the quality of education”.
References
1. Nevala, A.M., Hawley, J., Stokes, D., Slater, K., Otero, M.S., Santos, R., Duchemin, C.,
Manoudi, A.: Reducing early school leaving in the EU. European Parliament, Brussels
(2011)
2. Bedagkar-Gala, A., Shah, S.K.: A survey of approaches and trends in person re-
identification. Image Vis. Comput. 32(4), 270–286 (2014)
3. Lee, K.W., Sankaran, N., Setlur, S., Napp, N., Govindaraju, V.: Wardrobe model for long
term re-identification and appearance prediction. In: 15th IEEE International Conference on
Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, pp. 1–6
(2018)
4. Ding, Y.: Pedestrian re-identification based on image enhancement and over-fitting solution
strategies. In: 5th International Conference on Systems and Informatics (ICSAI), Nanjing,
China, pp. 745–750. IEEE (2018)
5. Nambiar, A., Bernardino, A.A.: Context-aware method for view-point invariant long-term
re-identification. In: Cláudio, A., et al. (eds.) Computer Vision, Imaging and Computer
Graphics – Theory and Applications, VISIGRAPP 2017. Communications in Computer and
Information Science, vol. 983, pp. 329–351. Springer, Cham (2017)
6. Kamalakumari, J., Muthuraman, V.: Recognizing heterogeneous faces-a study. Int. J. Pure
Appl. Math. 118(8), 661–663 (2018)
7. Hassaballah, M., Aly, S.: Face recognition: challenges, achievements, and future directions.
IET Comput. Vis. 9(4), 614–626 (2015)
8. The American Society for Aesthetic Plastic Surgery. https://www.surgery.org/sites/default/
files/ASAPS-Stats2018_0.pdf. Accessed 01 June 2019
9. Kim, Y.A., Cho Chung, H.I.: Side effect experiences of South Korean women in their
twenties and thirties after facial plastic surgery. Int. J. Women’s Health 10, 309–316 (2018)
10. Singh, R., Vatsa, M., Noore, A.: Effect of plastic surgery on face recognition: a preliminary
study. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops, pp. 72–77 (2009)
11. Mayes, A.E., Murray, P.G., Gunn, D.A., Tomlin, C.C., Catt, S.D., Wen, Y.B., Zhou, L.P.,
Wang, H.Q., Catt, M., Granger, S.P.: Environmental and lifestyle factors associated with
perceived facial age in Chinese women. PLoS ONE 5(12), 1–7 (2010). e15270
12. Yadav, D., Singh, R., Vatsa, M., Noore, A.: Recognizing age-separated face images: humans
and machines. PLoS ONE 9(12), 1–22 (2014). e112234
13. Wen, D., Fang, C., Ding, X., Zhang, T.: Development of recognition engine for baby faces.
In: 20th International Conference on Pattern Recognition, Istanbul, Turkey, pp. 3408–3411
(2010)
14. Siddiqui, S., Vatsa, M., Singh, R.: Face recognition for newborns, toddlers, and pre-school
children: a deep learning approach. In: 24th International Conference on Pattern Recognition
(ICPR), pp. 3156–3161. IEEE (2018)
15. Han, H., Otto, C., Liu, X., Jain, A.K.: Demographic estimation from face images: human vs.
machine performance. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1148–1161 (2015)
16. Mandal, B.: Face recognition: perspectives from the real world. In: 14th International
Conference on Control, Automation, Robotics and Vision (ICARCV), pp. 1–5. IEEE (2016)
17. Rothoft, V., Si, J., Jiang, F., Shen, R.: Monitor pupils’ attention by image super-resolution
and anomaly detection. In: International Conference on Computer Systems, Electronics and
Control (ICCSEC), pp. 843–847. IEEE (2017)
18. Hendel, R.K., Starrfelt, R., Gerlach, C.: The good, the bad, and the average: characterizing
the relationship between face and object processing across the face recognition spectrum.
Neuropsychologia 124, 274–284 (2019)
19. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision
Conference (BMVC), pp. 41.1–41.12. BMVA Press (2015)
20. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition
and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition, Boston,
pp. 815–823 (2015)
21. Zhuang, N., Zhang, Q., Pan, C., Ni, B., Xu, Y., Yang, X., Zhang, W.: Recognition oriented
facial image quality assessment via deep convolutional neural network. Neurocomputing
358, 109–118 (2019)
22. Heyman, J.: TracTrac: a fast multi-object tracking algorithm for motion estimation. Comput.
Geosci. 128, 11–18 (2019)
A Quantum Annealing-Based Approach
to Extreme Clustering
Tim Jaschek1,2(B) , Marko Bucyk1 , and Jaspreet S. Oberoi1,3

1
1QB Information Technologies (1QBit), Vancouver, BC, Canada
tim.jaschek@1qbit.com
2
Department of Mathematics, University of British Columbia,
Vancouver, BC, Canada
3
School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada
Abstract. Clustering, or grouping, dataset elements based on similarity

can be used not only to classify a dataset into a few categories, but also
to approximate it by a relatively large number of representative elements.
In the latter scenario, referred to as extreme clustering, datasets are enor-
mous and the number of representative clusters is large. We have devised
a distributed method that can efficiently solve extreme clustering prob-
lems using quantum annealing. We prove that this method yields optimal
clustering assignments under a separability assumption, and show that
the generated clustering assignments are of comparable quality to those
of assignments generated by common clustering algorithms, yet can be
obtained a full order of magnitude faster.
Keywords: Extreme clustering · Distributed computing · Quantum

computing · Maximum weighted independent set · Unsupervised
learning
1 Introduction
Traditionally, clustering approaches have been developed and customized for
tasks where the resultant number of clusters k is not particularly high. In such
cases, algorithms such as k-means++ [1], BIRCH [2], DBSCAN [3], and spectral
clustering produce high-quality solutions in a reasonably short amount of time.
This is because these traditional algorithms scale well with respect to the dataset
cardinality n. However, in most cases, the computational complexity of these
algorithms, in terms of the number of clusters, is either exponential or higher-
order polynomial. Another common issue is that some of the algorithms require
vast amounts of memory.
The demand for clustering algorithms capable of solving problems with larger
values of k is continually increasing. Present-day examples involve deciphering
T. Jaschek and J. S. Oberoi—have contributed equally to this research
https://doi.org/10.1007/978-3-030-39442-4_15
170 T. Jaschek et al.
the content of billions of web pages by grouping them into millions of labelled
categories [4,5], identifying similarities among billions of images using nearest-
neighbour detection [6–8]. This domain of clustering, where n and k are both
substantially large, is referred to as extreme clustering [9]. Although there is great
value in perfecting this type of clustering, very little effort towards this end has
been made by the machine learning community. The proposed algorithm is, in
fact, such an effort. Its output is a clustering tree, which can be used to generated
multiple clustering assignments (or “levels”) with varying degrees of accuracy
(i.e., coarseness or fineness) of the approximation. Generating such a tree is
not uncommon for clustering algorithms. Consider, for example, hierarchical
clustering algorithms which generate binary clustering trees. Clustering trees
are useful tools for visualizing real-world data. Our algorithm, the Big Data
Visualization Tool, or BiDViT, provides this functionality.
BiDViT employs a novel approach to clustering problems, which is based on
the maximum weighted independent set (MWIS) problem in a graph induced
by the original dataset and a parameter we call the radius of interest, which
determines a relation of proximity. The use of such a parameter has been suc-
cessfully employed in density-based spatial clustering of applications with noise
(DBSCAN) [3]. The MWIS problem can be transformed into a quadratic uncon-
strained binary optimization (QUBO) problem, the formulation accepted by a
quantum annealer. An alternative way to address the underlying problem is
to use a heuristic algorithm to approximate solutions to the MWIS problem.
Quantum annealing and simulated annealing have been applied in centroid-based
clustering [10,11] and in density-based clustering [12]. However, the approaches
studied are not capable of addressing problems in the extreme clustering domain.
The document is structured as follows. Sect. 2 introduces a novel approach
to clustering problems and proves that, under a separability assumption, the
method identifies the ground truth labels when parameters are selected that are
within the bounds determined by that assumption. Sect. 3 outlines the way in
which the underlying optimization problem can be solved efficiently, and shows
how it can be algorithmically employed in extreme clustering. Runtime and solu-
tion quality values are provided for BiDViT with respect to internal evaluation
schemes such as the Calinski–Harabasz and the Davies–Bouldin scores in Sect. 4.
The results suggest that BiDViT yields clustering assignments of a quality com-
parable to that of assignments generated by common clustering algorithms, yet
does so a full order of magnitude faster.
2 The Coarsening Method
Our algorithm is based on a combinatorial clustering method we call coarsening.

The key idea behind coarsening is to approximate a set X ⊂ Rd by a subset
S ⊆ X such that, for any point x ∈ X, there exists a y ∈ S such that x −
y2 < ε, for some parameter ε > 0. In this case, we say that S is ε-dense in
X and call ε the radius of interest. This concept is not restricted to subsets of
Euclidean spaces and can be generalized to an arbitrary metric space (M, d).
A Quantum Annealing-Based Approach to Extreme Clustering 171
Fig. 1. Visualization of chunk collapsing (left) and data partitioning (right). Left) A
maximal ε-separated subset (red dots) of a dataset (red dots and blue dots). The circles
have a radius equal to the radius of interest ε. The weights of the red points are updated
according to the number of blue points within a distance of ε. The yellow borders are
a Voronoi partition of the dataset indicating the clustering assignment. Right) Data
partitioning of a dataset along the axes of maximum variance. In this example, there
are s = 5 partitioning steps, resulting in 25 = 32 chunks.
For example, the coarsening method can be used for clustering assignments on
finite subsets of Riemannian manifolds with respect to their geodesic distance,
for instance, in clustering GPS data on the surface of the Earth when analyzing
population density. In what follows, we assume that X = {x(1) , . . . , x(n) } is a
dataset consisting of n d-dimensional data points, equipped with a metric d : X ×
X → [0, ∞). Finding an arbitrary ε-dense subset of X does not necessarily yield
a helpful approximation. For example, X itself is always ε-dense in X. However,
enforcing the additional constraint that any two points in the subset S must
be separated by a distance of at least ε yields more-interesting approximations,
often leading to a reduction in the number of data points (one of our primary
objectives). We call such a set ε-separated. Figure 1 shows a point cloud and
an ε-dense, ε-separated subset. The theorem that follows shows that a maximal
ε-separated set S of X is necessarily ε-dense in X. Let B(x, r) denote the open
metric ball with respect to d, with centre x and radius r.
Theorem 1. Let S be a maximal ε-separated subset of X in the sense of set

inclusion. Then the following properties must be satisfied.

(i) It holds that X ⊆ x∈S B(x, ε).
(ii) For every y ∈ S, it holds that X x∈S\{y} B(x, ε).
(iii) The sets B(x, ε/2) for x ∈ S are pairwise disjoint.
In particular, S is a minimal ε-dense subset of X.
Proof. Note that (i) is equivalent to S being ε-dense in X and that, in combina-
tion with (ii), is equivalent to S being a minimal with respect to this property.
To prove (i), let S be a maximal ε-separated subset of X and assume, in con-
tradiction, that S is not ε-dense in X. Then one could find x ∈ X such that
d(x, y) ≥ ε, for every y ∈ S. Hence, S ∪ {x} would be ε-separated, which is
in contradiction to the maximality of S. To prove (ii), we fix a point x ∈ S.
Since S is ε-separated, d(x, y) ≥ ε for any y ∈ S and, thus, S \ {x} is not ε-dense
in X. Property (iii) follows from the triangle inequality.
Note that a maximal ε-separated subset does not refer to an ε-separated
subset with fewer than or equally as many elements as all other ε-separated
subsets but, rather, to an ε-separated subset that is no longer ε-separated when
a single data point is added. Contrary to Theorem 1, a minimal ε-dense subset
does not need to be ε-separated. Consider the set X = {1, 2, 3, 4} ⊂ R, and
let d be the Euclidean distance on R. Then, S = {2, 3} is 3/2-dense in X but
not 3/2-separated. Also note that an ε-separated subset is not necessarily an ε-
coreset, which is a weighted subset whose weighted k-means cost approximates
the k-means cost of the original set with up to an accuracy of ε [13,14].
In the following, we assume that X is equipped with a weight function w :
X → R+ . We call wi = w(x(i) ) the weight of x(i) and gather all weights in a
weight vector w ∈ Rn+ . It will be clear from the context whether we refer to
a weightfunction or a weight vector. The weight of a set S ⊆ X is given by
ω(S) = x∈S w(x). We have already argued that maximal ε-separated subsets
yield reasonable approximations. However, such subsets are not unique. We are
thus interested in finding an optimal one, that is, one that captures most of the
weight of the original dataset. In other words, we are interested in solving the
optimization problem.
maximize ω(S) subject to S is ε-separated. (P1)
S⊆X
When imposing unit weights, the solution set to (P1) will consist of the
maximal ε-separated subsets of X with a maximum number of elements among
all such subsets. The term “maximal” refers to set inclusion and the “maxi-
mum” refers to set cardinality. Since w(x) > 0 for all x ∈ X, a solution S ∗ to
(P1) will always be a maximal ε-separated subset and, therefore, by Theorem 1,
ε-dense. In Sect. 3.6, we show that this problem is equivalent to solving an MWIS
problem for a weighted graph Gε (X, E ε , w), depending solely on the dataset X,
the Euclidean metric d, and the radius of interest ε. Thus, the computational task
of finding a maximal ε-separated subset of maximum weight is NP-hard [15,16].
Every U ⊂ X gives rise to a clustering assignment C = {Cx }x∈U , where
Cx = {y ∈ X : d(x, y) ≤ d(x , y) for all x ∈ U }. (1)
Data points that are equidistant to multiple representative points are assigned
to only one of them, uniformly at random. Typically, larger values of ε result
in smaller cardinalities of C. The following corollary summarizes properties of C
when U is ε-separated, and can be readily verified.
Corollary 1. Let C be the clustering assignment generated from a maximal ε-
separated set S ⊂ X. Then, the following properties are satisfied:
(i) The clusters in C are non-empty and pairwise disjoint.
(ii) The cluster diameter is uniformly bounded by 2ε, i.e., supx∈S diam(Cx ) ≤
2ε.
(iii) For all x ∈ S, it holds that maxy∈Cx d(x, y) < ε.
Notice that these properties are not satisfied by every clustering assignment, for
example, the ones generated by k-means clustering. They are desirable in specific
applications, such as image quantization, where a tight bound on the absolute
approximation error is desired. However, they are undesirable if the ground truth
clusters have diameters larger than 2ε. More details on the clustering assignment
are provided in Sect. 3.
One could argue that prior to identifying a maximum weighted independent
set and using it to generate a clustering assignment, a dataset should be nor-
malized. However, normalization is a transformation that would result in chunks
not being defined by metric balls, but rather by ellipsoids. In particular, such a
transformation would change the metric d. We assume that the metric d already
is the best indicator of proximity. In general, one can apply any homeomorphism
f to a dataset X, apply a clustering algorithm to the set f (X), and obtain a
clustering assignment by applying f −1 to the individual clusters.
A common assumption in the clustering literature is separability—not to be
mistaken with ε-separability—of the dataset with respect to a clustering C. The
dataset X is called separable with respect to a clustering C = {C1 , . . . Ck } if
max d(x, y) < min d(x, y), (2)

x,y∈Ci x∈Ci ,y∈Cj
1≤i≤k 1≤i=j≤k
that is, if the maximum intra-cluster distances are strictly smaller than the
minimum inter-cluster distances. The following theorem shows that, if ε is chosen
correctly, the proposed coarsening method yields the clustering assignment C.
Theorem 2. Let X be separable with respect to a clustering C = {C1 , . . . Ck }.

Then, for any ⎛ ⎤
ε ∈ ⎝ max d(x, y), min d(x, y)⎦ , (3)

x,y∈Ci x∈Ci ,y∈Cj
1≤i≤k 1≤i=j≤k
the proposed coarsening methods yields the correct clustering assignment.
Proof. To simplify notation, the lower and upper bounds of the interval in (3)
are denoted by l and r, respectively. By the separability assumption, this interval
is non-empty. One can see that, for any admissible choice of ε, any two points
from different clusters are ε-separated. Indeed, for x ∈ C and y ∈ C , it holds
that d(x, y) ≥ r ≥ ε. Furthermore, if a point x in a cluster C is selected, then no
other point y in the same cluster can be selected, as d(x, y) ≤ l < ε. Therefore,
every solution S ⊆ X to (P1) is a union of exactly one point from each cluster.
Using the separability of X with respect to C, one can see that the clustering
assignment induced by (1) is coincident with C.
In practice, the separability assumption is rarely satisfied, and it is challeng-

ing to select ε as above (as this assumes some knowledge about the clustering
assignment). However, Theorem 2 shows that the presented coarsening method
is of research value, and can yield optimal clustering assignments.
We have developed two methods, which we refer to as the heuristic method

and the quantum method, to address the NP-hard task of solving (P1). The
heuristic method loosens the condition of having a maximum weight; it can be
seen as a greedy approach to (P1). In contrast, the quantum method explores
all different maximal ε-separated subsets simultaneously, yielding one that has
maximum weight. The quantum method is based on the formulation of a QUBO
problem, which can be solved efficiently using a quantum annealer like the D-
Wave 2000Q [17] or a digital annealer such as the one developed by Fujitsu [18].
3 The Algorithm
Let X = {x(1) , . . . , x(n) } ⊂ Rd denote a dataset of n d-dimensional data points.
Note that, mathematically speaking, a dataset is not a set but rather a multiset,
that is, repetitions are allowed. BiDViT consists of two parts: data partitioning
and data coarsening, the latter of which can be further subdivided into chunk
coarsening and chunk collapsing.
3.1 Data Partitioning

In general, the computational complexity of distance-based clustering methods
is proportional to the square of the dataset cardinality, as all pairwise distances
must be computed. This bottleneck can be overcome by dividing the dataset and
employing distributed approaches [13,14,19], yielding a result different from the
one we would obtain when applying clustering methods on the entire dataset.
However, its slight imprecision results in a significant computational speed-up.
A partition P of X is a collection of non-empty disjoint sets P1 , . . . , Pk ⊂ X
such that X = P ∈P P . Elements of partitions are typically referred to as blocks,
parts, or cells; however, we refer to them as chunks. The partitioning is intended
to be homogeneous: every extracted chunk has an equal number of data points
(there might be minor differences when the cardinality of the chunk to be divided
is odd). The upper bound on the number of points desired in a chunk is referred
to as the maximum chunk cardinality κ. To determine κ, one should take into
account the number of available processors, their data handling capacity, or, in
the case of a quantum annealer, the number of fully connected qubits.
To break the data into chunks, we employ a modified version of the well-
known “median cut” algorithm, which is frequently used in colour quantization.
First, we select an axis of maximum variance. We then bisect the dataset along
(1) (n)
the selected axis, say , at the median m of {x , . . . , x } in such a way as
to obtain two data chunks P1 and P2 whose cardinalities differ by at most one
(in the case where n is odd) and which satisfy P1 ⊆ {x ∈ X : x ≤ m} and
P2 ⊆ {x ∈ X : x ≥ m}. We cannot simply assign P1 = {x ∈ X : x ≤ m} and
P2 = X \ P1 , as these sets might differ drastically in cardinality. For example,
(1) (n)
when x = . . . = x , this assignment would imply that P1 = X and P2 = ∅.
By using P1 and P2 in the role of X, this process can be repeated iteratively,
until the number of data points in the chunk to be divided is less than or equal to
the maximum chunk cardinality κ, yielding a binary tree of data chunks. After
(s) (s)
s iterations, this leaves us with 2s chunks Pk such that X = 1≤k≤2s Pk ,
where the union is disjoint. Figure 1 provides a visualization.
3.2 Chunk Coarsening

The goal of a data coarsening step is, for each chunk, to find representative
data points such that their union can replace the original point cloud, while
maintaining the original data distribution as accurately as possible.
Let P = {x(1) , . . . , x(n) } be a chunk and ε > 0 be the radius of interest. In
what follows, we assume that all the data points are pairwise different. Prac-
tically, this can be achieved by removing duplicates and cumulatively incre-
menting the weight of the representative point we wish to keep by the weight
of the discarded duplicates. The radius of interest ε induces a weighted graph
Gε = (P, E ε , wP ), where P is the vertex set, the edge set E ε is given by the
relation ∼ε defined by x ∼ε y if and only if d(x, y) < ε for all x, y ∈ P , and the
weight function wP : P → R+ is the restriction of w to P . For each data point
x(i) , we denote its weight wP (x(i) ) by wi .
For each data point x(i) , we introduce a binary decision variable si that
encodes whether x(i) is used in a possible set S ∗ . Furthermore, we define the
neighbourhood matrix N (ε) (or similarity matrix) of the graph Gε = (P, E ε , wP )
(ε) (ε)
by Nij = 1 if x(i) ∼ε x(j) , and Nij = 0 otherwise. Problem (P1) can then be
posed as a quadratically constrained quadratic program (QCQP) given by

n
n
(ε)
maximize
n
si wi subject to si Nij sj = 0. (P2)
s∈{0,1}
i=1 i=1 j>i
Here, the inner summation of the constraint does not need to run over all indices,
due to the symmetry of N (ε) . The matrix form of (P2) is given by maximizing
(ε) (ε)
sT w subject to the constraint sT N s = 0, where N is the upper triangular
matrix of N (ε) having all zeroes along the diagonal. As explained in Sect. 3.6,
(P2) is equivalent to the NP-hard MWIS problem for Gε = (P, E ε , wP ), and
thus is computationally intractable for large problem sizes. Note that (P2) can
be written as the 0–1 integer linear program (ILP)

n
(ε)
maximize
n
si wi subject to si + sj ≤ 1, for i, j such that N ij = 1.
s∈{0,1}
i=1
(P3)
We present two methods we have devised to address (P2).
The Heuristic Method. We wish to emphasize that the heuristic method

does not provide us with a solution to (P2). Rather, the aim of this method is to
obtain an ε-separated subset S with a high—but not necessarily the maximum—
weight ω(S). The seeking of approximate solutions to the MWIS problem is a
Algorithm 1. Greedy(P, w, N (ε) )

input : data chunk P ; weight function w; neighbourhood matrix N (ε)
output: ε-separated subset with high (but not necessarily maximum) weight S ∗
1 S∗ ← ∅
2 while P = ∅ do
3 select x ∈ arg minv∈P degw (v) uniformly at random
4 use N (ε) to determine Nx
5 remove x and its neighbours Nx from P
6 S ∗ ← S ∗ ∪ {x}
7 end
8 return S ∗
well-studied subject [20–22]. Typically, researchers employ greedy algorithms,

LP-based algorithms (using the relaxation of (P3)), or semi-definite program-
ming (SDP) algorithms; see [22] for an analysis.
We employ a classic greedy algorithm due to its simplicity and low compu-
tational complexity. In each step, we add the data point that locally is the best
choice in the sense that the ratio of the weight of its neighbourhood to its own
weight is as small as possible. Prior to the execution of the step, we remove the
point and its neighbours from the set of candidates. Pseudocode of the greedy
algorithm is provided in Algorithm 1. Before we state a theoretical result on the
approximation ratio of this algorithm, we define the weighted degree degw (v) of
a vertex v in a weighted graph G = (V, E, w) and
the weighted average degree of
G as degw (v) = ω(Nv )/w(v) and degw (G) = v∈V w(v) degw (v)/ω(V ), respec-
tively, where Nv = {u ∈ V : u ∼ v} is the neighbourhood of vertex v [22].
Theorem 3. Algorithm 1 has an approximation ratio of degw (G) + 1, i.e.,

−1
ω(S) ≤ degw (G) + 1 ω(S ∗ ), (4)
for any solution S ∗ to (P1) and any output S of the algorithm. Moreover, the
bound in (4) is tight.
Proof. A proof is given in [22, Thm. 6].
The Quantum Method. In contrast to the heuristic method, the QUBO app-
roach provides an actual (i.e., non-approximate) solution to (P2). We reformulate
the problem by transforming the QCQP into a QUBO problem.
Using the Lagrangian penalty method, we incorporate the constraint into
the objective function by adding a penalty term. For a sufficiently large penalty
multiplier λ > 0, the solution set of (P2) is equivalent to that of

n
n
(ε)
maximize
n
si wi − λ si Nij sj . (P4)
s∈{0,1}
i=1 i=1 j>i
One can show that, for λ > maxi=1,...n , wi every solution to (P4) satisfies
the separation constraint [23, Thm. 1]. Instead, we use individual penalty terms
λij , as this may lead to a QUBO problem with much smaller coefficients, which
results in improved performance when solving the problem using a quantum
annealer. Expressing (P4) as a minimization, instead of a maximization, problem
and using matrix notation yields the problem
minimize
n
sT Qs, (P5)
s∈{0,1}
(ε)
where Qij = −wi if i = j, Qij = λij if Nij = 1 and i < j, and Qij = 0 other-
wise. Solutions to (P5) can be approximated using heuristics such as simulated
annealing [24], path relinking [25], tabu search [25], and parallel tempering [26].
Before solving (P5), it is advisable to reduce its size and difficulty by making use
of logical implications among the coefficients [27]. This involves fixing every vari-
able that corresponds to a node that has no neighbours to one, as it necessarily
is included in an ε-dense subset.
The following theorem show that (P2) is equivalent to (P5) for a suitable
choice of λij , for 1 ≤ i < j ≤ n.
Theorem 4. Let λij > max{wi , wj } for all 1 ≤ i < j ≤ n. Then, for any
solution s ∈ {0, 1}n to (P5), the corresponding set S ⊆ X is ε-separated. In
particular, the solution sets of (P2) and (P5) coincide.
Proof. We generalize the proof of [23, Thm. 1] and show that every solution s
n (ε)
to (P5) satisfies the separation constraint i=1 j>i si Nij sj = 0. Assuming,
in contradiction, that the opposite were to be the case, we could find a solution
(ε)
s and indices k and such that 1 ≤ k < ≤ n and sk = s = Nk = 1. Let ek
denote the k-th standard unit vector, and let v = s − ek . Then,

n
n
v T Qv = sT Qs − sj Qkj − si Qik − Qkk (5)
j>k i<k
(ε)
= sT Qs − si λσ(i,k) Nik + wk , (6)
i=k
where σ : N2 → N2 , defined by σ(i, k) = (min(i, k), max(i, k)), orders the index
accordingly. This technicality is necessary, as we defined λij only for 1 ≤ i <
(ε) (ε)
j ≤ n. As Nk = s = 1, we have i=k si λσ(i,k) Nik ≥ λσ(,k) , and thus
v T Qv ≤ sT Qs − λk + wk . (7)
Therefore, as λk > max{wk , w } ≥ wk , it holds that v T Qv < sT Qs, which is
absurd, as, by assumption, s is a solution to (P5).
We now show that the solution sets of (P2) and (P5) coincide. Note that
(P2) is equivalent to the optimization problem
(ε)
minimize
n
, −sT w subject to sT N s = 0. (P6)
s∈{0,1}
Let s1 and s2 be solutions to (P6) and (P5), respectively. We denote the objective
(ε)
functions by p1 (s) = −sT w and p2 (s) = −sT w + sT Λ ◦ N s, where Λ is the
matrix defined by Λij = λij for 1 ≤ i < j ≤ n, and zero otherwise, and the term
(ε) (ε)
Λ◦N ∈ Rn×n denotes the Hadamard product of the matrices Λ and N , given
by element-wise multiplication. Then, as λij > max{wi , wj } for 1 ≤ i < j ≤ n,
by the observation above, both s1 and s2 satisfy the separation constraint. Since
ε
s and N are coordinate-wise non-negative and λij > mink=1,...,n wk > 0 for
1 ≤ i < j ≤ n, it holds that
(ε) (ε)
sT N s=0 ⇔ sT Λ ◦ N s = 0, (8)
thus, if s satisfies the separation constraint, then p2 (s) = p1 (s). Using this obser-
vation, and that s1 and s2 minimize p1 and p2 , respectively, we have
p1 (s1 ) ≤ p1 (s2 ) = p2 (s2 ) ≤ p2 (s1 ) = p1 (s1 ). (9)
Hence, the inequalities in (9) must actually be equalities; thus, the solution sets
of the optimization problems coincide.
Problem (P5) can be transformed to an Ising spin model by mapping s to

2s − 1. This form is desirable because the ground state of the Hamiltonian of an
Ising spin model can be determined efficiently with a quantum annealer.
3.3 Chunk Collapsing

Having identified a maximal ε-separated subset S ⊆ P , we collapse the vertices
P \ S into S, meaning we update the weight of each x ∈ S according to the
weights of all y ∈ P \ S that satisfy x ∼ε y. We aim to assign each y ∈ P \ S
to a unique x ∈ S by generating a so-called Voronoi decomposition (depicted in
Fig. 1) of each chunk P , which is a partition, where each point x ∈ P is assigned
to the closest point within a subset S. More precisely, we define the sets Cx as
in (1), for each x ∈ S. By construction, Cx contains all vertices that will be
collapsed into x, in particular, x itself. We then assign the coarsened chunk S a
new weight function wS defined by

wS (x) = ω(Cx ) = w(y). (10)
y∈Cx
In practice, to prevent having very large values for the individual weights, one
might wish to add a linear or logarithmic scaling to this weight assignment. In
our experiments, we did not add such a scaling.
3.4 Iterative Implementation of BiDViT

BiDViT repeats the procedure of data partitioning, chunk coarsening, and chunk
collapsing with an increasing radius of interest, until the entire dataset collapses
Algorithm 2. BiDViT
input : data set X; initial radius ε0 ; maximum chunk cardinality κ; radius increase rate α
output: tree structure that encodes the hierarchical clustering T
1 T ← create node list(X)
2 ε ← ε0
3 while length(T ) > 1 do
4 P ← partition(T , κ)
5 T ←∅
6 for P ∈ P do
7 compute neighbourhood matrix N (ε) for P
8 identify representive data points by solving MWIS for P, N (ε) , and w
9 compute Voronoi partition of P with respect to representative points
10 compute centroids of the cells of the Voronoi partition
11 for x ∈ P do
12 ind ← closest centroid(x, centroids)
13 centroids[ind].weight += x.weight
14 centroids[ind].parents.append(x)
15 end
16 T .append(centroids)
17 end
18 ε ← αε
19 end
20 return T
to a single data point. We call these iterations BiDViT levels. The increase of ε
between BiDViT levels is realized by multiplying ε by a constant factor, denoted
by α and specified by the user. In our implementation we have introduced a
node class that has three attributes: coordinates, weight, and parents. We
initialize BiDViT by creating a node_list containing the nodes corresponding
to the weighted dataset (if no weights are provided then the weights are assumed
to be the multiplicity of the data points). After each iteration, we remove the
nodes that collapsed into representative nodes from the node_list and keep only
the remaining representative nodes. However, we append the removed nodes to
the parents of the representative node. The final node_list is a data tree, that
is, it consists of only one node, and we can move upwards in the hierarchy by
accessing its parents (and their parents and so on); see Fig. 2. Two leaves of
the data tree share a label with respect to a specific BiDViT level, say m, if
they have collapsed into centroids which, possibly after multiple iterations, have
collapsed into the same centroid at the m-th level of the tree. For the sake of
reproducibility, we provide pseudocode (see Algorithm 2).
It is worth noting that, at each level, instead of proceeding with the identified
representative data points, one can use the cluster centroids, allowing more-
accurate data coarsening and label assignment.
3.5 Complexity Analysis
Our analysis shows that every interaction of the heuristic version of BiDViT has
a computational complexity of O(dn log(n/κ) + dnκ). Note that κ n.
The order of complexity of the partitioning procedure is O(dn log(n/κ)). To
see this, note that there are at most log2 (n/κ) partitioning stages and in the
s-th stage we split 2s−1 chunks Pi , where i = 1, . . . , 2s−1 . Let ni denote the
number of data points in chunk Pi . Finding the dimension of maximum variance
has a complexity of O(dni ) and determining the median of this dimension can
be achieved in O(ni ) via the “median of medians” algorithm. Having computed
the
median, one can construct two chunks of equal size in linear time. Since
1≤i≤2s−1 ni = n, a partitioning step is O(dn log(n/κ)). Any division of a chunk
is independent of the other chunks at a given stage; thus, this procedure can
benefit from distributed computing.
1 2 3 4 5 6 7 8 9
BiDViT level 1
c1,1 c1,2 c1,3 c1,4
c2,1 c2,2 c2,3
c3,1
1 2 3 4 5 6 7 8 9
c1,1 c1,2 c1,3 c1,4

c2,1 c2,2 c2,3
BiDViT level 2
c3,1
Fig. 2. Dendrogram representing the output tree of BiDViT and the encoding of the
clustering assignment. The original dataset is represented by the leaves, which collapse
into a single centroid after three BiDViT iterations. The first iteration (“BiDViT level
1”) results in four centroids, each corresponding to a cluster consisting of the nodes
that collapsed into it. At the next iteration, the algorithm merges the clusters of the
centroids. For example, c1,3 and c1,4 are merged into c2,3 at the next level.
The order of complexity for the collapsing process is O(dnκ), as computing

the neighbourhood matrix of a chunk is O(dκ2 ) and the heuristic selection pro-
cedure is O(κ2 ). The number of chunks is bounded from above by n/κ. This
yields a complexity of O((n/κ)(dκ2 + κ2 )) = O(dnκ). As data coarsening in each
chunk is independent, with n/κ parallel processors available the complexity
reduces to O(dκ2 ).
3.6 Relation to the MWIS Problem
The process of identifying a maximal ε-separated set of maximum weight

is equivalent to solving the MWIS problem for the weighted graph Gε =
(P, E ε , wP ). Let G = (V, E, w) be a weighted graph. A set of vertices S ⊆ V
is called independent in G if no two of its vertices are adjacent or, equivalently,
if S is a clique in the complement graph. This corresponds to the separation
constraint mentioned earlier, where two vertices are adjacent whenever they are
less than a distance of ε apart. The MWIS problem can be expressed as
maximize ω(S) subject to S is independent, (P7)

S⊆V
and is NP-complete for a general weighted graph [16], yet, for specific graphs,
there exist polynomial-time algorithms [28,29]. Note that the QUBO formulation
of the MWIS problem in [23,30] is related to the one in (P5).
If all weights are positive, a maximum weighted independent set is necessarily
a maximal independent set. A maximal independent set is a dominating set, that
is, a subset S of V such that every v ∈ V \ S is adjacent to some w ∈ S. This
corresponds to our observation that every maximal ε-separated subset is ε-dense.
4 Results
The datasets used to demonstrate the efficiency and robustness of the proposed
approach are the MNIST dataset of handwritten digits [31], a two-dimensional
version of MNIST obtained by using t-SNE [32], two synthetic grid datasets, and
a dataset called Covertype containing data on forests in Colorado [33]. The syn-
thetic grid datasets are the unions of 100 samples (in the 2D case) and 1000 sam-
ples (in the 3D case) drawn from N (μij , σ 2 ) with means μij = (10i + 5, 10j + 5)
and a variance of σ2 = 4 for 0 ≤ i, j ≤ 9 in the 2D case and the natural extension
in the 3D case. Dataset statistics are provided in Table 1. In the following sec-
tions, our technical experiments are explained in detail; a practical application
of BiDViT for image quantization is illustrated in Fig. 7. All experiments were
performed using a 2.5 GHz Intel Core i7 processor and 16 GB of RAM.
4.1 Low-Range Clustering Domain

Although BiDViT has been specifically designed for extreme clustering, it yields
accurate assignments for low values of k. Figure 3 shows the clustering assign-
ment of BiDViT on the 2D grid dataset and on MNIST. The results are obtained
by manually selecting a BiDViT level. In the grid dataset, every cluster is iden-
tified correctly. In the MNIST dataset, all clusters are recognized, except one.
However, as BiDViT is based on metric balls, and some datasets might not con-
form to such categorization, there are datasets for which it cannot accurately
assign clusters. This is true for most clustering algorithms, as they are able to
recognize only specific shapes.
4.2 Extreme Clustering Capability

To evaluate the performance of BiDViT on high-dimensional datasets in the
extreme clustering range, we used the Calinski–Harabasz score [34] and the
Davies–Bouldin score [35]. These clustering metrics are internal evaluation
schemes, that is, their values depend solely on the clustered data, not requiring
the ground truth label assignment for the dataset. Such schemes must be viewed
as heuristic methods: their optimal values do not guarantee optimal clusters
but provide a reasonable measure of clustering quality. Detailed analyses have
been conducted on the advantages and shortcomings of internal clustering mea-
sures [36,37]. In the extreme clustering scenario, where the objective is to obtain
an accurate approximation of the entire dataset instead of categorizing its ele-
ments, no true labels are given and thus external evaluation schemes (ones based
on the distance to a ground truth clustering assignment) do not qualify as success
measures.
Let C1 , . . . , Cnc denote a total of nc detected clusters within a dataset X
with n data points. The Calinski–Harabasz score SCH of a clustering is defined
as a weighted ratio of the inter-cluster squared deviations to the sum of the
intra-cluster squared deviations. More precisely, SCH is given by
nc
n−1 |C |c − c22
SCH (C1 , . . . , Cnc ) = nck=1 k k 2, (11)
nc − 1 k=1 x∈Ck x − ck 2
where ck , for k = 1, . . . , nc are the cluster centroids, and c is their mean. High
values of SCH are indicative of a high clustering quality. The Davies–Bouldin
score SDB is the average maximum value of the ratios of the pairwise sums of
the intra-cluster deviation to the inter-cluster deviation. The score is defined as
100 100
80 80
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
1000 1000
800 800
600 600
400 400
200 200
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Fig. 3. Performance of BiDViT in the non-extreme clustering domain. The left-hand-

side figures show the original datasets (blue) with cluster centroids (orange) determined
by BiDViT. On the right, colours correspond to assigned labels. The figures can be
reproduced by using the parameters κ = 103 , α = 1.3, and ε0 = 2.0, and using BiDViT
level 18 for the MNIST dataset (bottom), and κ = 103 , α = 1.3, and ε0 = 1.0, and
BiDViT level 10 for the synthetic grid (top).
1
nc
Sk + Sj
SDB (C1 , . . . , Cnc ) = max , (12)
nc j=k ck − cj 2
k=1

where Si = x∈Ci x − ci /|Ci |. Low values of SDB indicate accurate clustering.
Figure 4 shows SCH and SDB of clustering assignments obtained with BiD-
ViT and Mini Batch k-means clustering [38] for different values of k on the
Covertype dataset. Due to their high computational complexity with respect to
k, many common clustering algorithms could not be applied. Remarkably, SCH
values are quite similar, indicating that the cluster assignments generated by
BiDViT are of comparable quality even though the runtime of BiDViT is signif-
icantly shorter. For SDB , BiDViT outperforms the others for lower values of k,
and is comparable for large values. One explanation for the slightly weaker per-
formance of BiDViT with respect to SCH is that BiDViT aims to minimize the
1.5
350,000
BiDViT
Mini Batch k-means
Calinski–Harabasz Score
Davies–Bouldin Score
300,000 1.4
250,000
1.3
200,000
150,000 1.2
100,000 1.1
50,000
1.0
0
0 5,000 10,000 15,000 20,000 0 5,000 10,000 15,000 20,000
Number of Clusters k Number of Clusters k
Fig. 4. SCH (left) and SDB (right) of clustering assignments on the Covertype dataset
generated by the heuristic BiDViT algorithm (κ = 103 , α = 1.5, and ε0 = 102 ) and
Mini Batch k-means (batch size = 50, max iter = 103 , tol = 10−3 , and n init = 1).
Whereas a higher value of SCH indicates better clustering, the opposite holds for SDB .
103
Time to Solution (s)
100
Davies–Bouldin Score
102
101
6 × 10−1
100 Mini Batch k-means – 3

BiDViT Mini Batch k-means – 1
k-means – 10 Birch – 100
k-means – 2 Birch – 2 4 × 10−1
10−1
0 2500 5000 7500 10000 12500 15000 0 2500 5000 7500 10000 12500 15000
Number of Clusters k Number of Clusters k
Fig. 5. Time to solution (left) and SDB (right) of common clustering algorithms and
BiDViT (κ = 103 , α = 1.3, and ε0 = 16.0) on a subset of the Covertype dataset for
different numbers of clusters. For k-means++ and Mini Batch k-means clustering, we
modified the number of initializations, and for Birch clustering, it was the branching
factor. These parameters resulted in a speed-up with a minimum loss of quality; their
values are indicated in the legend.
non-squared distances, whereas SCH rewards clustering methods that minimize

squared distances. Similarly, this explains BiDViT’s advantage for SDB .
4.3 Runtime Comparison

In our experiments, we observed that, with respect to the total runtime, even
the heuristic version of BiDViT restricted to a single core outperforms common
clustering methods in the extreme clustering domain. Figure 5 shows the runtime
required by different clustering algorithms for the Covertype dataset. For the
implementation of methods other than BiDViT, we used the publicly available
sklearn.clustering module for Python. To generate the plots, we ran the
entire BiDViT procedure, then applied classical algorithms for the same values
of k. The results suggest that, in the extreme clustering domain, the runtime of
BiDViT is an order of magnitude faster than that of the agglomerative methods
against which it was compared, and multiple orders of magnitude faster than
that of k-means and Mini Batch k-means clustering. The dataset cardinality
was restricted to 20,000 points to obtain results for other methods, whereas
BiDViT is capable of handling the entire dataset comprising 581,000 points.
We then compared the runtime of BiDViT to PERCH (“Purity Enhancing
Rotations for Cluster Hierarchies”), a hierarchical algorithm for extreme clus-
tering [9], to our knowledge the only other algorithm designed to solve extreme
clustering problems. Table 1 shows that BiDViT performs an order of magnitude
faster than PERCH. However, they solve somewhat different problems: whereas
BiDViT aims to gradually coarsen a dataset by finding ε-separated, ε-dense
subsets, PERCH maximizes the dendrogram purity, a measure of the cluster-
ing tree’s consistency [9]. The clustering tree generated by PERCH is binary
and thus enormous, allowing for much finer incremental distinctions between
clustering assignments. In contrast, the tree generated by BiDViT is more com-
pact, as multiple data points can collapse into the same representative point.
When comparing dendrogram purities, we expect PERCH to outperform BiD-
ViT; when comparing Davies–Bouldin scores at a given level, we expect the
opposite. We did not test these hypotheses, as dendrogram purity is an external
evaluation scheme, that is, it requires a clustering assignment to use for compar-
ison, which is not available in unsupervised machine learning. Both algorithms
were restricted to using a single core.
4.4 Results for the Quantum Version of BiDViT

We tested a prototype of BiDViT on a D-Wave 2000Q quantum annealer, a
machine that has 2048 qubits and 5600 couplers. According to D-Wave Sys-
tems, the computer uses 128,000 Josephson junctions and was the most complex
superconducting integrated circuit built to date when introduced in January of
2017 [17].
Before solving the QUBO problems, we applied preprocessing techniques,
reducing their size and difficulty [27]. This proved effective and eliminated a
great many variables. In most cases, size reductions of over 60% can be observed.
Table 1. Dataset statistics (left) and runtime comparison of extreme clustering algo-
rithms in seconds (right). PERCH-C (“collapsed-mode”) was run, as it outperforms
standard PERCH. The parameter L sets the maximum number of leaves (see [9] for an
explanation). BiDViT selected the values ε0 = 30 and ε0 = 0.5, such that a percent-
age of the nodes collapsed in the initial iteration, for the Covertype and the MNIST
datasets, respectively. The mean and standard deviation were computed over five runs.
Algorithm Runtime on Dataset (seconds)

Name Description Cardinality Dimension specified parameters Covertype grid-3D
PERCH-C
MNIST handwritten images 60 K 784 1616.45 ± 20.37 1588.10 ± 41.46
L = Inf
PERCH-C
MNIST-2D t-SNE of the above 60 K 2 1232.53 ± 53.61 1280.30 ± 15.03
L = 50, 000
PERCH-C
Covertype forest data 581 K 54 928.82 ± 47.00 –
L = 10, 000
BiDViT (heuristic)
grid-2D synthetically generated 100 K 2 301.36 ± 10.01 152.50 ± 0.86
κ = 2000, α = 1.1
BiDViT (heuristic)
grid-3D synthetically generated 1000 K 3 56.26 ± 0.62 75.22 ± 0.95
κ = 500, α = 1.2
For the quantum version of BiDViT, one can observe higher-quality solutions
and a significant speed-up, when compared to common clustering methods. Both
observations are based on results shown in Fig. 6.
Fig. 6. Runtime and quality of results for the quantum version of BiDViT obtained
using a D-Wave 2000Q quantum annealer. Left) Computational time for the 3D grid
dataset. Right) Comparison of the Calinski–Harabasz score of the quantum version
of BiDViT and of k-means clustering on a subset of the MNIST dataset for different
numbers of clusters. The orientation of the abscissae has been inverted to illustrate that
at low BiDViT levels there are many clusters and at high levels only a few remain.
However, the heuristic version of BiDViT and the common clustering algo-
rithms were executed on a classical device that has a limited computational
capacity, whereas the D-Wave 2000Q is a highly specialized device. Running
these algorithms on a high-performance computer might lead to an equivalent
degree of speed-up.
BiDViT Level: 8, Colours: 40700 BiDViT Level: 16, Colours: 1063

BiDViT Level 16 Colour Space
250
200
B Value
150
100
50
0
200
0 50 150 e
100 alu
100 50 G V
R Va 150 200
lue 250 0
BiDViT Level: 18, Colours: 367 BiDViT Level: 24, Colours: 19

BiDViT Level 18 Colour Space
250
200
B Value
150
100
50
0
200
0 50 150 e
100 alu
100 50 G V
R Va 150 200
lue 250 0
Fig. 7. Image quantization via clustering in the colour space of a standard test image.
The original image has 230,427 colours. BiDViT is particularly fast at reducing its
colours to a number on the order of 104 , as this falls into the extreme clustering range.
Here, the k-means clustering algorithm faces its computational bottleneck. A commonly
employed algorithm for such problems is the median cut algorithm. Naturally, it is
faster than BiDViT—as BiDViT employs the median cut algorithm in its chunking
procedure—but BiDViT produces a more accurate colour assignment.
5 Conclusion
We have developed an efficient algorithm capable of performing extreme clus-
tering. Our complexity analysis and numerical experiments has shown that, if
the dataset cardinality and the desired number of clusters are both large, the
runtime of BiDViT is at least an order of magnitude faster than that of classi-
cal algorithms, while yielding a solution of comparable quality. With advances in
quantum annealing hardware, one can expect further speed-ups in the algorithm
and the size of dataset that it can process.
Independent of BiDViT, the proposed coarsening method, based on identi-
fying a maximal ε-separated subset, is valuable in its own right—it is a novel
approach to clustering which is not limited solely to extreme clustering.
Further investigation of the proposed coarsening approach is justified, as we
have identified a domain for the radius of interest (in Theorem 2) such that,
under a separability assumption, every solution to (P1) (i.e., every maximum
weighted ε-separated subset) yields the optimal clustering assignment. Potential
paths for future research include the development of selection procedures for the
initial radius of interest and investigating the clustering quality of the coarsening
method (without data chunking) in the non-extreme domain.
Acknowledgments. We thank Saeid Allahdadian, Nick Condé, Daniel Crawford, and

Austin Wallace for their contributions to an earlier version of the algorithm. We thank
Maliheh Aramon, Pooja Pandey, and Brad Woods for helpful discussions on optimiza-
tion theory. The implementation of the QUBO preprocessing techniques was conducted
jointly with Brad Woods and Nick Condé. Inderpreet Singh contributed to the figure
on image quantization. Victoria Wong assisted with graphical editing of individual fig-
ures. Partial funding for this work was provided by the Mitacs Accelarate internship
initiative.
References
1. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-
rithms, pp. 1027–1035. SIAM (2007)
2. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering
method for very large databases. In: ACM Sigmod Record, vol. 25, pp. 103–114.
ACM (1996)
3. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for
discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp.
226–231 (1996)
4. Nayak, R., Mills, R., De-Vries, C., Geva, S.: Clustering and labeling a web scale
document collection using Wikipedia clusters. In: Proceedings of the 5th Interna-
tional Workshop on Web-scale Knowledge Representation Retrieval & Reasoning,
pp. 23–30. ACM (2014)
5. de Vries, C.M., de Vine, L., Geva, S., Nayak, R.: Parallel streaming signature EM-
tree: a clustering algorithm for web scale applications. In: Proceedings of the 24th
International Conference on World Wide Web, pp. 216–226. International World
Wide Web Conferences Steering Committee (2015)
6. Wang, X.J., Zhang, L., Liu, C.: Duplicate discovery on 2 billion internet images. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, pp. 429–436 (2013)
7. Liu, T., Rosenberg, C., Rowley, H.A.: Clustering billions of images with large scale
nearest neighbor search. In: Proceedings of the 8th IEEE Workshop on Applications
of Computer Vision, WACV 2007, p. 28. IEEE Computer Society, Washington
(2007)
8. Woodley, A., Tang, L.X., Geva, S., Nayak, R., Chappell, T.: Parallel K-Tree: a
multicore, multinode solution to extreme clustering. Future Gener. Comput. Syst.
99, 333–345 (2018)
9. Kobren, A., Monath, N., Krishnamurthy, A., McCallum, A.: A hierarchical algo-
rithm for extreme clustering. In: Proceedings of the 23rd ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 255–264. ACM
(2017)
10. Kumar, V., Bass, G., Tomlin, C., Dulny, J.: Quantum annealing for combinatorial
clustering. Quantum Inf. Process. 17(2), 39 (2018)
11. Merendino, S., Celebi, M.E.: A simulated annealing clustering algorithm based on
center perturbation using Gaussian mutation. In: The 26th International FLAIRS
Conference (2013)
12. Kurihara, K., Tanaka, S., Miyashita, S.: Quantum annealing for clustering.
arXiv:1408.2035 (2014)
13. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In:
Proceedings of the 26th Annual ACM Symposium on Theory of computing, pp.
291–300. ACM (2004)
14. Balcan, M.F., Ehrlich, S., Liang, Y.: Distributed k-means and k-median clustering
on general topologies. In: Advances in Neural Information Processing Systems, pp.
1995–2003 (2013)
15. Lucas, A.: Ising formulations of many NP problems. Front. Phys. 2, 5 (2014)
16. Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of Com-
puter Computations, pp. 85–103. Springer (1972)
17. D-Wave Systems Inc.: The D-Wave 2000Q Quantum Computer: Tech-
nology Overview (2017). https://www.dwavesys.com/sites/default/files/D-Wave
%202000Q%20Tech%20Collateral 0117F.pdf. Accessed 13 Feb 2019
18. Fujitsu Ltd.: Digital Annealer Introduction: Fujitsu Quantum-inspired Com-
puting Digital Annealer (2018). http://www.fujitsu.com/global/documents/
digitalannealer/services/da-introduction.pdf. Accessed 13 Feb 2019
19. Malkomes, G., Kusner, M.J., Chen, W., Weinberger, K.Q., Moseley, B.: Fast dis-
tributed k-center clustering with outliers on massive data. In: Advances in Neural
Information Processing Systems, pp. 1063–1071 (2015)
20. Balaji, S., Swaminathan, V., Kannan, K.: Approximating maximum weighted inde-
pendent set using vertex support. Int. J. Comput. Math. Sci. 3(8), 406–411 (2009)
21. Hifi, M.: A genetic algorithm-based heuristic for solving the weighted maximum
independent set and some equivalent problems. J. Oper. Res. Soc. 48(6), 612–622
(1997)
22. Kako, A., Ono, T., Hirata, T., Halldórsson, M.: Approximation algorithms for the
weighted independent set problem in sparse graphs. Discrete Appl. Math. 157(4),
617–626 (2009)
23. Abbott, A.A., Calude, C.S., Dinneen, M.J., Hua, R.: A hybrid quantum-classical
paradigm to mitigate embedding costs in quantum annealing. arXiv:1803.04340
(2018)
24. Nolte, A., Schrader, R.: A note on the finite time behavior of simulated annealing.
Math. Oper. Res. 25(3), 476–484 (2000)
25. Lü, Z., Glover, F., Hao, J.K.: A hybrid metaheuristic approach to solving the
UBQP problem. Eur. J. Oper. Res. 207(3), 1254–1262 (2010)
26. Zhu, Z., Fang, C., Katzgraber, H.G.: borealis – a generalized global update algo-
rithm for Boolean optimization problems. arXiv:1605.09399 (2016)
27. Glover, F., Lewis, M., Kochenberger, G.: Logical and inequality implications for
reducing the size and difficulty of quadratic unconstrained binary optimization
problems. Eur. J. Oper. Res. 265(3), 829–842 (2018)
28. Mandal, S., Pal, M.: Maximum weight independent set of circular-arc graph and
its application. J. Appl. Math. Comput. 22(3), 161–174 (2006)
29. Köhler, E., Mouatadid, L.: A linear time algorithm to compute a maximum
weighted independent set on cocomparability graphs. Inf. Process. Lett. 116(6),
391–395 (2016)
30. Hernandez, M., Zaribafiyan, A., Aramon, M., Naghibi, M.: A novel graph-based
approach for determining molecular similarity. arXiv:1601.06693 (2016)
31. LeCun, Y., Cortes, C., Burges, C.J.: MNIST handwritten digit database. AT&T
Labs (2010). http://yann.lecun.com/exdb/mnist
32. Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res.
9, 2579–2605 (2008)
33. Blackard, J.A.: UCI Machine Learning Repository (2017). http://archive.ics.uci.
edu/ml. Accessed 13 Feb 2019
34. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.
Theory Methods 3(1), 1–27 (1974)
35. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern
Anal. Mach. Intell. 2, 224–227 (1979)
36. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering
validation measures. In: 2010 IEEE International Conference on Data Mining, pp.
911–916 (2010)
37. Jain, R., Koronios, A.: Innovation in the cluster validating techniques. Fuzzy
Optim. Decis. Making 7(3), 233 (2008)
38. Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International
Conference on World Wide Web, pp. 1177–1178. ACM (2010)
Clustering and Classification
to Evaluate Data Reduction
via Johnson-Lindenstrauss Transform
Abdulaziz Ghalib, Tyler D. Jessup, Julia Johnson(B) ,

and Seyedamin Monemian
Department of Mathematics and Computer Science, Laurentian University,

Sudbury, ON P3E2C6, Canada
{as ghalib,td jessup,jjohnson,smonemian}@laurentian.ca
Abstract. A dataset is a matrix X with n × d entries, where n is the

number of observations and d is the number of variables (dimensions).
Johnson and Lindenstrauss assert that a transformation exists to achieve
a matrix with n × k entries, k << d, such that certain geometric prop-
erties of the original matrix are preserved. The property that we seek is
that if we look at all pairs of points in matrix X, the distance between
any two points should be the same within a given small acceptable level
of distortion as the corresponding distance between the same two points
in the reduced dataset. Does it follow that semantic content of the data is
preserved in the transformation? We can answer in the affirmative that
meaning in the original dataset was preserved in the reduced dataset.
This was confirmed by comparison of clustering and classification results
on the original and reduced datasets.
Keywords: Data reduction · High dimensional data · Clustering ·

Classification · Johnson-Lindenstrauss Transform
1 Introduction
Each observation (row) of matrix X is represented by a point x in d-dimensional
space in Rd . A dataset can include a massive number of dimensions. However, it
is not always useful to have such a large number of attributes because oftentimes
many contain irrelevant and redundant information. Also, it is common that by
shrinking the dimensions, we could gain additional crucial information.
There are two types of dimensionalities, one is called the extrinsic dimension-
ality and the other is called the intrinsic dimensionality. The extrinsic dimen-
sionality of a matrix X indicates the dimensionality in which its data points are
observed. So we may say that there are d extrinsic dimensions in X. The other
type of dimensionality, which is significant, is known as intrinsic dimensionality.
The intrinsic dimensionality of matrix X indicates the number of dimensionali-
ties that are important and required to return a useful response to a given query
of our choice.
https://doi.org/10.1007/978-3-030-39442-4_16
Data Reduction Evaluation via Johnson-Lindenstrauss Transform 191
There is an essential need to use techniques to find this intrinsic dimension-

ality to reduce the size of datasets and analyze them in such a manner as to
achieve desired results. Available studies have shown the effects of using feature
selection methods and feature extraction methods such as principal component
analysis (PCA) in reducing the number of dimensions and improving the per-
formance when conducting data analysis [1]. A new branch of statistics to deal
with massive amounts of data referred to as high-dimensional data analysis has
yielded the Johnson and Lindenstrauss (JL) lemma.
Dimensionality reduction is a procedure for mapping, or transforming, a high
dimensional dataset into a lower dimensional dataset, and nearly preserving the
intrinsic structure or important information of the original data in the trans-
formed data. How one defines intrinsic structure, or important information, can
vary, based upon the chosen metrics. We explore here, a type of a dimension-
ality reduction technique named the Johnson-Lindenstrauss Transform (JLT),
or Johnson-Lindenstrauss Embedding. The JLT conforms to the constraints of
a mathematical lemma, named The Johnson-Lindenstrauss (JL) lemma. This
technique preserves intrinsic structure or important information based on the
metric of Euclidean Norm, and the technique does so by noting the pairwise
distances between all random points in the higher dimensional dataset and pre-
serving them in the corresponding lower dimensional dataset.
In general, the JLT relates points in high dimensional space to points in lower
dimensional space, while keeping them in comparison as similar as possible. In
a finer grained definition towards the notion of similarity, the JLT transforms
a high dimensional data set into a lower dimensional dataset based on a metric
for preserving pairwise distance between points (i.e. Euclidean Norms) through
a linear transform known as a random projection. After the projection, the com-
parable pairwise distances (Euclidean Distances) between the original points in
the original data vs. the reduced data are known based on the JL lemma. All
randomly chosen pairwise points are nearly preserved within a constant stretch
factor, or distortion/error parameter. Both the constraints on the pairwise dis-
tance preservation and the bound k of reduced dimensional space based on the
distortion parameter can be seen in the lemma. The definition of the random
projection matrix in the linear transform is stated and proven to exist, yet an
exact definition has never been given.
Over many evolutions of the JLT, the linear transformation or more specif-
ically the random projection matrix within the linear transformation has been
defined in multiple ways. It has been shown, over time, that the particular choice
of random projection matrix plays an important role and impacts both on the
lemma’s precision in preserving information and on the lower limiting function
which defines the lemma’s lower bounds k on the reduced space’s dimension.
A particular choice of a random projection matrix has a direct influence on k.
Major facts resulting in this effect are, first, that the coordinates within each
random projection vector within the chosen random projection matrix conform
to statistical expectations, in particular, that a series or sum have known con-
vergent values and, second, that the lower limiting function of the lemma and
192 A. Ghalib et al.
hence also k are heavily dependent. In fact, in many cases of the evolutions and
improvements of the lemma, the choice of which random projection matrix is
used, and the chosen random projection matrix’s well defined characteristics,
integrate into the lemmas proof and derivation, and in doing so is most times
the largest impacting influence on the lemma’s lower bound k.
Now, over the evolutions and improvements of the JLT, and the JL lemma,
how has the random projection matrix been defined? The answer is that there
is a clean and clear three part distinct partition of all types of evolutions of the
random projection matrix and, observed from many proofs of improvements of
the lemma, there are indeed only these three:
1. The optimal random projection matrix adhering to fundamental constraints

(Spherical Symmetry, Orthogonality, and Unit-Norms of random projection
vectors).
2. The Probabilistic Approach (Gaussian coordinates of random projection vec-
tors).
3. Sparse and Discrete Approach -1,0,1 (coordinates of 1,0,-1 of random projec-
tion vectors, for example).
All three encompass specific definitions of the random projection matrix, and
different classes of evolutions and improvements of the original lemma’s proof.
The JL lemma has been improved over time. Many researchers were looking
to enhance the lemma’s proof, by reducing the bound k and simplifying the proof
of the lemma. The improvements focused on how to make the lower bound of
the k-dimensional space to be tighter. Also, research has centered on how to
make the transformation T to be efficient by reducing the required number of
matrix operations and the amount of space to compute the transformation. Even
though there are various improved versions of the lemma, all of them still adhere
to the condition of pairwise distance preservation.
Two approaches to construction of JL lemma are of interest in this study:
the first approach uses sub-Gaussian projection and the second uses sparse pro-
jection matrices. We investigate the effects of dimensionality reduction using the
two different techniques on real-world data for Attention Deficit Hyperactivity
Disorder (ADHD).
1.1 Motivation
Authors [2] used simulated, not natural data to perform data reduction based
on the JL lemma and statistical measures to judge the quality of the reduction.
In this study we use natural neuroimaging data and clustering and classifica-
tion techniques to judge the quality of the JL lemma reduction. A question of
paramount importance is whether our original high dimensional dataset gives
the same analytic results as the reduced dataset. For instance, after performing
clustering on both original and reduced datasets, would they produce the same
classification accuracy? Even though some papers [3–6] have conducted research
on the topic of Johnson-Lindenstrauss lemma to reduce high dimensional data
Fig. 1. Novelty of this research. The dotted lines shop work that has previously been
done. Clustering followed by classification is new.
for classification or clustering (see the dotted lines in Fig. 1), to the best of our
knowledge, comparison of the classification via clustering results (as illustrated
using the bold solid black arrows in Fig. 1) has not been done.
1.2 Research Question
The research question is whether corresponding original and reduced datasets

contain approximately the same information. The JL lemma states that for
every pair of points (xi , xj ) in the original dataset and corresponding pair in
the reduced set, the distance between xi and xj is approximately preserved.
By virtue of pairwise distances being preserved, does this guarantee that the
semantic content of the data has been preserved as well? To that end we will be
pursuing answers to the following questions:
1. Does the Johnson-Lindenstrauss Transform (JLT) in fact protect the reduced

data from major loss of important characteristics by virtue of the fact that
it preserves the pairwise distance structure?
2. Does the reduced dataset by JLT produce the same accuracy (or almost the
same) of classification models as the original dataset?
3. Which one of the two JL algorithms produces more reduction?
4. What is the optimum number of clusters for labeling the ADHD dataset?
An outline of the paper follows: Sect. 2 gives related work, which includes
an introduction to the data set under consideration, that of fMRI measures of
people experiencing Attention Deficit Hyperactivity Disorder (ADHD), review
of data reduction strategies, in general, and the JL lemma, in particular, and
clustering and machine classification algorithms. Section 3 describes the ADHD
dataset in greater detail and shows how it was used for performing data reduc-
tion and classification via clustering analysis to illuminate the essential meaning
in the data. Also, in that section the method is explained that was used to imple-
ment and compare different versions of the JL lemma, and the development of
a general tool based on the JL lemma is described that makes it easy to accom-
plish data reduction tasks. In Sect. 4 we show implementation of the tool for
comparing different JL reduction techniques and also show the results of exe-
cution of the clustering and classification algorithms on our dataset. Section 5
shows analysis and comparison of results of running two different versions of
the lemma and examines the amount of reduction done by each. Also, it shows
the results of comparing machine classification outcomes for the original high
dimensional dataset and the reduced ones. Section 6 summarizes the outcomes
from our experiments and suggests future research directions.
2 Background and Related Work
This section provides a review of several aspects of data reduction strategies

and machine classification techniques. Also, it provides a review of the literature
related to the Johnson and Lindenstrauss (JL) lemma and shows its applications.
We will begin by describing Attention Deficit Hyperactivity Disorder (ADHD).
There are three symptoms that people have who are experiencing ADHD. The
first one is inattention, the second is hyperactivity, and the last one is impulsivity.
ADHD is generally experienced among children but adults may also have it.
Moreover, it is more prevalent for males than females. There are three important
measurements of the human brain that can lead to observing individuals with
ADHD, the brain size, the cerebellar vermis, and frontal cortices.
One of the most important and recognized studies of the ADHD symptoms
is “Subcortical brain volume differences in participants with attention deficit
hyperactivity disorder in children and adults: a cross-sectional mega-analysis”
[7], which investigates the brain volumes of more than 3000 individuals. The
objective of another study [8] conducted for ADHD was to find a way to create
classification models that help medical practitioners so that they will be able to
recognize individuals with ADHD.
2.1 Data Reduction Strategies
Currently, data are growing rapidly in all science and business domains making
it complex to process such large amounts of data. Therefore, there is a real
necessity to consider data reduction strategies to help solve many real world
problems such as crime prevention, reducing unemployment, or making decisions
for medical treatment. Also, there are other goals for data reduction techniques,
such as the minimization of the amount of data that needs to be stored in
a data storage environment, or for the efficient analysis of big data, such as
healthcare dataset, which contains a massive amount of data like the health
status of patients. Many real world problems can be solved after preprocessing
big data. Many organizations in different industries such as banking, government,
manufacturing, education, health care, and retail are using big data.
During data reduction, we perform techniques on our original dataset to
obtain a reduced representation of it. Indeed, the reduced representation should
be much smaller in volume and produce the same (or almost the same) analytical
outcomes. Therefore, using a reduced representation of our dataset would be
more efficient due to the fact that the space and computational complexity are
reduced. Dimensionality reduction [9], numerosity reduction (parametric: log-
linear models [10], regression [11]), (non-parametric [12]), and data compression
[13] are three types of data reduction strategies. Choosing the right model is
essential for the given type of dataset.
The singular value decomposition (SVD) or what is also known in different
fields of study as the principal component analysis (PCA) or Latent Semantic
Indexing or Karhunen-Loeve (KL) transform, is one of the effective methods for
dimensionality reduction. [21] utilizes SVD as a dimensionality reduction method
combined with the K-means clustering algorithm in order to yield a more accu-
rate recommender system. In [22], the authors proposed Higher Order Singular
Value Decomposition (HOSVD) as a robust dimensionality reduction method
to overcome the sparsity problem of high dimensional matrices. Author in [23]
proposed a semi-supervised data reduction method, Semi-supervised Linear Dis-
criminant Analysis (SLDA), which can use limited number of labeled data and
a quantity of the unlabeled ones for training.
Theoretically, JL Lemma and SVD solve different problems in Data Reduc-
tion. JL Lemma promises a low dimensional space that captures the distances
up to a given error, while SVD gives the best possible embedding according to a
given dimension. In other words, JL Lemma yields a worst-case pairwise guaran-
tee while SVD computes a projection of the input points in a lower dimensional
space where the sum of the errors is minimized.
For PCA we were unable to scale it up to large datasets so it was not con-
sidered for our current study. PCA and Johnson and Lindenstrauss approach
(the focus of this study) are data extraction techniques that distort the values
of data in columns in contrast with feature selection techniques that leave the
column values of the selected features intact.
To select the best algorithm for our problem, we have to take into consider-
ation these questions: What kind of problem are we dealing with? For example,
are we dealing with classification, regression, or clustering? Does our data consist
of a very high dimensionality? Is our data labeled or unlabeled? Then we can
decide on answers to the following questions: Are we interested in performing
feature selection or extraction? Are we going to use a supervised or unsuper-
vised method? Do we want to use a very computationally intensive method or
an inexpensive one?
2.2 Johnson and Lindenstrauss Lemma

We consider the representation of data in a lower dimensional space. An efficient
result that promises to take care of this matter is Johnson and Lindenstrauss
lemma [24] which ensures that the pairwise distance between each pair of data
points is distorted by a given small factor.
The Johnson and Lindenstrauss lemma or JL lemma for short is a classi-
cal mathematical dimension-reduction tool from 1984. The tool named after
William B. Johnson and Joram Lindenstrauss concerns low-distortion embed-
ding of points from high dimensional space into a low dimensional space. The
lemma aims to preserve all the pairwise distances in the Euclidean space of n
vectors of length d. The idea is that we want to embed all of the vectors into
a space of dimensionality k << d. The JL lemma was first used as an initial
geometric lemma to prove the results of Lipschitz mappings to Hilbert spaces.
In the original paper [24], the JL lemma was stated as follows: Given a set
of n points in IRd , for some n, d ∈ IN, and given ∈ (0, 1), there exists k0 =
O(−2 log n) such that, if k ≥ k0 , there exists a linear mapping T : IRd → IRk
such that for any two points u, v,
(1 − )u − v2 ≤ T (u) − T (v)2 ≤ (1 + )u − v2 . (1)
|T (u) − T (v)|2 is the projected distance onto k-dimensional space. |u − v|2 is

the original distance in d-dimensional space (where u,v are any rows from our
dataset). The mapping T is called a JL embedding.
Johnson and Lindenstrauss are claiming that a high dimensional dataset can
be projected to significantly lower dimension while preserving internal structure
of the data depending upon the amount of distortion we consider acceptable
(smaller k has more distortion).
Generally speaking, when pursuing dimensionality reduction of a given high
dimensional dataset, many of the available techniques apply linear transforma-
tion in order to obtain the intrinsic dimensionality. Principal component analysis
(PCA) and random projection (RP) are two of those that use linear transforma-
tion as illustrated in Fig. 2.
Fig. 2. Linear transformation T(X) to obtain intrinsic dimensionality of a given high

dimensional dataset X.
Steps in the linear transformation are as follows:
– We construct random matrix R of dimension d × k.

– We project our original matrix onto k-dimensions by multiplying it by R.
To be able to construct the random matrix R, we used for first experiment,

the Gaussian distribution and for second experiment, a sparse random projection
matrix as was done in [2]. However, their experiments tested the two implemen-
tations of the JL-lemma on simulated data using statistics while in this work, the
two versions were tested on natural data using machine classification techniques.
2.3 Classification and Clustering
Classification and clustering are two different approaches to machine learning.

Most of the time, classification algorithms are called “supervised learning algo-
rithms” and clustering algorithms are called “unsupervised learning algorithms”.
Supervised methods include neural networks [14], K-nearest neighbor (K-NN)
classifier [15,16], support vector machines [17] and decision trees [18]. Also, in
[19], Naı̈ve Bayes has been used as a classifier on a reduced dataset via feature
selection method for intrusion detection application. Naı̈ve Bayes is considered
to be simple to build [15].
In unsupervised learning, the aim is to model the essential structure or dis-
tribution in the data. By contrast with supervised learning, there are no clear
target outputs or teacher. Clustering algorithm, which is an “unsupervised learn-
ing algorithm”, aims to find “clusters” within a given dataset. From its name
“cluster”, we can tell that it is searching for groups in a dataset. The main goal
of clustering is to group similar objects in a dataset as a collection of data [11].
This type of learning supports an “unlabeled” data type.
The K-means is one of the most popular partitioning methods, which is used
to split a dataset into k groups (clusters) based on similarity. There are many
application areas that take advantage of the K-means algorithm such as in the
field of social science, manufacturing, and chemical analysis. For each group
(cluster), the center is represented by the mean value (centroid) of the objects in
that cluster. This clustering algorithm uses the Euclidean distance as a measure
to assign the objects to their closest cluster center [20].
3 Methodology
An application programming interface (API) was implemented that enabled two
different versions of the JL lemma to be performed. Quality of the reduction was
judged by the amount of reduction done and by the extent to which the reduced
data capture the shape of the original data. The first criterion was determined by
looking at the value k of the reduced datasets. The second stage was to examine
the accuracy and performance of clustering and classification techniques applied
to the reduced datasets.
3.1 The Dataset
A neuroimaging dataset was analyzed that describes subjects with Attention

Deficit Hyperactivity Disorder (ADHD) (NYU site resting-status fMRI data
[25]). Clustering and classification were applied to the original data set as well
Fig. 3. Method by which JL lemma was tested to see if it preserves information content
in the reduction as well as geometric properties of the data.
as to the reduced datasets. Refer to Fig. 3 for an overview of how the quality of
the reduction was judged.
The ADHD data can be viewed in several ways. For example, subjects could
appear as rows and Time series/features (also called regions of interest) as
columns, or features could appear as rows and the other two as columns. In
all there are six possible permutations as illustrated in Table 1.
Table 1. Permutations of the ADHD data to aid flexibility of analysis.
Permutation Row Column

Permutation 1 Features Subjects [Time series]
Permutation 2 Time series Subjects [Features]
Permutation 3 Features Time series [Subjects]
Permutation 4 Time series Features [Subjects]
Permutation 5 Subjects Features [Time series]
Permutation 6 Subjects Time series [Features]
The original subject files contained numeric codes in their first row that
describe each column. The codes describe the regions of interest (ROIs) in the
brain. A numeric code of each ROI is associated with a narrative abbreviation.
We have provided six different permutations of the ADHD dataset each of
which contains ≥19,952 dimensions. However, only one of the permutations was
used during our experiment (permutation1), which has 37,152 dimensions. We
have developed an API to construct these permutations as well as apply the imple-
mentations of the JLT algorithms with error tolerances from 0.1 through 0.9.
Let us discuss permutation1 with subjects along the rows and feature-time
along the columns. The structure consists of all of the 116 ROIs each with its
related 172 time series. In the first row (i.e. ROI number 1-time 1, ROI number 1-
time 2, and so on) the features and time-series will be given. Then, the following
rows would hold the record for each subject (i.e. row 2 will have subject number
1, row 3 will have subject number 2, and so on).
The API for running the JL-lemma construction provides flexibility for reor-
ganizing the input dataset to view it from different angles of interest.
3.2 Classification via Clustering in WEKA
WEKA is a powerful tool that allows us to perform many machine learning (ML)
algorithms and to analyze their results. Subjects were grouped based on their
brain signals measurements. Therefore, we applied clustering technique using
the K-means algorithm. Then we used the ‘AddCluster’ filter that is available
in WEKA, which allowed us to know which instances are in which cluster. This
resulted in having a new attribute called ‘cluster’, that acts as a class label
in the dataset that tells the cluster number for each instance. Then different
classification algorithms were performed as the data in each cluster were labeled
by the cluster’s number. The so labeled ADHD dataset was used to test the
accuracy and performance of classification algorithms. In our experiment, we
used ten-fold cross-validation for training and testing purposes for the ADHD
dataset.
Datasets can be divided into two separate sets. One is called training set and
the other is called testing set. The training set is intended for building a model.
On the other hand, the testing set is intended for validating that built model.
Note that the data points from the training set are excluded from the testing
set. Some available evaluation measurements in WEKA include cross validation,
confusion matrix, error rate, accuracy, precision, recall, and ROC curve.
To be able to evaluate our predictive models, we used cross-validation tech-
nique for all of our classifiers. This supervised learning evaluation technique
allowed us to divide our dataset into two sets (training set and testing set). We
used 10-fold cross-validation which means that the dataset will be sliced into 10
equal sized sets. Each one of these ten sets were divided into two parts (90%
training set and 10% testing set). The selected classification algorithm was run
on each of the ten sliced sets. At the end the cross-validation technique returns
the average accuracy out of these ten models. The selection of 10-fold was based
on theoretical evidence [27,28] as well as on empirical evidence of using different
datasets with different learning techniques. The studies show that 10-fold is the
optimal number of folds for having a good estimator error.
Using data reduction techniques will result in reducing the computational
complexity of the K-means clustering algorithm as well as classification tech-
niques. We have tried performing PCA on the ADHD dataset in WEKA and R
software for the sake of comparison. However, attempts were unsuccessful due
to the computational complexity of the PCA with the high dimensional ADHD
dataset. In fact, computation of PCA would have required an upgraded machine
environment.
4 Implementation
We have implemented the JL Embedding Tool, which acts like an API for using
two different MatLab programs to compute two different versions of the JL
lemma: JL lemma using sub-Gaussian projection and JL lemma using sparse
matrix projection. Mtlb t 4 2 and mtlb t 4 3 are abbreviations used for the dif-
ferent versions of the JL lemma because in [2] the two approaches were referred
to as theorems 4.2 and 4.3.
Let us describe the architecture of the input/output of the JL Embedding
Tool that did the reduction (Fig. 4). The tree shows the former datasets at (root)
as an input data file. The second layer contains all of the six permutations (pre-
embedded) of that input data file. The third layer (post-embedded) produces the
output of using the two different versions of the JL lemma algorithms. For each
permutation with each algorithm, there will be nine different result files as arising
from = 0.1 to 0.9. That would result in having (6 permutations × 2 algorithms
× 9 epsilon values) = 108 ‘.csv’ output files. Figure 4 shows an overview of the
input output structure of the JL Embedding Tool that was used. Note that this
was the main tool for our experiments.
Fig. 4. The input/output of the JL Embedding Tool.

In summary, the steps for conducting the experiment follow:
1. The API was used to reconstruct the ADHD data files for 216 subjects, which
will produce six different permutations. Also, the API will apply two different
JL transformations (for error tolerance = 0.1, . . . , 0.9 per permutation per
JL algorithm) on these six permutations.
2. Choose one permutation (Only permutation 1 was chosen to conduct the
experiment).
3. Load the dataset into R programming language to obtain its transpose.
4. Write an R program script to apply the elbow method on the dataset so
that we will have insight on the optimal number of K needed for K-means
clustering algorithm.
5. Load the ‘.csv’ data file in WEKA and save it as ‘.arff’ WEKA file.
6. Apply K-means clustering algorithm with K = 3 (from step 5, it turns out
that 3 is the optimal number of K) on the dataset and then add a new
feature that represents the cluster name for each instance in the dataset
using WEKA’s filter ‘AddCluster’.
7. Apply classification algorithms and observe the accuracy and performance
of each algorithm.
8. Cluster the two reduced dataset (JL 4.2 and 4.3 for error tolerance = 0.1,
0.5, 0.9) using K-means clustering algorithm with K = 3. Then add a new
feature to the datasets that represents the cluster name for each instance in
the datasets.
9. Apply classification algorithms on the datasets in step 8 and observe the
accuracy and performance of each algorithm.
10. Compare the accuracy and performance of the original data and the reduced
datasets.
11. Compare reduced dimension of the datasets (JL 4.2 and JL 4.3 for error
tolerance = 0.1, . . . , 0.9).
5 Results and Discussion
Two machine learning techniques were studied, clustering (unsupervised learn-

ing) and classification (supervised learning). For clustering the ADHD dataset,
we used the K-means algorithm. For classification, we used five different algo-
rithms, which are the decision tree, the K-NN, the Naı̈ve Bayes, the random
forest, and the ZeroR.
5.1 Amount of Reduction Done (Sub-Gaussian Vs. Sparse Matrix)
In Fig. 5, we are comparing two different algorithms of JL lemma, one is using

sub-Gaussian projection and the other one is using sparse matrix projection.
As we can see from the graph, the sub-Gaussian projection is performing more
reduction than the sparse matrix approach on the ADHD data except when
epsilon is equals 0.9. In fact, this is the same graph for all of the six permutations
Fig. 5. Comparison of sub-Gaussian and sparse matrix JL algorithms applied to the

ADHD data with different error tolerances.
that we have tested for both sub-Gaussian and sparse matrix projections. The x
axis represents the value of error tolerance and the y axis represents the number
of reduced attribute (k dimensions).
Now we can answer research question three ‘Which one of the two JL algo-
rithms produces more reduction?’ The answer is JLT 4.2 using the sub-Gaussian
approach.
A Note on Efficiency: A generalized statement of the JL lemma using sub-

Gaussian tails has already been provided by previous researchers [26], while a
variant of the lemma that contains an efficient construction using sparse projec-
tion matrices has been included in current available algorithms. Sparse matrices
can be dealt with more efficiently than dense matrices due to the fact that most
of the elements of a sparse matrix are zero, whereas in a dense matrix most of
the elements are non-zero. Thus, performing operations such as matrix multipli-
cation using sparse matrix is less computationally expensive.
5.2 Selecting the Optimal Value of ‘K’ in K-means

Visualizing the clusters was difficult due to the massive number of attributes
in the ADHD dataset. However, there are other ways to evaluate clusters such
as the number of iterations and the sum of squared error (SSE). The number
of iterations was used to assign the number of runs for the K-means clustering
algorithm. Different runs may result in different clusters.
Besides assigning the number of iterations, we considered assigning different
centroid starting conditions to select the best one. In our experiment we selected
the number of iterations to be 15 and the number of initial configurations to be

50 for centroid starting points.
The SSE within a cluster is the sum of squares from points to the assigned
cluster centers. We can test the SSE when selecting different values of K. If the
SSE decreases that means it is a better clustering model than the previous one.
The goal is to minimize total intra-cluster variance. Basically, the error for each
point is the distance to the nearest cluster.
We can now answer research question number four ‘What is the optimum
number of clusters for labeling the ADHD dataset?’. We choose to use the
K-means algorithm because it is a simple and fast clustering technique, espe-
cially with numeric data. This clustering algorithm requires the user to pick the
number of clusters prior to running the algorithm. To be able to select the right
number of clusters ‘K’, we have used the elbow method because it is effective
in selecting the optimal number of clusters. Basically, we plot the comparison
between the values of SSE of each cluster versus the number of ‘K’ clusters.
Then we inspect the plotted graph for the elbow shape.
The method shows that at K = 3 we should stop dividing the data into more
clusters. We have concluded this by writing an R programming language script,
which enabled us to run the K-means algorithm with different values of ‘K’ and
calculate the SSE within clusters. After applying K-means clustering algorithm
on the dataset, a new feature was added to the data called ‘cluster’. This new
feature will enable classification to be performed later.
5.3 Preservation of Meaning (Clustering)
Table 2 shows the amount of retained cluster labels from the original ADHD
dataset in the reduced dataset using JLT 4.2. The table shows that with error
tolerance 0.1, the reduced dataset was able to capture 67% of the labels. With
error tolerance 0.5, the dataset was able to capture 55% of the labels. However,
with reduction using error tolerance 0.9, the reduced dataset captured 37% of
the labels.
Table 2. Retained amount of clusters in the reduced ADHD dataset using JLT 4.2
Error tolerance (ε) 0.1 0.5 0.9

Sub-Gaussian projection (JLT 4.2) 67% 55% 37%
Table 3. Retained amount of clusters in the reduced ADHD dataset using JLT 4.3
Error tolerance (ε) 0.1 0.5 0.9

Sparse matrix projection (JLT 4.3) 82% 31% 38%
Table 3 shows the amount of retained cluster labels from the original ADHD
dataset in the reduced dataset using JLT 4.3. As we saw on the previous table,
the amount of retention was significant with error tolerance 0.1 while with large
error tolerance (e.g. 0.5 and 0.9) the retention rate decreases.
Tables 2 and 3 answer research question one which was ‘Does the Johnson-
Lindenstrauss Transform (JLT) in fact protect the reduced data from major loss
of important characteristics by virtue of the fact that it preserves the pairwise
distance structure?’ and the answer is it depends on the value of error tolerance.
5.4 Preservation of Meaning (Classification)
In our experiment, we used 10-folds cross-validation as the method to split our

ADHD dataset into training set and testing set. The 10-folds cross-validation
method helped us to have the best accuracy rate of our classification algorithms.
We strive for the highest possible quality.
Table 4. Accuracy rates of JL 4.2 and JL 4.3 each with ε = 0.1, 0.5, and 0.9 for each
of the classification algorithms.
Algorithm ADHD dataset

Original 4.2-0.1 4.2-0.5 4.2-0.9 4.3-0.1 4.3-0.5 4.3-0.9
Decision tree 91% 74% 82% 98% 87% 84% 98%
K-NN 86% 89% 92% 99% 92% 88% 100%
Naı̈ve Bayes 93% 96% 93% 98% 95% 93% 97%
Random forest 90% 91% 89% 99% 94% 91% 100%
ZeroR 56% 53% 42% 43% 56% 40% 59%
The accuracy rates are summarized in Table 4 for several different classifica-
tion algorithms. The accuracy rates are good except for ZeroR which is known
to be a poor classifier because it relies on the target and ignores all predictors,
simply predicting the majority category.
Overall, the K-NN and random forest have the ideal accuracy rate among
all classification algorithms that is 100% when used to predict the model of the
reduced ADHD dataset that uses JLT 4.3 with error tolerance 0.9. Notice that we
have used the same settings for training set and testing set for all classifiers. By
calculating the average accuracy of all models for each classification algorithm,
the decision tree has 88%, the K-NN has 92%, the Naı̈ve Bayes has 95%, the
random forest has 93%, and the ZeroR has 50%. The highest average achieved
by the Naı̈ve Bayes algorithm that is 95%. However, we cannot just rely on the
correctly classified instances (True rate/accuracy rate) because we might have a
high accuracy rate but not a good model.
After clustering the ADHD dataset, both original and reduced datasets con-
tain unbalanced class labels. A balanced class label means to have 50% of the
labels in one class and 50% in the other class (ie. 50% in cluster1 and 50% in
cluster2) while an unbalanced class label means to have more labels in one class
than the other. For example, assume we have 90% of our instances in one class,
we will end up saying everything belongs to that class. Thus we will report 90%
accuracy most of the time but our prediction model is poor in terms of per-
formance. Therefore, it is important to report other metrics such as the ROC
area under the curve (AUC) to check on our classifiers performance. This metric
provides the accuracy but in a different way. It will check on the items from
each class in a dataset and give the percentage of the time that the classifier is
correctly predicting that item’s class.
The above answers research question two that asks ‘Does the reduced dataset
by JLT produce the same accuracy or almost the same of classification models
as the original dataset?’ and the answer is that it depends on the classification
algorithm used and the value of error tolerance.
Table 5. Comparison of the highest ROC AUC per classification algorithm.
Algorithm ADHD dataset Precision Recall F-measure ROC AUC

Decision tree 4.2-0.9 0.983 0.983 0.983 0.986
K-NN 4.3-0.9 1 1 1 1
Naı̈ve Bayes 4.3-0.9 0.975 0.974 0.974 1
Random forest 4.2-0.9 0.992 0.991 0.99 1 1
Random forest 4.3-0.9 1 1 1 1
ZeroR 4.2-0.9 N/A 0.431 N/A 0.476
In Table 5, we show comparison between the highest ROC AUC rates for
different classification algorithm. The table shows that the K-NN, Naı̈ve Bayes,
and random forest have the highest ROC AUC rates of 1 using the reduced
ADHD datasets, which indicates that their model is perfect. Despite the fact
that the Naı̈ve Bayes was not previously selected as having the highest accuracy
rate, now we can consider it as a good prediction model as the K-NN and random
forest based on the ROC AUC rate achieved. Notice that this model of Naı̈ve
Bayes using the reduced ADHD dataset by JLT 4.3 with distortion rate 0.9
has 97% accuracy rate whereas K-NN and random forest have 100% accuracy
using JLT 4.3 with distortion rate 0.9. Also, random forest using JLT 4.2 with
distortion rate 0.9 represents a prefect model, it has accuracy rate of 99%.
6 Summary and Conclusions

The immense amount of entries within data, gives rise to difficulties in expos-
ing its important information. Available analytical methods such as principal
component analysis could be used before performing machine classification algo-
rithms such that the tremendous amount of data that needs to be analyzed will
be reduced. Even though this method could help in reducing data to a more
condensed representation of its original state, it was found to be unusable on the
ADHD dataset due to computational intensity.
We strive for a better way to acquire the intrinsic dimensionality in data. For
that reason, we have found a mathematical result from 1984 named The Johnson
and Lindenstrauss Lemma (JL lemma), which states that we can embed high
dimensional data points into a lower dimensional representation while approxi-
mately preserving the pairwise distance between each pair of points. Indeed, it
states that any set of n points in d-dimensional Euclidean space can be embed-
ded into k-dimensional Euclidean space while preserving the pairwise distances
between all points within a given factor , where k << d and k is logarithmic in
n and independent of d.
Two different approaches were used for construction of the JL lemma: the
first approach uses sub-Gaussian projection (JLT 4.2) and the second uses sparse
projection matrices (JLT 4.3). Both approaches were implemented in an API
and the sub-Gaussian was found to be superior for reduction of the ADHD data
(research question three).
The accuracy and performance of classifying the original and reduced ADHD
datasets were compared using different classifiers. The K-means clustering algo-
rithm was used to partition the brain signals of the subjects into different groups.
The selection of the number of partitions was obtained by using the elbow
method, which provided us with the optimal number of partitioning (k = 3)
(research question four). Both original and reduced ADHD datasets became
labeled and thus we were able to perform classification algorithms on them.
It was illustrated that using dimension reduction by the JLTs as a preprocess-
ing step before performing clustering and classification yields the best prediction
models both in terms of accuracy and performance rates. This shows that the
reduced datasets have positive impact on the accuracy and performance of those
techniques which is to be expected. However, the clustering and classification
results themselves help to answer the research questions of this study concern-
ing whether information is preserved in the Johnson-Lindenstrauss Transform
(JLT).
Clustering results answer research question one ‘Does the Johnson-
Lindenstrauss Transform (JLT) in fact protect the reduced data from major loss
of important characteristics by virtue of the fact that it preserves the pairwise
distance structure?’. Best results were obtained for error tolerance 0.1 because
amount of retained cluster labels is 67% for JLT 4.2 and 82% for JLT 4.3.
The K-NN and random forest algorithms were found to have the best predic-
tive models in terms of accuracy rate. Both have 100% accuracy when predicting
the reduced ADHD dataset using JLT 4.3 with error tolerance 0.9. On the other
hand, the K-NN using JLT 4.3 with error 0.9, Naı̈ve Bayes using JLT 4.3 with
error 0.9, and random forest using both JLT 4.2 and 4.3 with error 0.9 have the
perfect models in terms of performance, having AUC RUC = 1. The results show
that these prediction models outperform the other models. The worst prediction
models of all were obtained by using the ZeroR classification algorithm, the
models showed poor accuracy and performance rates. The use of ZeroR classifier
was as a baseline to compare other classification algorithms.
Hence, research question two ‘Does the reduced dataset by JLT produce
the same accuracy or almost the same of classification models as the original
dataset?’ can be answered in the affirmative depending on the classification
algorithm used and the value of error tolerance.
Previous conceptualizations of the ADHD have been limited by the lack of
tools to analyze huge stores of complex neurological data. It seems likely that
there is a complex topology of the disorder. For example, it seems likely that
different people have different brain regions involved in symptoms of the disease
and those brain regions might be the foundation for the typology. An initial step
towards developing tools for analyzing ADHD data has been provided.
Finally, it was observed from proofs of JL-lemma and JLT that the lin-
ear transformation can be computed in polynomial time for high dimensional
datasets without any assumptions on the original data. In contrast, other dimen-
sionality techniques such as Principal Component Analysis are only useful for
datasets constrained to lower dimensional spaces. That is, the technique explored
here is not constrained by the assumptions of any arbitrary dimension. High-
dimensions impose no computational expense or cost in this technique’s opera-
tion or computational performance.
Our description of the Johnson-Lindenstrauss Transform/Embedding is an
extremely top down, and course grained description, and of all things, only
a description. The description was made in an attempt to be simple, clear
and straightforward, in common English and wording, avoiding mathematical
complexities imposed by the Johnson-Lindenstrauss Lemma’s mathematical for-
mulation and proof. Further work is in development towards understanding
the implicit geometrical nature of the Lemma’s construction and the implica-
tions imposed on relativity, proportionality, and space in general. Development
towards these objectives will supersede the high-level description discussed here
in further work.
We are in the process of creating a single application that is a hybrid of a
variety of JL techniques. This application will be a single pipeline, of one black-
box to another, forming a chain from input to output. UML and architectural
diagrams of the design and implementation are forthcoming. Refactoring will
be undertaken to adhere to a solution stack in an attempt to decouple the
application and perhaps provide it as an open source library for JL.
References
1. Wang, A., Gehan, E.A.: Gene selection for microarray data analysis using principal
component analysis. Stat. Med. 24(13), 2069–2087 (2005)
2. Fedoruk, J., Schmuland, B., Johnson, J., Heo, G.: Dimensionality reduction via
the Johnson-Lindenstrauss lemma: theoretical and empirical bounds on embedding
dimension. J. Supercomput. 74(8), 3933–3949 (2018)
3. Cannings, T.I., Samworth, R.J.: Random-projection ensemble classification. J. Roy.
Stat. Soc. B (Stat. Methodol.) 79(4), 959–1035 (2017)
4. Dasgupta, S.: Experiments with random projection. CoRR, vol. abs/1301.3849

(2013)
5. Fern, X., Brodley, C.: Random projection for high dimensional data clustering: a
cluster ensemble approach. In: Proceedings of the Twentieth International Confer-
ence on Machine Learning (ICML) (2003)
6. Klopotek, M.A.: Machine learning friendly set version of Johnson-Lindenstrauss
lemma. CoRR, vol. abs/1703.01507 (2017)
7. Hoogman, M., Bralten, J., Hibar, D.P., Mennes, M., Zwiers, M.P., Schweren, L.S.J.,
van Hulzen, K.J.E., Medland, S.E., Shumskaya, E., Jahanshad, N., de Zeeuw, P.,
Szekely, E., Sudre, G., Wolfers, T., Onnink, A.M.H., Dammers, J.T., Mostert,
J.C., Vives-Gilabert, Y., Kohls, G., Oberwelland, E., Seitz, J., Schulte-Rüther, M.,
Ambrosino, S., Doyle, A.E., Høvik, M.F., Dramsdahl, M., Tamm, L., van Erp,
T.G.M., Dale, A., Schork, A., Conzelmann, A., Zierhut, K., Baur, R., McCarthy,
H., Yoncheva, Y.N., Cubillo, A., Chantiluke, K., Mehta, M.A., Paloyelis, Y.,
Hohmann, S., Baumeister, S., Bramati, I., Mattos, P., Tovar-Moll, F., Douglas,
P., Banaschewski, T., Brandeis, D., Kuntsi, J., Asherson, P., Rubia, K., Kelly,
C., Martino, A.D., Milham, M.P., Castellanos, F.X., Frodl, T., Zentis, M., Lesch,
K.-P., Reif, A., Pauli, P., Jernigan, T.L., Haavik, J., Plessen, K.J., Lundervold,
A.J., Hugdahl, K., Seidman, L.J., Biederman, J., Rommelse, N., Heslenfeld, D.J.,
Hartman, C.A., Hoekstra, P.J., Oosterlaan, J., von Polier, G., Konrad, K., Vilar-
roya, O., Ramos-Quiroga, J.A., Soliva, J.C., Durston, S., Buitelaar, J.K., Faraone,
S.V., Shaw, P., Thompson, P.M., Franke, B.: Subcortical brain volume differences
in participants with attention deficit hyperactivity disorder in children and adults:
a cross-sectional mega-analysis. The Lancet Psychiatry 4(4), 310–319 (2017)
8. Sun, H., Chen, Y., Huang, Q., Lui, S., Huang, X., Shi, Y., Xu, X., Sweeney, J.A.,
Gong, Q.: Psychoradiologic utility of MR imaging for diagnosis of attention deficit
hyperactivity disorder: a radiomics analysis. Radiology 287(2), 620–630 (2018).
pMID: 29165048
9. Li, T., Ma, S., Ogihara, M.: Wavelet Methods in Data Mining, pp. 553–571.
Springer, Boston (2010)
10. Agarwal, D., Agrawal, R., Khanna, R., Kota, N.: Estimating rates of rare events
with multiple hierarchies through scalable log-linear models. In: Proceedings of the
16th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, KDD 2010, pp. 213–222. ACM, New York (2010)
11. Hand, D.J.: Data Mining Based in part on the article “Data mining” by David
Hand, which appeared in the Encyclopedia of Environmetrics. American Cancer
Society (2013)
12. Xi, X., Ueno, K., Keogh, E., Lee, D.-J.: Converting non-parametric distance-based
classification to anytime algorithms. Pattern Anal. Appl. 11(3), 321–336 (2008)
13. LalithaY, S., Latte, M.V.: Lossless and lossy compression of dicom images with
scalable ROI. IJCSNS Int. J. Comput. Sci. Netw. Secur. 10(7), 276–281 (2010)
14. Du, K.-L., Swamy, M.N.S.: Recurrent Neural Networks, pp. 337–353. Springer,
London (2014)
15. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan,
G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg,
D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
16. Suthaharan, S.: Machine Learning Models and Algorithms for Big Data Classifi-
cation, vol. 36. Springer, Boston (2016)
17. Noble, W.S.: What is a support vector machine? Nat. Biotechnol. 24(12), 1565–
1567 (2006)
18. Stein, G., Chen, B. Wu, A.S., Hua, K.A.: Decision tree classifier for network intru-
sion detection with GA-based feature selection. In: Proceedings of the 43rd Annual
Southeast Regional Conference - Volume 2, ACM-SE 43, pp. 136–141. ACM,
New York (2005)
19. Mukherjee, S., Sharma, N.: Intrusion detection using Naive Bayes classifier with
feature reduction. Procedia Technol. 4, 119–128 (2012). 2nd International Confer-
ence on Computer, Communication, Control and Information Technology (C3IT-
2012) on February 25–26, 2012
20. Deshmukh, S., Rajeswari, K., Patil, R.: Analysis of simple K-means with multiple
dimensions using WEKA. Int. J. Comp. Appl. 110(1), 14–17 (2015)
21. Zarzour, H., Al-Sharif, Z., Al-Ayyoub, M., Jararweh, Y.: A new collaborative filter-
ing recommendation algorithm based on dimensionality reduction and clustering
techniques, pp. 102–106 (2018)
22. Nilashi, M., Ibrahim, O., Ahmadi, H., Shahmoradi, L., Samad, S., Bagherifard, K.:
A recommendation agent for health products recommendation using dimensional-
ity reduction and prediction machine learning techniques. J. Soft Comput. Decis.
Support Syst. 5, 7–15 (2018)
23. Wang, S., Lu, J., Gu, X., Du, H., Yang, J.: Semi-supervised linear discriminant
analysis for dimension reduction and classification. Pattern Recogn. 57, 179–189
(2016)
24. Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert
space. Contemp. Math. 26(7), 189–206 (1984)
25. Bellec, P., Chu, C., Chouinard-Decorte, F., Benhajali, Y., Margulies, D.S., Crad-
dock, R.C.: The neuro bureau ADHD-200 preprocessed repository. NeuroImage
144, 275–286 (2017). data Sharing Part II
26. Matoušek, J.: On variants of the Johnson-Lindenstrauss lemma. Random Struct.
Algorithms 33(2), 142–156 (2008)
27. Bengio, Y., Grandvalet, Y.: No unbiased estimator of the variance of K-fold cross-
validation. J. Mach. Learn. Res. JMLR 5, 1089–1105 (2004)
28. Markatou, M., Tian, H., Biswas, S., Hripcsak, G.: Analysis of variance of cross-
validation estimators of the generalization error. J. Mach. Learn. Res. JMLR 6,
1127–1168 (2005)
Application of Statistical Learning
in Ferro-Titanium Industry
Mahan Balal Pour1, Vahid Partovinia2, and Robert Pellerin1(&)

1
Polytechnique Montreal, Montreal, QC H3T 1J4, Canada
{mahan.balal-pour,robert.pellerin}@polymtl.ca
2
Huawei Technologies and Polytechnique Montreal,
Montreal, QC H3T 1J4, Canada
vahid.partovinia@polymtl.ca
Abstract. Despite the statistical control methods are extensively explored in

the literature, there are rarely published researches in ferrotitanium industries.
Ferrotitanium is used by steelmakers as a stabilizer to prevent chromium carbide
forming at grain boundaries and in the production of low-carbon steels. Steels
with relatively high titanium content include interstitial-free, stainless and high-
strength low-alloy steels. Ferrotitanium is lighter, stronger and has higher
resistance of corrosion compared with iron. The main statistical method which is
applying by ferrotitanium industries is a statistical process control method which
ignores the correlations between the chemical components to determine the main
predict variables for each response variable. In this paper, by applying the
supervised learning methods we recognize the possible correlations between the
main alloys and prioritize them in the production process of this industry to be a
guidance in predicting the quality results of production and to decrease the
manufacturing cost of this industry because of producing out-of-range products.
Keywords: Ferrotitanium Ferroalloy Statistical process control Supervised

learning Linear regression Random forest
1 Introduction
The need to improve productivity and efficiency, reduced management levels and
process security has led to increased research activity on fault detection and isolation
[2].
From the other side, quality and price are an essential requirement that clients
consider it on top of all other items they are looking to make a purchase or sign a
contract with a producer. Therefore, for producers, having a rigid quality plan to
control critical processes to hint process owners before arriving at the limit points is
vital. One of these main industries which can easily be called the root of other
industries is steel producer companies. Steel producers compete with each other to be
ahead of the others in quality of production. To improve their qualities, steel producer
companies always trying to improve the quality of their products by applying some
purifying chemical compounds. One of these main elements is called ferroalloy
chemicals. Ferroalloy is a combination of the metal with a high amount of chemical
alloys such as vanadium, titanium, silicon, aluminum, etc. Each of these ferroalloys has

https://doi.org/10.1007/978-3-030-39442-4_17
Application of Statistical Learning in Ferro-Titanium Industry 211
a constructive effect on the performance of produced steel in different applications like

aerospace, automotive, medical part production and other industries [3].
This research will carry out in a Ferrotitanium company of North America to
investigate the quality data of the melting process of this company. Ferrotitanium is
used by steelmakers as a stabilizer to prevent chromium carbide forming at grain
boundaries and in the production of low-carbon steels. Titanium reacts very fast with
nitrogen, oxygen, carbon, and sulfur, and forms unsolvable composites. It is lighter,
stronger and has higher resistance of corrosion compared with iron [7].
The company which is selected to carry out this research is the leading manufac-
turer of ferrotitanium in Canada over the last twenty years. This company specializes in
the crushing of FeTi of different grades and sizes ranging from standard, low carbon,
low aluminum, low vanadium grades and sizes from 10 to 50 mm, 10 to 30 mm, 6 to
12 mm, until 0 to 2 mm ferrotitanium powder.
Currently, they are using univariate statistical process control charts to control the
client’s required chemical components in this company. Statistical Process Control is a
statistical method which controls the stability and capability of a single variable of a
process. The most popular tools in SPC are control charts especially Shewhart control
charts and capability indices. But in most processes, there are several variables which
should be under control and supervised simultaneously [10].
In order to extract useful information from process data and apply it to monitor the
process, multivariate statistical process control (MSPC) has been developed [5]. The
multi-way principal component analysis is one of the most popular unsupervised
methods to implement multivariate SPC [6]. The interpretation of control charts of
MSPC needs some statistical expertise and much more complicated than univariate
SPC control charts. The statistical algorithms which are used for MSPC are more
complex than SPC statistical methods. Based on the reviewed literature, implementa-
tion of univariate SPC is much easier than multivariate SPC. The comparison of
univariate and multivariate statistical process control methods are explained in Table 1.
Based on this comparison table, the ease of use, implementation, and interpretation
make univariate SPC very popular in practice. The main weakness of the control charts
of multivariate SPC is that there is no possibility to easily determine which variable is
out-of-range while the system is alarming process is out-of-control. In this case, we
should check all the univariate control charts to see which variable is not working in the
range [9].
By using some standard recipes, this company produces high-quality products,
except when they see instabilities in chemical components. Off-grade products generate
too much cost for this industry because the only way to recover these off-grade
products is to reprocess them, which is very expensive and time-consuming. For this
industry, finding a way or algorithm to identify the element (s) that affect the essential
and targeted chemical alloys, will change the game, lead to stable quality, and dra-
matically improve the production cost of this business.
Here we use a dataset collected from the quality department of the mentioned
ferrotitanium company including all the chemical results of every melt during 2018 to
see if it is possible to present a developing statistical model aimed to forecast the
reaction of different chemical alloys before the melting process.
212 M. B. Pour et al.
Table 1. Pros and cons of SPC and MSPC.

Measure Univariate control charts Multivariate control charts
Acceptance by Very popular Rarely used
industry
Implementation There are many guides and manual – Lack of instruction and clear
Easy to apply methodology – Very difficult
to apply
Software exigence Not a requirement but advisable Necessary
Relationship Not consider Consider
between variable
Multiple variable It is hard to monitor several Multiple variables could be
measurements. univariate control charts – Very controlled at the same control
complicated for the quality operator chart
Significance level Every control chart has its own For all principal components,
significant level – very difficult to there is the same significant
calculate level
Out-of-control Very easy and straightforward Interpretation is practically
signal impossible without the use of
interpretation special methods
Unit of controlled The controlled statistic is in a unit of The controlled statistic has no
statistic the controlled variable unit
Understanding the Very easy Complex
mechanism of
functioning
By plotting these chemical alloys, the results show there are correlations between
some of the chemical alloys which we more investigate them using some supervised
machine learning methods over this research to determine the linear and non-linear
correlation between chemical compounds. Therefore, the specific objective of this work
is to determine the relationship between the dependent and independent variables in the
ferrotitanium dataset and form a stable algorithm for predicting these relations at the
production process of ferrotitanium.
In order to explore the goal of this research, the following experiments will be
performed:
• Visualizing the collected dataset to detect the potential correlations;
• Data cleansing and remove the sample’s data which were generated because of
human error’s happening during sampling steps;
• Determining dependent (response) and independent (predictor) variables;
• Exploring robust patterns between response and predictor variables;
• Applying multiple regression and random forests as the main statistical method of
supervised learning method;
• And, training the model on the general dataset of collected data from the quality
department of a titanium producer.
The result is a roadmap for the quality, production, and engineering departments of
this industry in order to predict the suitable combination of different materials as the
main recipe for the production process. The results can indisputably assist the company
to boost further conveniences to the operation, by this means better planning and
guidance for high-quality products.
This research starts by collecting and sorting the data from the quality department of a
ferrotitanium producer and is continued by applying supervised learning methods over
the collected data. Analyzing the main predict variables affecting the quality of
response variables of the ferrotitanium producers is the target of this research. Some of
these companies use statistical process control method to control the main chemical
alloys required by their clients. They are using several control charts to control different
alloys by setting different upper control limit (UCL) and lower control limit (LCL).
Figure 1 represents the SPC control chart of vanadium in the studied company.
6.00
4.00
2.00
Fig. 1. SPC control chart for vanadium.
Despite being easy to implement, univariate statistical process control method

ignores the correlation between variables and controlling all the control charts at the
same time is very hard for quality operators. Also, every control chart has its own
significant level and control limits which make all these variables incomparable.
The summary of the research strategy is shown in Fig. 2.
• Result of 15 chemical elements of 2018 is collected from QC department

Data • Dependent and independent variables, are determined and segrigated
collection
• Multiple linear regression is applied to assess the relation between dependent and
Applying independent variables
Linear
Regression
• Using Random Forest, the result of linear regression method is analysed and the main
Applying predictive variables which affect each dependant variable is determined
Random
Forest
• Compare the main elements were determined by multiple linear regression with random
Result forest results
Comparissio • Control the calculated R-squared to see which trained data is more reliable
n
• Present results and determine new research opportunities

Conclusion
Fig. 2. Research steps
In this research, as it is shown in Fig. 2, the 2018 dataset of a Ferroalloy producer

in Canada was collected for statistical process control of the production line. This
company started to use statistical process control method from the beginning of 2016 to
control the main chemical alloys using the SPC control charts.
The dataset which was generated during 2018 to control 15 different alloys of daily
production is the most reliable data as the company has fully expanded its laboratory
machinery from the beginning of 2018, and the quality control team has analyzed every
melt sample after daily calibration. Overall, 2018 was the most consistent year to make
data analysis. The primary dataset format was changed to CSV (comma-separated
values) file which contained 2084 rows and 17 columns.
After collecting and sorting the data from the quality department of the ferrotita-
nium producer, using the viewpoint of the quality engineers we segregate the depen-
dent variables from independent ones. The dependent variables are the chemical alloys
which are set with limit ranges on all contracts. The supervised method such as mul-
tiple linear regression is accounted for a analyze correlation between the response and
predictor variables, and random forest method is to account for a non-linear interaction
between the predictor and response variables. In the result comparison step, we
compare the main elements which were determined by multiple linear regressions with
random forest results. Finally, we present the results and recommendation for future
researchers in the conclusion section.
3 Data Analysis
3.1 Dependent (Response) and Independent (Predictor) Variables
In this research after investigating all sales contracts of 2018, it was considered that
there are five main alloys which are requested by clients, and quality officers are
controlling them at every sample result using the SPC control charts to meet the
customer requirements. In this paper, the results of vanadium as a critical response
variable is presented and the results of the other four response variables are presented in
the appendix section.
To control the key elements required by clients, the technical department prepares a
special recipe to blend and melt the different materials to meet the customer’s
requirements. This recipe is prepared according to the types of available materials in
the inventory system of the company and the description of the job order for each
alloy’s boundary coming from the sale’s contract. Therefore, we consider these five
main alloys as our dependent (response) variables to run the data analysis and see
which alloys have more effects on these dependent variables. Table 2 displays the
independent variables which are analyzed to find out their effect on vanadium as our
response variable.
Table 2. Dependent and independent variables.

Variables Response Predictor
Mo √
Si √
Sn √
Zr √
Mn √
Cr √
V √
Fe √
Ni √
Cu √
N √
3.2 Data Visualization

After depicting the dataset variables scatter plot, shown in Fig. 3, it shows that there is
a linear regression between some of the variables of the general dataset.
To analyze the correlation between dependent and independent variables, using the
supervised learning methods we analyze these relationships.
Fig. 3. Scatter plot of the relationship between predictors.
4 Supervised Learning
Statistical learning methods are divided into two general categories supervised and
unsupervised. For the supervised learning, for every observation of the predictor
measurement(s) xi, i ¼ 1; 2; . . .: n there is an associated answer as yi sometimes called
response. In supervised learning, we fit a model for forecasting and predicting
responses. A simple statistical model may help to understand the relationship between
predictors and response [4].
Supervised learning models are built to predict the relation between the response
and predictors using multiple regression as a linear method, and random forests as a
non-linear method.
4.1 Multiple Linear Regression

While dealing with a single predictor variable, simple linear regression is a suitable
method for predicting response. In real life, there are often several predictors, which
may affect each other. Multiple linear regression method determines a separate slope
coefficient for each predictor to quantify the relationship between the response and
predictors [1].
We use the p-value to determine the statistical significance of the relationship, with
a threshold level of 0.05. The obtained fitted model takes the following form for every
response chemical component.
Y ¼ B0 þ b1mo þ b2si þ b3sn þ b4zr þ b5mn þ b6cr þ b7fe þ b8ni þ b9cu þ b10n
ð1Þ
In Table 3, the results of predictor variables for vanadium as the response variable
are depicted. For the response variables with the reported R-squared of more than 50%,
shows that the model adequately fits the data and multiple linear regression is a suitable
tool to predict the results. On the other hand, for the response variables with R-squared
lower than the 50%, it shows this trained model cannot be used uniquely to predict the
response variable on the dataset.
To obtain a fitted model for vanadium regression it takes the following form.
v ¼ B0 þ b1mo þ b2si þ b3sn þ b4zr þ b5mn þ b6cr þ b7fe þ b8ni þ b9cu þ b10n ð2Þ
Table 3. Multiple linear regression results.

Estimate Std-error T-value P-value Signif. code
Mo −0.53 0.04 −11.95 <2e−16 ***
Si −0.02 0.12 −0.16 0.868205
Sn −0.43 0.13 −3.20 0.001372 **
Zr −0.20 0.05 −3.53 0.000412 ***
Mn 5.04 0.30 16.74 <2e−16 ***
Cr −0.26 0.04 −6.10 1.26e−09 ***
Fe −0.14 0.01 −29.36 <2e−16 ***
Ni 0.06 0.02 2.47 0.013476 *
Cu −0.78 0.46 −1.68 0.091619 .
N 0.50 0.07 6.33 2.98e−10 ***
The results from the fitted model on the training set are shown below.
These results1 indicate that there are two chemical components as si and cu which
are not significant. This means the model can be simplified and si and cu can be
removed from the model. We continue to regenerate the regression model by removing
these two alloys, respectively which its results are shown in Table 4.
Table 4. Final multiple linear regression results of vanadium.

Estimate Std-error T-value P-value Signif. code
Mo −0.53 0.04 −11.91 <2e−16 ***
Sn −0.44 0.13 −3.27 0.001073 **
Zr −0.20 0.05 −3.48 0.000501 ***
Mn 4.95 0.29 16.84 <2e−16 ***
Cr −0.27 0.04 −6.30 3.61e−10 ***
Fe −0.14 0.00 −29.52 <2e−16 ***
Ni 0.06 0.02 2.49 0.012749 *
N 0.52 0.07 6.64 3.93e−11 ***
By removing si and cu from the trained model as it is shown in Table 5, all

predictors are statistically significant. This indicates all predictors affect the response
except si and cu. The reported R-squared by the model is around 55%, so the model fits
the data adequately. Therefore, this trained model can be used to predict the response
variable on the dataset for vanadium content. To obtain a fitted model for carbon
content regression it takes the following form.
Table 5. Summary of regression result over the general dataset.

Response R-squared % Predictors
Mo Si Sn Zr Mn Cr Fe Ni Cu N
V 55 √ √ √ √ √ √ √ √
4.2 Random Forest

A random forest is a practical nonlinear supervised of machine learning method that
applies big numbers of random decision trees to examine importance sets of variables.
We choose the random forest, as a powerful group classifier to achieve high accuracy.
A Random Forest generates several trees in place of a single tree. To categorize a new
sample from a specified input, each tree is provided with the same input in the forest
that forms classification. Each sorting is called as “votes” for classified class. Classi-
fication with maximum votes is selected. Increasing the similitude between the trees
increases the forest error rate in opposite a tree with a lower error rate becomes strong
classifier. Dropping number of attributes reduces the correlation and the strength,
whereas increasing it increases both [8].
1
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.
Multiple R-squared: 0.5526, Adjusted R-squared: 0.5504.
In this section using the random forest method, we verify the results of multiple
linear regression using a nonlinear extension of multiple regression which was
explained at the previous section and also to find out the most important elements to
predict the vanadium as the response variable. The random forest algorithm can be used
for both classification and the regression of problems [1].
Table 6. Random forest results for vanadium.

%IncMSE IncNodePurity
Mo 32.8 102.5
Si 15.7 30.8
Sn 34.4 85.2
Zr 34.9 107.5
Mn 48.9 53.9
Cr 25.8 43.5
Fe 73.1 122.2
Ni 17.8 31.2
Cu 12.1 15.3
N 18.7 37.3
Results are depicted in Table 6 and determine the first main chemical components
to predict the vanadium content are fe, mn, zr, and Sn, respectively.
4.3 Linear Regression and Random Forest Result Comparison

To compare the results of both linear and nonlinear tools, we use the absolute t-value of
each predictor in multiple linear regression to prioritize the first four main elements to
see if the results of both multiple linear regression and random forest methods are
matched or not. The results are depicted in Table 7.
Table 7. Comparison of linear regression (LR) and random forest (RF) results for Vanadium.
Response Vanadium
Method Random forest Linear regression
Predictors Mo 3
Si
Sn 4
Zr 3
Mn 2 2
Cr
Fe 1 1
Ni
Cu
N 4
Priority results The first two items have the same
priority
5 Conclusion
In this research, we have shown the results of vanadium as one of the main response
variables. The results of the other four response variables such as aluminum (Al),
carbon (C), oxygen (O2), and titanium (Ti) are shown in the appendix section from
Tables 8, 9, 10, 11, 12 and 13. After plotting and visualizing the collected dataset,
some correlations between the variables were observed and encouraged us to imple-
ment multiple linear regression models over the dataset. The multiple linear regression
results show that there is some linear relationship between the variables. The results
show that the linear regression model fits the data effectively to predict titanium and
vanadium because R-squared is more than 50% for each model. Also, to predict the
other responses like aluminum, carbon, and oxygen it shows that linear regression does
not fit the data effectively and linear regression is not an efficient statistical model to
predict these responses because the R-squared is lower than 50%.
To find out if there is any nonlinear regression between variables and also to
prioritize the most important predictors of every response, we have run the random
forest across the general dataset. It shows that the results of multiple linear regression
and random forest of the first two main predictors are the same for titanium, vanadium,
and oxygen and results are the same for the first main item of carbon and aluminum
over the general dataset. The main predictors for titanium content are iron and nickel,
for vanadium content are iron and manganese, and for oxygen are nitrogen and silicon.
Therefore, looking over the analyzed general dataset results, we can say that both
linear regression and random forest results are same for the first two main predictors of
titanium content as iron and nitrogen and for the oxygen content as nitrogen, and
silicon. For the other responses, we need more research to find out the main predictors
of each. Also, we just studied the chemical behavior of all different products under the
name of the general dataset. We recommend future researchers to continue this study
over every single product as low vanadium, low carbon, low oxygen, etc. to see if the
results are the same. Also, working on the other products and preparing a predicting
algorithm based on the historical data analyzing is recommended.
By on these initial results, we are planning to implement the unsupervised learning
methods over the general dataset as well and results will be published in the future
papers. The knowledge which is generated during this research by analyzing the his-
torical data of a ferrotitanium producer can be a hint for metallurgist and material
science researchers to continue the complementary studies on the technical reasons of
each correlation which find out over this research.
The structure of this research can be a guide to be developed by future researchers
over the other ferroalloy producers to have a robust understanding of material reaction
based on the historical data observations and statistical calculations.
Appendix
Table 8. All dependent and independent variables over the general dataset.
Variables Response Predictor
Al √
Mo √
Si √
Sn √
Zr √
Mn √
Cr √
V √
Fe √
C √
Ni √
Cu √
N √
O √
Ti √
Table 9. Summary of regression result over the general dataset.

Predictors R-squared Mo Si Sn Zr Mn Cr Fe Ni Cu N
Responses Ti 62% √ √ √ √ √ √ √ √ √ √
Al 23% √ √ √ √ √ √ √ √ √ √
V 55% √ √ √ √ √ √ √ √
C 12% √ √ √ √ √ √ √
O 38% √ √ √ √ √ √ √
Table 10. Comparison of linear regression (LR) and random forest (RF) results for titanium.
Response Titanium
Predictors Mo
Si
Sn
Zr
Mn 3 3
Cr
Fe 1 1
Ni 2 2
Cu
N 4 4
Priority results Same priorities
Table 11. Comparison of linear regression (LR) and random forest (RF) results for aluminum.
Response Aluminum
Predictors Mo 4 4
Si
Sn
Zr 3
Mn 2
Cr
Fe 1 1
Ni
Cu 2 3
N
Priority results Not the same for second and third items
Table 12. Comparison of linear regression (LR) and random forest (RF) results for Carbon.
Response Carbon
Predictors Mo
Si 1 1
Sn
Zr 3
Mn
Cr 2
Fe 4
Ni
Cu 4 2
N 3
Priority results Just first item has the same priority in
both methods
Table 13. Comparison of linear regression (LR) and random forest (RF) results for Oxygen.
Response Oxygen
Predictors Mo
Si 2 2
Sn
Zr
Mn 3 4
Cr 4
Fe 3
Ni
Cu
N 1 1
Priority results First two items are the same
References
1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
2. Chen, Q., Wynne, R.J., Goulding, P., Sandoz, D.: The application of principal component
analysis and kernel density estimation to enhance process monitoring. Control Eng. Pract.
8(5), 531–543 (2000)
3. Holappa, L.: Towards sustainability in ferroalloy production. J. South Afr. Inst. Min. Metall.
110(12), 703–710 (2010)
4. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning, vol.
112. Springer (2013)
5. Kano, M., Hasebe, S., Hashimoto, I., Ohno, H.: A new multivariate statistical process
monitoring method using principal component analysis. Comput. Chem. Eng. 25(7–8),
1103–1113 (2001)
6. Liu, Y.-J., André, S., Saint Cristau, L., Lagresle, S., Hannas, Z., Calvosa, É., Devos, O.,
Duponchel, L.: Multivariate statistical process control (MSPC) using Raman spectroscopy
for in-line culture cell monitoring considering time-varying batches synchronized with
correlation optimized warping (COW). Anal. Chim. Acta 952, 9–17 (2017)
7. Panigrahi, M., Paramguru, R.K., Gupta, R.C., Shibata, E., Nakamura, T.: An overview of
production of titanium and an attempt to titanium production with ferro-titanium. High
Temp. Mater. Processes (Lond.) 29(5–6), 495–514 (2011). https://doi.org/10.1515/HTMP.
2010.29.5-6.495
8. Patel, S.V., Jokhakar, V.N.: A random forest based machine learning approach for mild steel
defect diagnosis. In: 2016 IEEE International Conference on Computational Intelligence and
Computing Research (ICCIC), pp. 1–8. IEEE (2016)
9. Rogalewicz, M.: Some notes on multivariate statistical process control. Manag. Prod. Eng.
Rev. 3(4), 80–86 (2012)
10. Rogalewicz, M., Poznańska, P.: The methodology of controlling manufacturing processes
with the use of multivariate statistical process control tools. J. Trends Dev. Mach. Assoc.
Technol. 17(1), 89–93 (2013)
Assessing the Effectiveness of Topic
Modeling Algorithms in Discovering
Generic Label with Description
Shadikur Rahman, Syeda Sumbul Hossain(B) , Md. Shohel Arman,

Lamisha Rawshan, Tapushe Rabaya Toma, Fatama Binta Rafiq,
and Khalid Been Md. Badruzzaman
Daffodil International University, Dhaka, Bangladesh

{shadikur35-988,syeda.swe,arman.swe,lamisha.swe,toma.swe,
fatama.swe}@diu.edu.bd, khalid@daffodilvarsity.edu.bd
Abstract. Analyzing short text or documents using topic modeling

becomes a popular solutions for the increasing number of documents
produced in everyday life. For handling the large amount of documents,
many topic modeling algorithms are used e.g. LDA, LSI, pLSI, NMF.
In this study, we have used LDA, LSI, NMF and also lexical database
wordNet synset for candidate labels in our topics labeling. And finally
compare the effectiveness of topic modeling algorithms for short docu-
ments. Among those LDA gives the better result in terms of WUP sim-
ilarity. This study will help to select the proper algorithm for labeling
topics and can easily identify the meaning of topics.
Keywords: Topic modeling · LDA · NMF · LSI · Topic labeling
1 Introduction
With the ever increasing electronic documents, it is becoming a challenging job to

pull out the exact topics without reading the whole documents. Topic modeling
is employed to extract latent topics of documents. It provides an advantage to
the readers to find out what is occurring over the document. But for polynomial
topic, it is terribly exhausting to grasp the essence for a user. If it be properly
labeled, it will be far better for understanding.
There are several algorithms for topic modeling. The top most popular are
LDA [4], LSA [7], NMF [13], pLSA [8], and HDP [17]. For generating the exact
label of any topic these algorithms are already used by others. Hench, there are
less study to figure out the best models for generating the most efficient labels
of any topics. A probabilistic approach is presented in [14] to generate effec-
tive labels and discover topic models automatically where learning the models
through pLSA and LDA can be used. For measuring the semantic meaning in
inferred topics, a quantitative method is presented [6]. Some others work have
been discussed in Sect. 2.
https://doi.org/10.1007/978-3-030-39442-4_18
Topic Modeling Algorithms 225
In our previous study [9], we proposed a model to find generic label for
the polynomial topics over text documents. In which we have used only LDA
model to generate topic models. Correspondingly, we have used the same model
for generating the generic label and used LDA, LSI and NMF for training the
model. We have run these models over the short text documents and measure
the WUP similarities of each labels. By comparing the WUP similarities for each
models, we come to a conclusion that we can find the exact labels by using the
LDA model. Though we have done this experiment in short documents, it can
be done on large texts or documents.
The rest of the paper is organized as follows: Related Work is discussed in
Sect. 2 followed by Research Methodology and Result & Discussion in Sects. 3
and 4, respectively. And finally, Sect. 5 concludes the summary.
2 Related Work
The task of topic labelling is introduced by [14] for LDA topics which was an
unsupervised approach. Author in [12] presented an approach for topic labelling.
First, they generated a set of candidate labels from the top-ranking topic terms,
titles of Wikipedia articles containing the top-ranking topic terms, and sub-
phrases extracted from the Wikipedia article titles. Then finally they ranked
the label candidates using a combination of association measures, lexical fea-
tures and an Information Retrieval feature. Natural language processing(NLP)
is also used for topic modeling and labeling method. Using Word2Vec word
embedding, [1] labeled online news considering the article terms and comments.
Author in [16] proposed topic2vec approach for topic representation with words.
Using Wikipedia document titles as label candidates, in [3] author presented
an approach to select the most relevant labels for topics by computing neural
embedding for documents and words. A graph-based method for topic labelling
was developed [10] by using the structured data exposed by DBpedia. In [2],
authors introduced a framework which apply summarizing algorithms to gener-
ate topic labels. Using embedding and letter trigrams, a novel method for topic
labelling was proposed by [11].
Though different approaches had been proposed in several studies, as per our
knowledge no evidence is present for assessing the models for generating topic
labels.
The overall process of our research activities has been described in this section.
First of all, we have chosen our dataset1 . For the dataset, we have selected some
online documents to perform our research process. For doing so, we have to
cover a lot of process e.g. step by step pre-processing, noun phrase choosing,
1
https://github.com/sadirahman/Effectiveness-of-Topic-ModelingAlgorithms-in-
Discovering-Generic-Label-withDescription.
226 S. Rahman et al.
N-Gram, training model, label processing and label description with the help of
WordNet Synset. Then we receive to find out topic label based on our topic model
clustering result. We transferred a retrieve responsibility to compare three topic
representations: (1) Clustering result, (2) Topic labels and (3) Labels description.
Figure 1 shows an overview of our research experiment.
Fig. 1. Research Methodology
3.1 Text Pre-processing
Data originating from various sources have several characteristics. In this

research, we have used online articles. In most of the cases, we recognize that text
data is not completely cleaned. For cleaning the text data, text pre-processing
is needed. For doing the pre-processing, we need to follow several steps like
tokenization, stop words removing, lemmatizing and removing punctuation.
Text Tokenization: Tokenization is the method of splitting the provided text

into smaller portions called token. Words, numbers, punctuation marks, and
others can be recognized as the token.
Stop Words Removing: Stop words are the usual common words in a lan-
guage like “about”, “an”, “the”, “is”, “into”. These words do not give important
meaning and are normally removed from text documents of our dataset.
Lemmatizing: Lemmatizing is a process of decreasing words to their word

Lemma, base or root form, for example, roads–road, loved–love.
Removing Punctuation: Remove punctuation is needed if they are not related

to the text corpus. Normally, regular expressions are used to remove set of punc-
tuation symbols.
3.2 Noun Phrase Choosing
After preprocessing, we have only picked the noun and proper noun from the pre-
processed result. Using this approach, the topic is taken by top nouns words with
the largest frequency in text corpus. For noun phrase choosing: first, the tok-
enization of text is executed to lemma out the words. The tokenized text is then
tagged with parts of speech NN (nouns), NNP (proper nouns), VB (verbs), JJ
(adjectives) etc. Before lemmatizing and stop-words removing, parts-of-Speech
(POS) tagging is done. The stop-words are removed after POS tagging. In the
final stage, words including their tags and rounds are put in a hash table and
most solid nouns are obtained from these to create a heading for a text. The
results of non phrase words is presented in Table 1.
Table 1. Result of an Noun Phrase words
POS tagging words Only Noun words

Way, generally, achieve Way, travel, transportation
Travel, transportation Transportation
3.3 N-Gram
An N-gram is a sequence of N words, which computes p(w—h), the probability

of a word w* [5]. We have used N-gram model using the same training text and
held-out data as we used for the word-based Natural language process model
we discussed above in our research. The purpose of using this is to maintain
the sequence of candidate labels in our training data sets. For example, if we
consider the text, “Please enter your document”. For this text, the 2-gram and
3-gram words is presented in Table 2.
Table 2. Process of an N-gram words
2-gram (bigram) 3-gram (trigram)

“Please enter” “Please enter your”
“Enter your” “Enter Your document”
“Your document”
3.4 Training Model

In this research, we have used most popular topic modeling algorithms e.g. LSI,
LDA, and NMF. For training our models, first we need to build corpus data
set with documents. Every topics model are based on the same basic imperson-
ation: each document consists of a mixture of topics and each topic consists of a
collection of words. As a result, the objective of topic modeling algorithm is to
uncover these latent topics variables that shape the meaning of our document
and training corpus data. Figure 2 shows an overview of our research training
models.
Latent Dirichlet Analysis is a probabilistic model, and to obtain cluster
assignments. LDA utilizes dirichlet priors for the document-topic and word-topic
distributions it utilizes two probability values: P(word—topics) and P(topics—
documents) [4].
To perform a LDA model begins by defining the number of ‘topics’ that are
begun in our set of training documents. Now we will present the model output
below: here we have chosen the number of topics = 3 and Number of Words = 3.
We have also set the random state = 1 which is enough for our text documents.
Table 3 presents the result of training LDA model after executing Document 2.
Table 3. Result of an example Document 2 in LDA model
Topic 1 Topic 2 Topic 3

0.046 * “sugar” 0.085 * “father” 0.066 * “driving”
0.046 * “sister” 0.082 * “sugar” 0.062 * “pressure”
0.046 * “health” 0.081 * “sister” 0.062 * “may”
LSA is one of the fundamental techniques in topic modeling. LSA models

typically substitute natural counts in the document-term matrix with a tf-idf
score [7]. Here we choose the number of topics = 3 and number of Words =
3. And also set one pass = true as one pass = true is enough for our required
document. Table 4 presents the result of training LSA model after executing
Document 2.
Table 4. Result of an example Document 2 in LSI model

0.426 * “sister” 0.534 * “sugar” 0.363 * “driving”
0.426 * “father” 0.298 * “pressure” 0.224 * “may”
0.261 * “sugar” 0.207 * “bad” 0.224 * “cause”
NMF is a linear-algebraic model, that factors high-dimensional vectors into

a low-dimensional representation.non-negative matrix V, find (usually) two
Fig. 2. Train model
non-negative matrices W and H such that: V = WH [13]. Here we choose the

number of topics = 3 and Number of Words = 3. Table 5 presents the result of
training LDA model after executing Document 2.
Table 5. Result of an example Document 2 in NMF model

Time Say Stress
Father Good Suggest
Sister Sugar Increased
3.5 Label Processing
In this section, after training models, we take topics from the documents we
only get top-weighted word from the topic because its result is hardy for its
topic set. Then we explore the semantic definition from the lexical database for
English parts of speeches which are termed WordNet [15]. We choose the defini-
tion because that is several appropriate within its phrases. After pre-processing
the definition, we prepare the candidate labels and from those candidate labels,
we estimate the candidate labels with the main topic word for conceptual seman-
tic relatedness measure by WUP [18] similarity. Then we began in our WUP
similarity process for labeling. Figure 3 WUP similarity process for labeling our
topics.
3.6 Label Description

After selecting the label, we put a description based on topic label from WordNet
synset and also a synset example. Using the description and example, readers
can easily understand the each topic words clearly. Figure 4 shows the process
of finding label description.
In this section, we discuss the overall results of our research work. We have
selected top three words for each models and found the top word considering
the highest weighted value. Then we obtain the description of top weighed word
using the WordNet synsets. After that, we pre-process the discovered description
and choose the candidate labels of each description by using Noun-phrase and
N-gram.
Finally we consider the label comparing the WUP similarities accuracy
between the each candidate label and the top weighted word. Tables 6, 7, and 8
are showing the details of selecting the labels of three documents by using LSI,
LDA and NMF, respectively.
For choosing the final label of each topic, WUP similarities values are used.
The values of WUP similarities accuracy is showing in Tables 9, 10 and 11 for
models LSI, LDA and NMF, respectively.
Fig. 3. Labeling processing
Fig. 4. Label description

Table 6. Label of selected 3 documents with LSI
Document(s) Topics Topic word Top Candidate Label labels

weighted
word
D1 Topic 1 Traffic, Traffic Aggregation, thing, Aggregation
causes, jam vehicle, locality,
period, time
Topic 2 Vehicle, road, Vehicles Conveyance, Transport
traffic transport, people
Topic 3 Road, move, Road Way, travel, Transportation
jam transportation
D2 Topic 1 Sister, father, Sister Person, parent, Person
health carbohydrate
Topic 2 Sugar, Sugar Crystalline Carbohydrate
pressure, bad carbohydrate
Topic 3 Drive, stress, Drive Mechanism, action, Action
increase power
D3 Topic 1 Drug, Drug Substance, medicine Substance
addiction,
without
Topic 2 Feel, like, Feel Situation, awareness Awareness
time
Topic 3 Like, use, Like Chance, case, Case
time prospect
Table 7. Label of selected 3 documents with LDA
Document(s) Topics Topic word Top weighted Candidate labels Label

word
D1 Topic 1 Road, jam, Road Way, travel, Transportation
traffic transportation
Topic 2 Traffic, rule Traffic Aggregation, thing, Aggregation
cause vehicle, vehicle,
locality, period,
time
Topic 3 Vehicle, Vehicle Conveyance, Transport
cause, jam transport, people
D2 Topic 1 Sugar, sister Sugar Crystalline, Carbohydrate
health carbohydrate
Topic 2 Father, Father Parent, founder, Parent
sugar, sister family
Topic 3 Driving, Driving Mechanism, action, Action
pressure may power
D3 Topic 1 Addiction, Addiction Craving, drug Craving
drug,
persistent
Topic 2 Chore, even, Chore Spending eating Spending
consume
Topic 3 Feel, drug, Feel Situation, awareness Awareness
addiction
Table 8. Label of selected 3 documents with NMF
Document(s) Topics Topic word Top Candidate labels Label

weighted
word
D1 Topic 1 Traffic, Traffic Aggregation, thing, Aggregation
causes, jam vehicle, locality,
period, time
Topic 2 Vehicle, road, Vehicles Conveyance, Transport
traffic transport, people
Topic 3 Road, move, Road Way, travel, Transportation
jam transportation
D2 Topic 1 Feel, father, Feel Situation, awareness Awareness
sister
Topic 2 Say, good, Say Speak, chance, Chance
sugar express, word
Topic 3 Stress, Stress Prominence, note, Note
suggest, pitch
increased
D3 Topic 1 Drug, Drug Substance, medicine Substance
persistent,
addiction
Topic 2 Feeling, Feeling State, idea, Confidence
chores, confidence
family
Topic 3 Like, longer, Like Chance, case, Case
use prospect
Table 9. WUP similarity between topic and label with LSI
Document(s) Topics Label WUP similarity Average WUP

Document 1 Traffic Aggregation 0.88 0.84
Vehicle Transport 0.93
Road Transportation 0.71
Document 2 Sister Person 0.66 0.57
Sugar Carbohydrate 0.31
Drive Action 0.76
Document 2 Drug Substance 0.60 0.63
Feel Awareness 0.94
Like Case 0.37
After choosing the topic labels, we put a description based on topic labels
from WordNet synset and also a synsets definition with synsets example. We can
see at Table 13 that shows the labels word description with labels word example.
After getting the WUP similarities values of each label, we average the WUP
similarities values. We can see at Table 12 that LDA shows the best accuracy
(71%) comparing with the other two models.
Table 10. WUP similarity between topic and label with LDA

Document 1 Road Transportation 0.714 0.84
Traffic Aggregation 0.888
Vehicle Transport 0.933
Document 2 Sugar Carbohydrate 0.31 0.64
Father Parent 0.96
Driving Action 0.66
Document 2 Addiction Craving 0.57 0.67
Chore Spending 0.52
Feel Awareness 0.94
Table 11. WUP similarity between topic and label with NMF

Document 1 Traffic Preserve 0.30 0.48
Cause Justification 0.40
Patients Person 0.75
Document 2 Feel Awareness 0.94 0.71
Say Chance 0.93
Stress Note 0.26
Document 2 Drug Substance 0.60 0.51
Feeling Confidence 0.57
Like Case 0.37
Table 12. Result of labels description with examples
Label words Synsets definition Synsets example

Preserve A domain that seems to be Medicine is no longer a male
specially reserved for someone preserve
Person A human being There was too much for one
person to do
Substance The real physical matter of DNA is the substance of our
which a person or thing consists genes
Case An occurrence of something It was a case of bad judgment
Table 13. WUP similarities difference among models
Models Document sets Average WUP Total average

LDA Document 1 0.84 0.71
Document 2 0.64
Document 3 0.67
LSI Document 1 0.84 0.68
Document 2 0.57
Document 3 0.63
NMF Document 1 0.48 0.56
Document 2 0.71
Document 3 0.51
5 Conclusion
The main objective of this research is to find the most relevant topic label with
description. We have used LDA, LSI, NMF to train our model. For each models,
we have determine the top three words and selected the top weighted word. Using
the wordNet synset of lexical database, we obtain the description of selected top
words and generate the candidate labels for each of the words. Comparing the
WUP similarities between candidate labels and top words, we selected the most
exact label for topics. After analyzing the WUP similarities of each models, we
found that LDA gives the most accurate topic label. This work will help others to
consider the best model while labeling any topics. This research can be extended
to find the topics and labels for large documents and also can be modified to
any optimized solutions.
References
1. Aker, A., Paramita, M., Kurtic, E., Funk, A., Barker, E., Hepple, M., Gaizauskas,
R.: Automatic label generation for news comment clusters. In: Proceedings of the
9th International Natural Language Generation Conference, pp. 61–69 (2016)
2. Basave, A.E.C., He, Y., Xu, R.: Automatic labelling of topic models learned from
twitter by summarisation. In: Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics (Volume 2: Short Papers), pp. 618–624
(2014)
3. Bhatia, S., Lau, J.H., Baldwin, T.: Automatic labelling of topics with neural
embeddings. arXiv preprint arXiv:1612.05340 (2016)
4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
5. Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based
n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
6. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea
leaves: how humans interpret topic models. In: Advances in Neural Information
Processing Systems, pp. 288–296 (2009)
7. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Index-
ing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
8. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fif-
teenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan
Kaufmann Publishers Inc. (1999)
9. Hossain, S.S., Ul-Hassan, R., Rahman, S.: Polynomial topic distribution with topic
modeling for generic labeling. In: Communications in Computer and Information
Science, vol. 1046, pp. 413–419. Springer (2019)
10. Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic
labelling using DBPedia. In: Proceedings of the Sixth ACM International Confer-
ence on Web Search and Data Mining, pp. 465–474. ACM (2013)
11. Kou, W., Li, F., Baldwin, T.: Automatic labelling of topic models using word
vectors and letter trigram vectors. In: AIRS, pp. 253–264. Springer (2015)
12. Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic
models. In: Proceedings of the 49th Annual Meeting of the Association for Com-
putational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Asso-
ciation for Computational Linguistics (2011)
13. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-
torization. Nature 401(6755), 788 (1999)
14. Mei, Q., Shen, X., Zhai, C.X.: Automatic labeling of multinomial topic models. In:
Proceedings of the 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 490–499. ACM (2007)
15. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11),
39–41 (1995)
16. Niu, L., Dai, X., Zhang, J., Chen, J.: Topic2Vec: learning distributed representa-
tions of topics. In: 2015 International Conference on Asian Language Processing
(IALP), pp. 193–196. IEEE (2015)
17. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Sharing clusters among related
groups: hierarchical Dirichlet processes. In: Advances in Neural Information Pro-
cessing Systems, pp. 1385–1392 (2005)
18. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the
32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138.
Association for Computational Linguistics (1994)
BeagleTM: An Adaptable Text Mining
Method for Relationship Discovery
in Literature
Oliver Bonham-Carter(B)
Department of Computer Science, Allegheny College,

520 N. Main Street, Meadville, PA 16335, USA
obonhamcarter@allegheny.edu
http://www.cs.allegheny.edu/sites/obonhamcarter/
Abstract. Investigators in bioinformatics are often confronted with the

difficult task of connecting ideas, which are found scattered around the
literature, using robust keyword searches. It is often customary to iden-
tify only a few keywords in a research article to facilitate search algo-
rithms, which is usually completed in absence of a general approach that
would serve to index all possible keywords of an article’s characteristic
attributes. Based on only a hand-full of keywords, articles are there-
fore prioritized by search algorithms that point investigators to seeming
subsets of their knowledge. In addition, many articles escape algorithm
search strategies due to the fact that their keywords were vague, or have
become unfashionable terms. In this case, the article, as well as its source
of knowledge, may be lost to the community. Owing to the growing size
of the literature, we introduce a text mining method and tool, (Bea-
gleTM ), for knowledge harvesting from papers in a literature corpus
without the use of article meta-data. Unlike other text mining tools that
only highlight found keywords in articles, our method allows users to
visually ascertain which keywords have been featured in studies together
with others in peer-reviewed work. Drawing from an arbitrarily-sized
corpus, BeagleTM creates visual networks describing interrelationships
between user-defined terms to facilitate the discovery of connected or par-
allel studies. We report the effectiveness of BeagleTM by illustrating its
ability to connect the keywords from types of PTMs (post-translational
modifications), stress-factors, and disorders together according to their
relationships. These relationships facilitate the discovery of connected
studies, which is often challenging to determine due to the frequently
unrelated keywords that were tied to relevant articles containing this
type of information.
Keywords: Text mining · Literature analysis · Relationship

networks · Relationship models
1 Introduction
When performing a literature review using searching algorithms, locating articles
is difficult due to the obscure nature of the keywords necessary. For any project,
https://doi.org/10.1007/978-3-030-39442-4_19
238 O. Bonham-Carter
one must supply the literature’s search engines with keywords that are closely
associated to one’s research. Correct keywords are carefully selected to isolate
relevant works, however, there are three general problems inherent to locating
knowledge based on the non-uniform links that have been provided by the diverse
authors of the articles.
The first major problem is that the terms for the particular desired infor-
mation may not follow a popular convention of keyword-naming. It would seem
that some authors choose words which are no-longer current with their fields.
Their articles are hence, found with other non-contemporary research due to
an antiquated use of language. In other cases, authors of seminal articles may
invent their own terms to describe the details of their work. This implies that
one must know an exact term or a particular usage of word(s), according to spe-
cific authors, to locate their articles. Simultaneously, other research teams may
be working along similar research themes, yet use entirely different keywords to
put their work in a scientific context. Therefore, when locating articles across
different researchers, multiple sets of specific keywords must be applied to search
engines to retrieve articles from a particular area of research.
The second general problem concerns the growth that many research areas
enjoy as a result of their popularity. As an area of research evolves, some of
its terminology, including keywords, may gradually be replaced by others as a
consequence. The natural evolution of a research field may cause disconnections
between current and former work, and to continue to locate new developments
one must be knowledgeable of the modernized keywords, in addition to the for-
mer ones. In a single field of research, we already see the creation of a widening
gap between different generations of knowledge in function of its evolution – if
former keywords were queried by search engines, then one may find only the for-
mer research while the latest literature remains undiscovered. In Fig. 1, we use
terms from network research to illustrate the phenomenon caused by keyword
obsolescence, as a result of field evolution.
The third general problem of searching for articles by keywords is that
although particular knowledge is likely to exist in the literature, specific insights
may be obtained in articles of completely alternative keywords. For instance,
a particular fact or detail may be briefly mentioned in one or several articles,
for which the associated keywords are irrelevant to one’s research interests. The
discovery of knowledge in seemingly unrelated articles is occasional and unpre-
dictable. Often, the researcher may read numerous articles of diverse keywords
to discover pieces of valuable knowledge. One becomes familiar with the articles
of many types of alternative research, from which to derive header and footnote
knowledge to weave together into parts of one’s informed literature review.
In bioinformatics, the keywords of articles often do not describe the total
wealth of information contained within the article. For instance, a gene or a pro-
tein may be often included in many types of articles but there may be no formal
keyword(s) to declare that they have been included in a particular article. This
lack of information necessitates one to be familiar with many types of articles,
in addition to those which directly concern one’s field of research. For instance,
BeagleTM: Relationship Discovery 239
Fig. 1. Forgotten networks: The drifting and replacement of keywords in research

fields may leave articles “locked” outside of the reach of the general research commu-
nity. Often, articles of non-contemporary research are still relevant to the field, yet are
associated with keywords which no longer carry the same meaning for contemporary
investigators. For example, shown are the keywords concerning a large, popular, com-
puter network whose terms have changed several times during the evolution of its own
study. We note that, “the Information Superhighway” (popular during the 1990’s) and
“the Cloud” (popular during the 2010’s) and the “Internet of Things” (popular as of
this writing) still share much of the same foundational knowledge but may be confused
as different areas of research.
an investigator in bioinformatics would have to include the entire corpus of the

supporting disciplines (i.e., biology, computer science, mathematics and others),
in order to have an opportunity of discovering discussion of an interesting gene
or protein. Unfortunately, finding relevant knowledge in the articles is frustrated
by the fact that the name of the gene or protein may never appear in article’s
keyword rubric or metadata.
Ontologies [1] have gained much popularity as they are able to bridge gaps
between alternative concepts. However, convenient as they may be for connecting
articles in the literature, the searching for knowledge is still likely to depend on
author-selected keywords. These keywords may be poorly selected, and/or dis-
used by their fields, resulting in the loss of relevant articles in literature reviews.
During the evolution of research in bioinformatics, as well as other disciplines,
new keywords are constantly being created, which causes further searching dif-
ficulties.
The amount of published work in biomedical research, making up seemingly
all biomedical knowledge is written by people who apply diverse styles and lan-
guage, is highly unstructured, and is expanding at an astounding rate. During
this expansion, these alternative styles work to introduce noise into the litera-
ture. In the health sciences, biomedical knowledge may be located in noisy sets
of data and it is only by the use of computer-driven technology that meaningful
information may be harvested.
To help them locate and access the knowledge hidden in a corpus, researchers
turn to refining and exploring text mining algorithms discussed in [2], techniques
and tools [3], [4]. Text mining algorithms and tools have been developed for
specific types of research in bioinformatics, as discussed in [5].
In this article, we present a text mining framework and method to assist inves-
tigators to locate articles, while countering the annoyances caused by the three
general problems discussed above that would otherwise persist to inhibit knowl-
edge discovery. We have applied this method to develop and create a tool called
BeagleTM which performs text mining for researchers without being limited to
the sometimes cryptic keywords of article meta-data. Our method processes the
abstracts sections of articles, provided by the PubMed, to locate information
that may be aggregated with that of other articles to infer relationships which
are outputted as visual networks. Also in this article, we provide the details of
our method and discuss how selected keywords are used to drive the network cre-
ation system. Finally, we explain how to read the resulting relationship networks
to obtain knowledge of the connections from the processed articles.
1.1 Text Processing
The text of an article’s abstract relates the goals and context of the information
in an article. Since there is limited space in abstract sections, this text is often
written exactly and unambiguously, and it is likely to be a better source of
information than what is inferred using keywords alone. Our approach is to
process the abstracts of the corpus articles to connect the words and concepts
to those of abstracts in other articles. The user inputs chosen terms into our
algorithm, which are then used to create focus points during the process to
create customized networks that describe relationships.
When multiple keywords are found together in the same abstract, one may
assume that the keywords have some common thread that runs between them to
connect them. In this article, we propose a text mining method called BeagleTM
which permits the investigator to locate the common threads of multiple key-
words found in article abstracts of PubMed literature. Furthermore, our method
creates relationship networks (i.e., graphical visualizations) of keywords to visu-
ally describe how they are associated to each other according to peer-reviewed
articles of the literature. In Sect. 1.2 we discuss our approach to text mining and
our method’s ability link terms.
Text mining tools have been successfully applied to extract information for
convenient use (text summarization, document retrieval), assess document simi-
larity (document clustering, key-phrase identification), extract structured infor-
mation (entity extraction, information extraction) [6], and social medial infor-
mation extraction [7]. Additionally, text mining tools also exist as plug-ins or
libraries for programming languages such: TM [8], Rattle [9] and on-line tools
such as, [10].
While text mining tools fulfill important needs for the bioinformatics com-
munity, they are generally hosted by web sites and their automation in pipelines
may be problematic. Furthermore, many of these tools find specific details from
particular articles and do not infer associations between search terms. Tools such
as PubTator [11], PIE The Search [12], Meshable [13] are useful for bringing arti-
cles to the attention of the investigator where keywords have been highlighted,
however in this task, automation is bottle-necked because the researcher must

manually process the results.
1.2 A Text Mining Approach by BeagleTM
It is already clear that a major challenge in bioinformatics includes the manage-

ment of large volumes of data [14]. Text mining methods that simply highlight
keywords is, therefore, not likely to be fully beneficial. The tool of our method
was written in Python and is able to handle a corpus containing an arbitrary
number of articles since BeagleTM processes each article separately. BeagleTM
processes an entire corpus of articles available for download from PubMed [15]
(maintained by the National Library of Medicine) to track and group articles
where specific user terms are of interest. Although our method and tool provides
graphical representations of the keyword relationships, we stress that the details
regarding the causality shown by relationships must be further explored by the
researcher.
Our method centers around the notion that connected terms likely indi-
cate some form of shared context. Shown in Fig. 2 is a link between two terms
(in a model) indicating that a peer-reviewed study exists to connect them. In
this example, both keywords were found in the same abstract to suggest that
they share some common context. Using networks to describe relationships,
researchers may determine connections between keywords across disjoint arti-
cles to gain knowledge of context.
In Fig. 3, we illustrate that two keywords (nodes) are related (described by an
edge), according to the literature. The determination of how keywords are related
to others is shown visually in a plot that we call a relationship network. In this
example, all three keywords relating to PTMs (post-translational modifications),
stresses and the types of proteins which are featured in the same studies are
illustrated. Here we note that these terms were chosen for this discussion since
they are often featured in articles, yet their discussions and details are elusive
since these terms are not generally mentioned as keywords in articles where they
play prominent roles.
Fig. 2. Existing knowledge in the peer-reviewed literature: Two terms which

share an edge signify that there exists a peer-reviewed article to corroborate their
association by some scientific pursuit. Here, the edge symbolizes that both keywords
were found in the same abstract to suggest that they likely share some common context.
Fig. 3. Building networks: A summary of how BeagleTM builds relationship net-

works from text mining articles across the literature. The terms, PTM, stress and
(associated) proteins are often of interest in research but their presence is seldom
announced by the keywords of the articles in which they are featured.
1.3 Relationship Models

A relationship model in our work is an overarching summary of an article’s
contents using visual cues to show how the keywords (according to one’s selec-
tion) are connected to others (also user-selected) across the articles of the litera-
ture. Although these models could be used to describe the associations between
any types of user-selected keywords, we give an example in Fig. 3 of the rela-
tionships between proteins, stresses and PTMs, according to scientific citations
featured by NCBI’s PubMed server (https://www.ncbi.nlm.nih.gov/pubmed/).
The plots resemble network-models where their information has been taken from
the abstracts of articles and show that particular proteins have been connected
to types of stresses and PTMs. In other words, the occurrences of keywords in
the articles, as well as their relationships to other user-specified terms (shown
by edges) are displayed using a network model. These graphics illustrate links
between keywords to provide investigators with a way to quickly determine how
elements of their projects are acquainted in a, guilt-by-association, manner. Fur-
thermore, we note that a keyword which is connected to another by an edge
symbolizes a study in which both keywords has been found. Any edges out to
other keywords suggest that the studies may themselves be related in regard to
the context of its work.
Links Between Alzheimer’s Disease, Tau Proteins and PTMs. To

describe our method, we used keywords from the literature to create relationship
networks manually. In the work of Marcelli et al. [16] the authors set a stage for a
discussion of age-related neurodegenerative disorders (i.e., Alzheimer’s disease)
where the primary actors are a set of proteins (i.e., APP, Aβ, Tau and BACE1 )
which are involved with the ailment. In their article, the interaction between
the proteins, a set of PTMs (i.e., ubiquitination, phosphorylation, SUMOyla-
tion, acetylation and nitrosylation) and stresses are explored. In particular, we
are interested to know which PTMs are linked to proteins and stresses, accord-
ing to Marcelli et al.. The manually produced relationship model that follows
Fig. 4. Manually-produced relationship network: We describe a guilt-by-

association scenario between proteins which are linked to neurodegenerative disorders
such as Alzheimer’s disease, and PTMs. By “Functional Groups” we imply that these
proteins are likely involved with the disorder. This model was created using the key-
words inherent to the article and work of Marcelli et al. [16]. The actual details to
explain the relationship between terms are not contained in this model and must be
obtained from the original articles.
from the author’s work is shown in Fig. 4. The models created from BeagleTM
describe the connections between the predefined terms automatically, and the
exact causality behind this relationship must be explored in the article by Mar-
celli et al.. Observing the created plot may help researchers to determine the
relevance of an article at a glance.
2 Methods
Once the keywords have been defined for a text mining operation, BeagleTM
scans all abstracts of the corpus to find their occurrence across articles as shown
in Fig. 5. A relationship implies that these keywords are relevant to a study
where they play a role. Several keywords found together in an article may likely
signify a central theme that binds them together. In bioinformatics, for example,
learning that a particular protein and a stress have been found in the same article
is very likely to suggest that the protein has been studied in some context of the
stress. If a type of PTM is also mentioned in the text, then there is reason to
suggest that it may be a part of the stress response for the protein, for example.
If several distinct articles are uncovered where these same keywords are found
together, or are linked by edges in relationship networks, then, again this may
point to a deeper relation or, perhaps, a common mechanism.
In general, by studying keywords and how they are found in the literature, we
may have strong evidence to suggest that they share some form of a relationship.
Fig. 5. Flowchart: The abstracts of each article are individually parsed for pre-
selected keywords. The results of this parsing are organized in specific database tables.
Although further exploration is necessary to determine the exact details of the

discovered relationship, this is not a limitation because all relationships, no mat-
ter their strengths, may be important parts of a rigorous review. Furthermore,
many discoveries have been suggested by simple guilt-by-association scenarios.
BeagleTM is built from open source software (Python https://www.python.
org/ and SQLite3 https://www.sqlite.org/). Our method provides convenient
customization and its output has been especially formatted to create input files
for populating a SQLite3 database. In time, we plan to release the tool’s source
code to the community.
The corpus data was provided from PubMed [15], maintained by the National
Library of Medicine (ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/), and is their most
recent compilation at the time of this writing. Uncompressed, there are about
134 GB of articles to process (over 1.7 million articles) that are saved in an
nxml format. One of the hardships of text mining is that each keyword must
be queried in the text of each article of the corpus. Approaches similar to TM
for R-statistic [8], where the entire corpus is to be loaded into memory for an
operation, is undesirable for working with PubMed due to its size and growth.
Our method avoids the problem of expanding data sets and memory limi-
tations by loading each individual file to parse for keywords, then adding any
results to a central database. This operation is completed each time the list of
keywords changes. We note that working with such an enormous amount of files
may introduce bottle-necks, however, our analysis was completed on solid-state
hard drives which enabled elevated performance without compromising the exe-
cution time. In addition, any lost time during text processing is recovered when
the database programming (discussed in Sect. 2.1) is applied.
Since each article from NCBI has a unique PMID number (identification
reference for PubMed citations) that acts as a primary key for the database pro-
gramming element of the method (discussed in Sect. 2.1), all encountered key-
words are recorded with the PMID number. The associations between keywords
are made by connecting their sources by these PMID numbers. Our method
connects these keywords to each other by finding the intersections according to
PMIDs and uses databases to manage this task.
2.1 Database Support
Each article of the PubMed corpus is in an nxml format, as shown in Fig. 6.

From this meta-data, our method is able to determine particular types of data
to be included in the database to discover connections by automatic queries.
For this task, we used SQLite3, which was chosen for its simplicity, power, open
source nature, and for the fact that an entire database may be stored as a single
Fig. 6. Article meta-data: Each article is downloaded from NCBI as an nxml for-
matted file. We note that BeagleTM parses each file for specific types of information
to be stored in its internal SQLite database. This information is shown by arrows (i.e.,
PMID, the title of journal, the title of the article, and the abstract block.) The text of
this abstract [17] is parsed for relevant information to the keywords.
Fig. 7. Extracted data: For each article where relevant terms are found, referential
details (such as the PMID, article sources, and a the blurb of text in which the term
is found) is inserted into the database for further analysis with advanced queries.
file with which our tool works. SQLite3 also provides a convenient way to setup
and import data from BeagleTM processing by simple scripts to build the tool’s
database on seemingly any hardware. Our tool extracts information about each
occurrence of a keyword as it is encountered in an article (i.e., PMID number,
article references, the occurrence number, and its associated blurb of text) to be
inserted into the database as shown in Fig. 7.
The database is used to perform stringent queries to locate keywords and
associated data according to matching PMID numbers. We have customized
our tool and database to perform analyses of articles for PTMs, stresses and
protein keywords, as discussed above, to help explain its function. Our method
has six main tables: Functional (containing functional origins of types of pro-
teins), MitoSymbols, PTMGeneral (general PTM names such as acetylation or
phosphorylation) and PTMSpecific (containing specific types of PTMs which
are actually subsets of more-general types of PTMs such as, phosphotyrosine
which is a type of phosphorylation, for example). The Stress table concerned
the factors with which the proteins had been exposed, according to the articles.
In Table 1, we provide the details of two of these tables, and note that all tables
have a similar construction.
2.2 Networks
Across all networks, edges between nodes signify that at least one peer-reviewed
study exists in which the keywords have been mentioned together in the same
study (i.e., the keywords share a common PMID number). Cliques in the net-
works represent that keywords originated from the same abstract. To find the
associations of keywords bound by cliques, BeagleTM queries all keywords hav-
ing the same PMID number. In our tool, this output is then relayed to the
NetworkX plotting tool [18] to create the relationship networks. In Sect. 3, we
will discuss the specific results of the networks that suggest interrelationships
according to our method. We note that sometimes when all these cliques are
shown together it may create some confusion when differentiating a particular
Table 1. Database schemas; Here we provide the SQLite3 code to create two of the
tables in our database. The creation code for the other tables is similar. The integrity
constraint NOT NULL was necessary to ensure that the relation for each article was
complete. PubMed’s PMID, an article reference number, was assigned by the NIH
National Library of Medicine and functions as a primary key.
CREATE TABLE Functional ( CREATE TABLE Stress (

pmid varchar PRIMARY KEY, pmid varchar PRIMARY KEY,
funct varchar NOT NULL, stress varchar NOT NULL,
count integer NOT NULL, count integer NOT NULL,
blurb text NOT NULL, blurb text NOT NULL,
journal text NOT NULL ); journal text NOT NULL);
clique from another. To ascertain the members for a specific clique, including the
article PMIDs, one may consult the non-graphical data (i.e., the output provided
by BeagleTM, not shown) from which networks are made.

We obtained a listing of keywords taken from [19] and [20] to be used to demon-
strate the functionality of our method. A sample of these keywords is shown
in Table 2. We note that potentially any keywords may be used by researchers
for analysis with BeagleTM. We begin by discussing the relationship networks
that were created from our keywords. Some of the following networks have been
reduced for simplicity of discussion.
In Fig. 8 each protein clique contains red circles represent the article PMIDs
in which the other terms of the clique are found. Blue squares represent (PTMs),
yellow pentagons representing ailment type, and green triangles representing
stress-factors. Since the types of keywords will change from model to model,
the node assignments may also change. In Figs. 9 and 10 (functional cliques),
the red circles, blue squares, yellow pentagons and green triangles, represent
PMIDs, gene symbols, stress-types, and related ailments, respectively.
In Fig. 11 (acetylation) and 12 (glycosylation), we note the cliques describing
networks where a PMIDs, stresses-factors, and disorders have been linked by
peer-reviewed literature to a single PTM. Since single PTMs have been observed
to operate in concert with others for the orchestration of diverse chaperone
functions [21], we note that networks concentrating on a single PTM may not be
completely informative. In this case, it is suggested that networks describe terms
related to stresses-factors, such as those of Fig. 13 be studied, for potentially
uncovering cross-talking PTMs in network cliques.
From the examination of relationship networks, we examine cliques of protein
which was linked to disorders and stresses according to the literature. Each
keyword of a clique may be found in the same abstract, denoted by the PMID of
the circle-nodes. In Fig. 8, by the literature, we note that the protein SOD1 has
Table 2. Sample keywords: Below are the four main rubrics for our curated key-
words, from which we built relationship networks. The total number of keywords is
current as of April 2018. We note that the number of terms increases in tandem with
the expansion of the PubMed corpus.
Rubric Sample Total

key-
words
Diseases-specific Acidosis, ageing, Alzheimer’s, apoptosis, arthritis, Crohn’s, 46
diabetes, obesity, Parkinson’s and others
Mt Gene Symbols Oat, pc, opa1, cs, mut, msra, phb, sod1, mtor, aldh2 and others 619
PTMs (general types)Acetylation, glycosylation, methylation, oxidation, 35
phosphorylation and others
Stresses Hypoxia, oxidation, oxidative stress, ROS (reactive oxygen 47
species), tolerance, toxin, unfolded protein response and others
been linked to several diverse disorders such as, Alzheimer’s Parkinson’s, diabetes
and other neurodegenerative ailments as discussed in [22] (PMID: 22384126).
There are also links to types of stresses – ROS (reactive oxygen species),
oxidative stress, general stress, toxins and others, which have been introduced
by the articles. From this observation, we note that SOD1 may likely be involved
with these disorders and stresses since there is at least one peer-reviewed study
found in which these edge-connected nodes have been mentioned in the same
article.
From the network, we note the article by Milani et al. [23] (PMID: 23983902),
in which the authors discuss the involvement of induced oxidative damage by
ROS in Parkinson’s disease and amyotrophic lateral sclerosis. This suggests the
role played by these actors. Furthermore, the authors study SOD1 for its con-
nection to NRF2, a transcriptional factor and master regulator of the expression
of many antioxidant /detoxification genes. With the exception of the NRF2
protein and the discussion of amyotrophic lateral sclerosis, this relationship to
SOD1 may readily be observed from the network itself. We reserve judgment on
the NRF2 (an important neuroprotective protein in neurodegenerative diseases)
which may be deeply connected to the stresses and ailments in the network of
Fig. 8.
Also in Fig. 8, we explore the article (PMID: 25998424), relating to the article
by Collins et al. [24] and we note the keywords of the network – {acidosis,
oxidation, acetylation and SOD1 } share a commonality. According to the article,
our network infers the actors of the article, acidosis, SOD1 and ROS, share a
relationship. From the simplicity of the network, these relationships may be used
to determine that SOD1 can be related to the stress of ROS. In addition, since
the authors mentions ROS specifically, we may infer that other oxidative stresses
are likely to play roles by the discussion of redox in the article. It is interesting to
note here that if, upon consultation of the article, there is no discussion about
oxidation, one may form a hypothetical theory that such a relationship may
eventually be discovered.
Fig. 8. Protein clique: Relationship model of SOD1 that has been found according
to the literature to share a relationship to Alzheimer’s, Parkinson’s disease, as well as
others. There are three types of nodes featured in this model: the square represents the
single protein to which each other node is related. The circles and pentagons denote
the PMIDs and stresses, the triangles denote the disorders that have documented
relationships to the other nodes. All edges denote that terms are connected by at least
one common article.
In Fig. 9, we note that ageing shares a relationship with stresses ROS, oxida-
tive stress, pollutants and others. A relationship is also shared with PTMs such
as thyroxine (the main hormone secreted into the bloodstream by the thyroid
gland), methionine sulfoxide, lactic acid, and triiodothyronine – a thyroid hor-
mone that plays vital roles in the body’s metabolic rate.
More specifically, when exploring the article (PMID: 27199942) by Bastard
et al. [25], we note that the actual clique for this article is composted of {ageing,
stress and tolerance}. The article concerns the ability of the Gram-positive bac-
terial species, Oenococcus oeni, is used in the production of wine to reduce acidity
and to tolerate stresses caused by the formation of biofilms or planktonic cells.
The article provides examples of relationships where stress and tolerance play
major roles in the study.
In Fig. 10 which has been reduced from its full size to facilitate discussion,
we note that Alzheimers is related to stress-types such as, {heat shock, oxi-
dation, oxidative stress, reactive oxygen species, (stress) tolerance, and oth-
ers}. We have customized our output to show only peer-reviewed journals in
which relationship between Alzheimer’s and the stress-factors are described.
Fig. 9. Functional clique: The red circles represent the PMID numbers for PubMed
articles, the blue squares indicate PTMs, the green triangles denote the stress-factors
and the mustard pentagons correspond to the ailment by name, to which all elements
are related by the literature. We note that all these terms are related by peer-reviewed
studies however, we must return to the PMID of each clique to determine the nature
of the relationship. We show the summary plot of text mining tasks for the keyword
aging.
For example, in Millian’s work [26] (PMID: 25364287) core pathophysiological

processes underlying Alzheimer’s have been studied where methylation, oxida-
tive stress and other factors were intimately involved.
In Fig. 11, the relationships between PTM, acetylation, stress-factors and
ailments are described. Amongst the stresses, there are: {heat, hyperthermia,
hypoxia, microgravity, oxidate stress, ROS and others} and in the ailments, we
note: {ageing, Alzheimer’s, asthma, bone loss, crohns disease, diabetes, epilepsy,
and others}. In Ansari et al. [27] (PMID: 27686535) from the network, SIRT3,
a member of the sirtuin group, is studied for its role for its regulation of energy
demand during stress conditions such as fasting and exercise. We note that
SIRT3 regulates metabolism through the deacetylation and acetylation of mito-
chondrial enzymes and is understood to be able to combat the effects of ROS,
and to prevent cancer by initiating apoptosis. In their article, Ansari et al. review
the molecular functions of SIRT3 and its ability to regulate.
Before having to consult the article itself, its keywords (in absence of the
SIRT3 sirtuin group member) were exhibited graphically in the network Fig. 11.
Investigators undergoing literature reviews for articles containing these keywords
may consult these networks to begin some of their work. Due to multiple PTMs
Fig. 10. Functional clique: The red circles represent the PMID numbers for PubMed
articles, the blue squares indicate stress-factors, the green triangles denote the journal
names, and the mustard pentagon correspond to the ailment by name. This network
allows us to study which types of journals are featuring unique types of research. We
show the summary plot of text mining tasks for the keyword, Alzheimer’s Disease. This
relationship network is actually a subset of the entire network which was too populated
to be legible.
which are likely working together for a process for disorders such as Parkinson’s
[28] and discussed in [29], [30], [20], investigators may also wish to consult PTM
relationship networks, such as that of Fig. 12 (glycosylation), to gain a fuller
understanding of other PTMs that may work in tandem.
Some literature reviews may begin by a study of stress-factors and PTMs to
determine effects and/or potential onsets of disorders. In such a case, the research
of stresses, in conjunction with a particular PTM would lead the research team
to articles where potential disorders are explored, where stresses and PTMs
are integral components. To determine some of the disorders, which may result
during exposure of a particular stress and PTM, we created the relationship
networks of Fig. 13 (reduced to facilitate discussion) to facilitate this task. In
the relationship networks of the figure, we note that oxidative stress has been
linked to: {apoptosis, diabetes, heart disease, obesity and others}, in concert with
PTMs such as: methionine sulfoxide (oxidation), nitrated tyrosine (nitration),
thyroxine (iodination), and others. More information about the nature of each
PTM of this network is available from UniProt at http://www.uniprot.org/docs/
ptmlist.
Fig. 11. PTM clique: The red circles represent the PMID numbers for PubMed arti-
cles, the blue square indicates a PTM, the green triangles denote stresses, and the mus-
tard pentagon correspond to the ailment by name. This relationship network is from the
study of the keyword acetylation in light of stress-factors and associated disorders.
Fig. 12. PTM clique: The red circles represent the PMID numbers for PubMed arti-
cles, the blue square indicates a PTM, the green triangles denote stresses, and the mus-
tard pentagon correspond to the ailment by name. This relationship network is from the
study of the keyword glycosylation in light of stress-factors and associated disorders.
Fig. 13. Stress clique: The red circles represent the PMID numbers for PubMed
articles, the blue square indicates a stress, the green triangles denote ailments, and
the mustard pentagon correspond to the PTMs. This relationship network is from the
study of the keyword oxidative stress, in light of, PTMs and associated disorders.
4 Conclusion
Due to the problems associated with attaching keywords to articles, we noted
that text mining may be the appropriate remedy to help researchers find concepts
in the literature which has seemingly no uniform method for writing keywords.
In addition, these keywords for articles may exist, yet they do not suggest the
full depth of knowledge that their articles contain. In this study, we discussed
examples where our method and tool, BeagleTM, was used to extract relation-
ships between PTMs, stress-factors, and proteins which may be involved with
disorders. We described how to read relationship networks suggesting connec-
tions between keywords and allow researchers to obtain knowledge from the
literature.
During the discussion of the technicalities of the method itself, we discussed
how our method is able to process a corpus of arbitrary size since it parses one
article’s abstract at a time. Since abstracts are excellent representations of the
entire work, we used the articles’ abstracts as the inputs to our tool. However, our
method and tool will work similarly on any size of text, including a full article.
We discussed that the method of determining commonalities across keywords
revolves around the idea that each article in our corpus (supplied by PubMed)
is automatically given a PMID number. When a keyword is located, then the
keyword, its reference details, along with its PMID number are inserted into the
local SQL database. Robust SQL queries can then be utilized to determine data
to describe the relations that we require to create relationship networks. We

discussed how to read and understand relationship networks where notes and
edges represent keywords and the existence of articles to support a relationship,
respectively. Finally, we discussed how the use of relationship networks, a visual
representation of the actors in abstracts, may save the investigator time when
sifting through diverse abstracts while searching for specific types of studies.
4.1 Future Work

In the future, we intend to extend our BeagleTM tool to add statistical power
such as Bayesian inference and other methods to enable the prediction of new
keywords which are likely to be related to a particular type of disorder, protein or
stress-factor. This functionality would enable investigators to obtain meaningful
networks, in absence of a complete knowledge of necessary keywords for a subject
area. The addition of this analysis would also allow our tool to discern between
strong and weak types of relationships between keywords.
We intend to add a network interactivity layer to the tool so that researchers
are able to move and re-position the nodes of the relationship networks to aid in
productivity. Finally, after development, we plan to make our tool open source
and to release it to the bioinformatics community by Github or another cloud-
based development platform to allow for community-inspired development.
Acknowledgment. I would like to thank Janyl Jumadinova for her help in proofing
this manuscript.
References
1. Splendiani, A., Donato, M., Drăghici, S.: Ontologies for bioinformatics. In: Springer
Handbook of Bio-/Neuroinformatics, pp. 441–461. Springer, Heidelberg (2014)
2. Schouten, K., Frasincar, F., Dekker, R., Riezebos, M.: Heracles: a framework for
developing and evaluating text mining algorithms. Expert Syst. Appl. 127, 68–84
(2019)
3. Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B.,
Kochut, K.: A brief survey of text mining: classification, clustering and extraction
techniques. arXiv preprint arXiv:1707.02919 (2017)
4. Sharma, S., Srivastava, S.K.: Review on text mining algorithms. Int. J. Comput.
Appl. 134(8), 39–43 (2016)
5. Lamurias, A., Couto, F.M.: Text mining for bioinformatics using biomedical litera-
ture. In: Encyclopedia of Bioinformatics and Computational Biology, vol. 1 (2019)
6. Paynter, R., Bañez, L.L., Berliner, E., Erinoff, E., Lege-Matsuura, J., Potter, S.,
Uhl, S.: EPC methods: an exploration of the use of text-mining software in sys-
tematic reviews (2016)
7. Maynard, D., Roberts, I., Greenwood, M.A., Rout, D., Bontcheva, K.: A framework
for real-time semantic social media analysis. Web Seman.: Sci. Serv. Agents World
Wide Web 44, 75-88 (2017)
8. Feinerer, I.: Introduction to the tm package text mining in R (2017)
9. Williams, G.J., et al.: Rattle: a data mining GUI for R. R J. 1(2), 45–55 (2009)
10. Müller, H.-M., Van Auken, K.M., Li, Y., Sternberg, P.: Textpresso central: a cus-
tomizable platform for searching, text mining, viewing, and curating biomedical
literature. BMC Bioinform. 19(1), 94 (2018)
11. Wei, C.-H., Kao, H.-Y., Lu, Z.: PubTator: a web-based text mining tool for assisting
biocuration. Nucleic Acids Res. 44, gkt441 (2013)
12. Kim, S., Kwon, D., Shin, S.-Y., Wilbur, W.J.: PIE the search: searching PubMed
literature for protein interaction information. Bioinformatics 28(4), 597–598 (2011)
13. Kim, S., Yeganova, L., Wilbur, W.J.: Meshable: searching pubmed abstracts by
utilizing mesh and mesh-derived topical terms. Bioinformatics 32(19), 3044–3046
(2016)
14. Papadopoulou, P., Lytras, M., Marouli, C.: Bioinformatics as applied to medicine:
challenges faced moving from big data to smart data to wise data. In: Applying
Big Data Analytics in Bioinformatics and Medicine, pp. 1–25. IGI Global (2018)
15. Ncbi, R.C.: Database resources of the national center for biotechnology informa-
tion. Nucleic Acids Res. 45(D1), D12 (2017)
16. Marcelli, S., Corbo, M., Iannuzzi, F., Negri, L., Blandini, F., Nisticò, R., Feligioni,
M.: The involvement of post-translational modifications in Alzheimer’s disease.
Curr. Alzheimer Res. 15, 313–335 (2017)
17. Hunnicut, J., Liu, Y., Richardson, A., Salmon, A.B.: MsrA overexpression targeted
to the mitochondria, but not cytosol, preserves insulin sensitivity in diet-induced
obese mice. PloS One 10(10), e0139844 (2015)
18. Schult, D.A., Swart, P.: Exploring network structure, dynamics, and function using
networkX. In: Proceedings of the 7th Python in Science Conferences (SciPy 2008),
vol. 2008, pp. 11–16 (2008)
19. Bonham-Carter, O., Pedersen, J., Bastola, D.: A content and structural assessment
of oxidative motifs across a diverse set of life forms. Comput. Biol. Med. 53, 179–
189 (2014)
20. Bonham-Carter, O., Pedersen, J., Najjar, L., Bastola, D.: Modeling the effects of
microgravity on oxidation in mitochondria: a protein damage assessment across a
diverse set of life forms. In: IEEE Data Mining Workshop (ICDMW), pp. 250–257.
IEEE (2013)
21. Thygesen, C., Boll, I., Finsen, B., Modzel, M., Larsen, M.R.: Characterizing
disease-associated changes in post-translational modifications by mass spectrome-
try. Expert Rev. Proteomics 15(3), 245–258 (2018)
22. Li, Y., Chigurupati, S., Holloway, H.W., Mughal, M., Tweedie, D., Bruestle, D.A.,
Mattson, M.P., Wang, Y., Harvey, B.K., Ray, B., et al.: Exendin-4 ameliorates
motor neuron degeneration in cellular and animal models of amyotrophic lateral
sclerosis. PLoS One 7(2), e32008 (2012)
23. Milani, P., Ambrosi, G., Gammoh, O., Blandini, F., Cereda, C.: SOD1 and DJ-1
converge at Nrf2 pathway: a clue for antioxidant therapeutic potential in neurode-
generation. Oxidative Med. Cell. Longevity 2013 (2013)
24. Collins, J.A., Moots, R.J., Clegg, P.D., Milner, P.I.: Resveratrol and n-
acetylcysteine influence redox balance in equine articular chondrocytes under acidic
and very low oxygen conditions. Free Radical Biol. Med. 86, 57–64 (2015)
25. Bastard, A., Coelho, C., Briandet, R., Canette, A., Gougeon, R., Alexandre, H.,
Guzzo, J., Weidmann, S.: Effect of biofilm formation by Oenococcus oeni on malo-
lactic fermentation and the release of aromatic compounds in wine. Front. Micro-
biol. 7, 613 (2016)
26. Millan, M.J.: The epigenetic dimension of Alzheimer’s disease: causal, consequence,
or curiosity? Dialogues Clin. Neurosci. 16(3), 373 (2014)
27. Ansari, A., Rahman, M., Saha, S.K., Saikot, F.K., Deep, A., Kim, K.-H., et al.:
Function of the SIRT3 mitochondrial deacetylase in cellular physiology, cancer,
and neurodegenerative disease. Aging Cell 16(1), 4–16 (2017)
28. Ferrer, I.: Early involvement of the cerebral cortex in Parkinson’s disease: conver-
gence of multiple metabolic defects. Progress Neurobiol. 88(2), 89–103 (2009)
29. Stetz, G., Tse, A., Verkhivker, G.M.: Dissecting structure-encoded determinants
of allosteric cross-talk between post-translational modification sites in the Hsp90
chaperones. Sci. Rep. 8(1), 6899 (2018)
30. Bonham-Carter, O., Thapa, I., Bastola, D.: Evidence of post translational mod-
ification bias extracted from the tRNA and corresponding amino acid interplay
across a set of diverse organisms. In: Proceedings of the 5th ACM Conference
on Bioinformatics, Computational Biology, and Health Informatics, pp. 774–781.
ACM (2014)
Comparison of Imputation Methods
for Missing Values in Air Pollution Data:
Case Study on Sydney Air Quality Index
W. M. L. K. N. Wijesekara(&) and Liwan Liyanage
School of Computing, Engineering and Mathematics,

Western Sydney University, Sydney, Australia
18570263@student.westernsydney.edu.au,
l.liyanage@westernsydney.edu.au
Abstract. Missing values in air quality data may lead to a substantial amount
of bias and inefficiency in modeling. In this paper, we discuss six methods for
dealing with missing values in univariate time series and compare their per-
formances. The methods we discuss here are Mean Imputation, Spline Inter-
polation, Simple Moving Average, Exponentially Weighted Moving Average,
Kalman Smoothing on Structural Time Series Models and Kalman Smoothing
on Autoregressive Integrated Moving Average (ARIMA) models. The perfor-
mances of these methods were compared using three different performance
measures; Mean Squared Error, Coefficient of Determination and the Index of
Agreement. Kalman Smoothing on Structural Time Series method is the best
method among the methods considered, for imputing missing values in the
context of air quality data under Missing Completely at Random (MCAR)
mechanism. Kalman Smoothing on ARIMA, and Exponentially Weighted
Moving Average methods also perform considerably well. Performance of
Spline Interpolation decreases drastically with increased percentage of missing
values. Mean Imputation performs reasonably well for smaller percentage of
missing values; however, all the other methods outperform Mean Imputation
regardless the number of missing values.
Keywords: Imputation Smoothing Missing completely at random
1 Introduction
Air quality data is widely used in models for various purposes including assessing the
impact of air quality on health and wellbeing. It is common to expect a large number of
missing values in data sources collected using sensors. Missing values of air quality
data may lead to underestimate the associated health effects. In general, missing values
create problems by introducing a substantial amount of bias and reducing the efficiency
of analysis [1]. Although many methods have been developed for missing value
imputations, methods for time series data is still at its infancy. The inherent nature of
autocorrelation, trend, seasonality and cyclic effects have made the process more dif-
ficult and challenging. However, it is necessary to impute missing values in cer-
tain situations to make more accurate predictions. Therefore, it is an important area of
research. In this study, we focus on the air pollution data in the Sydney region of
https://doi.org/10.1007/978-3-030-39442-4_20
258 W. M. L. K. N. Wijesekara and L. Liyanage
Australia. Figure 1 summarizes the percentage of missing data in air pollutant variables
at two monitoring stations (Liverpool and Rozele) from 1994 to 2018.
Fig. 1. Missing value percentages of pollutant variables in two monitoring stations in Sydney
Substantial percentages of missing values are present in almost all pollutant vari-
ables. Liverpool station has lower percentages of missing data than Rozelle. Even
though only two stations are presented here, all other stations also showed a consid-
erable percentage of missing values which we cannot disregard. Therefore, a proper
mechanism to impute these missing values is essential before any type of modelling.
In this paper, we discuss six well established methods of dealing with missing
values in a univariate time series context and compare their performance on imputing
missing values for air quality data in the Sydney region. The methods discussed here
are Mean Imputation, Spline Interpolation, Simple Moving Average, Exponentially
Weighted Moving Average, Kalman Smoothing on Structural Time Series Models and
Kalman Smoothing on ARIMA models. The performances of these methods were
compared with three performance measures; Mean Squared Error (MSE), Coefficient of
Determination (R2) and Index of Agreement (d). The objective of this paper is to
identify the best available imputing method for air quality data in the Sydney region.
2 Literature Review
Variety of methods ranging from the simple methods such as mean imputation to
advanced methods such as Long Short Term Memory (LSTM) Recurrent Neural Net-
works have been applied to impute missing values in the context of air pollution data.
Mean imputation methods have performed well in most of the situations where the
percentage of missing values is as low as 5% and especially in single imputations [2, 3].
Other widely used methods include interpolations (linear, quadratic and cubic), Nearest
Neighbor (NN) [2, 4], Regression-based methods [4, 5], Self-Organizing Maps
Comparison of Imputation Methods for Missing Values in Air Pollution Data 259
(SOM) and Multi-Layer Perceptron (MLP) [4]. When the data can be formulated as a
multivariate normal time series, the Expectation-Maximization (EM) based methods
appeared to perform well [6]. Moreover, attempts have been made by combining the
power of neural network and fuzzy logic in handling missing air quality data [7, 8]. These
methods have been recommended for nonlinear and complex phenomena. One such
method is the hybrid approach of Multiple Imputation (MI) and Adaptive Nero-Fuzzy
Inference System (ANFIS). Recently, it can be seen that deep learning techniques such as
LSTM Recurrent Neural Networks are also used in missing value imputations in air
quality data [9]. However, the simplest technique, the mean imputation is still dominant
in this area and it is considered the best method in some scenarios.
The most commonly used performance measures for comparing missing data
imputation methods are Mean Absolute Error (MAE), Root Mean Square Error
(RMSE) and the Coefficient of Determination (R2). The Index of Agreement (d) also
have been used in some studies. There is no one universal method to measure the
performance of imputation techniques. Methods widely depend on the nature of data
and distribution of missing values.
There are three types of missing data mechanisms identified as Missing Completely
at Random (MCAR), Missing at Random (MAT) and Missing Not at Random
(MNAR) [10]. In the MCAR scenario, the missingness is independent of both
observable and unobservable parameters of interest. In MAR, there is a systematic
relationship between the propensity of a value to be missing and the observed data
while in MNAR there is a relationship between the propensity of a value to be missing
and its unobserved value.
3 Methodology
3.1 Imputation Methods
Mean Imputation
This is the most commonly used single imputation technique where the missing values
are replaced with the mean value of the variable. The mean of a series of values
y1 ; y2 ; . . .; yn is given by
1 Xn
y ¼ y
i¼1 i
ð1Þ
n
Spline Interpolation
For n + 1 pair of observations fðti ; yi Þ : i ¼ 0; 1; . . .; nÞg, the shape of spline is mod-
eled by interpolating between all the pairs of observations ðti1 ; yi1 Þ and ðti ; yi Þ with
polynomials
y ¼ qi ðtÞ; i ¼ 1; 2; . . .; n ð2Þ
Simple Moving Average

Simple moving average for a series Y at t is given by the unweighted mean of the
previous n observations.
1 Xn1
yma ¼ y
i¼0 tðniÞ
ð3Þ
n
Exponentially Weighted Moving Average

Exponentially weighted moving average for a series Y at any time period t may be
calculated recursively

y1; t¼1
st ¼ ð4Þ
a :yt þ ð1 aÞ: st1; t[1
where a denotes the degree of weighting decrease.

Autoregressive Integrated Moving Average (ARIMA) Model
When the process is stationary, an Autoregressive Moving Average model ARMA
(p, q) can be defined as
X
p X
q
yt ¼ c þ ui yti þ et þ hj etj ð5Þ
i¼1 j¼1
where ui s are the autoregressive parameters, hj s are the moving average parameters to
be estimated, and et s are a series of unknown random errors that are assumed to follow
a normal distribution.
Time series which needs to be differenced to be stationary is said to be an “inte-
grated” version of a stationary series. In ARIMA(p, d, q) model, the number of
autoregressive terms, the number of non-seasonal differences and the number of lagged
forecast errors in the prediction equation is denoted by p, d and q respectively.
Structural Time Series Models
All linear time series have a state space representation. This representation relates the
disturbance vector {et} to the observation vector {yt} via a Markov process {at}.
A convenient expression of the state space form is
yt ¼ Zt at þ et ; et N ð0; Ht Þ;
ð6Þ
at ¼ Tt at1 þ Rt gt ; gt ð0; Qt Þ; t ¼ 1; . . .n
where yt is a p 1 vector of observations and at is an unobserved m 1 vector called

the state vector. The system matrices Zt , Tt and Rt have dimensions p m, m m
and m g respectively. The disturbance terms et and gt are assumed to be serially
independent and independent of each other at all time points. The matrix Ht has
dimension p p with rank p, and the matrix Qt has dimension g g with rank
g m [11].
Kalman Smoothing
Kalman filter calculates the mean and variance of the unobserved state, given the
observations. This filter is a recursive algorithm; the current best estimate is updated
whenever a new observation is obtained. Kalman Smoothing takes the form of a
backwards recursion and it can be used to compute smoothed estimator of the dis-
turbance vector [11].
The R package, “imputeTS” was used for Kalman smoothing and moving average
imputations [12].
3.2 Performance Measures

Mean Squared Error (MSE);
1X n
MSE ¼ ðPi Oi Þ2 ð7Þ
n i¼1
Coefficient of Determination (R2);

Pn

1 i¼1 ½ðPi PÞðO 2
i O Þ
R ¼
2
ð8Þ
n rP rO
Index of Agreement (d);

" Pn #
ðPi Oi Þk
d ¼ 1 Pn i¼1
ð9Þ
k
i¼1 ðjPi Oj þ jOi OjÞ
where are n is the number of imputations, Oi is the observed data point, Pi is the
P
imputed data point O, are the means and rO; rP are the standard deviations of
observed and imputed data, respectively [4].
3.3 Data and Approach

This data includes air pollutant variables measured hourly at each of monitoring sta-
tions in the Sydney region. Figure 2 depicts how the missing values of air pollutants at
Liverpool monitoring station is distributed across the time period from 1994 to 2018.
The red coloured bars represent the percentage of missing values and blue colour bars
represent the percentage of observed values over a period of time. The x axis represents
the cumulative percentage of values in the time series. As can be seen, the missingness
is almost random across the time except for the first few years in PM2.5 (fine partic-
ulate matter 2.5 micrometers or less in diameter) and CO. Therefore, in this study,
simulations were carried out considering the MCAR mechanism.
Fig. 2. Percentage of missing values of air pollutants at Liverpool station over the time from
1994 to 2018
In order to select a reference time series to carry out missing value simulations and
to use as the ground truth in the comparison of imputation methods, the Air Quality
Index (a standard index calculated by incorporating all the air pollutants) recorded at
each station was considered. Figure 3 shows the heat map of number of missing values
in the hourly air quality index for each monitoring station from 1994-01-01 01:00:00
AEST to 2018-12-31 24:00:00 AEST.
Fig. 3. Number of missing values of the air quality monitoring stations in Sydney from 1994 to
2018
As can be seen in Fig. 3, the dataset suffers from the problem of missing values.
Liverpool, Richmond and Earlwood stations appeared to have less number of missing
values. Therefore, these three stations were further analyzed and a subset of Earlwood
hourly air quality indices for a two-year period starting from 2014.01.01 01:00:00
AEST to 2015.12.31 24:00:00 AEST with no missing values (Fig. 4) was selected as
the reference series.
Missing values for this series were created by artificially deleting observations
under Missing Completely at Random (MCAR) mechanism. Four scenarios were
created where the percentage of missing values were 5%, 10%, 15% and 20%. The
missing values of each scenario were imputed by using the six methods; Mean
Imputation, Spline Interpolation, Simple Moving Average, Exponentially Weighted
Moving Average, Kalman Smoothing on Structural Time Series Models and Kalman
Smoothing on ARIMA models. Then the performance of each method on each scenario
were assessed by using the three performance measures; Mean Squared Error (MSE),
Coefficient of Determination (R2) and Index of Agreement (d).
Fig. 4. Distribution of hourly air quality data from 2014.01.01 01:00:00 AEST to 2015.12.31
24:00:00 AEST in Earlwood
Figure 5 shows the position missing values in the time series in each of the four
scenarios as mentioned in Sect. 3.3. The vertical red lines indicate the positions of
missing values. Only the first 1,000 observations out of 17,517 observations are dis-
played for ease of viewing.
Fig. 5. Distribution of missing values in the simulations for 5%, 10%, 15% and 20% of missing
values in the dataset
The six methods as stated in Sect. 3.3 were used to impute missing values which
were artificially deleted under simulations. The performance of each method was
evaluated with three measures MSE, R2, and d.
Table 1 shows the performance of the six methods measured by MSE. Since the
performance of Mean Imputation method was poor compared to other methods, Fig. 6
compare the performance of other five methods except Mean Imputation.
Table 1. MSE measures

Method 5% 10% 15% 20%
Spline Interpolation 0.406 0.954 1.128 2.020
Kalman Smoothing 0.250 0.600 0.790 0.980
Structural TS
ARIMA
Simple MA 0.479 0.926 1.321 1.777
Fig. 6. Comparison of MSE measures of
Exp MA 0.297 0.664 0.931 1.199
Mean 12.915 25.936 37.624 48.378
each method for the 5%, 10%, 15% and
20% missing value scenarios
The Kalman Smoothing on Structural Time Series method appeared to be the best
while Mean Imputation appeared to be the worst. When the percentage of missing
values increases, performance of all the methods decreases. Kalman Smoothing on
ARIMA models and Exponentially Weighted Moving Averages perform well for small
percentages of missing values.
Table 2 shows the performance of the six methods measured by R2. Figure 7
displays the performance of methods except Mean Imputation.
Table 2. R2 measures
Method 5% 10% 15% 20%
Structural TS
ARIMA
Simple MA 0.999 0.998 0.997 0.996 Fig. 7. Comparison of R2 measures of each
Exp MA 0.999 0.999 0.998 0.997 method for the 5%, 10%, 15% and 20%
Mean 0.973 0.944 0.918 0.893 missing value scenarios
Again the Kalman Smoothing on Structural Time Series methods appear to be the
best among the considered methods. The Kalman Smoothing on ARIMA and Expo-
nentially Weighted Moving Averages methods also perform well. However, once again
the performance of all the methods decreases as the percentage of missing values
increases.
Table 3 gives the performance of methods measured by Index of Agreement (d).
Figure 8 shows the performance of methods excluding Mean Imputation in order to
compare the other models clearly.
Table 3. Index of Agreement measures

Method 5% 10% 15% 20%
Structural TS
Kalman Smoothing ARIMA 0.999 0.999 0.999 0.998
Simple MA 0.999 0.999 0.998 0.998 Fig. 8. Comparison of Index of Agreement
Exp MA 0.999 0.999 0.999 0.998 measures of each method for the 5%, 10%,
Mean 0.985 0.970 0.955 0.940 15% and 20% missing value scenarios
The Kalman Smoothing on Structural Time Series models and on ARIMA models
perform equally well. Although Spline interpolation performed well with a smaller
percentage of missing values, its performance drastically decreases with increasing
missing values. Except the Mean Imputation all other methods show approximately
equal performances. It is clear that the, performance of all the methods decreases when
the percentage of missing values increases.
Figure 9 exhibits the imputed values from the Kalman Smoothing on Structural
time series model for the four simulated scenarios. Red, green and blue represents the
imputed values, actual values and known values respectively. Again, only the first
1,000 observations out of 17,517 observations are presented for ease of viewing.
It can be seen that this method has performed extremely well for MCAR missing
mechanism in the air quality data. However, further studies must be carried out to
compare the performance of these methods under MAR and MNAR missing mecha-
nisms. Also, here we have considered a subset of observed series to artificially create
missing values. When there are large numbers of missing values, these methods may
result in sub-optimal results.
Kalman Smoothing on Structural Time Series Model ImputaƟon
Fig. 9. Comparison of imputed data using Kalman Smoothing on Structural Time Series models
against the actual data for the 5%, 10%, 15% and 20% missing value scenarios
5 Conclusion and Recommendations
Among the six methods considered, Kalman Smoothing Methods on Structural Time
Series is the best method for imputing missing values in the context of air quality data
where the missing mechanism is MCAR. Kalman Smoothing on ARIMA, and
Exponentially Weighted Moving Average methods also perform considerably well.
Performance of Spline Interpolation decreases drastically with increased percentage of
missing values. Even though Mean Imputation performs reasonably well for smaller
percentages of missing data, all the other five methods outperform this method
regardless the number of missing values. The six methods can be ranked from best to
worst as; Kalman Smoothing on Structural Time Series Models, Kalman Smoothing on
ARIMA models, Exponentially Weighted Moving Average, Simple Moving Average,
Spline Interpolation and Mean Imputation. However, the need of developing an

improved method of imputation to deal with higher percentage of missing values still
persists. These methods also need to be studied against other missing data mechanisms
such as Missing at Random (MAR) and Missing Not at Random (MNAR).
References
1. Nakagawa, S., Freckleton, R.P.: Missing inaction: the dangers of ignoring missing data.
Trends Ecol. Evol. 23(11), 592–596 (2008)
2. Norazian, M.N., et al.: Estimation of missing values in air pollution data using single
imputation techniques. ScienceAsia 34(3), 341–345 (2008)
3. Zakaria, N.A., Noor, N.M.: Imputation methods for filling missing data in urban air pollution
data for Malaysia. Urbanism 9(2), 159–166 (2018)
4. Junninen, H., et al.: Methods for imputation of missing values in air quality data sets. Atmos.
Environ. 38(18), 2895–2907 (2004)
5. Wyzga, R.E.: Note on a air method to estimate missing pollution data. J. Air Pollut. Control
Assoc. 23(3), 207–208 (1973)
6. Junger, W.L., de Leon, A.P.: Imputation of missing data in time series for air pollutants.
Atmos. Environ. 102, 96–104 (2015)
7. Lei, K.S., Wan, F.: Pre-processing for missing data: a hybrid approach to air pollution
prediction in Macau. In: Proceedings of the 2010 IEEE International Conference on
Automation and Logistics (2010)
8. Shahbazi, H., et al.: A novel regression imputation framework for Tehran air pollution
monitoring network using outputs from WRF and CAMx models. Atmos. Environ. 187, 24–
33 (2018)
9. Yuan, H.W., et al.: Imputation of missing data in time series for air pollutants using long
short-term memory recurrent neural networks. In: Proceedings of the 2018 ACM
International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings
of the 2018 ACM International Symposium on Wearable Computers (Ubicomp/Iswc 2018
Adjunct), pp. 1293–1300 (2018)
10. Rubright, J.D., Nandakumar, R., Glutting, J.J.: A simulation study of missing data with
multiple missing X’s. Pract. Assess. Res. Eval. 19(10) (2014)
11. Abril, J.C.: Structural time series models. In: Lovric, M. (ed.) International Encyclopedia of
Statistical Science, pp. 1555–1558. Springer, Heidelberg (2011)
12. Moritz, S., Bartz-Beielstein, T.: imputeTS: time series missing value imputation in R. R J. 9
(1), 207–218 (2017)
BERT Feature Based Model for Predicting
the Helpfulness Scores of Online
Customers Reviews
Shuzhe Xu(B) , Salvador E. Barbosa, and Don Hong
Computational Science Program, Middle Tennessee State University,

Murfreesboro, TN 37132, USA
sx2b@mtmail.mtsu.edu
Abstract. Online product reviews help consumers make purchase decisions

when shopping online. As such, many computational models have been con-
structed to automatically evaluate the helpfulness of customer product reviews.
However, many existing models are based on simple explanatory variables,
including those extracted from low quality reviews that can be misleading and
lead to confusion. Quality feature selection is essential for predicting the help-
fulness of online customer reviews. The Bidirectional Encoder Representations
from Transformers (BERT) is a very recently developed language representation
model which can attain state-of-the-art results on many natural language process-
ing tasks. In this study, a predictive model for determining helpfulness scores of
customer reviews based on incorporation of BERT features with deep learning
techniques is proposed. The application analyzes the Amazon product reviews
dataset, and uses a BERT features based algorithm expected to be useful in help
consumers to make a better purchase decisions.
Keywords: BERT · Online review · Review helpfulness · Neural network ·

Data mining · Prediction model
1 Introduction
Online shopping has grown exponentially across the world, and customer reviews play
an increasingly important role in helping shoppers make purchase decisions. These
reviews usually contain detailed information about the product, including product
descriptions, user experiences, and personalized suggestions. Given the large number
of reviews for specific products, an automated recommendation system incorporating
reviewer comments can be of great importance to consumers.
A lot of research work along this direction has been done, and many different tech-
niques, including regression methods and machine learning algorithms, have been used
to evaluate the helpfulness of product reviews. Liu et al. [1] developed a non-linear
model by incorporating linguistic features such as writing style and by using radial
basis function for predicting helpfulness. Zhang et al. [2] proposed a regression model
using lexical subjective clues, lexical similarity features, and shallow syntactic features
to evaluate helpfulness. Mudambi and Schuff [3] built a regression model based on

https://doi.org/10.1007/978-3-030-39442-4_21
BERT Model for Predicting Scores of Customer Reviews 271
their hypotheses that some factors in the review influenced the review helpfulness as
explanatory variables in the regression model. Those explanatory variables, including
such things as word count, can be unreliable in representing the review’s helpfulness.
Park [4] analyzed psychological and linguistic features of reviews as explanatory vari-
ables in a linear model. However, linear and non-linear regression models based on
limited explanatory variables have the limitation of reliability and often fail to con-
sider the complete review contents. Additionally, some of the models may require
time-consuming pre-processing of data. Due to uncertainty in many such factors, the
prediction of product review helpfulness becomes a difficult and challenging research
problem. Using machine learning techniques to analyze customer reviews as a Natural
Language Processing (NLP) task can lead to the identification of linguistic indicators
between the review texts and their helpfulness. Some researchers [5–7] have success-
fully used machine learning methods to extract features for opinion mining, semantic
classification and sentiment classification. They showed that machine learning tech-
niques as well as traditional statistical methods can solve NLP tasks. With modern
computer technology, more information from the data can be extracted to reduce uncer-
tainty and to more fully consider available data during modeling and predicting.
The Bidirectional Encoder Representations from Transformers (BERT) is a machine
learning based NLP tool released by Google in late 2018 [8]. BERT obtains new state-
of-the-art results on many NLP tasks such as General Language Understanding Evalu-
ation (GLUE), as a pre-trained language representation model that requires fine-tuning
to process different NLP downstream tasks. Researchers have already shown that BERT
can process product reviews data with high accuracy, working alongside reading com-
prehension and aspect-based sentiment analysis [9]. In that research, features extracted
from a review by BERT are used to represent the review text, and avoid heavy pre-
processing task on the original data to answer the questions from customers based on
product reviews. Hence, BERT is expected to be useful for the goal of predicting scores
of review helpfulness.
In this study, a neural network (NN) based model is developed with BERT features,
instead of explanatory variables, and is used to rank the helpfulness of product review
data collected by Amazon.com, using the ratio of helpful votes to total votes for each
review. This NN based tool is used to analyze the product review data by incorporating
BERT features. The proposed model predicts the helpfulness of customer reviews with
a ranking score by analyzing the review text, its star rating, and the product type. The
prediction should help consumers to make a better purchase decision. The remainder
of this paper is organized as follows. In Sect. 2, a brief introduction of related previous
research work is provided. Data preprocessing and the details of the model are pre-
sented in Sect. 3. An analysis and comparison of the proposed model’s results with one
that uses explanatory variables are included in Sect. 4. In the last section, results and
possibilities that could help improve the model in future research are discussed.
2 Related Work
2.1 Factors Affecting Review Helpfulness
Predicting a helpfulness score as a NLP task is quite challenging. There are many
factors that determine the helpfulness score of a review. Some of these are extractive
272 S. Xu et al.
information, such as the overall star rating for each product obtained directly from the
data, and others are abstractive information like linguistic features that are more difficult
to extract from the review text.
Star Rating and Product Type. The star rating is intuitive and can be extracted from
data directly. The star ratings for online product review usually are numerical values
and range from one to five stars. A one star rating reflects an extremely negative aspect
of the review; conversely, a five star rating shows an extremely positive aspect. For each
product, the star rating indicates the attitude from customers.
Previous research has evaluated the relative diagnosticity of review extremity (those
tending toward one star or toward five stars) [3]. The relationship between the numerical
star ratings and the actual helpfulness scores is difficult to establish. Past research has
shown that a moderate star rating in the mid-range has great credibility [10], and is often
more helpful to customers than an extreme one. On the contrary, other researchers found
that an extreme star rating became more helpful than a moderate rating for eBay sellers
[11] and books [12]. These contradictory findings do not provide an exact answer to the
question of which one is more helpful.
Depending on the nature of the product type being reviewed, the relative value of
moderate ratings and extreme ratings can differ. In 1970, Nelson [13] defined search
goods as products that consumers are able to get product quality information about
before purchasing them, and experience goods as products that require user experience
to establish product quality. Past research has found that customers are more skeptical
of experience, or subjective, than of search, or objective product claims [14]. Mudambi
and Schuff [3] found that prior research failed to take into consideration product types
in assessing moderate ratings versus extreme ratings. They indicated that moderate
reviews were more helpful than either extremely positive or extremely negative reviews
for an experience good, in the decision making stage. For a search good, the extreme
ratings were seen as more helpful than moderate ones.
Linguistic Features, Content of Reviews and Other Factors. Review details can
increase information availability of the product, and help consumers to determine
whether the product is good or not. However, this is not always the case. Consumers
expect the information that they want to know about the product. A review could be
less helpful if there is not enough information for those consumers, even if the review
contains many sentences and words. There are many other similar factors that affect the
perceived helpfulness of reviews to consumers, depending on different situations.
Previous studies addressed some factors that affect review helpfulness. For the lin-
guistic aspect, a review with a high readability is likely to be accepted by customers [15–
17]. Mudambi and Schuff [3] indicated that review depth by word count has a positive
effect on review helpfulness depending on the product type. Ghose and Ipeirotis [16]
have shown that readability-based features affect the review helpfulness. Other research
[17] also supports the finding that readability has a higher importance for review help-
fulness, than review length.
From the content aspect, the meaning of reviews was extracted using latent seman-
tic analysis (LSA) in a previous study by Cao et al. [18]. They showed that the semantic
features of review have greater effect on the review helpfulness. Some research [12, 16]
found that reviews containing both subjective and objective information are more help-
ful to customers, compared to reviews containing subjective sentences only, or objective
sentences only. Sentiment features such as the words with positive motions or negative
motions are also indicated as important elements of review helpfulness. Pan and Zhang
[15] found that positive reviews are more likely to receive helpful votes than negative
reviews. However, negative reviews also have a large impact on decision making for
customers. Kuan et al. [19] showed that negative reviews are more helpful than positive
reviews. Other research discussed sentiment features in more detail. [20, 21] found that
sentiment features affect review helpfulness.
Previous Analysis Models for Predicting Review Helpfulness. Based on a selection

of factors that influence review helpfulness shown in previous studies, it is possible to
build a mathematical model to represent the relationship between the helpfulness score
and the variables of those factors that contribute to review helpfulness.
Mudambi and Schuff [3] developed a regression model to show the relationship
from review helpfulness to explanatory variables that included review extremity, review
depth, and product type. Review extremity is the numerical star rating value. Review
depth is measured by the length of the review text. Product type is determined by pre-
vious research as a binary value, 0 for search goods and 1 for experience goods. The
regression model is designed based on these three basic terms with some interaction
terms since those basic terms have some correlation with each other. The model has
shown the relationship between the helpfulness score and three basic terms clearly,
however, those tree basic terms may not be appropriate to represent the review helpful-
ness.
In 2018, Park [4] designed a linear regression model with more explanatory vari-
ables. The model contains 11 explanatory variables from 3 categories, 7 of them called
psychological variables are Analytic, Clout, Authentic, CogProc, Percept, PosEmo
and NegEmo. Analytic variable represents the level of logical thinking. Clout is a
measurement of professional expressions. Authentic is the level of personal thinking
and disclosures. CogsProc represents the number of cognitive process words. Percept
is the number of perceptual process words. All of these 5 variables influence review
helpfulness. PosEmo and NegEmo are sentiment features which have been discussed
in [15, 19–21], both of them affect review helpfulness. Other explanatory variables are
WC, the number of words of a review [3, 15], and WPS, the average number of words
in each sentence, that have been shown to be useful in determining helpfulness in prior
research [12, 16, 17], and Compare, is a term for the number of comparison words
like “greater” and “higher”. Reviews with comparison words would be more objective
to customers, thus, this term would affect the helpfulness score. The last explanatory
variable is star rating as a metadata variable [3]. Comparing to the previous model,
which contains only 3 explanatory variables [3], this model with 11 variables is more
appropriate to measure the review helpfulness. However, this model is still not the best
because those variables are still too simple compared to the real situation. In the real
world, the review helpfulness is determined by a lot of factors and most of them may
not even be measurable by a simple numerical number.
274 S. Xu et al.
In past few years, researchers have tended toward using machine learning based
techniques, instead of traditional statistical based methods, for extracting features. The
study in [22] used deep learning technique to measure the helpfulness of hotel reviews
with user-provided photos. In [23], a convolutional neural network was applied to
extract information from the product reviews with auxiliary domain discriminators.
2.2 BERT
BERT is a fine-tuning based language representation model developed very recently. It
obtains state-of-the-art results on many NLP tasks such as General Language Under-
standing Evaluation (GLUE), Natural Language Inference (NLI), and Corpus of Lan-
guage Acceptability (CoLA), etc. [8]. The basic idea of BERT is to pre-train the lan-
guage model by using large-scale corpora using on a transformer model [24]. The model
is trained bidirectionally through multiple layers. Compared to other language models
[25, 26], BERT is adapted for different end tasks through a fine-tuning approach. Thus,
BERT can feed any of a number of different downstream tasks, without changing its
pre-trained language model. BERT has two primary model sizes with different param-
eters:
BERTB ase : 12 layers, 768 hidden dimensions, 12 self-attention head and 110 mil-
lion total parameters.
BERTLar g e : 24 layers, 1024 hidden dimensions, 16 self-attention head and 340
million total parameters.
Compared to other machine learning based methods, BERT has the best perfor-
mance in many different downstream task domains [8], and almost all of them are
related to features extraction. Features extraction is the most important task in review
helpfulness analysis. Previous studies showed linguistic features and psychological fea-
tures matter in influencing review helpfulness. Also, those features are difficult to mea-
sure by limited number of explanatory variables. In this research, BERT is used as a
front-end model to extract linguistic, psychological, and other features that are passed
to the downstream task of determining review helpfulness.
3 Model with BERT Features

3.1 Data Collection and Pre-processing
In this study, 2 datasets are generated for the experiment. The first dataset is for test-
ing the model. It is collected by Amazon.com, which is downloadable at https://s3.
amazonaws.com/amazon-reviews-pds/tsv/index.txt.
It contains 2 product categories (cameras and video games) with different product
types (search goods or experience goods) that were used in prior research [13, 27, 28].
The original data has 15 columns, and this model focused on 4 factors, which are listed
below:
Star rating: The 1 to 5 star rating of the review.

Helpful votes: Number of helpful votes the review received.
Total votes: Number of total votes the review received. This is equal to the number of
helpful votes plus the number of not helpful votes.
Review body: The review text.
5000 reviews of each category are selected from the original data set by letting the
total number of votes be a control variable to filter the reviews. The helpfulness score
denoted by s is a percentage defined as follows.
helpf ul votes
s= . (1)
total votes
If the number of total votes is a small number, the helpfulness score can be more unsta-
ble than one with a large number of total votes, and thus, the helpfulness score with a
small number of total votes could be biased.
10000 reviews were selected randomly (from those where the number of total votes
was greater than 30) from 2 different categories as listed in Table 1.
Table 1. Number of data of each category
Camera Video game

Data size 5000 5000
After filtering the data, the extracted data is divided into three subsets: 80% of the
data as the training set, 10% as the development set and another 10% as the testing set.
Therefore, for each category, there are 4000 reviews to train the regression model, 500
reviews to optimize the parameters of the model (development set), and 500 reviews for
testing. The second data set is for used for comparison with previous research, and will
be discussed in Sect. 4.2.
3.2 Neural Network Based Model

In this study, the proposed NN based model is formed by incorporating BERT pre-
trained model with one additional output layer to predict the helpfulness score. The
regression model contains 3 main terms: BERT features, star rating, and product type.
The BERT features are obtained from the BERT pre-trained model as vectors [8].
Though the vectors are difficult to explain and understand directly, they are data driven
outcomes and hence, the BERT features are reliable. The star rating can be extracted
from the data directly and the product type is determined by the previous research
[13, 27, 28]. The model is designed as:
s = f (wT x + b). (2)
Here s is the helpfulness score, f is the activation function ReLU, x is the vector of
input and w is a vector as the weight of x, where the length of x and w is determined
by the length of the input. b is bias and it is a single value since there is only one output
which is the predicted helpfulness score. The equation (2) can be extended in detail:
276 S. Xu et al.
s = f (w1T F eaturesBERT + w2T StarRating + w3T P roductT ype + b). (3)
where w1 , w2 and w3 are vectors of weight corresponding to specific terms.

The loss function of this NN based model is measured by mean squared error
(MSE):
n
1
M SE = (si − ŝi )2 . (4)
n i=1
where si represents the real helpfulness score from the original dataset, ŝi is the cor-
responding predicted helpfulness score, and n is the number of reviews in dataset. The
input of the NN are elements in x. For each element in x, it multiples the corresponding
weight in the hidden layer. The bias b is added to this. The output of the NN is the
summation of outputs that come from the activation function as showed in Fig. 1.
Fig. 1. Regression process in neural network
3.3 Algorithm Procedure

The program is based on the original BERT release. The source code is downloadable
at https://github.com/google-research/bert#fine-tuning-with-bert.
Data Pre-processing. The files are generated for testing and the comparison. The pro-
cess is listed below:
Step 1: Select data to match the condition, for example, the number of total votes is
greater than 30.
Step 2: Convert the data file to .tsv file
Step 3: Randomly divide the data into training, development and test sets.
Step 4: Repeat step 3 until enough data is collected.
Neural Network Incorporating BERT. The general steps to run the code on Tensor-
Flow are:
Step 1: Extract the BERT features from each dataset.

Step 2: Combine the BERT features with star rating and product type as the input to
the NN.
Step 3: Train the model in NN using training data and optimize the parameters with
the development set.
Step 4: Test the model and calculate the error.
Step 5: Repeat Steps 1 to 4 on 10 random data subsets, and calculate the average
error and standard deviation.
4 Results
4.1 Prediction
The BERT pre-trained model was fine-tuned to extract BERT features by setting the
values of hyper parameters. The model was tested on a GeForce GTX 1080 GPU by
adjusting hyper parameters to avoid the out-of-memory issues. The hyper parameters
used are listed in Table 2. The regression model was tested 10 times for each category
Table 2. Adjusted fine-tuning hyper parameters
Hyper parameter
max seq length 144
train batch size 16
M odeltype BERTBase
Optimizer Adam
Table 3. Prediction MSE of Table 4. Prediction MSE of

Video Game products Camera products
Data set MSE (NN) Data set MSE (NN)

1 0.05619 1 0.03841
2 0.05368 2 0.03214
3 0.05824 3 0.03178
4 0.06105 4 0.03815
5 0.06281 5 0.03403
6 0.05478 6 0.03828
7 0.05458 7 0.04468
8 0.04820 8 0.03478
9 0.05474 9 0.04041
10 0.05135 10 0.03623
Average 0.05556 Average 0.03689
Std. dev. 0.00432 Std. dev. 0.00395
278 S. Xu et al.
with random subsets (training, development, and testing). First, the model was trained
and optimized with the training set and development set, then the helpfulness scores
were predicted for the test set. The prediction results of each test are listed in Tables 3
and 4. From the results, it can be see that the BERT features worked well as variables in
the regression model. The MSE differs from different product categories. Comparison
of these results with those from an explanatory variable based regression model are
included in the next section.
4.2 Comparison
In this section, the exact same data in [4] is used for comparison, and the best results
(best average MAE and smallest stand deviation) in that paper are compared. The data
is collected from http://jmcauley.ucsd.edu/data/amazon/ and reviews with more than 10
total votes are selected, to match what that paper did (this data is different from the
testing data used in Sect. 3). Ten different data sets are generated randomly by using
these selected data. For each data set, 80% is chosen as the training set, 10% as the
development set, and another 10% as the test set. There are 520 reviews for each test
of Cellphone products and 836 reviews for each test of Beauty products, as was done
in [4], and the proposed model does not contain the product type (experience or search
goods) for this comparison. The mean absolute error (MAE) is given as follows.
n
1
M AE = |si − ŝi |. (5)
n i=1
The results of comparison are listed in Tables 5 and 6 and Tables 7 and 8, respectively.
Table 5. MAE of test dataset Table 6. Test results in Park’s paper

(Cellphone products) (Cellphone products)
Data set MSE (NN) Data set MAE (SVR) MAE (M5P)
1 0.05619 1 11.1354 12.1070
2 0.05368 2 11.5486 12.5098
3 0.05824 3 10.4736 11.7598
4 0.06105 4 10.3240 10.6372
5 0.06281 5 11.8700 12.2581
6 0.05478 6 11.5203 11.8930
7 0.05458 7 11.9216 12.4486
8 0.04820 8 12.1916 11.8303
9 0.05474 9 13.3505 12.2984
10 0.05135 10 12.9721 12.6732
Average 0.05556 Average 11.7308 12.0415
Std. dev. 0.00432 Std. dev. 0.9178 0.5495
Table 7. MAE of test dataset Table 8. Test results in Park’s paper

(Beauty products) (Beauty products)
Data set MAE (NN) Data set MAE (SVR) MAE (RandF)
1 11.2604 1 11.6135 11.9215
2 12.0296 2 12.2497 12.6919
3 11.0899 3 13.1854 13.2427
4 11.9005 4 11.9664 12.2743
5 11.5507 5 10.9451 11.5257
6 11.4561 6 11.1959 11.8278
7 11.1970 7 11.5863 12.1718
8 10.5708 8 10.8911 11.3696
9 11.9430 9 11.2281 12.1587
10 11.5520 10 12.3415 12.5452
Average 11.3550 Average 11.7203 12.1729
Std. dev. 0.4382 Std. dev. 0.6862 0.5300
Comparison of the results are shown in Tables 5 and 6. Although, the BERT features
based NN model did not get the better MAE in every subset tested, compared to the best
support vector regression (SVR) explanatory variables regression model, the proposed
NN model has a better average MAE across all 10 subsets with nearly 9% improvement.
The difference in the standard deviations also show that the NN model is much stable
than the SVR model, across different test datasets. The standard deviation of this model
is even smaller than the explanatory variables based (M5R) model, which yielded the
best standard deviation in Park’s result.
Tables 7 and 8 show a similar comparison result using another dataset of a different
product category. The proposed NN model still resulted in the best average MAE and
the smallest standard deviation.

In this study, a BERT features approach was investigated for linguistic and psycholog-
ical information to incorporate into a NN model for predicting scores for helpfulness
of online reviews. The comparison with results from a model with a limited set of
explanatory numeric variables in prior research, shows that the predictive results of the
proposed model are not only better in terms of MAE, but also yield a smaller standard
deviation. This means that the BERT based NN model is more stable than the mod-
els with limited explanatory variables across different product categories. These results
show the model is reliable in predicting the helpfulness scores of customer reviews,
while avoiding limitations such as the time consuming process associated with select-
ing and extracting explanatory variables. Due to limited computing power in the test
environment, the test could not be run using the best hyper parameters yet. This could
have improved the attained prediction accuracy in current data analysis.
280 S. Xu et al.
Many additional efforts can be done in the future to improve the prediction accuracy
from this study. First, the model can be run on a TPU sever with better hyper param-
eters to check the improvements. Second, a post-process procedure can be considered
to optimize the weights of BERT features, based on the specific products category of
reviews. Lastly, BERT features can be modeled with more advanced statistical comput-
ing techniques for use is review helpfulness prediction.
Acknowledgment. The authors are indebted to anonymous reviewers for providing constructive
comments and suggestions which has resulted in improvement both the readability and quality of
the paper.
References
1. Liu, Y., Huang, X., An, A., Yu, X.: Modeling and predicting the helpfulness of online
reviews. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 443–452. IEEE
(2008)
2. Zhang, Z., Varadarajan, B.: Utility scoring of product reviews. In: Proceedings of the 15th
ACM International Conference on Information and Knowledge Management, pp. 51–57.
ACM (2006)
3. Mudambi, S.M., Schuff, D.: What makes a helpful review? A study of customer reviews on
Amazon.com. MIS Q. 34(1), 185–200 (2010)
4. Park, Y.J.: Predicting the helpfulness of online customer reviews across different product
types. Sustainability 10(6), 1735 (2018)
5. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and
semantic classification of product reviews. In: Proceedings of the 12th International Confer-
ence on World Wide Web, pp. 519–528. ACM (2003)
6. Read, J.: Using emoticons to reduce dependency in machine learning techniques for sen-
timent classification. In: Proceedings of the ACL Student Research Workshop, pp. 43–48.
7. Ye, Q., Zhang, Z., Law, R.: Sentiment classification of online reviews to travel destinations
by supervised machine learning approaches. Expert Syst. Appl. 36(3), 6527–6535 (2009)
8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:181004805 (2018)
9. Xu, H., Liu, B., Shu, L., Yu, P.S.: BERT post-training for review reading comprehension and
aspect-based sentiment analysis. arXiv preprint arXiv:190402232 (2019)
10. Eisend, M.: Two-sided advertising: a meta-analysis. Int. J. Res. Mark. 23(2), 187–198 (2006)
11. Pavlou, P.A., Dimoka, A.: The nature and role of feedback text comments in online market-
places: implications for trust building, price premiums, and seller differentiation. Inf. Syst.
Res. 17(4), 392–414 (2006)
12. Forman, C., Ghose, A., Wiesenfeld, B.: Examining the relationship between reviews and
sales: the role of reviewer identity disclosure in electronic markets. Inf. Syst. Res. 19(3),
291–313 (2008)
13. Nelson, P.: Information and consumer behavior. J. Polit. Econ. 78(2), 311–329 (1970)
14. Ford, G.T., Smith, D.B., Swasy, J.L.: Consumer skepticism of advertising claims: testing
hypotheses from economics of information. J. Consum. Res. 16(4), 433–441 (1990)
15. Pan, Y., Zhang, J.Q.: Born unequal: a study of the helpfulness of user-generated product
reviews. J. Retail. 87(4), 598–612 (2011)
16. Ghose, A., Ipeirotis, P.G.: Estimating the helpfulness and economic impact of product
reviews: mining text and reviewer characteristics. IEEE Trans. Knowl. Data Eng. 23(10),
1498–1512 (2010)
17. Korfiatis, N., Garcı́a-Bariocanal, E., Sánchez-Alonso, S.: Evaluating content quality and
helpfulness of online product reviews: the interplay of review helpfulness vs. review con-
tent. Electron. Commer. Res. Appl. 11(3), 205–217 (2012)
18. Cao, Q., Duan, W., Gan, Q.: Exploring determinants of voting for the helpfulness of online
user reviews: a text mining approach. Decis. Support Syst. 50(2), 511–521 (2011)
19. Kuan, K.K., Hui, K.L., Prasarnphanich, P., Lai, H.Y.: What makes a review voted? An empir-
ical investigation of review voting in online review systems. J. Assoc. Inf. Syst. 16(1), 48
(2015)
20. Yin, D., Bond, S., Zhang, H.: Anxious or angry? Effects of discrete emotions on the perceived
helpfulness of online reviews. MIS Q. 38(2), 539–560 (2014)
21. Ahmad, S.N., Laroche, M.: How do expressed emotions affect the helpfulness of a product
review? Evidence from reviews using latent semantic analysis. Int. J. Electron. Commer.
20(1), 76–111 (2015)
22. Ma, Y., Xiang, Z., Du, Q., Fan, W.: Effects of user-provided photos on hotel review helpful-
ness: an analytical approach with deep leaning. Int. J. Hosp. Manag. 71, 120–131 (2018)
23. Chen, C., Yang, Y., Zhou, J., Li, X., Bao, F.S.: Cross-domain review helpfulness prediction
based on convolutional neural networks with auxiliary domain discriminators. In: Proceed-
ings of the 2018 Conference of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies (Short Papers), vol. 2, pp. 602–607
(2018)
24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing
Systems, pp. 5998–6008 (2017)
25. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.:
Deep contextualized word representations. arXiv preprint arXiv:180205365 (2018)
26. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv
preprint arXiv:180106146 (2018)
27. Bragge, J., Storgårds, J.: Utilizing text-mining tools to enrich traditional literature reviews.
case: digital games. In: Proceedings of the 30th Information Systems Research Seminar in
Scandinavia IRIS, pp. 1–24 (2007)
28. Weathers, D., Sharma, S., Wood, S.L.: Effects of online communication practices on con-
sumer perceptions of performance uncertainty for search and experience goods. J. Retail.
83(4), 393–401 (2007)
Evaluating Taxonomic Relationships
Using Semantic Similarity Measures
on Sensor Domain Ontologies
Mireya Tovar Vidal1(B) , Aimee Cecilia Hernández Garcı́a1 ,

José de Jesús Lavalle Martı́nez1 , José A. Reyes-Ortiz2 ,
and Darnes Vilariño Ayala1
1
Faculty of Computer Science, Benemérita Universidad Autónoma de Puebla,
14 sur y Av. San Claudio, C.U., Puebla, Puebla, Mexico
mtovar@cs.buap.mx, hernandez.aimee@outlook.com, jlavallenator@gmail.com,
dvilarinoayala@gmail.com
2
Universidad Autónoma Metropolitana,
Av. San Pablo Xalpa 180, Azcapotzalco, 02200 Mexico City, Mexico
jaro@correo.azc.uam.mx
Abstract. The importance of sensors nowadays is all about the boom

of internet of things. Sensors produce a mass of heterogeneous data con-
tinuously, and just like the data produced on the web, sensor data lack
semantic information. This problem can be overcome with semantic web
technologies by designing ontologies to provide a semantic structure of
sensor data as well as machine readable data improving the interoperabil-
ity. Those ontologies must be evaluated to verify their semantic quality
and this is where semantic similarity plays its function. Semantic sim-
ilarity is a metric used to know the similarity degree of two concepts
in an ontology. In this research, we propose a system which evaluates
taxonomic relationships in ontologies using semantic similarity through
an algorithm and the accuracy measure. The applied semantic similar-
ity measures are classified in four categories: structure-based, feature-
based, content information and hybrid measures. In this research, we
evaluate sensors domain ontologies using semantic similarity measures
and we obtained promising results in the evaluation of the taxonomic
relationships.
Keywords: Semantic similarity · Taxonomic relationships · Domain

ontologies
1 Introduction
The information on the current web grows day by day and much of this informa-
tion is generated without a structure that can be understood by both machines
and humans, which makes it a difficult task to process.

https://doi.org/10.1007/978-3-030-39442-4_22
Evaluating Taxonomic Relationships Using Semantic Similarity Measures 283
The semantic web, proposed by Berners-Lee [3], searches to give structure and
knowledge to the conventional web. Ontologies are used to represent knowledge
in a structured way on the semantic web.
An ontology is defined as “an explicit specification of a conceptualization”
[9]. Conceptualization refers to the abstract model of some phenomenon of the
world, explicitly because the concepts and constraints used are explicitly defined.
Then an ontology is understood as a formal model that describes a particular
domain and specifies the concepts within the domain with relationships between
them.
Ontologies in the domain of sensor networks are mainly used to have an
accepted language that represents the definitions of sensors, properties, tax-
onomies and thus to improve data fusion and interoperability [2].
They are also used as a dictionary of terms and for the analysis of observed
data. It is for these reasons that several researchers have designed ontologies for
the semantic representation of sensor networks.
The evaluation of an ontology is an essential part of the process of construc-
tion of it. If the ontology has been designed manually or automatically, it is
necessary to evaluate its quality.
In this area, the proposals in the literature are classified depending on the
form of evaluation used, which are: compare the ontology with a gold standard,
apply the ontology in an application and evaluate the results, make comparisons
against source data of the domain ontology, and finally evaluations made by
humans to determine what established criteria the ontology satisfies [4]. On
the other hand, Gómez-Pérez [8] presents two terms for the ontology evaluation:
verification and validation. The verification ensures that the definitions meet the
requirements correctly. The validation ensures that the meaning of the definitions
correctly models the phenomena of the world.
There are tools for the design of ontologies that include the use of reasoners
for their evaluation, such as those integrated with the Protégé software1 . The
reasoners allow evaluating the consistency criterion of an ontology.
If the ontology is large, the evaluation becomes an exhaustive task that
requires more man-hours by the engineer or knowledge expert. Given this prob-
lem, automatic methods have been proposed to carry out the evaluation of
ontologies. One aspect that is considered in the evaluation of ontologies and that
can be automated is the verification of taxonomic relationships. The semantic
similarity is used to measure the semantic closeness of one concept with another
using a mathematical metric for its calculation.
In this work, semantic similarity measures are applied to evaluate the taxo-
nomic relationships of the ontologies in the sensor domain or IoT. The aim of the
experiment is to decide if the evaluation of taxonomic relationships using mea-
sures of similarity is an effective method to evaluate the quality of ontologies in
this domain. For this experiment, four ontologies are used, such as MMI Device
Ontology [22], OntoSensor [10], CSIRO Sensor Ontology [17], and Intelligent
Environment [5,6].
1
https://protege.stanford.edu/.
284 M. T. Vidal et al.
The first three ontologies are based on the SSN ontology2 , an ontology cre-
ated by the W3C Semantic Sensor Network Incubator group (SSN-XG) that
describes the sensors and its observations. The Intelligent Environment ontol-
ogy was considered because it is an ontology for the semantic representation of
objects, people and interactions that occur in academic environments enabled
by IoT. All are available for use.
The content of the present paper is divided in the following way. In Sect. 2,
the state of the art about semantic similarity measures based on information
content is briefly discussed. Section 3 shows the similarity measures used for
evaluating taxonomic relationships. Section 4 shows the algorithm to evaluate
taxonomic relationships in domain ontologies. In Sect. 5, the data set considered
and the evaluation of our approach is presented. Finally, conclusions and future
work are presented in Sect. 6.
2 Related Work
Some works related to this research are presented below.
In [7], an ontology is designed and implemented to describe the concepts
and relationships of data in sensor networks. This ontology aims to improve the
accuracy and recall of the sensor data sought. After constructing the proposed
ontology, an experimental evaluation was carried out using two tests: the sub-
sumption test and verification of consistency. For this task it is used the reasoner
RacerPro and verified the logical consistency of the ontology.
In [12], a tool was designed to allow ontologies to be validated against the
concepts and properties used in the SSN (Semantic Sensor Network) model of
W3C. It also generates validation reports and collects statistics regarding the
terms or concepts most used in ontologies. The authors used the tool to validate
a set of ontologies in which the SSN ontology of the W3C was used as a basis, the
evaluation includes discovering noise errors, inconsistency, checking syntax and
similarity between the concepts of SSN ontology of W3C and the concepts of the
other ontologies. This tool can help developers to indicate which parts of the SSN
ontology are most used and thus focus on the functions most commonly used
to create tools that improve interoperability between systems or applications in
the sensor’s domain or loT.
In [1], an ontology is proposed to allow semantic interoperability in several
fragmented test benches in the IoT domain. This ontology reuses basic concepts
of several popular ontologies and taxonomies, such as SSN, M3-Lite, WGS84,
IoT-Lite, and DUL. The ontology is supported by annotation of references and a
validation tool called Annotation Validator Tool (AVT), as well as best practices
and guides. AVT tries to adopt the SSN validator by applying the necessary
changes and checking semantic and syntactic problems together, compared to
the SSN validator that only makes a syntactic validation. The elements that
AVT reviews are: the inheritance relations of classes and properties, cardinality,
domains and ranges unexpected.
2
https://www.w3.org/2005/Incubator/ssn/ssnx/ssn.
In this paper, an algorithm to evaluate ontologies is presented by evaluating

taxonomic relations of pairs of concepts in domain ontologies of sensors. Which is
done by different measuring similarity through the structure of the ontology and
the accuracy measure. The main goal is to give an automatic quality judgment
of the domain ontologies.
3 Semantic Similarity Measures
A system is used to evaluate the taxonomic relationships of sensor domain ontolo-

gies. It uses knowledge-based semantic similarity measures, that is, that only the
structure of the ontologies is considered without the need for a textual corpus
for its evaluation.
These knowledge-based measures have categories that depend on the char-
acteristics they consider. These categories are presented below:
– Structure-based measures: estimate the similarity according to the distance

of two concepts in an ontology. The measures of Rada (1), Wu and Palmer
(2), and Li (3) are in this category.
– Measures based on characteristics: these measures consider the set of ances-
tors of each concept compared. The measures of Maedche and Staab (4),
Rodriguez and Egenhofer (5), and Sánchez (6) are in this category.
– Measures based on information content: these measures estimate the similar-
ity of the pair of concepts depending on the amount of information they share
or differ. The measures of Resnik (8), Lin (9), and Jiang and Conrath (10)
are in this category.
– Hybrid measures: these types of measures are a combination of any of the
categories mentioned above. The measures of Zhou (13) y Razan et al. (14)
are in this category.
The similarity measures used by the system to calculate the similarity of the
concepts that integrate semantic relations are detailed below.
Rada [19] proposed semantic similarity measure Path that uses the shortest path
(em length) between two concepts cu and cv in a taxonomy.
1
simpath (cu , cv ) = (1)
1 + length (cu , cv )
Wu and Palmer [26] defined the measure of Wu and Palmer that measures the
similarity of two concepts cu and cv using the shortest distance of each concept
cu and cv with the root concept and the distance of the lca (least common
ancestor) of each concept cu and cv with the root concept.
2 ∗ depth (clca )
simwup (cu , cv ) = (2)
depth(cu ) + depth(cv )
Li [13] formulated Eq. 3 where the depth (depth) of both concepts cu and cv is
calculated, and the lca (clca ) of the concepts to calculate their similarity.
eβdepth(clca ) − e−βdepth(clca )
simli (cu , cv ) = e−αlength(cu ,cv ) · (3)
eβdepth(clca ) + e−βdepth(clca )
Where α is a parameter that contributes to the length of the path and β
is the parameter for the depth of the path. According to the work of [13], the
optimal parameter for α is 0.2 and for β is 0.6.
Maedche and Staab. The measure proposed in [15], based on the Jaccard index,
considers the characteristics or set of ancestors of a concept.
|A(u) ∩ A(v)|
simCM atch (u, v) = (4)
|A(u) ∪ A(v)|
Rodriguez and Egenhofer. The measure proposed by [21] is a normalization of

the [25] model with functions of the set theory, such as the intersection (∩) and
the difference (\).
|A(u) ∩ A(v)|
simRE (u, v) = (5)
γ|A(u) \ A(v)| + (1 − γ)|A(v) \ A(u)| + |A(u) ∩ A(v)|
Where γ is a parameter to adjust the symmetry of this measure, γ in [0, 1],
A(u) is the set of ancestors of the concept u, and A(v) is the set of ancestors of
the v concept.
Sánchez. Proposed a measure in [24]. In this measure the taxonomic distance

of two concepts is calculated using the sum of the cardinality of the different
characteristics of both concepts, divided by the sum of the cardinalities of the
different and common taxonomic feature sets. This measure is shown in Eq. 6.
The result is in the range of [0...1].

|A(u) \ A(v)| + |A(v) \ A(u)|
distSAN (u, v) = log2 1 + (6)
|A(u) \ A(v)| + |A(v) \ A(u)| + |A(u) ∩ A(v)|
The following measures use information content. Especifically, the formula
proposed by Zhou, which is based on the depth of the structure, and the number
of descendants of the concept that is being considered to compute its information
content. The formula is defined in Eq. (7).

log(|D(c)|) log(depth(c))
iICZhou (c) = k 1 − + (1 − k) (7)
log(|C|) log(depth(GT ))
Where |C| is the number of defined concepts in the taxonomy, D(c) are the
descendants of the concept c, depth(c) is the depth of the concept c, depth(GT )
is the maximun depth of the taxonomy and k ∈ [0, 1] is a parameter to adjust
the weight of two terms in the equation, here k = 0.6. In the following definitions
of semantic similarity, IC refers to iICZhou .
Resnik. In [20] the semantic similarity between a pair of concepts (u, v) is

computed by taking the set of common ancestors for concepts u and v, then IC
is computed for each member in that set. Finally, the similarity is defined as the
maximum IC (MICA, Most Informative Common Ancestor).
simResnik (u, v) = IC(M ICA(u, v)) (8)
Lin. Lin [14] proposed a measure based on the work of Resnik [20], but Lin also
took into account the information content of the pair of concepts (u, v). The
measure is defined in Eq. (9).
2 · IC(M ICA(u, v))

simLin (u, v) = (9)
IC(u) + IC(v)
Jiang and Conrath. The semantic similarity measure defined in [11] is derived
from the proposal based on edges, adding information content to its decision
factor. To compute this similarity the following is used: the information content
of both, the parent concept and the child concept; the shortest path between
concepts u and v; and the MICA of them. This similarity is defiened in Eq. (10).
distJC (u, v) = IC(u) + IC(v) − 2 · IC(M ICA(u, v)) (10)

In the work of [23] the Eq. 10 is normalized and a linear transformation is
applied to transform it into a similarity function, being defined as shown in
Eq. 11.

IC(u) + IC(v) − 2 · IC(M ICA(u, v))
simJC (u, v) = 1 − (11)
2
Mazandu and Mulder. This measure is proposed in [16], the semantic simi-
larity between two concepts is gotten computing the MICA between them and
taking the maximum IC of the two concepts, as shown in Eq. (12).
IC(M ICA(u, v))

simM az (u, v) = (12)
max(IC(u), IC(v))
Zhou. This hybrid measure, proposed in cite Zhou, combines the proposal based
on the information content of [11] with the calculation of the shortest distance
of the structure-based measures.
log(length(u, v) + 1) distJC (u, v)

simzhou (u, v) = (1−k)∗ −(1−k)∗ (13)
log(2 ∗ maxDepth(c) − 1) 2
Razan et al. The HSS3 measure proposed in [18] introduces smoothing param-
eters such as the constant K, where K = 1/N D (N D is the total number of
disorders in the original proposal, in this case, it was taken as the total number
of taxonomic relationships). This measure also combines the proposal of mea-
sures based on information content and measures based on structure. In Eq. 14
this measure is defined.
L
D ∗ (IC(M ICA(u, v)) + K)
simHSS3 (u, v) = (14)
length(u, clcs ) + length(v, clcs )
The HSS4 measure also proposed in [18] is similar to the HSS3, but with
the difference that this measure also considers the information content of each
concept compared. This measure is defined in Eq. 15.
IC(u)∗IC(v)
L
D ∗ (IC(M ICA(u, v)) + IC(u)+IC(v) )
simHSS4 (u, v) = (15)
length(u, clcs ) + length(v, clcs )
4 Proposed Algorithm
The proposed algorithm uses each measure of semantic similarity: based on struc-
ture, based on characteristics, based on information content and hybrids. The
algorithm steps are presented below.
1. Preprocessing. In this phase, through the use of Apache Jena, the concepts and
taxonomic relationships of the input ontology in OWL format are extracted.
2. Semantic Similarity Computation. For each pair of concepts (u, v) the com-
putation of the semantic similarity is applied using the Semantic Measure
Library3 (SML).
3. Thresholds Computation. The computation of thresholds is obtained by per-
forming an average of the similarity results obtained for each similarity mea-
sure applied. RT refers to the total number of taxonomic relationships in the
ontology.

ressimi (u, v)
thldsimi = (16)
RT
The average threshold is obtained by applying Eq. 17, where N is the total
of the measures of the category used, for example, if the similarity measures
based on the structure are being used, N = 3 which are the three structure-
based measures used in this work.

thldsimi
thldavg = (17)
N
3
http://www.semantic-measures-library.org/sml/.
4. Evaluation. This phase is the most important of the algorithm.

The concepts u and v of the ontology and the thresholds are used in the
evaluation process. If ressimi (u, v) ≥ thldsimi then the pair forms a taxonomic
relation (a correct case), in another case, it is false (an incorrect case). Next,
the accuracy metric was used for the evaluation of the performance of the
proposed system. The equation of accuracy is shown in Eq. (18).
Quantity of Correct Cases
Accuracy = (18)
T otal of Cases
5 Experimental Results
In this section, the results obtained by applying the approach to the ontolo-
gies of the sensor domain are presented. The proposed algorithm was imple-
mented in the programming language, Python. Semantic similarity measures
based on features and hybrid measures were implemented in the Java program-
ming language using the library SML (https://www.semantic-measures-library.
org/sml/). In this library the measures based on structure and those based on
information content are already implemented.
Table 1 shows the total number of classes, the number of taxonomic relation-
ships (TR) and the maximum depth of each ontology.
Table 1. Sensors ontologies
Ontology Clases TR Depth

Intelligent Environment 81 27 4
Onto Sensor 289 252 12
Sensor Ontology 71 64 5
Device Ontology 56 40 5
5.1 Results of Similarity Measures Based on Structure
This subsection presents the results obtained with the structure-based measures
using the SML library and which applies the Eqs. 1, 2 and 3 to ontologies in the
sensor domain. Table 2 shows the average threshold, calculated using Eq. 17 and
the thresholds obtained for each measurement by ontology applying Eq. 16.
Table 3 shows the results obtained with the measure of accuracy, and each
measure of similarity applied to each ontology. As noted, the ontology with the
highest value in the overall average of the approach is Intelligent Environment
followed by Onto Sensor.
Table 2. Thresholds for sensor domain ontologies using structure-based measures
Ontology Path WUP Li Average Threshold

Intelligent Environment 0.500 0.775 0.632 0.636
Onto Sensor 0.500 0.794 0.668 0.654
Sensor Ontology 0.500 0.442 0.355 0.433
Device Ontology 0.500 0.757 0.601 0.619
Table 3. Results of the accuracy measure obtained with each structure-based semantic
similarity measure
Ontology Path WUP Li General Average

Onto Sensor 1.0 0.755 0.755 0.755
Sensor Ontology 1.0 0.578 0.578 0.578
Device Ontology 1.0 0.600 0.600 0.600
5.2 Results of Similarity Measures Based on Characteristics

This subsection shows the results obtained for the measures based on charac-
teristics that take into account the ancestors of the concepts. Table 4 shows the
thresholds obtained for each measure of similarity for these ontologies.
Table 4. Thresholds for sensor domain ontologies using measures based on character-
istics
Ontology CMatch RE SAN Average Threshold

Onto Sensor 0.786 0.875 0.273 0.645
Sensor Ontology 0.634 0.769 0.444 0.616
Device Ontology 0.726 0.840 0.348 0.638
measure of similarity applied to each sensor ontology.
In the case of this type of similarity measures applied to sensor ontologies;
the ontology with the highest accuracy value is again the ontology of Intelligent
Environment, followed by Device Ontology.
5.3 Results of Similarity Measures Based on Information Content

In the category of measures based on information content, the information on
each concept is taken into account (depth of the concept, depth of the taxon-
omy, etc.). The results obtained by the proposed system are presented in this
subsection.
Table 5. Results of the accuracy measure obtained with each semantic similarity
measure based on characteristics
Ontology CMatch RE SAN General Average

Onto Sensor 0.566 0.566 0.434 0.570
Sensor Ontology 0.578 0.578 0.422 0.578
Device Ontology 0.600 0.600 0.400 0.600
Table 6 shows the thresholds obtained for each measure of similarity for these
ontologies.
Table 6. Thresholds for sensor domain ontologies using measures based on information
content
Ontology Lin Maz JC Res Average Threshold

Intelligent Environment 0.864 0.764 0.895 0.698 0.806
Onto Sensor 0.842 0.750 0.908 0.592 0.773
Sensor Ontology 0.558 0.454 0.803 0.376 0.548
Device Ontology 0.846 0.735 0.889 0.614 0.771
Table 7. Results of the accuracy measure obtained with each semantic similarity
measure based on information content
Ontology Lin Maz JC Res General Average

Onto Sensor 0.642 0.608 0.585 0.532 0.581
Sensor Ontology 0.578 0.578 0.578 0.578 0.780
Device Ontology 0.300 0.300 0.350 0.450 0.475
As can be seen in Table 7, the ontology with the best accuracy value in the
general average is Sensor Ontology, followed by Onto Sensor.
5.4 Results of Hybrid Semantic Similarity Measures
Hybrid semantic similarity measures are a combination of the measures based on

the structure and the measures based on information content. In this subsection,
Table 8. Thresholds for sensor domain ontologies using hybrid measures
Ontology Zhou HSS3 HSS4 Average Threshold

Onto Sensor 0.065 0.190 0.292 0.182
Sensor Ontology 0.059 0.143 0.221 0.141
Device Ontology 0.102 0.243 0.368 0.238
the results obtained from the system are presented, when these types of measures
are applied to the ontologies of the sensor domain. Table 8 shows the thresholds
obtained for each measure of similarity for these ontologies.
Table 9. Results of the accuracy measure obtained with each hybrid semantic similarity
measure
Ontology Zhou HSS3 HSS4 General Average

Onto Sensor 0.585 0.370 0.370 0.381
Sensor Ontology 0.578 0.359 0.359 0.375
Device Ontology 0.350 0.600 0.600 0.600
For this category of measures, the ontology with the highest value in the
overall system average is Device Ontology, followed by Onto Sensor.
5.5 Final Results
Table 10 shows the general averages obtained in each category (structure-based

measures (GAStr ), characteristics-based measures (GAF t ), measures based on
information content (GAIC ) and hybrid measures (GAH )). The average for each
ontology of the general averages (GA) in all categories P Gaverage is also shown.
Table 10. Accuracy results obtained by all measures of semantic similarity
Ontology GAStr GAF t GAIC GAH GAaverage

Onto Sensor 0.755 0.570 0.581 0.381 0.571
Sensor Ontology 0.578 0.578 0.780 0.375 0.577
Device Ontology 0.600 0.600 0.475 0.600 0.568
Average 0.696 0.650 0.570 0.413 0.582
As shown in Table 10, Intelligent Environment on average has the best results
in the evaluation of taxonomic relationships with a 61% of GAaverage , although
with the measures based on information content and hybrids it obtains the GA
lower with 44.4% and 29.6%, respectively.
6 Conclusions
In this work, an algorithm to evaluate the taxonomy of a domain ontology is

presented. The evaluation uses four categories of semantic similarity measures.
The ontologies in the sensor domain used in the experiments are small with
less than 100 classes, with the exception of Onto Sensor that has more than 200
classes.
From the experimental results, it is concluded that the ontologies of sensors
with a size similar to ontologies such as Intelligent Environment are recom-
mended to use the measures based on structure and those based on characteris-
tics for the evaluation of taxonomic relationships. When the ontologies are larger
as Onto Sensor, it is advisable to use the measures based on information con-
tent when having more information for each concept. The information content
is computed taking into account the depth of the structure of the taxonomy and
the descendants of the concepts.
As future work, similarity measures based on fuzzy logic will be implemented.
Another line of work is to consider other ontologies as those in the medical
domain.
Acknowledgment. This work is supported by the Sectoral Research Fund for Educa-
tion with the CONACyT project 257357, and partially supported by the VIEP-BUAP
project.
References
1. Agarwal, R., Fernandez, D.G., Elsaleh, T., Gyrard, A., Lanza, J., Sanchez,
L., Georgantas, N., Issarny, V.: Unified IoT ontology to enable interoperability
and federation of testbeds. In: 2016 IEEE 3rd World Forum on Internet of Things
(WF-IoT), pp. 70–75, December 2016
2. Ali, S., Khusro, S., Ullah, I., Khan, A., Khan, I.: Smartontosensor: ontology for
semantic interpretation of smartphone sensors data for context-aware applications.
J. Sensors 2017, 8790198:1–8790198:26 (2017)
3. Berners-lee, T., Hendler, J.: The semantic web. Sci. Am. 284, 34–43 (2001)
4. Brank, J., Grobelnik, M., Mladenić, D.: Automatic evaluation of ontologies, pp.
193–219. Springer, London (2007)
5. Bravo, M., Reyes, J., Cruz-Ruiz, I., Gutiérrez-Rosales, A., Padilla-Cuevas, J.:
Ontology for academic context reasoning. Procedia Comput. Sci. 141, 175–182
(2018)
6. Bravo, M., Reyes-Ortiz, J.A., Cruz, I.: Researcher profile ontology for academic
environment. In: Arai, K., Kapoor, S. (eds.) Advances in Computer Vision, pp.
799–817. Springer, Cham (2020)
7. Eid, M., Liscano, R., Saddik, A.E.: A novel ontology for sensor networks data. In:
2006 IEEE International Conference on Computational Intelligence for Measure-
ment Systems and Applications, pp. 75–79, July 2006
8. Gómez-Pérez, A.: Ontology evaluation, pp. 251–273. Springer, Heidelberg (2004)
9. Gruber, T.R.: Toward principles for the design of ontologies used for knowledge
sharing? Int. J. Hum.-Comput. Stud. 43(5), 907–928 (1995)
10. Russomanno, D.J., Kothari, C., Thomas, O.A.: Building a sensor ontology: a prac-
tical approach leveraging ISO and OGC models, pp. 637–643, January 2005
11. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and
lexical taxonomy. CoRR, cmp-lg/9709008 (1997)
12. Kolozali, S., Elsaleh, T., Barnaghi, P.M.: A validation tool for the W3C SSN ontol-
ogy based sensory semantic knowledge. In: TC/SSN@ISWC (2014)
13. Li, Y., Bandar, Z.A., Mclean, D.: An approach for measuring semantic similarity
between words using multiple information sources. IEEE Trans. Knowl. Data Eng.
15(4), 871–882 (2003)
14. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the
Fifteenth International Conference on Machine Learning, ICML 1998, pp. 296–
304. Morgan Kaufmann Publishers Inc., San Francisco (1998)
15. Maedche, A., Staab, S.: Measuring similarity between ontologies. In: Gómez-Pérez,
A., Benjamins, V.R. (eds.) Knowledge Engineering and Knowledge Management:
Ontologies and the Semantic Web, pp. 251–263. Springer, Heidelberg (2002)
16. Mazandu, G., Mulder, N.: Information content-based gene ontology semantic sim-
ilarity approaches: toward a unified framework theory. BioMed Res. Int. 292063,
2013 (2013)
17. Neuhaus, H., Compton, M.: The semantic sensor network ontology: a generic lan-
guage to describe sensor assets (2009)
18. Paul, R., Groza, T., Zankl, A., Hunter, J.: Semantic similarity-driven decision
support in the skeletal dysplasia domain, November 2012
19. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a
metric on semantic nets. IEEE Trans. Syst. Man Cybern. 19(1), 17–30 (1989)
20. Resnik, P.: Semantic similarity in a taxonomy: an information-based measure and
its application to problems of ambiguity in natural language. CoRR, abs/1105.5444
(2011)
21. Rodriguez, M.A., Egenhofer, M.J.: Determining semantic similarity among entity
classes from different ontologies. IEEE Trans. Knowl. Data Eng. 15(2), 442–456
(2003)
22. Rueda, C., Galbraith, N., Morris, R., Bermudez, L., Arko, R., Graybeal, J.: The
MMI device ontology: enabling sensor integration. In: American Geophysical Union
Fall Meeting – Session, vol. 16, pp. 44–48, January 2010
23. Seco, N., Veale, T., Hayes, J.: An intrinsic information content metric for semantic
similarity in wordnet. In: Proceedings of the 16th European Conference on Artificial
Intelligence, ECAI 2004, pp. 1089–1090. IOS Press, Amsterdam (2004)
24. Sánchez, D., Batet, M., Isern, D., Valls, A.: Ontology-based semantic similarity: a
new feature-based approach. Expert Syst. Appl. 39(9), 7718–7728 (2012)
25. Tversky, A.: Features of similarity. Psychol. Rev. 84, 327–352 (1977)
26. Wu, Z., Palmer, M.: Verb semantics and lexical selection. CoRR, abs/cmp-
lg/9406033 (1994)
Trained Synthetic Features in Boosted
Decision Trees with an Application
to Polish Bankruptcy Data
Thomas R. Boucher(B) and Tsitsi Msabaeka
Texas A&M University-Commerce, Commerce, TX 75428, USA

thomas.boucher@tamuc.edu, tboucher1@gmail.com
Abstract. Trained synthetic features are created through the applica-

tion of optimized learning criteria to the collection of predictors. The
trained synthetic features are added to the training data and used to
create a decision tree. Boosting creates an ensemble of decision trees;
at each boosting iteration, the newly weighted observations lead to a
new collection of trained synthetic features. As a result, the trained syn-
thetic features evolve along with the ensemble to explore regions of the
predictor space containing difficult to classify observations.
Keywords: Classification trees · Synthetic features · Boosting
1 Introduction
In the classification problem, a training set consists of n observations of a binary
categorical target Y and p predictors X1 , . . . , Xp . The training set is used to
build a model that predicts Y based on X1 , . . . , Xp . For example, in a financial
application the training set could consist of observations on n companies and Y
could denote whether or not a company declared bankruptcy during the obser-
vation period. The numeric predictors X1 , . . . , Xp could be financial indicators
taken from the companies’ financial statements. This data can then be used to
build a model that will predict the bankruptcy status of companies Y given
observations of their financial indicators X1 , . . . , Xp .
One approach to building such a model is to augment the data set by com-
bining the predictors X1 , . . . , Xp into new variables called synthetic features and
adding these synthetic features to the collection of predictors. Standard model-
building procedures such as decision trees or ensembles of boosted trees can then
applied to the augmented data set. Zieba et al. [11] follow this approach to create
ensembles of boosted classification trees built using synthetic features. Two pre-
dictor variables are selection by variable-importance weighted random selection;
the predictors are then combined using a randomly chosen binary arithmetic
operation. Mellville and Mooney [6] create artificial data by randomly sampling
values of predictors and then sampling a class value inversely proportional to the
current ensemble’s predictions. Rodriguez et al. [8] create synthetic features by
https://doi.org/10.1007/978-3-030-39442-4_23
296 T. R. Boucher and T. Msabaeka
performing principle components analysis on each of K non-overlapping subsets

of the features. Wang and Wu [10] use data reduction and manifold learning
algorithms/kernel-based methods for feature subset selection. Zieba and Hardle
[12] create artificial data by resampling instances within each target variable
class using separate Beta-Binomial distributions for each class.
In this paper synthetic features are created not by resampling or randomly
combining variables from the training set, but rather through using statistical
methods for modeling a response variable as a function of a set of predictor
variables. Because these features are the result of an optimized learning criterion
applied to the collection of predictors, they can be termed trained synthetic
features. Additionally, weights applied to the observations can be incorporated
in training these synthetic features. Boosting will be used to create an ensemble
of decision trees, with the newly weighted observations used to create a new
collection of trained synthetic features at each boosting iteration. This allows the
trained synthetic features to evolve along with the ensemble to explore regions
of the predictor space containing difficult to classify observations.
2 Methods
2.1 Decision Trees and Boosting
Boosting [5] is a well-known method (or collection of methods) for combining

weak classifiers into an effective ensemble classifier. A popular boosting algorithm
is AdaBoost.M1 [2], also known as discrete boosting. As a standard in machine
learning practice, details of its operation are widely known so these details will
be only briefly touched on here. A more complete account can be found in [5], for
example, whose notation will be largely followed here. Let n denote the size of the
training sample, let y1 , . . . , yn denote the observations on the response Y and let
x1 , . . . , xn denote the n vectors of observed values of the predictor variables x =
(X1 , . . . , Xp ) in the training sample. In AdaBoost.M1 the observed responses
y1 , . . . , yn are coded so that the values are either −1 or 1 depending on the class
value of yi . Boosting decision trees iteratively reweights the data and produces a
decision tree for the reweighted data at each of m iterations, creating a collection
of trees T1 , . . . , Tm that comprise the boosted ensemble of decision trees. The
(1)
weights are initialized as wi = 1/n for i = 1, . . . , n and a decision tree T1 (x) is
grown on the training set using the weighted n data. The weighted misclassification
n
error for the decision tree is err1 = i=1 wi I(yi = T1 (xi ))/ i=1 wi and the
weight for the tree in the ensemble is α1 = log((1 − err1 )/err1 ). The observation
(2) (1)
weights are updated for the second iteration as wi = wi exp(α1 I(yi = T1 (xi )),
i = 1, . . . , n, and the process repeated for each of the m boosting iterations. The
m
ensemble is then T (x, m) = j=1 αj Tj (x) and a weighted vote is taken of the

m
ensemble to produce the prediction sgn j=1 α j T j (x) . It has been shown [3]
that discrete boosting approaches the minimum of the expected exponential loss
E[e−yT (x,m) ] of the classifier T (x, m) using iterative stepwise updates T (x, m) =
Boosting Trained Synthetic Features 297
T (x, m−1)+αm Tm (x), with αm = log((1−errm )/errm ) minimizing the expected

exponential loss as at each step m; that is, αm = arg minc Ew(m) [−cyT (x, m)].
Another boosting variant is gentle boosting, which updates the tree weights
αm in a more gradual fashion as αm = 1 − 2 × errm . Gentle boosting has
advantages over discrete boosting when the updates of αm in discrete boosting
are numerically unstable. In discrete boosting there can be large swings in the
value of αm in regions with a high purity (errm close to 0 or 1) due to the use
of the log-ratios when calculating αm = log((1 − errm )/errm ) [3].
Decision trees can be viewed as methods of partitioning the predictor space

Rp = (X1 , . . . , Xp ) so that each subset of the predictor space in the partition
produces a single predicted class for all cases in that subset. Subsets of the
predictor space in the partition are the result of combinations of splits of the
form Xi < c and Xi ≥ c for i = 1, . . . , p, where c is a scalar. While effective, these
combinations of splits form a limited set of available predictor space partitioning
operations and it is expected that a more extensive set of available partitioning
operations would result in a more accurate classifier. One way to extend the
set of partitioning operations in a decision tree is to augment the training set
with new features called synthetic features that are functions of the predictors.
Splitting on these synthetic features creates a more flexible partitioning of the
predictor space by allowing splitting on more general functions of the predictors.
This leads to the creation of more accurate classifiers.
2.2 Trained Synthetic Features

Define trained synthetic features to be functions g(X1 , . . . , Xp ) fitted to the
predictors using statistical criteria. It is a good practice to include both linear and
nonlinear functions g in the set of trained synthetic features to create a flexible
set of partitions that can be applied to a variety of data sets. The following
trained synthetic features are applied here.
Zero-One Regression: Let the values of Y be in {A, B}, define

1, Y = A
Ynew =
0, Y = B
and perform ordinary or weighted least-squares regression of Ynew on X1 , . . . , Xp .

The rationale is that E[Ynew ] = P (Y = A) is modeled as a linear function of the
predictors. If necessary, variable selection procedures for finding the best fitting
regression of Ynew on subsets of X1 , . . . , Xp can be performed and the best-
fitting model returned. When variable selection procedures are used a popular
criterion for selecting the best-fitting model from a collection of regression models
is Aikaike’s Information Criterion (AIC) where AIC = 2k + n ln(RSS), k is the
number of parameters in the regression model and RSS = (yi − ŷi )2 . When
regression of weighted observations is conducted the residual sum of squares
incorporates the weights as RSS = wi (yi − ŷi )2 .
Least squares regression, which minimizes the RSS, can be replaced with
variants like robust regression with a psi function, or least-trimmed squares
regression. One concern with these alternatives to least-squares regression is
that computational time is increased; for example, robust regression models are
fit using iteratively reweighted least squares. Whichever method is used, the
synthetic feature is the value of β0 + β1 x1 + . . . + βp xp returned by the fitted
model, which is the estimate of P (Y = A) given the data.
Principal Components Analysis: Often used for data reduction, princi-

pal components analysis performs an eigenvalue/eigenvector decomposition on
either the covariance or correlation matrix of X = (X1 , . . . , Xp ), returning eigen-
value/eigenvector pairs (λi , ei ), i = 1, . . . , p. The principal components are the
vectors X ·ei , for i = 1, . . . , p. The
jnumberofp principal components selected is the
smallest integer j ≤ p so that i=1 λi / i=1 λi exceeds a user-defined thresh-
j p
old. The quantity i=1 λi / i=1 λi is regarded as the proportion of generalized
variance/correlation accounted for by the first j principal components; in this
sense the principal components summarize the set of predictors. Each selected
principal component is used as a new synthetic feature. If the observations are
weighted unequally the eigenvalue/eigenvector decomposition is applied to the
covariance/correlation matrix of the weighted observations.
Logistic Regression: Letting p = P (Y = A), logistic regression models the

logistic function of p as a linear function of the predictors

p
log = β0 + β1 x1 + . . . + βp xp + .
1−p
As with 0–1 regression, logistic regression can be performed with or with-

out observation weighting, and AIC can be combined with variable selec-
tion procedures to find an optimal model. The synthetic feature is the value
of β0 + β1 x1 + . . . + βp xp returned from the fitted model, which is the
estimate of
P (Y = A)
log
1 − P (Y = A)
given the data. Equivalently, one could write the estimate of P (Y = A) as
eβ0 +β1 x1 +...+βp xp

P (Y = A) = .
1 + eβ0 +β1 x1 +...+βp xp
Binary Synthetics: Two predictors Xi , Xj are selected from X1 , . . . , Xp and

are combined using a binary operation selected at random from +, −, ×, ÷, ∧.
The predictors can be sampled with probabilities reflecting their correlation with
the 0–1 encoded variable Ynew as in 0–1 regression. The probability that Xi is

selected is then |Corr(Xi , Ynew )|/ |Corr(Xi , Ynew )|. If the observations are
weighted unequally, the weighted correlation Corrw will be used where
n
i=1 wi (Xi − X̄)(Yi − Ȳ )
2
Corrw (X, Y ) = n n .
i=1 wi (Xi − X̄) i=1 wi (Yi − Ȳ )
2 2 2 2
This is similar to the methods used in [11] but with weighted correlations used
here in place of variable importance. Weighted correlations are preferred to vari-
able importance in boosting as this will encourage the algorithm to explore new
regions of the predictor space as the weights evolve, rather than favor predic-
tor variables by the frequency with which they have already been selected. The
synthetic feature is the value of Xi Xj .
Fisher’s Linear Discriminant: Letting X̄A denote the mean vector of X =

(X1 , . . . , Xp ) for those observations with Y = A, define X̄B similarly and let S
denote the sample covariance matrix of X. The synthetic feature is the linear

combination (X̄A − X̄B ) S −1 X of X1 , . . . , Xp as defined by Fisher’s linear dis-
criminant. If the observations are weighted unequally the weighted observations
are used in calculating X̄A , X̄B and S.
Naive Bayes: Naive Bayes applies Bayes’ rule to express P (Y = A|X) as pro-
portional to the product f (X|Y = A) × π(Y = A), where π(Y = A) is the
prior probability of class A and f (X|Y = A) is the joint likelihood of X =

(X1 , . . . , Xp ) given that Y = A. The naive assumption is that the Xi are inde-
pendent given the value of Y so that the joint conditional likelihoodp equals the
product of the conditional marginals; that is, f (X|Y = A) = i=1 f (Xi |Y = A),
with f (Xi |Y = A) being the conditional marginal distribution of Xi given that
Y = A. In practice the f (Xi |Y = A) are not known and so are assumed. For
example, each Xi given Y = A is often assumed to follow a Normal distribution
with mean X̄iA and standard deviation SiA , these statistics being calculated for
the Xi values for which Y = A. P (Y = B|X) is calculated similarly; the constant
of proportionality is handled by normalizing and returning
p
i=1 f (Xi |Y =
A)π(Y = A)
P (Y = A|X) = p p
i=1 f (Xi |Y = A)π(Y = A) + i=1 f (Xi |Y = B)π(Y = B)
as the synthetic feature. If the observations are weighed unequally the weighted
observations are used in calculating X̄iA , SiA , X̄iB , SiB , i = 1, . . . , p. This will
have the effect of shifting the modes of the conditional marginal Normal densities
towards the more heavily weighted observations.
Spline: As in 0–1 regression define

1, Y = A
Ynew =
0, Y = B
and create synthetic features with cubic smoothing splines [5] of each Xi as a
predictor of Ynew . Multivariate splines can be applied to collections of Xi as
predictors of Ynew . The computational burden of fitting splines can be intense
but splines allow for nonlinear class boundaries. Observation weights can be
incorporated in the spline fitting.
2.3 Boosting with Trained Synthetic Features
In this paper a stump is used as the base learner. A stump is a classification

tree with a single split and only two terminal nodes. More complex classification
trees could be used as the base learner but at increased computational cost. More
complex trees are also generally less stable. On the other hand, simpler classifiers
like stumps may require more boosting iterations to achieve the same level of
performance as boosting a more complex tree. It is an open question as to where
the advantage lies or what the optimum balance between classifier complexity
and the number of boosting iterations is, or if such an optimum balance even
exists. These questions are beyond the scope of this paper.
At each boosting iteration, the trained synthetic features are created using
the current boosting weights and added to the set of predictors from the train-
ing sample. The trained synthetic features from the previous boosting itera-
tion are discarded. Both discrete and gentle boosting methods are applied sep-
arately to produce separate boosted ensembles. It is also important to point
out that because discrete and gentle boosting differ in how the observation
weights are updated at each iteration the sets of synthetic features created
at each boosting iteration will differ between discrete boosting and gentle
boosting. Where discrete boosting is applied an upper bound of 4 is used for
αm = log((1 − errm )/errm ) as per the recommendations in [3]. Boosting itera-
tions cease if a perfect stump is reached.
Unless stated otherwise, 10-fold cross-validation is used to assess model per-
formance. The mean prediction error and the standard deviation of the predic-
tion errors are calculated across the cross-validation runs. Variable importance
are extracted from each stump, averaged across the boosting iterations, and nor-
malized so that the sum of the average variable importance is equal to 1 to allow
for easier comparison among variables.
Table 1 contains a list of abbreviations used to denote the trained synthetic
features.
These abbreviations will be used to refer to synthetic features in the sections
that follow.
Table 1. Synthetic feature abbreviations.
Abbrevation Trained synthetic feature

zo zero-one regression
nb binary synthetic
ls logistic regression
pca principal components
lda Fisher’s linear discriminant
rlm robust linear regression
lts least-trimmed squares regression
nbay naive Bayes
Spl univariate splines
mSpl multivariate splines
SplmSpl zero-one regression on ‘spl’ and ‘mspl’
sdata.pca principal components replacing predictors in root node
3 Experiments
All work was done in the R environment [7]. R package rpart [9] was used for
creating the stumps at each boosting iteration. Original scripts were written in
R to perform the discrete and gentle boosting other than using rpart for stump
creation, as the new observation weights were needed at each boosting iteration
in order to create the new set of synthetic features. The synthetic features were
then added to the collection of predictor variables available for that iteration.
3.1 Data
Polish Companies Bankruptcy Data: The Polish companies bankruptcy
data set is described in [11] and publicly available through the UCI Machine
Learning Repository [1]. This is an interesting data set as it contains a large
number of observations, 64 predictors, many missing values, the predictor dis-
tributions are often skewed and contain outliers. The classes are also highly
unbalanced; for example, the data for Year 1 contains 7027 observations, 271 of
which represent bankrupted companies, and the remaining 6756 companies did
not bankrupt. Two of the variables, X37 and X27 are missing approximately
39% and 23% of their observations, respectively. The remaining 62 variables are
missing from 0% to 4.4% of their values. The missing values do not appear to be
missing at random so imputation should be applied cautiously. However, since
the data is being used to illustrate boosting with synthetic features, the missing
values will be imputed according to the following simple scheme. For a given
predictor Xi containing missing values, the most highly correlated predictor Xj ,
j = i, is selected and simple linear regression of Xi on Xj is used to impute the
missing values of Xi . If Xj contains missing values so that some of the missing
Xi values cannot be imputed, then the next highest correlated predictor to Xi

is used to impute these missing values. The process is repeated until all missing
values in Xi are imputed. In practice it took 13 iterations to impute all missing
values in the data set.
Checkerboard Data: This is an artificially generated data set often used for
machine learning experiments. A black-and-white checkerboard pattern is super-
imposed over the unit square. Instances are generated by randomly sampling X
and Y values uniformly between 0 and 1. These (X, Y ) coordinates determine
which square in the black-and-white checkerboard pattern the instance falls into.
The color of the square determines the class of the instance. An example of
10,000 simulated instances is in Fig. 1. The classification problem involves pre-
dicting whether an instance belongs to the black-square class or the white-square
class given the X, Y coordinates.
1.0
0.8
0.6
y
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Fig. 1. 10,000 simulated instances of checkerboard data. The black squares are one
group and the white squares are another group.
Spheres Data: This is another artificially generated data set often used for
machine learning experiments. Instances belong to one of two classes defined by
two spheres, one nested within the other. The first class is defined by a sphere
centered at the origin of radius 1/2. The second class is defined by a hollow sphere
centered at the origin having inner radius 1/2 and outer radius 1. Instances are
generated by randomly sampling X, Y and Z values from each of the two spheres.
An example of 1,000 simulated instances is in Fig. 2. The classification problem
involves predicting the class membership given the (X, Y, Z) coordinates.
1.0
●●
●
●
● ●
●● ● ●● ●● ● ●
●● ●●●● ● ● ● ●● ●
● ●
● ●● ● ●●●●●● ●●● ● ● ● ● ●
● ●
●
● ● ●● ● ●● ●● ●●●●● ● ● ●● ●
●
● ●● ●● ●● ●●● ● ● ●
●
●
●●●●●●● ● ●
●● ●●
●●●●●● ● ●
●●● ●●●● ●
● ● ●●●● ●● ● ● ●● ● ●●●
● ●●● ●● ●●● ●
●●●● ●● ●●● ●● ● ● ●
●●
●
●●●●●●● ●● ●●● ●● ● ● ●
0.5
● ● ● ●●●● ● ●●
● ● ● ●
● ● ●
● ●● ● ● ●●●● ● ● ●● ● ●● ●●
●● ●
●●●●●●● ●
●
● ●●
● ●● ● ●
● ●●●● ● ●
●● ●● ●●
●●● ●
● ● ●●●●● ●● ●
●● ●●● ●●●
●● ●
●●
● ●● ●●●●● ● ●●● ● ● ●● ●● ● ●●●●
●
● ● ●●●●●
●● ●●●●
●
●●
●
●●● ●●
● ●●
●● ●●●
●●● ●
●●●● ●●
●●● ●●●●● ●
●
●●
●● ●●●
●
●●
●● ● ●●
● ●● ●●●●●●
● ●●● ●●
●●
●
●●● ●
●●
● ● ●● ●● ● ●
●●
● ●●● ●
●●●
●
●
●
●●●
●●
● ●● ●
●● ●
● ●● ●●●● ●
●●
● ●●●● ●
●●●●●●● ●●●
●● ●●
● ●●
●●● ●
0.0
●● ●
●● ●● ● ●●
●
z
●●
1.0
y
0.5
−0.5
0.0
−0.5
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0
Fig. 2. 1,000 simulated instances from two nested spheres. Classes are indicated by
plotting symbol.
3.2 Results
Polish Bankruptcy: Due to the size of the training set (7027 observations),
a stratified sample of size n = 200 was selected with classes represented in the
same proportions as found in the training set. The stratified sample was used
for model fitting. The 64 predictor variables were replaced with 16 principal
components which together accounted for over 90% of the generalized variance.
There were still too many variables to be able to compute multivariate splines
so this synthetic feature was omitted.
Using both discrete boosting and gentle boosting combined with 10-fold
cross-validation, models with a variety of boosting iterations were fit to the
stratified sample from the Polish data. The average and standard deviation of
the misclassification errors across the cross-validation runs were calculated. The
results are in Table 2.
The mean cross-validation classification errors should be compared with the
0.039 misclassification error rate for the naive majority class classification rule
which classifies every observation as the not bankrupt majority class (6756 of
the 7027 companies did not go bankrupt). Table 2 shows that for both discrete
and gentle boosting the mean cross-validation misclassification error decreases
as boosting iterations increase. The standard deviation of the cross-validation
misclassification errors decreases for gentle boosting and remains stable for dis-
crete boosting. Table 2 indicates discrete boosting resulted in slightly smaller
mean cross-validation misclassification error and smaller standard deviation of
cross-validation error than gentle boosting. Both boosting methods crossed the
0.039 naive misclassification error rate at about 20 boosting iterations.
Table 2. Results for Polish bankruptcy data. Average (‘MCVE’) and standard devia-
tion (‘SCVE’) of misclassification errors from 10-fold crossvalidation of boostrap ensem-
bles of sizes (‘Iterations’) 3, 5, 10, 15, 20, 25, 50, and 100 boosting iterations. Discrete
(‘Discrete’) and gentle (‘Gentle’) boosting applied.
Iterations Discrete MCVE Discrete SCVE Gentle MCVE Gentle SCVE

3 0.05 0.027 0.07 0.048
5 0.04 0.039 0.065 0.071
10 0.044 0.023 0.065 0.063
15 0.042 0.032 0.045 0.044
20 0.034 0.027 0.03 0.063
25 0.04 0.033 0.03 0.042
50 0.034 0.032 0.04 0.04
100 0.014 0.027 0.025 0.035
Variable importances from the discrete and gentle boosting models with
100 boosting iterations were averaged over the 10 cross-validation runs and
normalized to sum to 1 for easier evaluation. The top 5 variables for discrete
and gentle boosting by the resulting variable importance are in Table 3.
The variable importance for discrete boosting differ very little, suggesting
none dominate the others in their ability to predict the bankruptcy status of a
company. For discrete boosting the first 4 variables in importance are univarate
splines of principal components of the original predictor variables, suggesting
some nonlinearity in the partition between classes on these principal components.
The final variable in the top 5 is logistic regression with the principal components
as independent variables, a further suggestion of a nonlinear partition between
the classes. The variable importance using gentle boosting differ from those using
discrete boosting as the logistic regression synthetic feature strongly dominates
the other trained synthetic features. Univariate splines again appear in the top
5, though of a different collection of principal components. Recall that discrete
Table 3. Results for Polish bankruptcy data. Top 5 variables by variable importance in
decreasing order, 100 boosting iterations. Discrete and gentle boosting applied. ‘DVar’
and ‘DImp’ are variables and importances for discrete boosting, ‘GVar’ and ‘GImp’
are variables and importances for gentle boosting.
DVar DImp GVar GImp

data.pca8Spl 0.055 ls 0.405
data.pca10Spl 0.044 data.pca14Spl 0.066
ls 0.043 data.pca9Spl 0.040
and gentle boosting have differing observation weights at each boosting iteration,
leading to differing sets of synthetic features, explaining the difference between
variables and their importance.
Checkerboard Data: The training set consisted of 1,000 simulated instances,

510 of which were in the black-square group and 490 in the white-square group.
During experimentation it was noticed that replacing the predictors with their
principal components resulted in improved predictive performance and so the
principal components were included in the models described below. All synthetics
including multivariate spline were used.
Cross-validation results for boosting models with varying iterations and dis-
crete and gentle boosting are in Table 4.
Table 4. Results for checkerboard data. Average (‘MCVE’) and standard deviation
(‘SCVE’) of misclassification errors from 10-fold crossvalidation of boostrap ensembles
of sizes (‘Iterations’) 3, 5, 10, 15, 20, 25 boosting iterations. Discrete and gentle boosting
applied.

3 0.069 0.033 0.129 0.111
5 0.063 0.048 0.137 0.112
10 0.069 0.041 0.137 0.120
15 0.045 0.031 0.125 0.115
20 0.058 0.041 0.111 0.092
25 0.075 0.036 0.097 0.066
The average cross-validation misclassification error and the cross-validation

standard error do not appreciably decrease with the number of boosting itera-
tions, suggesting a boosting model with as few as 3 boosting iterations is appro-
priate. The mean cross-validation errors and their standard deviations are larger
for gentle boosting than those obtained using discrete boosting. The performance
of each of the models compares favorably with the 0.490 misclassification error
rate for the naive majority class classification rule, which in this case would
classifiy all observations as being in the black-square class.
The top 5 variables by variable importance are in Table 5.
These variable importances are averaged over the cross-validation runs and
then normalized so that the sum of variable importances is equal to 1. Table 5
shows these variables comprise nearly all of the variable importance of the trained
synthetic features, the first 4 in particular. The presence of the splines reflects
the nonlinear partition between the classes, while the presence of 0–1 regression
is somewhat surprising.
Table 5. Results for checkerboard data. Top 5 variables by variable importance in

decreasing order, 100 boosting iterations. Discrete and gentle boosting applied. ‘DVar’
and ‘DImp’ are variables and importances for discrete boosting, ‘GVar’ and ‘GImp’
are variables and importances for gentle boosting.
DVar DImp GVar GImp

SplmSpl 0.347 SplmSpl 0.284
data.pca1Spl 0.271 zo 0.237
zo 0.043 data.pca2Spl 0.182
ls 0.040 ls 0.031
Spheres Data: The training set consisted of 1,000 instances, 500 in each group.
The predictors were not replaced with their principal components as this yielded
no appreciable increase in performance and the predictors were few in num-
ber so no computational improvement would result. Table 6 shows the mean
cross-validation misclassification error and the standard deviation of the cross-
validation misclassification error across boosting iterations, using both discrete
and gentle boosting to update the observation weights.
Table 6. Results for spheres data. Average (‘MCVE’) and standard deviation (‘SCVE’)
of misclassification errors from 10-fold crossvalidation of boostrap ensembles of sizes
(‘Iterations’) 3, 5, 10, 15, 20, 25 boosting iterations. Discrete and gentle boosting
applied.

3 0.018 0.018 0.016 0.016
5 0.025 0.023 0.005 0.008
10 0.021 0.024 0.009 0.014
15 0.018 0.024 0.001 0.003
20 0.002 0.022 0.000 0.000
While both methods perform well, the results for gentle boosting are even
more impressive than those for discrete boosting as the corresponding cross-
validation average misclassification errors and standard deviation of these errors
are smaller than those obtained from discrete boosting. For both discrete and
gentle boosting, the average cross-validation misclassification error and standard
deviation of cross-validation misclassification error are low for a small number
of boosting iterations and vary little as boosting iterations change. The models
clearly outperforms the naive majority class classification rule/coin toss with
its 0.5 misclassification error rate. It was also common to get to a perfectly
performing boosting model in fewer than the maximum 20 boosting iterations.
Variable importance from the discrete and gentle boosting models with 100
boosting iterations were averaged over the 10 cross-validation runs and normal-
ized to sum to 1 for easier evaluation. The top 5 variables by variable importance
are in Table 7.
Table 7. Results for spheres data. Top 5 variables by variable importance in decreasing
order, 100 boosting iterations. Discrete and gentle boosting applied. ‘DVar’ and ‘DImp’
are variables and importance for discrete boosting, ‘GVar’ and ‘GImp’ are variables
and importance for gentle boosting.
DVar DImp GVar GImp

nbay 0.247 nbay 0.218
SplmSpl 0.188 SplmSpl 0.210
mSpl 0.187 mSpl 0.210
The variable importance are similar for discrete and gentle boosting. Naive
Bayes comes through as the most important variable; this was also noted in
[4] when performing experiments on randomly generated instances of this data.
The remaining synthetic features are nonlinear splines, suited to capturing the
nonlinear boundary between the classes. Taken together the 5 variables account
for nearly 90% of variable importance.
The cross-validation estimate of model performance can be overly optimistic,
since the final model selection (including deciding the number of boosting itera-
tions, and gentle or discrete boosting) is based on the cross-validation results. As
a check, a run with gentle boosting and 20 boosting iterations on the entire train-
ing set then applied to another test set of 1,000 simulated instances of sphere
data yielded a misclassification error of 1.3% and confusion matrix in Table 8.
Table 8. Confusion matrix for model with 20 gentle boosting iterations, performance
on new simulated test set of 1,000 instances.
Sphere Inner Outer

Inner (actual) 495 5
Outer (actual) 8 492
A test run using discrete boosting and 20 boosting iterations on the same
new set of 1,000 simulated instances of sphere data yielded a misclassification
error of 2.6% and confusion matrix in Table 9.
In both cases, the model performance on the test set falls short of the cross-
validation estimate but is still very good.
Table 9. Confusion matrix for model with 20 discrete boosting iterations, performance
on new simulated test set of 1,000 instances.
Sphere Inner Outer

Inner (actual) 500 0
Outer (actual) 26 474
4 Conclusions
Trained synthetic features dominated the variable importance, scoring in the

top five in variable importance for each of the data sets. This indicates the
advantage of including trained synthetic features among the predictors as none of
the original features were in the top five in variable importance for any of the data
sets. The variable importance also illustrate the advantage of including nonlinear
trained synthetic features along with traditional linear trained synthetic features
as splines scored high in variable importance along with synthetic features from
0–1 regression and logistic regression.
As expected, mean cross-validation errors and their standard deviations gen-
erally decreased as the number of boosting iterations increased but it is debatable
whether the slight improvements are worth the increased computational cost of
an increase in the number of boosting iterations. For the spheres and checker-
board data sets the model performance quickly exceeded naive classification.
Polish bankruptcy data only saw an improvement in model performance beyond
naive classification for a large number of iterations. This may be due to inade-
quate size of the stratified sample used as the training set, the complex nature
of the data set, deficiencies in the method used for imputing missing values, the
highly unbalanced nature of the response variable, or other factors.
There were differences in model performance and variable importance
between discrete and gentle boosting. Discrete boosting outperformed gentle
boosting in some cases but was outperformed by gentle boosting in others, indi-
cating that both are valuable methods for updating weights in the boosting algo-
rithm. Differences in observation weighting between discrete and gentle boosting
would lead to different sets of trained synthetic features, explaining the difference
in variable importance between discrete and gentle boosting.
Overall, the trained synthetic features produced effective models for a moder-
ate number of boosting iterations performed using a very simple weak classifier, a
stump. Relatively few boosting iterations combined with a very simple weak clas-
sifier help compensate for the added computational complexity of the synthetic
features. Some of the linear synthetic methods produce redundant information
and could be omitted to speed up computation, particularly for larger data sets.
References
1. Dua, D., Karra Taniskidou, E.: UCI Machine Learning Repository. University of
California, School of Information and Computer Science, Irvine (2017). http://
archive.ics.uci.edu/ml
2. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning
and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
3. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical
view of boosting. Ann. Stat. 28(2), 337–374 (2000)
4. Frank, E., Hall, M., Pfahringer, B.: Locally weighted Naive Bayes. In: Proceedings
of the Nineteenth Conference on Uncertainty in Artificial Intelligence, pp. 249–256
(2003)
5. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edn. Springer,
New York (2009)
6. Melville, P., Mooney, R.J.: Creating diversity in ensembles using artificial data.
Inf. Fusion 6(1), 99–111 (2005). https://doi.org/10.1016/j.inffus.2004.04.001
7. R Core Team: R: a language and environment for statistical computing. R Foun-
dation for Statistical Computing, Vienna, Austria (2016). https://www.R-project.
org/
8. Rodriguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation forest: a new classifier
ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1619–1630
(2006). https://doi.org/10.1109/TPAMI.2006.211
9. Therneau, T., Atkinson, B., Ripley, B.: rpart: Recursive partitioning and
regression trees. R package version 4.1-10 (2015). https://CRAN.R-project.org/
package=rpart
10. Wang, L., Wu, C.: Business failure prediction based on two-stage selective ensem-
ble with manifold learning algorithm and kernel-based fuzzy self-organizing map.
Knowl.-Based Syst. 121(2017), 99–110 (2017)
11. Zieba, M., Tomczak, S.K., Tomczak, J.M.: Ensemble boosted trees with synthetic
features generation in application to bankruptcy prediction. Expert Syst. Appl.
58, 93–101 (2016)
12. Zieba, M., Hardle, W.: Beta-boosted ensemble for big credit scoring data. Expert
Systems With Applications, SFB 649 Discussion Paper 2016-052 (2016). Avail-
able at SSRN: https://ssrn.com/abstract=2875664 or https://doi.org/10.2139/
ssrn.2875664
UI Design Patterns for Flight
Reservation Websites
Zeeshan Haider Malik(&), Tayyab Munir, and Mesan Ali
Forman Christian College (A Chartered University), Lahore, Pakistan

zeeshanmalik@fccollege.edu.pk
Abstract. Flight Reservation Websites have become one of the most frequently
used online booking systems in today’s technology orientated world. In order to
ensure that such systems provide good user experience a research has been
conducted to analyze Flight Reservation Websites to formulate User Interface
Design Patterns for such systems to enhance their User Experience. Systematic
approach has been adopted for user study to come up with UI Design Patterns
which will help the system developers to design a system in a way that it will
ensure a better User Experience and it will address diverse user needs.
Keywords: UI Design Patterns User experience Usability user study

Human Computer Interaction Flight Reservation Websites
1 Introduction
To survive in today’s world, understanding technology and how it works is very

important. Hence, all kinds of users use the latest applications and websites to connect
with the rest of the world. Be it social media, online banking, purchasing or reservation,
new websites with new designs show up every day. Users, experienced or inexperi-
enced take time and face difficulties getting used to these new designs. A systematic
approach has been used to identify and solve the problems.
Flight reservation websites face a similar scenario where users have to understand
new designs every time as they book flights on different airlines. This is where User
Interface Design Patterns can play a vital role. The idea is to observe and evaluate users
booking flights using five top airline websites and interview them about their experi-
ence through the process. This provides the details where inexperienced users cannot
figure out what to do, and experienced users have been facing problems. UI Design
Patterns have been formulated based on the results of these observations and inter-
views. UI Design Patterns will help improve the usability of flight reservation websites.
2 Literature Review
The research [1] mainly focuses on creating successful interactive systems. User
interface which needs to be developed with the cooperation with developers and
designers and experts. This group of people exchanges the ideas and terminologies
which are used to enhance the design and usability of the project. This paper also

https://doi.org/10.1007/978-3-030-39442-4_24
UI Design Patterns for Flight Reservation Websites 311
presents an approach that uses pattern languages in software developments, HCI and
application domains. One of the sections in this paper explains from where the pattern
ideas come from and how it has been adapted to other disciplines, for example Patterns
in Urban Architecture, Patterns in Software Engineering, Patterns in Human Computer
Interaction and Patterns in the Application Domain.
Some pattern languages in interdisciplinary design are a Formal Hypertext Model
of a Pattern Language. This is a form of description of patterns which makes it less
ambiguous for the parties involved to decide what a pattern is supposed to look like in
terms of structure and content. The most important work for the developer is to use
patterns in the Usability Engineering Lifecycle. There are eleven major points in these
patterns like “Know the user”, “competitive analysis”, “setting usability goal”, “parallel
design”, participatory design”, “coordinated design of the total interface” etc. Focus on
HCI is particularly important for exhibits, but they are of equal importance to “kiosk”
and similar public-access systems where typical users are first-time and one-time users
with short term interaction times and no time for learning.
Talking about software design patterns, they are important to supplement the
general ones. Main focus in this text is on those patterns that relate specifically to
software design for example, music exhibits. Purpose of all the text is to apply the
pattern technique to entirely new application domains which could further strengthen
our argument that structuring them into patterns is generally a valid approach. A more
intensive use of HCI patterns in user interface design courses will hopefully give us
more detailed findings about their usefulness in education [1].
Another research [2] focused on how UI Design Patterns are introduced to HCI
students and the students’ review about it. The author talks about UI Design Patterns
and how these patterns are collected form pattern languages. Patterns provide the reader
with a solution to a known problem within a defined context.
Fourteen students were asked to develop a pattern-based model for each of the two
given interfaces after they attended a lecture on how to develop UI Design Patterns.
They were also given a six-step generative process to follow which was another
method of developing UI Design Patterns. Then the students had enough knowledge
about the development so they were observed and asked to fill questionnaires about
their experience. The results were more positive than negative. Most of the students
thought that design patterns should be used. This paper tells what approach should be
used to form UI Design patterns and how one should start learning in this field [2].
A research [3] about one day workshop analyzed how UI designers are using
patterns today. Ten possible topics within this scope were listed and explained. These
topics were divided into two categories: topics related to writing patterns and topics
related to using existing patterns. It was explained why writing patterns is important but
there was no discussion about how to write a pattern which makes this workshop more
theoretical and less practical. The paper claimed that there is very less work done on
reading and using existing patterns, which could be true because we have not found any
similar work for our literature review. The topics related to writing patterns are basi-
cally different approaches which can be used while writing a pattern, these ideas can be
helpful when one is writing a pattern of his own [3].
Another qualitative study about design patterns, this research [4] refers to problems
in usability of existing interaction patterns. The existing patterns do not fulfill their
312 Z. H. Malik et al.
purpose, and after using several patterns the problems faced were listed. Among the
problems included, one was that none of the pattern collections covered all of the
problem cases. Furthermore, a number of pattern collections used actually tend to focus
on the different aspects of the design process. In some cases, the naming of interaction
patterns seemed inconsistent and difficult to learn.
After proving that problems exist when it comes to using design patterns, their
solutions included, organizing the patterns using problem based or platform-based
grouping, using graphical references helps the developer understand the core of the
solution, systematic and standardized naming conventions could help identifying and
remembering patterns [4].
Another research [6] talks about how UI designers and researchers can evaluate the
existing design patterns before using them. The goal was to form a tool to manage
pattern collections. At first, the problems faced while forming a tool were addressed
which included standardizing a common pattern form, customizing patterns, relating
patterns etc. After this the existing patterns were investigated in a survey and it turned
out that some of the functions were already supported in some of the tools while others
were given very less support. There are details about which pattern holds which
specific function. This information was further used to identify six activities which
require support from a tool when managing pattern collections. The minor details about
what functions does a pattern tool hold were important for us because it tells how
patterns differentiate and what features are important, and we will be needing this
information while forming our UI Design Patterns [5].
Another research talks about building Complex analysis patterns from the combi-
nation of simpler analysis patterns in context of flight reservations. The requirement as
a common context were some different cases related to the reservation of flight like a
flight is defined by a number and a date, a plane is defined to a flight and it contains a
set off numbered seats. Flights are connected to a route and route consists of origin and
destination of a flight (span). Different use cases of flight reservations about how
different passengers book a flight according to their plan and chooses different routes
for the same destination.
Then comes the component patterns, these patterns are basically the building blocks
for the complex patterns. Each pattern has some intent, has a problem and their solution
is described with the approach of Object-Oriented Analysis and Design. These different
patterns include Travel Ticket, Seat Assignment, Collection of Seats, Self-Connection
pattern, Flight - Route pattern and Airport Role pattern. Then finally there is a flight
reservation pattern which describes the placement of an order of series of tickets.
Different examples with different problems have been discussed in this paper in detail
with their solutions. Authors approach involves the use of object-oriented methods and
semantic analysis patterns and by solving the problems using object-oriented methods
the benefits they got are reusability, extensibility and conceptual abstraction. In this
paper, the ability of Semantic Analysis Patterns to compose patterns to build complex
patterns or complex models in general has been illustrated through a case study. The
specific problem that the authors used as a case study is of intrinsic interest because of
its economic importance. They designed a software that has been designed either by the
procedural approach (most likely) or by object-oriented methods (in most of the cases).
However, their search did not yield any complete examples but the use of analysis
patterns can help build good conceptual models for designers who have little
experience [6].
Another research [7] is about design patterns for ubiquitous computing. Design
patterns are a format for capturing and sharing design knowledge. The overall goal of this
work was to aid practice by speeding up the diffusion of new interaction techniques and
evaluation results from researchers, presenting the information in a form more usable to
practicing designers. Design patterns were first developed by Christopher alexander for
architectural purposes, he developed 253 patterns. Pattern language for ubiquitous
computing uses the definition generated at INTERACT 99: “The goals of an HCI pattern
language are to share successful HCI design solutions among HCI professionals”. This
paper tells that patterns offer solutions to specific problems instead of providing high
level suggestions. Pattern languages started for architecture and have been emerging
since for UI design. Patterns have seen their success in the area of software design and
among software development community. They developed their own pattern language
using an iterative process and they believe that more work can be done on this language
following another iteration. The patterns formed in this paper were evaluated by prac-
ticing designers and the evaluations were used to improve the patterns [7].
When it comes to validating UI pattern language, three types of validation need to
be considered: the validity of the individual pattern, the internal validation of the
pattern and the external validation of the pattern. The elements which characterize a
pattern and pattern language were identified. Six tests or questions were developed in
the paper [8] to determine internal validity of a pattern language. The same validation
technique has been used to validate the UI Design Patterns of the current paper.
3 User Study
To conduct this study 35 participants were selected of diverse age group and experi-
ence in terms of travelling and using online booking system to book flights. Moving on
to specifics, out of 35, 20 participants had the prior experience of online flight booking
and the remaining 15 were never interested in booking their flight through online
booking system. They preferred to contact travel agent to book flights for them.
Moreover, 27 participants were male and 8 were female.
To conduct this study 5 top airline websites of 2017 were selected which were used
to book flights [9]. The five airlines were:
1. American Airlines (www.aa.com)
2. Delta Airlines (www.delta.com)
3. Southwest Airlines (www.southwest.com)
4. United Airlines (www.united.com)
5. Ryan Air (www.ryanair.com)
Within Subject Design [10] was used to conduct the experiment. This technique
was used in order to test each website with each participant. Within Subject Design also
helped the participants to analyze the websites and come up with a good comparison
between different websites.
In order to avoid fatigue and learning bias, Latin square method [11] is used. “In
general, a Latin square is an N N table filled with N different symbols positioned
such that each symbol occurs exactly once in each row and each column” [12]. This
method makes sure that each airline website gets an unbiased spot from first to last to
create variation.
4 Experiment
To make the experiment and interview of the participants smooth, a pilot test was
performed on each of the five airline websites by the interviewers and each step of that
pilot test was noted down from the beginning to end of the procedure of booking a
flight. After pilot testing each task was transformed into flash cards of 5 different color
for five different Airline websites on which the pilot test was performed. It was made
sure that during the process of tasks interviewees were able to perform every task on
their own without anyone’s help.
Retrospective Testing [13] was selected to conduct the study. The user was handed
the tasks at first and instructed a little bit about what he had to do. For each task, time of
performing the task was recorded. Not only the time was recorded, during the interview
a camera was also placed at such an angle so that participant’s facial expressions can be
recorded for each task and also the screen was recorded so that it can be used for
analysis afterwards. The website on which the user was going to perform the task was
already open on the screen. One of the interviewers had to sit right next to the par-
ticipant so that if there was any kind of technical problem or if the participant had any
kind of question.
Interviewer was also taking notes during each task so that the problems that the
participant faced could be collected as data so that it could further be used to come up
with solutions to make the system better. The user proceeded on to book his flight as
instructed on the tasks and the recordings were stopped when the purchase button was
clicked without any error in form.
After performing a task, five tasks per participant, a usability evaluation form was
asked to be filled by the participant. After that a set of decided questions were asked by
the interviewer about the task that had just been performed by the participant. All the
responses from participants were recorded.
All the data collected from this experiment was stored in five different spreadsheet
files. Every sheet consisted of user’s name, id, airline’s usability score by every user,
user’s prior flight booking and travelling experience, time taken by each user to book a
flight and duration of time taken and usability scores. Rankings of these flights were
recorded on a separate sheet based on average time and score.
5 Analysis
After conducting the experiment, the analysis showed that there are several problems
that in terms of usability with can be easily handled with some UI Design Patterns.
A series of questions were asked from the participants in form of an interview about the
website which they performed the task upon. Those questions included the information
about their prior experience of booking online flights and travelling, what was the first
impression of a particular site when they started the task. General question about User
Interface was also included to get more feedback and about their experience. One of the
optional questions in that interview was about participants’ suggestions about adding
additional features or any kind of change that they felt to be in a particular flight
booking website. Some UI design patterns are designed in such a way that they play a
role of a sub-pattern of a particular main pattern. So, most of the participants gave
suggestions about the design and other different aspects related to layout, colors or
some main functionalities like date selection for the travel. Some of the users got
confused about the date, the issue was if a passenger wants to travel a One-way flight
and at the instance of selecting a date for departure a user was asked to enter two dates,
one for departure and one for return, this happened because when a user opens an
airlines website for flight booking the default Trip Type was “Round-Trip”, and the
icons or radio buttons varies in different airline websites. They were so small in font
and were not placed appropriately in a way that a normal or new user would see and
select the option according. So, a pattern is designed to eliminate such problems.
Another major issue that was noticed was that all the websites share almost a
similar type of layout for a procedure of booking a flight and most users faced difficulty
or their time of performing task got delayed, another new pattern for flight booking is
suggested and has a simple, efficient and easy to operate layout. This particular layout
has an appropriate font size and a color scheme that can be visible to all users. Booking
a flight pattern contains five different sub-patterns and all of these patterns can defi-
nitely be solving basic design problems that a typical user faces. These patterns include
“Flight Menu” pattern followed by the “Airport Selection” pattern which is obviously
the main functionality of a flight booking procedure which ensures and gives you
enough options to select a preferred airport for departure and arrival of a passenger.
The next is “Trip Type Selection” pattern, because of which most of the users got
confused and were asked to enter or select multiple dates for flight booking, which
requires to select Round-Trip, One-Way or Multi-City trip type in drop-down menu.
So, this makes sure that the user would not have to enter or select multiple dates if
he/she is travelling a one-way trip. Next comes the “Flight Date Selection” pattern in
which a monthly calendar appears with navigation buttons to scroll the month in which
a traveler wants to travel. Then the final sub-pattern for flight booking is “Passenger
Selection” pattern which is simple and provides enough options for a traveler to select
from multiple age groups.
Another pattern “Travel Date Flexibility” is formed as per users’ suggestion that
provides multiple travelling dates so that the traveler can decide between feasible dates
which can also vary the price range of a particular flight on a particular date. This
pattern leads to the formation of another pattern which was not up to the mark in terms
of layout, positioning and other aspects of designs. One of the patterns formed is “Sort
Flight Options” pattern which provides numerous categories to sort the flight reser-
vation details e.g. if the traveler wants to have a direct flight to his destination then he
will select the option of “number of stops” in a dropdown menu.
Further on, another pattern, “Flight Filter Option” is formed so that the traveler can
also filter through the options of his departure/arrival based on his budget and this
option also provides possible number of departure and arrival airports of a city from or
to which a traveler is travelling.
6 UI Design Patterns
Following are the patterns which are formed after conducting a series of interviews,
getting questionnaires filled, managing tasks, compiling the data retrieved from par-
ticipants and completing the overall analysis.
6.1.1. Pattern Name
Flight Menu.
6.1.2. Pattern Description
The very first menu on a flight reservation website.
6.1.3. Problem Statement
Some websites use these options in random order.
6.1.4. Solution
Options in right order as per user’s need with a design that visibly shows which option
is selected.
6.1.5. Use When
Shall be used as main menu when making a flight reservation website.
6.1.6. Example
Example of the UI Design Pattern is illustrated in Fig. 1.
Fig. 1. Initial menu to start booking your flight.

6.2.1. Pattern Name

Trip Type Selection.
A drop-down list to decide if your trip will be one way or a round trip.
Most websites use radio buttons for this option on top left side which are often not
visible.
6.2.4. Solution
It is easily visible to user as it is now the part of horizontal flow of booking.
6.2.5. Use when
Can be used when booking a trip and there is a need to tell if it is going to be one way
or round.
6.2.6. Example
Fig. 2. A drop-down menu to choose your trip type between 3 options.
6.3.1. Pattern Name

Trip Date Selection.
The calendar that will be used when a passenger needs to select dates for his trip.
Mostly, there are different calendars to select the departure and return date.
6.3.4. Solution
This pattern makes it easier and faster with options for both dates in a single calendar.
6.3.5. Use When
Can be used for selecting dates for any trip.
6.3.6. Example
Fig. 3. A single menu with two calendars to select departure and return date.
6.4.1. Pattern Name

Passenger selection.
Helps select the number of passengers and well as identify if the passenger is an adult
or a child.
Usually, websites only use number of passengers as an option, but there should be
different options for children and adults as the price package is different for them.
6.4.4. Solution
Providing the option to select from children, adults, seniors and infants.
6.4.5. Use When
When input is required for the number of passengers for an air trip.
6.4.6. Example
Fig. 4. Menu to select the number of passengers of all age groups.
6.5.1. Pattern Name

Travel Date Flexibility.
Offers flexible dates with similar flights that you initially select for your search.
Flexible dates are not offered on every website, or in some cases only one-day window
is shown.
6.5.4. Solution
Offer flexible dates with few days window.
6.5.5. Use When
After a user searches his/her flight, he/she should be shown this with the list of flights
from which they are selecting.
6.5.6. Example
Fig. 5. Available dates for the searched flight are shown.

6.6.1. Pattern Name

Flight Selection Overview.
How the details of flights are shown for the user to select their preference.
Mostly, there is too much information in a very haphazard manner which confuses the
user.
6.6.4. Solution
Well organized, with all details covered and easily understandable through spacing and
sufficient font size.
6.6.5. Use When
Showing a list of flights to user for selection.
6.6.6. Example
Fig. 6. Details of the particular flight including stops.
6.7.1. Pattern Name

Sort Flight Options.
Options to sort the list of flight as per user’s requirements.
Not enough sorting options in most cases.
6.7.4. Solution
All possible sorting option in ascending or descending orders.
6.7.5. Use When

Should be used as an option of sorting trips.
6.7.6. Example
Fig. 7. Drop-down menu showing the list of all sorting options for flights.
6.8.1. Pattern Name

Filter Flight Options.
Helps the user to filter the searched flights according to certain needs.
This option is barely available on sites but it can be very helpful and useful.
6.8.4. Solution
An extra button which is not a hindrance in the usual booking process, but does offer to
serve the extra need just in case.
6.8.5. Use When
Shall be used with flight list to further filter the flight search.
6.8.6. Example
Fig. 8. Flight search filters as an extra option to fulfill additional needs.
6.9.1. Pattern Name

Airport Selection.
To search the airport for your departure or your arrival.
Not very accurate, in some cases only the abbreviations for airports are used which are
not understandable for all users.
6.9.4. Solution
A drop-down list with abbreviations, as well as the full name of the airport and city.
6.9.5. Use When
Where ever there is a need for an option to search airports.
6.9.6. Example
Fig. 9. List of airports with abbreviations and full airport names
6.10.1. Pattern Name

Online booking Payment.
This specific UI pattern is for online payments after booking a flight.
Mostly, such forms ask for unnecessary information, for example, asking about the
card type, whereas it can be identified when the card number is entered or asking about
last name on card as a required field when credit cards can be made without last names
on them.
6.10.4. Solution
Take input only for important information, create a single field for name and make sure
your system is smart enough to identify the card type when the number is entered.
6.10.5. Use When
This pattern can be used for any online payment form but it is specifically designed for
flight booking systems.
6.10.6. Example
Fig. 10. Payment method for flight booking with necessary information only.

Passenger details.
This pattern is for collecting details about a person travelling through your airline. This
can also be used for collecting user’s information for almost any website.
Most forms ask for a lot of unnecessary information and take up a lot of user’s time
which makes the user less interested in giving his/her information.
6.11.4. Solution
A pattern that precisely sticks to the information that is necessary and user can quickly
fill it up.
6.11.5. Use When
When you need user’s information, for example, his name, gender, date of birth, etc.
6.11.6. Example
Fig. 11. Passenger details with minimum information to input.

Contact Method.
After the passenger/user enters his details, there is a need for his contact information.
Mostly, a user is asked to enter his email ID and contact number for contact details
which becomes too much information to enter.
6.12.4. Solution
User given a choice where he can select his preferred method of contact and has to
enter information according to that only. For example, user can select Email as his
preferred method of contact and has to enter his Email ID only.
6.12.5. Use When
Should be used when you are looking for a way to contact the user in future.
6.12.6. Example
Fig. 12. Select suitable contact method from two possible options.

Flight Seat Selection.
This is basically a seat map for an airplane where the user can select a seat of his choice
and is notified about the additional charges.
Seat maps used by most of the airline websites are not easily understandable, not
enough details and the preview is not concise.
6.13.4. Solution
This pattern, used by united airline, covers every minor detail that a user is looking for.
Seat number, Availability/Unavailability of any seat, which class does a seat belong to,
etc.
6.13.5. Use When
Can be used in an airline reservation system when giving the user an option to choose a
seat of their choice.
6.13.6. Example
Fig. 13. A map of the airplane to select the seat of your choice.
7 Conclusion
The paper provides UI Design Patterns for Flight Reservation Websites. These UI
Design Patterns will prove to be quite beneficial in terms of providing guidance to the
future developers of such websites. These UI Design Patterns will prove to be the basis
for design decisions for flight reservation websites. In light of these UI Design Patterns
the flight reservation websites will provide better user experience and they will cater
diverse user needs.
References
1. Borchers, J.O.: A pattern approach to interaction design, pp. 1–10. ACM (2000)
2. Todd, E.G., Kemp, E.A., Phillips, C.P.: Introducing students to UI patterns, pp. 37–40. ACM
(2009)
3. Van Welie, M., Mullet, K., McInerney, P.: Patterns in practice: a workshop for UI designers,
pp. 908–909. ACM (2002)
4. Segerståhl, K., Jokela, T.: Usability of interaction patterns, pp. 1301–1306. ACM (2006)
5. Deng, J., Kemp, E., Todd, E.G.: Managing UI pattern collections, pp. 31–38. ACM (2005)
6. Jiang, Z., Fernandez, E.B.: Composing analysis patterns to build complex models: flight
reservation. ACM (2009)
7. Chung, E.S., Hong, J.I., Lin, J., Prabaker, M.K., Landay, J.A., Liu, A.L.: Development and
evaluation of emerging design patterns for ubiquitous computing, pp. 233–242. ACM (2004)
8. Todd, E., Kemp, E., Phillips, C.: Validating user interface pattern languages, pp. 125–126.
ACM (2003)
9. World Airline Rankings (2017). https://www.flightglobal.com/news/articles/insight-from-
flightglobal-worldairline-rankings-20-439587
10. Malik, Z.H., Arfan, M.: Evaluation of accuracy: a comparative study between touch screen
and midair gesture input, pp. 448–462. Springer (2019)
11. Malik, Z.H.: Usability evaluation of ontology engineering tools, pp. 567–584. IEEE (2017)
12. MacKenzie, S.: Human-computer interaction: an empirical research perspective, pp. 177–
188. Morgan Kaufmann Publishers (2013)
13. Malik, Z.H., Farzand, H., Shafiq, Z.: Enhancing the usability of android application
permission model, pp. 236–255. Springer (2018)
Conceptual Model for Challenges
and Succession Opportunities for Virtual
Project Teams in the GCC
Rasha Abou Samra1(&) , Nizar Al Sharari1,2,

and Salem AlTunaiji1,2
1
Faculty of Business, Higher Colleges of Technology, Abu Dhabi, UAE
rabousamra@hct.ac.ae
2
Faculty of Arabic Language Studies, Higher Colleges of Technology,
Abu Dhabi, UAE
http://www.hct.ac.ae
Abstract. This is a systematic review for designing a conceptual model of the

reinforcement factors and the inhibitors of the success of using virtual project
teams among geographical and /or organizational dispersed teams. The GCC
have common traditions and cultural aspects. The study is trying to answer one
main question about the challenges and opportunities for success for depending
on virtual project teams in managing GCC projects. Two sub-questions are
derived from the main question: what factors do GCC countries have to make
virtual project teams succeed? What benefits and obstacles the managers of
virtual project teams may have in GCC countries zone? In the first part of this
paper, the researchers are analyzing critically the state of the art literature in the
field of virtual project teams. In the second part of this paper, the researchers are
comparing the findings of previous researches with the GCC published reports
issued by official organizations. The objective of this study is to come up with
recommendations about the future practical research variables that measure the
success or failure of virtual project teams in the GCC context. Researchers
discussed challenges like resources allocation and decentralization of respon-
sibility as well as opportunities for success like having better cost controlled
projects and more efficient time management. The purpose of the study is to
design a tailored customized conceptual model for the context of GCC to
conceptualize the possible positive and negative relationships that determine the
potential of success of VPT in this region. The problem of the research is a
scarcity problem. It is focused on the scarcity of research contribution to the
conceptualization of a model for VPM succession however; the documents
reveal that there is a real need for VPM in general and especially for the GCC
region. The solution to this problem in this research is represented in the gen-
eration of a new model which is customized to the characteristics of GCC and
how these characteristics are related in a way that motivates further virtual
cooperation and succession. The paper provides conceptualization of possible
relationships between number of major constructs like efficiency of virtual
communication and coordination, cultural re-approchement, and the cost of VPT
functioning. The economic and political instability of the GCC environment is a
real motive to use the results of this study for further investigation on the
potential of success VPT in GCC.

https://doi.org/10.1007/978-3-030-39442-4_25
Conceptual Model for Challenges and Succession Opportunities for Virtual Project 329
Keywords: Virtual project teams GCC Culture Virtual communication

Innovation capabilities
1 Introduction and Relevance
Many researches focused on the fields of project management (Kelly, Edkins, Smyth,
Konstantinou, 2013 [28], Sommerville, Craig, Hendry, 2010 [16], Smith [3], Paton,
Hodgson, Cicmil, 2010) [37], and project teams (Rezania, Lingham [4], Sackmann,
Friesl, 2007 [29], Mueller, 2012 [19]) at work places. However, the higher tendency
toward connectivity and virtual contexts of project teams’ work did not have the same
research efforts by researchers. Virtual project teams are increasingly used for the types
of projects that need high knowledge levels from different cultures. Virtual teams are
solving many problems like time differences and geographical constraints. The for-
mulation of organizational project teams needs cooperative background among team
members [15, 20, 21].
Globalization and advances in digital communication facilitated the creation of
International virtual project teams (IVPTs), allowing cross-cultural and multinational
collaboration between team members regardless of cultural, historical, socio-political,
and educational differences [1]. On the Contrary, virtual teamwork is not always a
guaranteed success, and can be a bumpy ride. This indicates the increasing importance
as well as usage of virtual project teams nowadays and in the future. This also means that
the importance of research in this area will increase. Traditionally, virtual team members
are chosen because of having certain knowledge not because they are able to cooperate
well with other styles of personalities in the virtual team.i. Mehtab, Ishfaq [25] stated
that virtual leadership is one of the major challenges in modern works. From the
organizational point of view virtual leadership is important because they give it gives the
organization high level of flexibility and responsiveness (Potter; Balthazard; and Cooke,
2000) [34]. Potter and others, 2000 found that the manager of the team will be able to
have higher control over the team members if the size of the virtual team is small and the
project needed time is relatively short. Communication and controlling long-term virtual
projects is a challenge of modern work environment in addition to large sized virtual
project teams as stated by Morgan, Paucar-Caceres, Wright, 2014 [26].
From the team members’ point of view, they can join more than one team in more
than one organization; they can also have higher flexibility and knowledge diversifi-
cation. Potter, 2000 also found that virtual team members can participate and contribute
their knowledge and opinions simultaneously, without waiting for a dominating
member to stop speaking. This conclusion leads us to the expectation of having lower
importance of the role of the virtual team manager over time. There is an opportunity
that self-managed virtual team members arise intensively in the coming decade. Holton,
2001 [18] found that the organizational effectiveness may be affected by the progress of
virtual project team performance if there is no deep dialogue to cultivate a set of shared
values among virtual team members.
However, in the GCC countries such factor may represent a positive point that may
highly affect the success of virtual project teams in the Arabian Gulf Aria.
330 R. A. Samra et al.
Collaboration enhances the organizational learning. In the GCC charter, it is clear that
GCC countries have common cultural factors like language, because they are all Arabic
nations. In addition, they are all Islamic ones. They have strong historical bonds.
Moreover, those are the main reasons of the establishment of the Gulf Countries
Council. “Charter 1” The GCC has cooperation values among them according to this
charter; however, the research on the factors affecting virtual cooperation among GCC
is rare. A window is open for both practitioners and researchers to take further steps in
the field of virtual project teams’ potential of success in the GCC. In the following parts
of the study we will systematically review the contribution of the literature to the
relationships between factors related to cultural rapprochement and the efficiency of
virtual communication and coordination. The cost of VPT and the efficiency of com-
munication and coordination may hinder the succession of the VPT in the GCC;
however the innovation capabilities will play a positive role in reinforcing the suc-
cession of VPT in GCC countries as we will discuss in the coming parts of this study.
2 Literature Review
Having shared values among virtual project teams in the gulf countries may highly
affect the effectiveness of its organizations. This will also achieve the GCC countries
objectives of reinforcement of links and strengthening cooperation. This research is
important because it helps in a way to achieve the GCC countries integration with
higher flexibility, lower geographical constraints, and better cultural understanding
among virtual team members. David J. Pauleen from Victoria University of Willington
in New Zealand, 2002 [6] defined VPT “virtual project teams as the dispersed groups
of coworkers that use telecommunication technologies to accomplish organizational
tasks. This definition highlights the importance of virtual and global project teams.
Another definition for the virtual project teams classifies virtual teams to geographi-
cally dispersed and organizationally dispersed.
These teams have a goal to accomplish and it is possible to accomplish it using
information technologies and telecommunication (de Jong, Schalk and Curşeu, 2008
[33]). As a conclusion, virtual project teams are having technological mediated com-
munications to perform their tasks. The organization will be able by using virtual
project teams to become more dynamic and to be involved in projects that are more
complex. Complexity and ambiguity of boundary-less virtual project teams tends to be
main research point in most of the virtual project teams researches (Peters and Manz,
2007 [17, 22], Denton [5]; Oertig and Buergi, 2006 [24]; Morris, 2008 [36], de Jong,
Schalk and Curşeu, 2007 [33]. However, the nature of GCC countries has positive
impact on minimizing the ambiguity of boundary less virtual project teams and yet
there is still more research need on the area of challenges and potential of success in
GCC countries.
2.1 GCC Countries Need Virtual Project Teams

Population density (people per sq. km or Population density is midyear population
divided by land area in square kilometers) in United Arab Emirates reached 111 per sq.
km in 2016. The national Institutes in Arab Gulf Countries estimated that total pop-
ulation for GCC countries reached 51 density for each one km2 of GCC countries is
increasing with higher percentages in 2009 compared to 2007 and 2008, in Saudi
Arabia is increased 700% from 1961 to 2017 (Millions in 2016) [27, 30]. Statistic
shows the share of urban population in the Gulf Cooperation Council region will reside
in an urban setting. The following chart (Fig. 1) shows share of urban population in the
Gulf Cooperation Council region from 2005 to 2030.
Fig. 1. Growth in the GCC population from 2005 to 2030 [9–12].
This increase gives higher priority to researchers to find more cost control methods
of conducting business. Researches approved that virtual project teams are cost con-
trolled (Potter, Balthazard and Cooke, 2000 [34]).
The foreign direct investment in the GCC is rapidly increasing. The development
ranged from less than 50000 million US dollars in the year 2000, to almost 300000
million US dollars in the year 2009). The main GCC countries attracting foreign
investments are KSA. Regarding the GCC foreign, direct investment in other countries
it increased from 1100000 Million US dollars in the year 2000 to 13500000 Million US
dollars in the year 2009. The main GCC countries having foreign direct investments in
other countries are UAE (39.7%), and KSA (29.9%) [31, 32].
This makes it more important to study how to increase the effectiveness of con-
ducting high performance virtual project teams, whether among GCC or between the
GCC countries and the rest of the world. Accordingly, the researchers are applying
their research on these two countries KSA and UAE. The following table shows the
size of KSA and UAE countries’ intra-trade for the years (2007-2008-2009) (Numbers
are in million dollars):
Table 1. Intra-trade between KSA and UAE and the

GCC from 2207 to 2009
Country 2009 2008 2007
U.A.E 14.933 15.727 12.558
K.S.A 23.757 27.039 22.818
Reference: https://www.statista.com/statistics/957368/gcc-
urban-population/ [14, 23]
Table 1 shows that UAE and KSA were not able to increase the intra-trade size
continuously in the year 2009. They must find tools to facilitate the intra-trade between
each other. One of the main objectives of GCC countries is mentioned in the GCC
charter as:
“To stimulate scientific and technological progress in the fields of industry,
mining, agriculture, water and animal resources; to establish scientific research;
to establish joint ventures and encourage cooperation by the private sector for the
good of their peoples” [35].
This objective is another reason for researchers to focus on technological solutions.
Using the technological advances in establishing joint ventures and encouraging
cooperation among the GCC, mainly in the private sector is very important. Working
very early or working very late according to time differences is one component of virtual
aspects of communication.ii This challenge does not exist in the virtual project teams in
the GCC if they are not including members from outside the GCC in their teams. Morris,
2008 [36], found that project managers in many cases treat the virtual teams as the same
as treating the co-located teams. This indicates that a study is needed to investigate the
human impact versus the virtual working impact on the success of the project. Some
practices are affected by cultural retention and traditions, however; these practices are
not tested against productivity and success if applied virtually. Virtual work environ-
ment may increase stress indicators like tensions and conflicts. Disengagement in the
virtual environment is higher than the disengagement in the co-location environment as
found by Morris, 2008 [36]. This shows that cultural differences are still not well
managed for virtual project teams and that tendency toward managing virtual project
teams in similar cultures will be more successful and effective [38].
2.2 Virtual Project Team Performance Success

According to previous researches, virtual project teams are technological facilitator for
any organization to implement project tasks. As long as the increase in intra-trade size
is one of GCC objectives, more focus is needed on studying the role of virtual project
teams in achieving this goal. Synchronization or lack of synchronization of commu-
nications among virtual project team members is one of its characteristics that over-
come time limitations. David (2002) [6], found in his qualitative research that cultural
differences lead virtual project team members from Australia, and Thailand to build
trust first by using telephone communications and then moving to emails and tele-
conferences for faster communication. He also found that in Thailand they respect
managerial differences and layers in speaking.
Instead of saying “hi” they use “sir” which is not the case for Australian members.
These environmental barriers may be common in other virtual contexts. In Gulf coun-
tries, they share common language and common traditions. They have higher chances to
repeat co-location communications due to the short geographical distance among GCC
countries. Holton, 2001 [18] used both inductive and deductive research approaches to
find out how to increase trust and collaboration among virtual team members as a major
factor for organizational success. He focused on the team climate and the opportunity for
regular communication to create mutual trust and cohesiveness [39].
Beranek and Martz, 2005 [40] conducted a research on how to make virtual teams
more effective by improving relational links using training tools. They concluded that
they could increase cohesiveness and the ability to exchange information, which affects
positively the team performance. In the GC countries, other reasons for cohesiveness
are available as a cultural background before the process of virtual team formulation.
The point is the level of knowledge that represents another important factor affecting
the virtual project team performance.
The GCC are still in the process of importing external experts in different fields,
which may decrease the desire to formulate the all-virtual project team members from
the GCC only. The advantage of facilitated cultural communication may increase team
learning especially of higher levels of knowledge is one of the conditions in the stage of
virtual project team formulation. Kuruppuarachchi, 2006 [41] was trying to find out
how to maximize the performance of virtual project teams. He emphasized arranging
periodic co-location face-to-face meetings among team members on both monthly
bases and weekly bases. This is more convenient if the team members are living in
neighbor countries [16].
Giving more freedom to the virtual team rather than counting on old control
management styles and raising the skills of members were part of Kuruppuarachchi,
2006 [41] research conclusion. This enhances that for the successful application of
virtual project teams in the Arabian Gulf Area it is necessary to minimize the role of the
manager control and to raise the learning effort of organizational members. This may
lead to better performance of GCC virtual project teams. This conclusion matches with
the findings of Denton, work [5]. Increasing the intranet learning through comparing
results with plans and creating feedback loops increases the level of virtual project
teams’ performance. Denton [5] found that the rapid and clear feedback loops
encourages flexibility of controlling virtual teams’ self-directed tasks.
This also supports that idea of minimizing the need of traditional management
control and maximizing virtual project members’ self-control. Keeping on track by
repetitive regular co-location face-to-face communication also is mandatory for
effective task performance. Environmental context of GCC virtual project teams is
another positive component of the potential of success of using it in GCC countries [8].
Oertig and Buergi, 2006 [24] talked about the challenges of managing cross-
cultural virtual project teams. Through his qualitative research, which took the form of
conductive thematic analysis, he was able with his colleague to find out that the
challenges include managing language and cultural issues among virtual project team
members. Consequently, managing virtual aspects of communication and building trust
are other challenges. As long as there is unified language, and semi-unified culture
among the GCC, it is expected that building trust among virtual project team members
will be easier, faster, and more effective.
Peters and Manz, 2007 [22] were trying to find out how the depth of relationships,
trust, and shared understandings among team members feed into a team’s collaborative
ability. They initiated a conceptual model for identifying antecedents of virtual project
teams’ collaboration. They conceptualized an effect of the depth of the relationship on
the degree of virtual collaboration. In addition, they expected that the depth of rela-
tionships affect positively the level of trust as well as speeds reaching shared under-
standing. Consequently, this increases the potential of success of the degree of virtual
collaboration. If those four factors are reinforced in the virtual project team, then the
researcher expected that more innovation in performance might occur.
The internal feeling of collaboration by virtual project team members is another
point that is highly expected by GCC citizens due to the agreements they have and the
GCC charter that has the same goal. More focus on the task rather than focusing on the
relationship building may lead to higher competition or sometimes miscommunication
(Oertig and Buergi, 2016) [24]. This will not be the case with people living next to each
other and have the same economic and social goals like the case of the GCC. For
having more depth relationship and keeping all team members in the loop it was found
that face-to face communication is a must. In addition, it is recommended that the team
leader is located in one country while the team manager is located in another country
with almost the same number of virtual project team member with each (Oertig and
Buergi, 2006) [24]. In their qualitative research, Oertig and Buergi found by inter-
viewing managers that building trust in virtual project teams takes between three to
nine months. This trust encourages virtual team members to report problems to the
project leader even before taking the formal action to report it. This leads to higher
performance efficiency.
This aspect of trust will be related from the researcher’s point of view to the task
itself and after agreement on the relationship itself. Informal contacts among virtual
team members and between the team manager and the team leader are reinforcing
factors for reaching better levels of trust more quickly. Keeping everybody in the same
level of information is another very important aspect. This aspect may be part of the
role of the virtual project manager and leader.
Equality of participation without dominance and overload understanding are main
characteristics of managing virtual project teams. Margaret [24] found in her research
that between USA, Europe, and Japan using the English language had cultural barriers
that were significantly affecting the level of trust. Nevertheless, in the case of the GCC
using the Arabic language will remove barriers and increase trust especially that this
language is correlated with the same zone’s traditions and religion. The regional
stereotyping is minimal among the GCC, because they have many shared factors like
whether, language, level of income, religion, history, norms and traditions. They also
have interrelationships that are revealed by the intra-travelling ratio that is increasing
among the GCC.
Regarding the management of the tasks of virtual project teams, the use of matrix
virtual project teams has its challenges also. The functional loop of information has to
feed all functions to avoid isolation as well as project information. That is why it is
recommended that the virtual project team manager must continue the co-location face-
to-face communication for better relationship building. The cost of continuous
intercultural communication skills training is saved or at least is less than for other
dispersed cultural virtual project teams compared to GCC virtual project teams.
The fact of the high turnover in virtual project teams, which is one of its main
characteristics, represents one of trust building barriers. That is why it is highly rec-
ommended that the formulation of the virtual project team be based on shared back-
ground to start with acceptable level of trust every time, which is exactly the case of the
GCC’s virtual project teams. This may increase the speed of performance and its
efficiency compared to other different cultures based virtual project teams. This
enhances the idea of assuming the responsibility about the team goals not only about
own contribution, which was one of the barriers of creating collaboration in Linda’s
research, 2007 [17]. Also in her research, she recommended the formulation of
informal personal relationships for more functioning and collaboration, this also is
highly available and approved by statistics that GCC countries’ citizens are having
extended families in other GCCs and are increasingly travelling to each other.
Frankly speaking the reliability on the team member’s knowledge is the thing that
needs more in depth research compared with multicultural virtual project teams. de
Jong, Schalk and Curşeu, 2008 [33] found that higher levels of virtuality increases the
effect of perceived task conflict on virtual team performance and vice versa. Remco
meant by the level of virtuality three dimensions. They are the degree of synchro-
nization, the presence of nonverbal and para-verbal clues, and the extent of use of
communication media. Regarding the nonverbal communication tools among the GCC
virtual project teams it is expected that there is high understanding and clear inter-
pretations for these clues. This will lessen the manager’s effort in resolving conflicts
arising from misunderstandings of those clues. Remco and others were able to prove
that the relationship conflict has a negative impact on the perceived virtual project team
performance [33].
Task conflicts have positive impact on perceived virtual project team performance.
The relationship conflict in that research if increased will decrease the perceived per-
formance of the virtual project team. This relationship conflict in the GCC is less if
compared with multicultural virtual project teams due to the informality of the nature of
the relationship among GCC citizens, interrelationships and related families [13].
2.3 Virtual Project Team Performance Challenges

Schenkel and Garrison, 2009 [42] found that team efficacy is determined by both
cognitive capital and entrepreneurial orientation. In addition, they found that the virtual
entrepreneurial team performance is determined by team-efficacy. For the private sector
entrepreneurs in the GCC it is expected according to the findings of this research that
higher probability to succeed depend on the cognitive capital virtual project team
members have. This represents one of the main challenges that the GCC must focus on.
It is preparing the cognitive capital of the GCC to invest in the private sector by having
virtual project high efficacy teams. The research of Mark and others, 2009 [42] found
that both relational capital and the cognitive capital along with the entrepreneurial
orientation are increasing the virtual team efficacy. The virtual team efficacy if
increased, it will increase the virtual project team performance.
The cognitive ability of the virtual team members in the research was measured by
four variables: presence of breadth of perspectives on the problem at hand; compre-
hensive pool size of potential solutions to examine; extent of innovative ideas present
among members; and wide variety of criteria for evaluating possible solutions among
members. All these variables may represent training needs for the virtual team mem-
bers in the GCC countries, which represents one of the main challenges.
The innovation in the virtual team performance will highly depend on the cognitive
preparations of virtual project team members. This is also another cost item that has to
be put under control in the stage of virtual project teams’ formulation and the return on
investment in this training on the level of virtual project team performance has to be
measured and evaluated. Behrend and Erwee [2], conducted a research in the same area
about mapping knowledge flows in virtual teams. They found that knowledge man-
agement in virtual environments is more complex than common business practice
suggests. This gives higher value of training before the formulation of the virtual
project team. They concluded that Organizations often launch new initiatives without
understanding the inner working of involved formal and informal networks, relying on
the philosophy that more communication and collaboration are better.
This keeps the initial understanding of cognitive capital essential. It also tells that
strong communication and collaboration is a factor for success but it is not enough
alone to reach that success without understanding the dimensions for virtual knowledge
sharing. Resource allocation, coordination, and communication support systems were
found to be other challenges the virtual project team management will have to deal with
according to the findings of Drouin, Bourgault and Gervais [7]. Innovative performance
is a conclusion of both the knowledge exploitation and the access to knowledge in
virtual project teams.
Innovation capabilities are another challenge in front of the GCC localized virtual
project teams because they want to overcome the cultural communication problems but
not on the bases of losing the higher levels of team performance. This conclusion
matches with the conceptual model founded by Gressgard, 2011 [42] he found that
technological development improves communication characteristics and communica-
tion characteristics improves two innovation capabilities which are the knowledge
exploitation and the knowledge access which researchers can refer to as knowledge
exploration and both will improve the innovative performance of the virtual project
team.
Although the easiness of building trust was one of the succession points in GCC
virtual project teams, it was found that choosing the communication medium carefully
affects the virtual team member perception of trust. This was found by Joel Olson and
her sister Linda in 2012 [17]. The last challenge is the need to have decentralized
responsibly to enhance sharing. Charismatic and participative leadership is needed but
at the same time project goals and decisions are negotiated in the level of virtual team
members. These are the findings of Muganda and Pillayin 2013 [29] when they were
trying to study the Forms of power, politics and leadership in asynchronous virtual
project environment.
3 Conclusion
The research aims at investigating the special environmental needs for the applicability
of virtual project teams in the GCC countries. The research main question is: “What are
the main challenges and succession opportunities for virtual project teams in the
GCC?”.
To answer this research question researchers conducted a systematic review in the
literature and came up with a conceptual model (Fig. 2) that represents a start for
validation and measurement of successful performance of the GCC virtual project
teams:
Efficiency of virtual
communication and
coordination:
- Choosing the
communication
medium
- Easiness of co-
location
Cultural rapprochement Innovation capabilities of VPT:
communication
- Language - Self-managed
Traditions knowledgeable teams
- Cost of VPT functioning:
- History - Cognitive capital
- Religion and - building trust - Knowledge sharing
values - resolving relationship techniques
- Families relations conflict - Decentralization
- sharing knowledge
- VPT leadership Success of VPT in the GCC
- Resources Allocation
Fig. 2. The conceptual model of the research. (Source: Researchers’ conceptualization)
In fact, we can consider the virtual project teams as organizational dispersed teams
rather than cultural dispersed ones because of their cultural rapprochement. The virtual
project teams have high potential of success in the GCC countries because better
communication is available. This includes better understanding of non-verbal com-
munication clues, the easiness of repetitive co-location face-to-face communication.
Also the shared common values, traditions and the same language will minimize the
difficulty of resolving virtual project teams’ conflicts. The private sector entrepreneurs
have higher chance to be supported by the GCC for starting their Virtual project teams
which is considered cost controlled method of performing projects. Time limitations do
not exist as one of the challenges for GCC virtual project teams. There are some
challenges regarding the virtual project teams’ performance in the GCC area. They are
mainly focused on the knowledge sharing techniques, the needed training on raising the
cognitive capital of virtual project members, the careful choosing of communication
medium, and the successful decentralization of responsibility with appropriate man-

agement role in resources allocation, knowledge sharing, and coordination.
4 Discussion, Limitation, and Perspective of Future Work
The research is depending on the innovation capabilities to achieve the succession of

the VPTs in the GCC as revealed and explained earlier in the conceptual model,
however; the researchers of the study expect a weakness at least on the short run in the
innovation capabilities in the GCC countries because of the centralization of authorities
especially in the public sectors in this region. [43] The decentralization is not domi-
nating the organizational policies in these countries. This may hinder the work using
the self-managed knowledge creation teams. On the other hand, the cultural similarity
may represent an advantage and a disadvantage for the succession of the VPT at the
same time. As an advantage it may reinforce the knowledge sharing. There is a rela-
tionship between the cultural effectiveness of communication and the effectiveness of
knowledge sharing (Abousamra, 2015) [33]. The cultural rapprochement may become
a disadvantage when talking about the need for diversity for better innovation out-
comes. Differences of cultures may lead to better innovation results but at the same
time may lead to higher unproductive levels of team conflicts. The knowledge sharing
capabilities are the focal point in transferring the negative effect of the cultural rap-
prochement to a positive one from the researchers’ point of view. The study did not
consider the possible impacts and relationships with the economic and political factors
[45]. The main focus is on the possible relationships between the technological and
cultural variables in the context of the study. Further research is need to study the effect
of economic and political instability on the potential of success of VPT in the GCC
region. The conceptual model did not tailor the relationship between cultural variables
and technological variables to classification of projects into sizes and industries or
sectors. If related these two important aspects to innovation needs and capabilities
which is relatively more important from the researchers’ point of view. Further research
may reveal the need for more detailed conceptual models for every sector or project
size. This paper is a systematic review of the literature related to the succession of the
virtual project teams. This main contribution of this research is that it is tailoring the
contribution of the literature to the contextual documented variables in the GCC;
however, there is shortage in the number of researches applied on the potential of
success of virtual project teams in the Gulf area. This is a gap in the knowledge and in
the empirical research as well. The future work in on this gap is expected to focus more
on the measurement of contextual constructs to validate the conceptual model of this
study and to apply it for forecasting the future success relationships and impact
coefficients among variables of the conceptual model. This study is an introduction to
quantitative studies on the quantification of the positive and negative relationships in
the model. The determinants of the succession of VPT in GCC represent a real need for
GCC countries during the current era and especially during the global financial crises
that is accompanied with instability of MENA context. A system dynamic modeling
research is needed to measure the dynamics of delays and reinforcement factors
affecting this model is also needed especially when taking into considerations the
effects of economic and political factors in the region.
References
1. A strategic direction. 2(5), 22–24 (2011)
2. Behrend, F., Erwee, R.: Mapping knowledge flows in virtual teams with SNA. J. Knowl.
Manag. 13, 99–114 (2009). https://doi.org/10.1108/13673270910971860
3. Smith, C.: Understanding project manager identities: a framework for research. Int.
J. Manag. Proj. Bus. 4(4), 4 (2011)
4. Rezania, D., Lingham, T.: Coaching IT project teams: a design toolkit. Int. J. Manag. Proj.
Bus. 2(4), 577–590 (2009)
5. Denton, D.K.: Using intranets to make virtual teams effective. Team Perform. Manag. 12
(7/8), 253–257 (2006)
6. Pauleen, D.J.: Leadership in a global virtual team: an action learning approach. Leadersh.
Organ. Dev. J. 24(3), 153–162 (2003)
7. Drouin, N., Bourgault, M., Gervais, C.: Managing virtual project teams: recent findings
(2009)
8. http://sites.gcc-sg.org/Statistics/Files/1300866234.pdf. Accessed 23 Jan 2019
9. http://sites.gcc-sg.org/Statistics/Files/1306750608.pdf. Accessed 5 Dec 2018
10. http://sites.gcc-sg.org/Statistics/Files/1306750608.pdf. Accessed 19 Dec 2018
11. https://www.statista.com/statistics/1005526/gcc-population-growth/. Accessed 12 Nov 2019
12. https://data.worldbank.org/indicator/EN.POP.DNST. Accessed 2 Feb 2019
13. https://www.statista.com/statistics/957368/gcc-urban-population. Accessed 15 Jan 2019
14. Sommerville, J., Craig, N., Hendry, J.: The role of the project manager: all things to all
people? Struct. Surv. 28(2), 132–141 (2010)
15. Olson, J., Olson, L.: Virtual team trust: task, communication and sequence. Team Perform.
Manag. 18(5/6), 256–276 (2012)
16. Holton, J.A.: Building trust and collaboration in a virtual team. Team Perform. Manag. Int.
J. 7(34), 36–47 (2001)
17. Mueller, J.: Knowledge sharing between project teams and its cultural antecedents. J. Knowl.
Manag. 16(3), 435–447 (2012)
18. Kuruppuarachchi, P.: Managing virtual project teams: How to maximize performance.
Handb. Bus. Strategy 6, 71–87 (2006)
19. Jarle Gressgård, L.: Virtual team collaboration and innovation in organizations. Team
Perform. Manag. 17(1/2), 102–119 (2011)
20. Peters, L.M., Manz, C.C.: Identifying antecedents of virtual team collaboration. Team
Perform. Manag. 13(3/4), 117–129 (2007)
21. Lee-Kelley, L.: Situational leadership managing the virtual project team. J. Manag. Dev. 21
(6), 461–476 (2002)
22. Oertig, M., Buergi, T.: The challenges of managing cross-cultural virtual project teams.
Team Perform. Manag. 12(1/2), 23–30 (2006)
23. Mehtab, K., Rehman, A., Ishfaq, S., Jamil, R.: Virtual leadership: a review paper. Mediterr.
J. Soc. Sci. 8, 183–193 (2018)
24. Morgan, L., Paucar-Caceres, A., Wright, G.: Leading effective global virtual teams: the
consequences of methods of communication. Syst. Pract. Action Res. 27, 607–624 (2014)
25. Drouin, N., Bourgault, M., Gervais, C.: Effects of organizational support on components of
virtual project teams. Int. J. Manag. Proj. Bus. 3(4), 625–641 (2010)
26. https://www.statista.com/statistics/1005511/gcc-population-density/. Accessed 12 Nov 2019

27. Kelly, N., Edkins, A.J., Smyth, H., Konstantinou, E.: Reinventing the role of the project
manager in mobilising knowledge in construction. Int. J. Manag. Proj. Bus. 6(4), 654–673
(2013)
28. Muganda, N., Pillay, K.: Forms of power, politics and leadership in asynchronous virtual
project environment an exploratory analysis in South Africa. Int. J. Manag. Proj. Bus. 6(3),
457–484 (2013)
29. Paton, S., Hodgson, D., Cicmil, S.: Who am I and what am I doing here? becoming and
being a project manager. J. Manag. Dev. 29(2), 157–166 (2010). https://doi.org/10.1108/
02621711011019297
30. Samra, R.A., Shaalan, K.: The relationship between knowledge sharing climate and conflict
resolution styles. In: International Conference on Knowledge Management (2015)
31. de Jong, R., Schalk, R., Curşeu, P.L.: Virtual communicating, conflicts and performance in
teams. Team Perform. Manag. 14(7/8), 364–380 (2008)
32. Potter, R.E., Balthazard, P.A., Cooke, R.A.: Virtual team interaction: assessment,
consequences, and management. Team Perform. Manag. Int. J. 6(7/8), 131–137 (2000)
33. Ellis, R.C., Wood, G.D., Thorpe, T.: Technology-based learning and the project manager.
Eng. Constr. Archit. Manag. 11(5), 358–365 (2004)
34. Sackmann, S.A., Friesl, M.: Exploring cultural impacts on knowledge sharing behavior in
project teams–results from a simulation study. J. Knowl. Manag. 11(6), 142–156 (2007)
35. Morris, S.: Virtual team working: making it happen. Ind. Commer. Train. 40(3), 129–133
(2008)
36. Thinkwithgoogle Homepage. https://www.thinkwithgoogle.com/intl/en-gb/marketing-
resources/content-marketing/virtual-teams-drivers-license. Accessed 6 Feb 2019
37. https://www.statista.com/statistics/957368/gcc-urban-population/. Accessed 12 Nov 2019
38. Beranek, P.M., Martz, B.: Making virtual teams more effective: improving relational links.
Team Perform. Manag. Int. J. 11(5/6), 200–213 (2005)
39. Kuruppuarachchi, P.: Managing virtual project teams: how to maximize performance.
Handb. Bus. Strategy 7, 71–78 (2006). https://doi.org/10.1108/10775730610618648
40. Schenkel, M., Garrison, G.: Exploring the roles of social capital and team-efficacy in virtual
entrepreneurial team performance. Manag. Res. News 32, 525–538 (2009). https://doi.org/
10.1108/01409170910962966
41. Jarle Gressgård, L.: Virtual team collaboration and innovation in organizations. Team
Perform. Manag. 17, 102–112 (2011)
42. Tan, C.K., Ramayeh T., Teoh, A.P., Cheah, J.H.: Factors influencing virtual team
performance in Malaysia. Kybernetes (2019). ISSN: 0368–492X
43. Srivastava, M., Rogers, H., Letice, F.: Team performance, past, current, and future trends
(2013). ISSN: 1352–7592
Virtual Construction: Interactive Tools
for Collaboration in Virtual Reality
Juha Ojala, Jukka Selin, Timo Partala(&), and Markku Rossi
South-Eastern Finland University of Applied Sciences,

Patteristonkatu 3, FI-50100 Mikkeli, Finland
{juha.ojala,jukka.selin,timo.partala,
markku.rossi}@xamk.fi
Abstract. Virtual technologies and game engines provide new possibilities for
collaborative virtual design within digital building models. The current paper
describes an approach, in which computer-aided design (CAD) models of
buildings are transferred into a game engine based environment, where they can
be reviewed and further designed collaboratively. Following a user-centered
design (UCD) process based on interviews and iterative interactions with
designers and architects, the prototype of Virtual Construction─a game engine
based platform for collaborative virtual design meetings─was designed and
implemented using Unreal Engine 4. The interactive tools developed can be
used both in full immersive virtual reality and using traditional devices (e.g.
laptop or desktop computers). Based on identified user needs, interaction
techniques were implemented for moving, rotating, and aligning objects, adding
and resizing shapes and objects, as well as moving and measuring distances in
the three-dimensional (3D) building model. In addition, the communication
techniques implemented based on user needs included synchronous features
such as voice communication, text chat, pointing, and drawing, and asyn-
chronous features such as leaving messages and feedback augmented with
screenshots to exact virtual locations. Other implemented scenarios included
different lighting scenarios, an evacuation scenario and crowdsourced voting
between different designs.
Keywords: Virtual Construction Collaborative design User needs Game

engine
1 Introduction
Historically, building designs were visualized by producing two-dimensional archi-

tectural and technical drawings. They were followed by CAD tools, which enabled the
creation of interactive 3D visualizations of buildings. The currently prevailing trend,
Building Information Modelling (BIM), extends 3D building models by adding
dimensions such as time (schedules), quantities, costs, and sustainability and risk
analyses. BIM allows following the principles of Virtual Design and Construction
(VDC), which among other things emphasizes the importance of visualization and
collaboration during the building design phase [1, 2].

https://doi.org/10.1007/978-3-030-39442-4_26
342 J. Ojala et al.
According to existing research, presenting building 3D models visually is a crucial

part of planning, construction and maintenance phases in terms of collaboration and
understanding [3]. Especially in complex or largescale models, immersion is one of the
major key factors for being able to intuitively perceive all aspects of the scene. Detailed
walk-in virtual models can quite accurately simulate the user experience of real,
completed buildings after they have been taken into use [4]. While the visualization
features in CAD and BIM software have improved, they still have some drawbacks,
when it comes to collaborative visual building design.
For example, Kosmadoudi et al. [5] reviewed existing literature on design in CAD
environments and presented examples of limitations associated with using CAD soft-
ware including limited efficiency, limited creativity, potential lack of motivation, and
limited possibilities for interaction. They suggested that these issues may be enhanced
using game mechanics to provide more engaging and intuitive environments, which
may also result in reduced task times and improved design-making. Lee et al. [6]
explored 3D architectural and engineering design tools from a usability point of view
and concluded that CAD systems have become overly complex since they are com-
posed of several hundred menu items, which causes too much cognitive load on the
users. The best practices they suggested for 3D design environments included maxi-
mization of workspace, graphical richness, direct manipulation, familiarity, and mini-
malistic design.
Game engines seem to offer solutions for minimizing the above-mentioned limi-
tations and promoting best practices in building design. For example, games have been
shown to generate cognitive engagement in engineering design due to their inherent
interactivity, and they may even inculcate confidence [5]. Game engines allow for the
development of real-time walk-through applications, using which different users can
freely inspect different versions of building designs. This can add to the realism and
immersion. Buildings can also be displayed under different lighting conditions, which
is important, as there is evidence that buildings can be difficult to even recognize under
different lighting conditions and methods of presentation [7, 8]. Game engines also
have built-in multiplayer features with avatars and collision detection, which can be
utilized in developing real-time collaborative systems for building design with com-
munication and object manipulation features.
Game engines also offer advanced possibilities for creating virtual reality
(VR) applications. Despite their potential in diverse tasks, VR systems have been used
in construction mostly to explore finished designs rather than to create new ideas [9].
However, Moloney and Amor [10] suggested that game engine-based collaborative
virtual environments are suitable to supporting the early stages of design where teams
can collaborate and evaluate iterations at a relatively low level of detail. They
emphasized the possibility for both, synchronous and asynchronous communication,
supporting participatory and iterative design, as well as making intuitive of design
decisions as the main motivators of using a collaborative virtual environment for
building design. Recently, Lin et al. [11] studied VR-based design in practice and
demonstrated that a BIM/VR based solution could increase communication efficiency,
facilitate visual interactions, and ease decision-making in hospital design. Systems for
collaborative multi-user VR such Glue [12] and Fake [13] have entered markets, but
the related user needs in the context of construction remain largely unexplored in the
literature.
Virtual Construction: Interactive Tools for Collaboration in Virtual Reality 343
The current paper contributes by presenting a user needs based approach for
designing tools for interaction and virtual design and review meetings for the con-
struction industry. A process for utilizing a game engine (Unreal Engine) in design and
review of buildings is suggested. Based on the approach and a user-centered design
process, a prototype platform and a set of tools for collaboration in building design
were developed based on interviews and regular discussions with architects and
designers.
2 An Approach for Bringing Building Information Models

to Game Engines
We have developed a process and methods for bringing building information models to
game engines in order to be able to view and manipulate them in virtual reality. In our
process, there are two ways to bring the building information model with its metadata
(e.g. manufacturer, material, and price for each component) into a game engine. If the
Unreal Engine is used, the Datasmith add-on for Unreal Engine can be used to import a
building information model or its parts. Datasmith is a collection of tools and plugins,
which directly supports bringing content from more than 20 modeling formats into
Unreal Engine 4. For example, if a model or a part of it has been created with
Autodesk’s Revit software, it can be imported into the Unreal Engine with metadata
directly in Revit’s native format. Figure 1 below illustrates the proposed process for
bringing building models into game engines.
Fig. 1. A process for bringing building information models into game engines.
For a universal solution working with all the game engines and CAD software, the
Industry Foundation Classes (IFC) format is the only possible option. The IFC format
is the standard data storage and data transfer format for BIM. The IFC format is not
344 J. Ojala et al.
based on meshes (polygon structures), and game engines do not support it as such for
performance reasons. Thus, IFC models must first be converted into a mesh format
supported by game engines. The IFC format also supports the inclusion of metadata,
while pure mesh-based models do not contain metadata. In the process we have
developed, the IFC sub-models of the building data model are converted into the mesh-
based Wavefront OBJ format with the IfcOpenShell open source software. In addition,
the IFC models are translated into Extended Markup Language (XML) using the same
software. Then the metadata they contain can be read directly into the game engine
with an XML parser along with the data model. In the game engine, the models and
associated metadata are linked to each other using common identifier fields (e.g. id or
name). We have developed a computer program to support this process and convert
data models into a format that is supported by game engines.
3 Platform and Tools for Collaborative Design Meetings

in VR
3.1 User-Centered Design

To support the development of the current platform, called Virtual Construction, a user
centered design process was followed. First, the context of use was specified by
identifying the most important stakeholder groups, including architects, different
groups of designers, as well as the clients and the end users of the building. Next, user
needs were investigated. In total, 10 employees from four different companies forming
an alliance in the construction and architecture industry participated in the iterative
process focusing on user needs. The participants included both senior and junior
personnel, as well as manager level personnel and hands-on designers.
Table 1. Summary of identified user needs for a virtual building design platform
Need category Needs
General Need for an integrated communication and visualization solution
Need for real-time information, plans always updated and shared
Visualization Immersion and walk-in (to examine the model like a real building)
Easy navigation and wayfinding within the entire 3D model.
Sketching by drawing and adding simple objects
Changing textures and colors “on the fly”
Seeing behind structures (e.g. locations of pipes)
Decoration by easily adding and moving stock objects
Simulation of lighting options and different times of day and year
Communication Synchronous design meetings and asynchronous communication
Communicate in the model using voice and chat (“like in Skype”)
Awareness of others and their locations in the model
Pointing and highlighting parts of the 3D model
Leaving feedback messages to 3D objects
(continued)
Table 1. (continued)
Need category Needs
Access control and Default user group of all project members and private groups
crowdsourcing Involving end users of the building, “design your own work room”
Crowdsourcing the design of public buildings
Access control with high information security
Easy versioning and version control
Special needs Supporting design for accessibility
Scenario and simulations for building evacuation
First, a design meeting focusing on the requirements of each company for col-
laborative virtual technology was arranged. Next, four in-depth user interviews were
conducted. The interviewed persons were: a senior architect, a junior architect, a senior
building designer, and a designer of industrial structures. The semi-structured inter-
views concentrated on understanding their work processes, communication during the
processes, current use of tools and technology, and ideas for improved or new tools.
The identified requirements and ideas were listed and categorized. The most central
need categories and needs are summarized in Table 1 above. Interactive tools were
designed based on the requirements. The designs were refined iteratively in the context
of monthly meetings and special events (e.g. live demonstrations) over the course of
more than one year.
3.2 Technology
The frontend application of the Virtual Construction platform was implemented using
Unreal Engine 4 due to its superior visualization capabilities and more straightforward
compatibility with CAD software. The server side backend was programmed using
Node.js. HTTP calls and web sockets are used for communication between the frontend
application and the backend. The system uses Vivox voice services for voice chat. The
system has been tested with both HTC Vive Pro and Oculus Rift virtual reality
headsets.
3.3 Interactive Tools

In virtual reality, the Virtual Construction application is optimized for the HTC Vive
headset and its two motion-tracked handheld controllers (HTC Vive Wands). The
viewpoint is changed by head movements, and basic pointing and selecting is carried
out with the dominant hand controller using raycasting. The user points the ray from a
virtual raygun towards the object to be selected and presses the primary button (HTC
Vive Wand trigger) with the index finger. If an action with the selected tool can be
performed on the object, its outlines are shown when pointed at, allowing the user to
rapidly browse objects. Dragging objects is possible by pointing and selecting the
object, holding the trigger down, moving the controller and releasing the trigger. In the
default mode, virtual hands for moving objects are displayed.
346 J. Ojala et al.
The different tools implemented can be accessed from a radial menu activated by
pressing the menu button of controller. The menus and dialogs are operated using two
hands so that the 2D menu or dialog can be moved with the non-dominant hand, and
pointing and selecting items is carried out with raycasting as in interaction with the 3D
objects. The default method for moving both shorter and longer distances in virtual
reality is to teleport to visible locations. This is achieved by pressing the controller
touchpad button, moving the controller so that the ray ends at a desired position on
ground or floor, and releasing the touchpad button. Thus, the technique used for
moving is a type of specified coordinate movement [14].
Interaction and Communication Techniques. The different interaction and com-
munication techniques designed and implemented for VR are displayed in Figs. 2 and
3 below. Fairly detailed descriptions are given for each technique to give the readers
the possibility to adopt the techniques in their VR developments.
Floor Plan and Guidance. Using the map tool, a 2D architectural floor plan of the
premises appears on a sign on the user’s non-dominant hand (Fig. 4, left). The
architectural plan can help in understanding the overall design of the building. It also
acts as a map, in which the user’s current location is indicated, and the user can point
and select a destination on the plan with the raygun and the trigger. The user gets
guidance to the selected location. The user can also see the locations of other users in
the same virtual model. The guidance is drawn as arrows on the floor on walkable
routes (Fig. 4, right). If the user points the map with the raygun and selects a location
with the touchpad button, he/she is directly teleported to that location.
Crowdsourcing and Voting. Opinions and votes can be gathered from the end users
of the building or the general public in the case of public buildings. After completing a
registration, end users can inspect the model freely, leave location-specific feedback
(see Fig. 3), and participate in voting. Voting buttons can be added to rooms (Fig. 5).
After selecting the voting button, the user can toggle between predefined alternatives
for, for example, furniture, layouts, materials and colors, or lighting by selecting the
number of the option in the voting dialog, and give her/his vote after inspection.
Scenarios for Visualization. The developed platform allows the selection of one of
the (currently eight) available visualization scenarios to display 3D models of buildings
under different lighting conditions from the menu. An example of visualizing a
building under normal lighting conditions and in an evacuation mode is presented in
Fig. 6.
Adding, resizing, and erasing objects and

shapes. The item catalog tool displays a menu
of available 3D design objects. When pointed
and selected, the added objects appear in front
of the user and the move tool becomes active.
Basic shapes (e.g. cubes, spheres, cylinders)
can be added similarly using the add shape tool
and their dimensions and colors can be
changed in a dialog. Objects can be erased
using the erase tool simply by pointing and
selecting the object.
Manipulating objects. Using the raycasting

based move tool, objects can be grabbed and
moved horizontally or lifted vertically by
moving the controller. In the depth dimension,
they can be moved by pressing up and down on
the main controller touchpad when grabbed.
Light objects can also be moved and lifted in
the default mode (virtual hands) by dragging
them with one of the controllers. Heavy objects
can be dragged along the floor in this mode by
grabbing them with the controller and using the
four touchpad buttons to walk in four dimen-
sions with the object.
Snapping, aligning, and rotating. When

moving objects, they can be snapped to and
aligned orthogonally with any surface of the
building (e.g. floor, wall, ceiling) by moving
them close to the surface using the respective
snap and align tools. While being moved,
objects can also be rotated along the floor by
pressing left or right from the controller
touchpad or rotated on all three axes in a
specific dialog using sliders.
Measuring distances. Distances can be meas-

ured between two points by pointing and select-
ing starting and ending points on surface. Multi-
ple distances between points can be measured in
a row similarly using the measure path tool.
Using the freehand measurement tool, the user
points a starting point with the tip of the ray gun
and start dragging, while the system shows the
distance in meters. Using the measure normal
tool, the user points and selects a single point on
a surface to measure against surface normal (e.g.
wall to wall or floor to ceiling).
Hiding and unhiding objects. A common

need is to see through surfaces or objects
blocking a view. Using the hide tool, surfaces
and objects can be hidden by pointing and
selecting them with the primary controller. All
the hidden objects can be unhidden at once by
selecting the unhide option from the menu.
Fig. 2. Descriptions and illustrations of the developed interaction techniques.

348 J. Ojala et al.
Avatars, chat, and voice. Users are represent-

ed by simple avatars, whose appearances can
be tailored for different users or user groups.
Designers are used to using chat and voice
communications in design meetings and ex-
pressed a need for these features for synchro-
nous communication within the 3D model.
These features can be activated and deactivated
from the tools menu, and the messages are
visible/audible to users within the entire 3D
model.
Pointing. A typical need was to be able to

point at specific locations in the 3D model
while communicating with others. Using the
pointer tool, a spherical marker is displayed at
the end of a ray to grab other users’ attention.
To support awareness of other users’ activities,
raycasting is also always visible to other users
when the other tools are used.
Drawing on a surface or to air. Drawing can

be used as a tool in synchronous or asynchro-
nous communication. Using the draw to sur-
face tool the user points the ray towards a
surface, moves the controller, and the system
draws, when the primary button is pressed. The
draw to air tool is used similarly, but instead of
a surface, the drawing appears to the location
of the tip of the virtual ray gun.
Leaving messages and feedback. Asynchro-

nous communication between users emerged as
a user need. The user selects the relevant loca-
tion from a surface by pointing and selecting
and gives structured or free form feedback
using a 2D dialog. A marker (e.g. a 3D excla-
mation mark) appears next to the surface to
mark the message location. Messages can be
read by pointing the marker when using the
show object info tool.
Taking screenshots. Using this tool, a preview

of the screenshot contents appears above the
ray gun and the user can refine the view by
moving the dominant hand controller and press
the trigger to take a screenshot. The screenshot
tool can be launched from the message dialog
and screenshots can be attached to messages
and feedback. There is also a similar
standalone screenshot tool for taking screen-
shots rapidly and saving them to disk.
Fig. 3. Descriptions and illustrations of the developed communication techniques.

Fig. 4. The floor plan (left) and route instructions to a destination (right)
Fig. 5. Buttons can be added to rooms for toggling between alternatives and voting.
Fig. 6. A building interior in normal lighting and in an evacuation scenario
4 Conclusion
The introduction of virtual design meetings and asynchronous communication within

3D models has potential for significantly improving visual design and communication
during construction projects, resulting in more desirable buildings. This article
described our approach for bringing building information models to game engines and
utilizing the advanced features of the Unreal 4 game engine to develop a platform with
tools for collaboration and visualization of buildings.
The Virtual Construction platform was developed based on the principles of user-
centered design, which ensured that the platform was developed based on real needs of
350 J. Ojala et al.
companies operating in construction industry. It was noticed early that different

stakeholders in the construction industry have different user needs for a virtual col-
laboration platform. The user needs were found out by involving 10 senior and junior
members of staff from the participating companies in an iterative development process
with monthly group discussions and four in-depth interviews.
The identified user needs for a virtual platform for construction can be roughly
categorized to needs for advanced immersive visualizations, needs for advanced virtual
tools for both synchronous and asynchronous communication, needs for access control
and crowdsourcing, and special needs including designing for accessibility and evac-
uation, which are obligatory design issues in construction projects. The participants
also expressed the general need for an integrated solution for communication, visual-
ization and modification of plans collaboratively in real time.
The current solution already implements many features based on the most impor-
tant needs expressed by the participants of the current study. The tools for virtual
reality were designed from scratch based on the user needs and capabilities made
possible by Unreal Engine 4, and they are reported in this article in detail. Naturally,
existing ideas have also been utilized. For example, drawing and annotations have been
previously found to enhance collaboration in design review meetings [15]. We suggest
that the current approach and tools developed based on matching user needs to features
of a game engine offer a noteworthy alternative to existing systems and developments.
The logical next step is to test the developed tools in a study involving real
construction projects. In the future, the current platform can be extended into a more
holistic system for managing digital twins of buildings. In addition to the design phase,
game engine based visual solutions bring potential benefits in the construction and
maintenance phases. For example, the current platform could also support augmented
reality based viewing of building 3D models in the construction phase and visualizing
measurement data from completed buildings in the maintenance phase.
Acknowledgments. The authors would like to thank everybody who participated in the current
study. This research was funded by Business Finland from the European Regional Development
Fund (project A73293) and by the participating companies.
References
1. Kunz, J., Fischer, M.: Virtual design and construction: themes, case studies and
implementation suggestions. Working Paper #097, Center for Integrated Facility Engineer-
ing (CIFE), Stanford University (2012)
2. Waly, A.F., Thabet, W.Y.: A virtual construction environment for preconstruction planning.
Autom. Constr. 12(2), 139–154 (2003)
3. Hilfert, T., König, M.: Low-cost virtual reality environment for engineering and
construction. Vis. Eng. 4(2), 1–18 (2016)
4. Kuliga, S.F., Thrash, T., Dalton, R.C., Hölscher, C.: Virtual reality as an empirical research
tool—exploring user experience in a real building and a corresponding virtual model.
Comput. Environ. Urban Syst. 54, 363–375 (2015)
5. Kosmadoudi, Z., Lim, T., Ritchie, J., Louchart, S., Liu, Y., Sung, R.: Engineering design
using game-enhanced CAD: the potential to augment the user experience with game
elements. Comput. Aided Des. 45(3), 777–795 (2013)
6. Lee, G., Eastman, C.M., Taunk, T., Ho, C.H.: Usability principles and best practices for the
user interface design of complex 3D architectural design and engineering tools. Int. J. Hum
Comput Stud. 68(1–2), 90–104 (2010)
7. Partala, T., Nurminen, A., Vainio, T., Laaksonen, J., Laine, M., Väänänen, J.: Salience of
visual cues in 3D city maps. In: Proceedings of the 24th BCS Interaction Specialist Group
Conference, pp. 428–432 (2010)
8. Partala, T., Salminen, M.: User experience of photorealistic urban pedestrian navigation. In:
Proceedings of the International Working Conference on Advanced Visual Interfaces, AVI
2012, pp. 204–207 (2012)
9. de Klerk, R., Duarte, A.M., Medeiros, D.P., Duarte, J.P., Jorge, J., Lopes, D.S.: Usability
studies on building early stage architectural models in virtual reality. Autom. Constr. 103,
104–116 (2019)
10. Moloney, J., Amor, R.: StringCVE: Advances in a game engine-based collaborative virtual
environment for architectural design. In: Proceedings of CONVR 2003 Conference on
Construction Applications of Virtual Reality, pp. 156–168 (2003)
11. Lin, Y.C., Chen, Y.P., Yien, H.W., Huang, C.Y., Su, Y.C.: Integrated BIM, game engine
and VR technologies for healthcare design: a case study in cancer hospital. Adv. Eng.
Inform. 36, 130–145 (2018)
12. Glue – Universal Collaboration Platform. https://glue.work/. Accessed 30 Aug 2019
13. Fake multi-user VR/AR – Next level meetings and collaboration. http://www.fake.fi/
multiuser. Accessed 30 Aug 2019
14. Mackinlay, J.D., Card, S.K., Robertson, G.G.: Rapid controlled movement through a virtual
3D workspace. Comput. Graph. 24(4), 171–176 (1990)
15. Bassanino, M., Fernando, T., Wu, K.C.: Can virtual workspaces enhance team communi-
cation and collaboration in design review meetings? Archit. Eng. Des. Manag. 10(3–4), 200–
217 (2014)
Implementing Material Changes in Augmented
Environments
Adam Pike and Sudhanshu Kumar Semwal(&)
University of Colorado, Colorado Springs, USA

{apike,ssemwal}@uccs.edu
Abstract. Augmented and virtual reality technologies are becoming mature,

and headsets prices are continuing to come down. Having the functionality to
improve people’s live and make them more productive could help create a
demand. This paper looks at one such practical application that could help
designers, architects, and home buyers to better visualize home setting. This
application’s focus is how to change the material of an object from one
appearance to another from within the application.
Keywords: Augmented reality Virtual reality Augmented environment
1 Introduction
Today, Computer Aided Design (CAD) applications are a standard in the design and
construction industries. With new technology innovations comes improvements to how
CAD is used and implemented. AR (Augmented Reality) and VR (virtual reality) are
two examples of how designers are taking the next steps in the design process that
create immersive experiences. In this paper, an application will be discussed that uses
augmented reality to allow the user to change the material that make up an object. The
application is used on a mobile device (smartphone or tablet). The application creates
an augmented reality house and then can choose to change the color or material of the
house.
2 Previous Work
The primary usage for this application is intended for those in the fields of architecture
and design, including building design, landscape architecture, and environmental
planning. As a subarea of building design, remodeling is also a key industry with which
this application of augmented reality can benefit. For the growing field of mixed reality
applications, there is a subtle question that helps draw the line for the promise of an
application – that is how real does it have to be? Our answer is that it is dependent on
the type of application it is applied to. Those applications that look at health sciences
or with working with patients need to be highly realistic and multi-sensory [1]. Some
examples would be working with remote surgeries or post-traumatic stress therapy. On
the other side, specific detailing may not be as necessary for landscape design.

https://doi.org/10.1007/978-3-030-39442-4_27
Implementing Material Changes in Augmented Environments 353
The level of realism will greatly depend on the level of immersion that the user is
expecting and that can provide interactive interaction. Finally, the level of realism can
only be expected to improve for all applications as it becomes easier to design and
create 3D models.
VR/AR is being used to enhance collaboration among design teams [2]. Virtual
environments for example have been used to bring remotely located members into a
common space to co-design [3]. This helps designer-customer relationship as well. In
the immersive environments of mixed reality applications, the designers can bring the
design to the client and work with their imagination.
Another advantage with the use of mixed-use applications is the testing of spatial
concepts. Within the controlled environments of VR, it enables the testing of hypo-
thetical designs and practices [4]. These safe spaces that can be easily manipulate can
allow designers to test and survey different design ideas to get user input before
physical structure is built. Spatial concepting can help create real-world perspectives
that are not possible on computer screens or physical models. For the application
developed here, material changes could be applied to interior modeling as well, with
floors, walls, or countertops, even furniture. To a further extent, the application could
be used with fashion and let a user test different clothes or shoes on with different
colors. This application helps take the guesswork and uncertainty out of the picture.
What makes this application useful is the level of realism that is achievable with 3D
modeling. With the modelling capabilities of such programs as 3DS Max and Unity,
the model home can be given great details and realism like that shown in Fig. 1.
Fig. 1. 3D model showing realism.

354 A. Pike and S. K. Semwal
3 Changing Materials in Augmented Environment
There are two ways to change a material in augmented reality. The first is to identify
the appropriate polygons for which made up the object and apply a new material to
them. The second is to recreate the same object with the select material. This paper will
focus of the later here and discuss possible ways to accomplish the former. Augmented
objects are created using various objects and game components that are grouped
together to form the needed object. The difficulty in changing the materials in an object
is being able to identify the specific polygon or group of polygons that need to be
changed. This becomes a challenge, at least in our opinion, because augmented objects
are stored in the application as a prefabricated object, which generally does not get
ungrouped or changed. If the objects are grouped, then materials can be changed as
shown in Fig. 2.
Fig. 2. 3D model homes with different materials
Changing the material in this application was accomplished by creating various

identical houses that had different exterior materials. There were three models created:
a blue house with an asphalt roof, a white house with burgundy trim and asphalt room,
and a plaster house with a clay tiled roof (Fig. 2).
After the user has placed the home in a desired real-world space, they can use the
buttons on the screen of their smartphone or tablet to switch between the different
options for colors on the house. The user can also decide to delete the house and start
over. What is happening behind the scene is that the object house is being destroyed
and a new house with the chosen material change is being placed in the same exact spot
and rotation as the original. This method works fine if there are only a few options and
elements. But what if we wanted to have 30 different color choices? Then this approach
would need 30 different models prefabricated to be swapped out. Clearly this could get
memory heavy. To do this, we will have to adjust the original house model. First, the
model of the house needs to be broken down into various parts that will be grouped in
similar categories of what would be changed, e.g. all the exterior walls, or all the trim
pieces. By doing this the model house is no longer a single prefab, but several prefabs.
When the house is now called to be placed in the scene, it will send each of the
components of the house separately. Although they will be separate objects, they will
still be set in the right place based on the transform coordinates from when they were
converted to prefabrications. Next, a script would be added that changes the objects
based on the color selected. This can be done using an array of materials and linking
the selected buttons to the array number for that material. In this way, you are creating
one model and can have many different options for materials.
4 State of the Art and Implementation Considerations
Some part of the current success of augmented reality in recent years can be linked to
Google™ and Apple™, and their augmented reality platforms ARCore™ and
ARKit™, respectively. These platforms have enabled easy augmented systems from
their mobile devices. Google’s ARCore has its origins from the now terminated plat-
form Project Tango, a division of Google in 2014. The platform used computer vision
to enable mobile devices to detect the position relative to the world around the user,
enabling the platform to create various 3D mapping and environmental recognitions.
The software worked by integrating motion-tracking, area learning, and depth per-
ception. Like Project Tango, ARCore uses three technologies to implement its aug-
mented reality functions: motion tracking, environmental recognitions, and light
estimations. ARCore provides Standard Developers Kits (SDKs) to development
platforms like Android, Unity, and Unreal. This allowed us to create this AR
application.
4.1 Environment Modeling in AR Environments and Implementation

One of the biggest drawbacks of AR is that of occlusion. Occlusion in AR means
hiding virtual objects behind real world objects. AR is lacking in occlusion. So, for
example, we could have a roof occluding the inside of a Fig. 3.
Fig. 3. Non-occluded image [5].
There are ways to counter this. These solutions include sensing the 3D structure of
the real world, creating a digital 3D model of the real-world structure, and rendering the
model as a transparent mask that hides virtual objects or integrating the model through
in the 3D of the real-world. Sensing the environment can be done by using a light
sensor, a time-of-flight sensor, or stereo cameras. Light sensors project an IR light
pattern onto a 3D surface and uses the distortion to reconstruct the surface contours.
Time-of-flight sensors also use IR light but instead reflect off objects in its field of view
and uses the delay in the reflected light to calculate the depth. Stereo Cameras simulate
human binocular vision measuring the displacement between pixels using two cameras.
Unfortunately, each of these sensors has their limitations. Some of us feel that devices
are not yet able to understand the environment well enough or quickly enough to make
this work yet for real-time augmented reality. The main issues stem from poor range of
sensors, low resolution, and a slow mesh reconstruction of the 3D scene. Some early
attempts have shown that a convolutional neural network architecture can create feature
maps representing components of a simplified histogram of oriented depth [6].
For our application, a skybox was created around the model that would simulate a
blue sky outside the windows. This was made optional so that the user could place the
home in the actual location where the home would be and get the real view of what
they would see. To overcome the movement problem a feature needs to be developed
that will allow the user to move around in the environment without the need of physical
movement. A set of buttons programmed to move the environment around the user can
solve this issue. Button activates event trigger when button is down and button is up
and call a method to execute a script.
To implement this project, first the model needed to be created for which the
application would project as an augmented image. These 3D models were created in
3DS Max. Next, the application for which to run an augmented reality environment
was created using Unity, a game engine. As part of the application, ARCore (as
mentioned earlier) was used that allows for plane-detection and object dropping.
ARCore is an SDK from Google that provides the tools to allow users to create their
own augmented reality applications using mobile smart devices. Finally, scripts needed
to be written that would accomplish the stated features.
ARCore™ helps make augmented applications possible by allowing Unity™ to use
certain elements that detect and track planes within the camera’s field of view. By
adding the ARCore™ library to a Unity script, it can then be used to estimate where
planes are called trackedPlanes and grouping them into a List. A separate script and
prefab made from ARCore can then be implemented to display these planes as a grid of
triangles so that the user can see where the program has identified as a plane. If a plane
is detected, the user can then touch a point on that plane that can be used as a tracking
point for where an object can be placed. This touched point, or hit, will create an
anchor for which an object is then instantiated at.
For future work, it will be necessary to have a mix of what is occlude and what is
not occluded. This will have to be done computationally. If using AR movement on a
headset, then the user would need to use their hands to manipulate the scene but keep
the real-world environment of desks and tables to stay hidden. To do this, a new
approach would be needed that uses hand pose recognition in combination with model-
based methods for estimating occlusion [7]. By using a sensing device, like the Leap
Motion, one can track the hand’s estimated position and calculate the mask of occluded
portion. In our case materials could be changes in the AR environment by using buttons
displayed on top of the windows displayed on mobile phone or tablet (Fig. 4). The
material buttons would call the method, using an On Click () event, that would assess
the object created and find the appropriate tags. In the model, certain elements were
tagged as “ExtWalls”, “Floors”, etc. The method will find each of these elements and
change the material to the new selected element. Exterior elements are changed in
groups like in the code snippet on the right (Figs. 4, 5, 6 and 7).
4.2 Scaling
This initial test of the applicability of the material changer was a success and the next
steps are ready to be taken. Next, the method for changing the materials needs to be
updated so that only one model is used, and larger amounts of material options are
available (Fig. 7).
Fig. 4. Using Instantiation, Skybox, and interaction to move in augmented reality environment
[10]
Fig. 5. Material changes using button Interaction [10]

Fig. 6. Interaction and On click button calls method to provide the interaction [10].
Fig. 7. Scaling and effects of interaction and navigation [10].
Fig. 8. Extending walls and its effects [10].
5 Conclusions and Further Research
With the promising results of this project, further developments of the application could
strengthen the ability of AR in design and architecture [8, 9]. These future works
include both improvements to the existing features and additional features that would
add more benefits. Some ideas would not fit with this application but could merit
applications of their own. Currently, the application works if there is only one floor, but
split floors or multiple stories are not navigated in our implementation at this time,
certainly they are feasible as the person can walk up through stairs upstairs in 3D
world. To facilitate this, the functionality to change floors could be added with a button
in our implementation, thus changing the scene displayed from one model to another.
To push the material changer idea a bit further, a user could be given the ability to
change between different home options, like an extended deck or 3-car garage. Instead
of just changing a simple look of an object the user would be able to change the entire
layout of the model. A step beyond even that would be to implement an object
manipulation feature could be added that allowed the user to change individual com-
ponents like a single wall or window without breaking the model up. In this way, the
user could play around with the ideas of expanding their house or moving walls around.
The latter idea could be accomplished by linking objects’ data points. For example, if a
user wanted to move out the blue wall several feet, it would then leave a gap in the
adjacent red wall, as seen in Fig. 8. To account for this, a script, in our application,
would need to take the vertex points of the red wall connected to the blue wall and
move those points an equal distance, and in the same direction, as far as the blue wall
moved. An algorithm that could follow this premise could redesign a model in-app
while maintaining the model’s completeness. Another feature that will help is the
adding and removing of objects, e.g. furniture. There are several companies with
similar features such as IKEA™.
Finally, an issue when touring a potential home is you are only seeing it during a
given time of day and particular day of the year. Narrowing the time of viewing does
not allow the buyer (or designer) to know how the home will look at other times of the
day. For example, a kitchen that gets a lot of light during the morning may not get any
light during the rest of the day, or a driveway that during the winter stays icy because it
doesn’t get enough sun. By linking a home’s potential geographic location to a data-
base that tracks sun paths could allow a simulation of what the natural lighting for that
home could be. With this feature, the user could be better informed to all conditions
that may affect the home. This project looked at the creation of an augmented reality
application in which displays an interactive 3D model home. The results show working
models that were easy to tour through and manipulate. Although the realism of the
augmented image was passable, there is still great room for improvement and refine-
ment. Based on the promising results found in this application, we believe there is room
for growth for augmented and virtual reality in design and architecture, as well as
consumer usage for finding, touring, and remodeling homes. This application is merely
scratching the surface of what is possible with the new AR and VR technology
available. As consumer adoption and understanding of mixed reality increase and
hardware capabilities advance, so too will the potential for this application.
Acknowledgments. This paper is based on an independent study which the first author
undertook during the Spring 2019 with the second author. The independent study paper was
entitled: Material Changes in Augmented Environments, Adam Pike, pp. 1–5 (Spring 2019).
References
1. Portman, M.E., Natapov, A., Fisher-Gewirtzman, D.: To go where no man has gone before:
virtual reality in architecture, landscape architecture and environmental planning. Comput.
Environ. Urban Syst. 54, 376–384 (2015)
2. Wang, X.: Mutually augmented virtual environments for architectural design and
collaboration. In: Dong, A., Moere, A.V., Gero, J.S. (eds.) Computer-Aided Architectural
Design Futures (CAAD Futures) 2007. Springer, Dordrecht (2007)
3. Gu, N., Kim, M.J., Maher, M.L.: Technological advancements in synchronous collaboration:
the effects of 3D virtual worlds and tangible user interfaces on architectural design. Autom.
Constr. 20, 270–278 (2011)
4. Parush, A., Berman, D.: Navigation and orientation in 3D user interfaces: the impact of
navigation aids and landmarks. Int. J. Hum. Comput. Stud. 61, 375–395 (2004)
5. Image taken from website: https://hackernoon.com/why-is-occlusion-in-augmented-reality-
so-hard-7bc8041607f9
6. Höft, N., Schulz, H., Behnke, S.: Fast semantic segmentation of RGB-D scenes with GPU-
accelerated deep neural networks. In: Proceedings of 37th German Conference on Artificial
Intelligence (KI) (2014)
7. Feng, Q., Shum, H.P.H., Morishima, S.: Occlusion for 3D object manipulation with hands in
augmented reality. In: MIRU 2018: Proceedings of the 2018 Meeting on Image Recognition
and Understanding, Sapporo, Japan, August 2018
8. ACM Woodstock Conference. Image taken from website: https://www.mecanoo.nl/Media/
Model-Workshop-2017-11-22
9. WOODSTOCK 2018, El Paso, Texas USA, June 2018. https://doi.org/10.1145/1234567890.
Image taken from website: https://video.architecturaldigest.com/watch/this-hologram-table-
could-revolutionize-architecture-2017-11-22. ISBN 978-1-4503-0000-0/18/06
10. Pike, A.: Architectural Modelling in Augmented Reality. MS Project Advisor: Dr. SK
Semwal, Department of Computer Science, pp. 1–31, Summer 2019
Using Activity Theory and Task Structure
Charts to Model Patient-Introduced Online
Health Information into the Family
Physician/Patient Examination Process
Beth Ellington(&)
Appalachian State University, Boone, NC 28608, USA

ellingtonve@appstate.edu
Abstract. This research study was undertaken to gain a richer understanding of

the use of patient-introduced online health information during the physician/
patient examination and communication process. Utilizing qualitative data
obtained from ten family physician interviews and workflow modeling of the data
using activity diagrams and task structure charts, this study uncovered the fre-
quency of patient-introduced online health information, physician suggested
online resources, use of email for physician/patient communication, use of elec-
tronic medical records, along with tasks involved and methods used by the
physicians to work the online health information into the physician/patient
examination and knowledge transfer process.
Keywords: Patient-introduced online health information Physician

productivity Work flow modeling Engström’s Activity Theory
1 Research Objectives
1.1 Introduction
Online health information may be found from a variety of sources including govern-
ment, educational institution, medical, non-profit and commercial web sites. Research
has shown that with the increased access to health information provided via the
Internet, not only patients, but their healthcare providers, their caregivers and even
healthy people are increasingly seeking health information online [18]. Health infor-
mation seeking is not a new patient practice but one that has been enhanced through the
24/7 convenient availability of online health information. The ability to access this
information from home, work or school and with mobile devices, such as smartphones
and tablets, has magnified the access limitations of traditional sources of health
information by providing the technology for collaboration between patients, doctors
and caregivers via websites and email, and enabled patients to self-educate and form
online support communities [20].
One factor that may be contributing to the rise in patients’ searching for online
health information is the shortage of primary care providers, including family physi-
cians and pediatricians, existing today in the United States [19]. Primary care provider

https://doi.org/10.1007/978-3-030-39442-4_28
Using Activity Theory and Task Structure Charts 363
shortages limit the time available for physician/patient examinations, knowledge

transfer and communication about diseases or conditions. Knowledge transfer is needed
to empower patients to make better decisions about their healthcare. Since physician’s
time is a finite resource, after completing the necessary patient care activities, the
scarcity of this commodity can restrict the physician’s ability to thoroughly discuss
online health information brought to the examination by the patient, thus inhibiting
knowledge transfer [11].
This rise in patients seeking online health information may also be attributed to the
increased integration of online information into our everyday lives coupled with the
patient’s thirst for knowledge about a condition or treatment options affecting their
family [2]. Whatever the reasons, it appears patients are turning to the Internet to find
the answers to their health questions and rapidly displacing traditional health infor-
mation sources by introducing online health information into the physician/patient
examination process [4].
Finding a way to utilize this online health information, to increase the effectiveness,
and thus the quality, of the knowledge transfer process between patients and physi-
cians, could be a major step in optimizing examination and consultation productivity
[4]. Studies have shown that increasing productivity increases the capacity to see more
patients and collect additional revenue. This strategy for increasing patient volume in
the “fixed cost” world of primary care practices can prove to be a highly profitable
strategy for physicians to increase their revenues [25].
Although online health information, in theory, was created to empower the patient
in their healthcare decision making, can it not also be utilized to empower physicians
and enhance their knowledge transfer processes? So how does patient-introduced
online health information fit into the physician/patient examination activity? Does
patient-introduced online health information efficiently adapt to the physician’s
workflow? Or does it impede the knowledge transfer process due to varying levels of
the reliability, trustworthiness and quality of the information patients are discovering
online? Is this information wasting valuable physicians’ time and potentially under-
mining the patients’ medical treatment by causing the physicians to explain how it
relates or does not relate to their disease or condition? Where is the niche for patient-
introduced online health information in the medical treatment process? How can these
questions best be answered? The answers may best be found by asking the physicians
how it fits into their clinical workflow.
Workflow, in the workplace, is generally defined as the process by which tasks are
done, by whom, in what order and how quickly. Results of studies analyzing medical
practice workflow suggest that the physician’s time is best spent performing tasks that
only the physician can do, such as treating patients. Other tasks are best delegated to
staff in order to maximize medical practice workflow [1]. The physician/patient
examination is one of those tasks that only the physician should do and during the
examination is where physician/patient knowledge transfer is most likely to occur. The
physician/patient knowledge transfer process supports the communication of vital
information about the patient’s health concern from both the physician to the patient
and the patient to the physician to improve patient health literacy.
Other techniques utilized to evaluate work processes are socio-technical methods,
activity diagrams and task structure charts. Socio-technical methods study people,
364 B. Ellington
technology and organizations from a single theoretical framework using ethnographic

and participatory action methods [27]. Activity diagrams model the three-way inter-
action between those who perform an activity, the person place or thing to which the
activity is directed and their community of co-workers. Activity diagrams provide a
descriptional framework for analysis of work activity and the evaluation of technolo-
gies used in work settings [14]. Task structure charts are used to analyze work by
breaking down tasks needed to accomplish an activity into subtasks and sub-subtasks.
This method is used to visualize steps in a work process and the order in which they are
performed; once visualized, the process is then analyzed for process improvement
opportunities [24].
When using these methods to study clinical workflow and work processes, theo-
retically, the proper identification and correction of workflow and process issues found,
should improve the physician’s efficiency and effectiveness by improving patient flow
and minimizing physician downtime [9]. Activity diagrams are often used [15] to study
work redesign in the healthcare field. Activity diagrams were used in this study in
conjunction with task structure charts because they provided a theoretical framework to
better analyze the physician examining patient activity.
The concepts of Activity Theory utilized in Engeström’s model (see Fig. 1) are
subject, object, tools, community, rules and division of labor. Subject is defined as an
agent or group who acts. Object is defined as a person, place or thing to which the
activity is focused or directed. Tools may include tangible tools (e.g., hammer) or
intangible tools (e.g., knowledge) supporting the activity process. Community is
defined as other agents that support the activity. Rules regulate the activity within the
community. Division of Labor is defined as relationships and interactions within the
community that affect the completion of the activity [14].
Fig. 1. Engeström’s Activity Theory Model
The activity diagram illustrates a three-way interaction between subjects, objects

and community with each of these interactions mediated by tools, rules and division of
labor. These interactions support the basic “learn by doing” premise of Activity Theory
[14]. By modeling the physician examining patient activity through defining the sub-
ject, object, community, tools, rules, and division of labor of the activity, the contra-
dictions in the process should emerge [15]. Activity modeling, therefore, provides a
better understanding of the best fit for online health information introduction by the
patient during the physician/patient examination and knowledge transfer process.
This study was conducted to define the patient-introduced online health information
niche in the physician examining patient activity by analyzing physician interview
transcripts, activity diagrams and task structure charts to answer the following research
questions:
1. How does the introduction of patient-introduced online health information into the
family physician/patient examination process impact clinical workflow?
2. What are the potential barriers, challenges or improvements to physician/patient
examination and communication effectiveness created by patient-introduced online
health information introduction?
3. What process improvements or best practices may be developed to better manage
patient-introduced health information that could enhance the productivity of the
physician examining patient activity?
2 Methods
2.1 Interviews
This research employed the interview method because it allowed data collection for the
study with an individual activity as the unit of analysis. Interviews can be used for
descriptive, explanatory and exploratory purposes [17]. Interviews provide the
researcher with a mechanism for collecting subjective, objective, qualitative and
quantitative data with the advantages of greater flexibility in sampling and fewer
misunderstood questions. Interviews are also more effective for gathering data by tape
recording responses and analyzing those transcribed responses for complicated issues
such as physician/patient communication and clinical workflow processes [5].
The interview design utilized both closed-ended and open-ended questions (see
Appendix A). The interview questions were designed to specifically answer questions
to complete the elements of a preliminary activity diagram (see Fig. 1). Quantitative
data was gathered via questions 4, 5, 21 and 22, which collected data through closed-
ended questions, (e.g., How many days a week do you schedule patient appointments?
or How many patients do you see each week?). Qualitative data was gathered via
questions 1–3, 6–20, 23 and 24, which collected data through a combination of closed-
ended and open-ended questions, (e.g., List the steps you follow when interacting with
a patient from the time you enter the examination room until you exit the room or What
does the phrase “patient health literacy” mean to you?)
366 B. Ellington
2.2 Data Analysis

Evaluation of each interview included quantitative and qualitative analysis methods
and modeling of the individual activity of physician examining patient for each
physician. The analysis included breaking down each physician’s examination steps
into tasks and subtasks using Hierarchical Task Analysis task structure charts to better
understand the interaction of the physician with the patient during the examination.
Activity diagrams were also created for each physician examining patient activity in
conjunction with task structure charts. Modeling of work activities provided both a big
picture of the community activity systems and individual activity systems of each
physician while examining patients. Figure 9 was created to visualize and summarize
all physician examining patient activity components into one activity diagram.
Data, obtained from the interview transcripts, were used to calculate each physi-
cian’s patients/hour productivity value. Productivity values were derived by dividing
number of patients seen per week by number of days worked per week divided by
8 hours per day. Figures were created to analyze the physicians’ recommended web sites
(see Fig. 5), definition of patient health literacy (see Fig. 7), comparison of productivity
and medical practice type (see Fig. 10), and comparison of productivity and division of
labor (see Fig. 11). Additional figures were created from information obtained from
analysis of the interview transcripts. Those figures analyze the frequency of patient-
introduced online health information (see Fig. 3), physician recommended health
information formats (see Fig. 4), use of email for physician/patient communication (see
Fig. 6), and the object of the physician examining patient activity (see Fig. 8).
3 Results
3.1 Characteristics of Family Physician Participants

Of the ten North Carolina family physicians interviewed, six were Caucasian males,
three were Caucasian females and one was an African-American female. Their ages
ranged from 34 to 60 years of age with a median age of 48.5 years. The physicians had
been practicing medicine from five to 34 years. All had graduated from medical school
in the United States with four from the University of North Carolina at Chapel Hill, and
one each from Duke University, East Carolina University, Temple University,
University of Cincinnati, University of Colorado and University of Maryland.
Four of the physicians were in family practice clinics owned by medical centers,
one was in a community clinic owned by a medical center, four were in private practice
(two in solo private practice and two in group private practice), and one was in a
privately owned urgent care center (see Fig. 2).
Physician Medical Practice
Urgent Care Center

1
Solo Private Practice
2
5
Group Private Practice
2 Medical Center Owned Clinic
Fig. 2. Physician medical practice type
Four of the physicians described their practice of medicine in traditional terms such
as private practice or urgent care center. Others defined their practice by describing
those they worked with, specific populations served or providing detailed information
about their practice.
3.2 Physician/Patient Communication and Online Health Information

The introduction of online health information by the patient during the examination
activity was experienced by six physicians within the last seven days, one physician
within the last 14 days and three physicians within the last 30 days. Figure 3 below
demonstrates the frequency of patient-introduced online health information.
Frequency of Patient-Introduced
Online Health Information
8
6
6
4 3
Frequency
2 1
0
Last 7 Days Last 14 Days Last 30 Days
Fig. 3. Frequency of patient-introduced online health information

368 B. Ellington
Some physicians felt that the introduction of online health information by the
patient was disruptive and time-consuming. Other physicians just seemed to accept
patient-introduced health information as part of their normal routine and fit it into their
examination workflow.
Evaluation of the task structure charts, using Hierarchical Task Analysis tech-
niques, demonstrated that all physicians reported that they performed the same set of
six subtasks during the patient examination. The subtasks performed by all physicians
were (1) Enters examination room, (2) Communicates with patient, (3) References
medical record, (4) Examines patient, (5) Communicates with patient and (6) Leaves
examination room. The introduction of online health information by the patient
occurred during one of the “Communicates with patient” subtasks, either at the
beginning or the end of the examination depending upon when it was introduced by the
patient.
The physicians who were utilizing electronic medical records systems, that inclu-
ded patient educational material often, printed out health information for their patients.
Other physicians referred their patients to web sites or provided brochures with links to
web sites. One physician recommended web sites when they needed to counteract bad
information their patients had found online. Figure 4 below demonstrates the various
types of health information provided to patients by the physicians.
Physician Recommended Health

Information
8 7
6 5
4
Frequency
2 1
0
Web Sites Printouts Brochures
Fig. 4. Physician recommended health information
Some physicians suggested their patients visit specific web sites for additional
health information. Some physicians had developed criteria used to select web sites
they described as known, trusted, accurate, reliable and high quality or they knew the
url. One physician recommended that their patients not go to as he described “junk”
sites. The physicians recommended thirteen specific web sites to their patients. Fig-
ure 5 shows the actual web sites recommended to patients by the physicians, and of
those recommended sites, those sites containing advertisements or commercial sites.
Physician Recommended Web Sites

American Academy of Family Physicians 2
American Cancer Society 1 Commercial sites
American Diabetes Association 1
American Heart Association 1
eMedicine Health 1
Family Doctor 4
Healthfinder 1
Mayo Clinic 1
MedicineNet 1
MedlinePlus 1
Medscape 1
UpToDate for Patients 1
WebMD 3
0 2 4 6
Fig. 5. Physician recommended web sites
The physicians expressed both legitimate concerns and practical reasons for not
communicating with patients via email. Physicians were concerned about HIPAA
violations when communicating via unencrypted servers, and personal privacy issues.
The physician working in the urgent care center stated that she had no relationship with
her patients, and therefore, had no reason to communicate via email with them.
Two physicians encouraged their patients to communicate via their patient portals.
One physician expressed a desire to communicate with his patients via email but was
unable to do so due to lack of office technology. One physician communicated with her
patients via email but stated it was very time-consuming because they sent them at all
hours of the day and she spent hours answering patient emails.
The reason categories for not communicating via email with patients included: lack
of office technology, lack of encryption resulting in potential HIPAA violations, lack of
time, lack of patient relationship, personal preference, personal privacy and preference
for using their patient portal (see Fig. 6).
370 B. Ellington
Reasons for Patient Email

Communication Avoidance
Lack of Office Technology 2
Lack of Encryption/Potential HIPAA Violation 3
Lack of Time 2
Frequency
Lack of Patient Relationship 1
Personal Preference 1
Personal Privacy 2
Prefers Patient Portal 2
Fig. 6. Reasons for patient email communication avoidance
3.3 Physicians’ Definitions of Patient Health Literacy Comparisons

Health Literacy is defined by the Centers for Disease Control and Prevention (CDC), as
the capacity to obtain, process, and understand basic health information and services to
make appropriate health decisions [10]. The chart below compares the components of
the physicians’ definitions of patient health literacy to the components of the CDC’s
definition of health literacy. As shown in Fig. 7, two components of the CDC’s defi-
nition do not appear in the physicians’ definitions. Those were the patient’s capacity to
obtain basic health information and the patient’s capacity to obtain basic health services.
Physician Components of CDC's

Health Literacy Definition
Obtain basic health information 0
Process basic health information 9
Understand basic health information 10
Obtain basic health services 0 # Physician Responses
Process basic health services 8
Understand basic health services 10
Fig. 7. Physician components of CDC’s health literacy definition

3.4 Components of Physician Examining Patient Activity

Activity Theory was used as the framework for modeling the physician examining
patient activity because activity diagrams illustrate the organizational needs of the
physician’s medical practice, define the cultural, cognitive and social aspects of online
health information in the process and provide a picture of potential problems and
bottlenecks created by the patient’s introduction of online health information into the
physician’s examination workflow.
The components of an activity diagram include the subject, object, tools, com-
munity, division of labor, rules and outcome. The subject of each activity in this study
was the family physician and was defined as the person who acts [14]. The object of the
examination activity is defined as a person, place or thing to which the activity is
focused or directed [14]. The physician defined object categories were patient/patient’s
health, diagnosis of the problem and listening to the patient (see Fig. 8).
Object of Patient Examination

Process
Patient/Patient's
Health
2
Diagnosis of Problem
6
2
Listening to Patient
Fig. 8. Object of patient examination process
Tools are used to mediate an activity and include both tangible and intangible tools
which support the activity of examining the patient [14]. Tools utilized by the physi-
cians during the examination process included the physician’s medical knowledge, the
patient’s knowledge, patient-introduced online health information, physician suggested
health information, standard medical instruments, electronic medical records, non-
electronic medical records, vital statistics, lab and/or diagnostic test results, prescription
and/or medication information, computers and smartphone. Patient-introduced online
health information is an intangible tool used by the physicians during the examination
process.
Nine of the physicians had implemented electronic medical records systems in their
practice and one physician was using a smartphone to access prescription information.
Computer technology was used by nine of the physicians in the examination room.
Examples of using a smartphone and electronic medical records in the examination
room are in the physicians’ responses below.
372 B. Ellington
Most physicians with access to electronic medical records systems in their practices
perceived them simply as another “tool” in the physician’s toolbox. Electronic medical
records systems were used for patient file storage and as communication tools within
practices and between practices. However integration with other systems was men-
tioned frequently as a barrier to full implementation of electronic medical records
systems and its use as a communication tool for patients and other physicians.
Community is defined as the office staff that supports the activity of examining the
patient [14]. Community in the physicians’ practices included the following titles:
Administrative Staff, Business Office Staff, Certified Nurse’s Aides, Front Desk
Supervisor, Front Office Personnel, Practice Managers, Clinical Care Coordinators,
Instructional Assistants, Laboratory Technicians, Licensed Practical Nurses, Medical
Assistants, Medical Residents, Nurse Practitioners, Nursing Supervisor, Office Man-
agers, Phlebotomists, Physician Assistants, Radiology Technicians, Receptionists,
Referral Clerks, Registered Nurses and Schedulers. The size of the physician’s com-
munity ranged from three to thirty employees. Size of community was generally
dependent upon practice type, for example, the physicians who worked in major
medical center clinics had larger communities compared to the private practices and the
urgent care center.
Division of labor is defined as relationships and interactions within the community
that affect the completion of the activity of examining the patient [14]. Division of labor
was obtained by analyzing the physicians’ answers to the interview question: How do
these other employees support the activity of examining patients? to determine which
employees directly supported the physician examining patient activity. The results of
the analysis showed division of labor was comprised of Registered Nurses, Licensed
Practical Nurses, Nurse’s Aides and Medical Assistants directly supporting the activity
of examining patients.
Rules regulate the activity within the community [14]. Common rules and laws that
govern the practice of medicine in North Carolina, in addition to local, state and federal
laws, include the Chaperone Rule, Clinical Laboratory Improvement Amendments of
1988 (CLIA), Health Information Technology for Economic and Clinical Health Act,
Health Insurance Portability and Accountability Act of 1996 Privacy and Security
Rules (HIPAA), Medical Malpractice Liability, and the Patient Protection and
Affordable Care Act of 2010. Regulatory agencies that govern medical practices in
addition to the North Carolina Medical Board are the Centers for Disease Control and
Prevention (CDC), U. S. Department of Health & Human Services, Drug Enforcement
Administration (DEA), Occupational Safety & Health Administration (OSHA) and
Recovery Audit Contractors (RAC).
There were no practice specific policies or procedures that the family physicians
were required to follow during the examination activity mentioned in the interview
responses. Many of the physicians expressed a sense of autonomy with fewer regu-
lations and guidelines to follow in the organization of their work in both medical center
owned clinics and private practices. However all physicians seemed to be keenly aware
of the risk management implications for not following rules, laws and regulations to
minimize liability, litigation and malpractice claims for their clinical practices such as
the Chaperone Rule and HIPAA. Family Physicians Two and Nine also mentioned
other regulatory agencies such as OSHA, DEA and CLIA or RAC auditors and private
medical insurance company inspectors.
Outcome in an activity diagram is defined as the product of the examination activity
[14]. The physician defined outcomes were based upon the main result or outcome they
hoped to have achieved when they exited the patient examination room. The intended
outcomes the physicians hoped to have achieved when they left the examination room
are combined in the activity diagram in Fig. 9 below. Activity diagrams provided a
visual representation to better identify the online health information niche in the
physician examining patient activity and to better understand the interactions that
occurred during the activity. Figure 9 also summarizes all the subject, objects, tools,
community, division of labor and rules for the ten physician examining patient activities.
Fig. 9. Combined family physician examining patient activity diagram
3.5 Physician Productivity Comparisons

Physician productivity was calculated by dividing number of patients seen per week by
number of days worked per week divided by 8 hours per day. All of the data for
productivity calculations was self-reported by physicians and the eight hour workday
scale was assumed by the interviewer. The physicians saw patients an average of 3.65
days per week, on an eight hour workday scale, and examined an average of 76.8
patients per week or 21.04 patients per day. The chart below compares physician
productivity with physician practice type. The three physicians with the highest pro-
ductivity worked in a solo family practice, an urgent care center and a medical center
owned clinic respectively (see Fig. 10).
374 B. Ellington
Physician Productivity (patients/hour)

and Medical Practice Type
5.00
4.17
4.00 Urgent Care Center
3.13 3.00 Private Solo Practice
Private Group Practice
3.00 2.66 2.68 Medical Center Owned Clinic
2.08 2.22 2.25 2.33
2.00 1.50
1.00
0.00
Fig. 10. Physician productivity and medical practice type
The physician with the highest productivity value worked directly in the exami-
nation room with a Nurse Aide only, the second highest worked directly with a Nurse
only, and the third highest worked directly with a Medical Assistant and a Nurse.
Figure 11 provides a comparison of physician productivity with division of labor.
Physician Productivity and

Division of Labor
LPN, Medical Assistant 1.50
Nurse Assistant 2.08
LPN, RN 2.22
LPN, CNA 2.25
Medical Assistant 2.33
LPN, CNA 2.66 Productivity
Nurse, Medical Assistants 2.68
Nurse 3.00
Nurse, Medical Assistant 3.13
Nurse Aide 4.17
0.00 2.00 4.00 6.00
Fig. 11. Physician productivity and division of labor
These results indicate that neither type of practice nor the number of employees
directly supporting the physician examining patient activity influence higher physician
productivity values. Family Physician Nine, with the highest productivity value, works
in a solo private practice where patient appointments are scheduled three at one time,
and Nurse Aides assist in the examination room. This physician utilizes a tablet PC in
the examining room and inputs patient data into the electronic medical record during
the examination. He also uses voice recognition software for dictation. This physician
has also linked suggested web sites for his patients to his practice web site, and either
he or his Nurse Aides recommend those web sites to patients.
Family Physician One had the third highest productivity value and worked directly
with a Nurse in the examining room. His community contained not only Nurses, but
Medical Office Assistants, Instructing Assistants, Administrative Clerks and Medical
Residents. The facility where this physician worked may have directly influenced his
higher productivity value since according to its web site description, “Our center has 58
exam rooms, with areas for X-rays and minor procedures.” Fifty eight exam rooms
would accommodate a higher patient volume capacity by allowing the practice to
schedule 58 appointments at one time. Family Physician One was also the only
physician who mentioned using a smartphone in the examination room, in addition to
his use of electronic medical records and a laptop.
These results indicate that higher physician productivity values are influenced by
multiple factors. These factors include: efficient utilization of support staff, organization
of workflow, practice business model and use of technology such as electronic medical
records, computers and smartphones in the examination room.
Compared to the physicians with higher productivity values, the physicians with
lower productivity values had other factors adversely affecting their productivity val-
ues. Family Physician Six, with the lowest productivity value, works in a medical
center owned clinic where a Medical Assistant and a Licensed Practical Nurse assist in
the examination room. This physician utilizes electronic medical records and a com-
puter in the examination room but he is currently using two electronic medical records
systems that are not integrated. This physician also mentioned that he is a practicing
geriatrician, so his patients are most likely over 65 with chronic health conditions that
require more time to examine, diagnose and treat. In addition, he stated that he had only
been in that practice for about two years and was in the process of building the practice
resulting in patient output being below capacity. Therefore, his productivity may be
adversely affected by the use of two nonintegrated electronic medical records systems
while trying to build the practice to full capacity along with the extra time needed to
care for his elderly patients.
These results indicate that lower physician productivity values are influenced by
lack of office technology, inefficient use of electronic medical records systems, low
patient output due to practice not being at full capacity and characteristics of patient
populations served creating inefficiencies in workflow.
4 Discussion of Results
4.1 Analysis of Results and Discussion

This study was conducted to define the patient-introduced online health information
niche in the physician/patient examination activity and knowledge transfer process.
376 B. Ellington
An analysis of the results data obtained from the physicians’ interview transcripts, task
structure charts and evaluation of the activity diagrams modeled from the interview
data, revealed that all physicians experienced patient introduction of online health
information during the previous thirty days. The findings indicate that patient-
introduced online health information has an established niche within the
physician/patient examination process (see Fig. 3).
There were no indications from the results of this study that these family physicians
were “unprepared” to deal with the health information introduction, nor that they
experienced anxiety when it was introduced by the patient as in previous studies [3].
This may be attributed to the family physician/patient relationship being a longitudinal
relationship which better enables management of the situation when the introduction of
online health information occurs. The exception was Family Physician Three who was
employed by an urgent care center and self-reported that she had no relationship with
her patients. However, many of the physicians perceived online health information as
problematic, generating patient misinformation and increasing a physician’s workload
as found in other studies (Kim and Kim [21]; Ahmad et al. [4]).
Nine of the ten physicians suggested health information to their patients by pro-
viding printouts and brochures, or recommending web sites (see Fig. 4). In addition
eight of the physicians had learned to use the Internet as their ally by recommending
online resources to their patients. This was a suggested strategy from previous studies
(Ball and Lillis [6]; Kim and Kim [21]). Of the thirteen physician suggested web sites,
four were commercial web sites that contained advertisements for various products
including pharmaceuticals and prescription drugs. This indicates that those physicians
do not consider bias or conflict of interest to be criteria for exclusion when suggesting
online health resources to patients or determining the “quality” of the resource as
suggested in earlier studies (Brann and Anderson [8]). Physicians may not be aware
that their patients consider bias, especially from pharmaceutical advertising (Fox and
Rainie [18]; Elkin [13]), when determining the reliability and trustworthiness of online
health information (see Fig. 5).
Since health literacy is an emerging clinical concept in health communication, a
question was included in the interview asking physicians to define the term “patient
health literacy.” The physicians were able to partially define it compared to the Centers
for Disease Control and Prevention’s (2011) [10] definition, indicating that they have
experienced some degree of patient health “illiteracy” when examining patients (see
Fig. 7). Most physicians defined it in terms of language, educational literacy or level of
understanding without taking into account the patient’s capacity to obtain health
information and health services. According to the National Action Plan to Improve
Literacy, quality of clinician–patient communication can affect patient health out-
comes, including how well patients follow instructions from clinicians but few health
care professionals receive formal training in communication, particularly in working
with people with limited literacy. Therefore, these results are not uncommon [28].
However, the analysis of the quality of the physician recommended web sites did find
six of the thirteen sites contained low health literacy, easy to read or resources in a
language other than English, so some of the physicians were actually providing these
types of resources to patients similar to the targeted health communication models
suggested in the Kreuter and McClure [22] study. Ironically only three physicians
expressed concern for patient health literacy among their patients when defining it, and
of those three, only one mentioned recommending Family Doctor, which contains
health information in Spanish, to their patients.
The use of electronic medical records, computers in the examination room and
direct input of data by the physician appears to enhance physician productivity,
however technology was mostly used for computer supported work methods by the
physicians as opposed to a physician/patient communication tool. All of the physicians
maintained personal email accounts but none chose to actively communicate with their
patients via professional email instead expressing a legitimate concern with the level of
encryption in their email systems and those of their patients’ email accounts (see
Fig. 6). Other reasons for not communicating with patients via email were time
required to read and answer email, and the desire to maintain personal boundaries by
not being accessible to patients 24/7. This indicates that email would not be a feature
that family physicians would utilize in an electronic medical records system but would
prefer a more secure form of professional communication with patients such as an
encrypted patient portal with HIPAA compliant levels of encryption.
Nine of the physician practices were utilizing electronic medical records systems.
However some had invested in earlier versions and were in the process of upgrading
their systems due to the compliance requirements of the Patient Protection and
Affordable Care Act and the Health Information Technology for Economic and Clinical
Health Act [7]. The physicians mentioned problems with integration of systems when
referring patients to specialists in other hospitals and having to “wait for the mail” but
those who worked in medical center owned clinics or large private practices valued the
ability to communicate patient information with other healthcare “co-workers”
electronically.
Evaluation of the task structure charts demonstrated the steps involved in the
physician examining patient activity. All physicians expressed a sense of autonomy in
their organization of work during the examination process in regard to their use of
tools, adherence of rules and division of labor. However, they independently self-
reported the same “set” of six subtasks performed during the physician examining
patient activity, with the only difference being the order of subtask completion. This
was true regardless of the physician’s age, gender, years practicing medicine, type of
practice or whether they utilized electronic medical records, computers in the exami-
nation room, traditional medical charts, low tech medical instruments or high tech
medical instruments during the examination. Of interest from the analysis of the data
and the evaluation of the diagrams, was the recurring “sameness” of the physicians’
activities. According to the physicians’ data obtained from the North Carolina Medical
Board [23], all attended medical school in the United States with six of them attending
medical school in North Carolina. This indicates that this “sameness” may be attributed
to the physicians’ medical school physician/patient examination and/or relationship
training, which is one of the most commonly assessed qualities of students in medical
schools [12]. This “sameness” behavior may also be attributed to the absence of
observation in the data collection process creating an inability to detect possible
variations in actual behaviors.
Productivity is a measure of effective use of resources and is expressed as a ratio of
output to input. In this study the output was defined as number of patients examined
378 B. Ellington
and the input was defined as number of physician labor hours spent examining patients.
In order to better evaluate productivity value comparisons you must also evaluate the
characteristics of the workplace that effect productivity [26]. In the case of the family
physician practices those factors that influence productivity values were efficient uti-
lization of support staff, organization of workflow, type of business model and use of
technology. The analysis of the work organization of the physicians with the three
highest productivity values demonstrated how those factors influenced their produc-
tivity values.
There was no recognizable difference in productivity between physicians in regard
to type of medical practice or years practicing medicine. However there emerged
recognizable characteristics of the physicians with higher productivity such as:
(1) highly organized office support staff; (2) utilization of electronic medical records;
(3) utilization of either a laptop or tablet computer in the examination room;
(4) physician’s direct input of data into the electronic medical record during the
examination; (5) physician suggested online health resources linked directly to practice
web site or specific sites suggested routinely for chronic disease management. These
characteristics that emerged were similar to findings from the Wensing et al. [29] study
suggesting methods to improve knowledge management and patient outcomes.
There were 17 intended outcomes the physicians hoped to have achieved when they
left the examination room, that were combined in Fig. 9 from the individual physician
examining patient activity diagrams. The intended outcomes were: (1) resolution of
patient’s concerns; (2) make the correct diagnosis; (3) develop a plan to cure the
problem; (4) patient satisfaction; (5) determine or address the patient’s specific prob-
lem; (6) patient’s understanding of their treatment plan; (7) addressed the patient’s
questions; (8) helped the patient; (9) patient is more confident; (10) patient is informed
about their condition; (11) patient’s understanding; (12) answers patient’s questions;
(13) improve the patient’s health; (14) modify the patient’s disease behavior; (15) al-
leviate the patient’s suffering; (16) diagnose the problem; and (17) set up treatment for
the patient. These responses indicate that the physicians are more interested in the
quality versus the quantity of their patients’ examination outcomes since these mea-
sures are more qualitative than quantitative or quantifiable.
5 Recommendations and Conclusions
5.1 Recommendations
In addition to defining the patient-introduced online health information niche, this
study was undertaken to answer the following questions:
1. How does the introduction of patient-introduced online health information into the
family physician/patient examination process impact clinical workflow?
2. What are the potential barriers, challenges or improvements to physician/patient
examination and communication effectiveness created by patient online health
information introduction?
3. What process improvements or best practices may be developed to better manage

patient-introduced online health information that could enhance the productivity of
the physician/patient examination process?
Even though this study was limited due to interviewing a convenience sample of
North Carolina family physicians, and the values used to calculate physician produc-
tivity were self-reported, the findings still indicate that physicians have developed
methods to integrate patient-introduced online health information into the
physician/patient knowledge transfer process and it is clearly a tool utilized during the
physician examining patient activity as evident by the activity diagrams and its
inclusion as a subtask in the task structure charts. It has mainly impacted clinical
workflow by adding another sub-subtask to the “Communicates with patient” subtask
in the physician examining patient activity by creating an “involuntary” tool for the
physician to use and address during the examination process.
The potential barriers and challenges to physician/patient examination effectiveness
created by the introduction of online health information are the subtraction of
physician/patient quality time discussing information about symptoms, medication side
effects and diseases unrelated to the patient’s health concern. The potential improve-
ments to physician/patient examination effectiveness created by patient online health
information introduction are the physicians’ development of new methods for dis-
tributing and suggesting health information to their patients. However, no systematic
guidelines, policies or procedures for recommending online health information to
patients emerged from the physicians’ responses but rather an ad hoc combination of
linking online resources to their practice web sites, verbally suggesting web sites,
distributing printouts and brochures, and referring patients to vendor recommended
licensed content linked to electronic medical records systems.
Process improvement areas or opportunities to develop best practices for managing
patient-introduced online health information that emerged from the study were the
division of labor between the physician and their staff supporting the activity of
examining patients. After modeling the physicians’ workflow with task structure charts
and activity diagrams the support staff’s main role emerged as to enhance practice
workflow efficiency while the physician’s main role that emerged was to improve the
overall effectiveness and quality of the physician/patient examination. This was indi-
cated by the qualitative nature of the objects and outcomes categories of the exami-
nation activity. The object categories of patient/patient’s health, diagnosis of the
problem and listening to the patient were more qualitative than quantitative objectives.
The outcome categories were patient understanding/knowledge, disease management/
treatment, patient well-being and correct diagnosis which are more qualitative than
quantitative measures. Of interest is that diagnosis was mentioned as both an object and
an outcome by the physicians, indicating that problem resolution may be the overall
goal of the patient examination for these physicians.
The physicians’ responses contained great detail explaining how support staff was
responsible for placing the patient in the examination room, checking their vital signs,
taking their initial history, checking for immunizations needed, prepping patients for
suturing/minor outpatient surgery and taking care of patient follow-up. All of these
tasks can potentially increase patient volume and examination efficiency. This support
380 B. Ellington
enabled the physician to spend “quality” time with the patient thus increasing the
effectiveness of the examination, diagnosis and treatment.
One issue for the physicians mentioned was the amount of time spent providing
information for audits to insurance companies and government entities such as
Medicare. Electronic medical records systems provided ease of storage and retrieval of
documentation needed for these types of audits for the practices that had implemented
electronic medical records. All physicians should utilize their electronic medical
records systems for this type of information storage and management to facilitate audits
by regulatory agencies and compliance of rules, laws and regulations.
Other interesting practice business model issues mentioned by the physicians were:
(1) the current trend of hospitals buying private practices from Family Physician Nine;
and (2) the institutional guidelines and goals being developed for monitoring physician
response time for patients through the medical center admissions group from Family
Physician Six. If the business model for private practices is indeed changing and
medical center owned clinics are monitoring response time for process improvement
then physicians should be utilizing their electronic medical records systems to docu-
ment and justify their practice value and efficiency goals attained to the hospitals.
5.2 Conclusions
This study indicates that physician workflow and process efficiency improvements may
be gained by moving the patient-introduced online health information niche currently
residing in the physician/patient examination activity, and the recommending of online
health information sites that coincides with it, to the office support staff activities. This
workflow model is similar to the model currently in the urgent care center described by
Family Physician Three.
This could be accomplished by simply tagging patients as online health information
seekers in their medical chart/electronic medical record, to remind the staff to discuss
this type of information during the staff/patient encounter, prior to physician/patient
examination. This method of tagging could also be utilized for patients with low health
literacy, language comprehension and cultural issues to alert the support staff to direct
the patient to “easy to read” resources and resources in their native language. This
could be accomplished with minimal support staff training in the areas of health lit-
eracy, evaluation of online health information and electronic medical records systems.
Utilizing these changes in process may minimize online health information introduc-
tion’s effect on the examination and knowledge transfer effectiveness by ensuring that
only “quality” online health information is discussed during the physician/patient
examination.
This study indicates a need for the development of policies, procedures and best
practices for integrating health information into medical practice workflow to replace
the ad hoc methods currently being utilized. Developing these types of guidelines has
the potential to improve operational efficiencies for the medical practice. Improving
operational efficiencies could improve physician productivity and enhance quality of
patient care by optimizing time spent with the patient during the physician/patient
examination activity. Additional studies should also be conducted to determine best
practices for integrating online health information with electronic medical records and
utilizing e-Patient portals for secure, HIPAA compliant, physician/patient communi-

cation to enhance workflow efficiency optimization and knowledge transfer effective-
ness. This can best be accomplished through continuing education for both the
physicians and their support staff, work studies to evaluate redundancy in the physician
examining patient activity and moving the online health information tool from the
physician’s activity tool kit to the support staff’s activity tool kit.
In addition this study may offer more opportunities for research in Information
Science, specifically work studies involving human/computer/information interaction.
With the implementation of electronic medical records systems occurring in physi-
cians’ clinical practices and their subsequent integration with pharmacy and hospital
systems, this type of research may offer insight into new design methods for integrating
online health information with patient electronic medical records systems to support
physician/patient examination process improvement. Future researchers interested in
online health information and physician/patient communication might examine the role
of new or next generation technologies and their impact on clinical workflow, process
improvement or early adopters in the medical field.
Patients are continuing to seek health information online [16] and family physicians
have managed to integrate its introduction into their work methods. However, this
study pinpointed three strategic areas where improvement within the process may
improve the physician/patient knowledge transfer process such as: (1) utilizing elec-
tronic medical records, computers in the examining room and input of data by
physicians during examination to improve physician productivity and reduce human
error; (2) delegating to support staff the discussion of online health information and the
linking of web sites to patient’s electronic medical records, e-Patient portals and
medical practice web sites to increase physician/patient “quality” time during the
examination activity; (3) developing best practices for annually evaluating resources
suggested by physicians or linked to physician’s medical practice web sites for any
changes in content, sponsorship of sites and ability to support patient health literacy.
By utilizing technology, delegating duties and developing best practices more physi-
cian time should emerge for practicing medicine and less time choosing and critiquing
health related web sites which could result in improving the physician’s examination
goal of providing better patient health outcomes.
Appendix A: Physician Interview Questions
Thank you for participating in this interview today. This interview should take
approximately 30 min to complete. I would like to assure you that all of your responses
will remain confidential. You will be assigned a participant code that will be used to
maintain your anonymity. Your participant code for this study is FP###.
(Hand the participant the confidentiality agreement to read and sign with their
participant code already entered on the form.)
382 B. Ellington
By signing the confidentiality agreement you have agreed that your responses may
be recorded on audiotape and you are guaranteed that no personally identifiable
information will be linked to your recorded responses. (Turn on tape recorder.)
Interview Questions
1. How would you describe your practice of medicine? Probe if necessary: for
example private practice, hospital, medical school
2. Do you communicate with your patients via email? Probe: Why or why not?
3. Does your practice use electronic medical records? Probe: Why or why not?
4. How many days a week do you schedule patient appointments?
5. How many patients do you see each week?
6. What does the phrase “patient health literacy” mean to you? Probe: Are you
concerned about your patients’ health literacy?
7. List the steps you follow when interacting with a patient from the time you enter
the examination room until you exit the room. Probe for tools: Do you use a
computer in the examining room? Do you use medical instruments such as blood
pressure monitor, stethoscope, ear scope, tongue suppressor? Do you reference
their medical record, lab results, diagnostic tests results? Do you talk to the patient?
Do you talk to the patient’s family if they are present? Do you record information
into their medical record?
8. What is the main focus of your activity during the patient’s examination in the
steps you listed above? Probe if needed: the patient, the patient’s health, diagnosis
of the problem, other?
9. What is the main result or outcome you hope to have achieved when you exit the
patient examination room? Probe: If more than one is mentioned.
10. In the past 7 days have any of your patients brought health information they found
on the internet to their examination? (If no, then past 30 days? If no, then past 60
days? If no then omit questions 11 and 12.)
11. (If yes in #10 then ask), Was the health information your patient found on the
internet directly related to their disease or health condition?
12. Did you discuss the information with your patient?
13. (If yes in #12 then ask), Where did the discussion occur in the steps outlined in the
generic patient examination activity above?
14. How many employees other than physicians do you work with in your practice?
15. What are the titles of these employees? (Probe: i.e. nurses, nurse practitioners,
physician’s assistants, administrative assistants, medical technologists?)
16. How do these other employees support the activity of examining patients?
17. Does your practice have a policy to refer patients to internet health information? (If
no, omit questions 18 and 19.)
18. (If yes in #17 then ask),Who is designated to refer the patient to internet health
information in your practice? Probe: Where/How does this occur? During exam-
ination, after examination, follow-up visit, sent to patient later?
19. In what format do they give the suggested resources to the patient? (Probe: word
document?, brochure?, email attachment?, information prescription?)
20. Other than local, state, HIPAA and other federal laws what additional rules,
guidelines, policies or procedures are you expected to follow when examining
patients?
I would also like to ask you a few more questions to allow me to better understand
the characteristics of my interviewees.
21. In what year were you born?
22. In what year did you start practicing medicine?
(Questions for the interviewer to answer by observation if possible)
23. What is the interviewee’s gender?
24. What is interviewee’s race?
Thank you for agreeing to participate in this interview. (Ask if they would be
willing to complete a brief online survey in the future relating to internet health
information. If so, then ask them for their email address to send them the survey link
or give them a printed copy of the survey link on the signed confidentiality form.)
References
1. Aarts, J., van der Sijs, H.: CPOE, alerts and workflow: taking stock of ten years research at
Erasmus MC. Stud. Health Technol. Inform. 148, 165–169 (2009)
2. Ahern, D.: Challenges and opportunities of eHealth research. Am. J. Prev. Med. 32, 75–85
(2007)
3. Ahluwalia, S., Murry, E., Stevenson, F., Kerr, C., Burns, J.: ‘A heartbeat moment’:
qualitative study of GP views of patients bringing health information from the internet to a
consultation. Br. J. Gen. Pract. 60, 88–94 (2010)
4. Ahmad, F., Hudak, P., Bercovitz, K., Hollenberg, E., Levinson, W.: Are physicians ready for
patients with Internet-based health information? J. Med. Internet Res. 8, e22 (2006)
5. Babbie, E.: The Practice of Social Research. Thomson Wadsworth, Belmont (2007)
6. Ball, M., Lillis, J.: E-health: transforming the physician/patient relationship. Int. J. Med.
Inform. 61, 1–10 (2001)
7. Blumenthal, D.: Implementation of the federal health information technology initiative.
N. Engl. J. Med. 365, 2426–2431 (2011)
8. Brann, M., Anderson, J.: E-Medicine and health care consumers: recognizing current
problems and possible resolutions for a safer environment. Health Care Anal. 10, 403–415
(2002)
9. Brooks, L., Griffin, T.: Is it time for a new practice environment? An operational look at your
practice. J. Med. Pract. Manag. 25, 307–310 (2010)
10. Centers for Disease Control and Prevention: Health Literacy (2011). http://www.cdc.gov/
HealthLiteracy/. Accessed 9 Oct 2011
11. Dugdale, D., Epstein, R., Pantilat, S.: Time and the patient-physician relationship. J. Gen.
Intern. Med. 14, S34–S40 (1999)
12. Elcin, M., Odabasi, O., Gokler, B., Sayek, I., Akova, M., Kiper, N.: Developing and
evaluating professionalism. Med. Teach. 28, 36–39 (2006)
13. Elkin, N.: How America Searches: Health and Wellness. iCrossing, Inc., pp. 1–17 (2008)
384 B. Ellington
14. Engeström, Y.: Learning by Expanding: An Activity-Theoretical Approach to Develop-

mental Research. Orienta-Konsultit, Helsinki (1987)
15. Engeström, Y.: Activity theory as a framework for analyzing and redesigning work.
Ergonomics 43(7), 960–974 (2000)
16. Eudy, K.: Google second only to doctors as source of health information. Capstrat and
Public Policy Polling (2010). http://www.capstrat.com/news/google-second-only-doctors-
source-health-information/. Accessed 31 May 2010
17. Fowler, F.J.: Survey Research Methods. Sage, Los Angeles (2009)
18. Fox, S., Rainie, L.: The online health care revolution: how the Web helps Americans take
better care of themselves. Pew Internet & American Life Project (2000). http://www.
pewinternet.org/pdfs/PIP_Health_Report.pdf. Accessed 30 Nov 2008
19. Freed, G., Stockman, J.: Oversimplifying primary care supply and shortages. J. Am. Med.
Assoc. 301, 1920–1922 (2009)
20. Keselman, A., Logan, R., Smith, C., Leroy, G., Zeng-Treitler, Q.: Developing informatics
tools and strategies for consumer-centered health communication. J. Am. Med. Inform.
Assoc. 15, 473–483 (2008)
21. Kim, K., Kim, S.: Physicians’ perception of the effects of Internet health information on the
doctor-patient relationship. Inform. Health Soc. Care 34, 136–148 (2009)
22. Kreuter, M., McClure, S.: The role of culture in health communication. Annu. Rev. Public
Health 25, 439–455 (2004)
23. NC Medical Board (North Carolina Medical Board) (2011). http://www.ncmedboard.org.
Accessed 19 Nov 2011
24. Preece, J., Rogers, Y., Sharp, H.: Interaction Design: Beyond Human-Computer Interaction.
Wiley, New York (2002)
25. Rauh, S., Wadsworth, E., Weeks, W., Weinstein, J.: The savings illusion – why clinical
quality improvement fails to deliver bottom – line results. N. Engl. J. Med. e48, 1–3 (2011)
26. Stevenson, W.: Operations Management. McGraw-Hill, Boston (2012)
27. Trist, E., Murray, H.: Historical overview. In: The Social Engagement of Social Science: A
Tavistock Anthology. University of Pennsylvania Press, Philadelphia, pp. 1–34 (1990)
28. U. S. Department of Health and Human Services: National Plan to Improve Health Literacy
(2010). http://www.health.gov/communication/HLActionPlan/pdf/Health_Literacy_Action_
Plan.pdf. Accessed 9 Oct 2011
29. Wensing, M., Wollershiem, H., Grol, R.: Organizational interventions to implement
improvements in patient care: a structured review of reviews. Implement. Sci. 1, 2 (2006)
Naming Anonymous Processes Using
an Optimal Number of Test-and-Set
Registers
Layla S. Aldawsari1,2(B)
1
Department of Computer Science and Engineering, University of Colorado Denver,
Denver, CO, USA
layla.aldawsari@ucdenver.edu
2
College of Computer and Information Sciences,
Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia
lsaldossary@pnu.edu.sa
Abstract. Anonymous distributed systems consist of processes with-

out names to identify them, and the goal is to assign unique names to
them using a distributed algorithm. Synchronous and asynchronous com-
munication models are considered with eight different categories based
on number of test-and-set (TAS) registers available and knowledge of
number of processes. Processes can only communicate through shared
read/write and TAS registers. In this paper, I have developed two deter-
ministic algorithms and one randomized algorithm for naming anony-
mous processes in all variations of the problem model. Proof of correct-
ness for each algorithm is presented along with the analysis. The devel-
oped algorithms are optimal in both shared memory size requirements
and namespace size. The Counting and Global Counting algorithm have
a time complexity of O(n2 ) steps with a space complexity of O(1) and
O(2) shared registers, while Segment Shuffling algorithm has a time com-
plexity of O(nlog n) steps with a space complexity of O(2 ∗ log n))
shared registers.
Keywords: Distributed computing · Anonymous processes ·

Test-and-set register · Naming algorithm
1 Introduction
Anonymous process naming is one of the most fundamental problems in dis-
tributed systems. It can be seen as a basis for solving other distributed problems
because many solutions for distributed systems problems rely on processing using
unique names, such as leader election. A system of anonymous processes is con-
sidered where processes communicate through shared memory based on a special
object type known as a test-and-set (TAS) register in addition to read/write reg-
isters. This system is in a symmetric state, where each process is indistinguish-
able from other processes. Thus, breaking the symmetry is necessary to allow a
naming solution to assign unique names.
https://doi.org/10.1007/978-3-030-39442-4_29
386 L. S. Aldawsari
In this paper, anonymous process naming is studied, where processes commu-

nicate by writing to and reading from the shared memory. Anonymous processes
start with no names assigned to them, and the processes cannot perform any use-
ful computation before having unique names assigned to them. Two categories
are studied in both synchronous and asynchronous communication models where
one category specifies the number of shared registers available, while the other
category defines whether the number of stations can be used as part of the algo-
rithm. A shared register can be either a normal read/writer register or a special
object type known as a TAS register.
A TAS register or object is one of the most primitive objects that has been
used in the literature for solving several distributed problems, such as consensus
and renaming [1–3,7,10,14]. A TAS register stores a value of 0 or 1, where a
value of 0 represents the initial state. It has two operations, which are TAS()
and reset(). A TAS() operation returns the current value stored in the register
and then sets the register value to 1. Any subsequent TAS() call to a TAS register
with value of 1 always returns the value of 1. The reset operation changes the
register value to 0, which is the initial state. In the course of this paper, a process
is said to be a winner of a TAS register when the return value is 0. Otherwise,
a process is a non-winner when the return value is 1.
Existing naming algorithms that use TAS registers for naming anonymous
processes require the knowledge of the number of processes [3,21]. When the
number of processes is very large, these naming algorithms require a large mem-
ory size. Since the memory size requirement of these algorithms is dependent on
the number of processes, these naming algorithms are not suitable to be used in
systems with very limited resources.
In this paper, I have developed three algorithms for naming anonymous pro-
cesses. The first algorithm is known as Counting algorithm that is developed for
synchronous communication model with an optimal shared memory requirement
of O(1) shared register using an optimal namespace size of n, even when number
of processes is unknown. The second algorithm is the Global Counting algorithm,
which is developed for asynchronous communication model that have an opti-
mal shared memory size requirement of O(2) shared registers using an optimal
namespace size of n. The third algorithm is a randomized naming solution known
as Segment Shuffling, which is developed for asynchronous communication mod-
els with a time complexity of O(nlog n) steps. Unlike existing naming solutions
that use TAS registers, the developed deterministic algorithms can work even
when the number of processes is unknown and can assign unique names for any
number of processes. Naming and renaming solutions that have been presented
in [1–3,7,10] require at least n shared registers, while the presented algorithms
use minimal number of shared registers. Therefore, the presented solutions are
very suitable to be implemented in systems with very limited resources.
The rest of paper is organized as follows. In Sect. 2, related work from the
literature is reviewed. Then in Sect. 3, the model and problem statement are
defined. In Sect. 4, all developed naming algorithms are presented. Proof of cor-
rectness for the naming algorithms are shown in Sects. 5 and 6, and finally a
conclusion is presented in Sect. 7.
Naming Anonymous Processes 387
2 Previous and Related Work

Anonymous processes were first considered by Angluin in 1980 [4], where she
proposed message-passing-based networking solutions for anonymous hosts and
entities in a distributed paradigm. She studied anonymous processes on symmet-
ric system communications using a communication network. It was shown that
randomization was required in the case where the system is fully symmetric,
specifically for the naming problem.
In the literature, solutions developed for naming processors have used ran-
domization such as the work by Lipton and Park [17] for a shared memory model
where they developed a solution that requires a space size of O(L ∗ n2 ) bits. Sev-
eral improved solutions for the same model have been developed in [16,22].
Boldi et al. [6] proposed that breaking symmetry in distributed networks is
dependent on the type of computation used. Thus, they introduced a determin-
istic algorithm for leader election in an anonymous network of processors and
proved that every network has a weak election algorithm, where all processors
detect the impossibility of leader election.
Eğecioğlu and Singh [11] have developed a naming solution that uses ran-
domization and shared registers for both synchronous and asynchronous com-
munication models. The developed solution has a time complexity of O(n2 ).
The first Las Vegas algorithm for solving the naming problem was introduced
by Kutten et al. [15]. They developed an algorithm for the asynchronous model
with a shared read-write memory that has an expected time of O(log n) with
the space size of O(n), where n is the number of processes.
Alistarh et al. [3] introduced two randomized naming algorithms where one of
them ensures a name space size of n that has a step complexity of O(n log n4 n),
while the second algorithm achieves a namespace size of k(1 + ) with a step
k log k4
complexity of O( log1+ 2 ).
A wait-free algorithm was introduced by Panconesi et al. [21] for the naming
problem where the process can experience crashes. The developed algorithm uses
single-writer multi-reader atomic registers. Each process has a private register
that can be read by all other processes. However, addressing a specific private
register is accessed differently by other processes. The algorithm has a running
time of O(n log n log log n) and space size of (1+)n with a probability of 1−o(1),
where n is the number of processes and > 0. The developed algorithm uses an
α-Test and Set-Once object to assign a name randomly among all contending
processes with a minimum probability of α against a dynamic adversary.
Chlebus et al. [8] presented solutions for naming anonymous process when
they can only communicate using beeps that are sent over a communication
channel. The channel is assumed to be a shared communication channel such
that any beep sent over the channel, is delivered to all processes connected to
that channel. The channel has only two possible types of feedback, which is either
a beep or silence. Chlebus et al. developed a Las Vegas algorithm and a Monte
Carlo algorithm for the models where n is known or unknown. The presented
solution gives processes contiguous integer names starting from 1. The algorithms
388 L. S. Aldawsari
have been proved to be optimal with an expected run time of O(n log n) steps
and O(n log n) used bits.
The main task of the work proposed by Chlebus et al. [9] is that processors
are able to assign names to themselves in the anonymous distributed environ-
ment. The Monte Carlo algorithm was introduced that has an optimum running
time and uses O(n log n) random bits. The developed algorithms work with two
different scenarios to solve the naming problem. One is where the number of
memory cells is considered constant, while in the other, the number of mem-
ory cells is not restricted, yet the amount of memory available to the algorithm
depends on the number of processors.
Anonymous distributed systems have been considered in several other dis-
tributed system problems such as leader election and consensus [5,7,12,13,18].
Some of these researches focus on showing the power of a distributed system in
computing certain functions such as SUM, AND and Orientation [5] in anony-
mous systems, where the authors developed solutions for these functions that
have a complexity of O(n log n) messages. Moreover, Guerraoui and Ruppert [13]
considered anonymity as a way for protecting the identity of processes where they
developed algorithms for solving problems such as time-stamping, snapshots and
consensus.
3 Preliminaries
It is assumed that the system consists of anonymous processes that start with
no names assigned to them, and they can only communicate through shared
registers. The processes are assumed to be failure free, and the objective is to
assign unique names to them with or without randomization. Synchronous and
asynchronous communication models are considered in which n processes com-
municate through shared atomic multiple-reader/single-writer registers. Several
categories of the problem model are considered based on number of TAS registers
available and knowledge of n.
Coarse-grain atomicity is assumed regarding shared memory such that more
than one operation can be performed in a single step, which allows serialization
of access to shared registers. Some shared registers are implemented as TAS
registers. It is assumed that the initial value for all TAS registers is 0. A process
that is calling operation TAS() on a TAS register to obtain either winner or
non-winner status is said to be querying the TAS register. Simultanous queries
to a TAS register by multiple processes will result in exactly one process with a
winner status.
4 Naming Algorithms
Two deterministic algorithms have been implemented for both synchronous and
asynchronous communication models with an optimal number of shared registers
using an optimal namespace size. The first algorithm, developed for synchronous
communication models, known as the Counting algorithm, works by having each
process keep an internal counter variable that tracks the number of processes with
assigned unique names. The second deterministic algorithm, which is developed
for asynchronous communication models, uses a global shared counter instead
of a private variable. Finally, a randomized algorithm was developed that uses
log n shared TAS registers and log n shared read/write registers that are
used as counters.
4.1 Counting Algorithm
The Counting algorithm is developed for synchronous communication models

where a single shared TAS register is used as seen in Fig. 1. The pseudo code of
the algorithm is shown in Fig. 2. This algorithm requires the use of the reset()
operation to reset the value of the TAS register to the initial state. In addition,
each process has an internal variable counter that is initialized to 0, which repre-
sents the number of processes that have already won the TAS register. Execution
of the Counting algorithm is performed in phases, where a phase terminates when
the winner process resets the TAS register. In each phase, a process queries the
TAS register and checks whether the return status implies being a winner or
a loser. Assuming there are n processes, there will be a single winner process
and n − 1 losing processes that have failed to win the TAS register. A winner
process assigns the current counter value as a name, and in the following round,
it immediately resets the TAS register before terminating and ending the phase.
A process that has failed to obtain a winning status increments the counter
value by 1 and skips the following round because the TAS register has not been
reset by the winning process. This marks the end of a phase for both the winner
process and losing processes, where the remaining n − 1 processes begin a new
phase of the algorithm with a new counter value.
Process 1 Counter Process 2 Counter ....... Process n Counter
TAS() TAS() TAS()
TAS Register
Fig. 1. System overview for Counting algorithm in synchronous communication model

390 L. S. Aldawsari
Algorithm 1: Counting
Initialization;
counter ← 0;
repeat;
v ← test-and-set(T);
if v = 0 then
name ← counter;
reset(T);
else
counter ← counter + 1;
end
until v = 0
Fig. 2. Pseudo code for Counting algorithm. T is a shared TAS register
4.2 Global Counting Algorithm
The Global Counting Algorithm is developed for asynchronous communication

models where the pseudo code is shown in Fig. 3. It uses a shared read/write
register and a shared TAS register as shown in Fig. 4. The idea of the algorithm
is to keep a counter of winning processes in a shared register, which is accessed
exclusively by the winning process. This counter is saved in a shared read/write
register that is modified by the winning process. A winning process assigns itself
the current value stored in the shared counter as a name before incrementing the
counter’s value and resetting the TAS register. Thus, each process is guaranteed
to be assigned a unique name because no two processes will ever use the same
counter value. The shared counter variable is initialized to 0, which is then
incremented by 1 after each winning process assigns itself the current counter
Algorithm 2: Global Counting

repeat;
v ← test-and-set(T);
if v = 0 then
name ← global_counter;
global_counter ← global_counter + 1;
reset(T);
Terminate();
else
No Operation();
end
until v = 0
Fig. 3. Pseudo code for Global Counting algorithm. T and global_counter are shared
registers where T is a TAS register
Global_Counter
Register
read / inc read / inc read / inc
Process 1 Process 2 ....... Process n
TAS() TAS() TAS()
TAS Register
Fig. 4. System overview for Global Counting algorithm in asynchronous communica-

tion model
value. The Global Counting algorithm works by having each process query a
TAS register until a winning status is returned. Each process is guaranteed to
win the TAS register at least once before terminating because processes do not
experience a fault, so the TAS register is always reset by the winning process.
4.3 Segment Shuffling Algorithm
The Segment Shuffling Algorithm is a randomized algorithm for naming anony-

mous processes which is presented and described in this section where the system
overview can be seen in Fig. 5. The idea of the algorithm is to divide a names-
pace size of n into log n segments where each segment represents a sub-range
with a size of logn n . There are log n TAS registers that are initialized to 0 in
addition to log n read/write registers. Each TAS register is associated with a
specific segment, while the value of the read/write register represents the number
of names that has been taken from that sub-range namespace. Each read/write
register can only be modified by the winning processes of the segment’s TAS
register.
The algorithm has a probability of 1 for terminating with unique names and a
probability log1n for choosing an unused TAS register for each query attempt.
From the pseudo code shown in Fig. 6, the algorithm works by having each
process choose a random number r uniformly at random from the range (1, k),
where k is the number of unfinished segments, such that the process attempts
to win the TAS register at index r. A process that obtains a winning status
from the TAS register assign itself the value logn n ∗ r + global counter[r] as a
name and increments global counter[r] by 1 before resetting the TAS register
at index r and terminating. On the other hand, a process that has failed to
obtain a winning status chooses a new random number r from the range (1, r)
and attempts to win the register r. This process continues until a process wins
a TAS register.
392 L. S. Aldawsari
Counter Register 1 Counter Register 2 ....... Counter Register (log n)
read / inc read / inc read / inc
Process 1 Process 2 ....... Process n
TAS() TAS() TAS()
TAS Register 1 TAS Register 2 ....... TAS Register (log n)
Fig. 5. System overview for Segment Shuffling Counting algorithm in asynchronous

communication model
Algorithm 3: Segment Shuffling

Initialization;
k ← log n;
name ← ∅;
L ← [1, .., k];
MaxCounter ← logn n ;
repeat;
r ← random(1, k);
selected ← L[r];
v ← TAS(T[selected]);
if v = 0 then
if global_counter[selected] < M axCounter then
name ← (selected ∗ M axCounter) + global_counter[selected];
global_counter[selected] ← global_counter[selected] + 1;
Reset(T[selected]);
else
L ← L − {r};
k ← k − 1;
Reset(T[selected]);
end
end
until name = ∅
Fig. 6. Pseudo code for Segment Shuffling algorithm. It uses an array of shared TAS
registers T and an array of shared read/write registers global_counter. Array L con-
tains the index values of unfinished segments
5 Synchronous Communication Models

This section investigates all synchronous communication models, where execu-
tion is divided into equal time slots, and presents a naming solution for each
model. The naming solution is described along with proof of correctness and
analysis for all variations of synchronous communication models, which are based
on the two categories. There are four variations of the model based on the num-
ber of TAS registers, which is either finite or infinite, and the knowledge of the
number of processes n, which is either known or unknown. Using the Counting
algorithm, anonymous processes are assigned unique names in both finite and
infinite number of TAS registers categories because it requires only a single TAS
register. In addition, because the algorithm does not use n as part of the code,
it is also a naming solution for both known n and unknown n models.
5.1 Proof of Correctness for Counting Algorithm
In this section, proof of correctness for the Counting algorithm is presented

along with analysis of the algorithm regarding the number of registers required,
namespace size, and number of steps. For each category of the model, correctness
of the Counting algorithm is demonstrated.
Infinite Number of TAS Registers and Known Number of Processes
The proof of correctness must show that each process is assigned a unique name.
In the Counting algorithm, because each process is assigned a name based on
the current value of the counter that is private to each process, the counter value
must be identical for all processes in every phase during the whole execution of
the algorithm.
Lemma 1. The internal counter value is modified identically in all non-winner

processes.
Proof. The proof for this is through induction. The value of the internal counter
is first initialized to 0 for all processes before starting the algorithm. Assuming
that the value of the counter is k, the proof must show that it is incremented
identically in all remaining n − k processes. Because all processes check the same
TAS register, exactly one process is a winner of the TAS register, while the
remaining n − k − 1 processes fail to win the TAS register. The non-winner
processes increment the counter value to k + 1 and wait for the end of the phase,
which occurs immediately after the winning process resets the TAS register.
Thus, all remaining n − k − 1 processes begin a new phase with an identical
counter value of k + 1, which concludes the proof.
Next, all processes eventually terminate with a unique name. In each phase,
exactly one process becomes a winner of the TAS register and assigns itself the
value of the counter as a name, after which it resets the TAS register. Thus,
after n phases, all processes have terminated with a unique name. The space
complexity of the algorithm is O(1) shared registers since exactly one shared
394 L. S. Aldawsari
TAS register is used by all processes, while the time complexity of the Counting
algorithm is O(n2 ) steps, because there are n processes performing at most n
queries on the TAS register.
Infinite Number of TAS Registers and Unknown Number of Processes
In Counting algorithm, the knowledge of number of stations n is not required
since each process uses a counter to identify the number of processes that have
already obtained names and after getting a name the process immediately ter-
minates. The assigned name is unique and is based on the counter’s value, which
is dynamically modified during the whole execution of the algorithm. It is nec-
essary in this algorithm, to have the counter’s value being identically modified
by all process in each phase. The proof of correctness is shown in two parts,
where the first part shows consistency of the counter’s value among non-winner
processes in each phase.
Lemma 2. The internal counter variable is identical for all non-winner pro-
cesses in every phase of the algorithm.
Proof. The proof is shown by induction. At the beginning of execution, all

processes start with the same initial value for the counter variable. Assuming
that the counter value is k for all processes, it is shown that the new value for
the counter becomes k + 1 and is identical for all processes. Exactly one process
wins the TAS register and uses the current value of the counter as a name, while
all other processes fail to win the TAS register. Other processes that have failed,
increment the counter value to k + 1 and wait for the end of the phase before
attempting to win the TAS register. Thus, the remaining processes begin a new
phase with counter = k + 1 that is identical for all process, which concludes the
proof. Next, the proof must show that no two processes obtain the same name.
Theorem 1. On termination of Counting algorithm, a process assign itself a

unique name.
Proof. From Lemma 2, it was shown that the value of the counter is identical for
all processes in every phase. Because all processes query a single TAS register,
exactly one winner process exists in every phase, where a process uses the counter
value as a name and terminates after resetting the TAS register. Thus, each
counter value is used as a name by no more than one process, which shows that
all processes obtain unique names.
In this category of the model, the step complexity of the algorithm is unknown
because it depends on the number of processes, which is unknown, while the space
complexity is O(1) shared register as exactly one shared TAS register is used by
all processes to completely assign unique names for all processes.
Finite Number of TAS Registers and Known Number of Processes
Regardless of whether the model has a finite or infinite number of TAS registers,
the Counting algorithm needs exactly one TAS register. This is shown by the
fact that the single TAS register is reused by calling reset operation, so that a
new unique name is obtained and assigned by a new process. Thus, the Counting
algorithm can be used as a naming solution for anonymous processes whether
the number of processes is known or not.
To prove that unique names are assigned to every process, the proof must
show that the value of the counter is exactly the same for all processes. From
Lemma 1, it is implied that the internal counter variable is modified identically
on all non-winner processes, which can be proved by induction as follows.
Proof. It is already known that the counter value is initialized to 0 for all
processes, so the proof must show that, when the counter value is k for all
processes, it is identically incremented for all non-winner processes. Exactly one
process wins the TAS register while the remaining n − k processes increment the
counter value to k + 1. Thus, at the end of the phase, all remaining n − k − 1
processes begin a new phase with an identical value of k + 1, which concludes
the proof.
Next, to prove the correctness of the algorithm, it should be shown that all
processes are assigned unique names. It can be easily proven that each process is
assigned a unique name because a TAS register guarantees one winner process.
In addition, based on the fact that each process’s counter value is identical to
other processes, it is guaranteed that each counter value is assigned as a name
to at most one process. From the definition of the algorithm, each process keeps
querying the single shared TAS register at the beginning of each phase, which
implies that all processes will eventually obtain unique names.
The time complexity of the Counting algorithm is O(n2 ) steps, as each pro-
cess performs at most n queries to the TAS register before being assigned a
unique name, and there are a total of n processes performing n queries.
Finite Number of TAS Registers and Unknown Number of Processes
To prove that the algorithm is correct in this category of the model, it must be
shown that the internal counter is consistent among all non-winner processes.
Even though there are finite number of TAS registers, correctness of the algo-
rithm is proved using the same proof for Lemma 2, which shows consistency
of the internal counter value. The proof must also show that each process is
assigned a unique name. Even though there are finite number of TAS registers,
only one TAS register is needed. Thus, the proof from Theorem 1 can be used
to prove correctness of the Counting algorithm, which shows that all processes
terminate with assigned unique names.
6 Asynchronous Communication Models
In the asynchronous communication model, the developed Global Counting algo-

rithm can assign unique names to anonymous processes in all categories of the
model. Also, the Segment Shuffling algorithm can assign unique names to anony-
mous processes in the asynchronous communication model with known number
of processes, which has a lower time complexity compared to the deterministic
algorithms.
396 L. S. Aldawsari
6.1 Proof of Correctness for Global Counting Algorithm

Each process executes the Global Counting algorithm. Because it is an asyn-
chronous communication model, the order of events is unknown, where some
processes may access the TAS register earlier than others. Regardless of the
order of accesses to the TAS register, it is guaranteed that no two processes
will choose the same name because the shared global counter is only accessed
and modified by the winning process. Thus, a process simply queries the TAS
register forever until it wins the TAS register, after which it assigns itself the
global counter value as a name.
Naming in the Global Counting algorithm is based on the shared global
counter, where the single TAS register is used as a mutual exclusion to guaran-
tee a synchronized counter value among all processes and prevent two or more
processes from using the same global counter value as a name. The correctness
of the algorithm must show that the global counter value is never used more
than once and that each process wins the TAS register at least once.
Lemma 3. At most one process uses each value of the global counter register.
Proof. The proof is shown by induction. Assuming that the current value of
the counter is y and that a process has obtained a winning status, it will assign
the value y as a name to itself. When the next process wins the TAS register
after it has been reset by the previous process, it will assign itself the current
global counter value y . However, y is guaranteed to be a different value because
each winning process increments the global counter by 1 after assigning the old
value as a name. The key idea of the proof is that no more than one process can
use and alter the global counter variable by using the TAS register for mutual
exclusion.
The second part of the correctness proof is to show that each process wins
the TAS register at least once before termination.
Theorem 2. Each process assign itself a unique name after winning the TAS
register at least once.
Proof. Assuming that k processes have already won the TAS register once and
that all of them have terminated with unique names assigned to them, this
implies that global counter = k. The next process to win the TAS register will
assign the value k as a name and set global counter = k + 1 before termination.
Since each process is fault free and each of them terminates after resetting the
TAS register, it is guaranteed that each of the n processes wins the TAS register
at least once. Based on Lemma 3 and the fact that each process wins the TAS
register at least once, each process assign itself a unique name before termination.
To analyze the time complexity, it is assumed that the steps for all processes
are bounded by t, where, in a step, processes can either read or write to a register
and can perform some computations. The number of steps taken by a winning
process to finish assigning itself a name and resetting the TAS register is five
steps. In addition, n processes will perform at most 5n query attempts before

winning the TAS register. Thus, the time complexity is O(n2 ) steps. The space
complexity of the Global Counting algorithm is O(2) shared registers since it
uses two shared registers where one of the registers is a read/write register while
the other one is a TAS register. Comparing this presented algorithm to existing
naming solutions, it is uses the minimum number of shared registers.
Infinite Number of TAS Registers and Unknown Number of Processes
The Global Counting algorithm does not require the knowledge of the number
of processes, so it is a valid naming solution for anonymous processes in the
model of asynchronous communication with an unknown number of processes.
The proof must show that each value of the global counter is used by at most one
process and that all processes terminate with a unique names assigned to them.
Although the number of processes is unknown, using the proof for Lemma 3
it can be proved in this category that any global counter value is used by at
most one process. Similarly, the proof for Theorem 2 is used to prove that each
process terminate after assigning a unique name to it. Whether n is known
or unknown, each process is guaranteed to win the TAS register at least once
before termination. Thus, after finishing execution of Global Counting algorithm,
all processes are assigned unique names. The shared space requirement of the
algorithm is O(2) shared registers as it uses only two shared registers.
Even though there are a finite number of TAS registers, the Global Counting
algorithm can still be used to assign unique names in this category because
it requires exactly one TAS register and one shared read/write register. The
correctness of the Global Counting algorithm is proved by showing that a global
counter value is used by no more than one process. Using the proofs for Lemma 3
and Theorem 2 the Global Counting algorithm can be used to assign unique
names in this category. The proofs are valid since the algorithm requires exactly
two shared registers which is finite. The time complexity of the algorithm is
O(5n2 ) steps since each process takes a total of five step after winning and
resetting the TAS register. Similarly, the algorithm has a space requirement of
O(2) shared registers.
Finite Number of TAS Registers and Unknown Number of Processes
The Global Counting algorithm is a valid naming solution for anonymous pro-
cesses in this category because the algorithm does not require the knowledge
of the number of processes. The proof of correctness is based on Lemma 3 and
Theorem 2, which implies that each value of the global counter register is used
by at most one process and that each process wins the TAS register at least
once and is assigned a unique name before termination. The proofs are valid in
this category, since the knowledge of n is not required for the algorithm to be
correct.
398 L. S. Aldawsari
6.2 Proof of Correctness for Segment Shuffling Algorithm

The correctness of the Segment Shuffling algorithm is proved by showing that
all processes eventually obtain a unique name with a probability of 1. It is easily
proven that a process will obtain a unique name because a TAS register can be
won by exactly one winner. Next, it is shown that, when a process randomly
checks all log n TAS registers, it will guarantee termination.
Theorem 3. Each process terminates after assigning a unique name with a

probability of 1.
Proof. Assuming n − 1 processes have acquired names, then at most log n − 1

global counter registers have reached their highest maximum value of logn n . It is
also certain that a process reset a TAS register before terminating, which implies
that all TAS registers can be won by the remaining process. In the worst case,
the remaining single process executes the Segment Shuffling algorithm where it
randomly chooses a TAS register with an associated global counter value that
has reached the maximum value. The selected TAS register is then removed
from the list of unused segments. Eventually, the process will choose one of the
remaining TAS registers with an associated global counter value that has not
reached the maximum value. The process assigns itself the global counter value
as a name, which is unique because no other process has ever used it, and then
it finally terminates. The time complexity of the algorithm is O(nlog n) based
on the coupon collector process [19,20]. The algorithm require a shared memory
size of O(2 ∗ log n) shared registers.
Because there is a finite number of TAS registers and the number of processes is
known, the Segment Shuffling algorithm is a valid naming solution when there
are at least log n TAS registers. Proof of correctness is based on Theorem 3
which shows that each process terminates after assigning itself a unique name
with a probability of 1. To prove that this theorem is correct, it must be showed
that no more than log n TAS registers are used by a process.
Proof. Each global counter variable has a maximum value of logn n . By the
definition of the algorithm, a process increments a global counter value by 1 and
no other global counter is modified by the same process because it terminates
immediately after winning one TAS register. In total, the sum of all increments
is n, which is less than logn n ∗ log n. Thus, 2 ∗ log n shared registers are
only required for the Segment Shuffling algorithm to assign unique names for all
processes.
7 Conclusion
In this work, I have designed several models for naming anonymous processes
with different categories where the communication model is either synchronous
or asynchronous and processes can only communicate using shared registers.

The purpose of this work was to develop efficient distributed algorithms, which
require minimum number of shared registers and have optimal namespace, for
naming anonymous processes in both synchronous and asynchronous communi-
cation models. In this paper, I developed three algorithms for naming anonymous
processes which are most optimal in terms of shared memory size and namespace
size. These algorithms are Counting algorithm, Global Counting algorithm and
Segment Shuffling algorithm, which have a time complexity of O(n2 ), O(n2 ) and
O(n log n) steps with a space complexity of O(1), O(2) and O(2 ∗ log n) shared
registers. This work proved that the developed algorithms are correct for the
designed models and are optimal in terms of shared memory size requirement.
In addition, the developed algorithms have an advantage over existing naming
algorithms, since it does not require the knowledge of number of processes. This
presented work brings new light on developing naming algorithms for systems
with very limited resources, which will have a huge impact in distributed com-
puting, as it presents new insights for developing distributed algorithms.
References
1. Alistarh, D., Aspnes, J., Censor-Hillel, K., Gilbert, S., Zadimoghaddam, M.:
Optimal-time adaptive strong renaming, with applications to counting. In: Pro-
ceedings of the 30th Annual ACM Symposium on Principles of Distributed Com-
puting, PODC 2011, San Jose, CA, USA, 6–8 June 2011, pp. 239–248 (2011)
2. Alistarh, D., Aspnes, J., Gilbert, S., Guerraoui, R.: The complexity of renaming.
In: IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS
2011, Palm Springs, CA, USA, 22–25 October 2011, pp. 718–727 (2011)
3. Alistarh, D., Attiya, H., Gilbert, S., Giurgiu, A., Guerraoui, R.: Fast randomized
test-and-set and renaming. In: Lynch, N.A., Shvartsman, A.A. (eds.) Distributed
Computing, pp. 94–108. Springer, Heidelberg (2010)
4. Angluin, D.: Local and global properties in networks of processors (extended
abstract). In: Proceedings of the Twelfth Annual ACM Symposium on Theory
of Computing, STOC 1980, pp. 82–93. ACM (1980)
5. Attiya, H., Snir, M., Warmuth, M.K.: Computing on an anonymous ring. J. ACM
35(4), 845–875 (1988)
6. Boldi, P., Shammah, S., Vigna, S., Codenotti, B., Gemmell, P., Simon, J.: Sym-
metry breaking in anonymous networks: characterizations. In: ISTCS, pp. 16–26
(1996)
7. Buhrman, H., Panconesi, A., Silvestri, R., Vitanyi, P.: On the importance of hav-
ing an identity or, is consensus really universal? Distrib. Comput. 18(3), 167–176
(2006)
8. Chlebus, B.S., De Marco, G., Talo, M.: Naming a channel with beeps. Fundamenta
Informaticae 153(3), 199–219 (2017)
9. Chlebus, B.S., De Marco, G., Talo, M.: Anonymous processors with synchronous
shared memory: Monte Carlo algorithms. In: 21st International Conference on Prin-
ciples of Distributed Systems (OPODIS 2017). Schloss Dagstuhl-Leibniz-Zentrum
fuer Informatik (2018)
400 L. S. Aldawsari
10. Eberly, W., Higham, L., Warpechowska-Gruca, J.: Long-lived, fast, waitfree renam-
ing with optimal name space and high throughput. In: Proceedings of 12th Interna-
tional Symposium on Distributed Computing, DISC 1998, Andros, Greece, 24–26
September 1998, pp. 149–160 (1998)
11. Eǧecioǧlu, O., Singh, A.K.: Naming symmetric processes using shared variables.
Distrib. Comput. 8(1), 19–38 (1994)
12. Glacet, C., Miller, A., Pelc, A.: Time vs. information tradeoffs for leader election
in anonymous trees. ACM Trans. Algorithms 13(3), 31 (2017)
13. Guerraoui, R., Ruppert, E.: What can be implemented anonymously? In: Interna-
tional Symposium on Distributed Computing, pp. 244–259. Springer (2005)
14. Kruskal, C.P., Rudolph, L., Snir, M.: Efficient synchronization on multiprocessors
with shared memory. ACM Trans. Program. Lang. Syst. 10(4), 579–601 (1988)
15. Kutten, S., Ostrovsky, R., Patt-Shamir, B.: The Las-Vegas processor identity prob-
lem (how and when to be unique). J. Algorithms 37(2), 468–494 (2000)
16. Lim, L., Park, L.: Solving the processor identity problem in O(n) space. In: Pro-
ceedings of the Second IEEE Symposium on Parallel and Distributed Processing
1990, pp. 676–680. IEEE (1990)
17. Lipton, R.J., Park, A.: The processor identity problem. Inf. Process. Lett. 36(2),
91–94 (1990)
18. Matias, Y., Afek, Y.: Simple and efficient election algorithms for anonymous
networks. In: International Workshop on Distributed Algorithms, pp. 183–194.
Springer (1989)
19. Mitzenmacher, M., Upfal, E.: Probability and Computing - Randomized Algo-
rithms and Probabilistic Analysis. Cambridge University Press, New York (2005)
20. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press,
New York (1995)
21. Panconesi, A., Papatriantafilou, M., Tsigas, P., Vitányi, P.: Randomized naming
using wait-free shared variables. Distrib. Comput. 11(3), 113–124 (1998)
22. Teng, S.-H.: Space efficient processor identity protocol. Inf. Process. Lett. 34(3),
147–154 (1990)
Development Trends of Information
Technology Industry in Armenia
Ashot Davoyan(&)
Economic Faculty of Yerevan State University, 52 Abovyan Street,

Yerevan, Armenia
ashot_davoyan@yahoo.com
Abstract. This article presents the current state of Information Technology

(IT) industry in Armenia where this industry is considered one of the major forces
behind economic development and growth. It presents an overview of the IT
industry in Armenia as well as spreads light on major elements behind its devel-
opment. Nowadays, many Armenians choose this discipline due to the high
income, mobility and the multitude of job opportunities it offers. Over 17,000
specialists or 2.5% of the total employed work in this sector with an average annual
salary of about 12,000 USD. Due to infrastructure development, the number of IT
companies has grown in other parts of the country. Armenia-based IT companies
specialize in software development, semiconductor design, multimedia and web
design. The steady growth of the IT industry increasingly attracts new players.
Investors acknowledge Armenia as a hotspot for high-quality IT product devel-
opment and start-ups. However, some disadvantages still remain, including dis-
integrated regional ecosystem, lack of financing, and corruption.
Keywords: Economy IT Start-ups Armenia Development Growth

forces
1 Introduction
Armenia is a landlocked country in the South Caucasus region that neighbours

Azerbaijan, Georgia, Turkey and Iran. The country covers an area of 29,800 square
kilometres. Three million Armenians live in the country and three times this number
makes the Armenian Diaspora with major populations living in the USA, France and
Russia. Armenian economy grows at healthy level in spite of political upheavals. In
fact, it is estimated that the economy will grow at a rate of 4.5% in 2019. Information
Technology industry, hereafter IT industry, has high potential in the context of com-
petitiveness enhancement, economic development and growth. It comprises 6.25% of
GDP with over 650 companies which operate in this industry with an average annual
growth rate of 20%. The majority of IT companies operate in the capital of Armenia.
Over 17,000 specialists or 2.5% of the total employed work in this sector with an
average annual salary of about 12,000 USD. Due to infrastructure development, the
number of IT companies has grown in other parts of the country. Armenia-based IT
companies specialize in software development, semiconductor design, multimedia and

https://doi.org/10.1007/978-3-030-39442-4_30
402 A. Davoyan
web design. The total export of IT products from Armenia amounted to 338 million
USD in 2017 [1]. The main export destinations are the USA and Canada (45%),
European Union (25%), Russia (10%) and Asian countries (10%). The rest of the
products are consumed by the former soviet republics, where 3d modelling, animation,
game and mobile apps dominate. At an early stage of artificial intelligence, the market
for technological solutions is limitless where Armenia, in spite of its small IT com-
munity, is equipped with the necessary resources to offer and export technological
solutions. Its high potential for growth is influenced by the following factors:
– High-quality IT programs are set in universities as a result of university, IT Industry
and State collaboration.
– Availability of highly skilled specialists with relevant educational background and
knowledge of English language.
– Collaboration between local companies and diaspora creates synergistic effects.
– Availability of a competitive IT workforce and low operating costs.
– A large number of multinationals have set branches including CISCO, Synopsys,
National Instruments, etc.
Foreign investors can benefit from the following advantages:
– IP protection regulations.
– Free economic zones (FEZs). Residents of FEZs are completely exempted from
profit tax, VAT, property tax and customs duty. Services on behalf of the state
bodies are delivered on “one stop shop” basis.
– Right of 100% property ownership.
– No restriction on staff recruitment.
– Duty free import of personal goods.
– Armenia is a member of the Eurasian Economic Union and enjoys Generalized
Scheme of Preferences status with USA, Canada, Japan, Switzerland and Norway in
addition to the European Union states. Armenia implements an open-door policy as
a result of a positive attitude towards investments from overseas.
Table 1 below shows the key economic indicators.
Table 1. Key economic indicators

Economic Nominal GDP per Inflation Investment Unemployment
growth GDP capita (CPI, measured as % measured as %
measured as % (USD (Currency) annual change from of active
change of bln.) variation previous year population
GDP in %)
2014 3.6 11.8 3,966 3.0 −2.2 17.8
2015 3.2 10.7 3,574 3.7 2.5 19.5
2016 0.2 10.7 3,569 −1.4 −11.4 17.1
2017 7.5 11.6 3,862 1.0 7.7 18.0
Source: Economic Outlook by FocusEconomics [2]
Development Trends of Information Technology Industry in Armenia 403
The economy grew by 7.5% in 2017 and reached approximately 11.60 billion USD,
while the per capita GDP reached 3,862 USD. Stable high growth is positive in terms
of attracting new investors who acknowledge the country as a hotspot for high-quality
IT product development and start-ups. However, due to varying conditions, this is not
always an attainable goal taking into consideration the volatility of the South Caucasus
region.
2 IT Industry Development Life Cycle
Global
MARKETS
Academia
State INVESTORS
IT Industry
SUPPORT
ORGANIZATIONS
COWORKING
SPACES
Talents,
IT Products
and Start-
ups
2.1 Global Markets

The world has become smaller and life has become more comfortable due to
improvements in different sectors of economies where the demand for technology is at
an unprecedented level. The global IT industry is estimated at about 5 trillion USD in
2019 [2]. The share of US in global IT industry is 31%. China and the European Union
respectively share 26% and 19%, Japan shares 7%, and Russia and the former soviet
states share 3% of the global IT industry. Despite economic upsides and downs,
including the recent financial and economic crisis, IT trade remains at a healthy level
and makes consumers enjoy benefits of economic value creation. The increased
demand for IT products will be sourced in areas such as IoT integration, machine
learning and robotic process automation. Global businesses, jobs, and daily life are
becoming increasingly digitized year by year which means there is a growth expec-
tation for the IT industry. In 2019, the IT industry in Armenia is expected to grow at a
rate of 25% compared to 4.0% global growth rate according to Export.gov.
404 A. Davoyan
2.2 Investors
Armenia has an open and favourable policy towards foreign investment. The gov-
ernment continually carries out reforms aimed at improving business and investment
climate. According to the World Bank ‘Doing business 2019’ report, Armenia ranked
41st among 190 countries. As the former center of soviet states for software devel-
opment, semiconductor and electronics production, and industrial computing, Armenia
has managed to keep its capacity as a regional IT hub. Table 2 below shows a com-
parison of several key indicators to consider before investing in Armenia.
Table 2. Key investment indicators

Indicator Armenia Europe and Central Asia OECD
Time required for starting a business 4.5 days 10.1 days 8.6 days
Cost for starting a business as % of 0.9% 4.2% 3.2%
income per capita
Time required for enforcing contracts 570 478 days 558 days
days
Cost for enforcing contracts as % of 16% 25.5% 22.2%
claim
Total tax rate as % of profit 18.5% 34.2% 40.7%
Source: EIF, State of the Industry Report [1]
A number of multinationals including Oracle, Microsoft, Cisco, Synopsis, National

Instruments, D-Link, Mentor Graphics, and VMware have different levels of presence
in Yerevan. In terms of skilled and affordable IT professionals, Yerevan competes with
cities such as Kiev, Moscow, Tel Aviv, Chennai, Bangalore, Dublin and Montreal.
Granatus Ventures, the first VC fund was established by sponsorship of Government in
2013. It provides investment, expertise and networks to start-ups which leverage
Armenia’s potential as an emerging IT hub. The government considers it extremely
important to develop the infrastructure for IT development in different regions of
Armenia. In this framework, it has recently signed MoU to establish a technological
city as well as it has founded two technological centers outside Yerevan.
2.3 Coworking Spaces

Coworking spaces are a relatively recent phenomenon in Armenia as well as the South
Caucasus region. Armenian start-ups and entrepreneurs enjoy many benefits of
coworking spaces, including mentorship, networking opportunities and structure for
work. The idea of contemporary coworking belongs to the software engineer and open-
source enthusiast Brad Neuberg, who started the coworking campaign in 2005 in San-
Francisco, seeking to support the community and create a structure for gathering in new
and unprecedented ways. However, coworking spaces as functional idea date back to
the 15th century when sculptors, painters, architects, engineers and others came
together under one roof to work with each other in the Renaissance “bottegas” in
Florence [3]. “Bottegas” or workrooms, translated from Italian, brought together dif-
ferent types of talents to improve skills, often under supervision of a master teacher
whose role was similar to that of a mentor. “Bottegas” encouraged interaction and
helped participants turn ideas into reality which led to an overall higher level of
creativity. Coworking spaces play vital role in creating localized (town/city/
region/state) innovation processes through tailored programs, diversification and col-
laboration. They typically charge a service fee in exchange for a chair at a desk which
includes the use of office equipment, locker, business address and other services,
depending on the selected plan. The environment in those coworking spaces usually
leads to improved performance due to the high level of independence and flexibility. It
is noteworthy that members at the local coworking spaces are predominantly well-
educated and hold jobs in IT and other related industries. Currently, there are up to a
dozen coworking spaces in Armenia:
1. Platform Coworking Space is a collaborative workspace with a strong sense of
community.
2. AEON Co-Work is a community of forward-thinking, innovative freelancers, co-
workers, and start-up professionals, giving them the opportunity to not only work
independently but also network. This anti-cafe doesn’t offer membership; the space
is free to use. Just pay for the hours you spend there.
3. PMC is a shared workspace which offers a coworking office, private offices and
meeting rooms. Members can also book facilities for events.
4. Mydesk accommodates 12 residents and has a fully completed conference room.
Convenient location and good infrastructure make Mydesk a perfect place for
young businesses that want to be at the center of events.
5. Coworking Armenia is a coworking space created by Startup Armenia Fund. It
provides opportunities for professionals and startups from all over the world.
6. Impact Hub Yerevan is a shared community which incorporates an inspiring
workspace, a social enterprise community center, an opportunity for education as
well as a global network of like-minded people, rather than a simple coworking
space. Its membership is valid across 81 hubs operating across the world.
7. Loft is a self-development and leisure multifunctional center where everything is
free except time.
8. Utopian Lab is a shared workspace environment made to thrive alongside a moti-
vated community in Armenia.
9. Innovative Solutions and Technologies Center (also known as ISTC) is a co-
working space founded by IBM, USAID, Government of Armenia, and Enterprise
Incubator Foundation.
2.4 Talents, IT Products and Start-Ups

Demand increasingly rises for IT workforce. Over the last decade, developed countries
have faced talent shortages. This problem is even more severe in new markets such as
AI, IoT and robotic automation processes where there is a high competition for
qualified employees. Technology defines new vacancies rather than replaces humans.
Some 10,000 students majoring in IT study at specialized universities, which are
406 A. Davoyan
currently six in number: The State Engineering University of Armenia, Yerevan State
University (YSU), American University of Armenia (AUA), European Regional
Educational Academy and Russian-Armenian Slavonic University (RAU). Russian-
Armenian Slavonic University and American University of Armenia provide degree
programs in Russian and English languages, respectively. Developers in Armenia are
still considered low-cost service providers, capable of producing IT products and
services that meet international standards. They mainly specialize in software devel-
opment, semiconductor production, systems integration, web design and multimedia.
Demand for IT services is also growing in local markets from banks, enterprises,
universities and other entities. Independent developers serve global markets through
support organizations or through their own contacts. Armenia’s start-up scene is fuelled
by well-thought initiatives and tax-breaks aimed at boosting the industry. Over the last
decade, the Armenian start-up ecosystem has flourished in terms of quality and
quantity. Armenia is home to some 200 tech start-ups of which some of the successful
technological start-ups include:
– Picsart – It is regarded as the number 1 photo and video editing app, powered by a
community of 100M+ monthly active users.
– Sololearn – It is an app offering mobile code and programming tutorials, as well as
specialized AI and machine learning content.
– ServiceTitan – It is a software platform, allowing home-service businesses to
manage businesses and improve customer service.
– Teamable – It is an employee referral and diversity hiring platform transforming
social networks into high-performance talent pools.
– Joomag – It is an all-in-one platform, offering integrated solutions for content
marketing, corporate communications, digital publishing and sales.
It is widely believed that the IT industry will be a dominant player for the Armenian
economy, which will create new wealth for generations to come.
2.5 Established IT Companies

Currently, 650 established IT companies operate in Armenia of which 202 are foreign
owned [4]. The majority of the foreign owned companies are product development
centers for head offices. A number of established IT companies have presence in
Armenia including National Instruments, CISCO, SYNOPSIS, Inc. (NASDAQ:
SNPS), VMware, Teamviewer and Oracle, among many other internationally reputed
companies. In general, these companies serve global markets through export channels
of parent companies.
2.6 Support Organizations

The Government of Armenia established EIF, one of the two state-promoted IT
development organizations in 2002. The goal of this organization is to implement and
co-ordinate public policy for IT development. The organization also functions as a
bridge for a number of local start-ups, entrepreneurs, IT companies and global markets.
UITE is a membership-based business association which was established in 2000.

The aim of this organization is to protect the interests of the IT sector as well as locally
and globally promote IT services provided by member SMEs, start-ups and tech-
entrepreneurs.
2.7 IT Industry, Academia and Government

According to Global Innovation Index, Armenia is ranked 61st with 32.80 scores
among 126 countries. Armenia lags behind Georgia (35.00) and Russia (37.90) but is
ahead of Kazakhstan (31.40) and Azerbaijan (30.20). Collaboration among IT industry,
universities and the government creates synergy for students as well as employers. The
concept of collaboration among these institutions resembles a triple-helix model
viewing innovation and economic development in large role of University and har-
monisation of elements from university, industry and government generating new
social and institutional formats [5]. It is worth mentioning that Armenia is home to six
universities that provide IT education and host a number of development centers
including Innovative Solutions and Technological Center, Samsung Learning Center,
Microsoft Innovation Center, National Engineering Laboratories and Armenian-Indian
Center for Excellence, among a number of other development centers.
2.8 Obstacles
A disintegrated regional ecosystem, lack of access to financing as well as corruption
create obstacles for development of start-ups and IT companies in Armenia which is in
large part due to regional conflicts and an inefficient economic system. However, the
difficult environment does not prevent IT industry from growing. The number of IT
companies, including start-ups, has increased 250 times since the independence of
Armenia, growing to 800 companies with a total turnover of 730.2 million USD
excluding turnovers generated by ISPs. In 2017, the IT industry grew 25%, making it
730.2 million USD or 6.25% of total GDP in Armenia.
3 Conclusion
This article provided a brief description of the Armenian IT industry where 2.5% of the
total workforce is employed. IT industry is considered a priority by the Government of
Armenia, which has taken effective steps to improve the quality of specialized education
and develop infrastructure for local and foreign IT companies as well as start-ups. In order
to support its IT industry, the government has defined 0% profit tax for 3 years of
operation. Given the availability of high-quality workforce along with improvements in
investment climate, this industry promises a high return for the development of other
industries. The country exported IT products and services worth $338 million to the USA,
the EU, Russia and other countries in 2017. Armenian IT companies mainly specialize in
software development, semiconductor design, multimedia and web design. In spite of
being recognized as a lucrative place for development of IT products and services, a
number of challenges still remain as a result of the country’s location and geopolitics.
408 A. Davoyan
References
1. EIF, State of the industry report
2. FocusEconomics, Armenia Economic Outlook
3. IDC Research Consultancy
4. Formica, P.: The innovative coworking spaces of 15th-century Italy, Harvard Bus. Rev.
(2016)
5. Etkowitz, H.: Triple Helix Conference, Daegu, Korea (2017)
A Study on the Inspection of Fundus
Retinal Picture
K. Sundaravadivu(&), N. Hariprasad, A. Sadeesh Kumar,

and N. Siva Balan
Department of Electronics and Instrumentation Engineering, St. Joseph’s College

of Engineering, Chennai 600 119, Tamilnadu, India
hodeiestaffaffairs@stjosephs.ac.in
Abstract. Eye is an essential organ in humans, responsible to collect and

translate the light signal into sensory pictures. The illness in eye usually affects
the vision scheme therefore, the illness to be detected and cured in its early
stage. In medical level, the illness in eye is assessed with the Retinal Pictures.
Because of its importance, a notable amount of computer based evaluation tools
are proposed and executed to inspect the FRP to review a class of retinal
pictures. This work proposes an image-processing technique to examine the
illness in optic-disc with the grouping of thresholding and segmentation pro-
cedure. The thresholding is executed with Enhanced-Firefly-Algorithm based
Otsu’s and segmentation is realized using the Distance-Regularized-Level-Set
(DRLS). This procedure is implemented and validated on Rim-One database
with various levels of the Optic-Disc (OD) pictures recorded with a Fundus-
Camera. The advantage of the proposed method is confirmed with a relative
examination between the mined OD and Ground-Truth (GT). The results of this
study verify that implemented technique offers improved outcome on Rim-One
with better Picture-Quality-Parameters (PQP).
Keywords: Firefly-algorithm Retinal image Optic disc DRLS

Assessment
1 Introduction
Due to its significance, recently a considerable number of medical picture appraisal

procedures are discussed by the research society. The new techniques are tested with
the real time clinical dataset as well as the benchmark datasets. The procedure, which
works on the benchmark database, will work well on the clinical pictures. Recently,
considerable efforts are taken by the researchers to propose an accurate solution for the
bio-medical image assessment. The bio- imaging practice plays an essential function in
the assessment of the crucial interior organs and to recognize the infected sector pre-
cisely to plan for the probable handling actions [1–5].
Due to its function and illness forecast task, a selection of picture evaluation
techniques are principally proposed and executed by the experts [6–8]. Inspection of
the gray-scale image is fairly uncomplicated and needs extremely fewer computational
efforts due to its straightforward pixel allocation. But, inspection of RGB-scale image

https://doi.org/10.1007/978-3-030-39442-4_31
410 K. Sundaravadivu et al.
forever requires multifarious computations due to its Red, Green and Blue pixel groups.
Because of its complication, a selection of events is employed by the researchers to pre-
process the RGB images to improve the information throughout the appraisal. The
current work aims to employ a Image-Evaluation-System (IES) to analyse the retinal
optic-disc (OD) section from fundus image. Usually, to categorize the defect in the eye,
an imaging practice is to be followed. This system accounts the eye elements with the
specialized camera, called the Fundus-camera and these images are then examined by
the doctor or the dedicated tool existing in eye-clinics to recognize the eye-abnormality
to plan for an appropriate treatment procedure [9, 10].
In this work, assessment of the retinal OD is used and the benchmark OD database
called Rim-One [11, 12] is adopted to check the presentation of the implemented
picture processing process. In the literature, a number of conventional and recent soft-
computing based retinal OD assessment events are previously discussed and executed
[7, 8]. The current works on OD inspection verify that, a two-step process will forever
assist to attain improved outcome compared to a single step process. Hence, in this
paper, the amalgamation of RGB multi-thresholding and the segmentation is incor-
porated to improve the outcome throughout the retinal OD assessment. The RGB multi-
thresholding is applied with the Modified-Firefly-Algorithm (MFA) [13, 14] based
Otsu’s between class variance and the segmentation is executed with the DRLS
segmentation [15].
The proposed investigational task is implemented with the Matlab7 and the results
of this approach is confirmed by considering the vital Picture-Quality-Parameters
(PQP), like Sensitivity (SE), Specificity (SP), Accuracy (AC) and Precision (PR).
These standards are calculated by evaluating the extracted OD segment with OD of
Ground-Truth (GT). This database includes a sum of five GTs for each test image.
Ultimately, the average of these PQP is considered for the verification of performances
of the IES. Here, the implemented imaging system is very proficient and offers
improved values of PQP, such as SE (98.11%), SP (97.89%), AC (98.22) and PR
(98.57%).
The other sections are arranged as follows: Sect. 2 presents the methodology,
Sect. 3 outlines the results and discussion section and Sect. 4 provides the conclusion
of the proposed approach.
2 Methodology
In literature, considerable methods are available to execute the multi-thresholding and

segmentation tasks. In this paper, a variety of phases depicted in Fig. 1 is considered to
investigate the retinal OD segment of Rim-One.
A Study on the Inspection of Fundus Retinal Picture 411
2.1 Rim-One Database

It is accessible from [11], in which a solitary retinal OD is obtainable in edition-1 and
two ODs are presented in edition-2. In this work, OD of edition-1 is considered for the
assessment and this database has ODs registered with a variety of categories, like Deep,
Early, Moderate, Normal and OHT [12]. Every image is linked with five numbers of
the GTs, such as GT1 to GT5. Due to its scientific importance, this dataset was widely
used by the researchers to appraise their image-examination trials [7, 8].
2.2 Pre-processing
Otsu’s multi-thresholding was initiated in 1979 to improve the gray-scale images [16].
This process assists to find the finest thresholds by maximizing the Between-Class-
Variance (BCV) of picture pixels. Due to its advantage, this scheme was widely
adopted to improve RGB images [17].
In RGB class, let L indicate whole magnitude phases of choice [0,1,2,…, L-1].
Then, the likelihood allotment ACi can be demonstrated as:
hci X
L1
ACi ¼ AC ¼ 1 ð1Þ
N i¼0 i
where i = accurate intensity phase of choice {0 i L 1}, for RGB component

C = {R,G,B}, N complete pixels, and hCi = amount of pixels for intensity rank i in
element C. Complete conversation relating to Otsu’s usefulness is accessible from [18].
Otsu’s BCV of each constituent can be defined as:
2
X
m 2
rcB ¼ wCj lCj lCT ð2Þ
j¼1
where wCj = likelihood of occurrence. The Th-level thresholding is reduced to an

optimization job to discover for tjC , that develop Jmax of each picture constituent
C being defined as:
2
Jmax ¼ uC ¼ max rcB ðtjC Þ ð3Þ
1\tiC \;;L1
Fundus
Retinal Picture
EFA + Otsu’s
thresholding
DRLS
segmentation
Comparison of
OD with GT
Assessment of PQP
and validation
Fig. 1. Different steps concerned with the proposed tool
In this paper, the recognition of Otsu’s Jmax is completed by means of Brownian-

walk FA discussed in [17].
In this FA, the exploration procedure is guided by the BW policy and its mathe-
matical replica is available below:
Xit þ 1 ¼ Xit þ b0 ec dij ðXjt Xit Þ þ a1 : signðrand 1=2Þ BW

2
ð4Þ
The information on the FA considered in this paper can be found in [17]. The FA
values are allocated as follows, total fireflies = 30, iteration number = 2000, search-
dimension = 3, and the stopping parameter = Jmax.
The above process can be used as the pre-processing practice, which will improve
the visibility of OD based on the Otsu’s three-level thresholding procedure [7].
2.3 DRLS Segmentation

This work employs the DRLS segmentation of Li et al. [15]. DRLS works based on the
energy minimization procedure and it is articulated as:
Z
D
<e ð/Þ ¼ eðjr/jÞdX ð5Þ
X
where, e = energy density value with e ¼ ½0; a ! <

In this work, an adoptable arc is authorized to recognize all the probable pixel
values connected to the irregular segment existing in the picture. After identifying all
the probable pixel clusters, it will mine the section which is within the converged arc
for the evaluation practice.
2.4 Assessment
The merit of PES is established by computing the PQP as conversed in [4]. In this
paper, the PQPs like SE, SP, AC and PR are used to appraise the advantage of the
proposed method [19, 20]. The necessary PQP standards are calculated by comparing
the extracted OD next to the five-GTs obtainable in the Rim-One. Later, the average
principles of these PQPs are recorded individually for each picture group and based on
this value, the advantage is established.
This segment presents the experimental outcome attained with the projected IES. All
these works are realized with Matlab 7.
Figure 2 shows the example test image considered for the early assessment. It is
from the Deep (D) class, and Fig. 2(a) represents the pseudo-name and Fig. 2(b) shows
the examination image. Later, a tri-level threshold with the MFA and Otsu is imple-
mented on these imagery and the pre-processed test images are shown in Fig. 2(c). In
this image, the OD segment is alienated from the blood-vessels and the background due
to the pre-processing procedure. Later, the DRLS is then implemented to mine the OD
section.
The OD examination consist the procedures, such as test image thresholding,
DRLS based image segmentation and validation of the OD with the GT picture. The
pre-processing picture will enhance the test image with a suitable image processing
approach. Later, the OD section is then mined using the RDLS segmentation. Finally, a
relative assessment among the extracted OD and the GT is then performed to confirm
the advantage of proposed technique.
Figure 3 presents the example test-image obtainable in the Rim-One for other
illness cases considered in this manuscript. In which Fig. 3(a) is the sickness, Fig. 3
(b)–(d) indicates the sample image. Alike pre-processing and segmentation techniques
are implemented for these images and the necessary PQP are calculated with a com-
parative study with the GTs.
Figure 4 presents the outcome attained with the DRLS process. Primarily, a
bounding-box is allocated on the OD segment of pre-processed image physically by
choosing the X and Y-axis values. When the iteration augments, the box is permitted to
congregate towards the OD and lastly it mines the OD segment. Later, the OD fragment
is then compared against each GT obtainable from Figs. 4(f) to 4(j) and the required
PQP are computed and the matching values are offered in Table 1. Alike process is
then implemented for other test-pictures and the matching average PQP for each illness
class is presented in Table 2.
D13
D38
D48
D69
D103
(a) (b) (c)
Fig. 2. Example test images of Deep case.
From the above table, it can be noted that, the average PQP values obtained for the
Rim-One dataset is as follows; SE (98.06%), SP (98.89%), AC (97.67) and PR
(98.78%). From these result, it is obvious that, these values are >97.67%, which
substantiate that, projected approach is incredibly capable for the assessment of fundus-
retinal images. In future, this technique can be used to inspect clinical images. Further,
in future, instead of the proposed technique, a suitable Deep-Learning and the Machine
learning procedures can be implemented to examine the fundus retional pictures.
Early
Moderate
Normal
OHT
(a) (b) (c) (d)
Fig. 3. Example test images of a variety of disease classes
(a) (b) (c)
(d) (f) (g)
(h) (i) (j)
Fig. 4. Result attained through the DRLS segmentation. (a) Pre-processed D13 picture,
(b) Initial bounding-box, (c) Converged curve, (d) Extracted OD, (f)–(j) GT1–GT5
Table 1. PQP values obtained for D13 picture

Image SE SP AC PR
GT1 0.9718 0.9716 0.9627 0.9621
GT2 0.9638 0.9722 0.9552 0.9704
GT3 0.9503 0.9694 0.9594 0.9727
GT4 0.9599 0.9732 0.95605 0.9274
GT5 0.9726 0.9704 0.9644 0.9718
Average 0.9637 0.9714 0.9596 0.9609
Table 2. PQP computed for the Rim-One database

Case SE SP AC PR
Deep 0.9789 0.9933 0.9745 0.9932
Early 0.9804 0.9904 0.9638 0.9864
Moderate 0.9844 0.9864 0.9825 0.9855
Normal 0.9805 0.9855 0.9806 0.9837
OHT 0.9786 0.9887 0.9821 0.9904
Average 0.9806 0.9889 0.9767 0.9878
4 Conclusion
This work implements a modern imaging approach based on the MFA and Otsu based
pre-processing and the DRLS based segmentation. The standard retinal OD image
database called the Rim-One is considered for the examination and the investigational
work is implemented with the Matlab7. In this work, the Rim-One’s images with the
Deep, Early, Moderate, Normal and OHT classes are separately examined and the PQP
values such as SE, SP, AC, and PR are then computed based on the relative exami-
nation between the extracted OD and the GTs existing in the dataset. The results of this
study confirms that, this imaging system is competent and offers an average PQP
of >97.67%.
References
1. Dey, N., Rajinikanth, V., Ashour, A.S., Tavares, J.M.R.S.: Social group optimization
supported segmentation and evaluation of skin melanoma images. Symmetry 10(2), 51
(2018). https://doi.org/10.3390/sym10020051
2. Raja, N.S.M., Kavitha, N., Ramakrishnan, S.: Analysis of vasculature in human retinal
images using particle swarm optimization based Tsallis multi-level thresholding and
similarity measures. Lecture Notes in Computer Science, vol. 7677, pp. 380–387 (2012)
3. Vaishnavi, G.K., Jeevananthan, K., Begum, S.R., Kamalanand, K.: Geometrical analysis of
schistosome egg images using distance regularized level set method for automated species
identification. J. Bioinform. Intell. Control 3(2), 147–152 (2014)
4. Rajinikanth, V., Raja, N.S.M., Kamalanand, K.: Firefly algorithm assisted segmentation of
tumor from brain MRI using Tsallis function and Markov Random Field. J. Control Eng.
Appl. Inform. 19(3), 97–106 (2017)
5. Rajinikanth, V., Kamalanand, K.: Advances in Artificial Intelligence Systems. Nova Science
Publishers Inc., USA
6. Khan, M.W., Sharif, M., Yasmin, M., Fernandes, S.L.: A new approach of cup to disk ratio
based glaucoma detection using fundus images. J. Integr. Des. Process Sci. 20(1), 77–94
(2016). https://doi.org/10.3233/jid-2016-0004
7. Shree, T.D.V., Revanth, K., Raja, N.S.M., Rajinikanth, V.: A hybrid image processing
approach to examine abnormality in retinal optic disc. Procedia Comput. Sci. 125, 157–164
(2018). https://doi.org/10.1016/j.procs.2017.12.022
8. Sudhan, G.H.H., Aravind, R.G., Gowri, K., Rajinikanth, V.: Optic disc segmentation based
on Otsu’s thresholding and level set. In: International Conference on Computer Commu-
nication and Informatics (ICCCI), pp. 1–5. IEEE (2017). https://doi.org/10.1109/iccci.2017.
8117688
9. Dey, N., Roy, A.B., Das, A., Chaudhuri, S.S.: Optical cup to disc ratio measurement for
glaucoma diagnosis using Harris corner. In: Third International Conference on Computing
Communication & Networking Technologies (ICCCNT). IEEE (2012). https://doi.org/10.
1109/icccnt.2012.6395971
10. Almazroa, R., Burman, K., Raahemifar, K., Lakshminarayanan, V.: Optic disc and optic cup
segmentation methodologies for glaucoma image detection: a survey. J. Ophthalmol. 2015
(2015). https://doi.org/10.1155/2015/180972. Article ID 180972, 28 pages
11. Fumero, F., Alayon, S., Sanchez, J.L., Sigut, J., Gonzalez-Hernandez, M.: RIM-ONE: an
open retinal image database for optic nerve evaluation. In: 24th International Symposium on
Computer-Based Medical Systems (CBMS), pp. 1–6. IEEE (2011). https://doi.org/10.1109/
cbms.2011.5999143
12. http://medimrg.webs.ull.es/research/retinal-imaging/rim-one/
13. Roopini, I.T., Vasanthi, M., Rajinikanth, V., Rekha, M., Sangeetha, M.: Segmentation of
tumor from brain MRI using Fuzzy entropy and distance regularised level set. In: Nandi, A.,
Sujatha, N., Menaka, R., Alex, J. (eds.) Computational Signal Processing and Analysis.
Lecture Notes in Electrical Engineering, vol. 490, pp. 297–304. Springer, Singapore (2018)
14. Rajinikanth, V., Satapathy, S.C., Dey, N., Vijayarajan, R.: DWT-PCA image fusion
technique to improve segmentation accuracy in brain tumor analysis. In: Anguera, J.,
Satapathy, S., Bhateja, V., Sunitha, K. (eds.) Microelectronics, Electromagnetics and
Telecommunications. Lecture Notes in Electrical Engineering, vol. 471, pp. 453–4622.
Springer, Singapore (2018)
15. Li, C., Xu, C., Gui, C., Fox, M.D.: Distance regularized level set evolution and its
application to image segmentation. IEEE Trans. Image Process. 19(12), 3243–3254 (2010).
https://doi.org/10.1109/TIP.2010.2069690
16. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man
Cybern. 9(1), 62–66 (1979)
17. Rajinikanth, V., Couceiro, M.S.: RGB histogram based color image segmentation using
firefly algorithm. Procedia Comput. Sci. 46, 1449–1457 (2015)
18. Raja, N.S.M., Rajinikanth, V., Latha, K.: Otsu based optimal multilevel image thresholding
using firefly algorithm. Model. Simul. Eng. 2014 (2014). https://doi.org/10.1155/2014/
794574. Article ID 794574, 17 pages
19. Rajinikanth, V., Satapathy, S.C., Fernandes, S.L., Nachiappan, S.: Entropy based
segmentation of tumor from brain MR images–a study with teaching learning based
optimization. Pattern Recogn. Lett. 94, 87–95 (2017). https://doi.org/10.1016/j.patrec.2017.
05.028
20. Grgic, S., Grgic, M., Mrak, M.: Reliability of objective picture quality measures. J. Electric.
Eng. 55(1–2), 3–10 (2004)
Accelerating Block-Circulant Matrix-Based
Neural Network Layer on a General Purpose
Computing Platform: A Design Guideline
Krittaphat Pugdeethosapol(&), Zhao Jin, Daniel Rider, and Qinru Qiu
Syracuse University, Syracuse, NY 13244, USA

{kpugdeet,zjin04,dprider,qiqiu}@syr.edu
Abstract. Deep neural networks (DNNs) have become a powerful tool and
enabled the state-of-the art accuracy on many challenging tasks. However, large-
scale DNNs highly consume both computational time and storage space. To
optimize and improve the performance of the network while maintaining the
accuracy, the block-circulant matrix-based (BCM) algorithm has been intro-
duced. BCM utilizes the Fast Fourier Transform (FFT) with block-circulant
matrices to compute the output of each layer of the network. Unlike conven-
tional pruning techniques, the network structure is maintained while using the
BCM. Compared to conventional matrix implementation, the BCM reduces the
computational complexity of a neural network layer from O(n^2) to O(n^2/k),
and it has been proven to be highly effective when implemented using cus-
tomized hardware, such as FPGAs. However, its performance suffers from
overhead of FFT and matrix reshaping on general purpose computing platforms.
In certain cases, using the BCM does not improve the total computation time of
the networks at all. In this paper, we propose a parallel implementation of the
BCM layer and guidelines that generally lead to better implementation practice
is provided. The guidelines run across popular implementation language and
packages including Python, numpy, intel-numpy, tensorflow, and nGraph.
Keywords: Block-circulant matrix Deep learning Acceleration Parallel

computing
1 Introduction
In the past few years, driven by increasing amounts of data and processing speed, Deep
Neural Network (DNN) has been able to deliver impressive results for many complex
and challenging problems. Particularly large-scale DNNs have significantly enhanced
object recognition accuracy and led a revolution in many real-world applications, such
as automatic machine translation [1], self-driving systems [2], and drug discovery [3].
The resurgence of neural networks has attracted both academic and industry in eval-
uation, improvement and promotions.
The deep neural networks consist of multiple layers of various parameters and
thousands of neurons. Recent research has proven that the depth of DNN structure is
crucial to satisfactory that stands out accuracy [4]. As a result, large scale DNNs
requires computation and memory remarkably. Driven by this challenge, more and

https://doi.org/10.1007/978-3-030-39442-4_32
420 K. Pugdeethosapol et al.
more techniques have been proposed to compress deep neural network size with a
negligible accuracy loss. One strategy is the block-circulant matrix-based (BCM) al-
gorithm [5], a principled approach utilizing Fast Fourier Transform (FFT) and block-
circulant matrices to reduce both computational and memory complexity. Compared to
other compression techniques such as weight pruning [9], BCM algorithm has three
main advantages. First, it allows us to derive a tradeoff between accuracy and accel-

n2
eration. Second, the BCM algorithm reduces storage complexity from Oðn2 Þ to O k
by compressing the weight matrix into k dense vector, whereas conventional weight
pruning gives a sparse weight matrix that requires additional memory footprint for
indexing. Lastly, BCM algorithm maintains the regular network structure and retains a
rigorous mathematical foundation on a compression ratio and accuracy [5].
In prior work, the BCM algorithm has only been evaluated for embedded platforms
due to their portability, versatility, and energy efficiency [5]. We aim to solve two
remaining questions. First, can the BCM algorithm be implemented on software-based
platforms, especially in Python which is the most popular programming language used
for deep learning. Second, how to configure the BCM algorithm to balance the tradeoff
between accuracy and compression/acceleration.
We propose in this paper to guide users to implement the BCM algorithm and
achieve the best performance. To solve the two questions, we will evaluate the per-
formance of the algorithm in Python using numpy, intel-numpy, tensorflow, and
nGraph packages. Additionally, we design the parallel BCM algorithm that effectively
utilizes multiple cores in the target systems.
2 Related Works
In the past decade, numerous techniques have been proposed to compress neural
network. These include structured weight matrices [6, 7], parameter pruning [8–10] and
quantization [11, 12]. Recently weight pruning methods have become more and more
popular. Although weight pruning could achieve an amazing compression ratio, but the
network structure and weight storage after pruning become irregular, hence indexing is
required which weaken the performance improvement. Specially, when implemented in
embedded system, it requires a customized hardware capable of loading sparse matrices
and/or performing sparse matrix-vector operations [13]. Inherently, irregular memory
access and extra storage footprint reduce the speed of the weight pruning.
Frequency domain operation was first proposed by LeCun to accelerate the com-
putations of convolution layer by replacing the convolution operation using element
wise multiplication in frequency domain [14]. No weight compression was considered
in [14]. Circulant-weight matrix was first proposed in [6] in 2015, as a mean to reduce
the storage complexity of fully connected neural networks. By compressing a weight
matrix into a circulant weight matrix, it reduces the space complexity from O(d2) to O
(d). As a property of circulant matrix, the matrix-vector multiplication (between the
weights and the inputs) can be done as element wise vector-vector multiplication in
frequency domain, and hence reduce the time complexity from O(d2) to O(d). Unlike
conventional weight pruning, the circulant weight matrix has a dense structure and it
Accelerating Block-Circulant Matrix-Based Neural Network Layer 421
could be used to optimize both speed and space. FFT is used to transform weights and
inputs to frequency domain.
For very large weight matrices, the circulant matrix approach provides very sig-
nificant compressing ratio, but also will lead to considerable quality degradation of the
neural network. Block-circulant weight matrix was first proposed by Ding et al. [5] as a
way to balance the storage complexity and neural network quality during the com-
pression. The authors also proposed the CirCNN to implement the BCM based Deep
Neural Networks on hardware, such as ASIC and FPGA. With the customized pipeline
structure, the FFT and element-wise operation achieve their best performance on the
customized hardware implementation. However, the efficiency of the BCM on general
purpose computing platforms, which is still normally used by machine learning
community, has not been studied.
In this paper, we consider all the potential overheads in software and propose a
parallel design of the block-circulant matrix-based algorithm for general purpose
computing platform. We evaluate its performance on popular deep learning
frameworks/packages and provide guidelines that can generally lead to better
implementations.
3 Background of Block-Circulant Weight Matrix
The block-circulant matrix-based algorithm can be applied to both Fully Connected

(FC) and Convolutional (CONV) layers. Since the benefit of compression is more
noticeable for FC layers, in this work we will focus our discussions on FC layers. The
similar result can be extended to CONV layers as well.
In FC layers, W Rmn , a weight matrix W with the size of m n will be parti-
tioned into 2D blocks of square submatrices where each submatrix is a circulant matrix.
After partitioning, there will be p q blocks where p ¼ m k, and q ¼ n k while
k represents the size of square submatrices or block size. XRl , the input vector X with
the size of l will also be partitioned into r blocks where r ¼ l k. Figure 1 shows an
example of partitioned matrices of input and weight where m ¼ 6, n ¼ 9, l ¼ 9, and
k ¼ 3.
Following the partitioning, the weight matrix becomes W ¼ Wij ; i f1. . .pg;
h iT
j f1. . .qg, and the input matrix becomes X ¼ xT1 ; xT2 ; . . .; xTq . As a result, the output
of each block can be calculated as:
Xq
ai ¼ j¼1
Wij xTj ; ð1Þ
Fig. 1. An illustration of the partitioned weight and input matrices in FC layer.
where ai Rk is an output column vector. According to circulant convolution theorem

[15], the compressed weight matrix Wij is defined by the first row of a vector wij as
shown in Fig. 1. The output of each block can be calculated as:
Xq
ai ¼ IFFT j¼1
FFT wij FFT xTj ; ð2Þ
where denotes element-wise multiplication, and FFT denotes Fast Fourier transform.
An illustration of the BCM calculation is shown in Fig. 2.
the storage complexity from OðmnÞ to
By using the BCM algorithm, we can reduce
Oðpqk Þ. Since we only need to store FFT wij for each submatrix, it is equivalent to
OðnÞ for small p and q values. Additionally, the computational complexity of FC layer
is reduced from Oðn2 Þ to OðnlognÞ, and from OðWHr 2 CPÞ to O(WHQlogQ) where
Q ¼ maxðr 2 C; PÞ for CONV layer.
Fig. 2. An illustration of the block-circulant matrix-based calculation [5].

3.1 Acceleration for General Purpose Computing Platforms

Despite its success in hardware implementation, the BCM based approach has not been
widely adopted by machine learning community that works mainly on general purpose
computing platforms. This is because, compared to matrix multiplication, which is
highly optimized for multi-core systems in many programming paradigms, FFT, IFFT,
and element-wise multiplication are not nearly optimized. This significantly affects the
performance of the BCM algorithm. In our investigations, we ran experiments com-
paring matrix multiplication with the BCM algorithm by using different numbers of
CPU cores.
Our implementation is based on Python programming language with numpy, intel-
numpy, tensorflow, and nGraph packages. We ran the experiments with one FC layer
which contains 4,096 hidden neurons. A batch size of 1024, and a block size of 128.
Figure 3 shows the results of matrix multiplication and BCM algorithm with different
number of CPU cores.
Fig. 3. Total time used with different number of CPU cores. Left panel: Matrix multiplication
results. Right panel: Block-circulant matrix-based algorithm results. The y-axis display time used
in milliseconds, the x-axis shows the number of CPU cores which the maximum number is 8
since the machine has 4 physical cores and 8 threads, and each label represents different
packages.
As shown in Fig. 3, the time used in matrix multiplication decreases as the number
of cores increases. This can be explained through utilization of multiple cores in each
package by using either multiprocessing or multithreading. In contrast, the time used in
the block-circulant matrix-based algorithm slightly decreases in tensorflow and ten-
sorflow+nGraph while remains stable in numpy and intel-numpy.
Therefore, we design the parallel block-circulant matrix-based algorithm to accel-
erate the computations. The key idea is to separate each computation block and run it
on different processes as each block can be calculated independently. Figure 4 repre-
sents parallel block-circulant matrix-based algorithm. We partition the block-circulant
matrix by row and run it on the different processes. Once the calculations are com-
pleted, we combine and convert the matrices to get the final output.
Fig. 4. An illustration of parallel block-circulant matrix-based algorithm.
In case there are multiple inputs, we can use data parallelism to separate these
inputs into processes such that each portion of data is assigned to different processes.
The portion of data is defined as input size=number of CPU cores.
In terms of implementation, we initially use native multithreading and multipro-
cessing provided in Python. However, Python has Global Interpreter Lock (GIL) that
only allows one thread to hold the control of its interpreter, creating the performance
bottleneck in multithreading. In contrast to multithreading, multiprocessing uses sub
processes to solve GIL which allows the program to optimize multiple cores in a given
machine. Nevertheless, there are overhead in spawning processes and sending data.
To improve the performance, we use Ray [16]: a simple framework that has been
proven to be faster than native multiprocessing and multithreading. Additionally, Ray
can be easily integrated with Python. Figure 5 represents the basic sample code of how
to use Ray to accelerate the parallel block-circulant matrix-based algorithm.
Basic Python Distributed with Ray
# Execute f serially. # Execute f in parallel.
@ray.remote
def f(): def f():
time.sleep(1) time.sleep(1)
return 1 return 1
ray.init()
results = [f() for i in range(4)] results = ray.get([f.remote() for i in range(4)])
Fig. 5. Example use of Ray in Python implementation [16].

Using parallel design and Ray, we can achieve better performance than the original
block-circulant matrix-based algorithm when increasing the number of CPU cores.
Figure 6 shows the results of our parallel version versus the previous version. The new
parallel version can achieve a stable speedup ratio up to 4 cores which is the number of
physical cores.
Fig. 6. Comparing the original and our ray-parallel implementation. Top-left panel: numpy.
Top-right panel: intel-numpy. Bottom-left panel: tensorflow. Bottom-right panel: tensorflow
+nGraph. A batch size of 1024, and a block size of 128 are used in these experiments. The y-axis
display time used in milliseconds, and the x-axis shows the number of CPU cores.
4 Design Space Exploration of BCM
In this paper, the block-circulant matrix-based algorithm has been applied to the model
during inference phase. When it comes to computational complexity, the block-
circulant matrix-based algorithm is faster than matrix multiplication. However, when it
comes to implementation, we need to examine the overhead from IFFT, FFT, and
matrix reshaping. Design parameters such as the batch size, block size, and number of
CPU cores all will affect the calculation time. For some combinations, matrix multi-
plication may be faster than the BCM, while for others the BCM may be more effective
than matrix multiplication. While increasing the block size always reduces storage and
computing complexity, it also lowers the model capacity of the neural network and
hence may lead to larger prediction error. It has been shown in [5] that with a com-
pression ratio of up to 30-50x, sometimes the loss may be negligible, and the com-
pressed models may even outperform the baseline models. However, in some cases the
loss is noticeable. In general, the loss is monotonically increasing with the compression
ratio. By focusing on the speed during the inference phase, users must decide whether
the accuracy loss is acceptable.
In order to choose the best model with the most efficient configuration and
acceptable accuracy without exhaustively exploring the entire design space, the
designer needs to know how the performance is affected by these design parameters
including the batch size, block size, and number of CPU cores. In this work, we
designed a set of benchmark programs that characterize the performance of different
configurations of BCM with a comparison to the matrix based implementation.
Guidelines in choosing the configuration of the BCM were derived from the result.
These guidelines will help designers to choose the configuration without having to
attempting all combinations.
The study is performed on Intel(R) Xeon(R) W-2123 CPU @3.6 GHz which has 4
physical cores and 8 threads. Matrix multiplication and the BCM algorithm are
implemented using Python programming language with various packages. The fol-
lowing lists of possible choices of the hardware/software configurations and design
parameters that were evaluated.
• Packages: numpy, intel-numpy, tensorflow, and tensorflow+nGraph
• Number of CPU cores: 1, 2, 4, and 8
• Block size (M): 128, 256, 512, 1024, and 2048
• Batch size (N): 128, 256, 512, 1024, 2048, 4096, and 8192
The evaluation results reported in this paper are platform specific; however, the
benchmarks and methodologies can be applied to other platforms. The model that we
considered is a fully connected layer with 4,096 hidden neurons. The weights and
inputs have the size of (4096, 4096), and (number of batches, 4096), respectively.
Table 1 shows the size of inputs, blocks, and weights used in the experiments.
Table 1. Size of inputs, blocks, weights before and after partitioning & FFT.
Input size Weight size before partitioning Block Weight size after partitioning
& FFT size & FFT

(X, Y) (M) (1, MX , MY , M2 þ 1)
(N, 4096) (4,096, 4096) 128 (1, 32, 32, 65)
256 (1 16, 16, 129)
512 (1, 8, 8, 257)
1024 (1, 4, 4, 513)
2048 (1, 2, 2, 1025)
We assume that the block-circulant weight matrix has already been trained and the
FFT of each block has been calculated. Please refer to [5] about how to train a block
circulant weight matrix. For a fully connected layer whose input size and output size
X
are X and Y, let M denote the block size, then there will be M MY circulant blocks.
Each block, after FFT, will be represented as a vector of size M2 þ 1 since we only
compute the real part of discrete Fourier Transform.
Overall we will represent the
X
weight as 4D tensor with size 1 M MY M2 þ 1 . In Table 1, N represents the
number of batches. The weight size after partitioning & FFT has 4 dimensions,
X Y
including 1, M , M , and size after FFT where 1 represents additional dimension that
help matching the number of batches in input size during element-wise multiplication.
In each experiment, we record the time used starting from the initial step until we
receive the output from the algorithm. The algorithm consists of six steps as the
following:
X
1. Reshaping input X into 4 dimensions N; M ; 1; M to match the size of weight
tensor.
2. Calculating the
X FFT Mof input
X from step 1, FFT(X) where the size of this step
becomes N; M ; 1; 2 þ 1 :
3. Calculating element-wise
X Y M multiplication
FFT ðW Þ FFT ð X Þ: The output size
becomes N; M ; M ; 2 þ1 :
PdMX e
4. Summing the output from step 3 using the formula: i¼1 FFTðwij Þ FFTðxi ÞÞ,
Y
where j is each block in M . The output size becomes N; MY ; M2 þ 1 .
5. Calculating IFFT along the third
dimension
of the tensor from step 4 and get output
Y. The output size becomes N; M ; M
Y
6. Reshaping the output into ðN; output sizeÞ where the output size is 4,096.
The dimensions that have been set to 1 are used for broadcasting which is available
in all packages that we use in our experiment, reducing and simplifying the codes.
4.1 Impact of Block Size on Performance

Varying the block size will result in different compression ratios and accuracies. In this
experiment, we set the number of blocks to 128, 256, 512, 1024, and 2048 to observe
the performance in relative to increasing number of blocks in different configurations
(i.e. number of CPU cores and batches).
Single Core. Even though multiple cores/GPUs may be readily available, some
specific embedded system/machine may only have single core. Therefore, running deep
model on this system may require significant amount of time and memory. BCM based
approach is especially effective for this type of resource constrained platforms.
Applying the block-circulant matrix-based algorithm will reduce the amount of
memory used and increase the speed. Figures 7, 8, 9 and 10 show how inference time
reduces as block size increases.
Fig. 7. Total time used with different block size with single core in numpy package. Left panel:
Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large batch size
(8192). The y-axis display time used in milliseconds, and the x-axis shows different block size
Fig. 8. Total time used with different block size with single core in intel-numpy package. Left
panel: Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large batch
size (8192). The y-axis display time used in milliseconds, and the x-axis shows different block
size
Fig. 9. Total time used with different block size with single core in tensorflow package. Left
size
Fig. 10. Total time used with different block size with single core in tensorflow+nGraph
package. Left panel: Small batch size (128). Middle panel: Medium batch size (1024). Right
panel: Large batch size (8192). The y-axis display time used in milliseconds, and the x-axis
shows different block size.
In all packages, increasing the block size reduces the time used. However, this
correlation is not linear; increasing the block size twice does not reduce the time by
half. Once the block size reaches 1024, the speedup ratio remains stable.
In numpy, using the BCM algorithm is faster than matrix multiplication in all block
size, small, medium, and large batch size. Meanwhile, the BCM algorithm is faster
when the block size is larger than 128 in intel-numpy. The same results apply to all
batch size with more difference observed in the large batch size.
In contrast to numpy and intel-numpy, tensorflow and tensorflow+nGraph display a
slower speed in the BCM algorithm when compared to matrix multiplication.
This outcome applies to small batch size when block size less than 256, and all block
size for medium and large batch size. The results can be explained by the overhead time
used to compute FFT, IFFT, and to create session to run the calculation.
Multiple Cores. Exploiting the resources of multiple cores of the system/machine will
help increase the overall performance. As mentioned earlier, we use Ray library to
implement the parallel block-circulant matrix-based algorithm. In this experiment, we
set the number of cores to 4 and ran the experiments using small, medium, and large
batch size. Figures 11, 12, 13 and 14 represents how much time used when we increase
block size in each package using multiple cores.
Fig. 11. Total time used with different block size with multiple cores in numpy package. Left
size.
Fig. 12. Total time used with different block size with multiple cores in intel-numpy package.
Left panel: Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large
batch size (8192). The y-axis display time used in milliseconds, and the x-axis shows different
block size.
Fig. 13. Total time used with different block size with multiple cores in tensorflow package.
batch size (8192). The y-axis display time used in milliseconds, and the x-axis shows different
block size.
Fig. 14. Total time used with different block size with multiple cores in tensorflow+nGraph
package. Left panel: Small batch size (128). Middle panel: Medium batch size (1024). Right
panel: Large batch size (8192). The y-axis display time used in milliseconds, and the x-axis
shows different block size.
Due to different techniques used in multiple cores in each package, different results
are expected. In numpy, the BCM algorithm appears to be faster than matrix multi-
plication once the block size is above 128 and small batch size is used, while it appears
to be faster with all block sizes when using a medium and large batch size.
With intel-numpy,matrix multiplication is always faster than the BCM algorithm
for block size below 128. The BCM outperforms matrix multiplication only for rela-
tively larger blocks. This outcome can be explained by highly optimized matrix
multiplication by Intel Math Kernel Library (MKL) packages which are designed
specifically for Intel CPU.
With tensorflow and tensorflow+nGraph, the break-even block size where BCM
and matrix multiplication has similar performance moves left. We now need even
larger blocks to outperform matrix multiplication. Similar to single core system, the
BCM performance still prefers smaller batch size. The break-even block size gets
bigger when the batch size increases.
4.2 Impact of Number of CPU Cores on Performance

The parallel block-circulant matrix-based algorithm utilizes multiple cores by spawning
multiple processes. However, the amount of speed up is not linearly proportional to the
hardware resources. Increasing the number of cores exceeds a certain point will cause
the program to run slower because of the communication bottleneck. Each process uses
more time in synchronization, such that the amount of increased communication time
outweighs the amount of computing time saved by add more cores. With some con-
figurations, applying matrix multiplication is more advantageous.
In this experiment, the number of cores is set to 1, 2, 4 and 8 by using OMP_-
NUM_THREADS, MKL_NUM_THREADS, and intra_op_parallelism_thread flags in
each package. We ran the experiments using a small, medium, and large batch size of
128, 1024, and 8192 in numpy, intel-numpy, tensorflow, and tensorflow+nGraph
respectively. Figures 15, 16, 17 and 18 display how much performance gains when we
increase the number of cores in each package using a different block size.
In general, using BCM algorithm can gain a stable speedup ratio up to 4 cores,
while in some cases, it become slower when using 8 cores as a result of parallel
slowdown, each process uses more time in communication and spawning process than
the increased processing power that it achieves.
Fig. 15. Total time used with different number of CPU cores in numpy package. Left panel:
(8192). The y-axis display time used in milliseconds, the x-axis shows different number of CPU
cores, and the legend represents matrix multiplication and different block size.
Fig. 16. Total time used with different number of CPU cores in intel-numpy package. Left
size (8192). The y-axis display time used in milliseconds, the x-axis shows different number of
CPU cores, and the legend represents matrix multiplication and different block size.
Fig. 17. Total time used with different number of CPU cores in tensorflow package. Left panel:
(8192). The y-axis display time used in milliseconds, the x-axis shows different number of CPU
cores, and the legend represents matrix multiplication and different block size.
Fig. 18. Total time used with different number of CPU cores in tensorflow+nGraph package.
batch size (8192). The y-axis display time used in milliseconds, the x-axis shows different
number of CPU cores, and the legend represents matrix multiplication and different block size.
Meanwhile in numpy, BCM algorithm is faster than matrix multiplication, How-

ever, when a small batch size, block size of 128 & 4 cores, and block size of 256 & 8
cores are used, matrix multiplication is faster. In intel-numpy, regardless of the number
of cores and batch size, the block-circulant is faster than matrix multiplication when the
block size is larger than 128.
For tensorflow and tensorflow+nGraph, the BCM algorithm is slower than matrix
multiplication when using larger batch size. Although at small batch size, it is faster
when setting block size to be greater than 128.
4.3 Impact of Batch Size on Performance

Since the batch size affects the computational speed, choosing appropriate batch size
for particular system will significantly improve the performance. For instance, using
larger batch size will improve the computation speed in multiple cores system/machine.
However, there is a saturation point where increasing the batch size will no longer
decrease the computational speed.
In contrast, using smaller or larger batch size with a single core does not signifi-
cantly affect the computational speed since it has to go through data one at a time.
Figure 19 displays total compute time when increase batch size, while Fig. 20 shows
compute time per sample. A block size of 128 & 256, and number of CPU cores of 1 &
4 are used in this experiment.
As shown in Fig. 19, a positive near linear relationship can be observed between
the batch size and the total computational time. However, the amount of time is
different in each configuration which is explained through the slope differences.
The computational time used per sample is not decreasing as the batch size increase
as shown in Fig. 20. However, it still decreases a little at the beginning because of
overhead in creating session & graph when using tensorflow or tensorflow + nGraph,
and spawning processes and sending data when using multiple cores.
Fig. 19. Total computation time. Top-left panel: 1 core with 128 block size. Top-right panel: 4
cores with 128 block size. Bottom-left panel: 1 core with 256 block size. Bottom-right panel: 4
cores with 256 block size. The y-axis displays total time used in milliseconds, and the x-axis
shows the batch size.
Fig. 20. The time used per sample. Top-left panel: 1 core with 128 block size. Top-right panel:
4 cores with 128 block size. Bottom-left panel: 1 core with 256 block size. Bottom-right panel: 4
cores with 256 block size. The y-axis display time used per sample in milliseconds, and the
x-axis shows the batch size.
4.4 Summary of Design Guidelines

In general, using the block-circulant matrix-based algorithm in intel-numpy packages is
the best choice since intel-numpy takes the benefit of Intel MKL which is highly
optimized for mathematics operations. However, there are certain cases such as when
the batch size is large enough and the block size is 128 where matrix multiplication will
be more beneficial. Another case is when using a single core with small batch size, using
the block-circulant in numpy is the fastest only when the block size is less than 256. Due
to the emphasis placed on inference phase, tensorflow and tensorflow+nGraph always
perform slower than numpy and intel-numpy when applying BCM algorithm. They
require time to initialize the graph and session before starting to compute the output.
When using multiple cores, larger batch size will optimize the parallelization of the
algorithm which will speed-up compute time of all samples in the batch.
Although the guideline provides the best possible combination of block size,
number of cores, and batch size to achieve optimal performance, it focuses mainly on
the time used to compute the algorithm. The accuracy reduction must be addressed of
manually in exchange for the increasing speed of the compute time.
5 Conclusion
This paper proposed a parallel design of the block-circulant based-matrix algorithm and
demonstrated that this new design can achieve better performance than previous ver-
sion of algorithm. We also provide guidelines on how to select block size, batch size,
and number of cores in certain situations in order to achieve optimal performance in the
least amount of time. The guidelines run across popular implementation language and
packages including Python, numpy, intel-numpy, tensorflow, and nGraph.
References
1. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural
networks with multitask learning. In: Proceedings of the 25th International Conference on
Machine Learning, pp. 160–167. ACM (2008)
2. Huval, B., Wang, T., Tandon, S., Kiske, J., Song, W., Pazhayampallil, J., Andriluka, M.,
Rajpurkar, P., Migimatsu, T., Cheng-Yue, R., et al.: An empirical evaluation of deep
learning on highway driving. arXiv preprint arXiv:1504.01716 (2015)
3. Burbidge, R., Trotter, M., Buxton, B., Holden, S.: Drug design by machine learning: support
vector machines for pharmaceutical data analysis. Comput. Chem. 26(1), 5–14 (2001)
4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–
778 (2016)
5. Ding, C., Liao, S., Wang, Y., Li, Z., Liu, N., Zhuo, Y., Wang, C., Qian, X., Bai, Y., Yuan,
G., Ma, X.: CirCNN: accelerating and compressing deep neural networks using block-
circulant weight matrices. In: Proceedings of the 50th Annual IEEE/ACM International
Symposium on Microarchitecture, pp. 395–408. ACM, October 2017
6. Cheng, Y., Yu, F.X., Feris, R.S., Kumar, S., Choudhary, A., Chang, S.F.: An exploration of
parameter redundancy in deep networks with circulant projections. In: Proceedings of the
IEEE International Conference on Computer Vision, pp. 2857–2865 (2015)
7. Saxe, A.M., Koh, P.W., Chen, Z., Bhand, M., Suresh, B., Ng, A.Y.: On random weights and
unsupervised feature learning. In: ICML, vol. 2, no. 3, p. 6, June 2011
8. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural
networks for resource efficient transfer learning. arXiv preprint arXiv:1611.06440, 3 (2016)
9. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient
neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143
(2015)
10. Luo, J.H., Wu, J., Lin, W.: ThiNet: a filter level pruning method for deep neural network
compression. In: Proceedings of the IEEE International Conference on Computer Vision,
pp. 5058–5066 (2017)
11. Cai, Z., He, X., Sun, J., Vasconcelos, N.: Deep learning with low precision by half-wave
Gaussian quantization. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 5918–5926 (2017)
12. Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using
vector quantization. arXiv preprint arXiv:1412.6115 (2014)
13. Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model
compression. arXiv preprint arXiv:1710.01878 (2017)
14. Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs.
arXiv preprint arXiv:1312.5851 (2013)
15. Pan, V.Y.: Structured Matrices and Polynomials: Unified Superfast Algorithms. Springer,
Boston (2012)
16. Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z.,
Paul, W., Jordan, M.I., Stoica, I.: Ray: a distributed framework for emerging AI applications.
In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI
2018), pp. 561–577 (2018)
Energy Aware Next Fit Allocation Approach
for Placement of VMs in Cloud Computing
Environment
Jyotsna Sengupta1(&), Pardeep Singh1, and P. K. Suri2

1
Department of Computer Science, Punjabi University, Patiala 147002, India
jyotsna.sengupta@gmail.com
2
Department of Computer Science and Applications, Kurukshetra University,
Kurukshetra 136119, India
Abstract. Cloud computing enables the IT giants to outsource their infras-

tructure, by providing a sharable pool of computing sources. These sources
consume a huge amount of energy that not only increase the running expenses
but also produce CO2 emission in the environment. Therefore, the main issue is
to manage and optimize the available resources for saving the energy. It can best
be done by dividing the physical machines into virtual machines and main-
taining the number of active machines according to the dynamic workload. This
process of server consolidation includes finding the overloaded hosts, selection
of VMs from the hosts with excess or under load and, finally, placing them all
over the available physical hosts dynamically. In this context, a novel approach
for placing virtual machines has been proposed that aims to reduce energy
consumption and SLA violation. Inspired from the bin packing problem, Next fit
allocation policy is tested for placing a VM over the available hosts. Suitability
of hosts is defined primarily on the basis of minimum energy consumption by a
VM on a host while placement. However, searching for the hosts is optimized
using next-fit policy. Experiments are performed in the cloudsim simulator tool
and results are compared with the existing policy of best-fit. Proposed approach
has identified better results for various performance matrices considered during
the experiments.
Keywords: Next-Fit approach SLA violation Data center management
1 Introduction
As the data demands of living beings are growing day by day, it has become com-
pulsory to sustain the quality of data resources. For which, high performance com-
puting equipment and lots of processing energy in required. This would not only be
highly expensive but also unavailable sometimes. To fulfil these requirements, various
cloud computing services are being deployed by the industry that provide various
features such as affordability, mobility and proper utilization of expensive infrastruc-
ture at a reasonable price.

https://doi.org/10.1007/978-3-030-39442-4_33
Energy Aware Next Fit Allocation Approach for Placement of VMs 437
Service providers have to maintain the large scale datacenters that comprised of
millions of computing services. These services consume a high amount of electrical
energy. In the year 2013, 91 billion KWh of electricity was consumed by data center
equipment in the United States, which is approximately equivalent to output power
generated from 34 coal-fired power plants in a year. This consumption is projected to
reach approximately 140 billion KWh in a year by 2020. Estimated price for this
energy is around $13 billion in a year [1]. Moreover, various equipment in the data
centers also produce heat, for which cooling devices are required. These cooling
devices also add to the cost of running data centers to around 2–5 million annually.
Primary reason behind the consumption of this huge amount of energy is improper
management of available resources in the cloud environment such as Servers, Storage
devices, Network etc. However, energy consumption has direct relationship with
Server’s utilization, as it linearly increases with the increment in utilization of server,
from 0% to 100%. So, the main concept behind reducing the consumption of energy is
optimizing the CPU utilization. This relationship between energy and CPU utilization
can be understood by following equations:
E ¼ PT 0 ð1Þ

PðU Þ ¼ Pi þ Pf þ Pi U ð2Þ
Where,
E = Energy
T′ = Time
P = Power function of utilization
Pi = Power consumption when CPU is idle i.e. 0% utilization
Pf = Power consumption while full utilization of CPU
U = current value of CPU utilization
There is one another empirical nonlinear power model defined by [2] as:

PðU Þ ¼ Pi þ Pf Pi 2U U R ð3Þ
Where, R ¼ Calibration parameter for reducing square error

For different configuration some set of experiments, need to be performed to find
best value of calibration parameter. Also, the value of Pi is approximately 0.7 Pf , i.e.
even when CPU is idle it consumed 70% of their highest energy [2]. Therefore, keeping
the minimum number of servers active with maximum utilization is the main challenge
to be handled for reducing energy consumption.
438 J. Sengupta et al.
1.1 VM Consolidation
Virtualization technique can be used to handle the server’s utilization issue in a better
way by dividing them into various virtual machines. It not only facilitates the sharing of
hardware among the users but also manage the large amount of user’s data separately.
Each instance of virtual machine (VM) defined over a server possesses a virtual
hardware and software package. All the instances of VMs can be homogenous or
heterogeneous depending upon their application requirement.
Fig. 1. Layers in cloud architecture [3].
Architecture of cloud environment is as exhibited in Fig. 1, which has three main

interfaces. At the lowest level, user have interaction with the tasks; at the middle level,
data center broker is responsible for defining VMs to perform these tasks; and on the
topmost level, management of VMs over various available hosts from different data
centers is performed by VM allocation policy. When a client needs to perform a task,
VM Allocator on the basis of VM Allocation policy creates the VM on a suitable host
and assigns the task to that VM. This policy is the part of VM consolidation process
that performs a dynamic load balance among the hosts [4]. Virtualization process helps
to create the VMs, and VMs are managed by consolidation process, that facilitates the
VMs to be switched among various hosts without any disturbance in their running
tasks. VMs are consolidated among the hosts, based upon the current load, carried by
each host. VMs from heavier loaded hosts are migrated towards lighter loaded hosts.
VM consolidation considers the suitability of a host for a VM based upon the capacity
to fulfil the resource requirements and fulfilling the constraints of a decision algorithm.
This dynamic management of VM placements involves the live migration of VMs
where VMs should be shifted at runtime from one host to another host without affecting
the services availed by users through these VMs [5]. On one side, shifting of workloads
in form of VMs from the overloaded hosts reduce the probability of SLA violation,
because overloaded hosts put the VMs on hold and increase the waiting time. On the
other side, migration assists to make the under-loaded hosts idle, by migrating all their
VMs to other vacant hosts. This dynamic process to free the hosts can than save the
energy by shutting down these hosts. So, dynamic or live migrations of VMs is another
important aspect needed to be handled for optimal resource management [6]. Decision
algorithm includes the decision for placing a VM over the physical servers that would
host the VM. This process of deciding the hosts for all the available VMs is known as
VM Placement, that not only includes the initial placement of VMs over the hosts, but
is also responsible for managing the placements of VMs on regular intervals according
the load variation. Various consolidation strategies have been defined by researchers
and are further in research, for obtaining different objectives, where an effective
strategy can achieve more than one objective without affecting the others.
1.2 VM Placement as Bin Packing Problem

Phenomenon of placing the VMs over the hosts can be compared with bin-packing
problem (BPP), in which hosts can be represent as Bins and VMs corresponds to the items.
In classical BPP, a sequence I ¼ fi1 ; i2 ; i3 . . .. . .in g of items is given where each item k has
a size S(ik Þ need to be filled in minimum number of bins, in order to use less number of
bins. Here, capacity of the bin and size of the items to be packed are the factors mainly
used to decide the number of maximum items that can be packed [7]. Figure 2 shows that
an optimal approach will definitely reduce the numbers of bins in use.
Similarly, in VM Placement problem, various factors can be considered while
allocating the VMs over hosts. We have considered the amount of energy consumed,
by a VM on a host as a decision factor. In VM placement there is a collection of total n
VMs represented as V ¼ fv1 ; v2 ; v3 . . .. . .vn g that need to be mapped over a list of total t
number of Hosts H ¼ fh1 ; h2 ; h3 . . .. . .ht g. Assuming energy consumed by a VM vi on
any host hk is eik , total energy consumed would be:
Xn
E¼ i¼1
ðeik Þ ð1Þ
Fig. 2. Bin packing problem.

For each VM vi , where i ¼ f1; 2; 3 . . .:ng, a suitable host hk need to be find out
such that, E should be minimum [8]. Many research works have been done, and are
going on, to solve BPP that considers Best Fit, First Fit, Worst fit and Next fit allo-
cation approaches. Researchers have also applied these approaches to solve problem
similar to BPP such as VM placement problem [4, 8, 10]. Applied by [4], Best fit
searches the minimum power consuming host for a VM, where hosts are arranged in
order of decreasing CPU utilization. Apparently, in this work, Next Fit approach of
BPP is applied primarily to optimize VM Placement for less energy consumption.
In this paper, VM placement algorithm is modified using Next Fit allocation
approach and results have been verified using basic performance metrics. The prime
contributions of this research paper are:
• Introduce the Next Fit allocation approach for placing the VMs over hosts.
• Implementing the proposed approach by modifying the existing approach in
cloudsim.
• Analyzing the effects of proposed approach for improving the energy efficiency and
reducing the SLA violation in the data centers.
The document is further sectioned in five parts, in which next section is describing
the work of existing researches in the same realm and next to this is explaining the
problem statement. Further, it is a preliminary section that introduces the baseline for
this research and then proposed approach is elaborated. Following this, experiment and
result analysis section not only define the experiment configuration and performance
parameters, but also verify the significance of proposed approach by showing results
using tables and graphs. Finally, the conclusion of the paper is mentioned with the
future scope of this research perspective.
2 Related Work
Many researchers have paid their attention to optimize the energy usage, because, more
the energy consumption is, more it would affects the environment and increase running
cost. In addition, QoS is another important factor considered by inventors as better QoS
helps in retaining a better relationship with the clients. Most of the researches have
focused on solving this problem by applying some modifications in VM consolidation
scheme, out of which some has been reviewed as below.
Chowdhury et al. [9] have considered VM placement as bin packing problem and
introduced three variants of PABFD [4] that processed modified worst fit, second worst
fit and first fit allocation in place of best fit allocation. Also, a clustering technique i.e.
modified k-means algorithm is applied to create clusters of VMs based upon value of
CPU utilization and allocated RAM. High density cluster i.e. cluster with more
numbers of VMs, will allocate the VMs firstly than next dense cluster and so on, till all
the VMs are not allocated. The adopting clustering technique, developed in this work,
performed best with modified first fit decreasing with decreasing host algorithm
(FFHDVP_C).
Damodar and Koli [10] modify the VM Consolidation process by proposing a new
overloaded host detection and underloaded host detection approach. For detecting the
overloaded host, they consider the SLA violation as the basic parameter. According to
their proposed method, if the VMs over a host demands more resources than the
available capacity, than surely it would violate the SLA. So, put that host into the list of
overloaded hosts. Then after using the existing VM selection technique to select the
VMs from those overloaded hosts and place them over other hosts using PABFD, they
declared one host as underloaded based upon the minimum utilization.
Kuo et al. [11] stated the resource based first fit algorithm for assigning VMs to
host. In this procedure, resource requirements of requesting VM are analyzed before
assignments and available resources of each host are updated after the termination of
VM assigned to it. Results of proposed approach are compared with worst fit (RWFA)
and best fit (RBFA) approaches, algorithm (RWFA) and resource based best fit
algorithm (RBFA). The Performance evaluation shows that RFFA scheme performed
well than RWFA & RBFA for decreasing energy consumption.
Mosa and Paton [12] solved the issue of SLA violation while conserving the energy
in data centers by developing a new utility based approach. In this approach, a design
model consisting of Monitoring, Analysis, Planning and Execution (MAPE), is fol-
lowed. Monitoring phase find the of CPU utilization of active hosts, Analysis phase
find the alternative of VMs to hosts mapping using Genetic Algorithms, Planning phase
evaluate the alternatives and find the best VMs to hosts mapping based upon utility
formula and last phase of Execution perform the necessary VMs migration and shutting
down idle hosts.
Castro et al. [13] worked upon two sub-problems of VM consolidation process, for
underloaded host detection. They define a novel algorithm named Underload Detection
(UD) and for VM placement they denoted their algorithm as CPU and RAM Energy
Aware (CREW). In UD, an average CPU usage is calculated using an Exponential
Weighted Moving Average (EWMA). Then average CPU utilization value is used to
define underloaded hosts. CREW is a modified PABFD, which check both, the
available CPU and available RAM on a particular host before placing a VM over that
host. In this way, their work proves the role of RAM for consuming the energy and
provide a heuristic that consider the usage of RAM with CPU for reducing energy and
SLA violation.
Han et al. [14] defined two algorithms one for under loaded host detection using
power efficient value (PE) of a host, which defines the ratio between power con-
sumption and number of VMs running on a particular host; and second algorithm for
VM placement based upon remaining utilization that consider remaining available CPU
resources of PMs for placing the VMs. Then, a new integrated VM consolidation
algorithm, consisting of these two techniques, is applied on five deferent types of
planetlab workload data using cloudsim simulator. The results are compared with
PABFD and UMC, which shows that the proposed algorithm has less impact for energy
consumption. However, SLA violation, SLATAH, number of VM migrations and ESV
metrics reduced significantly.
Mevada et al. [15] improvised the VM placement algorithm as a part of VM
consolidation process. They consider the utilization of the complete data center for
defining the host that can be evacuated and further use this information for defining
lower threshold value. Based upon this value, underutilized host are detected and shut
downed after their all VMs are shifted to other vacant hosts.
Khoshkholghi et al. [16] researched on all the four sub problems of VM consoli-
dation defined by Beloglazov and Buyya [4] and generate a new consolidation scheme
comprising of all proposed techniques. In this, considering two utilization thresholds,
which are defined by Iterative weighted linear regression technique, an over loaded
host detection algorithm is developed. Also, a new VM selection algorithm based upon
three different policies i.e. maximum power reduction policy, time and power tradeoff
policy & violated Mips–VMs policy, is generated. Than a two phase algorithm for VM
placement is proposed. In first phase, VMs selected from overloaded hosts are placed
and in second phase all VMs selected from under loaded hosts are allocated host.
Further a Multiple Resources Under-loaded Host Detection algorithm (MRUHD) is
introduced. This algorithm defines an adaptive lower threshold value and host is
considered under loaded all the utilizations of CPU, RAM and bandwidth below this
threshold. The proposed consolidation scheme is evaluated using cloudsim toolkit. The
results depict that new proposed scheme outperformed in all the benchmarks and can
reduce energy consumption to 28% while SLA violation until the level of 87%.
3 Problem Statement
According to Beloglazov and Buyya [4], during the VM consolidation process for
optimal mapping of VMs and hosts, four sub problems needed to be researched. First is
to find out the overloaded hosts i.e. hosts which have been allocated more number of
VMs than their serving capacity. Because, it would result into idle VMs and hence
promote the SLA violation. Second, selecting the adequate number of VMs from the
overloaded hosts for migration depending upon various parameters such as CPU uti-
lization, size of VMs, VM correlation etc. Third, detecting the under-loaded hosts, in
which a threshold value is defined for some particular parameters and hosts with less
than this threshold value is considered as under-loaded. Purpose for detecting under-
loaded host is to find suitable host that can easily be shut down without affecting the
other parameters, as all the VMs from under-loaded hosts can migrate to other active
hosts. Lastly, to place all the VMs selected for the migration on suitable physical hosts
in an optimal way. Placement here defines how the migrated VMs can be allocated to
available resources of active physical hosts that are neither overloaded nor under-
loaded. This mapping between hosts and VMs is not stable, because, new VMs keep on
generating to satisfy dynamic user demands and old VMs terminating progressively
after the completion of tasks assigned to them. In addition, it impacts on all the
performance measurements such as energy consumed by hosts; how many hosts can
further be get overloaded or under-loaded after a specific interval; how much amount of
VM migrations would be occurring and last but not least, how much violation of SLA
be occurred.
With the analysis of the related work, some research gaps have been identified,
which are helpful in designing the proposal of this research work. All the reviewed
papers and related research have modified the existing techniques or developed some
new policies for improvements in one or more performance metrics defined later.
Researchers have developed many strategies for reducing the energy consumption, and
for this, they have considered required or available resources, or amount of energy
consumed by applying different allocation policies. For reducing the energy, main
concepts used are to allocate the VMs in such a way that it can reduce the number of
active PMs and shutting down the idle servers. However, there are various other
solutions existed for NP hard problems that have not been tested. So, for optimizing a
NP hard problem, solution of other similar type can be applied. This gap in research
leads us to concentrate over applying a technique to solve bin packing problem over the
VM placement procedure in cloud environment. However, for this it is necessary that
both the problems should be scaled to a common format. Also, it has been analyzed
that there is a tradeoff between SLA violation and energy consumption [9]. Algorithms
reducing the energy consumption might have impact on SLA violation and, similarly,
to reduce the SLA violation more energy could be consumed. So, a VM placement
policy that could maintain the balance between both parameters need to be defined.
4 Preliminary
Out of four sub problems defined in the problem statement section, VM placement is
main sub-problem considered in this research. This section describes the base VM
placement policy that is modified.
Fig. 3. Power Aware Best Fit Decreasing (PABFD): Algorithm 1 [4].
In the research [4], power aware best fit decreasing technique is applied that is
defined in Algorithm 1, shown in Fig. 3. This is the default algorithm for VM place-
ment in cloudsim. In this, VMs to be migrated are selected one by one from a list, and
their suitability is checked against each and every available host. Power consumption
is the factor considered mainly in this method for finding the suitable host for a VM.
As shown in step no. 8 of algorithm 1, power is estimated for each VM for a particular
host. Then, based upon the condition applied in step no. 9, VM will be allocated to the
specific host only if it has the minimum power consumption on that host. Hence, VMs
are allocated to the best fit hosts available.
5 Proposed Algorithm
In this section, the proposed algorithm Power Aware Next Fit Decreasing (PANFD) is
elaborated in Fig. 4. Next fit allocation is an advanced version of first fit allocation
policy. In this, the next host for allocation will be searched on the basis of location of
previous allocated host. However, the condition is that all the hosts should be in sorted
order. Order of sorting depends upon the different parameters of hosts as well as VMs,
such as available number of MIPS, amount of energy consumed, current CPU uti-
lization etc. In this algorithm, CPU utilization value of VM is considered for sorting the
VMs in ascending order because utilization is directly proportional to the energy
consumption. And, available amount of MIPS per host is selected for arranging the
available hosts in an ascending order. Thus, this phenomenon is more suitable to
allocate the VM with required utilization requirement to the first suitable host in the list,
and, thereafter, for next VM, search would continue from the position where last host
was selected. This could increase the time for searching the host because hosts are
sorted, but, it would utilize the resources in an optimized way. Thus, resulting into less
consumption of energy with less number of migrations.
Fig. 4. Power Aware Next Fit Decreasing (PANFD): Algorithm 2.

This algorithm is based upon the concept that it can be assumed that if a host is not
compatible with a VM, then it would also not be compatible for the next VM in the list,
because hosts and VMs are in sorted in ascending order. So, better to skip those hosts
and search from the remaining only. Modified algorithm shown in Fig. 4, defined a
new pointer named “position” in searching the next suitable host. Initialized to 1st
position, pointer is updated after every allocation of suitable host to the current VM in
the list. This modification would not only improve the performance of the algorithm but
also would affect the overall allocation of VMs in an optimized manner, leading to the
optimized performance in various factors discussed later, compared to the base
approach.
6 Experiment and Performance Analysis
6.1 Performance Metrics

Various performance metrics can be used for assessing the usefulness of the proposed
mechanism for placing the VMs comparing to standard mechanism. In this work,
principally energy consumption in data centers, amount of migrations taking place with
their influence on performance and violations of SLA by hosts and overall have been
investigated [4], details about each metric is detailed as follows:
Energy Consumption: Physical Machines can be turn on or shut down, depending
upon the server consolidation mechanism. Therefore, the amount of energy consumed
during an interval at running time by all the Physical Machines defined as Energy
Consumption. Lower the value of energy consumption means less expenditure. So, an
algorithm can be considered optimized only if it helps to reduce this parameter.
Number of VM Migrations: This parameter defines the number of times VMs are
shifted among various hosts. More the number of migrations, more the degradation in
performance and SLA violations. However, less quantity of migrations would result
into inadequate and improper allocation of resources. Hence, there must be a balance to
handle this trade-off.
Performance Degradation due to Migration (PDM): Expressed using the following
formula, PDM represents sum of ratios of decrement in performance of a VM cause of
migration and its total required CPU capacity.
XV Pd
PDM ¼ i¼1 P
i
ð2Þ
ci
Where,
V = total available VMs,
Pdi = the estimated performance degradation due to ith VM Migration,
Pci = total cpu capacity required by ith VM.
SLA Violation Time per Active Host (SLATAH): Represents the percentage of time
for which a CPU is 100% utilized. In other words, it defines the portion of time for
which VMs have to wait because of 100% utilization of CPU.
1 XH TCi
SLATAH ¼ ð3Þ
H j¼1 T
Ai
Where H characterizes the total number of hosts, TCi corresponds to complete

duration during which CPU of ith host remain 100% utilized that leads to SLA Vio-
lation and TAi describe the time of ith host for which it has remained in active state.
Importance of SLATAH can be seen as the fact that VM cannot be assigned to a host
with 100% CPU utilization, which puts a VM into waiting for same amount of time,
which violates the SLA.
Overall SLA Violations (OSLAV): Due to overutilization of hosts and migrations of
VMs, SLA is occurred, because over-utilized hosts cannot be assigned more VMS and
migration of VMs consume the extra CPU cycles for maintain the status of migrations.
So the overall SLA violation is resultant of all factors violating the SLA. The overall
SLA violations can be represented in terms of percentage and defined by the formula as
follows:
totalRequestedMips totalAllocatedMips
OSLAV ¼ ð4Þ
totalRequestedMips
Where totalRequestedMips is the sum of all the requested MIPS by available VMs
in data center according to their resource requirements, whereas totalAllocatedMips is
sum of all actual allocated MIPS to them.
6.2 Experimental Setup

To check the effectiveness of algorithms, a simulation model of Cloud Environment is
required, which can depict an IAAS model. In this work, we have selected CloudSim
tool as the simulation platform. Benefits of CloudSim is that researchers can focus on
the problem related to system design instead of worrying about services and infras-
tructure related to cloud implementation [17]. The default PABFD defined in CloudSim
and proposed algorithm PANFD are applied and compared. Total sixteen experiments
have been performed that are combination of four overloaded host detection algorithms.
These are: (1) MAD, (2) IQR, (3) LRR and (4) LR with four basic VM selection
techniques, namely, (1) MMT, (2) MC, (3) RS and (4) MU. Each experiment name is
defined in such a way so that it can represent the overload host detection policy and
VM selection policy used in it, e.g. MAD_MMT depicts that MAD overload detection
policy is used with MMT VM selection policy. In our experiments, 800 heterogeneous
hosts are configured, out of which half are HP ProLiant ML110G4 servers with con-
figuration Intel Xenon 3040, dual-core 1860 MHz, 4 GB, 1 Gbps, whereas remaining
are HP ProLiant ML110 G5 with specifications as: Intel Xenon, 3075, dual-core 2660,
4 GB, 1 Gbps. The real data of energy consumption defined by SPECpower benchmark
are considered [18]. The energy consumed by these two types of servers for various
utilization levels of CPU is mentioned in Table 2. Reason for selecting this configu-
ration is to measure the effectiveness of proposed VM placement algorithm, as servers
with less resource capacity can be overloaded easily with light workload. Four types of
virtual machines corresponding to Amazon EC2 instance types are created that are
defined in Table 1 [4].
Table 1. VM instance types in Amazon EC2

Type CPU (MIPS) RAM (GB)
Extra large 2500 3.75
High CPU medium 2500 0.85
Small 1000 1.7
Micro 500 0.61
Therefore, the amount of energy consumed during an interval at running time by all
the Physical Machines defined as Energy Consumption. Lower the value of energy
consumption means less expenditure. So, an algorithm can be considered optimized
only if it helps to reduce this parameter.
6.3 Workload Data

Traces of a real time workload from a real environment of CoMon system have been
considered for the simulation for better understanding of algorithm’s application. The
CoMon project developed a monitoring structure for PlanetLab [19]. This data defines
the utilization value of CPU measured from millions of VMs that are running on
various physical servers around the world. The statistics of workload are defined in
Table 3.
Table 2. Amount of energy consumed (in Watts) by hosts at various CPU utilization levels.
Server 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
HPG4 86.0 89.4 92.6 96.0 99.5 102.0 106.0 108.0 112.0 114.0 117.0
HPG5 93.7 97.0 101.0 105.0 110.0 116.0 121.0 125.0 129.0 133.0 135.0
Table 3. CPU utilization statistics of workload considered.

Date Number of Mean Standard Quartile Median Quartile
VMs value deviation one value three
March 3, 1052 12.31% 17.09% 2% 6% 15%
2011
6.4 Experiment Results

Following are the tables showing experiment results for five main performance metrics
considered in this paper. Table 4 shows the results for both the existing and modified
scheme applied with MAD overload detection policy and four VM selection policies.
Similarly, Table 5 describes the results for experiments using IQR with same selection
policies. And, then Tables 6 and 7 also define the data values obtained from experi-
ments based on LRR and LR policy, respectively.
Table 4. Performance metrics with MAD.

Performance Energy No. of VM PDM SLATAH OSLAV
metrics ! consumption migrations
Experiments PABFD PANFD PABFD PANFD PABFD PANFD PABFD PANFD PABFD PANFD
approach
MAD_MMT 132.42 118.22 28105 24988 0.1 0.09 4.01 2.84 0.15 0.11
MAD_MC 135.5 118.76 24737 21428 0.13 0.1 4.8 3.11 0.19 0.13
MAD_RS 175.81 118.46 24026 21725 0.11 0.1 7.09 3.22 0.13 0.13
MAD_MU 133.6 116.85 48244 41452 0.13 0.11 5.33 3.36 0.22 0.17
Table 5. Performance metrics with IQR.

approach
IQR_MMT 188.86 121.24 26476 25581 0.06 0.09 4.96 2.85 0.07 0.11
IQR_MC 137.23 121.44 24646 22505 0.13 0.1 4.75 3.1 0.18 0.12
IQR_RS 136.78 121.86 25312 22978 0.13 0.1 4.27 3.16 0.18 0.13
IQR_MU 134.28 130.65 26162 43903 0.12 0.12 4.58 6.02 0.18 0.21
Table 6. Performance metrics with LRR.

approach
LRR_MMT 163.15 118.55 27632 11670 0.08 0.03 5.84 2.57 0.14 0.1
LRR_MC 150.33 118.99 23004 9794 0.1 0.04 6.97 2.43 0.16 0.1
LRR_RS 151.15 118.85 23028 9803 0.1 0.04 7.03 2.63 0.16 0.1
LRR_MU 174.24 116.98 29555 15966 0.07 0.03 8.18 3.82 0.17 0.19
Table 7. Performance metrics with LR.

approach
LR_MMT 126.66 118.55 11624 11670 0.05 0.03 4 2.57 0.15 0.1
LR_MC 132.27 118.99 11587 9794 0.05 0.04 3.98 2.43 0.15 0.1
LR_RS 150.51 119.11 23825 9699 0.1 0.04 7.15 2.54 0.17 0.1
LR_MU 174.24 116.98 29555 15966 0.07 0.03 8.18 3.82 0.17 0.19
6.5 Result Analysis

In this section, we have analyzed the variations in the performances when PANFD is
replaced with PABFD in server consolidation for all the above-mentioned parameters
one by one.
Energy Consumption: The bar graph in Fig. 5 illustrates the amount of energy
consumed in kilowatt- hour for all the sixteen experiments performed. This can be
analyzed from Fig. 5 that energy consumption is reduced by applying PANFD as
compared to PABFD in all the experiments. This is also recognized that using proposed
approach, MAD_MU has consumed the minimum energy i.e. 116.85 kWh, compared
to other combinations, whereas IQR_MMT has shown the highest decrement of 35.8%
by applying proposed approach.
Fig. 5. Comparison for energy consumption for PABFD and PANFD.
Number of VM Migrations: Figure 6 illustrates the amount of VM migrations

occurring during the server consolidation process. Further, PANFD has performed best
in LR_RS in which 59.3% migrations have reduced, whereas LR_RS has the minimum
value of 9699 migrations with the novel approach. Also, it reveals that with PANFD,
this attribute has been optimized for all the cases except in IQR_MU.
Fig. 6. Comparison for No. of VM migrations for PABFD and PANFD
Fig. 7. Comparison for PDM for PABFD and PANFD
Performance Degradation Due to Migration (PDM): This parameter identifies the

performance rectification because of VM migrations in consolidation process. So,
owing to decrement in number of VM migrations discussed earlier, PDM has also been
decreased with the proposed policy of VM placement. Figure 7 justifies this statement
by showing the bar graph. It is lucid that PDM has decreased in every experiment
except IQR_MMT and it has least value of 0.03% for LRR_MC, LRR_RS, LR_MC
and LR_RS. In addition, LRR_MU and LR_MU has shown maximum optimization in
the factor with 57.14% improvement.
Fig. 8. Comparison for SLATAH for PABFD and PANFD
Fig. 9. Comparison for OSLAV for PABFD and PANFD
SLA Time Per Active Host: This factor is mainly responsible for defining the level of
QoS attained in during catering the services of cloud users. Represented in Fig. 8, it can
be analyzed that this has been decreased dramatically in all the combinations except
IQR_MU. Here, LRR_MC has performed best with 65.13% improvement and least
value of 2.43% in SLATAH. One other related quality factor considered for evaluation
is Overall SLA violation (OSLAV), which is shown in Fig. 9. It depicts that OSLAV
has not shown improvement for IQR_MMT, IQR_MU, LRR_MU and LR_MU.
However, for LR_RS it has been optimized by 41.17%. Also, IQR_MMT for existing
approach has minimum OSLAV of 0.07%. Hence, PANFD has produced mixed results
for this factor and shown improvement to some extent.
7 Conclusion
In conclusion, application of Next-Fit allocation approach in VM placement procedure

has identified significant improvement in terms of reducing energy consumption and
SLA violations. Comparing to the existing approach, it has managed to get lower
values of energy consumption, VM Migrations, PDM and SLATAH for almost every
single experiment executed. Even OSLAV is also decreased in all the experiments
except the case where minimum utilization value of CPUs is considered for selecting
the VMs that may be migrated. The prime factor for better performance of next fit
approach is sorting order of hosts in the search space. It improved the utilization of
hosts by packing maximum number of VMs over a host. In addition, next fit implied
that if a host is not suitable for a VM then it would also not be suitable for next VM
because VMs are also sorted in increasing order of Utilization. So, the proposed
approach of placing the VMs using next fit policy identifies the optimized way of
allocation that not only saves the energy but also reduces the SLA violations by
improving performance. Thus, the proposed approach is more powerful in managing
the balance among energy consumption and SLA violation trade-off.
In this paper, solution for VM placement problem i.e. Next-Fit approach is derived
from solutions of BPP. Hence, some more approaches that has been identified or are in
research for solving the BPP, can also be applied and tested for optimizing VM
placement. Also, there is scope in applying similar solutions on data center broker,
which is mainly responsible for mapping tasks to VMs. So, depending upon the
requirement of tasks, VMs can be allotted in bin-packing manner. This application
would optimize number of VMs required for tasks and subsequently would improve the
energy efficiency in data centers.
References
1. Kepes, B.: Aligned energy changes the data center model. https://www.networkworld.com/
article/3025455/aligned-energy-changes-the-data-center-model.html
2. Fan, X., Weber, W.D., Barroso, L.A.: Power provisioning for a warehouse-sized computer.
ACM SIGARCH Comput. Archit. News 35(2), 13–23 (2007)
3. Singh, P., Sengupta, J., Suri, P.K.: A novel approach of virtual machine consolidation for
energy efficiency and reducing sla violation in data centers. Int. J. Innovative Technol.
Exploring Eng. 8, 547–555 (2019)
4. Beloglazov, A., Buyya, R.: Optimal online deterministic algorithms and adaptive heuristics
for energy and performance efficient dynamic consolidation of virtual machines in cloud data
centers. Concurrency Comput. Pract. Experience. 24, 1397–1420 (2011)
5. Silva Filho, M., Monteiro, C., Inácio, P., Freire, M.: Approaches for optimizing virtual
machine placement and migration in cloud environments: a survey. J. Parallel Distrib.
Comput. 111, 222–250 (2018)
6. Clark, C., Fraser, K., Hand S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., Warfield, A.: Live
migration of virtual machines. In: Proceedings of the 2nd Conference on Symposium on
Networked Systems Design & Implementation, vol. 2, pp. 273–286 (2005)
7. Coffman, E.G., Garey, M.R., Johnson, D.S.: Approximation algorithms for bin packing: a
survey. In: Approximation Algorithms for NP-hard Problems, pp. 46–93 (1996)
8. Kumaraswamy, S., Nair, M.K.: Bin packing algorithms for virtual machine placement in
cloud computing: a review. Int. J. Electr. Comput. Eng. (IJECE) 9, 512 (2019)
9. Chowdhury, M., Mahmud, M., Rahman, R.: Implementation and performance analysis of
various VM placement strategies in CloudSim. J. Cloud Comput. 4, 20 (2015)
10. Pagare, M.J.D., Koli, N.A.: Performance analysis of an energy efficient virtual machine
consolidation algorithm in cloud computing. Int. J. Comput. Eng. Technol. (IJCET) 6(5),
24–35 (2015)
11. Kuo, C.F., Yeh, T.H., Lu, Y.F., Chang, B.R.: Efficient allocation algorithm for virtual
machines in cloud computing systems. In: Proceedings of the ASE BigData & SocialInfor-
matics, p. 48. ACM (2015)
12. Mosa, A., Paton, N.: Optimizing virtual machine placement for energy and SLA in clouds
using utility functions. J. Cloud Comput. 5, 17 (2016)
13. Castro, P., Barreto, V., Corrêa, S., Granville, L., Cardoso, K.: A joint CPU-RAM energy
efficient and SLA-compliant approach for cloud data centers. Comput. Netw. 94, 1–13
(2016)
14. Han, G., Que, W., Jia, G., Shu, L.: An efficient virtual machine consolidation scheme for
multimedia cloud computing. Sensors 16, 246 (2016)
15. Mevada, A., Patel, H., Patel, N.: Enhanced energy efficient virtual machine placement policy
for load balancing in cloud environment. Int. J. Cur. Res. Rev. 9(6), 50 (2017)
16. Khoshkholghi, M.A., Derahman, M.N., Abdullah, A., Subramaniam, S., Othman, M.:
Energy-efficient algorithms for dynamic virtual machine consolidation in cloud data centers.
IEEE Access 5, 10709–10722 (2017)
17. Calheiros, R., Ranjan, R., Beloglazov, A., De Rose, C., Buyya, R.: CloudSim: a toolkit for
modeling and simulation of cloud computing environments and evaluation of resource
provisioning algorithms. Softw. Pract. Experience. 41, 23–50 (2010)
18. Standard Performance Evaluation Corporation, “SPECpower_ssj2008”, Spec.org. https://
www.spec.org/power_ssj2008/results/res2011q1/power_ssj2008-20110124-00338.html.
https://www.spec.org/power_ssj2008/results/res2011q1/power_ssj2008-20110124-00339.
html
19. Park, K., Pai, V.: CoMon: a mostly-scalable monitoring system for PlanetLab.
ACM SIGOPS Operating Syst. Rev. 40, 65 (2006)
Multi-user Expert System for Operation
and Maintenance in Energized Lines
Erika F. Moreno(&), Evelyn E. Pacheco, Víctor H. Andaluz,

and Álvaro S. Mullo
Universidad de las Fuerzas Armadas ESPE, Sangolquí, Ecuador

{efmoreno1,eepacheco1,vhandaluz1,asmullo}@espe.edu.ec
Abstract. The article presents the development of a multi-user application in

virtual reality for the training and theoretical-practical training of electrical
maintenance personnel in the operation of energized lines. The environment has
a safety explanatory room in which users will know each of the rules and
protocols applied to perform maintenance maneuvers and operation of energized
lines, a realistic work area of the electrical power system that starts from
hydroelectric generation and passes through transmission lines, subtransmission
to reach substations that raise or reduce the voltage for subsequent distribu-
tion. The environment has been created using photogrammetry techniques,
WorldComposer, CAD design tools, Unity 3D, DigSilent Power Factory and
Matlab to provide realism of electrical system behavior, emulating failures and
critical conditions caused by external and internal emergency events. Experi-
mental tests show the efficiency of the system generated by human-machine
interaction in which operators interact between themselves and the ambience,
facilitate immersion in an environment that offers to contribute to the devel-
opment of their risk-free collaborative skills and abilities.
Keywords: Virtual Reality Maintenance on energized lines Electrical power

system Photogrammetry Unity 3D
1 Introduction
Maintenance in industrial systems is a strategic factor to guarantee a high level of safety

and productivity [1]. The development of adequate maintenance policies guarantees the
efficiency of production plants in terms of quality to minimize costs and maximize the
availability and performance of fixed assets [2]. Several maintenance policies have
been introduced, such as fault-based (FBM), usage-based maintenance (UBM),
condition-based maintenance (CBM), design and output maintenance (DOM) and
detection-based maintenance. Ideally, maintenance should be performed shortly before
the asset fails, not too soon, but not too late either [1, 2]. The most common practice in
the industry is preventive maintenance; a piece of equipment is regularly serviced while
in operation so that it does not cause unplanned downtime. Predictive maintenance, on
the other hand, uses advanced asset-based analysis under real operating conditions [3].
Specifically in the electric power industry has moved to a competitive and intelligent
future, with the need to improve maintenance management tasks [4].

https://doi.org/10.1007/978-3-030-39442-4_34
Multi-user Expert System for Operation and Maintenance 455
Electric utilities apply proactive methods based on RCAM, by improving main-

tenance modeling that includes details of the procedure to the transmission system. It is
divided into components and subcomponents, using weighting coefficients for priori-
tization [4], in order to maintain continuity of electricity and stable supply. The
appropriate procedure for transmission lines and protection systems is the preventive
one [4]. When developing maintenance operations in energized electrical networks, the
proximity between the liniero and the energized conductor requires the implementation
of several shielding procedures. Shielding against the electric field is provided by the
use of dielectrically shielded equipment, being highly effective as the low frequency
electric field is relatively easy to protect [5]. It is essential to follow an effective
overhead power line safety program, in which OSHA establishes as regulation, specific
distances and requirements for this type of work and equipment, as well as for workers,
OSHA1926, Subpart V power transmission and distribution; ANSI IEEE C2- National
Electrical Safety Code (NESC), 1910.333(c) for work near exposed energized parts,
and NFPA 70E2015 Standard for Electrical Safety in the workplace, Section 130.8
establishes equipment requirements [6]. Line operators and electricians must coordinate
their senses and actions under the practice of proper and safe work as a team.
Therefore, operator learning and training is becoming a strategic task for utilities that
include immersive technologies, with training systems that can be integrated into the
work environment and adaptive training methods to improve different levels [7].
Virtual Reality (VR) and Augmented Reality (AR) systems tend to be a safer
solution in terms of training methods, in which various procedures can be simulated by
providing representations of the real world with dangerous environments without
risking people and equipment [7, 8]. Most power system training applications are
associated with hazardous tasks. Within the training processes, several research pro-
jects are mentioned, such as: (i) Implementation of a training program based on VR,
applied to the maintenance of medium voltage overhead lines in electricity distribution
networks, the system has allowed a substantial reduction in electric accident [9]; (ii)
Desktop (VR) environment for power line workers, the training system aims to rein-
force classroom training and bridge theoretical knowledge and field work training [10];
(iii) Interactive system of virtual training of maintenance of hydroelectric generation
equipment to improve the experience and effectiveness of the training, based on the
operating system of each part and component [11]; (iv) Architecture of an intelligent
training system based on virtual environments for electricity distribution substations
oriented to the training process of electricians in the area of electrical substations [12];
(v) An immersive Virtual Reality application for collaborative training of power system
operators to improve user immersion and a problem-based learning approach [13]; (vi)
Training in virtual environments for a hybrid plant oriented to the simulation of faults
and manoeuvres for the training of professionals in Electrical Systems [14].
The document presents the development of a training system in VR oriented to the
operation and maintenance in energized lines of a real environment from the genera-
tion, transmission, subtransmission and distribution of electric energy, in which the
personnel according to the hierarchical level can access the application and be trained
456 E. F. Moreno et al.
in an immersive environment free of risks. The realism of the application is established

from an electrical scheme allowing the simulation and analysis of faults occurring in the
bars of the electric power system, as well as the variation of the dielectric character-
istics of the safety equipment itself that will improve the skills, skills of operators and
collaborative work.
This work is divided into five sections, including the introduction. Section 2
describes the structure of the proposed system. Section 3 includes the description of the
virtual environment. The development of collaborative work using the multi-user
system is described in Sect. 4. Section 5 analyses the results obtained. Finally, Sect. 6
presents the conclusions of the training system.
2 System Structure
In the electrical area is of vital importance group coordination between maintenance

personnel due to the risk of working with high voltage, one of the concurrent problems
is the low capacity for collaborative interaction of the operators [15]. Some studies
related to the development of virtual applications based on interactive learning allow to
develop techniques of comprehensive learning of the construction team in the processes
and operational safety [16].
The development of an immersive multi-user expert system for operation and
maintenance work on energized lines is presented, based on hierarchical levels
according to the activities in charge.; (i) chief of maintenance; (ii) overseer; (iii) chief
of group y (iv) operator, of the selection of the avatar depends on the tasks to be
performed in the work area. The virtualization of the real environment is carried out
through photogrammetry techniques and an additional tool to create height maps
WorldComposer, while the simulation of the electrical power system is developed in
the DigSilent Power Factory-Matlab software which provides realism and stabilizes the
behavior of the electrical system in the event of a failure or external critical situation
that leads to an excessive increase in voltage and current, directly affecting the elements
of the system such as transformers, isolates, transmission lines, subtransmission, etc. in
order to facilitate maintenance operators a learning and interaction with the system as
real as possible and me-improve the skills for the detection of hazards. The architecture
of the virtual training system is (shown in Fig. 1).
The environment has two training scenes; (i) information room, training place on
the mandatory use of individual protection equipment and maintenance protocols
according to the maneuver to be carried out in subtransmission or distribution; (ii) work
area, presents a real environment of a power electrical system with the respective
lifting and reducing substations, transmission lines, subtransmission and distribution
with the main feeders, place to which operators will have access prior to training in the
information room.
Electrical
software
DigSilent Power
Factory
Virtual Environment
Training room Work area ZXE

Chief of
Maintenance
ZXE
Overseer
ZXE
Group
Leader
ZXE
Operator
Fig. 1. Virtual training system architecture
3 Virtual Environment
This section describes the application development methodology.

The stages are: (i) Creation of the environment, this stage is subdivided into:
(a) virtualization of the environment, with the objective of making the training process
for the maintenance personnel of energized lines as real as possible; (b) 3D Modeling,
the information is collected by means of photographs and georeferencing for the
construction of the scene. (ii) Virtual environment, is developed in the Unity 3D
graphics engine, incorporating features to the model CAD and the environment. Dig-
Silent Power Factory generates process realism from critical system conditions. There
are two scenes in the environment: (a) Information room and (b) Work area. (iii) Multi-
user, at this stage, a multi-user hierarchical system will be implemented so that
operators from different parts of the world can connect to the environment through a
communication network and interact with each other, strengthening collaborative work.
The structural diagram of interrelation between components is (shown in Fig. 2).
Cad Models
Vitualization of the
environment Software Cad
3D Models
World Composer
Script
Electric software
controller
Virtual input and
Electrical Software
output devices Device
Detector Electrical Electrical
Virtual Reality
Device input output
Oculus 3D Moddels variables
variables
Rift Manager behavior
Application Digsilent Power
Sound Controller Factory
Audio
Effects
Communication
Plugin Movement Multiuser Tcp-Ip
Vr Control support
Controller
Grab Object
Avatar-
Audio Game Scenes operator
Interacion
Canvas UI
Methods
Fig. 2. Structural diagram of interrelation between components
3.1 Virtualization of the Environment

The real environment to be virtualized is part of the National Interconnected System of
Ecuador, specifically in Cotopaxi the Illuchi 1 - El Calvario subtransmission line, the
Illuchi 1 hydroelectric power station has an elevation substation that is intercon-
nected with the ELEPCO system through a three-phase line of 22 kV to the 13.8 kV
El Calvario line (see Fig. 3).
Location:Illuchi I-II
National Interconnected
System-Ecuador
Location: Cotopaxi
S/E El Calvario
Fig. 3. Description of the environment
The creation of the environment is developed from the survey of information

photographies and georeferencing, the tool used to extract data from real world maps is
WorldComposer, for the exact location of the area is needed coordinates, latitude and
longitude, facilitating the creation of reliefs in 3D. The description for the virtualization
of the environment can be (seen in Fig. 4).
Real virtualized
environment
Phase 4
Photography and Environment

georeference processing
Phase 1 Phase 3
Camera 360° and

application
Photography WorldComposer Map parameters

Phase 2
Regions
Heightmap export
Image export
Create terrain
Fig. 4. Description of the virtualization of the environment
In the first phase it is required the survey of information of the exact place to be
virtualized with the coordinates of location respectively, same that is obtained with
photographs in 360°; each photograph has as data the georefenciation of the real point
in which it was taken. The WorldComposer tool allows to extract height maps with
images of a real location of any part of the world, for its initialization the icons
described below must be activated: map parameters, regions, heightmap export, image
export and create terrain, once activated the parameters the virtualized area is created
directly in the Unity assets.
3.2 CAD Design

SolidWorks CAD modeling software facilitates the creation from design, import and
assembly of subtransmission and distribution structures, posts, insulated-res and
transformers of elevation and reduction substations to establish a scene as close to
reality as possible. Each design needs to be saved with a *.IGS extension compatible
with 3DSMax to import into Unity. The CAD design of certain elements is (shown in
Fig. 5).
(a) Subtransmission tower (b) Transformer
Fig. 5. CAD design in SolidWorks
3.3 Electrical Diagram

The realism and behavior of the electrical system is carried out with ge-nerada simu-
lation and electrical variables delivered by the DigSilent Power Factory software, same
that for its functionality the data of generator board and transformers as well as the
electrical parameters of the transmission line, loads, circuit breakers and disconnectors
must be inserted in the DigSilent editor according to real values of: active power,
reactive power, resistance, impedance, type of conductor, length of conductor and other
configurations. The feasibility of the software not only focuses on the creation of
diagrams for power flows in the study of the effects on the distribution of loads when
losses are generated, but also allows simulating phenomena of system failures, being
directly affected by external overcurrents in different bars, the incidence of these

overcurrents in the components and in the most critical case when this event occurs at
the exact moment of a maintenance. The power flow diagrams and system faults are
(shown in Figs. 6 and 7).
Fig. 6. System power flow in DigSilent Power Factory
Fig. 7. Bar (E/F) failure at DigSilent Power Factory.

3.4 Design of Operators According to Levels

The information room has a menu that allows you to select from 4 levels of operators
according to levels, in order to limit the activities and responsibilities of each member
of the team by providing a learning method focused on each of the maintenance
maneuvers and operation on energized lines. Avatars are created and customized in
Adobe Fuse software (see Fig. 8). Each one of them is created with their own personal
protective equipment and safety accessories such as: shirt, work trousers, helmet,
goggles and gloves.
Fig. 8. Avatar created in Adobe Fuse
Figure 9 presents the levels of the operators: (i) Chief of maintenance, Engineer-
Master-High level; controls, coordinates with the group leader maintenance maneuvers
and energized line operation and ensures compliance with safety standards; (ii)
Overseer, Engineer-High level; performs work orders and delivers materials according
to the operations to be performed in the work area; (iii) Chief of group, Technologist-
half level; coordinates, receives materials according to work orders and performs
maintenance maneuvers on energized lines; (iv) Operator, Low-level technician; fulfills
work orders, performs operations and maneuvers such as: cleaning and change of
insulators, change and relocation of crossheads, change of disconnectors, assembly and
disassembly of transformer.
Chief of
maintenance
Advanced level
Overseer
Advanced level
Adobe Fuse
Group leader
Medium level
Operator
Low level
Fig. 9. Operator levels created in Adobe Fuse
4 Multi-user System
This section focuses on multi-user application development, from the selection of

avatars in the environment according to established hierarchies to the interaction
between operators in a collaborative environment.
Figure 10 shows the multi-user system, which complies with the client-server
architecture, allowing the sending and receiving of messages by the operator (cliente)
to the server which in turn replicates the information to the other operators by storing
them in the database.
Avatar selection system
User
Connection system Administrator
Internet
Virtual Environment
Fig. 10. Multi-user system

4.1 Maintenance Operator Selection

Maintenance operator selection on energized lines depends on hierarchical levels, are
mainly characterised by the colour of the helmet and the work clothes, that’s why in the
menu that starts the application you will find the positions of each avatar who repre-
sents the operators of the work team in energized lines, that when selecting the position
will be registered with a name and ID that is unique for each operator.
Figure 11 presents the system of selection of the personnel according to the
positions and responsibilities of the operators such as: (i) Maintenance Manager,
Engineer-Master-High level; (ii) Overseer, Engineer-High level; (iii) Group leader,
Technologist- medium level; (iv) Operator, Technical-low level.
Sistema de selección de
personaje
Chief of Maintenance Overseer Group Leader Operator
Fig. 11. Personnel selection system
5 Results and Discussions
The results achieved by the multi-user virtual training expert system are described in
this section as an extra tool for learning and correct management in the operation and
maintenance of energized lines, due to the dangerous events represented by such work
and the high cost of line protection equipment, by means of the virtual application the
novel advantage was obtained of simulating different failure events caused in the
electrical power system, specifically on the main busbars without the need to directly
expose operators who have integral experience theoretical-practical learning, focused
on an efficient collaborative work between groups in case of emergencies caused in the
system.
The virtual work environment shows two training modes (operator and visitor) that
allow user interaction and immersion according to the learning requirements as (shown
in Fig. 12).
MAINTENANCE ON ENERGIZED
VISIT TO THE SUBSTATION
LINES
Fig. 12. Training mode menu
Maintenance Mode on Energized Lines. In this mode the user from any location can
access the environment by entering the name and id, through which it is presents a
selection menu of maintenance operators by hierarchical position according to activities
and responsibilities as (shown in Fig. 13). After selection the operator enters the
information room in which visual and auditory instructions on safety standards, dia-
grams and maintenance protocols are displayed (see Fig. 14).
Fig. 13. Hierarchy selection menu

Fig. 14. Training and information room
Selection of Maneuver and Work Area. A menu is presented to select the type of
maintenance in the sub-transmission or distribution system with maneuvers such as:
assembly and disassembly of single phase transformer, change and cleaning of insu-
lators and change of crossarms (see Fig. 15).
Fig. 15. Selection menu of the type of manoeuvre
Work Area. In this environment the operator is able to visualize and interact directly
with the potency system, substation of elevation, substation of reduction, insulators,
lines of transmission, power transformer and other electromechanical components (see
Fig. 16), for perform maintenance maneuvers on lines energized under safety standards
and protocols.
(a) Mantenance of transformer (b) Inspecction of médium voltaje lines
Fig. 16. Operation and maintenance maneuver in medium voltage
Multi-user. The training system has the novel alternative for training operators on a
multi-user platform, in which it allows the immersion and interaction of several users to
the work area, according to the complexity of the manoeuvre, as shown in the Fig. 17.
Fig. 17. Manoeuvre of collaborative maintenance
Fault Model. In Fig. 18 is displayed the fault occurred in the E-bar of the system
caused by an opening error in a disconnector provoking that the circuit opens and
generates an electric arc as (shown in Fig. 19), and The switches of the Illuchi 1 lifting
substation are be directly affected. The Programming facilitates the effect of visual and
auditory realism at the exact moment of the electrical emergency.
Fig. 18. E-bar disconnector open
Fig. 19. Electric arc on E-bar
Visitor Mode. In this mode you have access to the three substations Illuchi 1, Illuchi 2
and El Calvario in which students and new operators have the advantage of visiting and
learning about the elements that make up the system through an audible guide that
facilitates tour and learning in the electric field (see Fig. 20).
Fig. 20. Visit mode to El Calvario substation
The evaluation and validation of the multi-user training assistant provides a range
of approaches to the effectiveness of training tasks and maintenance maneuvers. Where
finally, the joint evaluation verifies the validity of the user-system interface in real
situations, quantifying the results in terms of benefit (knowledge, cost and extra risk-
free training tool). The SUS [17] usability of the expert system was evaluated by means
of a questionnaire of questions. During the development of the evaluation a group of 15
people, among them 10 students and 5 professors of the Career of Electromechanical
Engineering located in different locations: Research Laboratory, Network Laboratory &
Industrial Processes Control participated in this evaluation process, with students who
take the subjects of Maintenance Engineering and Electrical Systems (see Fig. 21). The
evaluators issued criteria with respect to the animations and presentation of the
maintenance procedures for the analysis of the information received by the evaluated
and possible improvements in terms of redesign of the system. Subsequently, research
was carried out whose main focus was the improvement of training through two groups
of people: users trained using the conventional method (theoretical knowledge) versus
users trained (theoretical knowledge plus training in the Virtual Reality environment),
showing a positive influence on the improvement of cognitive skills, collaboration of
students and teachers and retention of knowledge.
Fig. 21. Evaluation of the application with VR devices

In Table 1, the items for the usability assessment corresponding to the application
are shown.
Table 1. Evaluation results.

No. Questions
1 Have you used the virtual reality devices: HTC and GEAR VR?
2 The assistant for energized lines training is easy to use and intuitive?
3 Does the virtual reality environment have all the elements of an electrical system for
better familiarization?
4 Do you need the external help of a technician for the use of the application?
5 Can you easily perform the maintenance operations shown by the application?
6 Did the signage shown in the environment contribute to learning about safety issues?
7 Did the multi-user system allow you to interact collaboratively in different maintenance
operations with several users at the same time?
8 Did the application help you improve your cognitive and collaborative skills?
9 Would you recommend the expert training system as an extra theoretical-practical
training tool that is taught in the educational field?
10 Is learning through virtual reality technologies new to be applied in education?
16
14
Number of people surveyed
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
Number of question
Always Almost always Sometimes Almost never Never
Fig. 22. Results of the evaluation of the virtual training system
The results obtained are (shown in Fig. 22), where 14 of the 15 people evaluated
indicate that the virtual application for training operators on energized lines contributes
to the learning and development of the cognitive skills of students and teachers in the
area, since it is a novel tool that is easily accessible as a theoretical-practical support
when interacting with the components of the electrical power system.
6 Conclusions
The article presents the development of the application of the multi-user training system
based on virtual reality, within the electrical area in which it offers an assisted follow-up
to the user within the recognition of processes and electrical elements of subtransmission
and distribution contributing to the education and training of professionals by means of
immersion and interaction of operation and maintenance manoeuvres in a risk-free
environment optimizing economic resources, time and infrastructure.
The Unity 3D graphics engine features an easy-to-use interface that provides the
realism of an electrical power system through interaction between multi-user system
operators whose experimental results show the development of skills and abilities in a
collaborative environment, renewing current training models.
Sending data from Unity 3D electrical software and graphics engine provides
analysis of possible faults which can be critically caused by bad maneuvers or external
conditions in the electrical power system (substation bars, distribution systems). The
objective of the development of this application is to provide an extra training tool that
facilitates learning the maneuvers and safety protocols in the operation and mainte-
nance of energized lines.
The virtual environment is currently used by professors and students of the career
of engineering in Electromechanics at the University of the Armed Forces ESPE that
facilitates your theoretical-practical training, this training system based on the
immersive virtual reality system can be easily installed on a computer and through the
Internet interact with the multi-user system and strengthen collaborative work.
In the future, it is desired to integrate an augmented reality system that facilitates
maintenance not only in energized lines but also in electrical substations, starting out
like this from his generation.
References
1. Faccio, M., Persona, A., Sgarbossa, F., Zanin, G.: Industrial maintenance policy
development: a quantitative framework. Int. J. Prod. Econ. 147, 85–93 (2014)
2. Van de Kerkhof, R., Akkermans, H., Noorderhaven, N.: Knowledge lost in data:
organizational impediments to condition-based maintenance in the process industry. In:
Zijm, H., Klumpp, M., Clausen, U., Hompel, M. (eds.) Logistics and Supply Chain
Innovation, pp. 223–237. Springer, Cham (2016)
3. Daily, J., Peterson, J.: Predictive maintenance: how big data analysis can improve
maintenance. In: Richter, K., Walther, J. (eds.) Supply Chain Integration Challenges in
Commercial Aerospace, pp. 267–278. Springer, Cham (2017)
4. Koksal, A., Ozdemir, A.: Improved transformer maintenance plan for reliability centred asset
management of power transmission system. IET Gener. Transm. Distrib. X(8), 1976–1983
(2017)
5. Barbosa, C., Nallin, F.: Corrosion detection robot for energized power lines. In: Proceedings
of the 2014 3rd International Conference on Applied Robotics for the Power Industry, pp. 1–
6. IEEE (2014)
6. Neitzel, D.K.: Electrical safety when working near overhead power lines. In: 2016 IEEE PES
13th International Conference on Transmission & Distribution Construction, Operation &
Live-Line Maintenance (ESMO), pp. 1–5. IEEE (2016)
7. Galvan, I., Ayala, A., Rodríguez, E., Arroyo, G.: Virtual reality training system for
maintenance of underground lines in power distribution system. In: Virtual Reality (2016)
8. Ayala, A., Galván, I., Pérez, G., Ramirez, M., Muñoz, J.: Virtual reality training system for
maintenance and operation of high-voltage overhead power lines. In: Third International
Conference on Innovative Computing Technology (INTECH 2013) (2013)
9. Perez-Ramirez, M., Arroyo-Figueroa, G., Ayala, A.: The use of a virtual reality training
system to improve technical skill in the maintenance of live-line power distribution
networks. Interact. Learn. Environ. 1–18 (2019)
10. Zayas, B., Perez, M.: An instructional design model for virtual reality training environments.
In: EdMedia+ Innovate Learning. Association for the Advancement of Computing in
Education (AACE), pp. 483–488 (2015)
11. Li, B., Bi, Y., He, Q., Ren, J., Li, Z.: A low-complexity method for authoring an interactive
virtual maintenance training system of hydroelectric generating equipment. Comput. Ind.
100, 159–172 (2018)
12. Hernández, Y., Pérez, M., Ramírez, W., Ayala, E., Ontiveros, N.: Architecture of an
intelligent training system based on virtual environments for electricity distribution
substations. Res. Comput. Sci. 129, 63–70 (2016)
13. Dos Reis, P., Matos, C., Diniz, P., Silva, D., Dantas, W., Braz, G., Araújo, A.: An immersive
virtual reality application for collaborative training of power systems operators. In: 2015
XVII Symposium on Virtual and Augmented Reality, pp. 121–126. IEEE (2015)
14. Chiluisa, M., Mullo, R., Andaluz, V.H.: Training in virtual environments for hybrid power
plant. In: International Symposium on Visual Computing, pp. 193–204. Springer, Cham
(2018)
15. Zhang, S., Ying, S., Shao, Y., Gao, W., Liang, Y., Peng, P., Luo, X.: Design and application
of electric power skill training platform based on virtual reality technology. In: 2018 Chinese
Automation Congress (CAC), pp. 1548–1551. IEEE (2018)
16. Cardoso, A., do Santos Peres, I., Lamounier, E., Lima, G., Miranda, M., Moraes, I.:
Associating holography techniques with BIM practices for electrical substation design. In:
International Conference on Applied Human Factors and Ergonomics, pp. 37–47. Springer,
Cham (2017)
17. Cai, L., Cen, M., Luo, Z., Li, H.: Modeling risk behaviors in virtual environment based on
multi-agent. In: 2010 The 2nd International Conference on Computer and Automation
Engineering (ICCAE) (2010)
The Repeatability of Human Swarms
Gregg Willcox1(&), Louis Rosenberg1, and Colin Domnauer2

1
Unanimous AI, San Francisco, CA, USA
gregg@unanimous.ai
2
University of California, Berkeley, CA, USA
Abstract. Swarm Intelligence (SI) is a natural phenomenon in which social

organisms amplify their decision-making abilities by forming real-time systems
that converge on optimized solutions. It has been studied extensively in schools
of fish, flocks of birds, and swarms of bees. In recent years, a new technology
called Artificial Swarm Intelligence (ASI) has enabled human groups to form
similar systems over computer networks. While “human swarms” have been
shown to be more accurate than traditional methods for tapping the intelligence
of human groups, the present study tests the repeatability of the answers that
human swarms generate. Ten groups of 20 to 25 participants were asked to give
subjective ratings on a set of 25 opinion-based questions. The groups answered
by working together in real-time, connected by swarming algorithms. The results
show that groups answering as swarms produce repeatable results, reaching the
same answer as other groups 67% of the time. Additional analysis found that the
repeatability of each swarm was significantly correlated with a Conviction Index
(CI) metric computed from the real-time swarming data (r2 = 0.33, p < 0.01).
For swarms that converged upon a solution with a Conviction Index (CI) > 85%,
the repeatability was found to be greater than 90% and the likelihood that another
swarm randomly sampled from a similar population would generate the same
response was greater than 95% (p < 0.05). This provides powerful guidelines for
groups using ASI technology to generate optimized forecasts, insights, and
decisions from human swarms sampled from general populations.
Keywords: Artificial Swarm Intelligence Human swarms Repeatability

Reliability Market research
1 Introduction
When generating insights from human populations, market researchers, political

pollsters and other practitioners often use survey instruments to sample a sub-group and
extrapolate results to the full population [1, 3, 14]. As long as the surveyed sample is
representative of the full population, the repeatability—defined as the likelihood that
another random sample will reach the same answer—is well studied and rigorously
defined [2, 14, 15]. This repeatability is crucial to the practical utility of the insights
generated from human groups – the more repeatable the result, the less likely the result
will be inconsistent with the true population’s beliefs due to the random bias introduced
when sampling from a population.
In recent years, a new technology and methodology has been developed for cap-
turing the beliefs of populations not by aggregating data from isolated surveys, but
https://doi.org/10.1007/978-3-030-39442-4_35
474 G. Willcox et al.
instead by connecting individuals together into real-time systems, their interactions

moderated by AI algorithms modeled on the natural principle of Swarm Intelligence.
Known as Artificial Swarm Intelligence (ASI) or simply “Human Swarming,” this
method has been shown in numerous studies to produce solutions that can better reflect
a group’s collective beliefs, forecasts, or priorities than traditional surveys [5–13]. For
example, in a recent study conducted at Stanford University School of Medicine,
groups of radiologists were asked to estimate the probability that patients are positive
for pneumonia based on a review of their chest x-rays. When forecasting together as a
real-time swarm, diagnostic errors were reduced by over 20% as compared to the
aggregated radiologist survey probabilities [12].
While prior studies have established the value of the repeatability of traditional
polling methods [1–4, 14, 15], and have shown that swarming can generate signifi-
cantly more accurate insights from human groups than votes and polls [13, 16–18], no
previous study has examined the repeatability of human swarms, or assessed what
factors influence or predict the repeatability of swarms. The sections that follow
describe a rigorous repeatability study in which ten groups of randomly selected par-
ticipants were tasked with answering a set of highly subjective questions by working as
an ASI swarm. The repeatability of the group insights generated by real-time swarming
was then statistically assessed.
2 Repeatability Study
Ten groups of randomly selected participants were tasked with answering a set of 25
subjective assessment questions, each of which required them to rate items on a 1–5
scale. Each group consisted of between 20 and 25 individuals. All questions were of
Fig. 1. A human swarm answering a question in real-time

The Repeatability of Human Swarms 475
the same format: “How important is it to have taken a class in [subject area] before
graduating high school?” The group congregated on the Swarm AI platform to answer
the questions as a real-time human swarm. A swarm is shown answering question 14
(Environmental Science) in Fig. 1.
For each unique question, the repeatability of the swarm’s answers was calculated
as the fraction of the ten groups that gave the most common answer. For example, if
25% of groups tested chose “1” and 75% of groups tested chose “2”, the repeatability
of the swarms on that question would be 75%.
Fig. 2. Repeatability per question over ten groups
3 Analysis and Results
Over the 25 subjective assessment questions used in this experiment, the ASI swarms
generated answers that were 67% repeatable (i.e., on average, swarms chose the most
commonly generated answer 67% of the time.) The repeatability was also broken down
per question, shown in Fig. 2. Four questions were answered the same way by all 10
swarms, resulting in a 100% repeatability for questions 4, 7, 15, and 21. The minimum
repeatability observed was 40%, on questions 1, 3, and 10. On these questions, the
most popular answer chosen by the swarms was only selected 40% of the time, while
two or more other answers were selected the other 60% of the time. The swarm answers
to each of the 25 questions were bootstrapped 5000 times with replacement, and a 95%
confidence interval of the repeatability of swarms in this question set was calculated.
We can be 95% confident that swarms in this question set were between 60% and 75%
repeatable on average.
To determine whether the repeatability of a given swarm can be predicted without

running many swarms, the Conviction of each swarm answer was correlated with the
repeatability of that answer choice for each question, calculated as the fraction of other
swarms that chose that answer to the question. The Conviction of a swarm is a measure
of the degree of consensus in the chosen answer: the higher the Conviction, the more
the group agreed on that answer. As depicted in Fig. 3, Conviction was significantly
positively correlated with the repeatability of a swarm’s answer on a given question
(r2 = 0.33, p = 6e−23). At high conviction the swarm is very often highly repeatable.
In fact, based on a calculated running average, swarms are 90% repeatable above 85%
conviction. On this dataset, 14% of swarms fell above this 90% repeatability threshold.
Notably, 42 of the 44 swarm-based responses (95.5%) that had a Conviction value
greater than 85%, selected the answer that the majority of other swarms also selected
when answering that same question. This result suggests a powerful guideline for
practitioners generating insights from swarms of this size: if a randomly fielded swarm
selects an answer with 85% conviction or above, the answer the swarm selected is very
likely to be the same answer chosen by the vast majority of other similarly fielded
swarms (p < 0.05).
Fig. 3. Swarm repeatability and conviction correlation, including a moving average trendline to
show local average accuracy.
4 Swarm Interpolation
While the explicit answers from swarms are instructive, it’s often the case that a deeper
analysis of the behaviors of individuals within the swarms reveals a more accurate
picture of the beliefs of the group itself. A process of interpolation was used to refine
the swarm’s explicit answer into a fractional answer index that better represents the
beliefs of the group. This interpolated swarm answer was calculated as the mean
answer that individuals expressed over the swarm’s entire deliberation, expressed as a
decimal (e.g. 3.15). One example of the process of interpolating an explicit swarm
answer into an interpolated answer is shown in Fig. 4. In this example, the swarm
chose “2”, but debated mostly between answers “1 (not important)” and “2”. The
interpolated answer for this swarm was 1.60.
An interpolated answer was calculated for each explicit swarm answer in this study.
The interpolated answers had an intra-question variance of 0.138 of an answer index,
meaning that the swarm answers to a single question had a standard deviation of 0.371
answer indexes, on average. The interpolated answers observed in response to each
question were bootstrapped 5,000 times, and a 95% confidence interval of the average
intra-question standard deviation was calculated as [0.316, 0.412].
The average intra-question standard deviation of the explicit answers was 0.574
indexes, with a 95% confidence interval of [0.490, 0.656]. When comparing the intra-
question standard deviation of the explicit and interpolated answers, we find that it’s
highly unlikely (p < 0.001) that the interpolated answers have a lower intra-question
standard deviation. As a result, we can be confident that the interpolated answer metric
from a single swarm is a more precise predictor of the population’s sentiment on a
given question, as measured by the average swarm response to that question, than the
explicit answer that a swarm chooses.
Fig. 4. Interpolated answer based on mean support density over time

5 Conclusions
These results indicate that ASI swarms of 20–25 people are statistically repeatable
systems. Using the highly subjective questions in this study, the swarms were found, on
average, to produce 67% repeatable results. Furthermore, the repeatability of each
response was found to significantly correlate to the Conviction metric generated for that
swarm (r2 = 0.33, p < 0.01). Specifically, swarm-based responses with a Conviction
metric of more than 85% were shown to produce answers that were on average 90% or
more repeatable, and that produced the answer chosen most often by the swarm more
than 95% of the time. This provides a valuable guideline for practitioners using human
swarms to generate insights from sampled populations.
In addition, for questions with ordered answer options, the results of this study
show that interpolating the swarm’s output using the underlying behavioral data can
significantly decrease the variance of answers, indicating that the interpolated answer
may be a more precise estimator of group sentiment than the swarm’s explicit answer.
This study was limited by the number of groups tested, the size of the groups tested,
and the content of the questions. Future work may investigate the repeatability of
smaller (n < 15) or larger (n > 30) groups than in this study: the repeatability of a
survey is expected to increase as more people are surveyed, so it would be interesting to
see if the same trend holds for human swarms. It would further be interesting to
compare the expected repeatability of surveys and human swarms as the number of
respondents changes from small (n = 3) to large (n > 100): is one method more
repeatable at large/small sample sizes? Future work may also expand the question
content considered, as this study was limited to subjective assessments that tasked the
groups with rating items on a 5-point scale. In this study, some questions had full
repeatability (100%), while others were less repeatable than a coinflip (40%), which
may be in part attributable to the content of the questions themselves, so future work
may explore the impact that question content has on the repeatability of human swarms.
Acknowledgment. Thanks to Chris Hornbostel for his efforts in coordinating the swarms. Also,
thanks to Unanimous AI for the use of the Swarm platform for this ongoing work. This work was
partially funded by NSF Grant #1840937.
References
1. Galton, F.: Vox populi. Nature 75, 450–451 (1907)
2. Steyvers, M., Lee, M.D., Miller, B., Hemmer, P.: The wisdom of crowds in the recollection
of order information. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I. (2009)
3. Tetlock, P.E., Gardner, D.: Superforecasting: The Art and Science of Prediction. Crown
Publishing Group, New York (2015)
4. Dana, J., Atanasov, P., Tetlock, P., Mellers, B.: Are markets more accurate than polls? The
surprising informational value of “just asking”. Judgm. Decis. Making 14(2), 135–147
(2019)
5. Rosenberg, L.B.: Human swarms, a real-time method for collective intelligence. In:
Proceedings of the European Conference on Artificial Life, pp. 658–659 (2015)
6. Rosenberg, L.: Artificial swarm intelligence vs human experts. In: Clerk Maxwell, J. (ed.)
International Joint Conference on Neural Networks (IJCNN). IEEE (2016). A Treatise on
Electricity and Magnetism, 3rd edn, vol. 2, pp. 68–73. Oxford, Clarendon, (1892)
7. Rosenberg, L., Baltaxe, D., Pescetelli, N.: Crowds vs swarms, a comparison of intelligence.
In: IEEE 2016 Swarm/Human Blended Intelligence (SHBI), Cleveland, OH, pp. 1–4 (2016)
8. Baltaxe, D., Rosenberg, L., Pescetelli, N.: Amplifying prediction accuracy using human
swarms. In: Collective Intelligence 2017, New York, NY (2017)
9. Willcox, G., Rosenberg, L., Askay, D., Metcalf, L., Harris, E., Domnauer, C.: Artificial
swarming shown to amplify accuracy of group decisions in subjective judgment tasks. In:
Arai, K., Bhatia, R. (eds.) Advances in Information and Communication. FICC 2019.
Lecture Notes in Networks and Systems, vol. 70. Springer, Cham (2020)
10. Rosenberg, L., Pescetelli N., Willcox, G.: Artificial swarm intelligence amplifies accuracy
when predicting financial markets. In: 2017 IEEE 8th Annual Ubiquitous Computing,
Electronics and Mobile Communication Conference (UEMCON), New York City, NY,
pp. 58–62 (2017)
11. Rosenberg, L., Willcox, G.: Artificial swarm intelligence vs vegas betting markets. In: 2018
11th International Conference on Developments in eSystems Engineering (DeSE),
Cambridge, United Kingdom, pp. 36–39 (2018)
12. Rosenberg, L., Lungren, M., Halabi, S., Willcox, G., Baltaxe, D., Lyons, M.: Artificial
swarm intelligence employed to amplify diagnostic accuracy in radiology. In: 2018 IEEE 9th
Annual Information Technology, Electronics and Mobile Communication Conference
(IEMCON), Vancouver, BC, pp. 1186–1191 (2018)
13. Metcalf, L., Askay, D.A., Rosenberg, L.B.: Keeping humans in the loop: pooling knowledge
through artificial swarm intelligence to improve business decision making. Calif. Manag.
Rev. (2019). https://doi.org/10.1177/0008125619862256
14. Lee, R.M., Blank, G.: The SAGE Handbook of Online Research Methods, 2nd edn. SAGE
Publications, Thousand Oaks (2017). Edited by Nigel G. Fielding
15. Bartlett, J.E., et. al.: Organizational research: determining appropriate sample size in survey
research. Inf. Technol. Learn. Perform. J. 19, 43–50 (2001)
16. Schumann, H., Willcox, G., Rosenberg, L., Pescetelli, N.: Human swarming amplifies
accuracy and ROI when forecasting financial markets. In: IEEE International Conference on
Humanized Computing and Communication (HCC), Laguna Hills, CA, pp. 77–82 (2019)
17. Willcox, G., Askay, D., Rosenberg, L., Metcalf, L., Kwong, B., Liu, R.: Measuring group
personality with swarm AI. In: IEEE International Conference on Transdisciplinary AI
(TransAI), Laguna Hills, CA, pp. 10–17 (2019)
18. Willcox, G., Rosenberg, L.: Group sales forecasting, polls vs swarms. In: Future Technology
Conference (FTC), San Francisco, CA (2019)
A Two-Stage Machine Learning Approach
to Forecast the Lifetime of Movies
in a Multiplex
Abhijith Ragav1(B) , Sai Vishwanath Venkatesh1 , Ramanathan Murugappan2 ,

and Vineeth Vijayaraghavan3
1
SRM Institute of Science and Technology, Chennai, India
{abhijithragav,saivishwanathv}@ieee.org
2
Madras Institute Of Technology, Chennai, India
ramanathanmurugappan@ieee.org
3
Solarillion Foundation, Chennai, India
vineethv@ieee.org
Abstract. Collecting over $2.1 billion annually, the cinema exhibition

industry contributes 55% of the total revenue towards the Indian film
industry. Selection of films is one of the most economically crucial deci-
sions in cinema exhibition. Film selection is incredibly complicated to
execute in India owing to its diverse demographic across regions and the
resulting behavioral complexity. Working with data from one of India’s
leading multiplexes, the authors offer a two-stage solution using machine
learning to predict if a movie would proceed to be screened in the follow-
ing week and the number of weeks it would continue to be screened if it
does. The estimation of a movie’s lifetime helps exhibitors to make intel-
ligent negotiations with distributors regarding screening and scheduling.
The authors introduce a new metric MLE to evaluate the error in pre-
dicting the remaining lifetime of a film. The approach proposed in this
paper surpasses the existing system of lifetime prediction and consequent
selection of movies, which is currently performed based on intuition and
heuristics.
Keywords: Machine learning · Feature engineering · Movie lifetime

forecasting · Film industry
1 Introduction
The media and entertainment sector in India represents approximately 1% of the
country’s GDP with a $15.6 billion value estimate. Among this, the Indian film
industry grosses $3.8 billion and is constantly growing with a compound annual
growth rate of over 10% in the past years. This paper focuses on the most
economically valuable subset of the Indian film Industry - Cinema exhibition.
According to a Deloitte report on the economic contribution of the film industry
in the year 2017 [1], more than 50% of the gross share of the film industry is
https://doi.org/10.1007/978-3-030-39442-4_36
A Two-Stage Machine Learning Approach to Forecast the Lifetime of Movies 481
attributed to cinema exhibition. Cinema exhibition in India is carried out either

through multiple screen theaters (multiplexes) or single screen theaters.
Organizing cinema exhibitions through multiplexes is an arduous task, espe-
cially in India due to the added complexity of the market. Cinema Exhibition in
India deals with more than 2500 films produced nationally every year across 20
languages in different regions, apart from an average of 700 films produced inter-
nationally that are also screened. In contrast to the domestic box office of US
and Canada which majorly consists of Hollywood, the Indian domestic box office
is segmented based on region and language [1] with the majority of the box office
revenue collected by Bollywood (Hindi Language), Kollywood (Tamil Language)
and Tollywood (Telugu Language). Box office revenue collected by Bollywood
accounts for 34% of the total revenue while Kollywood and Tollywood account
for 15% and 13% respectively. In India, cinema exhibition is carried out through
more than 2000 multiplexes across the country and 6000 scattered single screen
theaters with factors that differ extensively from region to region. Typically,
each multiplex will have to cater to movies from at least 3 distinct languages in
addition to movies from languages specific to the multiplex’s region.
According to Jehoshua Eliashber et al. [2], whose research work is based off
a multiplex in the Netherlands, the programming problem faced by a multiplex
can consist of two stages: (i) the selection of movies to be screened and (ii) the
scheduling of these movies over screens, days and times of the day. Conventionally
(i) is handled by an expert at the multiplex on a weekly basis. It is one of the
most economically valuable stages in a multiplex since the dividends of the ticket
revenue of selected films are shared by both the exhibitors and the distributors.
The factors that affect the expert’s decision include the distributor’s pitch for
the film, analysis, intuition, internal policies, previous occupancy statistics, and
existing exhibitor-distributor agreements.
The expert consolidates this information to construct the projected lifetime
of a film. Loss in revenue can be attributed to underestimating the lifetime of a
movie and/or scheduling movies that won’t provide maximum returns for a week.
This also attributes to loss in potential profit from movies that could perform
better but fail to be selected for scheduling in the week. This emphasizes the
need for accurate lifetime estimation in multiplexes.
This paper focuses on one of the leading multiplexes operating at a metropoli-
tan city in India. The multiplex under consideration schedules movies across the
screens on a weekly basis. The authors of this paper define the lifetime of a
movie to be the total number of weeks the movie is screened within a multiplex
since its initial screening. The authors offer a two-stage solution using machine
learning to predict if a movie would proceed to be screened the following week
and the number of weeks it would continue to be screened if it does.
This paper is organized as follows: Sect. 2 discusses the relevant research done
across the Film Industry. The dataset under consideration is briefly described
in Sect. 3 and further explored and analyzed in Sect. 4, based on which intu-
itive features are engineered. Section 5 highlights the training methodology and
approaches employed in this paper while Sect. 6 compares the efficiencies of the
approaches discussed.
482 A. Ragav et al.
2 Related Work
The Film Industry has been an area of active research for several years. Ajay
Shiva Santhosh Reddy et al. [3] deal with box office performance of movies based
on Hype Analysis using Twitter. Hype is calculated using information pertaining
to the number of tweets relating to the movie, total number tweets from distinct
users and the follower count for each user. Box-office collection is predicted
by multiplying the hype factor with the number of shows screened during the
first weekend. Ramesh Sharda and Dursun Delen [4] in one of their works have
dealt with the prediction of how successful a movie turns out to be. The target
variable has been divided into nine classes, ranging from ‘flop’ to ‘blockbuster’
based on the movie’s box-office receipts. The classification problem is tackled
using a Neural Network architecture with features including presence of a star
actor, the genre of the movie, number of screens allocated to the movie, etc.
Sameer Ranjan Jaiswal and Divyansh Sharma [5] have come up with a similar
model specifically targeting Bollywood movies. They utilize a feature named
‘music score’, a characteristic factor for Bollywood movies that greatly improves
the performance of the model.
Andrew Ainslie et al. [6] propose an interesting concept in which they ana-
lyze box-office sales in the context of a market share model. They claim that
the number of screens allotted during the opening week are overestimated in
traditional models. The work also specifies that the actors have a direct effect
with the customers’ movie choice while the director has an indirect effect.
Majority of the research work revolving around movies and theaters focus on
predicting whether a movie is successful based on box office revenue. While most
of the authors consider external factors such as tweets and market share, they
do not consider local behavioral factors such as the behavior of the crowd, the
operational pattern of the multiplex and seasonal characteristics. Each state in
India has a diverse demographic with varying regional languages. Consequently,
movie preferences change vastly from state to state. Hence, behavioral factors
are crucial in analyzing the success of a movie in the locality of the multiplex.
The authors of this paper predict the lifetime of movies as a measure of
success using local behavioral factors as illustrated in Sect. 4. Moreover, accu-
rate estimation of the lifetime of movies within a multiplex helps the exhibitors
maximize profits by making smarter negotiations with distributors. Additionally,
selecting movies that would continue to screen next week helps the multiplex in
scheduling.
3 Dataset
Operating 17 multiplexes across 10 cities, each with different cultural, language
and ethnic backgrounds, the multiplex under consideration is one of India’s lead-
ing cinema exhibitors. The dataset consists of over 15 million records and has
been previously used for food sales forecasting [7]. Its dataset consists of informa-
tion of transactions from 2015–2017 pertaining to movie ticket purchases from
one of the 17 diverse multiplexes. A show is defined by the authors as a film

being played on screen at a unique time. The fields considered from the dataset
are illustrated in Table 1.
Table 1. Transaction data
Fields Description
Film String Code Unique Identification Key for Film
Screen name Unique name assigned to screen
Session datetime Date and time of film Screening
Session seats reserved Number of seats reserved by the multiplex for every show
Show number The corresponding time slot of the show considered in the day
Transaction value The payment amount remitted by the customer for the transaction
Transaction datetime Date and time of transaction
Seats per transaction The number of seats sold in the transaction
Transaction ID Unique key of identification for each transaction
Fig. 1. Frequency distribution of lifetime in weeks
The multiplex offers special screenings and arrangements for films to screen
exclusively for a maximum of 4 days. For this reason, the authors only consider
movies with at least 5 days of screening. The transaction data also labels tickets
that are canceled by the multiplex due to unforeseen circumstances. Hence to
maintain the integrity of the dataset, the canceled tickets are removed for further
analysis.
Since the multiplex releases movie schedules for the following week each
Wednesday, the authors define a business week to start from Wednesday to
deliver predictions prior to scheduling. The transaction records for over 2500
movies are aggregated based on our definition of a business week. Each record
contains fields specific to a movie in a business week such as the number of seats
filled in a week, weeks since movie release. Some of these fields will be discussed
in detail in Sect. 4.
484 A. Ragav et al.
The box plot shown in Fig. 1 represents the distribution of lifetime for all the
films considered. The average lifetime is observed to be 2.5 weeks, with 75% of
movies having a lifetime ≤ 3 weeks.
4 Feature Engineering, Extraction and Analysis

As discussed in Sect. 2, it is very important to consider local behavioral aspects
while delivering predictions. Therefore, the authors have engineered features
based on:
– Crowd behavior
– Screening behavior
– Seasonal and Regional behavior
– Movie specific aspects
4.1 Crowd Behavioral Features

Percentage of Seats Filled (PSF). PSF represents the ratio of the Number
Of Occupants (NOC) to the total seats scheduled by the multiplex for the movie
in a particular week (CAP). This is a strong indicator of the popularity of the
movie among the cinema audience. As observed from Fig. 2(a), the mean value of
PSF across all movies tends to reduce as they tend to age. As the PSF declines
for a movie, the multiplex reduces the CAP for the movie to minimize idle
seats. When PSF tends to saturation, the movie is observed to be dropped from
screening.
n
1 NOCj (i ) × 100
P SFavg (i) = (1)
n j=0 CAP j (i )
(a) Variation of PSF (b) Variation of Late-Bookings
Fig. 2. Variation of PSF and average LB across weeks since release of movies
where,
i represents a unique movie,
n represents the max number of weeks the movie has been screened.
The mean of PSF (PSFavg ) for a movie is a measure of consistency for a
movie’s occupancy. While PSF is a short term indicator of the film’s popularity,
PSFavg serves as a long term indicator for measuring the performance of a movie
in theaters.
Late-Bookings (LB). LB represents the ratio of the seats booked at the closing
of a show’s transaction window (defined by the authors as after 3 PM of the
previous day of the show screening) to the total seats allocated for the screening.
It is important to know that the multiplex facilitates offline in-person bookings
besides the conventional online bookings. The rate with which the seats are filled
typically represents the zeal for a film. Hence people tend to book tickets well in
advance of the show to ensure they get a seat. Late bookings also account for last
minute transactions that occur moments before a show is screened. Increase in
the average number of late bookings per week (in Fig. 2(b)) indicates a depleting
interest for movies. This inference is drawn based on the comparable reduction
in average occupancy per week as seen in Fig. 2(a).
Relative Occupancy (RO). A higher number of occupants for a particular
movie may require the multiplex to increase the capacity for it, consequently
reducing the capacity for another movie in the week. ROj (i) specifies the share
of seats held by the ith movie among other movies screening in the jth week at
the multiplex and is calculated by Eq. (2). If the ROj (i) is 5%, it implies that
the seats booked for the ith movie holds 5% of all seats booked in the multiplex
for the jth week. The relative occupancy for the multiplex ranges from 0.007% to
81% with an average RO of 4% observed. Movies with consistently higher RO
values have a higher probability of screening in the multiplex the next week.
A smoothed histogram in Fig. 3 illustrates the distribution of movies. It can be
observed that for movies failing to screen the next week, the average RO is 0.8%
and ranges from 0.007% to 18%. Meanwhile, the movies that screen the next
week have a relatively greater average (5.2%) and a larger range (0.14%–81%).
NOCj (i) × 100

ROj (i) = in (2)
k=i0 NOCj (k )
where,
i represents a unique movie,
j represents a particular week,
k represents a movie in the week,
n represents the total number of movies in the week.
Frequency of Seats Booked per Transaction (SBT). The number of seats
booked in a single transaction-SBTm (where m represents the number of book-
ings) sheds light into the response of the incoming crowd towards a film, and
in turn characterizes the film. The multiplex facilitates the booking of multiple
seats in a single transaction. Figure 4 represents the correlation between SBT
486 A. Ragav et al.
Fig. 3. Relative Occupancy histogram
and the probability of movies continuing to screen the following week. It can be
observed that the occurrence of higher values of SBT (SBT7 and above) indicate
a very high probability for movies to continue to be screened.
Fig. 4. Seats booked per transaction distribution
History Features. Since the model operates on a weekly basis to make pre-
dictions, it is important that we supply short term memory features to help the
model understand the variations in the behavior across the past week. Therefore,
7 history points corresponding to the days in the prior week are provided for
occupancy features such as RO, LB and PSF.
4.2 Screening Behavioral Features

Days Screened in a Week (DSW). It denotes the number of days a movie has
been screened in the past week. The multiplex can arrange for a movie to screen
any number of days through the week based on factors including accommodating
new releases in the week, holidays, as well as crowd behavioral factors discussed
in Sect. 4.1. DSW captures these dynamics and upon observing its behavior from
Fig. 5, it can be inferred that movies screened for just 1 or 2 days in the past
week have a high probability of getting dropped from screens by the next week
in contrast to films having higher days of screening (5–7 days), thus illustrating
its capability to provide short-term forecasting insight.
Fig. 5. Behavior of DSW
4.3 Seasonal and Regional Behavioral Features

Time of the Year. Seasonality is a vital factor when it comes to modeling
regional crowd behavior across time. It refers to characteristics displayed by data
that happen to recur in a defined periodic cycle. It can be observed in short term
cycles (weekly) such as weekend and weekday behavior or in long term periodic
cycles (yearly) such as festival holidays in a calendar year. The data showcases
periodicity across a few weeks through the year as seen in Fig. 6. Week-based
seasonality is crucial since the multiplex requires scheduling information offered
by our predictions in a weekly basis.
Language. The population is highly diverse owing to various demographic
factors attributing to local behavior. This feature describes language prefer-
ence across the region. The languages considered were broadly split into Tamil
(Regional language), English, Hindi (National language) and Others. The lan-
guage split displays a dominance for the regional language followed by English as
shown in Fig. 7. This information assists in labeling the character of the incoming
crowd.
488 A. Ragav et al.
Fig. 6. Seasonality in weeks
Fig. 7. Language split across the movies
4.4 Movie Specific Features

Genre and Runtime. Genres represent the themes showcased by a movie.
Drama, Comedy and Romance were observed to be the most popular genres in
the dataset and hence were considered for analysis. In contrast to movies with
a single genre, movies with an intersection between one or more genres were
observed to have a greater mean movie lifetime as shown in Fig. 8. Additionally,
Runtime indicates the scheduled running length of a movie.
New Releases in a Week (NRW). A new release can pivot the sales and
demand of movies currently running. NRW refers to the number of new movie
releases normalized over the total number of movies screening in the week.
5 Methodology
The authors make predictions at the start of a business week for two different
use cases. The first is binary classification, to flag whether a particular movie
Fig. 8. Average lifetime of genres (in weeks)
screening in the week will continue to screen the following week (classification use
case). For the movies predicted to screen the following week, a regressor engine
predicts how long the movie will continue to be screened in weeks (regression
use case).
5.1 Training and Testing
Booking transaction data from the years 2015, 2016 and 2017 were considered
for analysis of movie lifetime prediction. The Booking behavior was observed
to change within seasons through the years considered. The domain experts
attributed this to the volatile nature of the data considered and further specified
that this volatility consistently prevailed over the last decade. Therefore, the
best way to model such data is to consider a uniform split across the 3 years for
training and testing. Hence, the authors allocated 70% data from each year for
training the model and the rest was considered as testing data. This way, the
model is able to capture the behavior across all the years considered.
5.2 Feature Scaling and Standardization

Feature Scaling is the process of normalizing a feature to a defined scale. It is
very important for predictive model optimization, especially deep networks as
they offer faster means of convergence [8]. Occupancy features such as PSF, LB
and RO need to be normalized based on the capacity allocated since a raw value
of occupancy would not provide much insight to the predictive model without
context. This is due to the fact that the multiplex under consideration has screens
with capacities varying from 110 seats to 310 seats. Hence, the features have been
normalized to values that lie between [0,1].
The features were standardized using Standard Scalar, which transforms the
data based on the mean and standard deviation of the feature such that the
resulting distribution has a mean of 0 and variance of 1.
490 A. Ragav et al.
5.3 Single Model Approach

In this approach, the authors aim to solve both the classification and the regres-
sion use cases using a single machine learning model. As the authors use a
regressor for the classification task, the predictions are transformed based on a
threshold due to the continuous nature of the regressor output (which is in terms
of weeks).
Since the threshold used can be subject to bias, the authors consider all pos-
sible thresholds in a week to determine the best. A threshold labels a movie that
screens the next week as a movie that screens a minimum of ‘n’ days, where
n ≤ 7. A threshold of ‘3’ implies that a movie is considered to screen next week
only if it is predicted to screen for a minimum of 3 days (0.428 weeks). Deep Neu-
ral Network (DNN), Extreme Gradient Boosting (XGB) and ET (Extra Trees)
regressors were considered and their performance is summarized and presented
in Sect. 6.2.
5.4 Two Model Approach

Although a single model for both use cases reduces the complexity of the solution,
it fails to capture the specificity of the requirements posed by each (Sect. 6.3).
While the classification use case mainly requires short term features such as
DSW (Sect. 4.2), regressors require a combination of both long term and short
term information to make a continuous estimate. This approach deals with the
use of separate classifier and regressor models for the discussed use cases.
Classification. A standalone model for classification is considered because of

the volatile nature of the data. Since the prediction is pipe-lined, the error caused
by the classifier is propagated onto the regressor engine. Therefore, it is very
important to minimize the error caused by the classifier. An Extra Trees classifier
is considered by the authors since it is the best performing model as seen in
Sect. 6.2.
Regression. Regression is carried out to forecast the remaining lifetime for

movies that are predicted to screen next week by the classifier. DNN, XGB and
ET regressors were considered for this use case.
The DNN architecture used consisted of 3 dense hidden layers with varying
number of neurons, each having a supporting dropout layer. In addition to pre-
venting the proposed DNN to overfit, the dropout layers add controlled noise
to the model to improve its capability of generalization [9]. The performance of
these models are evaluated and compared in Sect. 6.2.
6 Results and Observation

6.1 Metrics
Classification. Precision, Recall and F1 score are considered as the primary
metrics for classification.
Table 2. Thresholds applied to regressors
DNN XGB ET
DaysMetrics Movie willMovie will Movie willMovie will Movie willMovie will
Mean Mean Mean
not screen screen not screen screen not screen screen
Precision0.97 0.91 0.93 0.99 0.84 0.89 1.00 0.90 0.93
1 Recall 0.78 0.99 0.93 0.55 1.00 0.87 0.74 1.00 0.92
F1 score 0.86 0.95 0.92 0.71 0.91 0.87 0.85 0.95 0.92
Precision0.96 0.94 0.95 0.98 0.88 0.91 0.99 0.92 0.94
2 Recall 0.85 0.98 0.95 0.67 0.99 0.90 0.79 1.00 0.94
F1 score 0.90 0.96 0.94 0.80 0.93 0.89 0.88 0.96 0.93
Precision0.95 0.95 0.95 0.96 0.91 0.93 0.98 0.93 0.95
3 Recall 0.88 0.98 0.95 0.77 0.99 0.92 0.82 0.99 0.94
F1 score 0.91 0.96 0.95 0.86 0.95 0.92 0.90 0.96 0.94
Precision0.93 0.95 0.95 0.94 0.93 0.93 0.98 0.94 0.95
4 Recall 0.88 0.97 0.95 0.82 0.98 0.93 0.86 0.99 0.95
F1 score 0.91 0.96 0.95 0.88 0.95 0.93 0.91 0.97 0.95
Precision0.91 0.95 0.94 0.89 0.94 0.93 0.97 0.95 0.96
5 Recall 0.89 0.96 0.94 0.86 0.96 0.93 0.88 0.99 0.96
F1 score 0.90 0.96 0.94 0.88 0.95 0.93 0.92 0.97 0.95
Precision0.88 0.96 0.94 0.82 0.95 0.91 0.96 0.96 0.96
6 Recall 0.90 0.95 0.93 0.89 0.92 0.91 0.91 0.98 0.96
F1 score 0.89 0.95 0.93 0.85 0.93 0.91 0.93 0.97 0.96
Precision0.77 0.97 0.91 0.77 0.96 0.90 0.88 0.97 0.95
7 Recall 0.90 0.91 0.90 0.90 0.89 0.89 0.93 0.95 0.94
F1 score 0.90 0.88 0.90 0.83 0.92 0.89 0.91 0.96 0.94
Regression. Since the regressors forecast the number of weeks a movie will
continue to screen, the error in lifetime is measured in terms of weeks. The
Movie Lifetime Error (MLE) is calculated as shown in Eq. 3.
M LE = |lif etimeactual − lif etimepredicted | (3)

where,
lifetimeactual represents the actual lifetime remaining for the movie,
lifetimepredicted represents the predicted lifetime remaining for the movie.
6.2 Single Model Approach Classification Results

Various thresholds (defined in Sect. 5.3) were applied to convert the output of a
single regressor to a classifier output. Table 2 represents the performance of Deep
Neural Network (DNN), Extreme Gradient Boosting (XGB) and ET models
across the 7 thresholds considered. The best results for each of the considered
models are shaded in green as observed from Table 2. ET outperformed the other
models with an average F1 score of 0.96 with an applied threshold of ‘6 days’.
6.3 Two Model Approach Classification Results

A standalone ET classifier provides the best results as seen from Table 3. To find
the algorithm that poses minimum loss for the classification task, the percentage
492 A. Ragav et al.
of wrong predictions are considered in Table 4. The ET classifier provides the

least error with 1.21% of wrong predictions when predicting if a movie is screened
the next week and 8% of wrong predictions when determining if a movie is not
screened the next week. It is evident that the ET classifier provides exceptional
performance when compared to other models tried out in Sect. 5.3. Not only does
the ET classifier provide a marginally better F1 score, but also predicts with a
lower percentage of wrong predictions. The ET classifier has the capability to be
less sensitive to noise and volatile behavior by offering variable smoothing using
the ηmin parameter [10].
Table 3. Classification best results

DNN regressor XGB regressor ET regressor ET classifier
Movie will Movie will Movie will Movie will Movie will Movie will Movie will Movie will
Metrics Mean Mean Mean Mean
not screen screen not screen screen not screen screen not screen screen
Precision 0.93 0.95 0.95 0.89 0.94 0.93 0.96 0.96 0.96 0.97 0.92 0.94
Recall 0.88 0.97 0.95 0.86 0.96 0.83 0.91 0.98 0.96 0.97 0.99 0.98
F1 score 0.91 0.96 0.95 0.88 0.95 0.93 0.93 0.97 0.96 0.97 0.97 0.97
Table 4. Percentage of wrong predicts
Movies screening Movies not screening

Models used
next week (%) next week (%)
XGB regressor 5.52 16.35
DNN regressor 4.17 9.94
ET regressor 2.02 10.26
ET classifier 1.21 8.01
6.4 Regression Results

The performance of three regressor models namely DNN, XGB and ET are
compared in Table 5. The DNN clearly performs better than the other two models
considered, predicting with less than 1 MLE 65% of the time and less than 2
MLE 85% of the time.
Table 5. Regression results
DNN XGB ET
[0,1) MLE (%) 64.85 59.00 55.16
[1,2) MLE (%) 19.10 22.48 28.44
[2,3) MLE (%) 7.82 10.71 10.58
[3,4) MLE (%) 3.45 4.10 3.17
>4 MLE (%) 4.77 3.70 2.65
7 Conclusion
Of the two approaches considered, the Two Model Approach (Sect. 5.4) provides
maximum accuracy and is the optimal solution for the two business use-cases
discussed. The standalone ET classifier performs better than the regressors trans-
formed to do the classification task. The model provides an accuracy of 97% for
the classification use-case, which helps the multiplex accurately schedule movies
on a weekly basis. The regression use case is performed using the DNN due to its
superior performance as shown in Sect. 6.2. The DNN is more robust to outliers
and is able to capture the non-linear trends present in the data.
Currently, the multiplex scheduling experts consider the admissible range of
error to be within 2 weeks. They estimate the lifetime and films screening the
next week based on empirical methods and heuristics. Our approach estimates
the remaining lifetime correctly with less than 2 MLE 85% of the time. These
results set a benchmark for the experts in the domain regarding lifetime esti-
mation. Since our method is the first of its kind, it will be tested in real-world
circumstances in the next revision of strategies by the considered multiplex.
References
1. Deloitte: Economic Contribution of the Film ad Television Industry in India,
2017 (2018). Accessed from https://www.mpa-i.org/wp-content/uploads/2018/
05/India-ECR-2017 Final-Report.pdf
2. Eliashberg, J., Hegie, Q., Ho, J., Huisman, D., Miller, S.J., Swami, S., Weinberg,
C.B., Wierenga, B.: Demand-driven scheduling of movies in a multiplex. Int. J.
Res. Mark. 26(2), 75–88 (2009). ISSN 0167-8116
3. Sivasantoshreddy, A., Kasat, P., Jain, A.: Box-Office opening prediction of movies
based on hype analysis through data mining. Int. J. Comput. Appl. 56(1), 1–5
(2012)
4. Sharda, R., Delen, D.: Predicting box-office success of motion pictures with neural
networks. Exp. Syst. Appl. 30(2), 243–254 (2006)
5. Jaiswal, S.R., Sharma, D.: Predicting success of bollywood movies using machine
learning techniques. In: Proceedings of the 10th Annual ACM India Compute Con-
ference (2017)
6. Ainslie, A., Drèze, X., Zufryden, F.: Modeling movie life cycles and market share.
Mark. Sci. 24(3), 508–517 (2005)
7. Ganesan, V.A., Divi, S., Moudhgalya, N.B., Sriharsha, U., Vijayaraghavan, V.:
Forecasting food sales in a multiplex using dynamic artificial neural networks.
In: Arai, K., Kapoor, S. (eds.) Advances in Computer Vision. CVC: Advances in
Intelligent Systems and Computing, vol. 944. Springer, Cham (2019). (2020)
8. Huang, L., Liu, X., Liu, Y., Lang, B., Tao, D.: Centered Weight Normalization
in Accelerating Training of Deep Neural Networks. In: 2017 IEEE International
Conference on Computer Vision (ICCV), Venice, pp. 2822–2830 (2017)
9. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn.
Res. 15(1), 1929–1958 (2014). https://dl.acm.org/citation.cfm?id=2670313
10. Geurts, P., Ernst, D. Wehenkel, L.: Mach. Learn. 63(3) (2006). https://link.
springer.com/article/10.1007/s10994-006-6226-1
A Cost-Reducing Partial Labeling
Estimator in Text Classification Problem
Jiangning Chen, Zhibo Dai(B) , Juntao Duan, Qianli Hu, Ruilin Li,
Heinrich Matzinger, Ionel Popescu, and Haoyan Zhai
School of Mathematics, Georgia Institute of Technology, Altanta, GA 30313, USA

jchen444@math.gatech.edu, zdai37@gatech.edu
Abstract. The paper proposes a new approach to address the text clas-
sification problems when learning with partial labels is beneficial. Instead
of offering each training sample a set of candidate labels, researchers
assign negative-oriented labels to ambiguous training examples if they
are unlikely falling into certain classes. Researchers construct two new
maximum likelihood estimators with self-correction property, and prove
that under some conditions, new estimators converge faster. Also the
paper discusses the advantages of applying one of new estimators to a
fully supervised learning problem. The proposed method has potential
applicability in many areas, such as crowd-sourcing, natural language
processing and medical image analysis.
Keywords: Machine learning · NLP · Text classification · Partial

label · Naive Bayes · Maximum likelihood estimator · Self correction ·
Cost reducing
1 Introduction
In some circumstances, the process of labeling is distributed among less-than-

expert assessors. With the fact that some data may belong to several classes
by nature, their labeling for hundreds of pictures, texts, or messages a day is
error-prone. The invention of partial labeling seeks to remedy the labor: instead
of assigning one or some exact labels, the annotators can offer a set of possible
candidate solutions for one sample, thus providing a buffer against potential mis-
takes [1,4,8,16,17,26]; other partial labeling settings involve repeated labeling
to filter out noises, or assessing the quality of the labelers [18,22] to enhance the
reliability of the models.
As the data size in companies such as FANG (Facebook, Amazon, Netflix,
Google) constantly reaches the magnitude of Petabyte, the demand for quick, yet
still precise labeling is ever growing. Viewing some practices, the partial labeling
frameworks that we know of exhibit some limitations. For instance, in a real-
world situation concerning NLP, if the task is to determine the class/classes of
one article, an annotator with a bachelor degree of American literature might find

https://doi.org/10.1007/978-3-030-39442-4_37
Cost-Reducing Partial Labelling Estimator 495
it difficult to determine if an article with words dotted with ‘viscosity’, ‘gradi-

ent’, and ‘Laplacian’, etc. belongs to computer science, math, physics, chemistry,
or none of the classes above. As a result, the annotator might struggle within
some limited amount of time amid a large pool of label classes and is likely to
make imprecise choices even in a lenient, positive-oriented partial labeling envi-
ronment. Another issue is the cost. Repeated labeling and keeping track of the
performance of each labeler may be pricey, and the anonymity of the labelers
can raise another barrier wall to certain partial labeling approaches.
Taking into consideration the real world scenarios, researchers present a
new method to tackle the problem of how to gather at a large scale par-
tially correct information from diverse annotators, while remaining efficient and
budget-friendly. Still taking the above text classification problem as the exam-
ple. Although that same annotator might not easily distinguish which categories
the above-mentioned article belongs to, in a few seconds he/she can rule out
the possibility the article is related to cuisines, TV-entertainment, or based on
his/her own expertise, novels. In our partial label formulation, the safe choices,
crossed-off categories labeled by annotators can still be of benefit. Furthermore,
when contradictory labels are marked on one training sample and the identities
of the labelers unknown, our introduced self-correcting estimator can select, and
learn from the categories where the labels agree.
Based on this, the paper proposes a new way to formulate partial labeling.
For some documents, instead of having the exact labels, not belonging to certain
classes is the information provided. To make use of both kinds of data, researchers
propose two maximum likelihood estimators, one of which has a self-correction
property to estimate the distribution of each classes. By making both type of
labeled data contribute in the training process, researchers proved that new
estimators converge faster than traditional Naive Bayes estimator. Researchers
finally find a way to apply new methods to some only positively labeled data set,
which is identified as a fully supervised learning problem, and achieve a better
result compared to the traditional Naive Bayes.
The rest of this paper is organized as follows. Section 2 introduces the related
works about text classification and partial labeling. Section 3 introduces the
formulation of this problem, and conclude the result of traditional Naive Bayes
estimator. Section 4 introduces the main results of our estimator, as well as how
to apply it in fully supervised learning problem. Section 5 reports experimental
results of comparative studies. Finally, Sect. 6 concludes the paper and discusses
future research issues.
2 Related Work
The text classification problem is seeking a way to best distinguish different
types of documents [5,12,25]. Being a traditional natural language processing
problem, one needs to make full use of the words and sentences, converting them
into various input features, and applying different models to process training and
testing. A common way to convert words into features is to encoding them based
496 J. Chen et al.
on the term frequency and inverse document frequency, as well as the sequence
of the words. There are many results about this, for example, tf-idf [19] encodes
term t in document d of corpus D as:
tf idf (t, d, D) = tf (t, d) · idf (t, D),
where tf (t, d) is defined as term frequency, it can be computed as tf (t, d) =
|t:t∈d|
|d| , and idf (t, D) is defined as inverse document frequency, it can be com-
|D|
puted as idf (t, D) = log |{d∈D:t∈d}| . We also have n-gram techniques, which first
combines n nearest words together as a single term, and then encodes it with
tf-idf. Recently, instead of using tf-idf, [21] defines a new feature selection score
for text classification based on the KL-divergence between the distribution of
words in training documents and their classes.
A popular model to achieve this target is to use Naive Bayes model [6,11,20],
the label for a given document d is given by:
label(d) = argmax P (Cj )P (d|Cj ),
j
where Cj is the j-th class. For example, we can treat each class as a multino-
mial distribution, and the corresponding documents are samples generated by
the distribution. With this assumption, we desire to find the centroid for every
class, by either using the maximum likelihood function or defining other different
objective functions [2] in both supervised and unsupervised learning version [7].
Although the assumption of this method is not exact in this task, Naive Bayes
achieves high accuracy in practical problems.
There are also other approaches to this problem, one of which is simply
finding linear boundaries of classes with support vector machine [3,9]. Recurrent
Neural Network (RNN) [15,23] combined with word embedding is also a widely
used model for this problem.
In real life, one may have different type of labels [14], in which circumstance,
semi-supervised learning or partial-label problems need to be considered [4].
There are several methods to encode the partial label information into the learn-
ing framework. For the partial label data set, one can define a new loss combining
all information of the possible labels, for example, in [17], the authors modify
the traditional L2 loss
n
1 m
L(w) = l(xi , yi , w) + l(xi , Yi , w) ,
n + m i=1 i=1
where Yi is the possible label set for xi and l(xi , Yi , w) is a non-negative loss
function, and in [4], they defined convex loss for partial labels as:
1
LΨ (g(x), y) = Ψ ( ga (x)) + Ψ (−ga (x)),
|y| a∈y
a∈y
/
where Ψ is a convex function, y is a singleton, and ga (x) is a score function

for label a as input x. A modification of the likelihood function is as well an
approach to this problem and [8] gives the following optimization problem using
Naive Bayes method:

θ∗ = arg max p(y|xi , θ)
θ
i yi ∈Si
where Si is the possible labels for xi .

Meanwhile, the similarity of features among data could be considered to
give a confidence of each potential labels for a certain data. In [24], K nearest
neighbor (KNN) is adopted to construct a graph structure with the information
of features while in [14] Rocchio and K-means clustering are used.
3 Formulation
3.1 General Setting
Consider a classification problem with independent sample x ∈ S and class set C,

where C = {C1 , C2 , ..., Ck }. Researchers are interested in finding new estimator:
ŷ = f (x; θ) = (f1 (x; θ), f2 (x; θ), ..., fk (x; θ))
for y, where θ = {θ1 , θ2 , ..., θm } is the parameter, and fi (x; θ) is the likelihood
function of sample x in class Ci . Now assuming that in training set, we have two
types of dataset S1 and S2 , such that S = S1 ∪ S2 :
1. dataset S1 : we know exactly that sample x is in a class, and not in other

classes. In this case, define: y = (y1 , y2 , ..., yk ), if x is in class Ci , then yi = 1.
k
Notice that if this is a single label problem, then we have: i=1 yi = 1.
2. dataset S2 : we only have the information that sample x is not in a class, then
yi = 0. In this case, define: z = (z1 , z2 , ..., zk ), if x is not in class Ci , we have
zi = 1.
To build the model, researchers define the following likelihood ratio function
and likelihood function:

k
k 1−z
i
L1 (θ) = fi (x; θ)yi fi (x; θ) k− j=i zj . (3.1)
x∈S1 i=1 x∈S2 i=1
k fi (x; θ)yi (x)+t k

L2 (θ) = i=1
k = fi (x; θ)yi (x)−zi (x)+t . (3.2)
f (x; θ)zi (x)
x∈S i=1 i x∈S i=1
The t in L2 satisfy t > 1, which is a parameter to avoid non-convexity.

The intuition of L1 is to consider the sample labeled zi = 1 has equal proba-
bility to be labeled in the other classes, each of the classes will have probability
1−zi

k− zj . And the intuition of L2 is to consider this in a likelihood ratio way,
j=i
498 J. Chen et al.
the zi = 1 labeled sample will have negative affection for class Ci , so we put
it in the denominator. With t > 1, all the terms in denominator will finally be
canceled out, so that even fi (x; θ) = 0 for some sample x ∈ S will not cause
trouble. Another intuition for L2 is that, it can be self-correct the repeated data,
which has been labeled incorrectly.
Take logarithm for both side, we obtain the following functions:

k
k
1 − zi
log(L1 (θ)) = yi (x) log fi (x, θ) + log fi (x, θ), (3.3)
k − j=i zj
x∈S1 i=1 x∈S2 i=1
and

k
log(L2 (θ)) = (yi (x) + t − zi (x)) log fi (x, θ). (3.4)
x∈S i=1
New estimators θ̂ will be solved by maximizing (3.4) or (3.3).
3.2 Naive Bayes Classifier in Text Classification Problem
For Naive Bayes model. Let class Ci with centroid θi = (θi1 , θi2 , ..., θiv ), where v
v
is the total number of the words and θi satisfies: j=1 θij = 1. Assuming inde-
pendence of the words, the most likely class for a document d = (x1 , x2 , ..., xv )
is computed as:
label(d) = argmax P (Ci )P (d|Ci ) (3.5)

i

v
= argmax P (Ci ) (θij )xj
i
j=1

v
= argmax log P (Ci ) + xj log θij .
i
j=1
So we have:

v
log fi (d, θ) = log P (Ci ) + xj log θij .
j=1
For a class Ci , we have the standard likelihood function:

v
x
L(θ) = θijj , (3.6)
x∈Ci j=1
Take logarithm for both side, we obtain the log-likelihood function:

v
log L(θ) = xj log θij . (3.7)
x∈Ci j=1
We would like to solve optimization problem:

max L(θ) (3.8)

v
subject to : θij = 1 (3.9)
j=1
θij ≥ 0.
The problem (3.8) can be solve explicitly with (3.7) by Lagrange Multiplier,
for class Ci , we have θi = {θi1 , θi2 , ..., θiv }, where:

xj
θ̂ij = d∈C iv . (3.10)
d∈Ci j=1 xj
For estimator θ̂, we have following theorem.

Theorem
v 1. Assume we have normalized length of each document, that is:
j=1 xj = m for all d, the estimator (3.10) satisfies following properties:
1. θ̂ij is unbiased.
θij (1−θij )
2. E[|θ̂ij − θij |2 ] = |Ci |m .
The proof of this theorem can be found in appendix.
4 Main Results
From Theorem 1, we can see that traditional Naive Bayes estimator θ̂ is an
θi (1−θi )
unbiased estimator with variance O( j|Ci |m j ). Now researchers are trying to
solve new estimators. Researchers prove that new estimators can use the data
in dataset S2 , and perform better than traditional Naive Bayes estimator.
4.1 Text Classification with L1 Setting (3.1)

In order to use data both in S1 and S2 , we would like to solve (3.8) with L(θ) =
L1 (θ), where L1 is defined as (3.1), let:

v
Gi = 1 − θij ,
j=1
by Lagrange multiplier, we have:

⎧
⎪
⎪ ∂ log(L1 ) ∂Gi
⎪
⎪ + λi = 0 ∀ 1 ≤ i ≤ k and ∀ 1 ≤ j ≤ v
⎨ ∂θij ∂θij
⎪ v
⎪
⎪ θij = 1, ∀ 1 ≤ i ≤ k
⎪
⎩
j=1
500 J. Chen et al.
Plug in, we obtain:

⎧
⎪ yi (x)xj 1 − zi (x) xj
⎪
⎪ + · − λi = 0
⎪
⎨ x∈S1 θ ij k − z
l=i l (x) θ ij
x∈S2
(4.1)
⎪
⎪
v
⎪
⎪ θij = 1, ∀ 1 ≤ i ≤ k
⎩
j=1
Here we take i and j to be ∀ 1 ≤ i ≤ k and ∀ 1 ≤ j ≤ v

Solve (4.1), we got the solution of optimization problem (3.8):

x∈S yi (x)xj + x∈S2 k−1−z
i (x) xj
L1 1 l=i zl (x)
θ̂ij = v 1−z v . (4.2)
i (x)
x∈S1 yi (x) j=1 xj + x∈S2 k− zl (x) j=1 xj
l=i
Theorem 2. Assume we have normalized length of each document, that is:

v 1−z
i (x)
j=1 xj = m for all d. Let Zi (x) = k− zl (x) = K, lij = E[xj |Zi = K]/m.
l=i
Assume further that |{i : zi (x) = 1}| = K to be a constant for all x ∈ S2 , the
estimator (4.2) satisfies following properties:
1. θ̂iLj1 is biased with
|Ri |K(lij − θij )
E[θ̂iLj1 − θij ] = .
|Ci | + |Ri |K
2. E[|θ̂iLj1 − θij |2 ] = O 1
|S1 |+|S2 | .
1−z
i (x)
Proof. 1. We denote Zi (x) = k− l=i zl (x) = K, lij = E[xj |Zi = K]/m and
Ri = {x : zi (x) = 0}, we have:

x∈S1 yi (x)xj + x∈S2 Zi (x)xj x∈S1 yi (x)xj + x∈S2 Zi (x)xj
θ̂ij =
L1
=
( x∈S1 yi (x) + x∈S2 Zi (x))m (|Ci | + |Ri |K)m
Moreover, assuming that pi = P (yi (x) = 1) = |Ci |/|S1 |, qi = P (zi (x) = 0) =

|Ri |/|S2 |, it holds that

1
L1
E[θ̂ij ] = E[yi (x)xj ] + E[Zi (x)xj ]
(|Ci | + |Ri |K)m
x∈S1 x∈S2
1
= pi E[xj |yi (x) = 1]
(|Ci | + |Ri |K)m
x∈S1
1
+ qi KE[xj |Zi (x) = K]
(|Ci | + |Ri |K)m
x∈S2
|Ci |E[xj |yi (x) = 1] + |Ri |KE[xj |Zi (x) = K]
=
(|Ci | + |Ri |K)m
|Ci |θij m + |Ri |Klij m
= .
(|Ci | + |Ri |K)m
Thus,
|Ri |K(lij − θij )
E[θ̂iLj1 − θij ] =
|Ci | + |Ri |K
2. As is for the second part, we have
2 1
θ̂iLj1 = yi (α)yi (β)αj βj
(|Ci | + |Ri |K)m
α∈S1 β∈S1

Zi (α)Zi (β)αj βj
α∈S2 β∈S2

yi (α)Zi (β)αj βj .
α∈S1 β∈S2
Then, by introducing C = (|Ci | + |Ri |K)m and Lij = E[x2j |Zi (x) = K] it is
true that

2 1
L1
E θ̂ij = 2 E[yi2 (x)x2j ] + E[yi (α)αj ]E[yi (β)βj ]
C
x∈S1 α,β∈S1 ,α=β

+ E[Zi2 (x)x2j ] + E[Zi (α)αj ]E[Zi (β)βj ]
x∈S2 α,β∈S2 ,α=β

+2 E[yi (α)αj ]E[Zi (β)βj ]
α∈S1 ,β∈S2
1
= |Ci |mθij (1 − θij + mθij ) + |S1 |2 − |S1 | p2i m2 θi2j
C2
+ |Ri |K 2 Lij + |S2 |2 − |S2 | K 2 qi2 m2 li2j
+ 2|Ci ||Ri |m2 Kθij lij |
1
= |Ci |mθij (1 − θij + mθij ) − |S1 |p2i m2 θi2j
C2
+ |Ri |K 2 Lij − |S2 |K 2 qi2 m2 li2j
2
|Ci |θij m + |Ri |Klij m
+
(|Ci | + |Ri |K)m
2
L1 L1
2 2
Using the fact that E θ̂ij − E[θ̂ij ] = E θ̂iLj1 − E[θ̂iLj1 ] , we can
conclude that
2

E θ̂iLj1 − E[θ̂iLj1 ]
1
= |Ci |mθij (1 − θij + mθij ) − |S1 |p2i m2 θi2j
(|Ci | + |Ri |K)2 m2
+ |Ri |K 2 Lij − |S2 |K 2 qi2 m2 li2j

1
=O
|S1 | + |S2 |

502 J. Chen et al.
Comparing θ̂ij and θ̂iLj1 , we can see that even though our estimator is biased,
the variance of θ̂iLj1 is significant smaller than the variance of θ̂ij , which means
by using negative sample set, θ̂iLj1 converges way faster than original Naive Bayes
estimator θ̂ij .
4.2 Text Classification with L2 Setting (3.2)
Another way to use both S1 and S2 dataset is to solve (3.8) with L(θ) = L2 (θ),
where L2 is defined as (3.2), let:

v
Gi = 1 − θij ,
j=1
by Lagrange multiplier, we have:

⎧
⎪
⎪ ∂ log(L2 ) ∂Gi
⎪
⎪ + λi = 0 ∀ 1 ≤ i ≤ k and ∀ 1 ≤ j ≤ v
⎨ ∂θij ∂θij
⎪ v
⎪
⎪ θij = 1, ∀ 1 ≤ i ≤ k
⎪
⎩
j=1
Plug in, we obtain:

⎧ xj
⎪
⎪ (yi (x) + t − zi (x)) − λi = 0 ∀ 1 ≤ i ≤ k and ∀ 1 ≤ j ≤ v
⎪
⎪ θ
⎨ x∈S ij
v (4.3)
⎪
⎪
⎪
⎪ θij = 1, ∀ 1 ≤ i ≤ k
⎩
j=1
Solve (4.3), we got the solution of optimization problem (3.8):

(yi (x) + t − zi (x))xj
θ̂iLj2 = v x∈S
. (4.4)
j=1 x∈S (yi (x) + t − zi (x))xj
Notice that the parameter t here is used to avoid non-convexity, when 0 ≤

t < 1, the optimization problem (3.8) has the optimizer located on the boundary
of θ, which cannot be solved explicitly.
Theorem
j=1 x j = m for all d. Assume the negative label has only one entry to be

1, namely, i zi (x) = 1, ∀x ∈ S2 . Let |Ci | denote the number of documents in
Class i and |Di | denote the number of documents labelled not in Class i with
pi = |C i| |Di |
|S| and qi = |S| . Further, we assume if a document x is labelled not in
Class i, it will have equal probability to be in any other class. Then the estimator
(4.4) satisfies following properties:
1. θ̂iLj2 is biased and |E[θ̂iLj2 − θij ]| = O( t+p

t+qi
i −qi
)
(1+2t)pi +(1−2t)qi +t2
2. var[θ̂iLj2 ] = O m(pi −qi +t)2 |S| .
Proof. First of all, we can simplify (4.4) using our assumption to be

(yi (x) + t − zi (x))xj
θ̂ij = x∈S
L2
.
x∈S (yi (x) + t − zi (x))m
For x ∈ Cl ⊂ S1 , E[xj ] = mθlj and var[xj ] = mθlj (1−θlj ). For x ∈ Dl ⊂

S2 with
m r=l θrj
zl (x) = 1, which means x is labelled not in Class l, we have E[xj ] = k−1
and

var(xj , x ∈ Dl ) = E[E[(xj − E[xj ])2 |x ∈ Cr ]]
r=l
1
= E[(xj − E[xj ])2 |x ∈ Cr ]
k−1
r=l
1
= mθrj (1 − θrj )
k−1
r=l

Moreover, denote N = m x∈S (yi (x) + t − zi (x)) = m (|Ci | − |Di | + t|S|).
1. We first compute the expectation
E[θ̂iLj2 ]

(yi (x) + t − zi (x))E[xj ]
= x∈S
N
l=i θlj
t x∈S1 E[xj ] + t x∈S2 E[xj ] + m|Ci |θij − m|Di | k−1
=
N
k k =l θ r j l=i θlj
t l=1 |Cl |θlj + |Ci |θij + t l=1 |Dl | rk−1 − |Di | k−1
=
|Ci | − |Di | + t|S|

t l=1 pl θlj + pi θij + t l=1 ql rk−1 − qi k−1
=
pi − q i + t
Therefore, we can compute the bias:
(pi − qi + t)θij
E[θ̂iLj2 − θij ] = E[θ̂iLj2 ] −
pi − q i + t

t l=1 pl θlj + t l=1 ql rk−1 − tθij − qi ( k−1 − θij )
=
pi − q i + t
(4.5)
t + qi
= O( )
t + pi − q i
504 J. Chen et al.
2. We now turn to variance.
var(θ̂iLj2 ) = E[|θ̂iLj2 − E[θ̂iLj2 ]|2 ]

2
1
= 2E (yi (x) + t − zi (x))(xj − E[xj ])
N
x∈S
Since different document x ∈ S are independent, we have

1
var(θ̂iLj2 ) = (yi (x) + t − zi (x))2 var(xj )
N2
x∈S
2
where var(xj ) = E (xj − E[xj ]) .
If x ∈ Cl has positive labels, var(xj ) = mθlj (1 − θlj ). Then

V1 := (yi (x) + t − zi (x))2 var(xj )
x∈S1

= (1 + t)2 mθij (1 − θij ) + t2 mθlj (1 − θlj )
x∈Ci x∈Cl ,l=i

k
= |Ci |(1 + 2t)mθij (1 − θij ) + |Cl |t2 mθlj (1 − θlj )
l=1
= O(|Ci |(1 + 2t)m) + O(|S1 |t m) 2
other hand if x ∈ Dl ⊂ S2 has negative labels, var(xj , x ∈ Dl ) =

On the
r=l mθrj (1 − θrj )
1
k−1

V2 := (yi (x) + t − zi (x))2 var(xj )
x∈S2

k
= (t − 1) var(xj , x ∈ Di ) +
2
t2 var(xj , x ∈ Dl )
x∈Di l=i x∈Dl

k
= (1 − 2t)var(xj , x ∈ Di ) + t2 var(xj , x ∈ Dl )
x∈Di l=1 x∈Dl
(1 − 2t)|Di | t2
k
= mθrj (1 − θrj ) + |Dl | mθrj (1 − θrj )
k−1 k−1
r=i l=1 r=l
= O((1 − 2t)m|Di |) + O(t2 m|S2 |)

Then the variance is

1 V1 + V2
var(θ̂iLj2 ) = (V1 + V2 ) = 2
N2 m (|Ci | − |Di | + t|S|)
2
V1 + V2
= O( 2 )
m (pi − qi + t)2 |S|2
⎛ ⎞
[(1 + 2t)pi + |S 1| 2
|S| t + (1 − 2t)q i + |S2 | 2
|S| t ]
=O⎝ ⎠
m(pi − qi + t)2 |S|

(1 + 2t)pi + (1 − 2t)qi + t2
=O
m(pi − qi + t)2 |S|

Using the same strategy as in 1, we have the first part of our variance
1
estimation should be of order O( m|S| ), which is less than the order of vari-
ance for Naive Bayes estimation: O( |C1i | ). We also showed that its order is
1
O( |S1 |+|S 2|
) < O( |C1i | ), therefore, θ̂iLj2 converges faster than θ̂ij .
4.3 Improvement of Naive Bayes Estimator with Only S1 Dataset

Now assume that we don’t have dataset S2 , but only have dataset S = S1 , can
we still do better than traditional Naive Bayes estimator θ̂? To improve the
estimator, researchers use L1 or L2 setting. With z(x) = 1 − y(x), researchers
define functions z on S1 dataset. In this setting, researchers have actually defined
S2 ∼
= S1 .
With simple computation, we have the estimator of L1 is the same as θ̂ij . As
for the estimator for L2 , we have:

∗ (2yi (x) + t − 1)xj
θ̂ij = v x∈S , (4.6)
j=1 x∈S (2yi (x) + t − 1)xj
and by Theorem 3, we have:

Corollary
j=1 xj = m for all d. With only dataset S1 , let S2 = S1 , define z(x) = 1−y(x),
Then the estimator (4.6) satisfies following properties:
1. θ̂i∗j is biased, E[θ̂i∗j − θij ] = O(t).

2. E[|θ̂i∗j − θij |2 ] = O( |S|
1
).
5 Experiment
Researchers applied new methods on top 10 topics of single labeled documents
in Reuters-21578 data [13], and 20 news group data [10]. Researchers compared
results of traditional Naive Bayes estimator θ̂ij and new estimators θ̂iLj1 , θ̂iLj2 ,
506 J. Chen et al.
as well as θ̂i∗j . t is chosen to be 2 in all the following figures. The data in S2 is

generated randomly by not belong to a class, for example, if a document d is in
class 1 among 10 classes in Reuter’s data, to put d in S2 , researchers randomly
pick one class from 2 to 10, and mark d not in that class. All figures are put in
the appendix.
First of all, researchers ran all algorithms on these two sample sets. We
know that when sample size becomes large enough, three estimators actually
convergence into the same thing, thus researchers took the training sample size
relatively small. See Fig. 1(a) and (b). According to the experiment, researchers
showed new methods were more accurate for most of the classes, and more
accurate in average.
Then researchers considered a more extreme case. If there is a dataset with
|S1 | = 0, that is to say, there is no positively labeled data. In this setting,
traditional Naive Bayes will not work, but what will we get from new estimators?
See Fig. 2(a) and (b). Researchers showed we could still get some information
from negative labeled data. The accuracy is not as good as Fig. 1(b) and (a),
because for each sample, negative label is only a part of information of positive
label.
At last, researchers tested new estimator θ̂L2 with only S1 dataset, see
Fig. 3(a) and (b). Researcher showed new methods achieve better results than
traditional Naive Bayes estimator. Researchers applied same training set and
tested the accuracy just on training set. Researchers found traditional Naive
Bayes estimator actually achieved better result, which means it might have more
over-fitting problems, see Fig. 4(a) and (b).
6 Conclusion
This paper has presented an effective learning approach with a new labeling
method for partially labeled document data, for some of which we only know
the sample is surely not belonging to certain classes. Researchers encode these
labels as yi or zi , and define maximum likelihood estimators θ̂iLj1 , θ̂iLj2 , as well as
θ̂i∗j for multinomial Naive Bayes model based on L1 and L2 . There are several
further questions about these estimators:
1. We have proved that with multinomial Naive Bayes model, our estimators
have smaller variance, which means our estimators can converge to true
parameters with a faster rate than the standard Naive Bayes estimator.
An interesting question is the following: if we consider a more general situation
without the text classification background and the multinomial assumption,
by solving the optimization problem (3.8) with L1 and L2 , can we get the
same conclusion with a more general chosen likelihood function fi ? If not,
what assumption should we raise for fi to land on a similar conclusion?
2. The effectiveness of an algorithm in machine learning depends heavily upon
well-labeled training samples; to some extent, our new estimator can utilize
incorrect-labeled data or different-labeled data. Our estimator, especially L2 ,
can resolve this problem (3.2), since the incorrect-labeled data can be canceled
out by the correct-labeled data, thus the partial-labeled data can still have
its contribution.
Another question is: besides θ̂iLj1 and θ̂iLj2 , can we find other estimators, or
even better estimators satisfying this property?
3. Based on our experiment, the traditional Naive Bayes estimator acts almost
perfectly in the training set as well as during the cross validation stage, but
the accuracy rate in the testing set is not ideal. To quantify this observation,
we are still working on a valid justification that the traditional Naive Bayes
estimator has a severe over-fitting problem in the training stage.
A Proof of Theorem 1
v
Proof. With assumption j=1 xj = m, we can rewrite (3.10) as:

d∈Ci xj xj
θ̂ij = = d∈Ci .
d∈Ci m |Ci |m
Since d = (x1 , x2 , ..., xv ) is multinomial distribution, with d in class Ci , we have:
E[xj ] = m · θij , and E[x2j ] = mθij (1 − θij + mθij ).
1.
xjd∈Ci d∈Ci E[xj ] d∈Ci m · θij
θ̂ij = E[ ]= = = θij .
|Ci |m |Ci |m |Ci |m
Thus θ̂ij is unbiased.
2. By (1), we have:
E[|θ̂ij − θij |2 ] = E[θ̂i2j ] − 2θij E[θ̂ij ] + θi2j = E[θ̂i2j ] − θi2j .
Then
2 d1 d2
( d∈Ci xj )2 d∈Ci xj + d1 ,d2 ∈Ci 2xj xj
θ̂i2j = = , (A.1)
|Ci |2 m2 |Ci |2 m2
where di = (xd1i , xd2i , ..., xdvi ) for i = 1, 2. Since:
|Ci |mθij (1 − θij + mθij ) θi (1 − θij + mθij )
E[ x2j ] = = j ,
|Ci |2 m2 |Ci |m
d∈Ci
and
|Ci |(|Ci | − 1)m2 θi2j (|Ci | − 1)θi2j
E[ 2xdj 1 xdj 2 ] = = .
|Ci |2 m2 |Ci |
d1 ,d2 ∈Ci
Plugging them into (A.1) obtains:

θij (1 − θij )
E[θ̂i2j ] = + θi2j ,
|Ci |m
θij (1−θij )
thus: E[|θ̂ij − θij |2 ] = |Ci |m .

508 J. Chen et al.
B Figures
See Figs. 1, 2, 3 and 4.
training set = negative set = 10%, behavior in Reuter data training set = negative set = 10%, behavior in 20 news group
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
(a) (b)
Fig. 1. We take 10 largest groups in Reuter-21578 dataset (a) and 20 news group
dataset (b), and take 20% of the data as training set, among which |S1 | = |S2 |. The
y-axis is the accuracy, and the x-axis is the class index.
training with only negative set = 90%, behavior in Reuter data training with only negative set = 90%, behavior in 20 news group
0.9 0.9
0.8
0.8
0.7
0.7 0.6
0.5
0.6
0.4
0.5 0.3
0.2
0.4
0.1
0.3 0
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
(a) (b)
Fig. 2. We take 10 largest groups in Reuter-21578 dataset (a), and 20 news group
dataset (b), and take 90% of the data as S2 training set. The y-axis is the accuracy,
and the x-axis is the class index.
training set 10% behavior in Reuter data training set 10% behavior in 20 news group
1 1
0.9 0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.2 0.3
0.1 0.2
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
(a) (b)
dataset (b), and take 10% of the data as S1 training set. The y-axis is the accuracy,
and the x-axis is the class index.
testing set = training set, trainging set 10% behavior in Reuter data testing set = training set, trainging set 10% behavior in 20 news group data
1 1
0.95
0.98
0.9
0.96
0.85
0.94
0.8
0.92
0.75
0.9
0.7
0.88 0.65
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
(a) (b)
dataset (b), and take 10% of the data as S1 training set. We test the result on training
set. The y-axis is the accuracy, and the x-axis is the class index.
References
1. Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classifi-
cation. Pattern Recogn. 37(9), 1757–1771 (2004)
2. Chen, J., Matzinger, H., Zhai, H., Zhou, M.: Centroid estimation based on sym-
metric KL divergence for multinomial text classification problem. arXiv preprint
arXiv:1808.10261 (2018)
3. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297
(1995)
510 J. Chen et al.
4. Cour, T., Sapp, B., Taskar, B.: Learning from partial labels. J. Mach. Learn. Res.
12, 1501–1536 (2011)
5. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms
and representations for text categorization. In: Proceedings of the Seventh Interna-
tional Conference on Information and Knowledge Management, pp. 148–155. ACM
(1998)
6. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach.
Learn. 29(2–3), 131–163 (1997)
7. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fif-
teenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan
Kaufmann Publishers Inc. (1999)
8. Jin, R., Ghahramani, Z.: Learning with multiple labels. In: Advances in Neural
9. Joachims, T.: Text categorization with support vector machines: learning with
many relevant features. In: European Conference on Machine Learning, pp. 137–
142. Springer (1998)
10. Lang, K.: 20 newsgroups data set
11. Langley, P., Iba, W., Thompson, K., et al.: An analysis of Bayesian classifiers. In:
AAAI, vol. 90, pp. 223–228 (1992)
12. Larkey, L.S.: Automatic essay grading using text categorization techniques. In:
Proceedings of the 21st Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 90–95. ACM (1998)
13. Lewis, D.D.: Reuters-21578
14. Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In:
IJCAI, vol. 3, pp. 587–592 (2003)
15. Liu, P., Qiu, X., Huang, X.: Recurrent neural network for text classification with
multi-task learning. arXiv preprint arXiv:1605.05101 (2016)
16. McCallum, A.: Multi-label text classification with a mixture model trained by EM.
In: AAAI Workshop on Text Learning, pp. 1–7 (1999)
17. Nguyen, N., Caruana, R.: Classification with partial labels. In: Proceedings of the
14th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 551–559. ACM (2008)
18. Belongie, S., Welinder, P., Branson, S., Perona, P.: The multidimensional wisdom
of crowds. In: Advances in Neural Information Processing Systems, pp. 2424–2432
(2010)
19. Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries.
In: Proceedings of the First Instructional Conference on Machine Learning, vol.
242, pp. 133–142 (2003)
20. Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001
Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46. IBM,
New York (2001)
21. Schneider, K.-M.: A new feature selection score for multinomial Naive Bayes text
classification based on KL-divergence. In: Proceedings of the ACL 2004 on Inter-
active Poster and Demonstration Sessions, p. 24. Association for Computational
Linguistics (2004)
22. Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? Improving data quality
and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
614–622. ACM (2008)
23. Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network
for sentiment classification. In: Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pp. 1422–1432 (2015)
24. Zhang, M.-L., Zhou, B.-B., Liu, X.-Y.: Partial label learning via feature-aware dis-
ambiguation. In: Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 1335–1344. ACM (2016)
25. Zhang, M.-L., Zhou, Z.-H.: ML-KNN: a lazy learning approach to multi-label learn-
ing. Pattern Recogn. 40(7), 2038–2048 (2007)
26. Zhang, M.-L., Zhou, Z.-H.: A review on multi-label learning algorithms. IEEE
Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)
Unsupervised Cross-Lingual Mapping
for Phrase Embedding Spaces
Abraham G. Ayana(B) , Hailong Cao, and Tiejun Zhao
School of Computer Science and Technology, Harbin Institute of Technology,

Harbin, China
abrecool@gmail.com, {caohailong,tjzhao}@hit.edu.cn
Abstract. Cross-lingual embedding has shown an effective way to learn

cross-lingual representation in a joint embedding space. Recent work
showed that cross-lingual phrase embedding is important to induce
phrase table for unsupervised phrase-based machine translation. How-
ever, most of the cross-lingual representation from the literature are
either limited to word level embedding or uses bilingual supervision for
shared phrase embedding space. Therefore, in this paper, we explore the
ways to map phrase embeddings of two languages into a common embed-
ding space without supervision. Our model uses a three-step process: first
we identify phrase in a sentence by using their mutual information, and
combine component words of the phrase in the preprocessing stage; then
we independently learn phrase embedding for each language based on
their distributional properties, finally a fully unsupervised linear trans-
formation method based on self-learning is used to map the phrase
embeddings into a shared space. We extracted bilingual phrase trans-
lation as a gold standard to evaluate the result of the system. Besides its
simplicity, the proposed method has shown a promising result for phrase
embedding mapping.
Keywords: Cross-lingual mapping · Word embedding · Phrase

embedding · Machine translation · Mutual Information · Linear
transformation
1 Introduction
Recently, cross-lingual embedding has shown an effective way to learn cross-

lingual representation in a joint embedding space. It has a capacity to represent
meaning and transfer knowledge in cross-lingual context [1], either by learn-
ing mappings between different embedding spaces [2–4] or by jointly training
cross-lingual representations [5–7]. While most of these approaches rely on word
embeddings, phrases are an atomic translation unit in many natural language
processing applications (NLP) such as statistical (phrase-based) based machine
translation [8,9]. Therefore, learning cross-lingual representations for phrases or
sequence of words is more crucial for some of the NLP applications.
https://doi.org/10.1007/978-3-030-39442-4_38
Unsupervised Cross-Lingual Mapping 513
To address this issue, some previous work attempted to learn a shared phrase
embedding space of two languages. For instance, bilingual phrase translation
is used as a supervision to learn semantic phrase embedding space between
two languages by bilingually-constrained recursive auto-encoder [10]. This was
later extended to learn inner structures and correspondences within the bilin-
gual phrase [11]. However, these methods depend on parallel data, which is not
available for most language pairs.
On the other hand, unsupervised cross-lingual embedding mapping model
has shown an interesting mapping strategy [12–15]. The model takes indepen-
dently trained monolingual word embeddings of two languages, and learn a linear
transformation to map them into a shared embedding space. The clue is that the
monolingual word embedding graphs are approximately isomorphic [16]. Based
on this, some recent methods explored adversarial training [12] or iterative self-
learning [13] to obtain cross-lingual embeddings without any supervision signals.
However, these methods are limited to word level embedding.
In this paper, we focus on unsupervised cross-lingual mapping to explore
the ways to learn cross-lingual phrase embedding. We propose a model of a
three-step process. First, we preprocess the corpus to identify phrases based on
their mutual information and then combine them into a single token. This helps
us to effectively extract meaningful phrases from raw text. Second, taking the
preprocessed data we train Word2Vec11 model for phrase embeddings. Finally, a
fully unsupervised linear transformation based on self-learning is used to map the
phrase embeddings into a shared space. The general framework of our method
is shown in Fig. 1. Our main contributions are:
– Most of the unsupervised Cross-lingual mapping focuses on individual word
embeddings. In this work, we proposed to map phrase embeddings space.
– Easily combining words in n-grams has a problem of creating meaningless
phrases, that in turn creates data sparsity problem. To mitigate this problem,
we have adopted collocation extraction methods using Mutual Information for
phrase Identification.
– Moreover, we proposed an unsupervised cross-lingual phrase embedding
model that helps as input for unsupervised statistical machine translation,
specifically between low resource languages.
The rest of this paper is organized as follows. In Sect. 2 we briefly review
related works. Section 3 presents our proposed method, the three-step process
each with detail explanation. The experimental settings and Results and Dis-
cussion are discussed in Sects. 4 and 5 respectively. Finally, Sect. 6 explains our
conclusion and future directions.
2 Related Work
Through time, researchers have tried to investigate on how to represent Cross-
lingual embeddings into a shared space using different methods. In 2013, [2] came
1
Word2Vec downloaded from: https://github.com/tmikolov/word2vec.
514 A. G. Ayana et al.
up with the notion that vector spaces can encode meaningful relations between
words. They also noticed that the geometric relations that hold between words
are similar across languages, which also indicates the possibility to transform
one language’s vector space into the space of another by utilizing a linear trans-
formation. They use a dictionary of five thousand words as supervision to learn
this mapping.
Consequently, several studies aimed at improving these cross-lingual word
embeddings. For instance, Canonical correlation analysis [3] techniques learn a
transformation matrix for every language. The transformation matrix of each
language is learned separately to transform embeddings into a different rep-
resentation, then the new representation can be seen as a shared embedding
space. Another method solely relies on word alignments by counting the num-
ber of times each word in the source language is aligned with each word in the
target language in a parallel corpus [17]. Adversarial auto-encoders [14] was pro-
posed to work like gameplay, a competition between the auto-encoder trained
to reconstruct the source embeddings, and the discriminator trained to differen-
tiate the projected source embeddings from the actual target embeddings. On
the other hand, [4] proposed normalization of word embeddings during training
to solve inconsistence between the embedding and the distance measurement.
They make the inner product the same as cosine similarity and then constrains
the transformation matrix to be an orthogonal matrix by solving a separate
optimization problem. A generalized framework was also built on the idea of
linear transformation by combining orthogonal transformation, normalization,
and mean centering for cross-lingual mapping [18]. But all the above methods
rely on bilingual word lexicons.
Alternatively, recent works have proposed to minimize the amount of bilin-
gual supervision used with a few dictionary entries [19], by iteratively using
embedding mapping to induce a new dictionary in a self-learning fashion. The
self-learning method is able to start from the weak initial solution and iteratively
improve the mapping. This method later became bases for mapping without any
cross-lingual guidance by initializing a weak mapping and exploits the structure
of the embedding spaces through self-learning approach [13]. Another work on a
fully unsupervised adversarial training [12,15] introduces an unsupervised selec-
tion metric that is highly correlated with the mapping quality, which is used
both as a stopping criterion and to select the best hyper-parameters. However,
all these previous works were based on cross-lingual word embeddings, aims to
induce bilingual dictionaries of the languages. There are also limited efforts on
supervised bilingual-lingual phrase embeddings. Bilingually-Constrained phrase
embeddings [10] uses Recursive Auto-Encoder (RAE) to learn semantic phrase
embeddings that depend on bilingual phrase translation. On the other hand, [11]
extends the work to capture both inner structures and correspondences within
the bilingual phrase. In the way, they also learn bilingual embedding space. But,
both of the methods depend on bilingual supervision to learn semantic phrase
embeddings.
3 Proposed Method
In order to induce the bilingual phrase representations, we have to first represent
the phrase in the languages into their respective vectors. To convert phrase
to their corresponding vector representation, first, we identify phrases in the
sentence based on their Normalized Pointwise Mutual Information and combine
them into a single token (3.1). Then we feed the phrase tagged corpus into
the Word2Vec model to independently learn the phrase embeddings (3.2). The
monolingual phrase embeddings are used to map the embeddings into a shared
space using unsupervised linear transformation (3.3). The general framework of
the proposed method is shown in Fig. 1.
Fig. 1. The general framework of our proposed method, L1 is monolingual corpus for
the source language and L2 is monolingual corpus for the target language. The output
of the unsupervised linear transformation are the cross-lingual phrase embeddings for
both the source and target language phrases.
3.1 Phrase Identification

Phrases are a combination of more than one words in a sentence that appears
together, sometimes called n-grams. N-grams can easily be identified in a sen-
tence based on their co-occurrence and combined to a single token. For example,
a sentence “The cat sat on the mat”, can be tagged with bigrams as in the sen-
tence “The cat cat sat sat on on the the mat”, but the problem is that most of
the n-grams in the sentence are meaningless (for example “cat sat”) and working
with only bigrams creates data sparsity problems with our corpus. Alternatively,
[20] introduced a way to extract meaningful collocation using Mutual Informa-
tion. According to [20], Mutual information (MI) is a measure of the information
overlap between two random variables.
p(x, y)
M I (x, y) = p(x, y)log (1)
x,y
p (x) p(y)
The value of the mutual information will be 1 if x and y come together and 0
otherwise. To further strength the result, a concrete way to evaluate dependence
between words x and y is called Pointwise Mutual Information (PMI), Given by:
p(x, y)
P M I (x, y) = log (2)
p (x) p(y)
It’s easy to see that when two words x and y appear together many times,
but not alone, P M I(x, y) will have a high value, while it will have a value of
0 if x and y are completely independent. While PMI is a good measure for the
dependence of occurrences of x and y, we don’t have an upper bound on its
values [20]. When there is no fixed upper limit, we do not know how close a bi-
gram is to perfect correlation. We want a measure that can be compared between
all bi-grams; thus, we can choose only bi-grams above a certain threshold. We
want the PMI measure to have a maximum value of 1 on perfectly correlated
words x and y. This is called Normalized (Point) Mutual Information (NPMI),
Formally:
p(x,y)
log p(x)p(y)
N P M I (x, y) = (3)
−logp(x, y)
The value of NPMI, when two words always occur together is 1 when they
are distributed as expected under independence, NPMI is 0 as the numerator is
0; finally, when two words occur separately but not together, we define NPMI to
be negative 1, as it approaches this value when p(x, y) approaches 0 and p(x),
p(y) are fixed. For comparison, these orientation values for PMI are respectively
negative ln p(x, y), 0 and negative infinity. Based on this, we prefer NPMI as a
good score to identify meaningful bigrams from sentences.
Now that we have a way to extract meaningful bi-grams from our large
monolingual corpus, we can replace bi-grams with an NPMI above a certain
threshold to one unigram, for example: “computer science” will be transformed to
“computer science”. It’s easy to create tri-grams by using the transformed corpus
with bi-grams and running again the process (with a lower threshold). Similarly,
we can continue this process to more n-grams with a decreasing threshold. A
pseudocode for phrase identification is given in Algorithm 1.
3.2 Phrase Embedding
Once we have a monolingual corpus with an identified phrase, we can use the
corpus to train Word2Vec model for phrase embeddings. Word2Vec works with
an idea of the distributional hypothesis. In Word2Vec, if we have a large mono-
lingual corpus and for each word in the corpus, we try to predict it by its given
context (CBOW), or trying to predict the context given a specific word (Skip-
Gram). Word2Vec is a neural network with one hidden layer (with dimension d)
and optimization function of Negative-Sampling or Hierarchical Softmax [16]. In
the training phase, we iterate through the tokens in the corpus and look at a
window of size k.
In our case, the skip-gram model with dimension of 300, windows size of 5, a
negative sampling of 10, 5 iterations and subsampling disabled, as a parameter
to train the model. Figure 2 shows Similar geometric relations between numbers
and animals in English and French. This suggests that it might be possible
to transform one language’s vector space into the space of another simply by
utilising a linear transformation.
Algorithm 1. Phrase Identification Algorithm

Require: F Preprocessed text file
Ensure: S∗ Phrase Identified sentence
Accept a text file
Define a threshod T
Define a minimum count mincount
while File f is defined/valid do
Open f
Read f
while line l is not EOF do
Read l
for word1,word2 in l do
if count(word1, word2) ≥ mincount then
N P M I(word1, word2) ← ln P (word1,word2)/P (word1)P (word2)
− ln P (word1,word2)
end if
end for
if N P M I(word1,word2) ≥ T then
Concatenate every bigrams containing word1 and word2 in l
S∗ ← l
end if
end while
end while
3.3 Cross-Lingual Mapping
Suppose we have given a source phrase embedding matrix P and target embed-
ding matrix Q. The goal is to find the linear transformation matrix W p and
W q that the dot product P W p and QW q are in the common embeddings space.
We also induce a phrase dictionary D by the assumption that if two phrases
have an equivalent embedding matrix, they are likely translated to each other.
This assumption is used to initialize the model, which will later be improved by
a self-learning procedure [19]. It works by first normalizing the embeddings on
preprocessing stage, then create an initial solution without any supervision that
will later be improved iteratively.
Based on this, we have taken the two independently learned phrase embed-
dings into a vecmap2 2 (an open source framework by (Artetxe et al. 2018a) to
map the embeddings into a shared space without any supervision.
2
Vecmap downloaded from: https://github.com/artetxem/vecmap.
Fig. 2. An example phrase embedding for English (a) and French phrases (b). We
have taken the translated sentence from the two languages; we can observe that their
embeddings are almost in the same geometrical space.
4 Experimental Settings
For this work, we have used an openly available WMT14 News Crawl shared task
monolingual dataset to evaluate our model. To build our monolingual dataset for
English, French, Germany, and Spanish, we have performed some preprocessing
tasks. First, we have used Moses tokenizer to tokenize the texts, then we cleaned
and removed the duplicated sentence. We also trained the true caser using Moses
true-caser and finally prepared a cleaned version of our monolingual corpus.
Table 1 shows the size of the training monolingual corpus for each language after
preprocessing.
Table 1. Size of the training corpuse
Languages Sentence size Vocabulary size

English 11.7M 988428
Germany 6.26M 849833
French 4.54M 520044
Spanish 1.66M 84045
Next, we identify and combine n-grams based on their Mutual Information.

We have used the NPMI model to identify bigrams from a sentence. The bigrams
are combined to a single token using underscore ( ) character, we also repeated
the same procedure to identify and combine trigrams. Finally, we transform the
preprocessed corpus into Word2Vec model to learn the phrase embeddings for
each language. The phrase embeddings were later used to train a shared embed-
ding space for two languages using linear transformation without any supervi-
sion. The English language is used as a source language for all the training.
Since there is no readymade golden phrase dictionary, we took the most fre-
quent phrase from the monolingual source datasets and translated these phrases
using online Google Translator, which later used as a golden standard phrase
translation for evaluation purpose. Our test set contains a total of 3k phrase
pair of most frequent 1000 unigrams, 1000 bigrams and 1000 trigrams for each
language pair.

We have considered a work from a component of unsupervised statistical machine
translation [8] as a baseline because as far as our knowledge is concerned they
have achieved a quality result so far for phrase translation. They have developed
a new way to learn word and phrase embeddings at the same time and also
trained cross-lingual phrase embeddings, which they later used as an input to
induce phrase table. We will discuss three different results in our paper; the first
is our main result (5.1), results from the comparison with the state-of-the-art
model (5.2), and result based on the phrase identification techniques to measure
an effect of NPMI on the model (5.3). In all the results, we have used a translation
accuracy as metrics for our evaluation.
5.1 Main Results

In order to capture the wide variety of languages and to measure the performance
of our model on a different lingual combination, we have evaluated our system
on different language pair using monolingual corpus from WMT14 News Crawl,
an openly available shared task dataset. We applied our model on English to
French (EN-FR), English to Germany (EN-DE), and English to Spanish (EN-
ES) language pair. We perform 10 runs, using different phrase pairs and reported
average accuracy (%) of the evaluation.
Table 2. The average result of the proposed model versus different language pairs
EN-FR EN-DE EN-ES

Coverage 40.7% 85.41% 16.92%
Accuracy 35.62% 27.10% 31.08%
Table 2 shows the result of our model on different lingual pairs. The results in
this table are the performance of our model, trained with a purely monolingual
dataset for all language pairs and each result given in the table are average
evaluation results for the language pair. As one can observe from the table,
with 40.7% coverage of EN-FR pairs, with 85.41% coverage of EN-DE pairs,
and 16.92% coverage of EN-ES pairs, our method achieved 35.62%, 27.10%, and
31.08% accuracy, respectively.
Given the fact that phrase embeddings are more difficult than word embed-
dings, our average result of 35.62% for EN-FR language pair is promising for
phrase-based Cross-Lingual embeddings.
5.2 Comparison with State of the Art
We have also compared our proposed model with the model that uses a skip-
gram model to learn word and phrase embeddings at the same time [8]. We
have used an openly available vecmap code to train both models without any
supervision. Table 3 shows the results of the proposed method in comparison to
the baseline.
Table 3. Comparison of our model performance with the state-of-the-art model, in all
case English is used as a source language.
Models EN-FR EN-DE EN-ES

(Artetxe et al. 2018c [8]), code 43.54% 28.05% 38.59%
Proposed model 47.64% 28.14% 39.57%
Based on the table, our proposed system has obtained better result compared
to the other model when trained on the large training corpus. Accordingly, our
proposed method achieved 4.10% higher accuracy over (Artetxe et al. 2018d)
on EN-FR pairs, 0.08% on EN-DE pairs, and 0.98% on EN-ES pairs. Thus, the
result we achieved using NPMI based phrase identification has succeeded in all
language pairs providing an improved result for phrase-based translation.
Our result is also competitive compared to even the state of the art unsuper-
vised cross-lingual unigram embedding (Artetxe et al. 2018a), which for instance
achieved 37.33% average result for EN-ES language pairs.
5.3 Phrase Identification Evaluation
To show the role of NPMI based phrase identification in the proposed method, we
perform three separate experiments based on bigram score for phrase identifica-
tion. The first is based on simple word frequency score, words frequently appear
together. The second method is based on the PMI score before normalized and
the third experiment is conducted using our newly adopted NMPI based phrase
identification. We have implemented the two model on English (EN), France
(FR), Germany (DE) and Spanish (ES) monolingual dataset. English is used
as a source language in all cases. Table 4 presents the comparison result of the
experiment performed using the random phrase identification and NPMI based
phrase identification.
In this table, we can observe that the proposed method has achieved by far
better result than the random phrase identification. Accordingly, our proposed
method achieved 40.26% higher accuracy over the model based on the random
phrase extraction on EN-FR pairs, 23.55% on EN-DE pairs, and 39.57% on EN-
ES pairs. These results confirm the relevance of using mutual information for
phrase identification in phrase-based translation.
Table 4. Cross-lingual phrase embeddings result based on phrase identification
Models EN-FR EN-DE EN-ES

Frequency 29.6% 24.98% 20.91%
PMI score 32.34% 26.12% 25.48%
NPMI score 47.64% 28.14% 39.57%
The word cooccurrence frequency based approach can be simple but perform
less in our evaluation. PMI based score for phrase identification perform better
than frequency based score but less than NPMI based score. NPMI is more
accurate cooccurrence score with upper bound to identify bigrams as we have
stated in Sect. 3.1 and shown relatively better result in our evaluation. It also
sustains the structure of the sentence while extracting meaningful phrases that
are easy to find in the phrase dictionary. This help us to capture the real phrase
representations across the languages, which also affects the result of the cross-
lingual phrase embedding.
5.4 Example
To better understand the output of our model, we show an example of some

English phrase correctly translated to French in Table 5. Most of the French
translation from the model are meaningful and almost semantically related to
their original translation.
Table 5. Some English phrase generated based on their NPMI score and their French
translation. We also presented the original translation from dictionary entry for com-
parison
English Dictionary entry French translation

the peace la paix la paix
rise to monter à aller jusqu’à
Green Paper Papier vert Papier vert
a priority une priorité une priorité
the Charter la charte le tableau
too much trop trop
Public Health Santé publique Santé publique
the environmental l’environnement l’environnement
political groups groupes politiques groupes politiques
6 Conclusions
We introduced a model that can represent Cross-lingual phrase embeddings,
without any supervision. Our model has used a three-step process to achieve its
goal. First, we have identified and combined most frequent n-grams in a sen-
tence based on their mutual information, then we used the n-gram tagged cor-
pus to independently learn phrase embeddings of the two languages, finally, the
embeddings are mapped into the shared space using linear transformation and
self-learning approach. Our implementation of Normalized point mutual infor-
mation as a phrase identification technique has helped to extract a meaningful
phrase from the corpus. Thus, the result showed that our method succeeded in
all cases provided a promising result for phrase-based cross-lingual embeddings.
In the future, we plan to apply our model on unsupervised phrase-based
machine translation. It is also important to test the finding on more language
pairs.
References
1. Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding
models. Computing Research Repository abs/1706.0 (4304), no. 661–3 (2017).
arXiv:1706.04902
2. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among lan-
guages for machine translation. arXiv:1309.4168v1. https://doi.org/10.1162/
153244303322533223
3. Faruqui, M., Dyer, C.: Improving vector space word representations using multilin-
gual correlation. In: Proceedings of the 14th Conference of the European Chapter
of the Association for Computational Linguistics, pp. 462–471. Association for
Computational Linguistics, Stroudsburg (2014). https://doi.org/10.3115/v1/E14-
1049. http://aclweb.org/anthology/E14-1049
4. Xing, C., Wang, D., Liu, C., Lin, Y.: Normalized word embedding and orthog-
onal transform for bilingual word translation. In: Proceedings of the 2015 Con-
ference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pp. 1006–1011. Association for Com-
putational Linguistics, Stroudsburg (2015). https://doi.org/10.3115/v1/N15-1104.
http://aclweb.org/anthology/N15-1104
5. Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributed
semantics. In: Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics, pp. 58–68 (2014). arXiv:1404.4641
6. Vulic, I., Moens, M.F.: Bilingual distributed word representations from
document-aligned comparable data. J. Artif. Intell. Res. 55(2), 953–994 (2016).
arXiv:1509.07308v2
7. Vyas, Y., Carpuat, M.: Sparse bilingual word representations for cross-lingual
lexical entailment. In: Proceedings of the 2016 Conference of the North Amer-
ican Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, pp. 1187–1197. Association for Computational Linguistics,
Stroudsburg (2016). https://doi.org/10.18653/v1/N16-1142. http://aclweb.org/
anthology/N16-1142
8. Artetxe, M., Labaka, G., Agirre, E.: Unsupervised statistical machine translation.
In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, pp. 3632–3642 (2018). arXiv:1809.01272
9. Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M.: Phrase-based &
neural unsupervised machine translation. In: Emperical Methods for Natural Lan-
guage Processing, vol. 25, no. 6, pp. 1109–1112 (2018). arXiv:1804.07755. https://
doi.org/10.1053/j.jvca.2010.06.032. https://arxiv.org/pdf/1804.07755.pdf
10. Zhang, J., Liu, S., Li, M., Zhou, M., Zong, C.: Bilingually-constrained phrase
embeddings for machine translation. In: Proceedings of the 52nd Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), pp. 111–
121 (2014). https://doi.org/10.3115/v1/P14-1011. http://aclweb.org/anthology/
P14-1011
11. Su, J., Xiong, D., Zhang, B., Liu, Y., Yao, J., Zhang, M.: Bilingual correspondence
recursive autoencoder for statistical machine translation. In: Proceedings of the
2015 Conference on Empirical Methods in Natural Language Processing, pp. 1248–
1258. Association for Computational Linguistics, Stroudsburg (2015). https://doi.
org/10.18653/v1/D15-1146. http://aclweb.org/anthology/D15-1146
12. Alexis Conneau, H.J., Lample, G., Ranzato, M., Denoyer, L.: Word transla-
tion without parallel data. In: ICLR Conference Paper (2018). arXiv:1710.04087.
https://doi.org/10.1111/j.1540-4560.2007.00543.x. http://doi.wiley.com/10.1111/
j.1540-4560.2007.00543.x
13. Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsu-
pervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Long Papers),
Melbourne, Australia, pp. 789–798 (2018). arXiv:1805.06297
14. Miceli Barone, A.V.: Towards cross-lingual distributed representations without
parallel text trained with adversarial autoencoders. In: Proceedings of the 1st
Workshop on Representation Learning for NLP, pp. 121–126. Association for Com-
putational Linguistics, Stroudsburg (2016). arXiv:1608.02996. https://doi.org/10.
18653/v1/W16-1614. http://aclweb.org/anthology/W16-1614
15. Zhang, M., Liu, Y., Luan, H., Sun, M.: Adversarial training for unsupervised bilin-
gual lexicon induction. In: Proceedings of the 55th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long Papers), pp. 1959–1970.
Association for Computational Linguistics, Stroudsburg (2017). https://doi.org/
10.18653/v1/P17-1179. http://aclweb.org/anthology/P17-1179
16. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word rep-
resentations in vector space. arXiv preprint arXiv:1301.3781. https://doi.org/10.
1162/153244303322533223
17. Guo, J., Che, W., Yarowsky, D., Wang, H., Liu, T.: Cross-lingual dependency
parsing based on distributed representations. In: Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language Processing Volume 1, pp. 1234–1244
(2015). http://www.research.philips.com/publications/downloads/martin wilcox
thesis.pdf
18. Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word
embedding mappings with a multi-step framework of linear transformations. In:
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018). Asso-
ciation for the Advancement of Artificial Intelligence (2018)
19. Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with
(almost) no bilingual data. In: Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pp. 451–462.
Association for Computational Linguistics, Stroudsburg (2017). https://doi.org/
10.18653/v1/P17-1042. http://aclweb.org/anthology/P17-1042
20. Bouma, G.: Normalized (pointwise) mutual information in collocation extraction.
In: International Conference of the German Society for Computational Linguistics
and Language Technology, pp. 31–40 (2009)
Sidecar: Augmenting Word Embedding
Models with Expert Knowledge
Mathieu Lemay1 , Daniel Shapiro1,2(B) , Mary Kate MacPherson1 , Kieran Yee1 ,

Hamza Qassoud2 , and Miodrag Bolic2
1
Lemay.ai Inc., Ottawa, ON, Canada
{matt,daniel,marykate,kieran}@lemay.ai
2
School of Electrical Engineering and Computer Science, University of Ottawa,
Ottawa, ON, Canada
mbolic@eecs.uottawa.ca
https://lemay.ai
Abstract. This work investigates a method for enriching pre-trained

word embeddings with domain-specific information using a small, cus-
tom word embedding. For a classification task on text containing out-
of-vocabulary expert jargon, this new approach improves the predic-
tion accuracy when using popular models such as Word2Vec (71.5% to
76.6%), GloVe (73.5% to 77.2%), and fastText (75.8% to 79.6%). Fur-
thermore, an analysis of the approach demonstrates that expert knowl-
edge is improved in terms of discrimination and inconsistency. Another
advantage of this word embedding augmentation technique is that it is
computationally inexpensive and leverages the general syntactic infor-
mation encoded in large pre-trained word embeddings.
Keywords: Transfer learning · Word embedding · Expertise ·

Knowledge · Embedding retraining
1 Introduction
This work examines three widely-used word embedding models, and explores the
use of a secondary, smaller domain-specific embedding to improve the represen-
tation of domain-specific text1 .
Word embedding is a technique for representing words using numeric vec-
tors, which allows calculations to be performed on a “word” in comparison to
all other words. In a good embedding, those calculations will map to semantic
relationships; the canonical example being “king - man + woman = queen” [9].
Unfortunately, processing a large enough corpus to generate useful vectors
is computationally expensive. Therefore, common practice for practical appli-
cations is to use a pre-trained word embedding model trained on a very large
1
The code and training data for the best performing approach described in this work
(fastText pre-trained model with fastText-based custom embedding) are available
at: https://github.com/lemay-ai/sidecar/.
https://doi.org/10.1007/978-3-030-39442-4_39
526 M. Lemay et al.
corpus such as Google News2 . However, these pre-trained embedding models do

not generalize well to text from more specialized fields and industries, as the
meanings of words can change in the context of a particular domain. For exam-
ple, the word “collision” is usually associated with “car”, “vehicle”, “accident”,
and “insurance”. In the context of computer networking, on the other hand,
words such as “packet”, “transmission”, and “Ethernet” should be considered
more relevant to the word “collision”. In other words, a jack-of-all-trades embed-
ding is master of none; but for many practical applications, such as real estate,
financing, accounting, or computer science, an expert-level model of the problem
domain may be required.
One approach to resolving this issue is to develop a custom word embedding
model for a particular domain, using a domain-specific corpus. The drawback
of this approach is that a smaller dataset may not capture the general semantic
relationships in the language of choice as well as a pre-trained model does. A
second possible solution might be to adjust the pre-trained model. Adjusting the
pre-trained model is mathematically difficult to do without retraining the entire
model from scratch. This is, as noted previously, computationally expensive, and
not suitable for most applications. A third solution, which is investigated in this
work, is to extend or “enhance” the embedding generated by the general-purpose
model by combining it with a second embedding generated by a separate, smaller,
domain-specific model.
This work builds on prior research [6] on enhancing Word2Vec pre-trained
embeddings by combining the pre-trained vectors with vectors from a custom
word embedding model. In that work, a vector for each word was selected from
a pre-trained embedding model and concatenated with an equally-sized vector
from a custom embedding trained on a corpus from the specific domain. The
results of these experiments on both adverse drug reaction tweets and movie
reviews found significant improvement in the performance of a simple CNN when
this concatenation method was used. This work extends that initial research by
investigating how this method works on a variety of word embedding models, as
well as making use of a more formal test of expertise to expose the true benefit
of this concatenation strategy.
The motivation for addressing this problem was previous work on processing
internal audit reports from various application domains3 , and generating action
recommendations using computer screen monitoring [14,15].
This work examines the classification accuracy of a neural network perform-
ing a topic modeling task on text containing a large amount of domain-specific
terminology (jargon). The contribution of this paper is a transfer learning tech-
nique, herein called the “sidecar” method, that uses a secondary, smaller domain-
specific embedding to significantly improve the performance of pre-trained word
embedding models on a domain-specific text classification task. It was also
demonstrated quantitatively that this method improves the expertise of the
2
Google News Vectors Binary File (2019) for Word2Vec can be found at https://
drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit.
3
More information on this can be found at https://www.auditmap.ai/.
Sidecar 527
system, as measured by classification task discrimination and consistency. These

metrics were defined in [13]. Some useful background information is given in
Sect. 2. Prior art is explored in Sect. 3, followed by a description of the technique
in Sect. 4, and the design of experiments in Sect. 5. Results of experiments are
reported in Sect. 6, and concluding remarks are presented in Sect. 7.
2 Background
2.1 Word Embeddings
Word embedding is the technique of mapping words to numeric vectors, such

that the vectors encode relationships between words. If such a meaningful repre-
sentation can be found, concepts such as “distance” between two words can be
used to analyze their relationship, even if those two words have never been seen
together in the corpus. The advantage of this technique over simpler statistical
analysis techniques is that good results can be achieved using a much smaller
corpus of text - billions of words, rather than trillions [9].
Finding such a mapping can be difficult, but over the years several effective
methods have been developed. Three common methods are Word2Vec (2013),
GloVe (2014), and fastText (2016). All three methods operate on the idea of
context, which is defined as the n words that appear before and after the word
of interest, where n is a configurable parameter.
Word2Vec is a context-predicting model that works by continuously trying
to predict a word given its context (CBOW), or the context given the word
(skip-gram) [9]. In doing so, the model internally learns an efficient vector rep-
resentation of each word in the training data, where words with similar contexts
have similar vectors. A vector is a multidimensional tensor representation of a
word. For example, a vector for the word “dog” may be composed of 300 floating
point numbers in an ordered array. A limitation of the Word2Vec method is that
it only learns embeddings for a specific number of words; the final model is essen-
tially a lookup table. It is limited to the words available in the training corpus,
and furthermore, words that occur less frequently than a particular threshold
are typically ignored for efficiency [10]. Thus, pre-trained Word2Vec models are
not suitable for training on text samples that contain infrequently used domain-
specific terminology unlikely to appear frequently, if at all, in general texts.
GloVe, another method for producing word embeddings, uses a regression
model to generate meaningful word vectors from co-occurrences. It learns word
vectors such that the dot product of two word vectors approximately equals the
log of the number of times those two words occur together [11]. It does so by
casting the model as a least-squares problem, which can be optimized iteratively.
The advantage of GloVe is that the generated word vectors are guaranteed to
encode statistical information about the corpus. However, like Word2Vec, it will
only learn a specific number of embeddings, and cannot generate embeddings
for words that were not in the training corpus.
528 M. Lemay et al.
An additional limitation of Word2Vec and GloVe is that these approaches

do not examine the internal structure of a word. This was addressed by [2], an
extension to Word2Vec which was later released as “fastText”. fastText applies
the Word2Vec model, but instead of finding a word vector for each unique word,
it finds word vectors for n-grams. For example, the 3-grams present in the word
“where” are whe, her, ere. The word vector for a particular word is created by
summing the vectors of its n-grams. This greatly improves handling of words
that rarely occur in the training corpus, and allows handling of words that do
not appear in the training corpus at all.
2.2 Expert Terminology

Issues arise when word embeddings represent data from an expert context. Ter-
minology and vocabulary that has distinct meaning in an expert context is often
referred to as “jargon”. More specifically, jargon is “the technical terminology or
characteristic idiom of a special activity or group” [8]. Such terminology char-
acterizes the particular models and insights that an expert acquires through
experience, and generally comes in two types.
The first is “specific” jargon - words that are rarely used or non-existent
outside of the expert context. These terms often serve to add granularity (what
is “blue” to most might be “cerulean” or “aquamarine” to a visual artist), help
convey complex concepts quickly to other experts (most medical terminology), or
reflect historical peculiarities of the field (e.g. the popular “foobar” placeholder
in computer science).
In addition to this specific jargon, common words are often reused and infused
with new meaning. For example, a “hash collision” has little to do with a “col-
lision” in the usual sense of the word; neither does a “packet collision”. These
terms represent metaphors that are useful to the computer scientist but confus-
ing to the layperson. Similarly, an “option” in the context of equity trading has
a very specific meaning.
Jargon of the first kind fall under the out-of-vocabulary problem in general-
purpose word embedding models. Jargon of the second kind will cause even more
confusion, because the word vector generated on the general-purpose corpus will
encapsulate relationships that may be inappropriate in the expert context (in
the “collision” example, being associated with “car” or “insurance” rather than
“table” or “network”). In order to improve performance of word embeddings
on domain-specific texts, it will be necessary to generate more meaningful word
vectors that encode these context-specific meanings.
2.3 Metrics of Expertise

Typical machine learning model metrics are precision and recall, which may
be combined to form an f1-score. This work examines expertise, and requires a
custom metric to capture this concept. A quantitative definition of expertise in
the context of knowledge and memorization is described in [1], as encoding facts
in a way that aids recall of information relevant to a given situation. It was noted
Sidecar 529
that, while a medical student may be able to memorize the same information
with the same degree of accuracy as another less educated individual, a medical
expert is better at identifying and recalling the important pieces of information.
In [7], expertise is summarized as:
– extensive knowledge and experience that affect the perception of systems and
the organization of information;
– the ability to recognize a situation and efficiently recall the most appropriate
knowledge to solve a specific problem.
In this work, the focus is on the second point: the ability of an expert to
discriminate between situations that may appear similar on the surface. The
Cochran-Weiss-Shanteau (“CWS”) ratio [13] is a quantitative measure of this
discriminatory ability. Suppose that a candidate is given a series of classification
tasks. The CWS ratio is defined as:
Discrimination
CW S =
Inconsistency
“Discrimination” is described as the “ability to differentiate between similar,
but not identical, cases”. “Consistency” is described as the ability of the can-
didate to “repeat their judgment in a similar situation”. “Inconsistency” is the
complement of consistency.
An example of applying CWS is useful for illustrating the advantage of this
metric. Consider the task of comparing the performance of two categorical classi-
fication models on a balanced dataset that contains an equal number of samples
from each of 5 categories. Tables 1 and 2 show example results for the two mod-
els and five categories. Each cell represents a test case. The value of each cell is
the predicted category for that test case. Each column represents the true cat-
egory for each test case. Each row represents a different test of the model, as is
commonly seen with k-fold cross validation when evaluating a machine learning
model. A perfect model would contain only 1s in the first column, only 2s in the
second column, and so forth.
Table 1. Example results for Candidate 1.
Candidate 1 performance chart

Cat1 Cat2 Cat3 Cat4 Cat5
1 2 3 1 5
4 2 3 1 2
4 2 3 4 5
4 2 3 1 2
1 2 3 1 2
530 M. Lemay et al.
Table 2. Example results for Candidate 2.
Candidate 2 performance chart

Cat1 Cat2 Cat3 Cat4 Cat5
1 1 3 4 1
5 3 3 5 2
1 2 4 4 5
1 2 3 2 5
2 5 3 4 5
Consider two cells containing the same predicted category. If they are in the
same column (“within-column match”) then the candidate replied with the same
response for the same category; the answer may not be accurate, but it is at least
consistent. If the matched cells are in different columns (“cross-column match”),
then the candidate failed to discriminate between the two categories. This type of
measurement is called measuring agreement or measuring inter-rater reliability,
and there are many statistical methods that could be used for this task [12].
This leads to the CWS ratio calculation of Algorithm 21. nk is the binomial
n!
coefficient, k!(n−k)! , and UNIQUE COUNTS() is a function that takes in a table
and returns the set of categories that appear in the table, along with the total
count of each category in the table.
CWS Ratio Calculation
Note that both candidate models had similar overall accuracy (15 out of 25 for
both models), recall (0.60 for both models), and precision (0.64 for Candidate 1
and 0.61 for Candidate 2), but Candidate 1 was more consistent in its classifica-
tion. Table 3 walks through the CWS ratio calculation, showing the intermediate
results of each step.
As expected, Candidate 1’s more consistent classification leads to a higher
CWS ratio. If accuracy, precision, and recall were the only metrics under con-
sideration, the two candidates would be indistinguishable. However, as demon-
strated, not all wrongness is measured equally.
3 Prior Art
3.1 Word Embedding and Expert Terminology
Word embedding models are ultimately limited by the content of the training
corpus. Retraining a model is a computationally expensive endeavor, and so
related works have proposed a number of approaches for improving the contex-
tual map (and therefore the quality) of a pre-trained word embedding model
without retraining it entirely. The retraining approach in [5] was defining a
Sidecar 531
Input: results The table of all results.

num cat The number of classification categories.
num rows The number of rows in the table.
Output: CW S The CWS ratio.
1 numItems = num categories × num rows

2 maxM atches = num items
num
2
rows
3 perColM atches = 2
4 possibleAcrossColM atches = maxM atches - num cat × colM atches
5 catCounts = UNIQUE COUNTS(results)
6 allColM atches = 0
7 for category in catCounts[categories] do

8 allColM atches ← allColM atches+ catCounts[counts][category]
2
9 end
10 withinColM atches = 0
11 for column in num cat do
12 columnCatCounts = UNIQUE COUNTS(results[:, column])
13 for category in columnCatCounts[categories] do
14 withinColM atches ← withinColM atches+
columnCatCounts[counts][category]
2
15 end
16 end
17 actualAcrossColM atches = allColM atches - withinColM atches
18 discrimination = possibleAcrossColM atches−actualAcrossColM atches
possibleAcrossColM atches
19 inconsistency = num cat×perColM atches−withinColM atches
num cat×perColM atches
discrimination
20 CW S = inconsistency
21 return CW S
domain-specific model, and then finding a mapping that translates the pre-
trained word vectors into the domain-specific vector space. Another approach
to retraining [16] is to combine several large pre-trained models into a single
ensemble model with a large, multilingual vocabulary. Similar to this work, the
retraining in [3] combined pre-trained and custom word vectors by concatenat-
ing them to create a richer vector representation. [17] used knowledge graphs to
improve the quality of word embeddings.
Prior work [6] explored a method of concatenating equally-sized vectors from
a pre-trained embedding model and from a custom-trained embedding model
for a particular dataset. In particular, [6] investigated how this method would
improve understanding of a collection of adverse drug reaction tweets, as well
as a dataset of movie reviews. The paper focuses specifically on improving
Word2Vec’s pre-trained Google News embedding when used by a simple classifi-
cation CNN. By concatenating the equally-sized pre-trained and custom vectors,
it was found that the percentage accuracy improved from 88.47% to 88.85% on
the adverse drug reaction tweets, and from 80.56% to 81.29% on the movie review
dataset. To build upon this research, this work investigates a similar technique’s
impact on multiple methods for creating embeddings, rather than just focusing
532 M. Lemay et al.
Table 3. Intermediate results for example CWS ratio calculation.
Measurement Candidate 1 Candidate 2

25
Possible matches: 2 300 300

Possible within-column matches per column: 52 10 10

Total possible within-column matches: 5 ∗ 52 50 50
Possible cross-column matches 290 290
Total matches across columns 26 35
Total non-matches across columns 264 255
Discrimination ratio 0.896 0.860
Total matches within columns 34 16
Non-matches within columns 16 34
Inconsistency ratio 0.32 0.68
CWS ratio 2.80 1.26
on Word2Vec. As well, by testing on a dataset with more than two possible cat-
egories, it is possible to bring in more precise methods of measuring expertise
than just percentage accuracy. This work also reduces the custom embedding’s
number of dimensions from the 300 of [6] to 100, as with a much smaller domain,
not as many dimensions are needed as with the pre-trained embedding [14].
4 The “Sidecar” Technique
Figure 1 describes the approach for augmenting a large, pre-trained, general-

purpose word embedding model to handle out-of-vocabulary and contextual-
specific meaning issues for domain-specific texts. This will be referred to in this
work as the “sidecar” technique, and the domain-specific model as the “sidecar
model”.
Fig. 1. Overall architecture of sidecar

Sidecar 533
The sidecar technique begins with training a small, custom word embedding
model (the “sidecar” model) on a corpus of domain-specific texts, which creates
word embedding vectors of length n. Since the sidecar model is trained on docu-
ments exclusively from that domain, it should learn meanings and relationships
exclusive to that domain.
When analyzing text, for each word, vectors were generated using both the
sidecar model and a general-purpose word embedding model, which generates
embedding vectors of length m. The final word vector is created by concatenating
both vectors lengthwise to create a vector of length n + m.
The intuition behind this technique is that, by using both vectors, the model
gains the strengths of both while covering their weaknesses. The rest of this paper
will examine the benefits of this technique in a domain-specific text classification
task.
5 Dataset and Algorithms
5.1 Evaluation Task
In order to evaluate the embedding, a corpus of programming-related text from

the Stack Overflow4 dataset hosted on Google Cloud’s BigQuery service was col-
lected5 . Stack Overflow is a question-and-answer forum for programming topics.
Computer programming is a field particularly infamous for its jargon6 ; text from
that domain is therefore ideal for evaluating vocabulary expansion methods. Take
as an example the following sample post:
was curious on how to write a method to remove all zeros from an array.
If I have the array in the main method. For example my Main Method
would look something like
public static void main(String[] args) {

int[] test = {1, 0, 4, 7, 0, 2, 10, 82, 0};
System.out.println(Arrays.toString(test) + "": length = "" +
test.length);
int[] result = removeZeros(test);
System.out.println(Arrays.toString(result) + "": length = "" +
result.length);
}
and have the code output the length and the array without the zeros like:
Stack Overflow posts were collected and labelled as being related to one of
ten programming languages: C++, vb.net, Java, Perl, PHP, Python, R, SQL,
4
https://stackoverflow.com.
5
https://cloud.google.com/bigquery/public-data/stackoverflow.
6
For some interesting background on programming jargon in particular, see http://
catb.org/jargon/html/distinctions.html.
534 M. Lemay et al.
Javascript, and C#. 1000 posts were collected per language, for 10000 posts
total. The task was to examine the content of the post and correctly classify
which language it is related to.
5.2 Sidecar Implementation

For testing, three popular pre-trained word embedding models were evaluated
with and without the “sidecar” technique:
– Google’s pre-trained Word2Vec model7 , trained on Google News as described
in [10]
– spaCy’s en core web lg GloVe model8 , trained on Common Crawl9
– Facebook’s crawl-300d-2M fastText model10 , also trained on Common Crawl
All three models produce vectors of size 300.
To generate vectors for the “sidecar” approach, each embedding method was
used to generate a custom model with vectors of size 100. Each of the pre-
trained Word2Vec, GloVe and fastText embeddings were paired with a custom
embedding using that same method, to make sure that the improvements were
coming exclusively from the sidecar approach and not through using a better
embedding method.
5.3 Context Classification Model and Hyperparameter Tuning

For the context classification task defined in Sect. 5.1, data preparation pro-
ceeded as follows: both pre-trained and custom word vectors were generated
for each of the first 100 words. Following a zero-padding approach, if a text
sample contained fewer than 100 words, then the remaining vectors were filled
with zeros. Out-of-vocabulary words were also replaced with zero vectors for the
Word2Vec and GloVe models.
The neural network trained to perform the classification task is depicted in
Fig. 1, and contains a 3-layer LSTM [4] network of constant width, followed by a
10-neuron densely connected softmax layer. The input of this network was either
the pre-trained word vectors by themselves, or the pre-trained word vectors
concatenated with the custom word vectors. The point of this classification model
was to evaluate the change in predictive power caused by various approaches to
representing expert knowledge in word embedding models.
There are two hyperparameters associated with the classification model archi-
tecture:
– The number of neurons in each layer of the LSTM network.
– The size of the context window used to generate the custom word vectors
(only applicable for sidecar variant).
7
https://code.google.com/archive/p/word2vec/.
8
https://spacy.io/models/en#section-en core web lg.
9
http://commoncrawl.org/.
10
https://fasttext.cc/docs/en/english-vectors.html.
Sidecar 535
In order to confirm the hypothesis, it was necessary to compare the best

performance of each variant. Therefore, LSTM layer widths between 10 and
100 neurons were investigated, as were context windows of size 2 through 6.
Training was performed on 95% of the 10000 records, randomly chosen, for each
combination of hyperparameters. The remaining 5% was kept aside for testing.
Hyperparameter combinations with the highest overall accuracy on the testing
set were chosen for final comparisons.
5.4 Embedding Model Comparison
Random permutation and selection of train/test data added an additional hid-

den hyperparameter that may affect performance. To minimize the impact of
randomness on the observations, for each of the hyperparameter combinations
mentioned previously, the model was re-trained and re-tested on ten different
randomized train/test data sets for cross-validation. The CWS scores were pro-
duced using Algorithm 21 on 30 test samples from each of the 10 categories in
the dataset.
The classification experiment was performed using the custom word vectors
by themselves, using the same hyperparameters as the fastText+sidecar variant;
this was used to verify that the combination of the pre-trained and custom
vectors performs better than either one by itself.
6 Results
6.1 Hyperparameter Tuning
Table 4 shows the results of hyperparameter tuning for the classification model
using only the pre-trained word embedding models. The highest accuracy for
Tables 5, 6 and 7 show the results of hyperparameter tuning for the sidecar
variants. The highest accuracy for each individual model is highlighted. The
hyperparameters chosen for each model were summarized in Table 8.
Table 4. Context classification accuracy for the pre-trained Word2Vec, GloVe and
fastText models
#LSTM neurons 10 20 30 40 50 60 70 80 90 100

Word2Vec 0.425 0.609 0.657 0.673 0.681 0.691 0.695 0.701 0.715 0.715
GloVe 0.507 0.641 0.675 0.661 0.709 0.703 0.699 0.705 0.735 0.723
fastText 0.625 0.679 0.691 0.725 0.733 0.731 0.749 0.758 0.752 0.747
536 M. Lemay et al.
Table 5. Context classification accuracy for models concatenating pre-trained

Word2Vec and custom fastText vectors. Results are reported for various numbers of
LSTM neurons and context window sizes.
windowSize 10 20 30 40 50 60 70 80 90 100
#LSTM
neurons
2 0.671 0.705 0.679 0.733 0.745 0.741 0.752 0.756 0.741 0.739
3 0.661 0.661 0.703 0.725 0.735 0.719 0.725 0.747 0.774 0.752
4 0.669 0.677 0.689 0.699 0.731 0.725 0.731 0.743 0.750 0.725
5 0.647 0.681 0.707 0.754 0.745 0.745 0.750 0.754 0.764 0.749
6 0.621 0.725 0.739 0.754 0.733 0.731 0.731 0.766 0.756 0.764
Table 6. Context classification accuracy for models concatenating pre-trained GloVe

and custom fastText vectors. Results are reported for various numbers of LSTM neu-
rons and context window sizes.
windowSize 10 20 30 40 50 60 70 80 90 100
#LSTM
neurons
2 0.651 0.703 0.747 0.747 0.754 0.754 0.737 0.758 0.743 0.731
3 0.613 0.697 0.735 0.729 0.739 0.754 0.735 0.762 0.768 0.743
4 0.673 0.695 0.721 0.727 0.750 0.754 0.733 0.772 0.741 0.782
5 0.693 0.715 0.743 0.745 0.735 0.760 0.762 0.745 0.737 0.762
6 0.649 0.711 0.754 0.733 0.741 0.762 0.723 0.758 0.747 0.756
6.2 Classification Performance
Figure 2 records the overall accuracy for all three pre-trained embeddings, side-
car alone, and those embeddings enhanced with a sidecar. Unsurprisingly, the
accuracy of fastText is head and shoulders above the other pre-trained models;
the n-gram approach gives it a significant advantage when evaluating a cor-
pus containing many out-of-vocabulary words. What is more significant is that
the sidecar-enhanced Word2Vec and GloVe embeddings manage to consistently
equal or exceed the overall accuracy of the fastText model; and the sidecar-
enhanced fastText consistently exceeds fastText in overall accuracy. The sidecar
embedding performs poorly by itself compared to all other models, rejecting
the hypothesis that sidecar alone (and not the pre-trained embedding model)
was responsible for the performance improvement. This experiment has demon-
strated that sidecar augments pre-trained models to provider a richer context
than either model provides by itself.
Figure 3 records the CWS ratio scores for each embedding. Once again,
the sidecar-enhanced embeddings consistently out-perform the originals. To fur-
ther illustrate the point, the CWS numerator (discrimination) and denominator
Sidecar 537
Table 7. Context classification accuracy for models concatenating pre-trained fast-

Text and custom fastText vectors. Results are reported for various numbers of LSTM
neurons and context window sizes.
windowSize 10 20 30 40 50 60 70 80 90 100
#LSTM
neurons
2 0.703 0.758 0.750 0.780 0.784 0.774 0.788 0.784 0.784 0.752
3 0.681 0.752 0.776 0.770 0.794 0.766 0.758 0.796 0.788 0.766
4 0.713 0.731 0.776 0.780 0.774 0.784 0.790 0.806 0.774 0.782
5 0.723 0.747 0.762 0.782 0.760 0.784 0.796 0.786 0.784 0.774
6 0.705 0.776 0.768 0.780 0.794 0.782 0.792 0.776 0.776 0.768
Table 8. Hyperparameter summary
Embedding hyperparameter LSTM width Context window size

Word2Vec 90 N/A
GloVe 90 N/A
fastText 80 N/A
Word2Vec+Sidecar 80 6
GloVe+Sidecar 80 4
fastText+Sidecar 70 5
Fig. 2. Context classification accuracies for each model over 10 different random seeds.
(inconsistency) are shown separately, in Figs. 4 and 5 respectively. The sidecar-

enhanced embeddings consistently achieve both higher discrimination and lower
inconsistency than the originals. Again, the sidecar embedding by itself performs
poorly compared to the others.
538 M. Lemay et al.
Fig. 3. CWS score for each model over 10 different random seeds.
Fig. 4. CWS numerator (discrimination) for each model over 10 different random seeds.
Regarding performance, the fastText version of sidecar alone (without a pre-

trained model) was found to have high variance in general, and poor discrimina-
tion and inconsistency scores, leading to an overall poorer CWS score than other
approaches. This is perhaps because of the lack of general knowledge within the
sidecar vectors, which are produced using only the jargon-filled expert-level text
corpus.
Sidecar 539
Fig. 5. CWS denominator (inconsistency) for each model over 10 different random
seeds.
7 Conclusion
A method was presented in this work for augmenting word embedding models
with expert knowledge from a corpus of text. This work investigated a simple
method of enriching existing word embeddings with domain-specific information
using a small, custom word embedding. The advantage of this technique is that it
is less computationally expensive than retraining a full word embedding model,
and it leverages the general syntactic information encoded in large pre-trained
word embeddings. This work demonstrated quantitatively that this approach
leads to improvements in performance when classifying domain-specific text,
as measured by both common machine learning metrics such as overall accu-
racy and a quantitative measure of expertise. One major limitation is that it
does require a dataset from the target domain - if there is not enough training
data, the sidecar structure will not be able to improve the embeddings. Another
limitation is that this approach does require additional compute time and con-
figuration to generate the sidecar embeddings, rather than using a pre-trained
model as-is. Future work may include experiments on other expert knowledge
domains in the realm of internal audit, such as medicine, finance, and law. The
sidecar method could also be applied to other forms of embedding altogether,
such as improving the quality of image embeddings in specific contexts. Another
interesting approach would be to experiment with merging the pre-trained and
custom embeddings deeper inside the model, passing the custom and pre-trained
vectors separately into a model and then merging the layers later on, as in [6].
540 M. Lemay et al.
References
1. Anders Ericsson, K., Charness, N.: Expert performance: its structure and acquisi-
tion. Am. Psychol. 49, 725–747 (1994)
2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
subword information. arXiv preprint arXiv:1607.04606 (2016)
3. Dong, J., Huang, J.: Enhance word representation for out-of-vocabulary on Ubuntu
dialogue corpus. CoRR abs/1802.02614 (2018). http://arxiv.org/abs/1802.02614
4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9,
1735–1780 (1997)
5. Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A.,
Fidler, S.: Skip-thought vectors. In: Advances in Neural Information Processing
Systems, pp. 3294–3302 (2015)
6. Limsopatham, N., Collier, N.: Modelling the combination of generic and target
domain embeddings in a convolutional neural network for sentence classification.
7. McBride, M.F., Burgman, M.A.: What is expert knowledge, how is such knowledge
gathered, and how do we use it to address questions in landscape ecology? In:
Perera, A., Drew, C., Johnson, C. (eds.) Expert Knowledge and Its Application in
Landscape Ecology, pp. 11–38. Springer, New York (2012)
8. Merriam-Webster Online: Merriam-Webster Online Dictionary (2009). http://
www.merriam-webster.com
9. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
10. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in Neural
11. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word repre-
sentation. In: Empirical Methods in Natural Language Processing (EMNLP), pp.
1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162
12. Saal, F.E., Downey, R.G., Lahey, M.A.: Rating the ratings: assessing the psycho-
metric quality of rating data. Psychol. Bull. 88(2), 413 (1980)
13. Shanteau, J., Weiss, D., Thomas, R., Pounds, J.: Performance-based assessment of
expertise: how to decide if someone is an expert or not. Eur. J. Oper. Res. 136(2),
253–263 (2002)
14. Shapiro, D.: Composing recommendations using computer screen images: a
deep learning recommender system for PC users. Ph.D. thesis, Université
d’Ottawa/University of Ottawa (2017)
15. Shapiro, D., Qassoud, H., Lemay, M., Bolic, M.: Visual deep learning recom-
mender system for personal computer users. In: The Second International Con-
ference on Applications and Systems of Visual Paradigms, VISUAL 2017, pp. 1–
10 (2017). https://www.thinkmind.org/index.php?view=article&articleid=visual
2017 1 10 70006
16. Speer, R., Chin, J.: An ensemble method to produce high-quality word embeddings.
arXiv preprint arXiv:1604.01692 (2016)
17. Xu, C., Bai, Y., Bian, J., Gao, B., Wang, G., Liu, X., Liu, T.Y.: RC-NET: a general
framework for incorporating knowledge into word representations. In: Proceedings
of the 23rd ACM International Conference on Information and Knowledge Man-
agement, pp. 1219–1228. ACM (2014)
Performance Analysis of Support Vector
Regression Machine Models in Day-Ahead
Load Forecasting
Lemuel Clark P. Velasco(&), Daisy Lou L. Polestico,

Dominique Michelle M. Abella, Genesis T. Alegata,
and Gabrielle C. Luna
Premier Research Institute of Science and Mathematics,

MSU-Iligan Institute of Technology, 9200 Iligan City, The Philippines
lemuelclark.velasco@g.msuiit.edu.ph
Abstract. Support vector machines (SVM) is a machine learning framework

that has exhibited optimum performance in the functions of classification and
clustering. This study explored Support Vector Regression Machines (SVRM)
as a specialized application of SVM in predictive functions by conducting a
performance analysis of various SVRM models for day-ahead load forecasting.
In order to find an appropriate SVRM model that can yield promising fore-
casting results, data preparation which involved data representation and feature
selection was conducted for the electric load dataset and found out that only the
time attribute has a relevant relationship with the consumed electric load.
Through the selection of an appropriate kernel along with its SVRM parameters
and SVRM architecture, it was found out that the Radial Basis Function kernel
along with SVRM parameters: c = 110, g = 0.001, e = 0.01 and p = 0.005
implemented in an SVRM architecture that uses: day before, two days before,
seven days before, and fourteen days before electric load data as input for the
SVRM model yields the best forecasting results. The results generated and
obtained by this study clearly suggests that with proper data representation,
feature selection, kernel selection, parameter selection and architecture selection,
SVRM can go beyond clustering and classification by being a viable forecasting
technique for day-ahead electric load forecasting.
Keywords: Machine learning performance Support vector machines

Support vector regression machines Electric load forecasting
1 Introduction
Support Vector Machine (SVM) is a supervised artificial intelligence method originally

applied to solve problems in classification and clustering. Compared to other machine
learning methods, SVM implements the structural risk minimization principle to
minimize an upper bound on the generalization error rather than employing the
empirical risk minimization principle to minimize the training error giving SVM better
generative performance [1–3]. The applications of SVM in machine learning has
already went beyond its original purposes of classifying and clustering as it is showing

https://doi.org/10.1007/978-3-030-39442-4_40
542 L. C. P. Velasco et al.
promising prediction capabilities in the field of electric load forecasting through

Support Vector Regression Machines (SVRM) [1, 4]. Belonging to a category of kernel
methods, SVRM inherits the classification power of SVM and uses this advantage to
recognize patterns which in turn generates predictions. The kernels provide SVRM a
powerful and unified framework for pattern discovery that act on general types of data
such as strings, vectors or text and learns general types of relations in terms of rankings,
classifications, regressions, and clusters [2, 4, 5]. Aside from kernel selection, there are
many challenges in the SVRM’s model that needs to be configured so that it can
effectively deliver close to accurate forecast. To achieve accurate prediction results, the
data to be used in the SVRM must be scaled, the features must be properly selected,
and the models must have appropriate partitioning and inputs. The SVRM itself has its
own requirements such as having the proper kernel, kernel parameters, soft margin
parameter C and other variables [4, 5]. In addition, if the SVRM architecture does not
suit the data, the model’s accuracy has the tendency to result in severely poor
performance.
Electric load forecasting is a vital process in any management of power systems as
it allows decision makers to plan the production, transmission and distribution of
electricity used by industries and households. The availability of historical consumed
electric load as well as advanced machine learning integrated development environ-
ments has made it possible for the ease of applying artificial intelligence in the forecast
of either long term, medium term and short term electric load predictions [3, 6]. Day-
ahead load predictions categorized under short term electric load forecasting generally
produce predictions in hourly granularity that aids power utilities in maintenance
planning as well as load switching operations. Performance analysis of machine
learning frameworks like SVRM will enable the appropriate configuration of the
architecture’s structure and components that enables classification as well as pattern
recognition of historical time, weather and electric variables [7]. More than the type of
data being fed into the SVRM, performance analysis of the SVRM models plays an
important role in the implementation of SVRM in any forecasting endeavor [5, 8, 9]. It
is in the SVRM’s model consisting of not just the kernel and its appropriate parameters
but also its architecture which determines the success in the applicability of the
supervised SVRM which was originally intended for classification and clustering and
its delivery of close to accurate forecasting results in a particular problem domain that it
tries to solve.
Power utilities assigned in the distribution of electricity among select localities are
in need of personalized predictive models like SVRM to forecast day-ahead load as the
absence of such scientific methods will lead to over or underestimation of nominated
load consumption in their various electricity markets. Accurate day-ahead forecast of
power consumption is essential in the decision making of power utilities that involves
functions in field scheduling, reliability analysis, and purchasing electric power [8].
Due to its intended classification function and novice purpose in forecasting, this study
aims to contribute in the development of appropriate SVRM forecasting models for
short-term load forecasting by conducting a performance analysis of various kernels,
SVRM architecture structures and component parameters. After electric load data
preparation, feature selection was conducted to pave the way for kernel selection and
the performance evaluation of various SVRM parameters and SVRM architectures.
Performance Analysis of SVRM Models in Day-Ahead Load Forecasting 543
Through an evaluation of SVRM models for day-ahead load forecasting, this research
hopes to contribute to the growing body of literature that explores the predictive
capability of SVM that will hopefully assist the appropriate generation, transmission
and distribution of electricity.
2 Methodology
2.1 Electric Load Data Preparation
Data preparation involving data selection, data representation and feature selection
plays an important role in successful SVRM implementation for any forecasting pur-
poses as this process ensures the good quality of data before fed into the machine
learning framework. As shown in Table 1, electric load data in terms of kilowatt
delivered (KW_DEL) in fifteen-minute resolution was acquired from a power utility
along with its corresponding, kilowatt per hour delivered (KWH_DEL), kilo volt amps
reactive hours delivered (KVARH_DEL), metering point (SEIL), and its billing date
(BDATE).
Table 1. Sample raw data.

SEIL BDATE TIME KW_DEL KWH_DEL KVARH_DEL
XX XX XX XX XX XX
XX XX XX XX XX XX
A total of 70,109 rows of data was then represented, scaled and then partitioned in
order to be processed by the SVRM. The BDATE attribute was split into their
respective day, month and year attributes to ease machine learning and deal the limi-
tation of SVRM libraries in handling data with commas, semicolons and other marks.
Additional attributes such as calendar day, holiday and weekend indicators were added
to the original dataset as literature suggest that this will be material to the performance
of SVRM in load forecasting [10]. Binary variables was then used to represent non-
numeric features [4, 11]. The 15 min TIME attribute was represented into an integer
equivalent with 00:15 read as 12:15 AM converted as 1 and 00:00 read as 12:00
Midnight converted as 96. As shown in Eq. (1), Min - Max normalization method was
then used to scale the electric load data since this normalization method has exhibited
to yield good results in regression problems by enabling the data to be scaled in such a
way that the higher values should not suppress the lower values in order to retain the
activation function [12].
x minð xÞ
x0 ¼ ð1Þ
maxð xÞ minð xÞ
Where x is the electric load, min(x) is the minimum electric load data of the dataset,
max(x) is the maximum electric load and x’ as the scaled electric load data. After data
representation and data scaling, the dataset was partitioned into two data sets of the
training set and validation set. The training set was used for training the formulated
SVRM models along with different parameters while the validation set was used for
testing the design of the SVRM model to confirm its predictive accuracy. As shown in
Fig. 1, January 1, 2013 to November 2014 was identified as the training set while
December 2014 was set as validation set with the available dataset of December 2014
starting from December 1 2014 to December 25 2014.
Fig. 1. Data partitioning of the dataset.
Despite the few number of variables used in this study, the researchers were still
conservative enough to perform feature selection to ensure the materiality of the data to
be fed into the SVRM models by identifying the most relevant input variables within
the data set and removing irrelevant, redundant, or noisy data. Proper selection of
features or relevant input variables can improve the prediction performance of machine
learnings, specifically that of SVRM which was originally designed for classification
[3, 8, 13]. Given a subset of the whole data set, correlation-based filter feature selection
and information gain approaches were used for the feature selection process [14–16].
The Waikato Environment for Knowledge Analysis (Weka) and R Programming lan-
guage were used as tools to perform feature selection with Pearson’s correlation,
Spearman correlation matrix, and Kendall’s correlation used in correlation based
approach. The value of these correlation coefficients ranges between −1 and 1 with the
strongest linear relationship being indicated by a correlation coefficient of −1 or 1 while
the weakest linear relationship is indicated by a correlation coefficient equal to 0.
A positive correlation means that if one variable gets bigger, the other variable tends to
get bigger with it while a negative correlation means that if one variable gets bigger, the
other variable tends to get smaller along with it. On the other hand, information gain IG
(G1;G2;C) is the gain of mutual information of knowing both G1 and G2 with respect
to the class C. A positive value of IG(G1;G2;C) indicates the synergy between G1 and
G2. In other words, it measures the amount of information in bits about the class
prediction which, in this case, the KW_DEL feature. The information gain measure is
based on the entropy concept. It is commonly used as the measure of feature relevance
in filter strategies that evaluate features individually with the advantage of being fast
[17]. Let D(A1;A2;…;An;C), n > 1, be a data set with n + 1 features, where C is the
class attribute. Let m be the number of distinct class values. The entropy of the class
distribution in D, represented by Entropy(D) as shown in Eq. (2).
Xm
EntropyðDÞ ¼ i¼1
pi log2 ðpi Þ ð2Þ
Where pi is the probability that an arbitrary instance in D belongs to class ci. This
concept is used by the single-label strategy known as Information Gain Attribute
Ranking to measure the ability of a feature to discriminate between class values [16,
17]. Similar to correlation based approach, this relationship is constrained by
−1 < r < 1.
2.2 SVRM Model Design Evaluation

To design and evaluate the SVRM models for load prediction, kernel selection,
parameter selection, and architecture selection were conducted. Kernel selection is the
process of selecting the optimal kernel method that fits the regression problem, in this
case day-ahead load forecasting. It involves testing different kernel methods in order to
find the most appropriate kernel to use for selecting the parameters needed for the
SVRM model. Kernels provide a powerful and unified framework for pattern discovery
and motivating algorithms that act on general types of data such as strings, vectors or
text, and look for general types of relations like rankings, classifications, regression and
clusters [18, 19]. The behavior of the SVRM kernels were then tested and observed
based on the accuracy of prediction of the resulting predictive model [5, 18, 19]. The
kernel with the smallest Mean Absolute Percentage Error (MAPE) (MAPE) was
selected and was used for parameter selection.
Parameter selection, also known as parameter search is the process of getting the
optimal values of the parameters needed in order to accurately predict unknown data -
in this case, electric load. These parameters govern the training process and have a
profound effect on the resulting model’s performance. Moreover, these parameters
control the tradeoff between margin maximization and error minimization and are used
in nonlinear mapping of the features [8, 20]. With different schemes of model con-
struction, a series of experiments are conducted. MAPE values of different models with
different parameters were the main concern of this study’s performance. Table 2 shows
the parameters needed to create the SVRM model for load prediction.
Table 2. SVRM parameters and its usage.

Parameters Usage
Cost penalty (C) Determines the balance between the flatness of the hyperplane
and the amount of tolerated deviations needed larger than e. In
other words, it is the tuning parameter
Insensitive loss function The width of the e-insensitive zone/tube
parameter (e)
P The set of epsilon function in epsilon SVR
Kernel parameter (c) Gamma of the RBF kernel function
The parameter C is a parameter that allows the tradeoff of training error vs. model
complexity. If the value of the parameter C is too large over fitting will occur. On the
other hand, if the C is too small, it may result to an underfitting and increase the number
of training errors [2, 5]. The parameter e controls the width of the e-insensitive zone,
used to fit the training data [5]. Larger e-value results in fewer support vectors selected
and will result in more flat or less complex regression estimates [3, 10]. If the value of e
is too big, the separating error is high, the number of support vectors is small, and vice
versa [10]. The kernel parameter (c) defines the nonlinear mapping from input space to
some high dimensional feature space [3, 9]. The optimality of the parameter values will
depend on its effect on the predictive accuracy of the resulting model. From the
selected kernel in the previous step, testing of parameters was performed. By com-
paring the predictive results, the parameters with the lowest error was selected. The
acquired values of the parameters and kernels were then used to select which SVRM
architecture will be used in the model.
Architecture selection is the process of choosing the structure of the dataset
depending on the number of previous days to be included in the training and set. In this
study, two SVRM architectures were developed wherein each model represents 96
forecasted load data per 15-min resolution on the day to be forecasted. As shown in
Fig. 2, the architectures have all attributes approved by the feature selection phase as
inputs.
Fig. 2. SVRM architectures design.
In Architecture I, the model has a day consumption data in a 15-min resolution

from each of the following: a day before, 2 days before, 7 days before, and 14 days
before the date of the model. These were represented by i-1, i-2, i-7, and i-14, where
i represents the current date of the model (i = 1, …, 365). The format of Architecture 1
was considered since it has been found relatively effective by researchers [1]. In
Architecture II, the model initially has a day consumption data in a 15-min resolution
taken from the date of the day before it. And as long as a minimum error is not
achieved, another one day’s worth of consumption data was added as input to
Architecture II. This process was iterated until the optimal accuracy is found. This will
be represented as i-1, i-2, …, i-n where n represented the number of days. Because
previous researches do not have a standard number of consumption data used to predict
forecast, Architecture II was created as to ensure that the appropriate number of days
are utilized to maximize accuracy in load prediction [1, 7, 8]. The architectures were
then trained using the kernels and parameters chosen by the kernel selection phase and
the parameter selection phase. Identifying the optimally performing SVRM model was
then conducted after choosing which architecture yielded the best results.

3.1 Electric Load Data Preparation Results
In preparing the electric load data for the SVRM model design, the dataset used in the
study was described, represented, and its features were selected. Table 3 shows the
represented data with the added considered features. TIME is now expressed numerically
along with MONTH, YEAR and DAY being created from BDATE. The sample date
shows the scaled consumed load at 1:15 AM of January 1, 2013 indicated by 010 for
Tuesday and a weekday indicated by 1 along with the scaled consumed load of 11:30 PM
of February 9, 2013, a Saturday indicated by 110 and a weekend indicator of 0.
Table 3. Sample of the represented dataset with the added features.

KW_DEL TIME MONTH YEAR DAY DAY_TYPE DATE TYPE
0.4581498 5 1 2013 1 010 1
0.4449339 94 2 2013 9 110 0
Despite the usual inclusion of weather attributes in the dataset processed by

machine learning frameworks in predicting electric load, this study did not consider
these variables due to the suggestion of authors [1, 7]. It was discovered that using
weather attributes such as the respective temperature of the consumed load may make
the results in electric load prediction worse due to temperature behavior being a sep-
arate problem [7]. This has the tendency to result to affect imprecise prediction for the
load forecasting, even though it may perform well in the training phase [1, 4]. With
these findings, it was decided that the dataset is sufficient for the load forecasting
problem using SVRM.
After data was scaled and partitioned, final features were then conservatively
identified using both correlation-based filter feature selection and information gain
approaches. Studies supported that the use of feature selection despite the few number
of evaluated features improves the accuracy and effectiveness of the SVRM model with
lower forecasting error [1, 2, 4]. This is due to feature selection reducing the dimen-
sionality of the data and enabling regression algorithms like SVRM to operate faster
and more effective. Since this study used R programming language to implement
Pearson’s correlation, Spearman correlation matrix, and Kendall’s correlation, the
function cor() of R was used calculating the weighted correlation of the given data set.
Pearson’s correlation is a statistical measure of the strength of a linear relationship
between paired data [14, 15].
With Pearson’s correlation, Fig. 3(a) shows that the TIME attribute has a
0.52901261 correlation value to KW_DEL, MONTH has −0.270584464, year has
−0.0675271275, DAY TYPE −0.028027669, DATE TYPE −0.021399675, and DAY
has −0.0183968342. Thus, the time attribute has the highest correlation to KW_DEL.
Attributes YEAR, DAY TYPE, DATE TYPE, and DAY have a relatively low corre-
lation to KW_DEL since they are closer to zero.
Fig. 3. Correlation matrices results.
Spearman correlation matrix measures the strength of a monotonic relationship

between paired data. Monotonic relationship is a relationship wherein as a value of a
variable increases, so does the value of the other variable increases or as the value of one
variable increases, the value of the other variable decreases [15]. As shown in Fig. 3(b), in
Spearman’s correlation, the TIME attribute has a 0.56725912 correlation value to
KW_DEL, MONTH has −0.2463835254, YEAR has −0.0749241004, DAY TYPE
−0.0489527428, DAY has −0.0434596375, DATE TYPE −0.027798819. Thus, the
TIME attribute has the highest correlation to KW_DEL. Attributes YEAR, DAY TYPE,
DAY, and DATE TYPE have a relatively low correlation to KW_DEL since they are
closer to zero. The last correlation based approach performed by this study is the Ken-
dall’s correlation which measures the non-parametric test for statistical dependence based
on the coefficient. As shown in Fig. 3(c), implementing this type of correlation yielded
the TIME attribute to have a correlation value of 0.39739378 to KW_DEL, MONTH has
−0.1805927803, YEAR has −0.0615430231, DAY TYPE −0.0402100228, DAY has
−0.0312288852, and DATE TYPE has −0.022834086. Thus, the TIME attribute has the
highest correlation to KW_DEL. Attributes MONTH, YEAR, DAY TYPE, DAY, and
DATE TYPE have a relatively low correlation to KW_DEL since they are closer to zero.
By performing the three types of correlation based approach in feature selection, it was
evident that TIME attribute is the only attribute that can affect the predictive variable
KW_DEL showing time having the highest correlation to KW_DEL while the rest of the
attributes show a low correlation to KW_DEL.
By implementing the Information Gain Approach in WEKA, Table 4 shows that

the TIME attribute has 0.7277 Information Gain value to KW_DEL and MONTH
attribute has 0.2451.
Table 4. Results of the information gain approach.

Attribute Information Attribute Information Attribute Information
gain value to gain value to gain value to
KW_DEL KW_DEL KW_DEL
TIME 0.7277 YEAR 0.0688 DATE 0
TYPE
MONTH 0.2451 DAY 0 DAY 0
TYPE
Figure 4 shows the Information Gain relationship between the different attributes
and KW_DEL. The orange parts of the illustration represent KW_DEL while the blue
parts represent specific attributes. Only attributes TIME and MONTH converge to
KW_DEL, signifying that time and month has a relevant relationship to KW_DEL. The
attributes DAY, YEAR, DATE TYPE, and DAY TYPE did not converge with
KW_DEL, signifying little to no relationship. Thus, in the Information Gain approach,
MONTH and TIME were selected as the features that can affect the predictive variable
while the rest of the attributes show poor correlation.
Fig. 4. Information Gain of the features.
To select the optimal feature selection between the Correlation-based approach

which considered only the TIME attribute that can affect the load and Information Gain
approach which considered both TIME and MONTH, this study compared the MAPE
values of four test models that used different SVRM test parameters. Table 5 shows that
models with the Correlation-based feature selection yielded better results than the
model with Information Gain feature selection. The MAPE values generated by the
models are the 15 min resolution average of the MAPE values of the 25 days of
December 2014 which in this case is the validation set of the study. Correlation-based
models were then selected in this study and the only attribute that will be considered as
having a relevant relationship to KW_DEL is the TIME attribute.
Table 5. Comparison of MAPE values of Correlation-based and Information Gain.
Model Model Parameters MAPE Correlation-based MAPE Information Gain
FS TEST 1 c=125 g=0.001 e= 0.01 p=0.0045 4.09% 5.60%
FS TEST 2 c=125 g=0.001 e= 0.01 p=0.005 4.11% 5.66%
FS TEST 3 c= 115 g=0.001 e=0.01 p=0.005 4.10% 4.12%
FS TEST 4 c= 130 g=0.001 e=0.01 p=0.005 4.16% 5.62%
3.2 SVRM Model Design Evaluation Results

It is important to select the proper kernel since it provides a powerful and unified
framework for pattern discovery in a regression problem. Studies stated that the kernel
function enables the transformation of the input space into high-dimensional feature
space where it is possible to apply SVRM [2, 4]. A study used the linear kernel in
establishing one of their SVRM prediction models [21]. Another type of kernel called
Radial Basis Function (RBF) kernel was used by a study in their electric load pre-
diction using an SVRM model [1]. To determine if linear or RBF kernel will be used in
the final model of this study, the researchers evaluated the MAPE values that the
kernels can yield. Seven models were tested with different parameters for the two
kernels. A 4.09% MAPE, the lowest percentage error among the tested model was
acquired by the model with RBF. Table 6 shows that the RBF kernel performed better
than the linear kernel.
Table 6. Comparison of performance between Linear and RBF Kernels.
Linear Parameters Linear MAPE Value RBF Parameters RBF MAPE Value
c=1 5.23% c= 120 g=0.001 e=0.01 p=0.005 4.10%
c=0.5 5.22% c= 110 g=0.001 e=0.01 p=0.005 4.13%
c=0.25 5.22% c= 115 g=0.001 e=0.01 p=0.005 4.12%
c=0.1 5.17% c= 130 g=0.001 e=0.01 p=0.005 4.16
c=0.05 5.15% c= 125 g=0.001 e=0.01 p=0.0045 4.09%
c=0.01 5.19% c= 126 g=0.001 e=0.01 p=0.005 4.12%
The MAPE which was computed from the daily prediction for the 25 days of
December 2014 shows that the RBF accuracy is superior than the linear kernel which
could not even produce accuracy below 5% MAPE. A study on load forecasting using
SVRM used the RBF kernel in their load prediction model resulted to an accuracy of
97% [1]. RBF kernel was also used a similar study which yielded a MAPE of 2.31%
[4]. With the behavior of the data used in this study, RBF was found to be more
accurate. Thus, RBF was chosen as the kernel for the developed SVRM models.
In searching for the suitable SVRM parameters, this study assessed the performance
of select SVRM models. To do this, the partitioned datasets namely the training and
validation sets were used to generate the MAPE values of the tested models.
The MAPE value generated by the model comes from the 25 days of December 2014
which, in this case, was the validation set of the study. According to the model’s
performance on the validation set, the researchers tried to infer the proper values of the
parameters. Figure 5 shows the iterated procedure conducted by this study to select the
best parameters that was used for the best performing SVRM model.
Fig. 5. Parameter selection results generation procedure.
A study suggested similar results generation procedure and the developed model of
the study yielded 3.62% margin error [2]. Another study also used similar procedure
and concluded good parameters used in their SVRM model [7]. The developed model
of the said study yielded an accurate predictive power which is less than 5% and won
the first place of the EUNITE competition in 2001 marking the popularity of SVM to
be used in forecasting. Table 7 shows the SVRM parameters with the lowest MAPE
among the 80 tested models. These models used the RBF kernel as decided in the
kernel selection phase.
Table 7. Parameter selection results.
MAPE for MAPE for

Model Parameters Model Parameters
December 2014 December 2014
A c=110 g= 0.001 e= 0.01 p=0.005 4.36% G c= 125 g=0.001 e=0.011 p=0.005 4.12%
B c= 120 g=0.001 e=0.01 p=0.005 4.10% H c= 125 g=0.001 e=0.01 p=0.0045 4.09%
C c= 118 g=0.001 e=0.01 p=0.005 4.13% I c= 127 g=0.001 e=0.01 p=0.005 4.12%
D c= 115 g=0.001 e=0.01 p=0.005 4.12% J c= 126 g=0.001 e=0.01 p=0.005 4.12%
E c= 125 g=0.001 e=0.01 p=0.005 4.11% K c= 123 g=0.001 e=0.01 p=0.005 4.11%
F c= 130 g=0.001 e=0.01 p=0.005 4.16% L c= 124 g=0.001 e=0.01 p=0.005 4.13%
Results shows that Model H with parameters c = 125 g = 0.001 e = 0.01

p = 0.0045 yielded the lowest MAPE among the tested model. It can also be observed
that Model B is also a promising model with a MAPE value of 4.10%. Also, Model B
and H have the same value in g and e parameters. By adjusting the cost value of C as
defined in Step 3 by the procedure above and adjusting the value of p, Model H yielded
a lower margin error than Model B. Model A yielded the highest margin error among
the tested models with the parameters c = 110 g = 0.001 e = 0.01 p = 0.005. As
observed, the rest of the models with C greater than 110 generated much better MAPE
values than Model A which has C of 110. From the tested models shown, it can be
inferred that c = 125 g = 0.001 e = 0.01 p = 0.0045 are the optimal parameters for
RBF kernel. Thus, the parameters of Model H was used for the next phase which is
selecting the best architecture for the SVRM model.
There were two SVRM architectures presented in this study. In Architecture I, the
model has daily load consumption attributes in 15 min resolution from each of the
following: 1 day before, 2 days before, 7 days before, and 14 days before. This
Architecture is denoted as i-1, i-2, i-7, i-14 where i represents the day to be predicted
and the number after the subtraction sign represents the number of days before the
predicted data. A study used this architecture on short term load forecasting and
established an accurate SVRM model with a MAPE error of 2.861% [1]. Table 8
shows a portion of the Architecture I dataset.
Table 8. Architecture I data format.

TIME KW_DEL i-1 i-2 i-7 i-14
1 0.6946903 0.6548673 0.6460177 0.5973451 0.9955752
2 0.6858407 0.6371681 0.6327434 0.5840708 0.9778761
Since there is no standard number of load consumption data used for load fore-
casting, this study introduced Architecture II. Architecture II initially has a day con-
sumption data in 15 min resolution taken from the previous date, denoted as i-1. The
process is iterated, incrementing the values of the previous days to be considered until
the smallest predictive error is found. Table 9 shows a portion of the Architecture II
dataset.
Table 9. Architecture II data format.

TIME KW_DEL i-1 i-2 i-3 i-4 i-5 i-6 i-7
1 0.5991190 0.6387670 0.6607930 0.6519820 0.6519820 0.6475770 0.6828190 0.9955950
2 0.5859030 0.6211450 0.6431720 0.6387670 0.6387670 0.6431720 0.6651980 0.9779740
In this research, the architecture that yielded the lowest MAPE was the architecture
with seven days as past attribute (i-1, i-2, i-3, i-4, i-5, i-6, i-7). As shown in Table 10,
the Architecture that is the least accurate is the architecture with five days as past
attribute (i-1, i-2, i-3, i-4, i-5). Interestingly, i-1 and i-1 & i-2 only has 0.01% difference
in accuracy. The most accurate of the seven was the model with (i-1, i-2, i-3, i-4, i-5, i-
6, i-7), reaching a MAPE of 4.19%. Since Architecture (i-1, i-2, i-3, i-4, i-5, i-6, i-7) got
the highest accuracy, this will be the architecture to be compared to Architecture I.
Table 10. Comparison between the MAPE of Architecture II.
Architecture II Design MAPE
i-1 4.37%
i-1, i-2 4.36%
i-1, i-2, i-3 4.43%
i-1, i-2, i-3, i-4 4.44%
i-1, i-2, i-3, i-4, i-5 5.97%
i-1, i-2, i-3, i-4, i-5, i-6 4.49%
i-1, i-2, i-3, i-4, i-5, i-6, i-7 4.19%
To select the best performing architecture for the SVRM model, the researchers
compared the predicted values of the two architectures to the actual values of the
December 2014 consumed load. Table 11 shows that Architecture II with seven days as
past attribute (i-1, i-2, i-3, i-4, i-5, i-6, i-7) has a MAPE of 4.19% for daily prediction in
December 2014. While Architecture I has a MAPE of 4.03% with Architecture I
superior only by 0.16% from Architecture II. It can also be observed that there is a
sharp increase of inaccuracy on Day 25 which is December 25. Architecture I gener-
ated the smallest MAPE with a value of 1.58% for December 11 while the smallest
MAPE of Architecture II which is 1.68%.
Table 11. Daily MAPE of Architecture I vs. Architecture II.

Day Architecture I Architecture II
Day 1 2.540710098 2.644323904
Day 2 2.351082283 2.759164399
– – –
Day 24 6.217774838 7.790127218
Day 25 11.29374136 10.4786477
MAPE 4.031940892 4.19265337
The result shows the performance of Architecture I yielding the better accuracy.
Despite Architecture I yielding the lowest average MAPE, it is also worth noticing that
it only differs by 0.160 when compared with Architecture II. This indicates that
Architecture II is also a promising architecture for SVRM load prediction.
A performance analysis of SVRM models was conducted in this research by con-

ducting data preparation and SVRM models design evaluation in order to assess
SVM’s application in day-ahead load forecasting. After cleaning, representing and
scaling the dataset, this study conducted correlation-based and the Information Gain
approaches in feature selection and has found out that between the two feature selection
approaches, the correlation-based approaches yielded better results. It was also found
out that the RBF kernel parameters c = 110, g = 0.001, e = 0.01 and p = 0.005 along
with Architecture II (i-1, i-2, i-3, i-4, i-5, i-6, i-7) produced the lowest daily MAPE of
1.68% in day ahead load prediction and a MAPE of 4.19% for the December 2014
predictions. This research provided a MAPE comparison between SVRM parameters as
well as SVRM architectures and has recorded that using day before, two days before,
seven days before, and fourteen days before electric load data as input for SVRM
model yields the best result.
Based on the findings of the study, the researchers would like to recommend for
further studies that will focus on the performance analysis of SVRM models in different
forecasting domains dealing with different datasets involving seasonal and time series
data. A study on the applicability of different kernels as well as various configuration of
SVRM parameters will be helpful in exploring the applications of SVM in the field of
forecasting. Lastly, researchers could also explore different types of SVRM architec-
tures aside from the ones presented and specifically will find ways to optimize the
performance of Architecture II since it only lagged behind from Architecture I with
very minimal performance error. The results of this study highlights a proposed per-
formance analysis mechanism in exploring the potential of SVM, not just in the field of
load prediction but with forecasting in general. This study has laid down a framework
on exploring SVRM as a powerful machine learning method for forecasting that hopes
to inspire future researchers in utilizing SVM for clustering, classification and
prediction.
Acknowledgment. The authors would like to thank the support of the MSU-IIT Office of the
Vice Chancellor for Research and Extension through PRISM-Premiere Research Institute in
Sciences and Mathematics for their assistance in this study.
References
1. Ostojin, S., Kulić, F., Švenda, G., Bibić, R.: Short-term electrical load forecasting using
support vector machines. In: Computers and Simulation in Modern Science. Mathematics
and Computers in Science Engineering. A Series of Reference Books and Textbooks, vol. I,
pp 138–142. WSEAS Press (2008)
2. Elattar, E.E., Goulermas, J., Wu, Q.H.: Electric load forecasting based on locally weighted
support vector regression. IEEE Trans. Syst. Man Cybern. C 40(4), 438–447 (2010)
3. Velasco, L.C.P., Polestico, D.L.L., Abella, D.M.M., Alegata, G.T., Luna, G.C.: Day-ahead
load forecasting using support vector regression machines. Int. J. Adv. Comput. Sci. Appl.
(IJACSA) 9(3), 22–27 (2018)
4. Ceperic, E., Ceperic, V., Baric, A.: A strategy for short-term load forecasting by support
vector regression machines. IEEE Trans. Power Syst. 28(4), 4356–4364 (2013)
5. Nagi, J., Yap, K.S., Tiong, S.K., Ahmed, S.K.: Electrical power load forecasting using
hybrid self-organizing maps and support vector machines. In: Proceeding of the 2nd
International Power Engineering and Optimization Conference, PEOCO 2008, Shah Alam,
Selangor, Malaysia (2008)
6. Türkay, B.E., Demren, D.: Electrical load forecasting using support vector machines. In:
Proceedings of the 7th International Conference on Electrical and Electronics Engineering,
ELECO, Bursa, pp. I-49–I-53 (2011)
7. Chen, B.J., Chang, M.W., Lin, C.J.: Load forecasting using support vector machines: a study
on EUNITE. Competition 2001. IEEE Trans. Power Syst. 19(4), 1821–1830 (2004)
8. Matijas, M., Vukicevic, M., Krajcar, S.: Supplier short term load forecasting using support
vector regression and exogenous input. J. Electr. Eng. 62(5), 280–285 (2011)
9. Tan, H., Yu, X., Chang L., Wan, W.: The performance analysis of support vector machine
parameters based on statistical analysis. In: 2007 IET Conference on Wireless, Mobile and
Sensor Networks, CCWMSN 2007. IEEE (2009)
10. Turkay, B.E., Demren, D.: Electrical load forecasting using support vector machines: a case
study. Int. Rev. Electr. Eng. 6(5), I-49–I-53 (2011)
11. Espinoza, M., Suykens, J., Belmans, R., Moor, B.D.: Electric load forecasting. IEEE Control
Syst. 27(5), 43–57 (2007)
12. Patro, G.K., Sahu, K.K.: Normalization: a preprocessing stage. Int. Adv. Res. J. Sci. Eng.
Technol. 2(3) (2015)
13. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn.
Res. 3, 1157–1182 (2003)
14. Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-
based filter approach to the wrapper. In: Proceedings of 12th International Florida Artificial
Intelligence Research Society Conference, pp. 235–239 (1999)
15. Hauke, J., Kossowski, T.: Comparison of values of Pearson’s and Spearman’s correlation
coefficient on the same sets of data. In: Proceedings of the MAT TRIAD 2007 Conference,
Bedlewo, Poland (2007)
16. Azhagusundari, B., Thanamani, A.S.: Feature selection based on information gain. Int.
J. Innov. Technol. Explor. Eng. (IJITEE) 2(2), 18–21 (2013)
17. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In:
Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997.
ACM Digital Library (1997)
18. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge
University Press, New York (2004)
19. De Bie, T., Cristianini, N.: Kernel methods for exploratory data analysis: a demonstration on
text data. In: Proceedings of the Joint IAPR International Workshops on Syntactical and
Structural Pattern Recognition (2004)
20. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for
support vector machines. Mach. Learn. 46, 131–159 (2002)
21. Ben-Hur, A., Weston, J.: A user’s guide to support vector machines. Methods Mol. Biol.
609, 223–239 (2010)
Ecommerce Fraud Detection Through Fraud
Islands and Multi-layer Machine
Learning Model
Jay Nanduri, Yung-Wen Liu, Kiyoung Yang, and Yuting Jia(&)
Dynamics 365 Fraud Protection, Microsoft, One Microsoft Way,

Redmond, WA 98052, USA
{jayna,yungliu,kiyan,yutjia}@microsoft.com
Abstract. Main challenge for e-commerce transaction fraud prevention is that

fraud patterns are rather dynamic and diverse. This paper introduces two
innovative methods, fraud islands (link analysis) and multi-layer machine
learning model, which can effectively tackle the challenge of detecting diverse
fraud patterns. Fraud Islands are formed using link analysis to investigate the
relationships between different fraudulent entities and to uncover the hidden
complex fraud patterns through the formed network. Multi-layer model is used
to deal with the largely diverse nature of fraud patterns. Currently, the fraud
labels are determined through different channels which are banks’ declination
decision, manual review agents’ rejection decisions, banks’ fraud alert and
customers’ chargeback requests. It can be reasonably assumed that different
fraud patterns could be caught though different fraud risk prevention forces (i.e.
bank, manual review team and fraud machine learning model). The experiments
showed that by integrating few different machine learning models which were
trained using different types of fraud labels, the accuracy of fraud decisions can
be significantly improved.
Keywords: Ecommerce fraud detection Multi-layer machine learning model

Link analysis Chargebacks Fraud islands
1 Introduction
Ecommerce has caused a tectonic shift in the retail landscape and opened vast new
opportunities for retail merchants. Although it provides greater convenience for mer-
chants and customers in conducting business, it unfortunately has also exposed them to
serious threats from sophisticated fraudsters who commit various online transaction
fraud and online service abuse. Retail fraud amounts to tens of billions of dollars lost in
the US alone. How to prevent the loss caused by unforeseen fraud attacks without
pushing away the revenue from legitimate transactions has always been a major task for
all online merchants around the world.
Main challenge for ecommerce transaction fraud prevention is that fraud patterns
are rather dynamic and diverse. Fraudsters often change their attack vectors when either
they sense their malicious behaviors are successfully prevented by the merchants or
they find another new loophole of the fraud prevention system to attack. In addition,

https://doi.org/10.1007/978-3-030-39442-4_41
Ecommerce Fraud Detection Through Fraud Islands 557
fraudsters also attempt to behave like the genuine customers to stay below the radar of
fraud prevention systems. Since like fraudsters, legitimate customers also change their
online transaction behaviors over time, it makes arduous for online merchants to be
able to accurately distinguish fraudsters from legitimate customers. Many existing
papers that address ecommerce fraud detection focus on investigating various types of
classification machine learning modeling, e.g. logistic regression, neural network,
random forest, Support Vector Machines (SVM), etc. [1–4]. In this research, machine
learning model, gradient boosting decision tree (GBDT) more specifically, was also
applied for fraud detection. It is important to emphasize here that in this paper, instead
of developing new machine learning model formulation, this research focuses on
improving the performance of the currently adopted machine learning model with the
following two proposed approaches.
The first one is that link analysis (fraud island) is applied to investigate the rela-
tionships between different fraudulent entities and to uncover the hidden complex fraud
patterns through the formed network. The outcome of link analysis then can be used as
an important feature input for machine learning model (the gradient boosting decision
tree in this research). Traditionally, online merchants employed the discrete analysis
method to distinguish fraudulent transactions from legitimate ones [5]. While the
discrete approach is effective enough for spotting patterns and capturing fraudsters
acting alone, it doesn’t necessarily detect patterns across all the different data endpoints
and therefore, is not very useful in detecting elaborate crime rings. Furthermore,
sophisticated fraudsters have learnt how to co-exist with and to act like the real cus-
tomers by making their transaction behaviors/patterns look as legitimate as possible.
Therefore, another great opportunity for improving classification accuracy can be
achieved by looking beyond the individual data points and through the network of
associated transaction features to uncover these larger complex patterns and reveal the
tricky fraud transactions.
The second one is that sequential-assembling modeling technique, called multi-
layer model is designed for advancing the accuracy of the conclusive classification
decision. In many fraud management systems, the fraud labels are determined through
different internal and external channels which are banks’ declination decision, manual
review agents’ rejection decisions, banks’ fraud alert and customers’ chargeback
requests. It can be reasonably assumed that different distinct fraud patterns could be
caught though different fraud risk prevention forces (i.e. bank, manual review team and
fraud machine learning model). Therefore, in this paper, various machine learning
models that are respectively trained using the fraud labels tagged by different fraud risk
prevention forces are assembled, and sequentially improve the final machine learning
model for conclusive classification.
The rest of the paper is organized as follows. In Sect. 2, some existing research in
link analysis and machine learning models that are either the building block or
important references of this research are briefly reviewed and discussed. In Sect. 3, the
main innovative methodologies proposed in this paper which are the fraud islands that
connect and cluster the entities which shared similar or same fraud transaction features,
and the multi-layer model which applies sequential-assembling modeling technique are
558 J. Nanduri et al.
discussed in detail. In Sect. 4, the effectiveness of the proposed methodologies is

demonstrated though a few experiments on real ecommerce transaction data. Section 5
concludes this paper.
2 Related Work and Literature Reviews
In this section, some existing research that either are the building block of this proposed
research or provide important reference for theory development in this study will be
discussed. As mentioned in the previous section, there are two main methodologies,
fraud island and multi-layer model, proposed in this paper. Therefore, literature review
is conducted based on these two topics, respectively.
2.1 Link Analysis

Fraudsters can execute their fraud attack through different online transaction venues
such as website, game console, small phone app, etc. They could also commit different
online payment fraud using various kinds of payment instrument such as credit card,
debit card, pay pal, etc. They could also sign up for transaction accounts using many
dummy email addresses from diverse geological location. All of these different features
of fraudulent transaction shape many “faces” of fraud [6]. In Shah et al.’s paper [6], the
authors showed that different fraud types have different local network structure and
account attribute. They also demonstrated that applying link analysis method, the
machine leaning model (SVM) could perform better at fraud detection. Although this
paper does not address ecommerce transaction fraud but online social networking
fraud, the idea of using network structure to uncover the many hidden associated linked
characteristics of fraud can definitely be borrowed for application of ecommerce fraud
detection and others e.g. insurance fraud [7].
One of the most common link analysis algorithms is Hyperlink-Induced Topic
Search (HITS) [8, 9]. HITS was developed originally in [8] for evaluating importance
of web pages. It is reasonable to assume that the number of links to a certain page give
a straight indication of its eminence. For some pages with few coming links if at least
two of the links are from major searching portals such as Google, Bing, etc., these
pages can still be considered eminent ones. HITS categorize the pages into authorities
and hubs where good hub pages lead to many good authorities while good authority
pages were led by many good hub pages. In the paper [9], the authors link users
through the same questions they answered and used HITS further to form different
neighborhood groups based on different categories of questions. They also show that
HITS provided greater accuracy for most of the experimental cases in this paper. This
algorithm is adopted to form so called fraud islands which contain entities having same
transaction fraud features, email domains, payment instruments, etc. In this research,
transactions are divided into those that are fraudulent and genuine ones. How different
transactions features lead to these two distinct types of transactions and how these two
distinct types of transactions were respectively linked with different transaction features
are also discovered.
2.2 Multi-layer Machine Learning Technique

The term Multi-layer is typically used in the Neural Network related techniques, where
there are an input layer, a number of hidden layers and an output layer [10, 11]. In each
layer, there are multiple nodes, which are interconnected across layers. Each node acts
as a function which takes inputs from the nodes in the previous layer and outputs a
value to the nodes in the next layer. The output from the output layer is subsequently
transformed into a probability value, which can be translated, for example, as the
probability that a given purchase transactions would result in a chargeback, for a fraud
detection system. Though this approach is termed as Multi-layer model, contrary to the
Neural network techniques, each layer contains distinct models that have been trained
independent of each other. From this perspective, this proposed approach is more
related with ensemble techniques.
Ensemble techniques, e.g., Random Forest [12], XGBoost [13], have been shown
to perform well in various domains [14, 15]. In [16], the authors showed a number of
ensemble techniques, both homogeneous and heterogeneous, depending on whether the
base models are created by the same type of classifiers or not, and proposed techniques
to build a credit scoring model based on the heterogeneous ensemble model. While all
the base models have been trained with the same Good/Bad labels in [16], this pro-
posed approach utilizes a number of different “Good/Bad” signals, for each of which a
separate classification model was built. Subsequently, these outputs are incorporated as
inputs to an uber model. Note also that Ensemble techniques consists of weak learners
whose outputs are aggregated to produce the final output. In this approach, each model
has been developed using a Gradient Boosting Decision Tree (GBDT) classifier, and
the output from these models have been fed into the final model, which is again a
GBDT classifier.
3 Fraud Islands and Multi-layer Models
In this section, the proposed methodologies are described in detail. As mentioned

earlier, the object of this research is not to investigate different machine learning
classifiers but to try to explore the possibility of enhancing the accuracy of the currently
running machine learning model (i.e. gradient boosting tree) for ecommerce fraud
detection through these two proposed approaches, fraud island and multi-layer models.
3.1 Fraud Islands

To detect ecommerce transaction fraudulent activities, mining the massive transactions
associated data is unavoidable. Although the massive data might provide a great vol-
ume of valuable information for fraud detection, some highly important but not obvious
information could be buried or hidden behind those unimportant one with much larger
volume. In order to more effectively distinguish the transaction patterns between
legitimate customers and fraudsters, in this research the relationships between different
entities are investigated, and statistics are derived to pinpoint fraud that might have
otherwise been missed. More specifically, in this research, link analysis that contained
two major components, link generation and network graph were conducted following
the steps below.
Step 1: The first step was to create the fraud graph from a set of known fraudulent
transactions. Entities associated with each transaction, such as account identifier,
device ID, email address, and payment instrument, will be extracted as nodes in the
graph. Each entity in a single transaction is essentially connected through edges.
Different transaction could be potentially connected through common entities. For
example, if two fraudulent transactions use the same payment instrument on two
different devices, these two devices are connected via the payment instrument. After
connecting all the entities from the historical fraudulent transactions, a pool of
transactions represented by these entities would be formed. The collection of linked
entities is referred to as “Fraud Archipelago”. See Fig. 1 below to see how Fraud
Archipelago and Fraud Islands are constructed.
Fig. 1. Composition of Fraud Archipelago/Islands.
Step 2: After Fraud Archipelago and Fraud Islands are constructed using the
fraudulent transactions, non-fraudulent transactions associated with the entities in
the Fraud Archipelago/Islands are added. Through the connected component, a
number of statistics per each entity and per Fraud Island are calculated, e.g., node
count in each Fraud Island, edge count per entity, clique size per Fraud Island,
centrality and connectedness.
Step 3: At the entity level, an entity’s existence in a Fraud Island is highly pre-
dictive of future fraudulent transactions which use the same entity. On the level of
each island, some statistics are calculated, such as the total number of nodes by
nodes’ type. Intuitively, the more interconnections exist among entities, the greater
the cause for the concern. To quantify the risk for these linked entities, some
statistics on the entity level, such as good transaction and bad transaction counts, are
calculated through aggregating the historical transactions.
Step 4: In the end, the system will check the connection between each incoming
transaction and the existing fraud island. Both the statistics retrieved from the level
of island and nodes will be provided as extra features to the existing modelling
engine, which output scores to indicate whether the input transaction is fraudulent
or not.
To evaluate the performance, this technique is applied to a dataset, and some

significant improvement has been observed. To determine a fraud risk in real-time, the
whole process has also been implemented in production and the fraud island statistics
would get updated on daily basis.
Link analysis provides a unique ability to uncover a variety of important fraud
signatures, such that the previously hidden fraud pattern become obvious. This project
combines the power of machine learning and linkage analysis by feeding graphic
features into the machine learning pipeline.
3.2 Multi-layer Machine Learning Model

For Multi-layer model, different sub-populations of transactions are utilized and sep-
arate models for each sub-population are built. Figure 2 shows the breakdown of the
online purchase transaction population, which consists of Settled, Approved, Rejected
and Chargeback sub-populations. Given a transaction, the current risk system either
approves or rejects it; if approved, the transaction is sent to bank for approval. If bank
approves it, the customer is charged, and the transaction is settled. In case of fraudulent
transactions, chargeback would be received for the settled transactions. Figure 3
depicts the flow of a purchase transaction process, and Table 1 lists the definitions of
these sub-populations.
Rejected SeƩled &

Approved & Chargeback
Bank Declined
Approved &
SeƩled
Fig. 2. Purchase transaction population break-down
Table 1. Transaction sub-population definitions

Term Definition
Approved Approved by merchant risk system
Bank declined Approved by merchant risk system, but declined by bank
Chargeback Settled, but later Chargeback reported
Rejected Rejected by merchant risk system
Settled Bank approved and charged
Fig. 3. Purchase transaction process flow
In addition to chargeback, the fraud management system also receives customer

fraud claims in the form of TC40 for VISA® and SAFE (System to Avoid Fraud
Effectively) report for MasterCard®. When these claims are reported to the Associa-
tions, i.e., VISA® and MasterCard®, the investigation is initiated. After investigation,
some claims result in chargebacks, while some do not. For example, for one of the
adopted portfolios in this study, only 30% of the chargeback transactions have cor-
responding customer fraud claims. However, the claims that have not resulted in
chargebacks are not necessarily non-fraudulent transactions. Sometimes, the merchants
do not report the fraudulent transactions as chargebacks, especially when the trans-
action amount is not reasonably greater than the fees associated with processing the
chargebacks. This customer fraud claim information is considered as fraud alert, and in
order to integrate it into the fraud detection models, a Fraud Alert model that predicts
whether a given transaction would incur a fraud alert or not is build. The output from
this model is then provided to the final model, as an input.
When building a fraudulent transaction detection model, the settled and chargeback
sub-populations are employed with the chargeback sub-population as “bad” labels.
That is called a Chargeback Prediction model. From this perspective, bank declined
sub-population is not utilized since they do not have the chargeback signals. However,
the bank declination is a pretty clear signal that indicates the corresponding transactions
are very suspicious, and very likely to be fraudulent. In order to incorporate this signal
into the fraud detection model, a Bank Auth model that predicts whether a given
transaction would be approved or declined by banks is build. Subsequently, the output
from this model is fed into the final model.
The Chargeback sub-population is typically utilized as labels for fraud detection
models. Since it may take up to three months to receive this information, the most
recent transactions are included; the chargeback information would not be available for
the models. However, if the most recent data are not incorporated into the model, the
model would fail to detect the latest fraud patterns. In order to address this, a Near
Term model which is based on the most recent few weeks’ worth of data is built. Since
this Near Term model would employ incomplete chargeback information, this model
may not reliably work as a standalone model. Combined with other models, and as an
input to the Long Term model, however, the Near Term model complements them and
improves the overall performance.
Table 2 summarizes the models that are build and the datasets that are used in this
research, and Fig. 4 illustrates the overall Multi-layer model architecture. That is, given
a transaction, three models are run, i.e., Bank Auth Model, Fraud Alert Model and Near
Term Model, and the outputs which are subsequently provided to the Final Model to
generate the final prediction are receive.
Table 2. Model descriptions

Model name Sub-populations Label
Bank Auth Model Approved transactions Bank
response
Fraud Alert Model Settled transactions Fraud alert
Near Term Model Settled transactions (4 to 8 weeks’ worth Chargeback
of data)
Final Model (Long Term Settled transactions (6+ months’ worth of Chargeback
Model) data)
Fig. 4. Multi-layer architecture
4 Case Study and Performance Discussion
In this section, how the proposed approach can be applied for fraud detection is
illustrated and the effectiveness of this proposed research is also demonstrated through
some case studies and experiments on real ecommerce data obtained from some
ecommerce business partners.
4.1 Fraud Islands

To demonstrate the effectiveness of fraud islands on improving machine learning model
performance, a preliminary case study was conducted. In this case study, seven months of
real ecommerce transaction data that contained the encrypted customer information, i.e.
email address, payment instrument id, device id, account id, and with fraud label
(good/bad) tagged were collected and used. The link graphs were generated using 30
days’ worth of data. With full six-month training period, total 180 link graphs are formed
offline. Components and transactional data are connected for calculating 12 aggregate
features for all graphs. The features are extracted based on two levels, node and cluster,
respectively, (see Table 3 below). For the node level, features were aggregated with
good/bad transaction counts. For the cluster level, features were aggregated with counts
of nodes for four customer info. Aggregate features were then appended to the existing
feature vector which had 194 features. Every feature vector was recreated offline with
Apache SparkTM [17]. Among all 12 aggregated features, Device_Bad_Transac-
tion_Count, PI_Bad_Transaction_Count and Account_Bad_Transaction_Count are the
top three significant features found for fraud detection.
Table 3. Features on node and cluster levels.

Node level Cluster level
PI_Bad_Transaction_Count Number_of_Account_Nodes
PI_Good_Transaction_Count Number_of_PIHashId_Nodes
Device_Bad_Transaction_Count Number_of_DeviceId_Nodes
Device_Good_Transaction_Count Number_of_EmailHash_Nodes
Acc_Bad_Transaction_Count
Acc_Good_Transaction_Count
Email_Bad_Transaction_Count
Email_Good_Transaction_Count
From the ROC (Receiver Operating Characteristics) curve below (Fig. 5), it can be
seen that the machine learning model (gradient boosting tree) with aggregate features
from link analysis performed better than the model without. There was 3% improve-
ment in AUC (area under curve).
Due to the limitation of computing power availability, this study was conducted
only in a relatively smaller scale and scope in terms of number of collaborative
business partners, the length of time period for data collection and training, and the
types of customer information used for link graph vertices. It is not hard to believe that
with more data from more business partners, longer data collection and training peri-
ods, and adding more customer information nodes, this proposed approach will be
more effective and helpful for ecommerce fraud detection.
With Link Graph aggregate features
Without Link Graph aggregate features
Fig. 5. Performance comparison between machine learning model with and without graph
features
4.2 Multi-layer Architecture

From three portfolios adopted in this research, two datasets for each were prepared, one
for in-time evaluation, and the other for out-of-time evaluation. For Near Term Models,
eight weeks’ worth of data were used, and for final models, six months’ worth of data
was used. For out-of-time evaluation, one month’s data were used. The performance
comparison between multi-layer (ML) model and Long Term model that do not utilize
the outputs from other models was performed.
Figures 6 and 7 present the performance comparison for Portfolio 1 in terms of
ROC up to 2% non-fraud rates. Note that the typical risk management operation range
would be between 1% and 2% non-fraud rates, depending on the risk management
resources and capacities. Figure 6 shows that at 1% non-fraud rate, the Multi-layer
(ML) model outperforms the Long Term Model by 2%. The performance improvement
is shown across the ranges up to 2% non-fraud rates.
For the out-of-time evaluation, in Fig. 7, there is cross over at *.15% non-fraud
rate. However, between 1% and 2% non-fraud rate range, which is the typical operation
rage, the multi-layer model shows more than 2% improvement over the Long Term
model.
Similarly, for both Portfolios 2 and 3, the multi-layer model consistently outper-
forms the Long Term Model across the ranges up to 2% non-fraud rates. Figures 8 and 9
present the performance comparison for Portfolio 2 on in-time data and out-of-time data,
respectively.
100
Porƞolio1 Porƞolio1_ML
90
80
70
PCT_FRAUD (%)
60
50
40
30
20
10
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 6. Performance comparison of Portfolio 1 using in-time dataset between Long Term Model
and Multi-layer (ML) model
100
90
80
70
PCT_FRAUD (%)
60
50
40
30
20
10
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 7. Performance comparison of Portfolio 1 using out-of-time dataset between Long Term
Model and Multi-layer (ML) model
50
45
40
35
PCT_FRAUD (%)
30
25
20
15
10
5
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 8. Performance comparison of Portfolio 2 using in-time dataset between Long Term Model
60
50
40
PCT_FRAUD (%)
30
20
10
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 9. Performance comparison of Portfolio 2 using out-of-time data between Long Term
Figures 10 and 11 present the performance comparison for Portfolio 3 on in-time

data and out-of-time data, respectively, which shows consistent improvement over the
Long Term Model, as much as by 4% at 1% non-fraud rate.
60
50
40
PCT_FRAUD (%)
30
20
10
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 10. Performance comparison for Portfolio 3 using in-time data between Long Term Model
60
50
40
PCT_FRAUD (%)
30
20
10
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 11. Performance comparison of Portfolio 3 using out-of-time data between Long Term
5 Conclusions
This paper introduces two approaches, fraud islands and multi-layer model to boost the
fraud detection capability of the currently running machine learning model.
Through fraud island, link graph aggregated features were created, that can be more
effectively provide some valuable information about the hidden fraud patterns. In
addition, through the conducted case study, it is found that those link graph aggregate
feature also can help improve the currently adopted machine learning models for the
fraud detection accuracy. As discussed in Sect. 4.1, due to availability of the current
cloud computing power, this proposed approach could only be implemented in a
relatively smaller scale. For the future study, the authors will try to get the access to
Azure machine learning service [18] which can provide much larger computing power
and will allow more data (both volume and variety) to be incorporated for improving
this proposed approach.
Using the multi-layer modelling technique, three sub-models were created for
subpopulations (transactions) that received fraud labels from different risk prevention
systems i.e. risk decisions made by merchants’ fraud risk system, banks’ authorization
decisions, and fraud alert from associations. it is believed that using the fraud labels
determined by different internal and external risk systems, more fraud in various kinds
of fraud patterns could be caught. The case studies shown in Sect. 4 proved this
intuition. For the future work, more representative subpopulations for sub-models for
multi-layer modelling will be investigated to achieve more accurate and conclusive
fraud detection.
References
1. Kou, Y., Lu, C.T., Sirwongwattana, S., Huang, Y.P.: Survey of fraud detection techniques.
In: IEEE International Conference on Networking, Sensing and Control, vol. 2, pp. 749–754.
IEEE (2004)
2. Wang, S., Liu, C., Gao, X., Qu, H., Xu, W.: Session-based fraud detection in online e-
commerce transactions using recurrent neural networks. In: Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pp. 241–252. Springer, Cham,
September 2017
3. Şahin, Y.G., Duman, E.: Detecting credit card fraud by decision trees and support vector
machines (2011)
4. Minastireanu, E.A., Mesnita, G.: An analysis of the most used machine learning algorithms
for online fraud detection. Inform. Econ. 23(1), 5–16 (2019)
5. Omen Homepage. https://omen.sg/detect-fraud-in-real-time-with-graph-databases/. Accessed
13 June 2019
6. Shah, N., Lamba, H., Beutel, A., Faloutsos, C.: The many faces of link fraud. In: 2017 IEEE
International Conference on Data Mining (ICDM), pp. 1069–1074. IEEE, November 2017
7. Sadowski, G., Rathle, P.: Fraud detection: discovering connections with graph databases.
White Paper-Neo Technology-Graphs are Everywhere (2014)
8. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM (JACM)
46(5), 604–632 (1999)
9. Jurczyk, P., Agichtein, E.: Discovering authorities in question answer communities by using
link analysis. In: Proceedings of the Sixteenth ACM Conference on Information and
Knowledge Management, pp. 919–922. ACM, November 2007
10. Vanneschi, L., Castelli, M.: Multilayer perceptrons. Encycl. Bioinform. Comput. Biol. 1,
612–620 (2019)
11. Svozila, D., Kvasnickab, V., Pospichalb, J.: Introduction to multi-layer feed-forward neural
networks. Chemometr. Intell. Lab. Syst. 39(1), 43–62 (1997)
13. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: 22nd SIGKDD
Conference on Knowledge Discovery and Data Mining (2016)
14. Volkovs, M., Yu, G., Poutanen, T.: Content-based neighbor models for cold start in
recommender systems. In: Proceeding of RecSys Challenge 2017, Proceedings of the
Recommender Systems Challenge 2017, Article no. 7 (2017)
15. “Machine Learning Challenge Winning Solutions” in “Awesome XGBoost”. https://github.
com/dmlc/xgboost/tree/master/demo. Accessed 14 June 2019
16. Xia, Y., Liu, C., Da, B., Xie, F.: A novel heterogeneous ensemble credit scoring model based
on bstacking approach. Expert. Syst. Appl. 93, 182–199 (2017)
17. Apache Spark Homepage. https://spark.apache.org/. Accessed 14 June 2019
18. Azure Machine Learning Service Homepage. https://azure.microsoft.com/en-in/services/
machine-learning-service/. Accessed 14 June 2019
Automated Drug Suggestion Using Machine
Learning
Vikrant Doma, Sahil Singh, Nikita Arora, Gabriel Ortiz,

Parneet Kaur Saran, Salvador Chavarin, and Matin Pirouz(&)
California State University, Fresno, CA 93740, USA

mpirouz@csufresno.edu
Abstract. The growing healthcare industry generates a large amount of data on

patient health conditions, demographic plans, and drugs required for such
conditions. These attract the attention of the medical professionals and the data
scientists alike. In this paper, we propose a drug recommendation assistant built
using machine learning techniques and natural language processing, which
draws its accuracy from several major datasets. The proposed system makes it
possible to manifest the contrasting effects, reviews, ratings, and then recom-
mend the most “effective” drug for a given individual. The results of the pre-
dictive analysis were that from 2005–2015, between the ages 55 and 80, the
death rates of the top deadliest diseases in the U.S. all increased drastically.
Based on the current trends, with some level of accuracy, it is possible to predict
the next top five medical conditions (Birth Control, Depression, Pain, Anxiety,
Acne) which will be prevalent in the near future and the top five drugs for used
to treat them.
Keywords: Drug recommendation Confusion matrix Data visualization

Data manipulation
1 Introduction
Analytics is a method of creating insights through effective use of data and applying
qualitative and quantitative analysis upon it. Recognizing technological advancement
and growth in medical data have created quite a good opportunity, but it has also
provided challenges for protecting the privacy and security of patient and research data
but working with federal and expert agencies promote security procedures which
enable better scientific and medical advances. Understanding the effects of drugs helps
with the selection of which drug is the most effective and popular amongst the patients.
Although there are many reviews for each drug available, people still struggle to pick
the one that is best for them. Using publicly available datasets, in order to implement
exploratory, descriptive, and predictive analysis to advance the understanding of the
effects of drugs and recommend the best possible drug for an individual.
Recently, there has been a significant rise in the number of data mining and ana-
lytics techniques being used in the field of medicine and healthcare, thus arises the need
to have multiple sources of data having similar attributes. For Example, the dataset
taken from UCI Machine learning [1] contains over 200,000 patient drug reviews.

https://doi.org/10.1007/978-3-030-39442-4_42
572 V. Doma et al.
There are two different datasets. The training dataset dimensions are 161,000 by 7,
while the testing dataset has 53,800 by 7 in terms of rows by columns. Table 1 gives an
overview of the studied datasets. Combining multiple datasets resulted in a dataset
containing 670,000 rows. The second dataset was taken into consideration as a result of
not having life-threatening diseases in the UCI dataset [1]. The final dataset [3] gave a
much-needed diagnosis of the patients’ conditions and symptoms.
Table 1. Dataset descriptions

Datasets Functions Dimensions Source
Recommendation Patients’ drug reviews 214,000 7 UCI Machine
Learning
Potentially excess Five leading causes of death 206,000 13 NCHS
deaths
Disease sorting Available diagnosis for each 250,000 15 Kaggle
symptom
Creating a drug recommendation assistant, with the help of these datasets and for
optimal accuracy, making it necessary to test various machine learning and natural
language processing techniques. The rest of this paper discusses the background
information, the framework setup, the result analysis, and exploratory, predictive, and
prescriptive analyses.
2 Literature Review
Being able to understand and take care of symptoms should be more closely observed.
Death due to a medical condition is on the rise and out of the ten biggest killers in the
United States, eight of them are related to some type of medical condition. In addition,
many of these diseases progressively get worse if there is no medication or treatment.
Getting the proper help and medication along with regular screening can help the
patients. To aid someway in ending this tragedy, we utilize three different datasets and
exploratory, predictive, and prescriptive analyses, to fight this ongoing epidemic with a
better chance of success. Whilst searching for datasets, UCI Machine Learning [1] and
[3] showed tremendous potential, along with Center for Disease Control and Preven-
tion (CDC), and Kaggle. The UCI dataset [1] also contains a published paper that
works with the data and covers in-domain and cross-domain sentiment analysis [1].
Accessing data that is both relevant and up to date is a top priority since certain
medicine will have a greater impact than they did before and a better prescriptive model
can be created. Also, people getting the proper help should be a priority as there has
been a link between getting medical attention in contrast to those that do not, as
discussed by Larmuseau [3]. Viveka and Kalaavathi [4] discuss the importance of data
mining as an essential part of medical analysis. The reason behind a better data
exploration is to complement the existing tools with more effective solutions to health
Automated Drug Suggestion Using Machine Learning 573
care professionals and to control the quality of the medicine being discovered, the
treatment, and adverse side effects of the drugs.
Also, being able to understand the death trends caused by some form of disease is
an aspect studied in this research. The reason for this is to be able to figure out which
drug side effects can escalate into life-threatening conditions. Several studies [5, 6]
suggest that CVD has been regularly decreasing over time. CVD mortality has
decreased significantly, resulting in most cancers surpassing CVD because of the
leading cause of demise in excessive-profits counties inside the USA.
We also investigated the current technologies available and how they help patients.
For example, from the patient records of the future, a body can access quickly a list of
current problems, with a kind of clinical logic, the patient’s health status, and infor-
mation about various treatment patient can go for, for treating their condition. Easy
access to and sound organization of data elements can be provided by the automation of
patient records, as suggested in [7] but the availability of the data elements depends on
whether or not practitioners are collecting and recording such data in the first place,
another aspect to consider is the security aspect as talked about in [8–10]. Thus, we can
see so many barriers in existing systems.
We investigated different techniques currently under research like in [7] and [11]. It
was explained in [12] that the study of PCA has been completed which unearths the
minimal number of attributes required to enhance the precision of numerous supervised
machine learning algorithms. It proposed new techniques to examine a supervised
system which gains knowledge in order to predict heart disease. There are numerous
data mining strategies like categorization, preprocessing.
Romanosky [10] suggested the use of Hard and soft clustering methods for
detecting patterns of medication use by patients and then check to see if it could
correctly estimate the probability of certain patient profile match. However, our data is
already labeled and thus unsupervised methods would be redundant. Expanding upon
the same logic for sentiment prediction techniques used in [13] for classification, to
drug prediction.
3 Datasets and Setup
The first dataset was taken from Kaggle and published by the UCI Machine Learning
Repository. [1] contains medicine reviews with over 200,000 entries. There are 2
different datasets. The Train dataset dimensions are 161,000 7, while the Test
dataset has 53,800 7 in terms of rows by columns. The rows represent every unique
case while the columns represent a unique ID, drug name, condition name, patient
review, a rating out of 10 stars, the date, and the number of users that found the drug
useful. Another dataset we used was originally from NCHS but we came across it on
Kaggle, named [2]. This dataset contains dimensions of 206k 13 which includes
columns like states, year of death, and the age. The Kaggle dataset used [3] System
Disease Sorting contains seven different datasets within it, describing the available
diagnoses for each symptom. Preprocessing steps were necessary to format, and to map
the id of the symptom to the id of the condition.
574 V. Doma et al.
Python was selected because of the abundance of libraries and APIs capable of
efficient and concise code generation. Some of the environments required in the
development are provided below:
Jupyter Notebook: Jupyter Notebook documents are produced by the Jupyter Note-
book App, which contains both computer code (e.g. python) and rich text elements
(paragraph, equations, figures, links, etc.). Notebook documents are both human-
readable documents containing the analysis description and the results (figures, tables,
etc.) as well as executable documents which can be run to perform data analysis.
Installation of the Jupyter Notebook through conda: pip3 install Jupyter.
Anaconda: Anaconda is a package supervisor and is a Python distribution and a group
of over 1,000+ open source packages. Its miles unfastened and clean to install, and it
offers free network support. Over 150 programs are automatically established with
Anaconda. The subsequent command helped in installing the present-day version of
Anaconda manager
bash Anaconda-latest-Linux-x86 64:sh
It is also possible to make a customized package using the “conda build” command.
Some of the required packages and tools were made available through Conda:
• Scikit-learn for implementing machine learning, as it is an efficient gear for mining
data and data evaluation, and is reachable to all people, and reusable in various
contexts; constructed on NumPy, SciPy, and matplotlib; Open source, commercially
usable. maximum of the inbuilt system learning algorithms like SVM, multinomial
NB, KNN. At its lowest level, one can say it is a third-party extension to SciPy [15,
16].
• Pandas is one of the easiest data structure and data analysis tool for the Python
programming language. Reading and analysis of CSV dataset can be done with
pandas, and has various features that allow us to format datasets and perform
cleaning.
• NumPy makes use of excessive-degree mathematical capabilities to perform
operations on arrays. NumPy can also be used as an efficient multi-dimensional field
of commonplace information. Arbitrary facts-kinds may be defined. This allows
NumPy to seamlessly and rapidly corroborate with a wide type of databases.
• NTLK is one of the leading platforms for working with human language data and
Python, One such module in this library called snowball stemmer was very essential
while constructing the NLP block of the code. Snowball is a small string processing
language designed for creating stemming algorithms for use in Information
Retrieval.
• MATPLOT is a data visualization library in Python for 2D plots of arrays. Mat-
plotlib is a multi-platform data visualization library constructed on NumPy arrays
and designed to paintings with the broader SciPy stack. One of its key aspects is
visualization which allows one to see large amounts of information in easily
digestible visuals. Matplotlib consists of numerous plots like line, bar, scatter,
histogram, pie and so forth as suggested in [17–19].
• Requests is an Apache2 Licensed HTTP library, written in Python. It is designed to

be used by humans to interact with the language. So, one does not have to manually
add query strings to URLs or form-encode your POST data. Requests allow one to
send HTTP/1.1 requests using Python. With it, one can add content like headers,
form data, multipart files, and parameters via simple Python libraries. It also allows
one to access the response data of Python in the same way. This library is required
for getting the datasets and can also be used in a future study to incorporate
streaming data [20].
• Collections module implements data types of high performance (well beyond the
inbuilt types like list, dictionary, and tuple) and has useful data structures that one
can use for information storage, such as name ‘tuple, deque, CounterDict, Order-
edDict, and defaultdic’.
4 Methods
4.1 Cleaning/Preprocessing
Before any kind of machine learning algorithms on textual data can be used, we needed
to find out what features could be useful in terms of being able to accurately predict a
well-labeled class. Cleaning involves removing the ‘NaN’ values, 0 instances, and
other garbage data. combining this phase with preprocessing, made it necessary to
further included removing stop words and other strange entries at the same time from
the column ‘review’ in the dataset [1]. For example, one type of entry that was found in
minority under the review column was “<?span> user found this comment helpful
</span>” without the actual review written, as this did not give any information of what
the actual review was, cleaning techniques helped to remove rows containing these
bogus entries. Removing unnecessary whitespaces and other symbols were also done
effectively with Pandas. The following are the techniques used to handle the cleaning
and preprocessing as suggested in [12, 14].
4.2 Tokenize
Removes the punctuations, strips to lowercase, and tokenizes the parsed tweet.
4.3 CountVectorizer
The predefined defined tokenize function is passed as a parameter to the CountVec-
torizer function. The result of this function is the Bag of Words generated from the text.
4.4 Snowball Stemmer

Snowball Stemmer is useful in that it removes words containing morphological affixes,
leaving only the word stem, which essentially means that it converts the words into their
root form. Snowball is a small string processing language module designed for creating
stemming algorithms for use in Information Retrieval. By creating a user-defined
576 V. Doma et al.
snowball function, with one of the parameters being “English stop words” it was pos-
sible to feed in all stop words of the English language in order to eliminate misclassi-
fication errors due to using of pronouns, prepositions, articles, and strengthen the
importance of adverbs adjectives nouns, etc.
4.5 Tfidf Transformer

Generates the TF-IDF values for the Bag of Words created by the above function
CountVectorizer. These values were then used as features for the ML model. Internally
the TF-IDF is calculated considering each ‘review’ sentence as a document. The term
frequency is explained using (1) and inverse document frequency by (2), the combi-
nation (3) illustrates the combination of the two.
ni;j
tfi;j ¼ P ð1Þ
k ni;j

N
idf ðwÞ ¼ log ð2Þ
dft

N
wi;j ¼ tfi;j x log ð3Þ
dft
tfi;j is the number of occurrences of i in j. dft is the number of documents containing

i. and N is the total number of documents, and ni;j is the count of the occurrence of that
word.
4.6 SGD and Decision Trees

SGD is a linear classifier with Stochastic Gradient Descent training. Implements a
Support Vector Machine algorithm by default. The model is fitted using the features
from the Tfidf Transformer function. Some of the prominent hyperparameters for this
function are loss, penalty, learning rate, max iterations.
Decision Trees are a supervised and non-parametric technique used for regression
or classification. The intention was to create a model that predicts the goal variable with
the aid of gaining knowledge through a simple decision policy inferred from the
features. (SGD) Stochastic Gradient Descent computes the gradient of the parameters h
using only a single or a few training examples. The standard gradient decent updates
parameters h of the objective below J ðhÞ (4)
h ¼ h arh E ½J ðhÞ ð4Þ
4.7 MulitnomialNB
MulitnomialNB is a sklearn function that implements Multinomial Naïve Bayes set of
rules. The multinomial Naïve Bayes classifier is appropriate for discrete functions (e.g.,
word counts for textual content class). The multinomial distribution usually calls for
integer characteristic counts. But, in practice, fractional counts including TF-IDF work
as well. Some of the hyperparameters for this function are prior probabilities,
Laplace/Lidstone smoothing parameter. In general, the naïve Bayes equation is given
as (5)
PðBjAÞPð AÞ
PðAjBÞ ¼ ð5Þ
PðBÞ
Where P(A | B) is the posterior probability, P(B | A) is the likelihood. P(A) is the
class Prior Probability. P(B) is the predictor prior probability.
In Multinomial Naïve Bayes The distribution is parametrized by vectors hy ¼

hy1 ; . . .::; hyn for each class y, and n is the number of features (basically the size of the
vocabulary) and hyi is the probability Pðxi jyÞ of feature appearing in a sample belonging
to a class y. The parameters hy is a smoothed version of maximum likelihood which is
estimated, giving the relative frequency counting as (6)
Nyi þ a
hyi ¼ ð6Þ
Ny þ an
P
where Nyi ¼ xT xi is the number of times the feature i occurs in a sample of that class
P
y from the training set T, and Ny ¼ ni¼1 Nyi i is the total number of all features for
class y. a represents features present or not present if a 0 accounts for features not
taken in consideration while learning.
4.8 Support Vector Classification (SVC)

SVC is a linear classifier model which makes use of the concept of hyperplane to
represent the facts samples of reviews as factors in space, mapped in order that the
examples of the separate categories are divided through a separation which is as huge
as viable. While training the model, the vector w and the bias b must be estimated by
solving a quadratic problem. Hence, SVM can be implemented in polynomial time,
which in turn results in its categorization in the p class. Considering only 2 dimensions
the hyperplane equitation will be for each vector (7), (8)
wT xi þ b 1 for xi having class A ð7Þ
wT xi þ b 1 for xi having class B ð8Þ
Where the equation of hyperplane is given as wT x = 0 and b being some constant

representing 2 dimensions.
And the margin of separation is expressed as (8)

yi wT x þ b ð9Þ
578 V. Doma et al.
Essentially Learning the SVM simply becomes an optimization problem where kw2 k
represents the margin of separation stated by (10), (11)
2
Max w subject to W:xi þ b 1 if yi ¼ þ 1 ð10Þ
kwk
2
Max w subject to W:xi þ b 1 if yi ¼ 1 for i ¼ 1 to N ð11Þ
kwk
Using these two equations we have the following simplified Eq. (12)
ð12Þ
In terms of application, the idea was to feed in the features from tfidf, essentially the
‘review’ column and output y is a string from the ‘condition’ column.
4.9 K-Nearest Neighbor (KNN)

When we perform classification using K-nearest neighbor, the algorithm is essentially
giving us a way to extract the majority vote which decides whether the observation
belongs to a certain K similar instance. For this, in any dimensional space, Euclidian
distance can be used as (13)
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Dist ¼ ðx2 x1 Þ2 þ ðy2 y1 Þ2 ð13Þ
It should also be noted that Manhattan, Chebyshev, and Hamming distance, are
other metrics which can be used. The classifier algorithm follows two steps, given a
positive integer K, an unseen observation x and a similarity metric d, the algorithm runs
through the whole dataset computing between x and each training observation. Con-
sider the K points in the training data that are closest to x the new or existing set
A. Note that K is usually odd to prevent situations where a tie occurs.
The classifier then takes an estimate of the conditional probability for each class,
which is, the fraction of points in A, having that class label. Here I is the indicator
function that evaluates to 1 only when the argument ‘x = true’ else 0, given by (14)
1X
ðiÞ
Pðy ¼ jjX ¼ xÞ ¼ I y ¼ j ð14Þ
K i2A
Finally, the input x is assigned to the class having the largest probability.
4.10 Training Validation Module

The features extracted from the above module were then passed on to each of the
machine learning models for training. As suggested earlier in order to check the
accuracy range of various machine learning models a comparative analysis of SGD
Naive Bayes, Support Vector Machine (SVM), decision trees and KNN, algorithms
were carried out. Furthermore, by applying K-fold (k = 5) cross-validation gave a

much better validation accuracy. The pseudo code for this is seen in Algorithm 1:
ALGORITHM 1: K-fold Cross Validation Loop
1: Let x be vector of length N with x-values of data points

2: Let y be vector of length N with y-values of data points
3: error = 0
4: for i = 1 to N
5: // define cross-validation subsets
6: y_output = interpolate(x_input, y_input, x_output, y_output)
7: err = err + (y[i] − y_output)^2
8: end
5 Results and Analysis of Findings

5.1 Analysis of Machine Learning Models
Among the machine learning models, our results demonstrated that SVM and Decision
trees worked the best in terms of accuracy of about 90% but there was a slight disparity
in execution time, whereas other classifiers either took too long to execute or gave poor
accuracy results. Figure 1 illustrates the five-fold cross-validation approach based on
randomly selecting 30% of the testing segments. The model evaluation results are
expressed in Fig. 2. Which clearly says that SVM executes faster along with good
enough accuracy. Hence, considering SVM as the final classifier for our ML model, it
was necessary to further improve it with cross-validation. the conclusion drawn from
by observing how the words were being classified into different classes of ‘conditions’,
was that each model was showing bias towards certain words i.e. greater importance
was given to words closely relating the ‘condition’, for example, if the word ‘nasal’
appears then the classifier will take into account this word more than the surrounding
words or features from the ‘reviews’.
Fig. 1. Five-fold cross-validation performed by randomly selected 30% testing segments

580 V. Doma et al.
Through cross-validation, it was possible to classify unknown reviews with around

90% accuracy most of the time, however, if no word was found related to any of the
classes it randomly, selects one. Which explain why blood pressure had more instances
of miss-classifications and Acne had the highest relative number of correct classifi-
cations. We also found out that certain reviews were causing disparities in classification
because side-effects may be different for each person, which is the reason why the more
data on reviews we have, the better the classification accuracy will be.
Qsvm ¼ kx2 þ maxf0; 1; yxUð xÞg

ð15Þ
FeaturesUðxÞ 2 Rd ; Classes y ¼ 1
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
b b2 4ac
Hyperparameter x ¼ ð16Þ
2a
5.2 Exploratory Analysis

Data visualization uses statistical graphs to plot patterns from the dataset. First, using
visualization, we saw the top fifty most commonly reported symptoms, as seen in
Fig. 3. These symptoms were then weighted based on the severity of the diagnoses:
common or life-threatening.
Based on observation analysis, it can be said that that pain is among the most
reported symptoms. It also happens to have the greatest number of medicine available
on the market to treat it. So how does one know which medication best treats pain?
Fig. 2. Different machine learning models based on accuracy and time
The relationship between the rating of medication, and the usefulness of the review
was mapped, and in doing so suggested that if someone rates a drug as useful and
leaves a positive review, then others are likely to find that comment useful. Symptoms
can indicate an underlying disease or can become one with time. From the dataset [2],
we analyzed the death rate for the top 20 most populous states in the United States.
Since the dataset presents us with five leading causes of deaths; Cancer, Chronic Lower
Respiratory Disease, Heart Disease, Stroke, and Unintentional Injury, a heatmap was
generated to better depict which states lose the greatest amount of lives to certain
conditions. Even though Heart Disease is the biggest killer in the United States [5], our
graph shows that Cancer causes the most amount of deaths. This can be attributed to
the fact that Cancer is grouped together with all forms, it does not distinguish between
say Throat Cancer and Lung Cancer. Cardiovascular disease affects more people living
in rural areas, while Cancer affects those in more urban areas [6].
5.3 Predictive Analysis

Predictive analytics is the practice of extracting information from existing datasets in
order to determine patterns and predict future outcomes and trends. Predictive analytics
forecasts what might happen in the future with an acceptable level of reliability.
Analyzing the years 2008 to 2017 to find the top five most popular prevailing
medical conditions in the USA, the following was found: birth-control being the most
common medical condition, followed by depression, pain, anxiety, and acne in that
order. Corresponding to these medical conditions, analysis for the five most popular
drugs for each condition that are currently trending was done revealing a dip in the use
of current drugs probably for newer more effective ones.
According to the ratings and reviews for the top five medical conditions, the top 5
medicines/drugs being used for each condition was analyzed and compared as seen in
Figs. 4, 5, 6, 7 and 8 and for the most popular medical condition found to be birth
control, in the year 2009 the drug Etonogestrel was found to be the most widely
recommended drug by doctors, the following was Levonorgestrel. Lately, in the year
2011 other drugs like Ethinyl estradiol/norethindrone and Ethinyl estradiol\levo-
onorgestrel came into the market to their competition and the sale\usage for the former
Fig. 3. The top 20 conditions cured by the maximum number of drugs

582 V. Doma et al.
Fig. 4. Top 5 drugs for birth control based on their ratings.
Fig. 5. Top 5 drugs for depression based on their ratings.

Fig. 6. Top 5 drugs for pain based on their ratings.
Fig. 7. Top 5 drugs for anxiety based on their ratings.

584 V. Doma et al.
decreased drastically. Later since 2014, the drug Levonorgestrel has been the most
widely used and all the five drugs were being equally popular and recommended by the
doctors. The main motive is predicted the possible medical conditions which will be
most common soon and the possible top medicines which can be taken to prevent them.
Analysis on the second dataset [2] revealed the deadliest diseases in the U.S.A
causing deaths between ages 50 and 85, from the years 2005 to 2015, as observed in
Fig. 9, the deadliest disease was found to be cancer, followed heart disease. Unin-
tentional injuries came under the category for least deadliest diseases. The variation in
colors depicts the change in the percentage for the diseases over the years. Darker
shades depict the most recent years.
Fig. 8. The top 5 drugs for acne based on their ratings.
5.4 Prescriptive Analysis

From the TF-IDF it was observed that the most common bigrams and unigrams for the
top classes of conditions as seen in Table 2. After doing this it was possible to
incorporate all the above classifiers to give a kind of one-word summary of the con-
dition you most likely have based on the relationship between the conditions and
reviews. On identifying all the identical instances of the ‘conditions’ with the unique
‘drug name’ columns, it was possible to simply group all their rating scores together to
find and an average rating of that instance occurring in the dataset, which gave an
overall understanding of how effective that drug was for that condition overall.
The second part of the system comes in the form of training the SVM model to
analyze a person’s reviews to best predict which medical condition that person has.
Fig. 9. The rise among the top deadliest diseases in the US
After the model is trained, the system allows one to enter his/her own feelings into
the system which then matches with the features corresponding to the medical con-
dition currently stored in the combined dataset, and based on that sentence, the system
should return the best-predicted condition that the user has, then the system uses that
condition to return the best top 10 rated drugs (after grouping unique ‘conditions’ and
‘drug name’) used to cure that condition based on the dataset. To ensure accuracy
stemming was used. Using feature selection through ‘chi2’ the important words
describing each ‘condition’ was revealed in the form of unigrams and bigrams as seen
in Table 2 also it should be noted that this system created was trained on the reviews of
the drug taken, therefore the system can also predict which drug you took or the
condition you were having after the ingestion of the drug. Figure 10 presents the
relationship between the popularity of a drug and its “usefulness”. The heat map in
Fig. 11 explains the correct classifications and miss-classifications being done when
taking a smaller sample of each class.
We used data visualization to visually see the accuracy of the resulting graphs
created. Analysis of the diseases and death rate dataset revealed to us the leading causes
of death in the United States, with cancer and heart disease being the deadliest of all as
they resulted in the most deaths. One way we can prevent this is by helping people
identify the disease from their occurring symptoms as well as recommend the best-rated
drugs available as early as possible before the condition becomes life-threatening.
586 V. Doma et al.
Fig. 10. The relationship between ratings and how useful a drug is
Fig. 11. Confusion matrix showing correct classifications of the top 5 most common conditions
in regard to the testing set
Table 2. The most important words in a review that the classifier can give more preference to
while performing classification
Conditions Most important unigrams Symptoms
Acne Really worried, pill 039, Face, Dry skin, cystic acne
Skin, Acne
Birth Headache, attacked, birth, control Mood swings, birth control
Control period, relief, pain
Depression Sadness, crying, hurt, relief, pain, Anxiety depression, doc suggested,
annoying years asked, depression anxiety
High Blood Tired, relief, pain, levels, High pressure, high heartbeat
Pressure dropped, said, surgery
Pain Surgery, relief, pain Chronic pain, pain relief
Today, with an increase in the diseases all over the world, the number of drugs for the
diseases is also increasing, and the important factor is to realize which drugs are most
suitable for a specific disease. So, the objective for our research work is to suggest
practitioners/ medical students the best-recommended drug for a particular condition
according to the reviews of the drug. Medical Data Analysis is a research-based
strategy in which specialists retrieve, clean and visualize data from accessible quali-
tative and quantitative data from Electronic Health Records. It was observed that ‘pain’
has the maximum number of different drugs used to treat it as pain is the first thing that
comes to mind when something is wrong. The accuracy of machine learning is visu-
alized using a heat-map. An inputted symptom by the user returns the condition the
person is suffering from and also the top drugs which can be used to treat the condition.
As a result of the user being returned a list of the top medication used to treat their
given condition, one must be wary of the possible ill-effects of consuming the drug.
Furthering this research project can mean looking at the medications and also returning
the negative effects associated with it. Moreover, some medicine does not mix well in
conjunction with others so there should be a precaution. Finally, every person is
physiologically different. Some conditions are more prevalent if gender, BMI, age, or
even if the race is taken into consideration.
Acknowledgement. This research is partially supported by a grant from Amazon Web Services.
References
1. Gräßer, F., Kallumadi, S., Malberg, H., Zaunseder, S.: UCI Machine Learning Repository
(2019). http://archive.ics.uci.edu/ml, https://archive.ics.uci.edu/ml/datasets/Drug+Review+
Dataset+%28Drugs.com%29#Irvine. Accessed 08 Apr 2019
588 V. Doma et al.
2. National Center for Health Statistics. NCHS Data Visualization. Gallery - Potentially Excess
Deaths in the United States. Centers for Disease and Control and Prevention, 28 August
2017. https://www.cdc.gov/nchs/data-visualization/potentially-excess-deaths/. Accessed 27
Apr 2019
3. Larmuseau, P.: Licence Public Last updated ‘2017-03-10’, Date created ‘2017-03-04’,
Current version: Version 4, Symptom Disease sorting. https://www.kaggle.com/plarmuseau/
sdsort/metadata
4. Viveka, S., Kalaavathi, B.: Review on clinical data mining with psychiatric adverse drug
reaction. In: 2016 World Conference on Futuristic Trends in Research and Innovation for
Social Welfare (Startup Conclave), Coimbatore, pp. 1–3 (2016). http://ieeexplore.ieee.org/
stamp/stamp.jsp?tp=&arnumber=7583945&isnumber=7583750
5. ACC News Story. CDC Report Shows CVD Still #1 Killer in US - American College of
Cardiology. American College of Cardiology, 3 December 2018. https://www.acc.org/latest-
in-cardiology/articles/2018/12/03/16/11/cdc-report-shows-cvd-still-1-killer-in-us. Accessed
8 May 2019
6. Rosenburg, J.: Cancer Surpasses CVD as Leading Cause of Death in High-Income Counties.
Ajmc.com, 13 November 2018. https://www.ajmc.com/focus-of-the-week/cancer-surpasses-
cvd-as-leading-cause-of-death-in-highincome-counties. Accessed 8 May 2019
7. Syverson, P., Reed, M., Goldschlag, D.: Private medical instances. J. Comput. Med. Data
(JCS) 5(3), 237–248 (1997)
8. Saint-Jean, F., Johnson, A., Boneh, D., Feigenbaum, J.: Private web search. In: Proceedings
of the 6th ACM Workshop on Privacy in the Electronic Society (WPES) (2007)
9. Levy, S., Gutwin, C.: Improving understanding of website privacy policies with fine-grained
policy anchors. In: Proceedings of the Conference on the World-Medical Data, pp. 480–488
(2005)
10. Romanosky, S.: FoxTor: helping protect your health while browsing online for conditions.
cups.cs.cmu.edu/foxtor
11. Khalid, S., Ali, M.S., Prieto-Alhambra, D.: Cluster analysis to detect patterns of drug use
from routinely collected medical data. In: 2018 IEEE 31st International Symposium on
Computer-Based Medical Systems (CBMS), Karlstad, pp. 194–198 (2018). http://ieeexplore.
ieee.org/stamp/stamp.jsp?tp=&arnumber=8417236&isnumber=8417175
12. Kanchan, B.D., Kishor, M.M.: Study of machine learning algorithms for special disease
prediction using principal of component analysis. In: 2016 International Conference on
Global Trends in Signal Processing, Information Computing and Communication
(ICGTSPICC), Jalgaon, pp. 5–10 (2016). http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=
&arnumber=7955260&isnumber=7955253
13. Gaydhani, A., Doma, V., Kendre, S., Bhagwat, L.: Detecting hate speech and offensive
language on twitter using machine learning: an N-gram and TFIDF, CoRR, Volume is
abs/1809.08651 (2018)
14. Kanchan, B.D., Kishor, M.M.: Study of machine learning algorithms for special disease
prediction using principal of component analysis. In: 2016 International Conference on Global
Trends in Signal Processing, Information Computing and Communication (ICGTSPICC).
IEEE (2016). https://doi.org/10.1109/icgtspicc.2016.7955260
15. Scikit-learn developers (BSD License). SVM Example—scikit-learn 0.20.3 documentation.
Scikit-learn.org (n.d.). https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.
html#sphx-glr-auto-examples-linear-model-plot-ols-py. Accessed 20 Apr 2019
16. The SciPy community. Quickstart tutorial—NumPy v1.16 Manual. Scipy.org Sponsored by
Enthought (2019). https://docs.scipy.org/doc/numpy/user/quickstart.html. Accessed 30 Apr
2019
17. Hunter, J., Dale, D., Firing, E., Droettboom, M., Matplotlib Development Team: Beginner’s
Guide—Matplotlib 1.5.3 documentation. Matplotlib.org (2016). https://matplotlib.org/users/
beginner.html. Accessed 19 Apr 2019
18. Hunter, J., Dale, D., Firing, E., Droettboom, M., Matplotlib Development Team: Matplotlib
Pyplot Semilogx—Matplotlib 3.0.3 Documentation. Matplotlib.org (n.d.). https://matplotlib.
org/api/_as_gen/matplotlib.pyplot.semilogx.html. Accessed 26 Apr 2019
19. Droettboom, M., Matplotlib Development Team: Matplotlib Pyplot Semilogx—Matplotlib
3.0.3 Documentation. Matplotlib.org (n.d.). https://matplotlib.org/api/_as_gen/matplotlib.
pyplot.semilogx.html. Accessed 26 Apr 2019
20. Dateutil. Parser—dateutil 2.8.0 documentation. Readthedocs.io (2016). https://dateutil.
readthedocs.io/en/stable/parser.html. Accessed 30 Apr 2019
Conditional Random Fields Based
on Weighted Feature Difference Potential
for Remote Sensing Image Classification
Yi Sun(B) , Yan Tian, and Yiping Xu
National Key Laboratory of Science and Technology on Multi-spectral

Information Processing, School of Electronic Information and Communications,
Huazhong University of Science and Technology, Wuhan 430074, China
yi sun@hust.edu.cn
Abstract. Image pixel-wise classification is an important challenge in

the field of remote sensing. Some traditional methods, e.g., SVM and
random forest, tend to ignore the contextual information, which may fail
to process ambiguous samples. Conditional random fields (CRF) is an
effective approach which eliminates semantic ambiguity via considering
the relationship between adjacent pixels. In CRF model, the pairwise
potential reflects the contextual relationship. However, the existed pair-
wise potential often equally treats each feature and does not highlight
the roles of key features. In this work, we propose a CRF model with
weighted feature difference pairwise potential which makes the key fea-
tures play important roles. Thus, we can meticulously depict the contex-
tual relationship. We further employ a max-margin method to optimize
the CRF model. Experimental results on the hyper-spectral and multi-
spectral remote sensing image datasets demonstrate that our approach
has the better classification performance.
Keywords: Remote sensing · Image classification · Conditional

random fields · Weighted feature difference potential
1 Introduction
Hyper-spectral and multi-spectral image interpretation is one of the most chal-
lenging tasks in remote sensing. The goal of this task is pixel-wise labeling auto-
matically.
Some classification methods, such as SVM [1,2], KNN [3,4], Boosting [5] and
Random Forest [6,7], have frequently been applied in image interpretation. Yet,
these methods only focused on the features of each pixel, while ignoring the
contextual information. The classification errors thus are particularly significant
when the pixel-feature ambiguity occurs. Some works based on markov random
fields (MRF) [8,9] attempted to remove the features’ ambiguity by introducing
the contextual information. However, MRF model only considered the contextual

https://doi.org/10.1007/978-3-030-39442-4_43
CRF Based on WFDP for Remote Sensing Image Classification 591
information of the label and ignores the observation of the image, which may
cause the label bias problem [10].
To overcome the disadvantage of MRF, conditional random fields (CRF)
[11] further considered the contextual information of the features of adjacent
pixels, which constructed the feature-difference pairwise potential to measure
the similarity between two adjacent sites (pixels). However, the current pairwise
potential often equally treated the importance of each feature, that did not
highlight the role of key features. In this work, we propose a weighted feature
difference potential (WFDP) which lets the key features to play important roles
by different weights. That is, for the pairwise potential, different features should
have different contribution which is reflected by weight.
The WFDP-based CRF model needs an efficient optimization method in
practice. The maximum likelihood (ML) optimization involves the evaluation of
the partition function in CRF models, which is proved a NP-hard problem. To
avoid the partition function, max-margin method [12] converted maximizing the
likelihood estimation to minimizing the energy with respect to potential func-
tions. Joachims et al. [13] introduced the cutting plane algorithm [14] into the
max-margin method [12], which efficiently solved the constrained optimization
problem in polynomial time. Szummer et al. [15] extended the method [13] to
the optimization of CRF model. Thus, we are inspired by [15] to use the max-
margin optimization method with cutting plane in the proposed WFDP-CRF
model. Our method thus achieves the better performance on the hyper-spectral
and multi-spectral remote sensing image datasets.
2 CRF Model for Pixel-Wise Classification of Image

Let G = <V, ε> be a graph consisting of vertices and edges, where V and ε
represent the vertex set and the edge set respectively. CRF is modeled as,
1 −E(X,Y,u)
p(y|x, u) = e (1)
Z
where E(X, Y, u) is the sum of energy, which is defined as,

E(X, Y, u) = ϕ(xi , yi , w) + ψ(xi , xj , yi , yj , v). (2)
i∈X (i,j)∈ε
Here xi denotes the ith site (pixel) and yi corresponds to the label of xi . The
site set X = {xi }, 1 ≤ i ≤ N , where N is the number of the pixel in image.
The labels set Y = {yi }, 1 ≤ yi ≤ C, where C is the number of categories.
ϕ(·) is the unary potential function, and ψ(·) is the pairwise potential function.
The parameters w of the unary potential and the parameters v of the pairwise
potential are integrated as a set u = {w, v}.
There are a variety of definitions [12,16] of the unary potential function.
Following [12], we utilize the linear form:
C

ϕ(xi , yi , w) = wlT · h(xi ) · δ(yi = l) (3)
l=1
592 Y. Sun et al.
Here, for a C-class classification problem, the parameter-vector set of unary

potential w = {wl }, 1 ≤ l ≤ C. wl is the parameter vector of the lth class. h(xi )
is the feature vector of xi . δ(yi = l) is a indicator function. That is, δ(yi = l) = 1
when yi = l; δ(yi = l) = 0 when yi = l.
The pairwise potential functions are often divided into two types, namely
label-pair potential (LPP) [16] and feature-distance potential (FDP) [17]. The
former only show the category difference between the adjacent sites but ignore
the feature difference between them. The latter integrates the category and fea-
ture difference information, which is defined as
⎧
⎨0, if yi = yj
ψ(xi , xj , yi , yj , v) = h(xi )−h(xj )22 (4)
⎩v · e− 2σ 2 , if yi = yj
where v is a parameter that can be learned and σ is a hyper-parameter which is

adjusted as needed. In this work, we define σ in the FDP as

2
(i,j)∈ε h(xi ) − h(xj )2
σ= (5)
Nε
where Nε denotes the edge number of all adjacent sites.
The quality of features is one of the most key factors that determine the
performance of classifiers. All features are equally treated in FDP, while FDP
ignores the difference in contribution of different features. Thus, the advantage
of good feature is not strengthened, and the disadvantage of weak feature is
not reduced. To the limitation of FDP, here, we propose a weighted feature
distance potential (WFDP), in which we assign different weights to different
feature distances. In the learning process, each feature can automatically be
given a suitable weight. The WFDP is defined as
⎧
⎨0, if yi = yj
ψ(xi , xj , yi , yj , v) = (h(xi )k −h(xj )k )2 (6)
⎩ M v ·e −
2σ 2
k , if y
= y .
k=1 k i j
Here v = {vk } is a set of the weight parameters of the pairwise potential,

h(xi )k stands for the k th dimension of the features of the site xi , and {σk } is
hyper-parameter set adjusted manually. In this work, we define the set {σk } in
the WFDP as
2
(i,j)∈ε (h(xi )k − h(xj )k )
σk = (7)
Nε
where k ∈ [1, M ], M is the dimension of the features.
3 Method
We first pursue the parameters of the CRF by optimizing a max-margin leaning
on training samples. Then, we predict testing samples with the trained CRF
model to obtain testing labels.
3.1 Max-Margin Learning
Optimizing the likelihood function w.r.t. a graph model tends to involve the eval-
uation of partition function Z in CRF models, which is proved a NP-hard prob-
lem. To avoid processing the partition function Z, Szummer et al. [15] introduced
the max-margin leaning method with cutting plane into CRF model. Inspired by
[15], we employ the max-margin method to learn the WFDP-based CRF model
as well. The energy functions w.r.t Ground Truth (GT) and inference labels of
training samples should be met the following inequality
E(x, ŷ, u) ≥ E(x, y ∗ , u) ∀ŷ = y ∗ (8)
where y ∗ denotes the GT label set of the all training samples, and ŷ is the
estimated labels of these training samples in learning. Equation (8) means that
the minimum energy only corresponds to the label set of the Ground Truth. That
is, the global probability p(y|x, u) of the CRF model on the training samples is
maximized on the Ground Truth, which is the idea of maximum likelihood.
We illustrate the max-margin optimization in Algorithm 1, which is based
on the cutting plane method [13]. Before learning, we establish the WFDP-CRF
model with initialized parameters and an empty constraint set S.
Then, the learning process starts. Firstly, we utilize the current WFDP-CRF
model to infer the labels of the training samples by a reasoning method of Graph
Cut (GC) [18]. Secondly, we add the estimated labels that are different from the
GT labels into the constraint set S. Thirdly, we optimize the objective function
of max-margin with the constraint inequality in Eq. (9) to pursue the model
parameters and update the model with the new parameters. In Eq. (9), the 0–1
loss function Δ(ŷ, y ∗ ) and the slack variable ξn are added into the constraint
inequality. The 0–1 loss function Δ(ŷ, y ∗ ) calculates the number of incorrect esti-
mated labels, which punishes a relatively larger margin for those labels that are
far from the ground truth. The slack variable ξn relaxes the margin constraints.
Notice that the v ≥ 0 constrain in Eq. (9) is to satisfy the precondition [19] of
the GC inference. Then, we repeat the above iterative process until convergence.
To the quadratic program problem shown in Eq. (9), an important work in
this paper is to calculate the energy difference between the estimated labels and
the ground truth on the constraint set S. Next, we support the derivation process
of the energy difference. According to the Eq. (3), the energy of unary potential
on the label set {y} is written as
C

ϕ(xi , yi , w) = wlT · h(xi ) · δ(yi = l)
i∈X i∈X l=i
(10)
C

= wlT ·h(xi ) · δ(yi = l).
l=i i∈X
594 Y. Sun et al.
Algorithm 1. Max-Margin Optimization

Require: input-labeling pairs {x, y ∗ } training set, the class number C, the WFDP-
CRF model with randomly initial parameters u = {w, v}, and empty constraint set
of labels: S = φ.
repeat
1. Run Graph cut to obtain image MAP inference label ŷ (n)
ŷ (n) ← arg min E(x, y, u) − Δ(ŷ, y ∗ )

y
2. If ŷ (n) = y ∗ , add ŷ (n) to the constraint set
S (n) ← S (n) ∪ {ŷ (n) }
3. Optimize the max-margin objective function and update the parameters u of

the model
C
min u2 + ξn ∀ŷ (n) ∈ S (n) ∀n
N n
s.t. E(x, ŷ (n) , u) − E(x, y ∗ , u) ≥ Δ(ŷ (n) , y ∗ ) − ξn (9)

ξn ≥ 0
v≥0
until u is unchanged (within a tolerance)
Then, the energy difference of the unary potential between the estimated-
label set {ŷ} and ground-truth set{y ∗ } is

Dϕ = ϕ(xi , ŷi , w) − ϕ(xi , yi∗ , w)
i∈X i∈X
C
C

= wlT h(xi ) · δ(ŷi = l) − wlT h(xi ) · δ(yi∗ = l) (11)
l=i i∈X l=i i∈X
C

= wlT h(xi ) · [δ(ŷi = l) − δ(yi∗ = l)].
l=i i∈X
For pairwise potential, according to the Eq. (6), the WFDP energy of all the
edges on the label set {y} is written as
M
(xik −xjk )
2
ψ(xi , xj , yi , yj , v) = vk · e− 2σ 2 · δ(yi = yj ). (12)

(i,j)∈ε (i,j)∈ε k=1
Then, the energy difference of WFDP between the estimated-label set{ŷ}

and the ground-truth set{y ∗ } is

Dψ = ψ(xi , xj , ŷi , ŷj , v) − ψ(xi , xj , yi∗ , yj∗ , v)
(i,j)∈ε (i,j)∈ε

M
−
(xik −xjk )2
M
−
(xik −xjk )2 ∗ ∗
= vk · e 2σ 2 · δ(ŷi = ŷj ) − vk · e 2σ 2 · δ(yi = yj )
(i,j)∈ε k=1 k=1
2

M − (xik −xjk )
= vk δ(ŷi = ŷj ) − δ(yi∗ = yj∗ ) · e 2σ 2 .
k=1 (i,j)∈ε
(13)
Thus, the total energy difference of unary and pairwise potential between
{ŷ} and {y ∗ } is
E(x, ŷ, u) − E(x, y ∗ , u)

= Dϕ + Dψ
C

= wlT h(xi ) · [δ(ŷi = l) − δ(yi∗ = l)] (14)
l=i i∈X
M
(xik −xjk )
2
∗ ∗ −
+ vk δ(ŷi = ŷj ) − δ(yi = yj ) · e 2σ 2
.
k=1 (i,j)∈ε
Substitute Eq. (14) into Eq. (9) to obtain the constraint inequality, and then
solve the Quadratic Program.
3.2 Inference
After the learning process is completed, we obtain the parameters u of the

WFDP-CRF model. Next, we apply this model on the test samples to infer
the test labels. The reasoning process is to find the lowest global energy w.r.t
the test samples, as described below:
Ytest ← arg min E(Xtest , Y, u) (15)

Y
Considering the computational efficiency, here, we still use the inference

method of Graph Cut to pursue Eq. (15).
596 Y. Sun et al.
Fig. 1. Washington D.C. HYDICE Dataset. (a) RGB pseudo-color image and (b)
Ground-truth image.
4 Experiments
4.1 Datasets
The proposed method is evaluated on three remote sensing image datasets.

The first dataset is a high-spectral remote-sensing image of Washington D.C.
mall, which was acquired in August 1995 by the HYDICE sensor. The image
of 1280 × 307 pixels contains 191 bands. The number of labeled pixels is 88556.
The object type in the image includes seven land cover/use classes: Grass, Water,
Roof, Road, Tree, Shadow, and Path. The labeled-pixel number of each category
is shown in Fig. 1.
The second dataset is a multi-spectral image of a township which is located in
southern region of Wuhan in China, and was captured by the Resources Satellite
III of China in August 2013. The multi-spectral image of 506 × 1182 pixels
consists of blue, green, red, and near-infrared spectral channels. The labeled-
pixel number is 50650. The dataset contains the following six classes: Corn field,
Paddy field, Cotton field, Path, Road, and Roof. The labeled-pixel number of
each category is shown in Fig. 2.
The third dataset is a multi-spectral image of the northern suburbs of Hobart
in Australia, and was taken in August 2010 by the GEOEYE-1 Satellite sensor.
The image data of 859 × 593 pixels with four bands which consists of blue, green,
red, and near-infrared spectral channels. The labeled-pixel number is 58783. The
image includes five land cover/use classes: Bare, Road, Roof, Tree, and Grass.
The pixel number of each category is shown in Fig. 3.
Fig. 2. Wuhan Resources Satellite III Dataset. (a) RGB pseudo-color image and (b)
Ground-Truth image.
4.2 Features Extraction
The used features on different datasets are not the same. For the hyper-spectral
image data, owing to the rich spectral information, we directly employ the spec-
tral values of 191 bands as the features.
For the two multi-spectral image datasets, i.e. Wuhan and Hobart datasets,
since their spectral channels are both four bands, we utilize the feature-
extraction method of Filter Bank [20,21] in which each band is extended to
11-dimensional features. There are total 44-dimensional features generated via
the Filter Bank. Further, the four bands can also serve as features. Thus, we
obtain 48-dimensional features on the Wuhan and Hobart datasets.
598 Y. Sun et al.
Fig. 3. Hobart GEOEYE-1 Satellite Dataset. (a) RGB pseudo-color image and (b)
Ground-Truth image.
4.3 Evaluation of the Propose Method

The experiments illustrate the performance of the WFDP-based CRF model
on the Washington D.C. mall, Wuhan, and Hobart datasets. We compare
the WFDP-CRF with the traditional methods, i.e. Structured Support Vector
Machine (SSVM) [22], Random Forest (RF), and FDP-CRF. For the Washing-
ton D.C. mall dataset, we randomly select 10% labeled samples as a training set,
leaving 90% labeled samples as a testing set. All the above methods are learned
on the training set and evaluated on the testing set.
We repeated the experiments 10 times and record the best overall accuracy
(OA), the corresponding category accuracies (Ci1 ) and the Kappa value. And
then, the average overall accuracy (AOA) is obtained by averaging the 10 times
overall accuracies. We list the experimental results of the above methods on
Washington D.C. mall in Table 1.
Table 1. Experimental results of SSVM, RF, FDP-CRF, and WFDP-CRF on the

Washington DC Mall dataset. The best results are shown in boldface.
Method Category accuracies (%) OA (%) AOA (%) KAPPA

C1 C2 C3 C4 C5 C6 C7
SSVM 89.74 92.99 91.03 90.07 86.90 82.65 89.56 89.80 89.12 0.8908
RF 98.25 97.08 95.22 92.05 94.74 81.39 67.88 93.64 93.02 0.9232
FDP-CRF 97.33 97.54 95.09 92.98 96.53 84.51 88.27 94.91 94.53 0.9420
WFDP-CRF 98.96 98.69 98.67 95.64 96.39 90.75 91.85 97.12 96.85 0.9663
As can be seen from Table 1, we see that the performance of WFDP-CRF are
superior to that of the other three methods. The AOA of WFDP-CRF is 2.32%
1
Ci denotes the classification accuracy of the ith category.
Fig. 4. Classification results on the Washington D.C. Mall Dataset. (a) is the results
of SSVM, (b) is the results of RF, (c) is the results of FDP-CRF, (d) is the results of
WFDP-CRF (our method), and (e) is the RGB pseudo-color image of this dataset.
higher than that of FDP-CRF ranked second. The KAPPA value of WFDP-
CRF is 0.0243 higher than that of FDP-CRF as well. We notice that SSVM
only achieves 89.80% OA and 89.12% AOA, which are weaker than the other
methods. This may derive from the well-known fact that contextual information
is not used. Although RF also does not use the any spatial dependencies, RF
is an ensemble leaning method, in which we employ 150 decision trees to vote
the inference results. Thus, the RF method has a significant improvement over
the SSVM method. Owing to using the contextual information, e.g., feature-
distance pairwise potential, the AOA of FDP-CRF is even 1.5% higher than
that of the RF method. The contextual information may play an important role
in eliminating ambiguity. Our method achieves 97.12% OA and 96.85% AOA,
which are better than the 94.91% OA and 94.53% AOA of FDP-CRF, because
we further assign the different weights to the different features-distances on the
pairwise potential. Figure 4 shows the inferring-label images of all the methods
on Washington D.C. mall dataset.
For the Wuhan and Hobart multi-spectral datasets, we randomly select 10%
labeled samples as a training set as well, leaving the leftover labeled samples as
a test set. The corresponding experimental results on the Wuhan and Hobart
datasets are listed in Tables 2 and 3, respectively. As shown in Table 2, WFDP-
CRF obtained 94.86% OA and 94.56% AOA, which are better than the 91.62%
OA and 91.35% AOA of FDP-CRF. Also, for the results of Hobart dataset in
Table 3, the OA and AOA of WFDP-CRF are both over 2.1% higher than those
of FDP-CRF. For the two datasets, FDP-CRF and RF have the comparable
600 Y. Sun et al.

Wuhan dataset. The best results are shown in boldface.

C1 C2 C3 C4 C5 C6
SSVM 93.80 94.09 94.69 85.84 82.34 57.46 89.75 89.41 0.8879
RF 97.89 95.76 95.46 85.51 86.48 54.72 91.51 91.16 0.9053
FDP-CRF 93.13 96.05 95.22 87.62 87.59 69.60 91.62 91.35 0.9064
WFDP-CRF 97.62 97.52 97.86 94.79 87.05 73.87 94.86 94.56 0.9384

Hobart dataset. The best results are shown in boldface.

C1 C2 C3 C4 C5
SSVM 91.82 93.41 86.59 91.24 92.27 91.12 90.86 0.9021
RF 91.86 95.81 91.91 95.15 91.44 93.39 93.08 0.9246
FDP-CRF 94.82 95.50 88.96 92.36 95.88 93.50 93.25 0.9254
WFDP-CRF 95.78 96.31 93.48 96.72 96.24 95.65 96.35 0.9475
performance, and SSVM still are weaker than all the other methods. The exper-
imental results of these two datasets further demonstrate that WFDP improves
the classification performance of CRF. The corresponding inferring-label images
are shown in Figs. 5 and 6, respectively.
Fig. 5. Classification results on the Wuhan Dataset. (a) is the results of SSVM, (b) is
the results of RF, (c) is the results of FDP-CRF, (d) is the results of WFDP-CRF (our
method), and (e) is the RGB pseudo-color image of this dataset.
Fig. 6. Classification results on the Hobart Dataset. (a) is the results of SSVM, (b) is
the results of RF, (c) is the results of FDP-CRF, (d) is the results of WFDP-CRF (our
method), and (e) is the RGB pseudo-color image of this dataset.
From the experimental results on the three datasets, we observed a common

phenomenon that the classification performance of WFDP-CRF are significantly
higher than that of FDP-CRF. The reason may be that the weighted features
distances in the pairwise potential make the key features play more important
roles.
5 Conclusion
This paper devotes to study a remote sensing image classification method based
on conditional random fields. The introduction of contextual information which
is characterized by the pairwise potential of CRF can effectively eliminate the
ambiguity of samples. The traditional pairwise potential such as FDP tend to
equally treat the distance of each feature, while our approach generates more
flexible context relationship which assigns different weights on different feature
distances so as to improve the ability of eliminating ambiguity of CRF model.
Experimental results on three remote sensing image datasets demonstrate that
our method has better classification performance and support the practical appli-
cation of WFDP-CRF.
602 Y. Sun et al.
References
1. Chandra, M.A., Bedi, S.S.: Survey on SVM and their application in image classi-
fication. Int. J. Inf. Technol. 1–11, 2018 (2018)
2. Mountrakis, G., Im, J., Ogole, C.: Support vector machines in remote sensing: a
review. ISPRS J. Photogram. Remote Sens. 66(3), 247–259 (2011)
3. Blanzieri, E., Melgani, F.: Nearest neighbor classification of remote sensing images
with the maximal margin principle. IEEE Trans. Geosci. Remote Sens. 46(6),
1804–1811 (2008)
4. Bo, C., Huchuan, L., Wang, D.: Spectral-spatial k-nearest neighbor approach
for hyperspectral image classification. Multimed. Tools Appl. 77(9), 10419–10436
(2018)
5. Qi, C., Zhou, Z., Sun, Y., Song, H., Lishuan, H., Wang, Q.: Feature selection and
multiple kernel boosting framework based on PSO with mutation mechanism for
hyperspectral classification. Neurocomputing 220, 181–190 (2017)
7. Xia, J., Ghamisi, P., Yokoya, N., Iwasaki, A.: Random forest ensembles and
extended multiextinction profiles for hyperspectral image classification. IEEE
Trans. Geosci. Remote Sens. 56(1), 202–216 (2018)
8. Cao, X., Zhou, F., Lin, X., Meng, D., Zongben, X., Paisley, J.: Hyperspectral image
classification with markov random fields and a convolutional neural network. IEEE
Trans. Image Process. 27(5), 2354–2367 (2018)
9. Jackson, Q., Landgrebe, D.A.: Adaptive Bayesian contextual classification based
on markov random fields. IEEE Trans. Geosci. Remote Sens. 40(11), 2454–2463
(2002)
10. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: proba-
bilistic models for segmenting and labeling sequences. In: ICML 2001: Proceedings
of the Eighteenth International Conference on machine Learning (2001)
11. Sutton, C., McCallum, A., et al.: An introduction to conditional random fields.
Found. Trends R Mach. Learn. 4(4), 267–373 (2012)
12. Taskar, B., Chatalbashev, V., Koller, D., Guestrin, C.: Learning structured pre-
diction models: a large margin approach. In: Proceedings of the 22nd International
Conference on Machine Learning, pp. 896–903. ACM (2005)
13. Joachims, T., Finley, T., Yu, C.-N.J.: Cutting-plane training of structural svms.
Machine Learn. 77(1), 27–59 (2009)
14. Kelley Jr., J.E.: The cutting-plane method for solving convex programs. J. Soc.
Ind. Appl. Math. 8(4), 703–712 (1960)
15. Szummer, M., Kohli, P., Hoiem, D.: Learning CRFs using graph cuts. In: European
Conference on Computer Vision, pp. 582–595. Springer (2008)
16. Kumar, S., Hebert, M.: Multiclass discriminative fields for parts-based object
detection. In: Snowbird Learning Workshop, vol. 164 (2004)
17. Boykov, Y.Y., Jolly, M.-P.: Interactive graph cuts for optimal boundary & region
segmentation of objects in ND images. In: Proceedings Eighth IEEE International
Conference on Computer Vision, ICCV 2001, vol. 1, pp. 105–112. IEEE (2001)
18. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction
using iterated graph cuts. In: ACM Transactions on Graphics (TOG), vol. 23, pp.
309–314. ACM (2004)
19. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via
graph cuts. In: Proceedings of the Seventh IEEE International Conference on Com-
puter Vision, vol. 1, pp. 377–384. IEEE (1999)
20. Varma, M., Zisserman, A.: A statistical approach to texture classification from
single images. Int. J. Comput. Vis. 62(1–2), 61–81 (2005)
21. Andreetto, M., Zelnik-Manor, L., Perona, P.: Unsupervised learning of categorical
segments in image collections. IEEE Trans. Pattern Anal. Mach. Intell. 34(9),
1842–1855 (2011)
22. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–
1484 (2005)
Feature Selection Using Flower Pollination
Optimization to Diagnose Lung Cancer
from CT Images
Dhalia Sweetlin Johnson1(&), Daphy Louis Lovenia Johnson2,

Pravin Elavarasan1, and Ashok Karunanithi1
1
Anna University, Chennai, Tamilnadu, India
jdsweetlin@mitindia.edu
2
Karunya Institute of Technology and Sciences, Coimbatore, Tamilnadu, India
Abstract. A segmentation approach using snake splines and a flower pollina-

tion based feature selection are suggested in this work to diagnose lung cancer
from Computed Tomography scan images. The proposed system segments the
lung tissues including the diseased portion using a prior shape model generated
using snake splines. The model generation includes the usage of control points
to construct a snake spline model. The model is scaled on to the diseased lung
tissues so as to segment the lung region along with its boundary. The level of
scaling varies depending on the size of reference and diseased lung image and
hence requires certain affine transformations to fit. From the segmented lung, the
cancerous regions which are the region of interests (ROIs) are extracted. Fea-
tures are extracted from these ROIs. A binary flower pollination algorithm is
used to select the relevant features in wrapper approach for further classification
using SVM classifier. In this work, 514 ROIs are extracted among which 414
ROIs are used for training and 100 for testing in relation to 80:20 Pareto
principle. 33 features are selected by binary flower pollination algorithm from
56 extracted features. The proposed approach achieved an accuracy of 84%
using SVM classifier. The results obtained when compared with similar algo-
rithms, either selected more than 33 features or yielded a lower accuracy.
Keywords: Binary Flower Pollination Catmull-Rom Spline Computed

Tomography images
1 Introduction
The detection of various structures in medical imaging being the need of the hour
requires high end technological advancements. There exists a range of medical imaging
techniques to carry over the detection of abnormalities in lung. One such technique
used widely by the practitioners of medicine due to the numerous benefits is Computer
Aided Diagnosis (CAD). CAD systems are developed to detect the cancerous cells
including the assessment of its growth over a period [1].
Many different medical imaging techniques are available at present, which includes
Magnetic Resonance Imaging (MRI), X ray, Computerized Tomography (CT) Scans
[2]. In this work, CT images are used. Enormous CT image slices are produced for a

https://doi.org/10.1007/978-3-030-39442-4_44
Feature Selection Using Flower Pollination Optimization 605
patient when exposed to CT scanning [3]. Hence an automated method of analysis

rather than manual exertion is preferred which can cut down labor and save time. From
the CT image slices of lung, the lung tissues need to be segmented separately so as aid
the improvement in accurate disease diagnosis. Thresholding based segmentation
approaches are unsuccessful in segmenting the entire lung region if the diseased por-
tions are present along the boundary [4]. Certain segmentation models using prior
shape knowledge require the landmark points to be plotted manually and require
consideration of large range of samples [5]. The standard snake (contour) based seg-
mentation approach traverses along the boundary of the lung, which tends to leave the
diseased portion along the lung boundary [6].
In this work, a Catmull-Rom Spline based segmentation approach is used to seg-
ment the lung tissues from the lung CT image slices. A spline function is a piecewise
polynomial function of degree k in a variable x [7]. The positions where the pieces meet
are referred as knots. These curves enable interpolation where curves are constructed
using points that are plotted. Cubic splines are used for interpolation of numeric data
specified at given argument values, to obtain a smooth continuous function [7]. Cubic
polynomial splines are extensively used in computer graphics and geometric modelling
to obtain curves or motion trajectories that pass to specified points of the plane or 3D
spaces [7]. Centripetal Catmull-Rom spline is a local interpolating spline used in
computer graphics to design curves and surfaces [8, 9]. This spline possesses a better
interpolation efficiency on comparison with other splines, the reason may be because of
the increased number of knot values used in interpolation [10]. B-spline (basis spline)
has the minimum support in accordance to degree, smoothness and domain partition.
B-splines contain knots which are equidistant from each other that are being used for
curve-fitting and numerical differentiation of experimental data [7].
Nodules present in the lungs resembling cancerous signs and patterns are consid-
ered to be the region of interests [11]. Image features are extracted from the nodules
and from which the relevant features that increases the predictive accuracy of the
disease are chosen using a variant of flower pollination optimization algorithm. The
commonly used optimization algorithms are not successful in selecting the optimal
feature subset as too many features are generated in medical images. It is also con-
cluded that meta-heuristic algorithms provide better solutions [12]. Researches have
proved that selection of appropriate features for classifier training will have a greater
impact on the performance of the classifier [13].
The rest of the article is organized as follows: Sect. 2 presents the literature survey
relevant to the work carried out. Section 3 deals with the materials and methodology.
Experimentation results are discussed in Sects. 4 and 5 concludes the article with the
future extensions that can be carried out.
2 Literature Survey
In this section, the works related to segmentation of lung and feature selection has been
discussed.
606 D. S. Johnson et al.
2.1 Lung Segmentation

Dhalia Sweetlin et al. [3] proposed a segmentation approach by generating a shape
model from the lung tissues of the same patient under diagnosis. This approach extracts
the nodules present even along the boundary of the lungs. The model has been tested
with tuberculosis and pneumonia infected lungs which resulted in an average similarity
index of 0.970 and 0.962 respectively. The dice coefficient values obtained are 0.985
and 0.983 respectively for both the datasets.
Song et al. [14] proposed a Toboggan based growing automatic segmentation
approach (TBGA) which includes automation of seed point detection, lung lesion
extraction and lung lesion refinement. This method has been completely automated
without involvement of human for lung lesion identification, separation and further
processing. The system gave an accuracy of 96.35%.
Sun et al. [15] proposed a fully automated system to segment high density
pathologies in lung CT images. It combines two methods; initially a robust active shape
model matching method has been performed to obtain the outline of the lung region.
An optimal surface finding approach has been utilized to adapt the initially segmented
region. The left and right lung regions have been extracted out. Performance of the
system with 40 abnormal and 20 normal lung images has resulted in an average Dice
coefficient of 0.975 ± 0.006 and a mean absolute surface distance error of
0.84 ± 0.23 mm, respectively.
Delgado-Gonzalo et al. [16] proposed an active contour-based segmentation that
inputs the knowledge of mathematical spline models. The prior knowledge of the shape
has been fed into the system using energy parameters. Thereby, the shape of the snake
has been orthogonally projected onto the space that spans the affine transformations of
a given shape prior. The curves formed have been proven to be continuous thus giving
an edge over landmark based segmentation techniques. This approach improves the
robustness and quality of spline-based segmentation algorithms, while its computa-
tional overhead is negligible. A continuous spline curve has been projected in the shape
space as a result of construction. They tested this approach in microscopic images and
found the signal-to-noise ratios as −6.6 dB, −11.4 dB, −12.54 dB and −13.46 dB with
a Jaccard index of JI = 0.59.
Brigger et al. [17] proposed a formulation wherein B-spline snakes can be used as a
tool for fast outlining of contour objects. Their work explained that a traditional snake
by itself is a cubic spline irrespective of external energy parameters but such snakes
have issues with slow convergence speed due to large requirement of control points and
certain energy constraints. The energy constraints have been overcome by controlling
the elasticity of the spline curves. This has been performed by increasing the spacing
between spline knots.

Yang et al. [18] proposed and evaluated a binary-constrained version of the Flower
Pollination Algorithm (FPA) for feature selection purpose, in which the optimization
search space is carried out on a Boolean lattice where each possible solution a string of
bits that denote whether a feature will be used to compose the final set.
Dhalia Sweetlin et al. [19] proposed a CAD system that improves the diagnosis of
pulmonary tuberculosis. The lung regions have been extracted using region growing
and edge reconstruction algorithms from which the ROIs are extracted. A wrapper-
based approach that combines cuckoo search and one-against-all SVM classifier has
been used for optimal selection of texture features. The algorithm selected 47 features
with 92.77% as classification accuracy. The authors have also proposed a hybrid
feature selection approach using ant colony optimization with tandem run recruitment
[20] to reduce the number of features used in the diagnosis of bronchitis.
Experiments over some public and private datasets against with Particle Swarm
Optimization (PSO) [12, 21], Cuckoo Search (CS) [19] and Ant Colony Optimization
(ACO) [20] algorithms have demonstrated the suitability of FPA for feature selection.
The experimental setting has evaluated the recognition rates, convergence speed,
number of selected features and computational load. All techniques have obtained
similar recognition rates; it is shown that FPA is also suitable for feature selection
tasks, since its results are comparable to the ones obtained by some state-of-the-art
evolutionary techniques.
Yang et al. [22] proposed a new feature selection algorithm for finding an optimal
feature set by using metaheuristic approach, called Swarm Search. Simulation exper-
iments were carried out by testing the Swarm Search over a high-dimensional dataset,
with different classification meta-heuristic algorithms. Swarm search was observed to
achieve good results. Though there are many works existing to diagnose cancerous
tumors in lung, this is the first work in this field to combine a spline curve-based
segmentation and a nature inspired feature selection approach.
3 Materials and Methods
The proposed approach is divided into many modules and is collectively illustrated in
Fig. 1. The modules are explained in this section.
3.1 Pre-processing Module

Lung CT images are collected from reputed hospitals. Primarily these grayscale CT
images are converted to binary image to simplify the segmentation. This subsystem
consists of background removal and isolation of left and right lung from CT image
slices.
Background Removal Subsystem. The stack of CT images of a patient is considered
in its entirety. In each slice, the background pixels are removed and the cavities and
holes present in the lung tissues are filled [19]. From every CT slice, left and right lung
tissues are separated and stored in a directory.
Lung CT Image Segmented Lung
Image Pre-processing ROI Extraction
Reference Image Selection Feature Extraction
Control Points Fixation Flower Pollination Based

Feature Selection
Candidate Features
Spline construction Performance Evaluation
Classifier
Scaled on to diseased
Feature set
Segmented
Classification
Trained Classifier
Fig. 1. Proposed system framework
Separation of Left and Right Lung. The left and right lung is isolated from the
binary image. This process is done by use of a filter which on subtracting and negating
from the full image gives the left and consequently the right lung images.
Input: A stack of binarized lung images.
Process: The stack of binarized lung images are made to determine the largest
connected component in them and on removing this component from the entire
image will result in isolation of the two lungs separately.
Step 1: Identify the largest connected component in the image.
Step 2: Store this component and subtract this from the binarized image.
Step 3: Obtain and store the isolated image of a lung.
Step 4: Invert the image obtained in step 3 to get lung in black pixel and back-
ground in white.
Step 5: Store the separated lungs.
Output: A stack of isolated left and right lungs stored in a directory.
3.2 Reference Image Selection

From the stack of left and right lung images, the best reference slice from each stack is
obtained. This is based on the area of the lung image that is being considered from the
stack of left and right lungs separated. The lung image with maximum number of pixels
is found by traversing along the lung image from the top left and moving one pixel
each and identifying the number of black pixels in it. Simultaneously a count of this
black pixel is maintained. Ultimately the lung which has the largest area among the
lung images in the stack is identified. This is done for both the stacks of images. Hence
the left and right lung with maximum lung area in each stack is selected as reference for
further processing.
Input: A stack of left and right separated lung images.
Process: The stack of left and right separated lung images is traversed to identify
the lung with maximum pixel count. This way both left and right lungs with
maximum area among other images in the stack is identified and is stored in a
separate folder. The isolated images are to be used for fixing of control points and
further process.
Step 1: Obtain the left or right lung image from the stack.
Step 2: Traverse along the image through every pixel.
Step 3: Compute the area of each lung using Eq. (1).
X
Area ¼ 8sliceð f ðx; yÞ ¼¼ 0Þ ð1Þ
where x, y are the pixel coordinates in each CT slice.

Step 4: Identify the left and right lung with maximum area in each stack.
Step 5: Display the image identified using step 4, of left and right lung portions as
reference.
Output: Reference image from a stack of left and right lung images.
3.3 Control Points Fixation

The construction of spline snakes requires the usage of control points. It is based on
these control points that splines operate and hence the references found in previous step
is fixed with control points. The left and right reference images are individually tra-
versed along x-axis with y-axis fixedly incrementing one pixel. The top most and
bottom most coordinates of the lung is obtained. All the points on the boundary of
curvy inner region of either the left or right lung is obtained by traversing from along
its sides. This provides the coordinates of the points which are plotted for reference.
These plotted points are reduced by deducing the slope between consecutive points.
Known that the slope of a straight line is always constant, such points are omitted along
a straight line and only the starting point of that line is used to construct the model. This
can be done by finding the difference in two consecutive slope values of these deduced
points. Along the other side of lung which is the outer boundary extending away along
the boundary which possess a regular shape do not require much of control points and
only a few regularly identified points are determined. These obtained points are used
for model construction using spline curve [8].
Input: Binary image of reference slice of left and right lung.

Process: Lung regions are Black pixels and background are in White pixels.
Traversing from left end to right end in row by row, check for black pixel, to get a
control point at top. By doing so from bottom to up, control point at bottom is
obtained. Then two control points are considered as major axis.
Step 1: Choose the reference lung image and scan from top along x-axis keeping y
constant.
Step 2: Repeat for all y values until first black pixel of lung is encountered.
Step 3: Save the co-ordinates for reference.
Step 4: Repeat step 3, from bottom corner decrementing the x and y values.
Step 5: Traverse along curvy lung side to find every point of its boundary.
Step 6: Find slope of each consecutive points (x, y) and (x′, y′) using Eq. (2).
Controlxð1; y0 Þ Controlxð1; yÞ
SlopeðmÞ ¼ ð2Þ
Controlxð1; x0 Þ Controlxð1; xÞ
Step 7: Compute difference in slope and if 0 record the first control point of that
sequence of 0’s.
Step 8: Fix equally spaced control points along the regular shape of lung by
dividing major axis into 16 regions.
Step 9: Combine all such control points into a single array.
Output: An array of control points to be used for model generation.
3.4 Spline Construction

The deduced control points are plotted through construction of spline along its
boundary. A function that implements the Catmull ROM spline is implemented to
which the array of control points is sent and an output that plots the path of the spline
connecting these control points. The spline function runs based on interpolation
between the control points, parting the distance between two control points into
specified number of partitions. The interpolated points on a whole define the model that
is required for segmentation [8, 9].
Input: An array of control points
Process: The sequence of control points is interpolated and a model of the reference
lung is obtained using Catmull ROM spline. The control points are given as input to
this module.
Step 1: Partition the distance between consecutive control points into desired
number.
Step 2: Combine these partitions using the Catmull ROM spline Eqs. (3) through
(8).
t2 t t t1
C¼ B1 þ B2 ð3Þ
t2 t1 t2 t1
Where
t2 t t t0
B1 ¼ A1 þ A2 ð4Þ
t2 t0 t2 t0
t3 t t t1
B2 ¼ A2 þ A3 ð5Þ
t3 t1 t3 t1
t1 t t t0
A1 ¼ P0 þ P1 ð6Þ
t1 t0 t1 t0
t2 t t t1
A2 ¼ P1 þ P2 ð7Þ
t2 t1 t2 t1
t3 t t t2
A3 ¼ P2 þ P3 ð8Þ
t3 t2 t3 t2
where P0, P1, P2 and P3 represents consecutive control points and t, t0, t1, t2 and t3
represents the knot sequences that decides interpolation.
Step 4: Construct a model plotting the interpolation of control points to define a
reference model.
Output: A model that holds the reference structure of the lung.
3.5 Scaling onto Diseased

The constructed model is scaled onto the diseased image using active contour method.
The model is made to shrink or expand to fit onto the boundary of the diseased without
distortion to its shape model. This process of expansion and shrinkage is accomplished
based on the energy function encompassed in snake algorithm. This process proceeds
for a predefined number of iterations that is acquired beforehand. On completion of
these iterations the segmented lung image of the diseased is obtained with shape
characteristics of the model imposed onto it.
Input: A diseased image and a reference model constructed using spline.
Process: The reference model is placed over the diseased image to segment out the
regions that are necessary for diagnosis of disease.
Step 1: Place the reference model over the diseased image that needs to be
diagnosed.
Step 2: Scale the active contour model by either shrinking or expanding the borders
of model to the boundary of diseased.
Step 3: Exercise an optimum fit using the nature of snakes that functions based on
energy parameters.
Step 4: Control optimum fit using the parameter ratio factor which is determined
using Eq. (9).
Area of reference lung

Ratio factor ¼ Constant ð9Þ
Area of Diseased lung
where the constant controls the iteration.

Step 5: Segment the diseased lung based on this ratio factor even along the
boundary as it withholds the shape characteristics of the model structure.
Output: A segmented lung.
3.6 ROI Extraction Module

The diseased lung image is segmented using the obtained model and the nodular
structures required for diagnosis of disease are extracted as ROIs. Not all extracted
ROIs are cancerous as other pulmonary structures may appear similar to a nodule in CT
slice. Other pulmonary diseases such as tuberculosis exhibits nodular pattern. Hence a
stage of pruning is carried out with the help of an expert to rule out the pattern caused
due to other pulmonary structures and retain only the required ROIs. The ROIs are
isolated and features are extracted from them for further processing. In this work, 514
ROIs are obtained and used for feature extraction.
Input: Segmented lung.
Process: The ROIs in the segmented diseased image is identified and is extracted
for further processing.
Step 1: Scan the image pixel by pixel and pixels that surpass the threshold value are
selected to be the ROI.
Step 2: Store these pixels further and combine with other such obtained pixels using
Eq. (10).

old pixel value; if xði; jÞ [ T
New Pixel vaue ¼ ð10Þ
0; otherwise
where T, the threshold value is set after repeated trials in this work.
Output: The ROIs are obtained for feature extraction.
3.7 Feature Extraction Module

Features extracted from the ROIs using Gray Level Co-occurrence Matrix (GLCM) and
shape-based features are used in this work. GLCM matrix yields the result that
showcases the number of pixels with same Gray value residing in the neighborhood for
each such pixel. This can be done with various degree variations, where 0 and 45
degree orientations are considered in this work. This yields 22 texture-based features
for each orientation.
Shape-based features such as area, centroid, major axis length, minor axis length,
eccentricity, orientation, convex area, filled area, Euler number, equiv-diameter,
solidity and perimeter are also extracted. Totally 56 features are derived.
Input: ROIs that are obtained after the pruning process.
Process: GLCM and shape features of these ROIs are obtained.
Step 1: Texture features are obtained by constructing a GLCM [23, 24] matrix of
size n n where n is the number of gray levels normally 256. G(a, b), represents
the number of a pixel values near pixel b, holding the neighbor-hood principle.
Features such as correlation, autocorrelation, contrast, cluster prominence and
shade, dissimilarity, energy, entropy, homogeneity and maximum probability are

extracted. Variants of features such as sum of squares, correlation, variance, entropy
and inverse differences are also extracted. A total of 44 features, 22 in each ori-
entation is extracted.
Step 2: Shape features such as area, centroid, major axis length, minor axis length,
eccentricity, orientation, convex area, filled area, Euler number, equiv diameter,
solidity, perimeter is also extracted.
Output: A total of 56 features are obtained.
3.8 Feature Selection Module

A feature selection algorithm has been used to reduce the number of features consid-
ered for classification. In this work, a Binary Flower Pollination Algorithm (BFPA) has
been used which follows natural pollination process in flowers. There occurs two ways
of pollination: Self-pollination and Cross-pollination. Cross-pollination is carried out
among flowers of different plants while self-pollination is within same plant or even
same flower [18].
Cross-pollination is carried out through insects or birds that carry the pollens to
other plants while self-pollination is performed with assistance from wind. The
movement of birds and insects is calculated using Levy’s formula which explores
various possibilities of solution [25].
Input: Extracted features from the ROIs.
Process: Reduction in the number of features is done using BFPA algorithm that
selects minimum number of features to obtain higher classification accuracy and
reduced complexity.
Step 1: Initialize the population with a random number of feature sets.
Step 2: Compute a threshold value for each feature and determine a binary notation
for each feature set comparing threshold of each feature.
Step 3: Obtain a binary array of considered feature set for which only the selected
features that surpassed the threshold for each feature set are chosen.
Step 4: Obtain fitness for initial population using the Eq. (11). The classification
accuracy of SVM is used as fitness measure. The global best fitness and the cor-
responding features are stored.
TP þ TN
Fitness ¼ ð11Þ
FP þ FN þ TP þ TN
where, TP is True Positive, TN is True Negative, FP is False Positive and FN is

False Negative.
Step 5: Decide cross or self-pollination for each iteration by probability factor.
Step 6: Perform exploration for cross-pollination, by computing Levy’s flight dis-
tance using the formula stated in Eq. (12).
k:CðkÞ : sinðkÞ 1
LðkÞ ¼ : 1 þ k s s0 [ 0 ð12Þ
p s
Where gamma ðCÞ and step (s) functions are used.
Step 7: Use the exploration explained by Eq. (13), and correspondingly values are
updated.

xit þ 1 ¼ xti þ aLðkÞ g xti ð13Þ
where g gives the global best value determined and L(k) is the Levy’s value.
Step 8: Perform exploitation for self-pollination as in the following Eq. (14), where
e is derived from normal distribution [0, 1].

xti þ 1 ¼ xti þ e xtj xtk ð14Þ
Step 9: Perform Step 7 and 8 only for those features which has been selected to
obtain the global best or the features of current feature set that holds better per-
spective than global best features.
Step 10: After every exploration or exploitation of the feature, it is to be binarized
for further computation using the Eqs. (15) and (16), where r belongs to normal
distribution [0, 1].
1
S xij ðtÞ ¼ j ð15Þ
1 þ exi ðtÞ

1; if S xij ðtÞ [ r
xij ðtÞ ¼ ð16Þ
0; otherwise
Step 11: Obtain fitness value for that feature set and changed if fitness value greater
than global best.
Step 12: Steps 5–10 are repeated for every random feature set chosen, its value gets
changed and its corresponding binary array change.
Step 13: After completion of all the iterations the global best is the accuracy of the
classifier and the binary array corresponding to that accuracy determines the optimal
subset of features using BFPA.
Output: A subset of features to provide best accuracy on classification.
3.9 Classification Module

Support Vector Machine (SVM) with linear SVM kernel is used in this proposed
system. In all, 80% of the ROIs are used to train the system and the remaining 20%
ROIs are used to test the efficiency of the system.
4 Results
The input CT images are collected from hospitals containing the normal and cancerous
CT images. These images are processed and their results are presented. Figure 2 shows
the input image, binarized lung image after the removal of background and the sepa-
rated right and left lungs.
Fig. 2. (a) Input image, (b) background removal, (c) isolated left, (d) isolated right
Reference lung image is selected from the CT stack as the lung with maximum area.
The left and right halves belong to the same CT stack of the same patient but it may be
from different slices within the stack. The left and right lung reference images and the
corresponding zoomed images thus obtained are shown in Fig. 3(a) through (d).
Fig. 3. (a) Reference left, (b) reference right, (c) zoomed left lung with control points,
(d) zoomed right lung with control points.
The sequence of control points is interpolated using Catmull ROM spline to gen-
erate the left and right lung shape model and is shown in Fig. 4(a) and (b).
Fig. 4. (a) and (b) Left and right lung models generated using spline curves
The suspected and diseased CT slices are identified with the help of an expert.
These slices are preprocessed and the model generated using spline curve is placed over
the diseased image and is shown in Fig. 5(a) through (d).
Fig. 5. (a) Left lung with initial fit, (b) left lung with final fit, (c) right lung with initial fit,
(d) right lung with final fit
The original gray scale pixels are superimposed onto the binarized lung to facilitate
ROI extraction. This can be seen in Fig. 6(a) and (b).
(a) (b)
Fig. 6. (a) Segmented lung (b) Extracted ROIs
4.1 Performance Analysis

The feature selection algorithm is analyzed through comparisons at various iterations
with itself and with other algorithms. The best solution is obtained when iterations are
repeated after fine tuning its parameters. A graph plotted for the number of features
selected and best fitness value for each iteration is shown in Figs. 7 and 8.
Fig. 7. Number of features selected
The comparison of the fitness value of the feature selection algorithm during
various iterations has been illustrated in Fig. 8.
The binary flower pollination algorithm is compared with various other filter
methods using measures such as information gain, gain ratio, relief attribute consid-
ering all the 56 features which resulted in an accuracy of 80%. Principal components
selected 6 features yielding an accuracy of 68%. CFS subset selected single feature
with an accuracy of 76%. Comparatively binary flower pollination algorithm selected
features and yielded a better accuracy of about 84%. The number of features selected
by each algorithm is shown in Fig. 9.
0.86
accuracy
0.84
0.82
0.8
0.78
0.76
1 2 3 4 5 6 7 8 9 10
Fitness… No of Iterations
Fig. 8. Performance of feature selection
Fig. 9. Feature selection algorithm
The selected features are trained with various classifiers to find the best classifier.
Further classifier parameters are tuned to find the better accuracy that the classifier can
yield to the data. The accuracy obtained when the data is classified with various
algorithms is shown in Fig. 10. SVM classifier with linear kernel function yielded
higher accuracy compared to other classifiers.
Fig. 10. Comparison of classification algorithms
5 Conclusion
In this work, a spline curve-based segmentation approach is used to segment lungs and
a binary flower pollination optimization technique is used to select optimal image
features to be used for classification. From the extracted features, 33 features are
selected by the wrapper based BFPA to give 84% in diagnosing lung cancers. Though
there are many existing works in this domain, this work uses a flower pollination
optimization approach to select features. The accuracy of the approach can be improved
if the segmentation technique is optimized to reduce overfitting of the reference lung
image over diseased lung image. The features extracted from the ROIs could be
increased for more accurate results and improvements in BFPA algorithm is likely for
better results. Also, this work concentrates on lung cancer alone which could be
extended to other lung diseases.
References
1. Firmino, M., Morais, A.H., Mendoça, R.M., Dantas, M.R., Hekis, H.R., Valentim, R.:
Computer-aided detection system for lung cancer in computed tomography scans: review
and future prospects. BioMed. Eng. Online 13(41), 1–16 (2014)
2. Woods, R.E., Gonzalez, R.C.: Digital Image Processing, 3rd edn. Pearson, London (2016)
3. Dhalia Sweetlin, J., Nehemiah, H.K., Kannan, A.: Patient-specific model based segmentation
of lung computed tomographic images. J. Inf. Sci. Eng. 32, 1373–1394 (2016)
4. Ma, Z., Manuel, J., Tavares, R.S., Jorge, R.M.N.: a review on the current segmentation
algorithms for medical images. In: Ist International Conference Proceedings on Computer
Imaging Theory and Applications, pp. 135–140, Lisbon, Portugal (2009)
5. Cootes, T.F., Cooper, D., Taylor, C.J., Graham, J.: Active shape models-their training and
applications. Comput. Vis. Image Underst. 61(1), 38–59 (1995)
6. Williams, D.J., Shah, M.: A fast algorithm for active contours and curvature estimation.
J. CVGIP: Image Underst. 55, 14–26 (1992)
7. de Boor, C.: Handbook of Computer Aided Geometric Design. Elsevier, North-Holland

(2002)
8. Catmull, E., Rom, R.: A class of local interpolating splines. In: Robert, E., Riesenfeld, R.F.
(eds.) Computer Aided Geometric Design, pp. 317–326. Academic Press (2014)
9. Barry, P.J., Goldman, R.N.: A recursive evaluation algorithm for a class of Catmull-Rom
splines. In: 15th Annual Conference Proceedings Computer Graphics and Interactive
Techniques, pp. 199–204, New York, USA (1988)
10. Yuksel, C., Schaefer, S., Keyser, J.: Parameterization and applications of Catmull-Rom
curves. Comput. Aided Des. 43(7), 747–755 (2011)
11. Liu, X., Ma, L., Song, L., et al.: Recognizing common CT imaging signs of lung diseases
through a new feature selection method based on fisher criterion and genetic optimization.
IEEE Trans. Biomed. Health Inform. 19, 635–646 (2015)
12. Chen, L.F., Su, C.T., Chen, K.H., Wang, P.C.: Particle swarm optimization for feature
selection with application in obstructive sleep apnea diagnosis. Int. J. Neural Comput. Appl.
21, 2087–2096 (2012)
13. Dhalia Sweetlin, J., Nehemiah, H.K., Kannan, A.: Computer aided diagnosis of pulmonary
hamartoma from CT scan images using ant colony optimization based feature selection.
Alexandria Eng. J. 57(3), 1557–1567 (2018)
14. Song, J., Yang, C., Fan, L., Wang, K., Yang, F., Liu, S., Tian, J.: Lung lesion extraction
using a toboggan based growing automatic segmentation approach. IEEE Trans. Med.
Imaging 35(1), 337–353 (2016)
15. Sun, S., Bauer, C., Beichel, R.: Automated 3-D segmentation of lungs with lung cancer in
CT data using a novel robust active shape model approach. IEEE Trans. Med. Imaging
31(2), 449–460 (2012)
16. Delgado-Gonzalo, R., Schmitter, D., Uhlmann, V., Unser, M.: Efficient shape priors for
spline based snakes. IEEE Trans. Image Process. 24(11), 3915–3926 (2015)
17. Brigger, P., Hoeg, J., Unser, M.: B-Spline snakes: a flexible tool for parametric contour
detection. IEEE Trans. Image Process. 9(9), 1484–1496 (2000)
18. Yang, X., Rodrigues, D., De Souza, A.N., Papa, J.P.: Binary flower pollination algorithm
and its application to feature selection. In: Recent Advances in Swarm Intelligence and
Evolutionary Computation. Studies in Computational Intelligence, vol. 585. Springer, Cham
(2014)
19. Dhalia Sweetlin, J., Nehemiah, H.K., Kannan, A.: Computer aided diagnosis of drug
sensitive pulmonary tuberculosis with cavities, consolidations and nodular manifestations on
lung CT images. Int. J. Bio-Inspired Comput. 13(2), 71–85 (2019)
20. Dhalia Sweetlin, J., Nehemiah, H.K., Kannan, A.: Feature selection using ant colony
optimization with Tandem-Run recruitment to diagnose bronchitis from CT Scan images.
Comput. Methods Programs Biomed. 145, 115–125 (2017)
21. Inbarani, H.H., Azar, A.T., Jothi, G.: Supervised hybrid feature selection based on PSO and
rough sets for medical diagnosis. Comput. Methods Programs Biomed. 113, 175–185 (2014)
22. Fong, S., Yang, X., Deb, S.: Swarm search for feature selection in classification. In: 16th
International Conference Proceedings on Computational Science and Engineering, pp. 902–
909. IEEE, Sydney, Australia (2013)
23. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification.
IEEE Trans. Syst. Man Cybern. 3, 610–621 (1973)
24. Haralick, R.M., Shapiro, L.G.: Computer and Robot Vision, 1st edn. Addison-Wesley,
Boston (1992)
25. Abdel-Basset, M., Shawky, L.A.: Flower pollination algorithm: a comprehensive review.
Artif. Intell. Rev. 52, 2533–2557 (2018)
Detecting Cyberbullying in Social Commentary
Using Supervised Machine Learning
Muhammad Owais Raza(&), Mohsin Memon, Sania Bhatti,

and Rahim Bux
Department of Software Engineering,

Mehran University of Engineering Technology, Jamshoro, Pakistan
owais.leghari@hotmail.com,
{mohsin.memon,sania.bhatti}@faculty.muet.edu.pk,
raheembux991@gmail.com
Abstract. This paper addresses the problem of cyberbullying on various online

discussion forums in the form of social commentary. Here, supervised machine
learning algorithms are employed to detect whether a particular comment is an
insult, threat or a hate message. First of all, a machine learning model is
developed with Logistic Regression, Random forest and naive bayes algorithms
for classification and then, both Voting and AdaBoost classifiers are applied on
the developed model to observe which works best in this case. In Japan, the
members of PTA (Parent Teacher Association) perform net-petrol with a manual
website monitoring in order to catch and stop cyberbullying activities; however,
doing all this manually is very time consuming and hectic process. The main
contribution of this paper includes a mechanism to detect cyberbullying and by
using supervised machine learning with logistic regression algorithm, model has
achieved an accuracy of 82.7%. With voting classifier, an accuracy of 84.4%
was observed. The evaluation results show that voting classifier outperforms all
other algorithms in detecting cyberbullying.
Keywords: Cyberbullying Python NLP Supervised machine learning
1 Introduction
With the rise of internet, social media and online discussion forums, there is a new kind
of bullying taking place that does not occur in classroom, home or in our neighbor-
hood, but, it happens online and carried out on internet and called as cyberbullying.
The usual way of cyberbullying is in the form of hate message through e-mail or
insulting online social commentary; it includes verbal attacks on one’s personal body
type, appearance, race or color. Cyberbullying could lead to depression, low self-
esteem, low confidence, self-harm or suicide in some cases. To deal with this problem,
the PTA (Parent Teachers Association) in Japan started net-patrol in which they
monitor websites for such activities and whenever they detect any such activity they
would contact internet provider or website admin for removing such content [2]. This
whole process is carried out manually in which keeping record of activities is done
manually requiring a lot psychological power and exert a lot of mental pressure.

https://doi.org/10.1007/978-3-030-39442-4_45
622 M. O. Raza et al.
Here, in this paper a method is proposed that require very less or no human involve-
ment. Authors of this paper have developed machine learning classifiers based on
supervised learning and these classifiers are trained using dataset from [1] that contains
both neutral and insulting text with their respective labels. With the proposed technique
authors are able to catch cyberbullying with no or very less human involvement. The
proposed mechanism achieves 84.4% accuracy in detecting cyberbullying.
2 Related Work
There are various researchers who have attempted to automate the process of detection
of cyberbullying. For example, researchers in [2] have tried to detect harmful or
insulting entries based on category relevance maximization. Their method has three
steps: phrase extraction, categorization with harmfulness polarity and in the final stage,
relevance maximization is performed. Their results show recall and precision for test
data with 50% of harmful or insulting entries and precision was between 49–79%. For
test data containing 12% of harmful or insulting entries, they observed precision
between 10 to 61%. In [3], researchers have used data which is gathered from
Formspring.me, a website, which contains a lot of bullying content. This data was
labelled using AWS’s Mechanical Turk [3], and then Weka [3] tool was employed to
train the model with C4.5 decision tree. They were able to achieve 78.5% accuracy on
recognizing true positives. Researchers in [5] have taken corpus of posts from Form-
spring.me website. There are two parts of results from this article. In part one, an
experiment was performed to get the specific word and context used for bullying. They
have recognized most relevant cyberbullying terms that they can query the content to
check whether it is cyberbullying or not. In second part, researchers have used machine
learning to detect cyberbullying. The social media posts having highest scores have
more amount of cyberbullying content. Most of the research in detection of cyber-
bullying focuses on the content, but not on the features of bullying [7]. Content used by
harasser depends on the features, such as, gender, age, race, skin tone, etc. In [4],
researchers have used content with the gender of harasser to train data using SVM
(Support Vector Machine) classifier. Using this technique, researchers have decreased
gender based discrimination capabilities of classifier. They have concluded that for
baseline, precision, recall and F-measure are 0.31, 0.15 and 0.20, respectively. How-
ever, with gender approach, Precision, Recall and F-measure are 0.43, 0.16 and 0.23.
Using gender based discrimination capabilities of the classifier, the results have shown
much improvement. The paper [6] has gone one step further, they have proposed to
incorporate the characteristics and information of users as a context before and after
harassing. With this approach, researchers have improved the accuracy of detection of
cyberbullying. In [11], researchers have used weakly supervised machine learning to
train model from small existing vocabulary related to bullying, and then applied that
model with huge corpus of unlabeled data to evaluate whether an interaction is bullying
or not. The authors in [12] have proposed unique set of features for cyberbullying
detection and performed comparison with baseline feature and results were tremen-
dously better. Authors have achieved f-measure of 0.936. In our research, standard
supervised machine learning was used in combination with ensemble learning which
helped to drastically improve performance of cyberbullying detection mechanism.
Detecting Cyberbullying in Social Commentary 623
3 Text Classification
Based on the content of text, assigning tags and categories is the called Text Classi-
fication. It is a very basic NLP (Natural Language Processing) task. With text classi-
fication a person can perform things such as spam detection, sentiment analysis and
much more. In this age of Internet, where there is a lot of unstructured data in the form
of text, extraction of proper information from resources like internet, text documents,
news articles, electronic mails, different databases, etc. is challenging [8]. In this article,
the goal of identifying social commentary as bullying cannot be achieved without text
classification. In this case, the information contained in text is extremely rich, but due
to the unstructured nature, it is hard to gain insight of data from these comments [10].
Therefore, text classification was used to classify and categorize the comments as
bullying by working together with natural language processing and machine learning
tools & techniques. The text classification results in text classifiers which are used to
categorize, organize and structure any textual information. In this case this classifier is
used to classify comments. A text classifier takes comment text as input, analyses the
content of the text and specifies label as output. Text classification algorithms act as a
key in processing the natural language to extricate and cleave the data into different
classes according to the required specifications [9, 15]. By using different machine
learning tools and techniques given person can extract the information from the given
document or a paragraph and can classify it according to different groups or classes. It
is believe of authors that ensemble learning can be better solution as it combines
multiple models to improve accuracy. In this case, researchers have applied two
techniques, Voting and Adaboost classifiers as an example of ensemble learning and
found that ensemble learning provides more accurate results.
Text classification comprises of different phases for filtering or categorizing the
document. It follows a series of steps given below [3, 16].
• Text Preprocessing
• Feature Extraction
• Training Classifier
• Classification Model
3.1 Text Preprocessing

The very first step for text classification is text pre-processing. In this step, the input
document is pre-processed by applying some text pre-processing techniques such as
removing null data and making data clean so that it can be used in the next step.
3.2 Feature Extraction

After the first step, the obtained data which is properly cleaned has to pass through
feature extraction phase. This phase transforms the original data of the data set by
removing redundant data, reducing the dimensions of data and constructing the data in
such a way that it is ready to be put in a model for classification.
3.3 Training Classifier

After feature extraction of data, it is ready to be applied on a classifier. In this phase, the
classifier is fed data for training purpose. After training, the classifier is capable of
observing patterns from data and may help understand the trends in data for future
predictions.
3.4 Classification Model

After training the data, final classification model is created according to the given
dataset which will be used for predicting the outputs.
4 Implementation
The dataset used in this paper is taken from [1]. This dataset contains social com-
mentary with labels mentioning those comments which are insults. Authors have also
performed data exploration on the dataset shown in Fig. 1 to give understanding of the
dataset and shown a glimpse of dataset in Fig. 2. In dataset we have three columns,
insult which is either zero or one zero representing neutral comment and 1 representing
insult. Second column is date of comment and third column is comment which is string.
We have used voting and AdaBoost classifiers for prediction of comments. The fol-
lowing subsections will elaborate these classifiers.
Fig. 1. Orange bar shows comments which are not insults and blue shows the comments which
are insults.
Fig. 2. Dataset
4.1 Voting Classifier

Voting classifier combines two or more machine learning algorithm and predict the
class or label based on the best average probability voted from all of the classifier [14].
It is used when there are models which are performing nicely so that we can combine
them to overcome the deficiency of each classifier. In this paper voting classifier is used
to combine logistic regression and random forest. The voting classifier has helped to
achieve accuracy of 84.4%.
4.2 AdaBoost
The basic rule of AdaBoost is to fit a sequence of learners on frequently changed
versions of the data. The predictions are then combined from all sequence of learner
with the help of a weighted majority vote in order to produce the prediction. Data is
reweighted and learning algorithm applied again for each iteration [13, 15]. The
ensemble technique creates a classifier after under sampling and oversampling data
with various under sampling and oversampling rate [16]. In this case, AdaBoost is used
on logistic regression that is it is going to apply logistic regression and on various
versions of data and induces a model that is best of all models. AdaBoost classifier
achieved an accuracy of 84.04%.
The solution proposed in this paper is carried out in several steps shown in Fig. 3.
In the first step, pandas are used to import dataset. It converts the given dataset which is
in CSV file to the panda’s data frame. Once, pandas data frame from dataset is created,
features and labels are extracted which are comment and insult respectively. After
obtaining the comments, removal of Stopwords is done because these are high fre-
quency words with very less semantic weight then perform stemming of words in the
comment to avoid variance of words with same root. Now, count vectorizer is applied,
which gives us an easy and simple method to build vocabulary with the words which
are known and tokenize collection of text and documents. Authors then create standard
and ensemble classifiers. Once a classifier is created, t is time to train it with training
data. We used various percentages and cross folds of the training data shown in
Tables 1 and 2 and applied classifier on that data. After classifier is trained, a prediction
vector is made to see how many correct predictions it was able to make. Based on these
readings, authors have calculated accuracy of model and also generated training curve
to see if model is over-fitting or under-fitting.
Fig. 3. Implementation flow
5 Evaluation and Results
In this paper, two parameters for evaluating the models are used first parameter is
which is defined as “Sum of True positive and True Negative divided by sum of True
positive, True Negative False Positive and False Negative” it is shown by Eq. (1).
Second parameter used for the evaluation of the model is learning curve which is
defined as “A basically a graph that is used to compare performance of model over
training and test set and to see whether the performance can be increased with increase
in data.” Using both of these parameters, we have discussed the results we got
Tp þ Tn
Accuracy ¼ ð1Þ
Tp þ Tn þ Fp þ Fn
The system proposed in this paper for detecting cyber bullying is tested on different
number of test train splits and cross folds on different classifiers, and, based on these
observations it can be said that the proposed approach for detecting cyber bullying in
social commentary can be effective. In Table 1, authors have shown the accuracy of
Logistic Regression, Naïve Bayes and Random Forest on different number of test train
split and in Table 2, researcher has shown the accuracy of Logistic Regression, Naïve
Bayes and Random Forest at different number of cross folds of data ranging from 10 to
90. To improve accuracy of the models, ensemble classifiers namely voting classifier
and Adaboost are used and both provided better accuracy then Logistic Regression,
Naïve Bayes and Random Forest. In Table 3, we have shown the best performances of
all classifier with respective test train split and number of cross folds. With the help of
training curves, it is shown that which model is good, worst or best fit. According to the
training curve shown in Fig. 4, there is deviation in start for training and testing lines
but lines are converging which explains that this is a good fit, however, this model can
be improved with data. From Fig. 5, it is evident that it is a best fit because both
training and testing lines are on almost same points. The model depicted in Fig. 6 is the
worst fit because of testing line going down and this model can’t be improved because
both lines in the ending region seems to be parallel.
Table 1. Accuracy values on different Test Train Split

Test train split % Logistic Regression Naïve Bayes Random Forest
20% 0.821 0.570 0.737
25% 0.803 0.577 0.734
30% 0.822 0.581 0.707
35% 0.824 0.594 0.724
40% 0.826 0.582 0.736
Table 2. Accuracy values on various number of cross folds

Number of cross folds Logistics Regression Naïve Bayes Random Forest
10 0.82418151 0.57158078 0.734228
20 0.82747702 0.56320482 0.73423311
30 0.82699253 0.55964056 0.73422676
40 0.82596835 0.55784930 0.73424072
50 0.82723065 0.55741064 0.73423070
60 0.82575583 0.55576650 0.73427477
70 0.82546862 0.55554568 0.73421272
80 0.82570298 0.55550770 0.73425540
90 0.82574961 0.55555790 0.73431054
Table 3. Results of accuracy

Algorithm name Max accuracy for Max CV Max accuracy for split Number of folds Split %
Logistic Regression 0.8274 0.8265 20 40%
Naive Bayes 0.5715 0.5949 10 35%
Random Forest 0.7343 0.7379 80 20%
Voting Classifier Null 0.844 Null 20%
AdaBoost Null 0.840 Null 20%
Fig. 4. Learning curve for Logistic Regression
Fig. 5. Learning curve for Random Forest

Fig. 6. Learning curve for Naïve Bayes
6 Conclusion
Cyberbullying is one of the rising issues with the increase in usage of internet and
social media. Cyberbullying can have very dangerous consequences, so there is a need
for a way to control, detect and remove it. Authors have discussed methods used by
various researchers, then we have applied standard machine learning algorithms and
ensemble machine algorithms. Ensemble algorithms performed better. Researcher have
achieved an accuracy of 84.4% with the help of voting classifier and authors have also
drawn learning curve to compare scores of train and test with various instances and
conclude that logistic regression is a good fit, however, random forest is less accurate
but best fit. Naïve Bayes is worst fit algorithm to detect cyberbullying. One of most
important and potential use case for these models is applying them in a social net-
working site to detect cyberbullying in real time with least human involvement.
Like any other system, the proposed model has certain limitations. The model
requires robustness and it works only for English. So, for future work, we will be
applying other state of art machine learning techniques to improve the accuracy of
model. By doing so, we can fully automate the process of cyberbullying detection on
online forums. We will also attempt to work on multiple language support for this
model so that we can detect cyberbullying in various languages.
References
1. Detecting Insults in Social Commentary “Kaggle”, Kaggle.com (2019). https://www.kaggle.
com/c/detecting-insults-in-socialcommentary/data. Accessed 09 Apr 2019
2. Nitta, T., et al.: Detecting cyberbullying entries on informal school websites based on
category relevance maximization. In: Proceedings of the Sixth International Joint Conference
on Natural Language Processing (2013)
3. Reynolds, K., Kontostathis, A., Edwards, L.: Using machine learning to detect cyberbul-
lying. In: 2011 10th International Conference on Machine learning and applications and
workshops, vol. 2. IEEE (2011)
4. Dadvar, M., et al.: Improved cyberbullying detection using gender information. In:
Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012).
University of Ghent (2012)
5. Kontostathis, A., et al.: Detecting cyberbullying: query terms and techniques. In:
Proceedings of the 5th Annual ACM Web Science Conference. ACM (2013)
6. Dadvar, M., et al.: Improving cyberbullying detection with user context. In: European
Conference on Information Retrieval. Springer, Berlin (2013)
7. DeGregory, K.W., et al.: A review of machine learning in obesity. Obes. Rev. 19(5), 668–
685 (2018)
8. Wu, J.-Y., Hsiao, Y.-C., Nian, M.-W.: Using supervised machine learning on large-scale
online forums to classify course-related Facebook messages in predicting learning
achievement within the personal learning environment. In: Interactive Learning Environ-
ments, pp. 1–16 (2018)
9. Balyan, R., McCarthy, K.S., McNamara, D.S.: Comparing machine learning classification
approaches for predicting expository text difficulty. In: The Thirty-First International Flairs
Conference (2018)
10. Hoogeveen, D., et al.: Web forum retrieval and text analytics: a survey. Found. Trends® Inf.
Retrieval 12(1), 1–163 (2018)
11. Raisi, E., Huang, B.: Cyberbullying detection with weakly supervised machine learning. In:
Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining 2017, pp. 409–416. ACM, July 2017
12. Al-garadi, M.A., Varathan, K.D., Ravana, S.D.: Cybercrime detection in online commu-
nications: the experimental case of cyberbullying detection in the Twitter network. Comput.
Hum. Behav. 63, 433–443 (2016)
13. Randhawa, K., et al.: Credit card fraud detection using AdaBoost and majority voting. IEEE
Access 6, 14277–14284 (2018)
14. Voting Classifier. https://scikitl-earn.org/stable/modules/ensemble.html#voting-classifier.
Accessed 24 Apr 2019
15. Ensemble Methods. Scikit. https://scikit-learn.org/stable/modules/ensemble.html#adaboost.
Accessed 24 Apr 2019
16. Rahman, H.A.A., Wah, Y.B., He, H., Bulgiba, A.: Comparisons of ADABOOST, KNN,
SVM and logistic regression in classification of imbalanced dataset. In: International
Conference on Soft Computing in Data Science, pp. 54–64. Springer, Singapore, September
2015
Predicting the Risk Factor for Developing
Chronic Kidney Disease Using a 3-Stage
Prediction Model
Hossam Medhat Aly1(&), Mohamed Aborizka1,

and Soha Safwat Labib2
1
Arab Academy for Science, Technology and Maritime Transport, Cairo, Egypt
hossamemedhat@gmail.com, m.aborizka@aast.edu
2
The Egyptian Chinese University, Cairo, Egypt
Soha.cs@gmail.com
Abstract. Chronic Kidney Disease (CKD) is considered one of the major high-
risk chronic diseases on humans’ health that causes death in its late stages.
Moreover, treating CKD patients cost huge amounts to be paid based on their
stage. In fact, it becomes significant not only to detect the disease in its early
stages, but also, to have a way to early assess and predict the possibility for
individuals to get affected in the future. In this research, a 3-Stage predictor is
introduced to help predicting the risk factor for developing CKD during the
healthcare screening based on a questionnaire and some laboratory tests. Also, it
aims to reduce and eliminate the unjustified tests’ costs unless the tests are
needed for the assessment by categorizing parameters into stages. A comparison
between 12 classifiers led to choosing the 3 classifiers used in designing the 3-
Stage model, based on the best accuracy and prediction speed. The 3-Stage
model is designed using Bagged, Boosted and Medium Trees classifiers. The
model was assessed on the dataset collected from the Centers of Disease Control
and Prevention (CDC) in the United States. The trained 3-Stage model resulted
in 99.97% accuracy by predicting around 3K cases in comparison with a 1-Stage
model.
Keywords: Machine learning Healthcare CKD Risk factor Data mining
1 Introduction
Chronic Diseases have been a burden for many years worldwide. They are the main
reason of death in many countries, especially in developing ones with low- and middle-
income. They cause around 80% of the deaths in these countries due to the poverty and
economic instability [1]. Chronic Kidney Disease (CKD) is one of those chronic dis-
eases, by which the kidney is damaged and its function to filter the blood from the
waste is degraded; this allows harmful materials to remain in human’s body and leads
to other health problems. CKD is perilous as it can show no symptoms at all till a
massive damage occurs that cannot be recovered. That’s why the first step to treat a
CKD patient is to early detect the disease [2]. Moreover, the early detection of CKD
can be achieved by monitoring some risk factors like: Diabetes, High blood pressure,

https://doi.org/10.1007/978-3-030-39442-4_46
632 H. M. Aly et al.
Cardiovascular (heart and blood vessel) disease, Anemia, Analgesic (taken long time
medicines) or by knowing that there is a family history of kidney failure [3].
In fact, CKD is ranked to be the 12th cause of death worldwide. Nevertheless, in
Egypt, CKD is ranked the 6th cause of death [4]. According to statistics in the US,
estimated that for each seven people there is one person who is affected by CKD. In
addition, there are around 15% of the adults, which constitutes around 30 Million, are
affected with CKD. Also, it stated that about 96% of those who have damaged, or
reduced function kidney are not aware of having CKD. More statistics reveal that the
ratios of getting affected by CKD due to other diseases that are considered as risk factors,
like diabetes or high blood pressure, are 1 every 3 and 1 every 5 respectively [5].
From the economical perspective, Honeycutt et al. [6], mentioned that based on
their research, the annual medical costs for a CKD person was $1,700, $3,500 and
$12,700 based on their CKD stages 1, 2 and 3 correspondingly. This proves that
detecting the CKD early does not only help in better addressing treatment for CKD
patients, but also help in reducing the cost of it by diagnosing the disease in its early
stages [6]. Moreover, knowing the risks of the CKD from different aspects like
human’s health and economy, motivated researchers to work on reducing these
impacts. That’s why there were several attempts to detect CKD through technology by
engaging computer systems.
Nowadays, the world is fronting huge steps in the revolution of technology, with
the need for using computers in each and every field, this makes the development of
new and innovative techniques to solve issues a demand. Advancing computer systems
to be more capable of monitoring, analyzing, learning, predicting risks and even
suggesting solutions for real problems in different domains, turns researchers to be
more curious to make the best use of it to solve more issues and make the world a better
place. One of these domains is the healthcare and medicine field. Many researches are
adopting the involvement of advanced computer systems and algorithms, like data
mining and machine learning, in the healthcare field in different ways like: early
detecting some diseases and identifying their stages or by proactively predicting the
possibility of getting affected by these diseases [7].
In this paper, a 3-Stage predictor model is introduced. Section 2 briefly discusses
the background and the related work interrelated to CKD and machine learning models.
In Sect. 3, the introduced predictor was explained in detail, starting from the motive
and objective, going through the designed model, data set, techniques and methodology
and finally discussing the experiments’ results. At the end, the conclusion and some
recommendations for future work is revealed in Sect. 4.
2 Related Work
Data Mining and Machine Learning are predominantly used when it comes to detecting
and predicting diseases in the healthcare industry [7]. Several research papers were
interested to predict CKD by comparing multiple classification algorithms and con-
cluding their findings with regards to performance and accuracy. One of these resear-
ches, included the Naïve Bayes and Support Vector Machine (SVM) algorithms in the
comparison, to find out that the SVM was better when it came to performance [7].
Predicting the Risk Factor for Developing Chronic Kidney Disease 633
Another reference, by Koklu & Tutuncu [8], formed an evaluation between the Naive
Bayes, C4.5 Algorithm, SVM and Multilayer Perceptron to detect CKD, spotting that
the correct detection percentages were 95,00%, 97,75%, 99,00% and 99,75% for Naive
Bayes, C4.5 Algorithm, SVM and Multilayer Perceptron respectively, on a dataset of
400 samples and 25 attributes [8].
Anantha Padmanaban and Parthiban [9] used the Naive Bayes and the decision tree
algorithms to conclude that the accuracy was 91% for the decision tree classification on
a larger dataset of 600 samples and 13 attributes. In addition, Sharma et al. [10],
evaluated 12 classification models on CKD UCI dataset that consists of 400 samples
and 24 attributes. The tested classifiers in this paper were: Decision Tree, Linear
Discriminant, Quadratic Discriminant, Linear SVM, Quadratic SVM, Fine KNN,
Medium KNN, Cosine KNN, Cubic KNN, Weighted KNN, FFBPNN (GD) and
FFBPNN (LM). The results show that the decision-tree owned the best scores by which
the accuracy was recorded as 98.6% [10].
Moreover, Pavithra & Shanmugavadivu [2] worked on the early prediction for
kidney failure with the Fuzzy C Means algorithm on UCI CKD dataset. Aside pre-
dicting the CKD and its severity, Tahmasebian et al. [11] research was focusing on
determining the relation between parameters and their relevance degree to detect the
CKD.
By inspecting the different researchers’ perspectives in this topic, an observation
can be reached that these attempts aim to detect the Chronic Kidney Disease based on
certain set of parameters by applying number of classification methods. But it does not
necessarily predict the possibility of getting affected by CKD in the first place.
Moreover, these trials used the whole set of parameters at one stage which might lead
to perform unnecessary laboratory tests and consequently unjustified cost.
3 Objective
The objective of this paper is to introduce a cost-effective prediction model to expect

who might be at risk to develop CKD throughout the frequent screening processes in
clinics and hospitals. This will help proactively overseeing the possibilities for an
individual to get affected and try to avoid it. Although the early prediction should help
decreasing the treatment costs, the designed prediction model in this paper is meant to
assess the inputs in stages to eliminate unjustified lab tests’ costs, by which the pre-
dictor should ask for more parameters only when there is a higher possibility of getting
affected by CKD. This is reached by adding questionnaire and basic screening
checkups earlier to the lab tests. Additionally, a larger dataset with more than 5K
patients’ real records is used to train the classifiers.
3.1 Model
The motive of this research is to help people and governments reducing the costs of
CKD treatment by working on predicting the risk of getting affected by CKD during
the regular healthcare screening. By which a 3-Stage predictor model is implemented
that works on a questionnaire and basic screening checkups in its first stage, then if
there is a risk based on these inputs, the predictor asks for more lab tests to confirm its
assessment. The 3-Stage model is presented in Fig. 1.
Fig. 1. 3-Stage CKD risk predictor
The model is designed to work on 13 CKD related data attributes, categorized per
stage, as presented in Fig. 2, based on real medical diagnostic practices. The 3-Stage
predictor outputs the CKD risk values described in Table 1.
3- Stage CKD Risk Prediction Model

Stage 1 Stage 2 Stage 3
Questionnaire Screening Check Output Classes Lab Tests Output Classes Lab Test Output Classes
Doctor Ever Doctor Do you Family Weight Standing Body Systolic: Blood Diastolic: LR, AR LDL Two AR, HR Albumin HR, RCKD, CKD
told told told now history of Height Mass pres (2nd rdg) Blood pres Cholesterol Hour creatinin
you you you smoke diabetes? Index (2nd rdg) Glucose e ratio
have had have cigarettes (OGTT)
diabete high high ?
s? blood cholest
pressur erol
e? level?
Fig. 2. The input data attributes, per stage, for the 3-Stage predictor
Table 1. 3-Stage model output risk values

Risk values Description
Low Risk (LR) As general population, no follow-up needed
At Risk (AR) Needs follow-up
High Risk (HR) High potential, needs follow-up
Reversible CKD (RCKD) Affected Kidney, can recover, needs treatment and follow-up
CKD Affected Kidney, mostly not recoverable, needs treatment to avoid
deterioration
3.2 Dataset
In this research, large datasets with total of around 30K real patients’ data and 239
attributes were collected to help experimenting the proposed 3-Stage model predictor.
The datasets are part of the National Health and Nutrition Examination Survey
(NHANES), between 2011 and 2016, provided by the Centers for Disease Control and
Prevention (CDC), US [12].
3.3 Techniques and Methods
Data Preparation. To make sure of the relativeness to the CKD medical diagnostics,
Nephrologists were interviewed for parameters selection, focusing on adults’ patients,
(>18 years old). Accordingly, the datasets were merged and filtered using SAS software
to outcome a ready dataset with around 8 K records, after data cleansing, and 16
attributes. Table 2 shows the list of attributes of the dataset.
Table 2. Dataset parameters

Parameter Type Values
SEQN Patient Identifier Number
Age General Number
Gender General Female, Male
Doctor told you have diabetes? Question Yes, No, Borderline, Don’t Know
Ever told you had high blood pressure? Question Yes, No, Don’t Know
Doctor told you have high cholesterol level? Question Yes, No, Don’t Know
Do you now smoke cigarettes? Question No, Daily, Some days
Family history of diabetes? Question Yes, Don’t Know, N/A
Weight Screening check (kg)
Standing height Screening check (cm)
Body mass index Screening check (kg/m**2)
Systolic: Blood pres (2nd rdg) Screening check (mm Hg)
Diastolic: Blood pres (2nd rdg) Screening check (mm Hg)
LDL cholesterol Lab test (mg/dL)
Two hour glucose (OGTT) Lab test (mg/dL)
Albumin creatinine ratio Lab test (mg/g)
CKD Risk Result LR, AR, HR, RCKD, CKD
In addition, the last attribute, CKD Risk, was populated according to Nephrologists’
real medical diagnostic practices, in preparation to train the machine learning classi-
fiers. Figures 3 and 4 show the conditions applied to determine the CKD risk values of
the training dataset as part of preparing the data.
Decision Matrix
Doctor told you have diabetes Yes - - - - - - - - -
Questionnaire
Ever told you had high blood pressure - Yes Yes - - - - - - -

Level 1 Paramters
Doctor told you - high cholesterol level - - - Yes - - - - - -

Do you now smoke cigarettes - - - - Yes Yes - - - -
Family history - - - - - - Yes - - - Else
>=30
Screening
Body Mass Index (kg/m**2) - - - - - - - - -

check
(Obese)
Systolic: Blood pres (2nd rdg) mm Hg - High - - High - - - High -
Diastolic: Blood pres (2nd rdg) mm Hg - - High - - High - - - High
CKD Risk? At Risk Low Risk
Fig. 3. CKD Risk - level 1 parameters
Decision Matrix
Level 2 Paramters
Level 3 Paramters
>= 160 between 30 -

Lab Tests
LDL Cholesterol - >300

Lab Test
(High) Albumin creatinine 300

Else (Severly Else
ratio (Moderately
increased)
increased)
Two Hour Glucose >= 200
-
(OGTT) (Diabetes)
CKD Risk? High Risk At Risk CKD Risk? CKD RCKD High Risk
Fig. 4. CKD Risk - level 2, 3 parameters
Classifiers Selection. The classifiers for the 3-Stage predictor model was selected
based on the best accuracy and prediction speed. Twelve different classifiers were
compared for every stage of the 3-Stage model using MATLAB. Tables 3, 4, 5
summarize the results of the classifiers for each of the 3 stages.
Moreover, a comparison was made by adding all parameters into 1 stage, to choose
the best classifier for the 1-Stage model as mentioned in Table 6. Similarly, Figs. 5 and
6 display the differences between the classifiers in respect to the accuracy and pre-
diction speed in each stage.
The accuracy is calculated by dividing the correctly classified cases over the total
number of cases available in the testing dataset. Equation (1) shows the accuracy
calculation rule where, T is the number of True Predictions and N is the total number of
cases in the testing dataset.
Accuracy ¼ T=N ð1Þ

Table 3. Comparison between classifiers accuracies – Stage 1 in the 3-Stage Model

Classifier S1 accuracy (%) S1 pred. speed (obs/sec) S1 training time (seconds)
Fine Tree 99.800% 230000 1.52500
Medium Tree 99.800% 200000 0.74809
Coarse Tree 99.400% 240000 0.64763
Linear SVM 94.400% 96000 6.50760
Quadratic SVM 96.800% 120000 12.97000
Cubic SVM 97.300% 160000 35.89600
Fine Gaussian SVM 95.200% 12000 19.35500
Medium Gaussian SVM 96.600% 57000 10.31700
Coarse Gaussian SVM 93.000% 26000 16.80600
Boosted Trees 99.800% 18000 21.28600
Bagged Trees 100.000% 23000 22.72500
RUSBoosted Trees 95.000% 28000 26.15000

Fine Tree 99.900% 360000 1.29570
Medium Tree 99.900% 320000 0.71973
Coarse Tree 99.900% 350000 0.60191
Linear SVM 56.400% 110000 4.61980
Quadratic SVM 59.400% 94000 6.16700
Cubic SVM 59.800% 83000 13.93500
Boosted Trees 100.000% 28000 10.86000
Bagged Trees 99.900% 28000 11.82500

Fine Tree 100.000% 31000 1.42540
Medium Tree 100.000% 330000 0.79847
Coarse Tree 100.000% 220000 0.63936
Linear SVM 94.600% 100000 5.52720
Quadratic SVM 94.600% 34000 122.52000
Cubic SVM 94.600% 66000 204.44000
Boosted Trees 59.100% 330000 12.32100
Bagged Trees 100.000% 23000 16.27800
Table 6. Comparison between classifiers accuracies – All in one for the 1-Stage Model
Classifier All in 1-Stage 1-Stage pred. speed 1-Stage training
accuracy (%) (obs/sec) time (seconds)
Fine Tree 99.300% 170000 1.55600
Medium Tree 98.800% 170000 0.66212
Coarse Tree 91.000% 120000 2.57550
Linear SVM 50.400% 63000 6.05280
Quadratic SVM 54.700% 30000 11.52900
Cubic SVM 55.000% 29000 15.84500
Boosted Trees 99.200% 19000 24.33800
Bagged Trees 99.100% 17000 23.81500
Accuracy Comparison (%)

120.000%
100.000%
80.000%
60.000%
40.000%
20.000%
0.000%
S1 Accuracy S2 Accuracy S3 Accuracy All in 1-Stage Accuracy
Fig. 5. Classifiers’ accuracy comparison for the 3-Stage and the 1-Stage models
3.4 Results and Discussion

After assessing the classifiers for each stage, the Bagged Trees classifier was chosen for
the first stage, the Boosted Tree classifier for the second stage and the Medium tree for
the third stage in the 3-Stage model. Also, the Fine Tree classifier was picked for the
1-Stage model.
Prediction Speed Comparison (obs/sec)

400000
300000
200000
100000
0
S1 Pred Speed S2 Pred Speed S3 Pred Speed 1-Stage Pred Speed
Fig. 6. Classifiers’ prediction speed comparison for the 3-Stage and the 1-Stage models
Both, the 3-Stage and 1-Stage models were trained on around 5K records and tested
on around 3K records. The 3-Stage model shows better accuracy of 99.97% compared
to the 1-Stage model with accuracy of 99.16%. Table 7 illustrates the differences
between the 2 compared models in regard to the accuracy. Furthermore, Fig. 7 presents
the confusion matrix for both models as per the predicted values over the 3K records.
Table 7. Prediction accuracies of the 3-Stage model and 1-Stage model

Model Input parameters Prediction accuracy
CKD 3-Stage predictor model Each stage has its set of parameters 99.97%
CKD 1-Stage predictor model All 13 at once 99.16%
3-Stage Model - Confusion Matrix 1-Stage Model - Confusion Matrix

True Class Predicted Class True Class Predicted Class
Affected Affected
Low Risk High Risk CKD Low Risk High Risk CKD
At Risk (AR) CKD At Risk (AR) CKD
(LR) (HR) (Reversible) (LR) (HR) (Reversible)
(RCKD) (RCKD)
Low Risk Low Risk
1.00 1.00
(LR) (LR)
At Risk (AR) 1.00 At Risk (AR) 0.00294695 0.9971
High Risk High Risk

1.00 0.02040816 0.0244898 0.96
(HR) (HR)
Affected Affected
CKD CKD
0.01724138 0.98 0.05172414 0.0862069 0.86
(Reversible) (Reversible)
(RCKD) (RCKD)
CKD 1.00 CKD 0.09090909 0.45454545 0.45
Fig. 7. 3-Stage Model vs. 1-Stage Model - confusion matrix

Based on the experiments results, splitting the CKD attributes into stages, con-
sidering their medical diagnostics relativeness, has enhanced the prediction accuracy in
comparison to when passing them all at once to the 1-Stage predictor.
In this research paper, a 3-Stage model was designed to predict the risk factor for
developing Chronic Kidney Disease (CKD). Different classifiers were compared to
choose from for each of the three stages based on the best accuracy and prediction
speed. The 3-Stage model was tested and compared to a 1-Stage model to show better
accuracy of prediction. In conclusion, having multi-stages predictors can help in the
healthcare and medicine fields to early detect and predict diseases like CKD during the
regular healthcare screenings. In addition, the early prediction of CKD using the 3-
Stage model will help in avoiding unnecessary tests and cost.
One of the limitations and challenges in this research was related to the dataset to be
used in conducting the model and the experimental work. Reaching a dataset that is
trusted and diverse from other studies was a limitation since there is no digital records
for patients that hold information that is relevant particularly to CKD in Egypt. That
was resolved by finding the dataset from the CDC.
For future recommendations, more trials and experiments can be considered in
respect to the multi-stages’ prediction models by which more data attributes can be
added and more classifiers can be compared, not only for CKD but other diseases that
its occurrence can be predicted during the regular healthcare screening.
References
1. WHO, Preventing chronic diseases: a vital investment, World Health Organization (2005).
https://www.who.int/chp/chronic_disease_report/part1/en/
2. Pavithra, N., Shanmugavadivu, R.: Efficient early risk factor analysis of kidney disorder
using data mining technique, pp. 1690–1698 (2017)
3. American Kidney Fund, Chronic kidney disease (CKD). http://www.kidneyfund.org/kidney-
disease/chronic-kidney-disease-ckd/#what_causes_chronic_kidney_disease
4. LeDuc Media, World Rankings - Total Deaths. https://www.worldlifeexpectancy.com/
world-rankings-total-deaths
5. National Chronic Kidney Disease Fact Sheet (2017)
6. Honeycutt, A.A., Segel, J.E., Zhuo, X., Hoerger, T.J., Imai, K., Williams, D.: Medical costs
of CKD in the medicare population. J. Am. Soc. Nephrol. 24(9), 1478–1483 (2013)
7. Vijayarani, S., Dhayanand, S.: Data mining classification algorithms for kidney disease
prediction. J. Cybern. Inform. 4(4), 13–25 (2015)
8. Koklu, M., Tutuncu, K.: Classification of chronic kidney disease with most known data
mining methods. Int. J. Adv. Sci. Eng. Technol. 5(2), 14–18 (2017)
9. Anantha Padmanaban, K.R., Parthiban, G.: Applying machine learning techniques for
predicting the risk of chronic kidney disease. Indian J. Sci. Technol. 9(29), 1–5 (2016)
10. Sharma, S., Sharma, V., Sharma, A.: Performance based evaluation of various machine
learning classification techniques for chronic kidney disease diagnosis (2016)
11. Tahmasebian, S., Ghazisaeedi, M., Langarizadeh, M., Mokhtaran, M.: Applying data mining
techniques to determine important parameters in chronic kidney disease and the relations of
these parameters to each other. J. Ren. Inj. Prev. 6(2), 83–87 (2017)
12. CDC NHANES Dataset, Centers for Disease Control and Prevention (CDC). National
Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey
Data. Hyattsville, MD, U.S. Department of Health and Human Services, Centers for Disease
Control and Prevention. https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/
Classification of Diabetic Retinopathy
and Retinal Vein Occlusion in Human Eye
Fundus Images by Transfer Learning
Ali Usman1, Aslam Muhammad1(&), A. M. Martinez-Enriquez2,

and Adrees Muhammad3
1
Department of Computer Science and Engineering, UET, Lahore, Pakistan
usmanali483@gmail.com, maslam@uet.edu.pk
2
Department of Computer Science, CINVESTAV, Mexico City, Mexico
ammartin@cinvestav.mx
3
Department of Computer Science, Superior University, Lahore, Pakistan
adreesgujer@gmail.com
Abstract. Sight threatening diseases are viral these days. Some of these are so
harmful that it may cause complete vision loss. Diabetic Retinopathy (DR) and
Retinal Vein Occlusion (RVO) are from this category. The first step to cure such
diseases is to accurately predict it. For prediction of these diseases there is a
large number of machine/deep learning algorithms employed. In this research,
we have proposed a DR&RVO prediction system, which may help eye spe-
cialists for the prediction of these diseases. The proposed methodology shows
that a retinal image undergoes through three main steps in a Deep Neural
Network (DNN) like, preprocessing, image segmentation, and feature extraction
and classification. For classification of this processed image into DR and RVO,
and normal labels, pre-trained deep neural networks (DNNs) are used. More
than 2680 eye fundus images are collected from 7 online available datasets, all
images are converted to jpg file format during preprocessing step, after class
labels distribution into three categories, the proposed model is firstly trained and
then tested randomly on Inception v3, ResNet50 and Alex Net. This is done by
using a deep learning technique named as ‘Transfer Learning’. The accuracy
obtained from these models shows that Inception v3 (85.2%) outperformed than
other two state of the art models.
Keywords: Diabetic Retinopathy (DR) Retinal Vein Occlusion (RVO)

Transfer Learning (TL) Deep Neural Networks (DNNs)
1 Introduction
Diabetes is widely spreading disease all over the globe. Patients suffering from diabetes
are 381.8 million till 2018 and the number will be increased as 591.9 million (55%) by
2030 according to the International Diabetes Federation (IDF) [1]. The main cause of
diabetes is the excess of blood sugar level. It can affect nervous system of eye, kidneys
and heart etc. Because eye is the most sensitive part of our body so it should be treated
with care. Most of the patients that suffer from diabetes are the victims of diabetic

https://doi.org/10.1007/978-3-030-39442-4_47
Classification of Diabetic Retinopathy and Retinal Vein Occlusion 643
retinopathy. This disease becomes active in severe case of diabetes that’s why early
detection of diabetic retinopathy is difficult. In order to avoid the serious damage to the
eye and to avoid the visual impairment or complete blindness, timely detection and
treatment of DR is necessary [2]. There are many disorders in human eye in which
complete vision loss can occur like diabetic retinopathy, Cataract, hypertension,
macular degeneration, neovascularization, hemorrhages, glaucoma, Retinal Artery
Occlusion and retinal vein occlusion [3]. All the disease mentioned here are injurious to
eyesight. But two most common of them are under our study, that are DR and RVO [4].
People suffering with moderate DR supposed to have good eye-sight but the DR that
can cause vision loss have two types, Diabetic-macular o-edema (DMO) and
Proliferative-diabetic-retinopathy (PDR) [5]. The complication of diabetes is called
DR, and it is the main cause of retinal blood vessel destruction. DR occurs in diabetes
patients. Patients with week diabetic control, suffer with very high sugar levels in blood
for a long period of time. This leads to retinal blood vessel destruction. DR cause
damage up-to 8/10 patients who suffer from diabetes for ten or more years [6]. Retina is
the layer of tissues at back side of eye. It functions as to detect light that enables human
being to see. In DR, light sensitive tissues at the back of retina are affected. This may
lead to vision loss.
Formation of a clot of blood in retina leads to the RVO [7]. Whenever arteries
crossing over vein, vein bursts leading to hemorrhage, and this further causes the RVO.
It also occurs due to complete blockage of blood vessels. When painless reduction of
eye-sight occurs suddenly in old age people it may be the symptom of RVO. In case
one of the veins is blocked. This causes blood transport prevention in the eye and
further leads to fluid and blood leakage in retinal parts with the swelling, bruising and
deficiency of oxygen at rear side of retina. This is the basic interference in light
detection by the retina and causes lack of vision. This condition is not common till 50
years of age, but its frequency increases as the life passes by. Like DR, RVO is also
commonly occurring disease in retina of human eye. It causes blindness. RVO has
three types [8], Branch-Retinal-Vein-Occlusion (BRVO), Central-Retinal-Vein-
Occlusion (CRVO) and Hemi-Retinal-Vein-Occlusion (HRVO). BRVO occurs at the
branch vein. When one of the veins out of four is blocked, each vein drain almost one
forth part of retina. CRVO affects the central vein. It occurs as a result of central vein
blockage. It drains blood about the whole retinal area. This is the most dangerous type
of vision loss in general. HRVO occurs at sub vein. BRVO is most commonly
occurring of all of above types. The exact cause of RVO is unknown. Formation of the
blood clot in retinal vein disturbs the flow of blood due to High-blood-pressure, High-
cholesterol, Glaucoma, Smoking, Diabetes and Certain-rare-blood-disorders. The main
difference between DR and RVO is that, DR occurs in diabetes patients and its fractal
dimensions are totally different from RVO patients [9].
These two disease are classified through different supervised and unsupervised
machine learning algorithms. Segmentation techniques for DR prediction and Fractal
dimension analysis is employed for the RVO prediction in most of state of the art work.
Still there is no deep/machine learning algorithms used for the prediction of both the
disease through same model with improved accuracy [10]. A neural network model is
proposed in this research, that can predict multiple classes like DR, RVO and Normal
images by using transfer learning technique. The main objective of this research is to
644 A. Usman et al.
provide the single interface to the medical specialists so that they can predict above
mentioned eye disease with improved accuracy.
In Sect. 1, a brief introduction to eye disease is provided with problem statement
and research objective. Section 2 is about the current state of the art work with para-
metric analysis of techniques used. Proposed methodology is described in Sect. 3.
Section 4 is about proposed DR&RVO prediction system, results and discussions.
Section 5 is about the conclusion and future directions.
2 Related Work
Deep/machine learning techniques are widely used in retinal image diagnose

throughout the world. Eye disease prediction, vessel segmentation, optic disc location,
DR severity level detection, Classification of DR types and Detection of CRVO or
BRVO by many supervised, semi-supervised and unsupervised state of the art algo-
rithms are briefly discussed in Table 1.
Table 1. Parametric comparison of algorithms employed for DR and RVO prediction

Author(s) Technique(s) Prediction Dataset(s) Performance
Ramachandran DNNs DR types classification Otago Sensitivity and
et al. [11] (485) MESSIDOR specificity 80%
(1200)
Roy et al. [12] CNNs DR severity detection Kaggle Performance 86%
Dutta et al. [13] DNN DR Kaggle Accuracy 72.5%
Pratt et al. [6] CNN DR severity levels Kaggle Accuracy 75%
Honnungar, Mehra SVM DR severity level Kaggle SVM (68%)
and Joseph [14] MLR grading MLR(73%)Random
Random Forest (72%)
Forest
Padmanabha et al. SVM DR classification E-OPHTA SVM performed
[15] ANN better than ANN
Annunziata et al. Unsupervised Vessel segmentation STARE Accuracies
[16] HRF
Christodoulidis Hybrid Vessel segmentation Erlangen database Sensitivity 85%
et al. [17] method
Saleh et al. [18] Hybrid DR prediction and Local dataset Accuracy and
approach severity level assessment sensitivity up to
80%
Hassan et al. [19] K-Means DR prediction DRIVE Accuracy 95.25%
clustering
3 Proposed Methodology
3.1 ANN Learning
Starting point of leaning the network is to determine the internal parameters and readjust
the weights and biases. Then the randomly generated input data is feed to the network
for training. Weight estimation and optimization is performed by ANN. Error function is
determined to define the error in training data. The process of training is ceased when the
error in data is less than or equal to the predefined value for all patterns. A schematic
representation of an ANN learning process is shown in the Fig. 1.
Fig. 1. Illustration of ANN learning procedure
3.2 Transfer Learning in Pre-trained Neural Networks

Deep neural networks are famous for giving the good results through transfer learning.
Transfer learning is a deep learning technique employed by deep neural networks. This
technique works on the idea that, instead of building a deep neural network from
646 A. Usman et al.
scratch, we can modify the existing pre-trained neural networks according to our
dataset. For example, if a neural network is trained for prediction and classification of
1000 classes, we can change the last layers of this particular neural to predict the labels
in our data according to the training image data [20].
3.3 Dataset Collection

Dataset for this research work is collected form 7 databases. Every database have
different image file formats. These formats are then converted to JPG (That are not in
JPG form at initial stage). The detailed analysis of image formats and the number of
images collected from each dataset is given in Table 2.
Table 2. DR, RVO and normal image dataset

Datasets Total images Original type Converted type
CHASEDB1 42 JPG None
DIARETDB0 51 PNG JPG
DIARETDB1 89 PNG JPG
DRIVE 60 GIF JPG
MESSIDOR 397 PPM JPG
IDRID 697 JPG None
E-ophtha 1574 TIF JPG
3.4 Class Definition and Plan for Prediction

We have defined three class labels for classification through ANNs. The brief definition
of each class is given in Table 3.
Table 3. Class label definitions and proposed predictive models

Class Definition Predictive model(s)
Normal Healthy eye fundus images having normal fractal dimensions Inception v3 [21]
ResNet50 [22] Alex Net [23]
DR Damage of light sensitive tissues at the beck of retina Inception v3 [21]
RVO Formation of blood clot in retina due to vein bursting Inception v3 [21]
The proposed research on Retinal images is carried out in three main steps shown in
Fig. 2:
• Preprocessing
All the images collected from different datasets were in different image file formats
like in PNG, PPM, JPG TIF and GIF. So before further processing, all these images are
converted into one file format. In our case, the images files that are not in JPG format,
are firstly converted into JPG by a python script, prior to further preprocessing steps.
An input image is then passed to input layer of CNN, namely convolutional layer,
here various convolutional filters are applied on this image. The image is converted to
RGB channel. Then noise removal filters and enhancement filters are applied to con-
clude the first step.
• Segmentation
In second step the generated features map is used for segmentation of blood vessels
in eye images. The eye fundus image segmentation in neural networks is generally
based on different steps; e.g. edge detection, soft and hard exudate segmentation and
thresholding etc.
• Feature Extraction and Classification
At last stage the image is passed through the fully-connected layer to generate a
final feature map of retinal images and an activation is used here to predict the class
label at the output layer.
In order to predict Classify, DR, RVO and Normal eye image. We are using three
pre-trained neural networks here. By transfer learning in these models, we can get the
desired results. The detail of Inception v3 architecture we have used, and implemen-
tation is described in next sections.
3.5 Inception v3 Design Requirements

Some of key points while designing an Inception v3 network should kept in mind
which are:
1. At the initial stages of designing the network, the representational bottlenecks
should be avoided.
2. The network representations in higher dimensions make processing easy within the
network.
3. Lower dimensions can be used to display spatial-embedding, in this way no or
minor loss occur in network representational power.
To get the optimum performance from the network, depth and width balancing of
the network is necessary. It can be done by balancing no. of filters at each stage with
depth.
648 A. Usman et al.
Fig. 2. Proposed methodology
3.5.1 Inception v3 Architecture (Transfer Learning)

The working mechanism of Inception v3 is we go deeper and deeper. The total number
of layers in this network are 42. Deepness shows the efficiency of this model. The
architecture of this network with transfer learning is shown in Fig. 3.
Fig. 3. A general architecture of Inception v3 with transfer learning

4 DR and RVO Prediction System
The main aim of conducting this research is to design a system that can predict Diabetic
Retinopathy (DR), Retinal Vein Occlusion (RVO) and Normal through a single model
with improved accuracy. For this purpose, three CNN models are tested on eye fundus
image data i.e. Inception V3, ResNet50 and Alex-Net.
The interface of proposed DR&RVO prediction system is shown in Figs. 4, 5 and
6. The proposed model is predicting the three class labels in the figures. This interface
is built on the results obtained from Inception v3 model.
Fig. 4. Predicted class “Normal”
The image is selected from test data, on the basis of training of the model, the
predicted class label is Normal, which is correct.
Fig. 5. Predicted class “DR”

650 A. Usman et al.
In the Fig. 5, the image is selected from test dataset, on the basis of training data,
the predicted class label is DR.
Fig. 6. Predicted class “RVO”
Figure 6 is showing the predicted class label is RVO, while it is DR image this
shows that, little disruption in image, while taking the image by image taking source
can change the result.
As mentioned we built this interface on the Inception v3 model, giving us the best
accuracy on our eye image data. So, to develop this interface we have used python 3.5
and we have imported Tkinter and other library files.
4.1 Results and Discussions

The results below in table, are obtained from three models mentioned here. We have
obtained these results by properly testing and varying the parameters like Epochs and
batch size. We stop varying these parameters, where our fine tune models are giving the
best accuracy. The accuracy of these models is given in Table 4 below with different
parameters.
Table 4. Performance of three models

Model Epochs Batch size Accuracy
Inception V3 1000 100 85.2%
Alex-Net 1500 100 69.03%
Inception V3 1000 100 85.2%
4.2 Accuracy and Loss Function Graph (Inception V3)

The accuracy graph shown in Fig. 7 is obtained from Inception v3 model on the basis
of training and validation data. This graph is showing that the accuracy increases by
increasing the number of epochs, but once its number almost approached to 1000
epochs, the validation accuracy of our data becomes constant.
Fig. 7. Accuracy graph (Inception V3)
The loss function graph of our model in Fig. 8, is showing that, at start it is high
and by increasing the number of epochs, it is minimized.
Fig. 8. Loss function graph (Inception V3)

652 A. Usman et al.
In our proposed methodology, we have discussed the steps to follow and discussed the
detailed working of proposed models. The dataset for this research is obtained from
seven online available datasets. Considering all these aspects, we have designed a
DR&RVO prediction system. We have define three class labels namely DR, RVO and
Normal eye images. And used pre-trained deep learning algorithms for the prediction.
The main idea of using these models is “Transfer Learning” in which we have changed
the last layers of these networks according to our dataset. Now these models are
classifying eye disease images with improved accuracy. Our system can be used as
decision support system by eye specialists in order to predict general purpose DR and
RVO infected patient’s eye images. In future, this work can be extended by using local
eye image dataset with further improving the accuracy of the model. In this research,
we are predicting three general eye image categories (two diseased and one normal).
This work also can be extended to predict more than three eye disease categories or to
predict parent as well as subcategories of these disease. Like, while the DR, the model
should also predict Diabetic-macular o-edema (DMO) and Proliferative-diabetic-
retinopathy (PDR). Similarly the same model should also predict RVO with Branch-
Retinal-Vein-Occlusion (BRVO), Central-Retinal-Vein-Occlusion (CRVO) and Hemi-
Retinal-Vein-Occlusion (HRVO).
Funding. This research is funded by the National Research Program for Universities (NRPU),
Higher Education Commission (HEC), Islamabad, Pakistan, grant number 20-
9649/Punjab/NRPU/R&D/HEC/2017-18.
References
1. Samant, P., Agarwal, R.: Machine learning techniques for medical diagnosis of diabetes
using iris images. Comput. Methods Programs Biomed. 157, 121–128 (2018)
2. Kaur, M., Talwar, R.: Automatic extraction of blood vessel and eye retinopathy detection.
Eur. J. Adv. Eng. Technol. 2(4), 57–61 (2015)
3. Guo, J., et al.: Automatic retinal blood vessel segmentation based on multi-level
convolutional neural network. In: 2018 11th International Congress on Image and Signal
Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE (2018)
4. Qureshi, I., et al.: Computer aided systems for diabetic retinopathy detection using digital
fundus images: a survey. Curr. Med. Imaging Rev. 12(4), 234–241 (2016)
5. Solkar, S.D., Das, L.: Survey on retinal blood vessels segmentation techniques for detection
of diabetic retinopathy. Diabetes (2017)
6. Pratt, H., et al.: Convolutional neural networks for diabetic retinopathy. Procedia Comput.
Sci. 90, 200–205 (2016)
7. Nicolò, M., et al.: Real-life management of patients with retinal vein occlusion using I-
macula web platform. J. Ophthalmol. (2017)
8. Zode, J.J., Pranali, C.C.: Detection of branch retinal vein occlusions using fractal analysis.
Asian J. Convergence Technol. (AJCT)-UGC LISTED 3 (2017)
9. Fazekas, Z., et al.: Influence of using different segmentation methods on the fractal properties
of the identified retinal vascular networks in healthy retinas and in retinas with vein
occlusion, pp. 361–373 (2015)
10. Schmidt-Erfurth, U., et al.: Artificial intelligence in retina. Prog. Retinal Eye Res. (2018)
11. Ramachandran, N., et al.: Diabetic retinopathy screening using deep neural network. Clin.
Exp. Ophthalmol. 46(4), 412–416 (2018)
12. Roy, P., et al.: A novel hybrid approach for severity assessment of diabetic retinopathy in
colour fundus images. In: 2017 IEEE 14th International Symposium on Biomedical Imaging
(ISBI 2017). IEEE (2017)
13. Dutta, S., et al.: Classification of diabetic retinopathy images by using deep learning models.
Int. J. Grid Distrib. Comput. 11(1), 89–106 (2018)
14. Padmanabha, A.G.A., et al.: Classification of diabetic retinopathy using textural features in
retinal color fundus image. In: 2017 12th International Conference on Intelligent Systems
and Knowledge Engineering (ISKE). IEEE (2017)
15. Annunziata, R., et al.: Leveraging multiscale hessian-based enhancement with a novel
exudate inpainting technique for retinal vessel segmentation. IEEE J. Biomed. Health
Informat. 20(4), 1129–1138 (2016)
16. Khomri, B., et al.: Retinal blood vessel segmentation using the elite-guided multi-objective
artificial bee colony algorithm. IET Image Process. 12(12), 2163–2171 (2018)
17. Saleh, E., et al.: Learning ensemble classifiers for diabetic retinopathy assessment. Artif.
Intell. Med. 85, 50–63 (2018)
18. Hassan, G., et al.: Retinal blood vessel segmentation approach based on mathematical
morphology. Procedia Comput. Sci. 65, 612–622 (2015)
19. Zaheer, R., Humera, S.: GPU-based empirical evaluation of activation functions in
convolutional neural networks. In: 2018 2nd International Conference on Inventive Systems
and Control (ICISC). IEEE (2018)
20. Szegedy, C., et al.: Rethinking the inception architecture for computer vision. arXiv preprint
arXiv:1512.00567 (2018)
21. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (2016)
22. Krizhevsky, A., Ilya, S., Geoffrey, E.H.: Imagenet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems (2012)
Crop Monitoring Agent System Based
on Pattern Recognition Techniques
Ahad Hanif1, Aslam Muhammad2(&), A. M. Martinez-Enriquez3,

and Andrees Muhammad4
1
Department of Mechatronics Engineering, UET, Lahore, Lahore, Pakistan
ahad.hanif@umt.edu.pk
2
Department of CS, UET, Lahore, Lahore, Pakistan
maslam@uet.edu.pk
3
Department of CS, CINVESTAV-IPN, D.F. Mexico, Mexico
ammartin@cinvestav.mx
4
Department of CS, Superior University, Lahore, Pakistan
adreesgujer@gmail.com
Abstract. Agriculturalists frequently visit farms to monitor the condition of

plants and the crop. The monitoring spends time and augments the cost of
production. Conventional monitoring techniques and inspection may cause low
crop production, and severe diseases attack, the plague of insects and parasites.
This paper describes the design, development, and validation of an integrated
infrastructure of sensors network with an intelligent agent for crop monitoring
grow in a greenhouse or a tunnel. This infrastructure supports the agriculturists
to remotely monitor and advise the farmers to take appropriate measures,
according to the environmental conditions. The main modules are the data
acquisition, the image processing to find out deficiencies of nutrients, and the
advice provided according to the events detection to minimize critical effects.
Keywords: Remote sensing Knowledge-based agent system Image

processing
1 Introduction
Crop science grows and changes continuously, and it is necessary to assist farmers
under critical situations. Currently, there exist several electronic devices like sensors
which can recollect environmental data and dedicated computing tools for supporting
farmers. The acquisition of data is primordial to deduce critical situations, for instance,
excessive water and low soil Oxygen (O2) damages the roots of plants [3]. Manual
monitoring of crop depends on the experience of farmers, but problems arise when a
farmer has not enough expertise. Furthermore, manual monitoring spends time and
decreases the economy. Besides, a farmer may not to be sure about his/her supervision
is precise and performed on time.
The two main crucial factors for plants are the crop water status and the availability
of fertilizers to meet the requirements of essential nutrients. Environmental factors
disturb the mobility of nutrients in plants like temperature and humidity. For instance,

https://doi.org/10.1007/978-3-030-39442-4_48
Crop Monitoring Agent System Based on Pattern Recognition Techniques 655
plants may suffer from dehydration, chilling, and freezing. Besides, plants become
vulnerable to the attack of pests and fungus. Traditionally, farmers use Leaf Color
Charts (LCC) to check the deficient in nutrients in plants (Nitrogen). The manual LCC
checking requires experience. If plants have an insufficient amount of nutrients, then
the yield decreases, and the production cost increases. Thus, small scale farming
requires system support to assure a good yield and reduced cost.
Many monitoring systems exist using radar, satellite imaging [15], and aerial
photography, but they are quite expensive, sophisticated, and require expertise to
operate them. By contrast, sensors are not expensive. Thus, a monitoring system using
sensors networks can maintain optimum range of parameters: light, temperature,
humidity, protecting plants from pests and diseases [1–3].
Greenhouses and tunnel farming techniques would be more productive when they
include a remote monitoring system. Thus, this research proposes an intelligent
monitoring system for the agriculturist in remote areas. This dedicated system inte-
grates different hardware equipment, and adaptable software for network sensors,
agents based on a knowledge-base of inference rules.
The sensors acquire data concerned with the parameters related to the growing of
plants: humidity, temperature, and nutrients. The analysis of this information provides
appropriate measure before the damage to the crop, in case of risk.
The organization of the paper is as follows: Sect. 2 introduces the current research
work in Phytomonitoring techniques for understanding the approach proposed in this
article. Section 3 describes the methodology in crop monitoring in the present study.
Section 4 explains the experimental results. Section 5 concludes and presents future
work.
2 Related Works
Phytomonitoring techniques have become essential to detect the physiological disor-

ders in the plants. The observed environmental parameters through sensors are tem-
perature, air humidity, and soil moisture. The Global System for Mobile
communication (GSM) transmits data remotely [4]. The sensitive parameters in Phy-
tomonitoring may change with a slight modification in water balance for the growth
rate of trunk, flower, and fruit [5]. In real-time, the monitoring and the control of the
environmental parameters supports farmers to bring innovation in the irrigation strategy
[6]. The agriculture sector uses image processing extensively to determine the age of
the wheat crop. This information is helpful to determine the suitable amount of fer-
tilizers and nutrients needed for the plants, reducing the overall cost [7]. The processing
combines the three prime colors: red, green, and blue to form an image [9]. Hierarchical
control of greenhouse crop production proposed in [10] develops an adaptive and
predictive control for heating and ventilation purpose. Suitable ranges of temperature
and humidity preserve the properties of stored fruits and control the fungus that
damages fruits [8, 9].
656 A. Hanif et al.
A trapezoidal two-dimensional index is empirically derived by [13], which detects

water stress even with a low percentage of canopy cover. Continuous monitoring and
scheduling the irrigation process by using Infrared images have provided early warning
of water stress [11]. The best way to find the plant water status is to detect the soil water
content by evaporating transpiration estimation. Infrared images were produced to find
the vegetation index [12]. The solar energy-based system monitors greenhouse to solve
the low battery issue, all nodes and sensors are solar energy powered, and the transfer
of Data is through ZigBee and GSM technology [14].
With the increase in world population, the requirements of more food increase.
Techno-economic solutions require improvement to decrease the number of people die
each year due to starvation. Besides, conventional monitoring systems are not techno-
economic. So much space is available to develop such a system which is cheaper,
reliable, and give a quick response.
The goal of this research is to design, build, and validate a system to support
farmers and agriculturists to take appropriate measures in case of environmental
stresses and risks. A knowledge-base of inference rules includes the expertise of
specialists. The adaptive system maintains updated information. The system brings
flexibility in the Leaf Color Chart (LCC) such that every person can use it conve-
niently. Furthermore, users can monitor and examine the plants in a better way to
augment the yield.
3 Methodology
The most crucial atmospheric element which effects the growth and yield of plants is
the temperature. The photosynthesis respiration process requires optimum temperature.
The second important parameter is the humidity. The relative humidity is the ratio of
two water vapor contents: the real and the saturated. These values are expressed in
percentage and measured at the same conditions of pressure and temperature.
Plants use air carbon dioxide (CO2) for their photosynthesis process. Plant’s leaves
have pores to take in CO2. During the respiration process and through the pores, some
moisture in the air goes in. The plant transpires moisture more slowly when high
humidity in the air exists and vice versa. So when air is dry, plants evaporate more
water, implying that plants become deficient in water and leaves close their pores.
Although, outgoing water decreases, at the same time, the CO2 intake reduces too.
Plants absorb water through roots from the soil.
Crop monitoring utilizes developed sensors to acquire information about the
environmental parameters. The closed environment area in which plants grow can use
the DS18B20 sensor for the temperature and the HSU-04 sensor for the humidity.
The IP of the camera installed in the field gets images of plants and compares them
with the standard Leaf Color Charts (LCCs). The Electronic LCC (E-LCC) substitutes
the manual LCC used before. E-LCC is flexible and also easy to use, assists farmers to
minimize the production cost by indicating the dosage of the appropriate fertilizers.
Finally, the Knowledge-base of inference rules (KBR) includes the expertise of the
specialist of the domain, who provides possible solutions to farmers’ problems.
The average growth of plants requires appropriate temperature, humidity, and a

precise quantity of nutrients (Nitrogen). As plants grow in an enclosed environment,
the sensors can recover in real-time environmental parameters (temperature, humidity).
The computer center through the RS-232 serial port receives this information, and the
intelligent agent system uses the KBR to deduces situations in risk and provide advice.
The agent software carries out automatically the retrieving and delivering of the
information tasks on behalf of users. The intelligent agent incrementally accommodates
new problem using the inference rules to analyze the behavior, faults, and the required
correction if any. The agent perceives from its environment and surroundings through
the sensor networks and reciprocates through actuators. The agent maps every possible
percent to an action-task.
There are two basic modules in the system: the Remote Monitoring Unit
(RMU) and the Data Manipulation Unit (DMU). Each camera takes images of plants in
real-time to find the deficiency of Nitrogen, and results are sending through email.
Images of plants are compared, in real-time, with the standard color scheme to find
out the percentage of Nitrogen. The color of the leaf changes with the nutrients. The
prime nutrients are Nitrogen (N), Phosphorous (P), and Potassium (K). For instance,
the green color of the leaf is due to N. Some examples of inference rules are:
• (R1) If evaporate-transpiration rate is higher than the absorption rate, Then plants
undergo severe atmospheric and chemical stresses, causing the plant cells to die.
• (R2) If the humidity in the air is few, Then Suggest (Air is almost_dry, and the
transpiration of moisture in the plant is slowly).
• (R3) If air is dry, Then Suggest (Air is too_dry, and plants evaporate more water,
and become deficient in water, and leaves close their pores).
• (R3) If plant leaf becomes pale-green, Then Suggest (the plant is deficient in
Nitrogen).
• (R4) If leaves appear in dark-green color, Then Suggest (the plant has enough
Nitrogen).
• (R5) If Temperature is greater or equal to the Max_value, Then Suggest (Too_hot).
• (R6) If Temperature is least or equal to the Minimun_value, Then Suggest
(Too_cold).
The expert-agent receives the information from field sensors through the
PIC18F452 microcontroller and the RS-232 serial port. The monitoring intelligent
agent system utilizes this information for better visualization of values which are
coming from field sensors, and they also display on a 16 2 LCD.
The expert-agent (a program writing in MatLab language, the multi-paradigm
numerical computing environment) treats this information together with the KBR and
generates an email notifying the current situation to the End-user (farmer).
The RS-232 Data Logger software has straightforward interfacing features. It
generates a text file, including temperature and humidity values. After regular intervals,
this file is updated automatically and used by the agent to apply the inference rules.
Figure 1 shows the system architecture.
658 A. Hanif et al.
Fig. 1. System architecture
The graphical user interface (GUI) (see Fig. 2) sets the ranges for the temperature
and the humidity. Till values (in real-time) remain within ranges, no intimation or
warning email is generated. By contrast, as soon as value goes out of the range,
communication with the End-user (farmer) is established by email reporting unwanted
conditions. Some inference rules apply for such cases.
Fig. 2. GUI window
While doing color image processing, it is necessary to separate the desired color,
which is under observation, from the original image. In the present case, the green color
is under study.
So firstly, the green color is identified in the original image. There are many
disturbing factors like sunshine, shades, background, and presence of non-green colors
in the original image, which create hurdles in detecting only the pure green color.
The developed image processing technique to identify the green color creates a mask
of the required color and applies it to the original image. Thus, the program finds output
for each simple color detection in the RGB model (red, green, and blue). Firstly, it figures
out the individual color bands from the image. Next, it processes the image histogram
and selects the color threshold range. The mask developed has a smooth border and filled
regions. This mask applied to the original image; identify the green portion in the original
image. The next step is to compare the processed image with the standard LCCs.
LCC is a straightforward technique to check the Nitrogen deficiency in the plants.
Thus, farmers can find out the plant’s Nitrogen demand and apply the Nitrogen fer-
tilizer in an appropriate amount and minimize the production cost. Usually, there are
six shades of green colors on LCC, which vary from light green to dark green. Each
shade represents a specific amount of Nitrogen deficiency. Figure 3 shows the different
shades of green color.
Fig. 3. Color scheme of LCC
Light green color means that plants have a high deficiency of Nitrogen and dark
green color means that plants have sufficient amount of Nitrogen. Sometimes it hap-
pens that the color of leaf lies between the two shades and does not entirely match with
any shade. This work addresses this issue. Electronic LCC gives an actual comparison
and precise if the color lies between two shades.
Figure 4 shows the field image to compare with the standard LCC. However,
before comparing and minimizing the effects of non-green parts, some image pro-
cessing techniques extract the green color from the image as described above.
Fig. 4. Image from field.

660 A. Hanif et al.
The image composed by a total of 256 pixels, whose range is [0 to 255] and each color
has its histogram. For instance, Fig. 5(a) shows the histograms of the RGB bands (Red,
Green, Blue); Fig. 5(b) shows the histogram of the green band. After the identification
of the green color, the Electronic LCC compares it with the standard color chart.
Shades of Electronic LCC may vary depending upon the type of crop and the area in
which it grows.
Fig. 5. (a) Histogram of red, green, blue bands; (b) Histogram of green band
Firstly, it treats the collection in a specific region with the referenced images by the
Electronic LCC. These images produce a color scheme. Next, the user can monitor the
content of Nitrogen in the plants by comparing in real-time the color of the image with
the standard color chart. It calculates the average values of green shades and field
images. It divides each pixel of a field image by the average value of the reference
images. However, before finding the average value, it has to use image processing
techniques as described above to figure out the actual green color taken from the image.
Figure 6 shows the results.
In the graph (Fig. 6), the X-axis indicates the percentage of Nitrogen deficiency in
plants. There are six shades of the green color in the Electronic LCC, so the six vertical
lines represent different shades of the green color. The red circle at the top of the graph
is near to the dotted horizontal line (passing through the Y-axis whose value is 1). The
vertical line shows the matching between the field images with the corresponding
shade. If the circle is on a dotted line, it means that the corresponding shade from the
scheme exactly matches the field image. If it is below or above the dotted line, it means
that the color of field image lies between the two shades. The X-axis indicates the
deficiency in Nitrogen in the plant. According to that information, regulate Nitrogen
fertilizer (like urea) in the soil.
The Electronic LCC provides cost-effective, good performance, and quick mea-
surement of Nitrogen deficiency within green plants. The intensity of the green color is
associated with nutrients content (Nitrogen). The critical value is one on LCC, which
corresponds with yellowish-green that shows the lowest Nitrogen concentration. The
highest concentration indicated by six is a critical value and corresponds with the dark-
green. The image processing tools eliminate the effects of disturbing factor-like
shadow, background, and sunlight. For instance, the sunlight may cause a slight dif-
ference in the color.
Fig. 6. Results of the color processing.
The camera must focus on the green portion of plants and should not include the
non-green parts like the color of the trunk or the soil to obtain a better result. It verifies
that the picture is not taken in sunlight, because it may show different shade as com-
pared to one taken in overcast weather.
Some examples of applied rules are the following:
(R7) If LCC = 5, Then Add Nitrogen 35 kg. Per hectare.
(R8) If LCC = 4, Then Add Nitrogen 20 kg. Per hectare.
A champ recorded 87% higher yield when applying the rule R7. In the case of the
application of R8 rule, although the yield was low, the farmer saved [40 to 50] % of
fertilizer.
This paper has described the designed, development, and the validation of crop
monitoring expert-agent system that analyzes the impacts of environmental stresses on
the growth of plants. The approach combines the hardware instrumentation and a
Knowledge-based agent system. The central computer service receives crucial envi-
ronmental parameters data from the sensors network, the image processing finds the
deficiencies of nutrients in plants, and minimizes the effects of no large green portion
captured. The Electronic leaf color chart replaces the manual LCC, providing enough
information for future scenarios. End-user receives warning messages to prevent crit-
ical situations and the ad hoc solution about environmental stress.
The application of Artificial Intelligence techniques like image processing and the
ad hoc instrumentation in Agriculture can make this sector more productive and
662 A. Hanif et al.
effective. The monitoring plants in real-time through color image processing has made
the system quite useful. It compares in the real-time, image of the plant with patterns of
previous images and communicates the results by e-mail. To avoid any abrupt fluc-
tuation in temperature and humidity value, agents take the average of the previous
values. So it avoids temporary alarming conditions.
The obtained results show that the Agriculture can improve day by day. The
process of plants growth can become more sophisticated but better controlled. For
instance, an adequate amount of water, fertilizers, and nutrients improve the yield and
at the same time, reduces the cost. Besides, the monitoring system advises to adjust the
water supply, temperature, and required humidity as a preventive measure, quite
convenient to see the past trends of Nitrogen deficiency in the plants. In addition to
remotely monitoring and controlling the plant environment, the system minimizes the
human effort and supports farmers and landowner by indicating the appropriated
quantity of nutrients to minimize the energy consumption, so to produce efficient and in
consequence with a better economic rating.
The obtained results were helpful to extend this research for producing new types
of crops with higher yield in remote areas. Plants can grow in those areas which do not
have suitable environmental conditions for a specific crop. Future research addresses
the improvement in the data acquisition, increase, modify, and organize the knowledge-
base of inference rules automatically to compensate environment conditions not yet
considered; also, to test different type of plants (e.g., cotton), nature of soil and area, as
well as the effects of the volatile substances.
References
1. Zuo, X., et al.: Design of environmental parameters monitoring system for watermelon-
seedlings based on wireless sensor networks. Appl. Math. Inf. Sci. 5(2), 243S–250S (2011)
2. Albright, L.D., et al.: Environmental control for plants on earth and space. IEEE Control
Syst. Mag. 21(5), 28–47 (2011)
3. Schaffer, B.: Effects of soil oxygen deficiency on avocado (Persea americana Mill) Trees. In:
Seminario International: Manejo del Riego y Suelo en el Cultivo del Palto La Cruz, Chile,
27–28 September 2006 (2006)
4. Avidan, A., Hazan, A.: Application of the phytomonitoring technique for table grapes. In:
The International Workshop on Advances in Grapevine and Wine Research, 15–17
September 2005 (2005)
5. Puig, V., et al.: Optimal predictive control of water transport systems: arrêt-darré/arros, a
case study. Water Sci. Technol. 60(8), 2125–2133 (2009)
6. Ton, Y., Kopyt, M.: Phytomonitoring in realization of irrigation strategies for wine grapes.
Acta Hortic. 652, 167–173 (2004). (ISHS)
7. Kakran, A., Mahajan, R.: Monitoring growth of wheat crop using digital image processing.
Int. J. Comput. Appl. 50(10) (2012). ISSN 0975-8887
8. Ibrahim, M., Rabah, A.B.: Effect of temperature and relative humidity on the growth of
Helminthosporium fulvum. Niger. J. Basic Appl. Sci. 19(1), 127–129 (2011)
9. Plataniotis, K.N., Venetsanopoulos, A.N.: Color Image Processing and Applications (2000)
10. Rodríguez, F., Guzman, J.L., Berenguel, M., Arahal, M.R.: Adaptive hierarchical control of
greenhouse crop production. Int. J. Adapt. Control Signal Process. 22, 180–197 (2008).
https://doi.org/10.1002/acs.974
11. Kopyt, M., Ton, Y.: Phytomonitoring Technique for Table Grapes Application Guide, 2nd
edn. PhyTech Ltd. (2005)
12. Clarke, T.R.: An empirical approach for detecting crop water stress using multispectral
airborne sensors. HortTechnology 7(1), 9–16 (1997)
13. Dupin, S., Gobrecht, A., Tisseyre, B.: Airborne thermography of vines canopy: effect of the
atmosphere and mixed pixels on observed canopy temperature. UMR ITAP, Montpellier
(2011)
14. Gao, L., Cheng, M., Tang, J.: A wireless greenhouse monitoring system based on solar
energy. Telkomnika 11(9), 5448–5454 (2013). e-ISSN 2087-278X
15. Diao, C.: Innovative pheno-network model in estimating crop phenological stages with
satellite time series. ISPRS J. Photogramm. Remote Sens. 153, 96–109 (2019). https://doi.
org/10.1016/j.isprsjprs.2019.04.012
Small Ship Detection on Optical Satellite
Imagery with YOLO and YOLT
Wilder Nina1 , William Condori1 , Vicente Machaca2 , Juan Villegas1 ,

and Eveling Castro1(B)
1
Universidad Nacional de San Agustı́n de Arequipa, Arequipa, Peru
{wnina, wcondori, jvillegasp,ecastro}@unsa.edu.pe
2
Universidad La Salle de Arequipa, Arequipa, Peru
Abstract. Actually, the use of deep learning in object detection gives

good results, but this performance decreases when there are small objects
in the image. In this work, is presented a comparison between the last
version of You Only Look Once (YOLO) and You Only Look Twice
(YOLT) on the problem of detecting small objects (ships) on optical
satellite imagery. Two datasets were used: High-Resolution Ship Collec-
tion (HRSC) and Mini Ship Data Set (MSDS), the last one was built
by us. The mean object’s width for HRSC and MSDS are 150 and 50
pixels, respectively. The results showed that YOLT is good only for
small objects with 76,06% of Average Precision (AP), meanwhile, YOLO
reached 69,80% in the MSDS dataset. Moreover, in the case of the HRSC
dataset where have objects of different sizes, YOLT obtained a 40% of
AP against 75% of YOLO.
Keywords: YOLO · YOLT · Small objects · Object detection · Ship

detection · Satellite Imagery
1 Introduction
In the field of remote sensing, ship detection can help in many problems like
maritime management and illegal fishing surveillance. Meanwhile, there is a lot
of work in ship detection. Moreover, the majority of methods in object detection
failed in detecting small objects.
Is compared the last version of You Only Look Once (YOLO) and You Only
Look Twice (YOLT) on the problem of detecting small ships. For that rea-
son, Two datasets were used, for the performance metrics. The High-Resolution
Ship Collection (HRSC) dataset [27] with satellite images of ships with different
objects sizes and the Mini Ship Data Set (MSDS) dataset built in this project,
this dataset contains only small objects.
The Average Precision (AP) metric of YOLO and YOLT in the MSDS dataset
(this dataset contains just small ships) were evaluated. In this case, the pro-
posal got 72% of AP for YOLT and 68% for YOLO. This demonstrates that
for small objects is a good idea to use YOLT. In the case of the HRSC dataset
https://doi.org/10.1007/978-3-030-39442-4_49
Small Ship Detection on Optical Satellite Imagery with YOLO and YOLT 665
(big, medium and small ships), YOLT just got 40% of AP against 75% of YOLO.
In this case the YOLT’s AP decrease considerably, the reason is because of
YOLT changed the architecture net in order to not downsample the objects,
this improves the performance in small object detection but affects the detec-
tion of big and medium objects.
2 Related Works
There are plenty of methods for ship detection. We group them in two. The first
consist of methods that use statistical and image processing. The second group
consists of methods based on deep learning.
2.1 Statistical and Image Processing

One of the pioneer works was proposed in 1995 by Inggs [14], they used a modified
version of Fourier transform with neural networks, getting 90% of accuracy.
Corbane in 2009 [13] proposed these steps: image split, binarization, contrast
stretching, morphological operations, statistical filtering, wavelet transform with
radon transform and finally a logistic regression model. They got good results
detecting ships, but the number of false positive was too big. In 2012 Bi [15]
proposed a model using a bottom-up visual attention mechanism, followed by a
top-down cue, as they named. Then, in 2018 Shi [16] proposed a method based
in three steps: Smooth filters, sea-land segmentation based on gradients and
finally, haar-like gradients with Radon transform, they reached 0.94 of precision.
Moreover, Chen [17] describe all the methods for object detection in satellite
imagery until 2016, these methods also include ship detection. Recently, Henning
[16] with image pixel processing determined the ship position, length and breadth
down to a subpixel resolution.
2.2 Deep Learning
The use of deep learning for object detection had outperformed the classic
methods. Moreover, the top deep net for object detection are R-CNN, Fast R-
CNN, Faster R-CNN, YOLO, and SSD. The R-CNN [22] presented in 2014,
used selective search [26] for the initial object proposal bounding boxes. Then
a deep convolutional net was used in each proposal region with Support Vec-
tor Machine (SVM) in the last layer, this method has a very good performance
evaluated in the Visual Object Classes (VOC) challenge. Then, the R-CNN was
improved with Fast R-CNN [23] in 2015, in this case, the first convolution lay-
ers are share for all the proposal; they also change the SVM classifier by a soft
max function in the last layer. Also, in 2015 the Faster R-CNN was presented
[24]. The problem with R-CNN and Fast R-CNN was the method they used for
object proposal, the selective search is a bottle neck, and also it proposes too
many candidates. For that reason, the Faster R-CNN used a convolution net for
666 W. Nina et al.
classifying and detecting region of interest. In this work, the time processing was
reduced considerably.
YOLO, was proposed in the work of [21], similarly to Faster R-CNN they
used a convolutional net in order to detect and classify objects. They treat the
object detection problem as a regression problem, in this case, the bounding box
coordinates are numbers, so it permits using a net to predict numbers as the
regression method do. This method has the minor time processing but is out-
performed by Faster R-CNN in terms of accuracy. Moreover, SSD [25] presents
a method for detecting objects in images using a single deep neural network and
discretizes the output space of bounding boxes into a set of default boxes over dif-
ferent aspect ratios and scales per feature map location. Then, at the end of 2016
Yolov2 [20] is presented, it added batch normalization, anchors, higher resolution
input image, fine-grained features and multi-scale training. Finally, Yolov3 [19]
outperformed the lastest versions adding a Feature Pyramid Networks (FPN),
logistic classifiers for every class instead of softmax and Darknet-53 instead of
Darknet-19.
In 2016, Liu [8] proposed a model based on CNN in order to detect ships in
Optical Remote Sensing Images (ORSI), also Zhang [6] proposed a S-CNN for
the same propose. More recently, Zhang in 2019 [7] adapted the Faster R-CNN
for detecting ships. Moreover, Ma [9] proposed a model to detect and recognize
ships in normal images meanwhile Zhao in 2019 [10] proposed a model for ship
detection in video.
Normally, a Convolutional Neural Network (CNN) for object detection
returns bounding boxes based on xmin , ymin , xmax , ymax or x, y, width, height,
but some implementation improved them by returning rotated bounding boxes
by adding an angle, for example a bounding box could be in the form of:
x, y, width, height, angle. For example, Yang [12] proposed a model called Rota-
tion Dense Feature Pyramid Networks (R-DFPN) in order to detect rotated
ships. Also, Yang [5] and Li [4] proposed models that detect the position and
direction of ships. Moreover, Fu [2] proposed a model based on a Feature Fusion
Pyramid Network and deep reinforcement learning (FFPN-RL) meanwhile Dong
[3] used the saliency and a rotation-invariant descriptor.
3 Proposal
We evaluated YOLOv3 (YOLO version 3) and YOLT in order to detect small
ship on optical satellite imagery, by small, it is referred to small objects in the
image. Normally, top state-of-art CNN for object detection as Faster R-CNN
failed in order to detect small objects, as is tested in [11]. YOLT was presented
as a modification of YOLOv2 in order to detect small objects, but YOLOv3
have included an FPN network, so the work focused on a comparison of both
detecting small objects.
3.1 Small Objects
Normally, the top state-of-art CNN for object detection were trained with
datasets as COCO [1] and VOC [31] with good performance, for example, see
the Fig. 1 it is noted that the objects normally take a considerable size related
to the full image, approximately the object represent the 20% to 70% of the
size image. The problem happens when you need to detect small object as in
Fig. 2, in this case the image size is 2400 × 1200 pixels, but the objects just have
60 pixels of width on average, taking the width of the image and objects, the
objects represent just the 2.5% of the image.
Fig. 1. Normal size of objects in object detection methods Source: YOLO [19].
3.2 YOLO
As it is mentioned before, Yolo was proposed by [21], improved in [20] and

more recently, the last version was presented in [19]. Despite other convolutional
networks, it detects objects with a single pass creating a grid of SxS boxes, where
each box has a logistic regression and a classification method, the regression
method predict each box with five values: x, y, w, h and the confidence of the
object being there. The classifier predicts C conditional class probabilities. At
the final stage multiple bounding boxes appear around a single object, so non-
maximum suppression is applied in order to keep the strongest detection around
a single object. The architecture of YOLO for this research project is showed in
the Table 1, obtained from Darknet library [19].
668 W. Nina et al.
Table 1. YOLO network architecture (Darknet-53)
Type Filters Size Output

Convolutional 32 3 × 3 256 × 256
Convolutional 64 3 × 3 / 2 128 × 128
Convolutional 32 1 × 1
1× Convolutional 64 3 × 3
Residual 128 × 128
Convolutional 128 3 × 3 / 2 64 × 64
Residual 64 × 64
Convolutional 256 3 × 3 / 2 32 × 32
Residual 32 × 32
Convolutional 512 3 × 3 / 2 16 × 16
Residual 16 × 16
Convolutional 1024 3 × 3 / 2 8×8
Residual 8×8
Avgpool Global
Connected 1000
Softmax
3.3 YOLT
Yolt is proposed to reduce model coarseness and accurately detect dense objects
(such as cars), follows [28] and implement a network architecture that uses 22
layers and downsamples by a factor of 16 rather than the standard 32× down-
sampling of YOLO, in the Table 2 is presented the Network Architecture.
Thus, a 416×416 pixel input image yields a 26×26 prediction grid. The archi-
tecture is inspired by the 28-layer YOLO network, though this new architecture
is optimized for small, densely packed objects. The dense grid is unnecessary
for diffuse objects such as airports, but improves performance for high density
scenes such as parking lots; the fewer number of layers increases run speed [29].
To improve the fidelity of small objects, the proposal also include a

passthrough layer (described in [20], and similar to identity mappings in ResNet
[30]) that concatenates the final 52 × 52 layer onto the last convolutional layer,
allowing the detector access to finer grained features of this expanded feature
map. It utilizes the same hyperparameters as the YOLO implementation.
Table 2. YOLT network architecture
Layer Type Filters Size/Stride Output size

0 Convolutional 32 3 × 3/1 416 × 416 × 32
1 Maxpool 2 × 2/2 208 × 208 × 32
2 Convolutional 64 3 × 3/1 208 × 208 × 64
3 Maxpool 2 × 2/2 104 × 104 × 64
4 Convolutional 128 3 × 3/1 104 × 104 × 128
5 Convolutional 64 1 × 1/1 104 × 104 × 64
6 Convolutional 128 3 × 3/1 104 × 104 × 128
7 Maxpool 2 × 2/2 52 × 52 × 64
8 Convolutional 256 3 × 3/1 52 × 52 × 256
9 Convolutional 128 1 × 1/1 52 × 52 × 128
10 Convolutional 256 3 × 3/1 52 × 52 × 256
11 Maxpool 2 × 2/2 26 × 26 × 256
12 Convolutional 512 3 × 3/1 26 × 26 × 512
13 Convolutional 256 1 × 1/1 26 × 26 × 256
14 Convolutional 512 3 × 3/1 26 × 26 × 512
15 Convolutional 256 1 × 1/1 26 × 26 × 256
16 Convolutional 512 3 × 3/1 26 × 26 × 512
17 Convolutional 1024 3 × 3/1 26 × 26 × 1024
18 Convolutional 1024 3 × 3/1 26 × 26 × 1024
19 Passthrough 10 → 20 26 × 26 × 1024
20 Convolutional 1024 3 × 3/1 26 × 26 × 1024
21 Convolutional Nf 1 × 1/1 26 × 26 × Nf
4.1 Datasets
Despite, there are some satellite image datasets, a dataset with the special condi-
tions of Peruvian sea were built. This dataset named Mini Ship Data Set (MSDS)
was built, with images from Google Earth of the Peruvian and Chilean seas. For
example in Fig. 2, we present some images. As you can see, these images are
not like other datasets, there are very small objects respect to full image size.
670 W. Nina et al.
Moreover, some images have noise, as you can see in the left part of Fig. 2. The
dataset consists of 200 images with a resolution of 2400×1200 pixels, split in 120
for training, 40 for validation and 40 for testing. The object haves width from
20 to 500. The majority of objects have widths of 40 and 50 pixels. In Fig. 3, the
histogram of object’s width is presented.
Fig. 2. Satellite images of MSDS dataset. Left: Images with noise. Right: Normal
images.
Fig. 3. Histogram of object width in MSDS dataset.
Also, the dataset High Resolution Ship Collection 2016 (HRSC2016) [18]
were used. This dataset includes ships on sea and ships close inshore from
Google Earth (see Fig. 4). In this case, the image sizes range from 300 × 300
to 1500 × 900, in Fig. 5, the histogram of object’s width is presented. As you
could see, in HRSC dataset, there are objects of all sizes, ranging from 20 pixels
to 620 pixels.
Fig. 4. Satellite images of HRSC dataset.
Fig. 5. Histogram of object width in HRSC dataset.
4.2 Results
The results presented in this section is based on the comparison between

YOLOv3 with YOLT. Take into account that YOLT represent and improve-
ment of YOLOv2. Moreover, YOLOv3 uses and FPN net.
Experiments with MSDS dataset with YOLOv3, it got a mAP of 69,8% that
is presented in Fig. 6, and it got 582 true prediction versus 57 false prediction
that is presented in Fig. 7. Despite the small dataset we are getting good results,
also, it demonstrated that YOLOv3 works well with small objects.
672 W. Nina et al.
Fig. 6. Precision Recall curve of ship detection with YOLOv3 in MSDS.
Experiments with MSDS dataset with YOLT is different of YOLOv3 because

YOLT is oriented to satellite images is used sliding window so it necessary apply
slice size over test images. In the experiments in article of YOLT, it used a
dataset of satellite images, so in training stage used 416 × 416 pixel cutouts on
images, and in testing stage used slice size of 416. In the experiments done on
MSDS in training stage not used cutouts, so in testing stage used slice size is
same of images size. In the Fig. 8 is presented that when slice size is 1500 it
got the best mAP. More in detail it got a mAP of 76,06% that is presented in
Fig. 9, and it got 541 true prediction versus 75 false prediction that is presented
in Fig. 10. It demonstrated that YOLT works well with small objects.
Fig. 7. True vs False prediction of YOLOv3 in MSDS.

Fig. 8. Slice size of test images vs mAP on YOLT in MSDS.
In the Fig. 11 is presented that some results of detected ship that are red
rectangles, the size test image is 2400 × 1422 and it applied a slice size of 1500.
Experiments with HRSC dataset with YOLT, it got a mAP of 40,00% and
with YOLOv3 it got a mAP 75,00%. It demonstrated that YOLT works well
with small objects.
Fig. 9. Precision Recall curve of ship detection with YOLT in MSDS.
The Average Precision (AP) metric is presented in Table 3, between YOLOv3

versus YOLT (remember that YOLT is a improvement over YOLOv2). Moreover,
in Table 4, the time processing is presented, in order to detect one image, we used
a NVIDIA QUADRO P5000 graphics card. This GPU have 2560 CUDA Cores
and 16 GB of Memory.
674 W. Nina et al.
Fig. 10. True vs False prediction of YOLT in MSDS.
Fig. 11. Results of detected using YOLT in MSDS.
Table 3. AP of YOLO and YOLT.
YOLO YOLT
MSDS 69,80% 76,06%
HRSC 75,00% 40,00%
Table 4. Execution time of YOLO and YOLT.
YOLO YOLT
Time 0.05327 0.06
4.3 Problems
Some problems with Yolt is that when there are ships together, it does not detect
well, some reasons is images size in training stage. In the Fig. 12 is presented
some examples.
Fig. 12. Problems with ships together with YOLT in MSDS.
5 Conclusions
Two datasets were used. The HRSC dataset, that contains objects of different
sizes and MSDS dataset that has been built for this project. This MSDS contains
small objects of ships on Optical Satellite Imagery.
YOLO and YOLT in the HRSC and MSDS datasets were compared. For the
MSDS dataset, YOLT outperformed YOLO with 76% and 69% of AP respec-
tively. Meanwhile, in the case of HRSC dataset, YOLO outperformed YOLT
with 75% and 40% of AP respectively.
Therefore, the work demonstrated that YOLT is a good framework for detect-
ing small objects, but it fails in other cases. The work also used FASTER-R-CNN
but it didn’t work for very small objects.
Acknowledgment. This research was supported from Universidad Nacional de San

Agustı́n de Arequipa Contract: IBA-0032-2017-UNSA, like part of the project “Detec-
tion of industrial fishing vessels within 5 miles of the Arequipa Region using high
performance computing and satellite images”. Thanks to the CiTeSoft Contract: EC-
0003-2017-UNSA for the equipment and the resources bring to the project.
676 W. Nina et al.
References
1. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft COCO: common objects in context. In: European Confer-
ence on Computer Vision, pp. 740–755. Springer (2014)
2. Fu, K., Li, Y., Sun, H., Yang, X., Xu, G., Li, Y., Sun, X.: A ship rotation detection
model in remote sensing images based on feature fusion pyramid network and deep
reinforcement learning. Remote Sens. 10(12), 1922 (2018)
3. Dong, C., Liu, J., Xu, F.: Ship detection in optical remote sensing images based
on saliency and a rotation-invariant descriptor. Remote Sens. 10(3), 400 (2018)
4. Li, S., Zhang, Z., Li, B., Li, C.: Multiscale rotated bounding box-based deep learn-
ing method for detecting ship targets in remote sensing images. Sensors 18(8),
2702 (2018)
5. Yang, X., Sun, H., Sun, X., Yan, M., Guo, Z., Fu, K.: Position detection and
direction prediction for arbitrary-oriented ships via multitask rotation region con-
volutional neural network. IEEE Access 6, 50839–50849 (2018)
6. Zhang, R., Yao, J., Zhang, K., Feng, C., Zhang, J.: S-CNN-based ship detection
from high-resolution remote sensing images. In: International Archives of the Pho-
togrammetry, Remote Sensing & Spatial Information Sciences, vol. 41 (2016)
7. Zhang, S., Wu, R., Xu, K., Wang, J., Sun, W.: R-CNN-based ship detection from
high resolution remote sensing imagery. Remote Sens. 11(6), 631 (2019)
8. Liu, Y., Cui, H.-Y., Kuang, Z., Li, G.-Q.: Ship detection and classification on
optical remote sensing images using deep learning, vol. 12, p. 05015. EDP Sciences
(2017)
9. Ma, M., Chen, J., Liu, W., Yang, W.: Ship classification and detection based on
CNN using GF-3 SAR images. Remote Sens. 10(12), 2043 (2018)
10. Zhao, H., Zhang, W., Sun, H., Xue, B.: Embedded deep learning for ship detection
and recognition. Future Internet 11(2), 53 (2019)
11. Eggert, C., Brehm, S., Winschel, A., Zecha, D., Lienhart, R.: A closer look: small
object detection in faster R-CNN. In: 2017 IEEE International Conference on Mul-
timedia and Expo (ICME), pp. 421–426. IEEE (2017)
12. Yang, X., Sun, H., Fu, K., Yang, J., Sun, X., Yan, M., Guo, Z.: Automatic ship
detection in remote sensing images from google earth of complex scenes based
on multiscale rotation dense feature pyramid networks. Remote Sens. 10(1), 132
(2018)
13. Corbane, C., Pecoul, E., Demagistri, L., Petit, M.: Fully automated procedure
for ship detection using optical satellite imagery. In: Remote Sensing of Inland,
Coastal, and Oceanic Waters, vol. 7150. International Society for Optics and Pho-
tonics (2008)
14. Inggs, M.R., Robinson, A.D.: Ship target recognition using low resolution radar
and neural networks. IEEE Trans. Aerosp. Electron. Syst. 35(2), 386–393 (1999)
15. Bi, F., Zhu, B., Gao, L., Bian, M.: A visual search inspired computational model
for ship detection in optical satellite images. IEEE Geosci. Remote Sens. Lett. 9(4),
749–753 (2012)
16. Shi, H., Zhang, Q., Bian, M., Wang, H., Wang, Z., Chen, L., Yang, J.: A novel ship
detection method based on gradient and integral feature for single-polarization
synthetic aperture radar imagery. Sensors 18(2), 563 (2018)
17. Cheng, G., Han, J.: A survey on object detection in optical remote sensing images.
ISPRS J. Photogramm. Remote Sens. 117, 11–28 (2017)
18. Liu, Z.K., Weng, L.B., Yang, Y.P., et al.: A high resolution optical satellite image
dataset for ship recognition and some new baselines (2017)
19. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv (2018)
20. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. arXiv preprint
arXiv:1612.08242 (2016)
21. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified,
real-time object detection. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 779–788 (2016)
22. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-
rate object detection and semantic segmentation. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
23. Girshick, R.: Fast R-CNN. arXiv preprint arXiv:1504.08083 (2015)
24. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object
detection with region proposal networks. In: Advances in Neural Information Pro-
cessing Systems, pp. 91–99 (2015)
25. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.:
SSD: single shot multibox detector. In: European Conference on Computer Vision,
pp. 21–37. Springer (2016)
26. Uijlings, J.R.R., Van De Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective
search for object recognition. Int. J. Comput. Vision 104(2), 154–171 (2013)
27. Liu, Z., Yuan, L., Weng, L., Yang, Y.: A high resolution optical satellite image
dataset for ship recognition and some new baselines. In: ICPRAM, pp. 324–331
(2017)
28. Van Etten, A.: You only look twice: rapid multi-scale object detection in satellite
imagery. CoRR, abs/1805.09512 (2018)
29. Van Etten, A.: Satellite imagery multiscale rapid detection with windowed net-
works. CoRR, abs/1809.09978 (2018)
30. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
31. Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J.,
Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int.
J. Comput. Vision 111(1), 98–136 (2015)
A Text Classification Model to Identify
Performance Bonds Requirement
in Public Bidding Notices
Urias Cruz da Cunha(B) , Ricardo Silva Carvalho(B) ,

and Alexandre Zaghetto(B)
University of Brası́lia (UnB), Brası́lia, Brazil

uriascruz@gmail.com, ricardosc@gmail.com, zaghetto@unb.br
Abstract. A performance bond is a means of guaranteeing that a prod-

uct that will be delivered by the seller in a timely and workmanlike
manner. To verify whether a performance bond has been laid down in
a bidding notice, an assessment through an internal auditing process is
made periodically, but it can be costly and take longer than the avail-
able time to conclude an audit engagement. We propose to explore and
apply algorithms to create a model able to identify bidding notices that
demand performance bonds, to make the assessment process more effi-
cient. We applied Four different classification algorithms (SVM, kNN,
Random Forest, and Naive Bayes) on two different vector space rep-
resentations (term frequency and term frequency inverse-document fre-
quency). Random Forest with term-frequency produced the best model,
which achieved F1 -score of 0.933 on the test set. The promising results
show the model created could help decrease the time and effort spent on
verifying the performance bonds requirement on bidding notices.
Keywords: Public bidding notice · Performance bonds · Text

mining · Classification · Machine learning · Random forest · SVM ·
K-Nearest Neighbors · Multinomial Naive Bayes · Term frequency ·
Inverse document frequency
1 Introduction
A performance bond is a “bond taken out by the contractor (the obliging party),
usually with a financial institution or insurer, for the benefit and at the request of
the employer (the offering party), in a defined sum of liability and enforceable by
the employer in the event of the contractor’s default” [1,2]. It aims to guarantee
that a product (construction, service or goods) will be delivered in a timely and
workmanlike manner, otherwise, the contractor should pay an indemnity to the
buyer.
In Brazil, government procurement is regulated – among other rules and
laws – by the Law 8,6661 which establishes in its article 56 that “[...] in each
1
Lei no¯ 8.666, de 21 de junho de 1993 – http://www.planalto.gov.br/ccivil 03/Leis/
l8666cons.htm.
https://doi.org/10.1007/978-3-030-39442-4_50
Classification of Public Bidding Notices 679
case, provided it is laid down in the bid invitation, it is possible to demand

performance bonds in procurements of constructions, services and goods [...]”.
Seeking to diminish the risks involved in procurement, public agents usually set
out a clause in public bidding notices imposing the presentation of a performance
bond or similar guarantee on contractors. According to the law (see footnote 1),
the value of a bond can reach up to 5% of the contract value and, for some special
cases set out in law, this percentage can go up to 10%. To verify whether compli-
ance concerning procurement has been achieved, assessments through internal
auditing process are applied from time to time. One of the objectives of this
process is to check if the performance bond or similar guarantee laid down in a
bidding notice was presented by the contractor and adequately recorded by the
buyer (a public institution).
When assessing procurement compliance, internal auditors are required to
read the bidding notices looking for paragraphs and excerpts which could deter-
mine the obligation of presenting a performance bond and, in case of identifying
the obligation, to verify whether or not performance bonds have been recorded
and are valid. The process of reading and identifying is costly and can take
longer than the available time to conclude an internal audit engagement. To
deal with this limitation, usually auditors review only a small sample of the
bidding notices, thereby reducing comprehensiveness of the analysis.
In order to improve the auditors’ results without compromising on audit
deadline, machine learning classification algorithms can be used in this process.
Instead of having auditors read bidding notices, we could classify bidding notices
as demanding or non-demanding of performance bonds through a classification
model. Once the bidding notices are classified, internal auditors will only have
to check whether corresponding performance bonds have been recorded and are
valid, decreasing the time and effort spent on analyses.
Therefore, we propose to explore and apply classification algorithms to create
a model able to identify bidding notices that demand performance bonds. The
study uses the bidding notices published by the Central Bank of Brazil (CBB)
from 2014 to 2018. Four different classification algorithms were selected based
on previous studies in text mining [3–7]: Support Vector Machines (SVM), k-
Nearest Neighbors (k-NN), Random Forest and Naive Bayes. The chosen model
was the one that yielded the best performance concerning the F1 -score metric.
The remainder of this article is structured as follows: Sect. 2 provides infor-
mation about the state-of-the-art, well recognized, and published algorithms
concerning text classification; Sect. 3 presents some studies concerning text min-
ing, specifically text categorization; Sect. 4 discusses the methodology adopted;
Sect. 5 shows the results obtained in this study; and Sect. 6 presents the conclu-
sion and possible future works.
2 State of the Art

Text mining is a method built upon different disciplines such as informa-
tion retrieval, machine learning, and statistics [8]. It may be defined as the
680 U. C. da Cunha et al.
employment of algorithms and methods originated from these disciplines with

the objective of finding useful text patterns [9].
Typical text mining tasks are categorized as classification (categorization),
clustering, summarization, information extraction (IE), information retrieval
(IR), topic tracking, concept liking, question answering, and information visu-
alization [8]. In this section, we will focus on the classification task since it is
related to the goal of the study.
2.1 Text Categorization
Text classification is a construction problem of models which can classify new

documents in predefined classes [10]. The process of modeling commonly com-
prises six steps: (1) data acquisition, (2) data analysis and labeling, (3) feature
construction and weighting, (4) feature selection and projection, (5) model train-
ing, and (6) solution evaluation.
In the data acquisition stage, data necessary to solve a research objective is
acquired. The next stage consists of data analysis and labeling, where parts of a
text or groups of text are labeled according to their classes. The following step
is feature construction and weighting, in which data is represented in a manner
that is the most appropriate to applying machine learning algorithms. The most
common representations are vector space representation, in which a document
is represented by a vector of weights and features, and graphic representation,
where a document is modeled in a graph form [10].
The fourth stage is characterized by the selection of the most relevant features
and disposal of remaining ones, and by transformation of the features’ vector
space [10]. In the fifth stage, the classification models are trained. The algorithms
used in this phase can be grouped in the following categories: supervised learning
– machine learning process that trains a function by using labeled data; semi-
supervised learning – it uses both labeled and unlabeled data to perform a
supervised or an unsupervised task; ensemble learning – it consists of training
multiple classifiers and combining their outcomes in order to build a committee
of decision makers; active learning – when a training algorithm queries a data
provider to label additional training instances; and transfer learning – when a
learning mechanism improves the performance of a task based on the knowledge
acquired in a different but related task [10].
The last stage consists of evaluating the models generated in the previous
stage and choosing the one that yielded the best performance [10]. The evaluation
is done by calculating a metric or a set of metrics for each model and comparing
them.
In this study, we adopt term frequency (TF) and term frequency-inverse
document frequency (TF-IDF) vector space representations. To generate mod-
els, we adopt only supervised learning algorithms, since our training dataset is
composed of labeled samples. Lastly, given that our target classes are slightly
imbalanced, we adopted the F1 -score metric to determine the best model.
2.2 Vector Space Representation

As previously mentioned, one of the steps in text classification is the transfor-
mation of the content of a set of documents (corpus) into a structure so that the
documents can be categorized by a classifier [11]. In a vector space model, a doc-
ument is represented as a vector of terms, dj = {w1j , ..., wkj }, where |dj | is the
total number of terms, a.k.a. features, and wk,j is the weight of term tk , repre-
senting how much the term tk contributes to the semantics of the document dj .
Term weighting methods can be classified into two categories based on the
use of known information present in training documents: supervised term weight-
ing method and unsupervised term weighting method [11]. Among the methods
classified as supervised, TF and TF-IDF are considered traditional ones.
TF and TF-IDF methods come from the information retrieval domain. A rep-
resentation based on TF is built to capture the frequency of a word, or term, in a
document [12]. It measures the association of a term t with respect to a given doc-
ument d [13] and can be calculated using tfk,j (dj , tk ), an equation that returns
the term frequency (the number of occurrences) of term tk in document dj .
IDF, in turn, represents the scaling factor, or the importance, of a term with
respect to the corpus. It assumes that the importance of a term is inversely
proportional to the frequency of this term in all documents. It was proposed
based on the heuristic intuition that terms which occur in many documents are
not good discriminators such as terms that occur in only a few documents, and,
therefore, should receive a smaller weight than terms with lower occurrences [14].
It can be calculated using Eq. 1, where N is the number of documents in the
corpus and dfk is the number of documents that contain the term tk .
N
idfk (tk ) = log( ) (1)
dfk (tk )
Finally, we have TF-IDF, which is basically a combination of TF and IDF
term weightings. It takes into account the frequency of a term in a document
and the occurrence of this term along all documents in the corpus, combining
both information. Equation 2 shows how to calculate the weight of the term tk
in the document dj using TF-IDF:
wk,j (dj , tk ) = tfk,j (dj , tk ) × idfk (tk ) (2)
where tfk,j (dj , tk ) represents the term frequency and idfk represents the inverse
document frequency, as described in Eq. 1.
2.3 Classification Algorithms

We carried out a short search in order to select the modeling algorithms. In the
study conducted in [3], the authors applied the Naive Bayes classifier with a novel
feature weight-enhancing method seeking the performance improvement of the
classifier in text mining tasks. In order to compare the Naive Bayes performance
with the performance of a state-of-art algorithm, the authors also applied the
SVM algorithm.
A research carried out in [4] proposed the improvement of public com-

plaints classification by adoption of a novel term weighting method named Term
Frequency-Inverse Gravity Moment (TF-IGM). To evaluate the potential of this
new vector space representation, Naive Bayes, SVM, and k-NN algorithms were
applied.
Another research [6] conducted an analysis of the impact of natural lan-
guage processing features (stop words, stemming, and a combination of both)
on predictive performance of the base classifiers Naive Bayes, SVM, k-NN, and
J48. According to the authors, these algorithms were selected because they are
frequently successfully used by researchers in text mining.
Based on the results of these works, we could see that Naive Bayes, k-NN,
and SVM have been used as baseline algorithms in text mining realm. In this
study, we applied the baseline algorithms described in [3,4,6]. We also added
the Random Forest algorithm to the list for the purpose of evaluating a different
algorithm from those listed in the mentioned studies. The choice was made taking
into account the promising results presented by Random Forest classifiers in
recent works related to text classification [5,7].
Support Vector Machines is an effective method with a solid theoretical foun-
dation which can achieve high prediction accuracy in a classification task by
learning the optimal hyperplane (maximum-margin) capable of separating obser-
vations into two different classes [15]. Some of the good features of SVM are its
high robustness and generalization ability even when trained with a few samples.
One drawback is the high complexity of the training algorithm, which adopts
quadratic programming, thus having a time complexity O(n3 ).
Classified as a similarity-based machine learning technique, the k-Nearest
Neighbor algorithm uses the training instances to classify new instances [16].
When classifying a new instance, it searches the k most similar training instances
to the new instance and, then, classify the new instance based on the predomi-
nant class among those k neighbors. To find the k-Nearest Neighbors, each new
instance is compared to each training instance.
According to the author in [17], “Random Forest is a classifier consisting of
a collection of tree-structured classifiers {h(x, Θk ), k = 1, ...}, where {Θk } are
independent and identically distributed random vector and each tree casts one
vote”. During the classification, the new observation is assigned to the class that
receives the majority of votes, i.e., the most popular class cast by the trees [17].
It is worth mentioning that random forests can also be used for regression and
is categorized as an ensemble algorithm.
The authors in [3] state that the Naive Bayes classifier is a probabilistic
classifier which assumes that all attributes (features) of the instances are inde-
pendent of each other given the class of the instance. Despite the independence
assumption, Naive Bayes has shown good performance in different fields. It is
based on the Bayes theorem, which can be stated by the following equation:
p(dj |c)p(c)
p(c|dj ) = (3)
p(dj )
In the context of text classification, p(c|dj ) is the probability that the docu-
ment dj belongs to the class c [3].
Other novel algorithms, such as Deep Learning, Hierarchical Deep Learning
for Text Classification, and Convolutional Neural Network, were not adopted in
this study because they present high complexity, with increased computational
effort and, since our dataset is not large, their implementation could also lead
to overfitting [18].
2.4 F1 -Score Metric
In order to evaluate and compare the performance of different models, it is

necessary to make use of a metric. The F1 -score is the “harmonic mean of recall
and precision, i.e., F1 = 2pr/(p + r), where p is precision and r is recall” [19].
Precision and recall are defined by p = T PT+F P TP
P and r = T P +F N , respectively,
where TP stands for True Positives – positive examples correctly classified –, FN
stands for False Negatives – positive examples incorrectly classified as negative
–, FP stands for False Positives – negative examples incorrectly classified as
positive.
F1 -score is considered a better choice when the prior probabilities of the
classes are very different, that is, classes are imbalanced [20].
3 Related Works
The work in [21] identified that many enterprise systems have to handle a large
pool of documents. Manual verification of those documents can take a huge
amount of time. Then, they proposed a rule-based framework that enables auto-
matic verification of document-based systems. The objective of the framework
is to reduce manual intervention as much as possible such that an automated
support for document verification can be provided to the customers.
In [22], researchers sought to investigate the downstream potential of
biomedicine researches in Indonesia based on scientific publication. To achieve
their objectives, they built a classification model designed to classify a document
into four different classes. For this, they applied four different classifiers: NB, NB
(kernel), k-NN, and SVM. The k-NN model showed better performance than the
other classifiers in accuracy, achieving 84.85%.
In [23], it was proposed the development of a classifier able to identify com-
ments and posts which are racists and malicious. To train the model they used a
set of documents which contains different antisocial texts collected from various
blogs, speeches, and articles. The algorithm used to create the model was k-NN.
From the experimental results, the authors concluded that the task of detecting
antisocial texts can be accomplished through text mining.
Researchers, in [24], conducted a study to analyze the performance of multi-

ple classifiers in the context of text classification. In this research, seven different
algorithms were applied to three datasets. The metric used for evaluation was
accuracy. The classifier based on Artificial Neural Network showed the best per-
formance with accuracy varying from 89.0% to 94.5%.
Notwithstanding the above studies deal with text classification, none of them
focus on classifying public bidding notices as having or not having laid down
performance bonds as a requirement. On the other hand, the present study
should use similar algorithms to those used in the previous studies, but now
focusing on classifying public bidding notices.
4 Methodology
The methodology applied in this study consists of four main steps: (1) Document
gathering – identification and collection of bidding notices (documents); (2) Pre-
processing – text extraction, restructuring, cleansing and transformation; (3)
Modeling – creation and evaluation of the models; and (4) Model evaluation –
estimation of how well the chosen model will work with new data. Figure 1 shows
the execution order of the steps.
Fig. 1. Steps applied to build and evaluate the models.
4.1 Document Gathering
The data used in this work consists of 478 bidding notices retrieved from the
e-BC system – a CBB’s internal document management system – and from
the Brazilian government purchases portal2 . These are the repositories where
the documents of interest can be found. The bidding notices were published
from 2014 to 2018. They are distributed in two classes with percentages of 37%
and 63%, as we can see in Fig. 2. The labels of the data were obtained from
another CBB’s internal system called Sistema de Administração de Instrumentos
Contratuais (SAIC). Most of the files were available in PDF format whilst the
rest of them were available in DOCX format.
Fig. 2. Number and percentage of bidding notices per class.
4.2 Pre-processing
Once we had all files downloaded, we started the text extraction from both PDF
and DOCX files. For PDF files, we used the program pdftotext.exe available
in the open source toolkit Xpdf3 . The program takes a PDF file as input and
outputs a TXT file containing the PDF’s file text content. For the DOCX files,
we used the Python’s library python-docx 4 , which can be used for creating and
updating Microsoft Word (.docx) files. Differently from pdftotext, to extract text
from a DOCX file using python-docx, we need to go through every element in
the file, such as paragraphs, tables, rows, and cells, extract their content and
concatenate them into a string. After all iterations, we save the string into a
new file. At the end of this process, we have got a set of TXT files, each one
containing the text of its corresponding PDF or DOCX file.
After extracting the data, it was necessary to restructure the text retrieved
from the PDF files. The reason is that, differently from the texts retrieved from
DOCX files, the texts obtained from PDF files come with hyphenation, so that
we need to identify where hyphenation occurred and try to rebuild the word
2
https://www.comprasgovernamentais.gov.br/.
3
http://www.xpdfreader.com/.
4
http://python-docx.readthedocs.io/en/latest/.
correctly, removing the hyphen and the newline character between the two parts
of the word.
At the moment all texts from all files were equally structured, that is, without
hyphenation and newline characters between parts of a word, we started the
process of cleansing and transformation. Firstly, we standardized the encoding
of all texts in order to be able to work them indistinctly. The next step was to
remove diacritical marks, such as acute (´) and cedilla (¸). After that we removed
words, expressions, abbreviations, numbers etc. that are non-informative in the
context of the problem. The stop words for Portuguese language available in
the Natural Language Toolkit5 (NLTK) were also identified and removed from
the texts. Finally, we used stemming to remove the suffixes, keeping only the
roots, or stems, of the words. Most of the tasks of cleansing were done with
the use of regular expressions, whereas stemming was done using the method
RSLPStemmer of NLTK, which stands for Removedor de Sufixos da Lı́ngua
Portuguesa, or in English, Portuguese Language Suffixes Remover.
4.3 Modeling
After all documents were properly cleansed and transformed, we started the
modeling phase. First, the data were split into training and test sets in a pro-
portion of 3:1 using a stratified fashion with the target class distribution as
reference. After that, the data were transformed into TF and TF-IDF vector
space representations. To build a TF representation, we applied the CountVec-
torizer method from scikit-learn 6 library on the documents, regardless of the
modeling algorithm used. CountVectorizer method converts a collection of text
documents to a matrix of token counts. To build a TF-IDF, we also applied
the TfidfTransformer method, which transforms a count matrix to a normalized
TF-IDF representation according to equation tf-idf(d,t) = tf (t)× idf(t), where
n+1
t is the term, d id the document, and idf(d, t) = log + 1, where n
df (t) + 1
is the total number of documents and df(d, t) is the number of documents that
contain the term t. Here, we can see that the equation applied by the scikit-learn
library differs slightly from Eq. 2 mentioned in Sect. 2. According to the scikit-
learn library documentation7 , the default parameter smooth idf=True adds “1”
to both numerator and denominator in order to prevent zero divisions.
Each vector space was used separately as a training dataset so that all four
algorithms were trained with TF and with TF-IDF representations. For each
algorithm, a set of hyper-parameters were defined and all possible combinations
were tested in order to improve the chances of generating the model that could
yield the best performance possible. To accomplish this, we used the methods
Pipeline and GridSearchCV from scikit-learn library. F1 -score metric and cross-
validation with 10-fold were used to estimate the models’ performance. The best
5
https://www.nltk.org/.
6
http://scikit-learn.org/stable/index.html.
7
http://scikit-learn.org/stable/modules/feature extraction.html#text-feature-
extraction.
model was the one that presented the highest metric value, regardless of the algo-
rithm or vector space being used. Table 1 presents the sets of hyperparameters
of each algorithm.
Table 1. Set of hyperparameters for each algorithm
Algorithm Hyperparameters
Multinomial Naive Bayes Default
k-Nearest Neighbors n neighbors: [3, 5 ,7, 9]
weights: [‘uniform’, ‘distance’]
algorithm: [‘ball tree’, ‘kd tree’]
leaf size: [20, 30, 50]
p: [1, 2]
n neighbors: [3, 5, 7, 9]
weights: [‘uniform’, ‘distance’]
algorithm: [‘brute’]
p: [1, 2]
Support Vector Machines C: [0.1, 0.5, 1, 1.5, 2, 2.5, 5, 10]
penalty: [‘l1’]
loss: [‘squared hinge’]
dual: [False]
C: [0.1, 0.5, 1, 1.5, 2, 2.5, 5, 10]
penalty: [‘l2’]
loss: [‘squared hinge’, ‘hinge’]
dual: [True]
Random Forest max features: [None, ‘sqrt’]
max depth: [None, 200, 50]
min samples split: [2, 10]
min samples leaf: [1, 10]
n estimators: [10, 50, 100]
4.4 Model Evaluation
Once we determined the best model regarding the performance obtained within
the cross-validation, i.e., the model whose F1 -score was the highest, we estimated
the out-of-bag error using the test dataset. The main goal of evaluating the
performance of the final model against the test set is to verify if the model did
not overfit the training data and how well it will perform when classifying new
documents.
5 Results
In this section, we present the best result of each algorithm with the respective
set of parameters. The algorithms’ results were presented in ascending order
according to the performance obtained in cross-validation. All methods used to
fit the data are from scikit-learn library. Table 2 summarizes the performance
obtained in all four algorithms. Metric values are from cross-validation applied
on the training set.
Table 2. Summary of results
Algorithm Vector space representation F1-score (CV = 10)

Multinomial Naive Bayes TF 0.888
k-Nearest Neighbors TF 0.905
Support Vector Machines TF-IDF 0.924
Random Forest TF 0.952
The fourth highest performance was obtained with Multinomial Naive Bayes
classifier using TF representation. The F1 -score obtained was 0.888 and the
method used to fit the data was MultinomialNB.
The third best performance was obtained with k-NN on TF representa-
tion. The F1 -score obtained was 0.905 and the method used to fit the data
was KNeighborsClassifier. The set of parameters of k-NN’s best model was
{algorithm: ball tree, leaf size: 20, n neighbors: 9, p: 1; weights: uniform}.
The second best performance was obtained with SVM on TF-IDF represen-
tation. The F1 -score obtained was 0.924 and the method used to fit the data
was LinearSVC. The set of parameters of SVM’s best model was {C: 1.5, dual:
False, loss: squared hinge, penalty: l1 }.
Finally, the best performance was obtained with Random Forest on TF rep-
resentation. The F1 -score obtained was 0.952 and the method used to fit the
data was RandomForestClassifier. The set of parameters of Random Forest’s
best model was {max depth: None, max features: None, min samples leaf: 10,
min samples split: 2, n estimators: 100 }. When classifying new documents with
the test set, the performance obtained was 0.933. Table 3 presents a confusion
Table 3. Confusion matrix for random forest classifier
Predicted
No Yes Total
No 40 5 45
True
Yes 5 70 75
Total 45 75 120
Matrix generated from the classification performance on the test set for Random
Forest classifier.
It is worth noting that the difference between the values obtained in the
cross-validation and the final evaluation (test set) was of only 0.019. This is
strong evidence that the final model didn’t overfit. It can be related to the
fact that bagging approaches – which is used by the Random Forest algorithm
– usually leads to a reduction of variance, consequently improving the overall
generalization error. This hypothesis is supported by the fact that the value of
the parameter n estimators used in the final model was 100, that is, the biggest
value possible according to the set of options available (see Table 1).
6 Conclusion
This paper reports the application of machine learning classification algorithms

in the context of text mining. The objective was to create a model able to identify
the requirement of performance bonds in public bidding notices.
Several tasks were carried out sequentially in order to create the final model.
Tasks involving download of documents, text extraction, text cleansing and
transformation, vectorization, modeling and evaluation were performed.
Four different classification algorithms were evaluated: Multinomial Naive
Bayes, k-NN, SVM, and Random Forest; and two vector space representations
were adopted: TF and TF-IDF. For validation, cross-validation with 10-fold and
F1 -score metric were used. The final model achieved a performance of 0.933 on
the test set concerning F1 -score metric. The model based on Random Forest algo-
rithm outperformed the other models based on baselines algorithms, confirming
the hypothesis that ensemble models usually outperform single models [25].
The vector space representation used in the final model was TF, which sur-
passed TF-IDF representation in 3 out of 4 algorithms. The most likely hypoth-
esis to explain this result is that, since the size of the paragraph which contains
a performance bond requirement (in case it is required in the bidding notice)
usually does not vary with the size of the document, when adopting TF-IDF, as
a document increases its size (the number of words gets larger), the weight of the
words capable of determining the class of the document gets smaller. As a con-
sequence, the models generated with TF-IDF representation are more sensitive
to the document’s size than the models generated with TF.
Given that the results are promising, that is, the final model achieved high
performance, we consider its deployment feasible in a production environment.
We expect that the model’s use in bidding notices assessment process will
decrease time and effort spent on verifying performance bonds requirement, con-
sequently increasing comprehensiveness of the analysis.
One drawback to a simple classification model is that it doesn’t bring the
excerpt that defines the bond’s requirement. The absence of this piece of infor-
mation may compel auditors to read the bidding notice in order to confirm that a
positive classification assigned by the model is correct, therefore, slowing down
the process and making the use of classification models not as effective as it
could be. One solution to this issue could be a creation of a second model that
identifies the excerpt which defines the requirement of a bond. Combining these
two models by means of a pipeline, the final solution would not only classify a
bidding notice correctly but it would also present to the auditor the piece of text
used to determine the class.
Future work should focus on verifying performance when applying different
vector space models, such as binary and latent semantic indexing (LSI), and new
term weighting methods, such as bi-normal separation (BNS). Another area of
improvement could be the use of n-grams with a different number of items,
instead of only unigrams, as done in this study.
References
1. Hassan, A., Adnan, H.: IOP Conference Series: Earth and Environmental Science,
vol. 117 (2018)
2. Supardi, A., Yaakob, J., Adnan, H.: Performance bond: conditional or uncondi-
tional, MPRA Paper 34007. University Library of Munich, Germany, revised 2009
(2009). https://ideas.repec.org/p/pra/mprapa/34007.html
3. Kim, S.B., Han, K.S., Rim, H.C., Myaeng, S.H.: Some effective techniques for Naive
Bayes text classification. IEEE Trans. Knowl. Data Eng. 18(11), 1457 (2006)
4. Mahfud, F.K.R., Tjahyanto, A.: 2017 International Conference on Sustainable
Information Engineering and Technology (SIET), pp. 220–225 (2017)
5. Onan, A., Korukoğlu, S., Bulut, H.: Ensemble of keyword extraction methods and
classifiers in text classification. Expert Syst. Appl. 57, 232 (2016)
6. Srivasatava, S.K., Kumari, R., Singh, S.K.: 2017 International Conference on Com-
puting, Communication and Automation (ICCCA), pp. 345–349 (2017)
7. Wang, R., Chen, G., Sui, X.: Multi label text classification method based on co-
occurrence latent semantic vector space. Procedia Comput. Sci. 131, 756 (2018)
8. Souza, E., Costa, D., Castro, D.W., Vitório, D., Teles, I., Almeida, R., Alves, T.,
Oliveira, A.L.I., Gusmão, C.: Characterising text mining: a systematic mapping
review of the Portuguese language. IET Software 12(2), 49 (2018)
9. Hotho, A., Nürnberger, A., Paass, G.: A brief survey of text mining. LDV Forum
GLDV J. Comput. Linguist. Lang. Technol. 20, 19 (2005)
10. Mirończuk, M.M., Protasiewicz, J.: A recent overview of the state-of-the-art ele-
ments of text classification. Expert Syst. Appl. 106, 36 (2018)
11. Lan, M., Tan, C.-L., Low, H.-B.: Proposing a new term weighting scheme for
text categorization. In: Proceedings of the 21st national conference on Artificial
intelligence - Volume 1 (AAAI 2006). AAAI Press, pp. 763–768 (2006)
12. Jiang, H., Li, P., Hu, X., Wang, S.: 2009 IEEE International Conference on Intel-
ligent Computing and Intelligent Systems, Shanghai, China, pp. 294–298. IEEE
(2009)
13. De Silva, J., Haddela, P.S.: 2013 IEEE 8th International Conference on Industrial
and Information Systems, Peradeniya, Sri Lanka, pp. 381–386. IEEE (2013)
14. Zhang, W., Yoshida, T., Tang, X.: 2008 IEEE International Conference on Systems,
Man and Cybernetics, Singapore, Singapore, pp. 108–113. IEEE (2008)
15. Liu, C., Wang, W., Wang, M., Lv, F., Konan, M.: An efficient instance selection
algorithm to reconstruct training set for support vector machine. Knowl.-Based
Syst. 116, 58 (2017)
16. Hochbaum, D.S., Baumann, P.: Sparse computation for large-scale data mining.
IEEE Trans. Big Data 2(2), 151 (2016)
17. Breiman, L.: Mach. Learn. 45(1), 5 (2001). https://doi.org/10.1023/A:
1010933404324
18. Xu, Q., Zhang, M., Gu, Z., Pan, G.: Neurocomputing (2018)
19. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name
matching in information integration. IEEE Intell. Syst. 18(5), 16 (2003)
20. Jeni, L.A., Cohn, J.F., De La Torre, F.: 2013 Humaine Association Conference on
Affective Computing and Intelligent Interaction, Geneva, Switzerland, pp. 245–
251. IEEE (2013). https://doi.org/10.1109/ACII.2013.47. http://ieeexplore.ieee.
org/document/6681438/
21. Roychoudhury, S., Bellarykar, N., Kulkarni, V.: 2016 IEEE 20th International
Enterprise Distributed Object Computing Conference (EDOC), pp. 1–10 (2016)
22. Silalahi, M., Hardiyati, R., Nadhiroh, I.M., Handayani, T., Amelia, M., Rahmaida,
R.: 2018 International Conference on Information and Communications Technology
(ICOIACT), pp. 515–519 (2018)
23. Chandra, N., Khatri, S.K., Som, S.: 2017 6th International Conference on Reli-
ability, Infocom Technologies and Optimization (Trends and Future Directions)
(ICRITO), pp. 348–354 (2017)
24. Mishu, S.Z., Rafiuddin, S.M.: 2016 19th International Conference on Computer
and Information Technology (ICCIT), pp. 409–413 (2016)
25. Zeng, T., Wu, B., Ji, S.: DeepEM3D: approaching human-level performance on 3D
anisotropic EM image segmentation. Bioinformatics 33(16), 2555 (2017)
Estimating the Time-Lapse Between
Medical Insurance Reimbursement
with Non-parametric Regression Models
Mary Akinyemi, Chika Yinka-Banjo(&), Ogban-Asuquo Ugot,

and Akwarandu Nwachuku
University of Lagos, Akoka, Lagos 100213, Nigeria

cyinkabanjo@unilag.edu.ng
Abstract. Nonparametric supervised learning algorithms represent a succinct

class of supervised learning algorithms where the learning parameters are highly
flexible and whose values are directly dependent on the size of the training data.
In this paper, we comparatively study the properties of four nonparametric
algorithms, k-Nearest Neighbours (k-NNs), Support Vector Machines (SVMs),
Decision trees and Random forests. The supervised learning task is a regression
estimate of the time lapse in medical insurance reimbursement. Our study is
concerned precisely with how well each of the nonparametric regression models
fits the training data. We quantify the goodness of fit using the R-squared metric.
The results are presented with a focus on the effect of the size of the training data,
the feature space dimension and hyperparameter optimization. The findings
suggest k-NN’s and SVM’s algorithms as better models in predicting well-
defined output labels (i.e. Time lapse in days). However, overall, the decision tree
model performs better because it makes a better prediction on new data points
than the ballpark estimates made from likelihood models: SVMs and k-NNs.
Keywords: Machine learning Supervised learning Non-parametric learning
1 Introduction
Supervised learning in artificial intelligence is a relatively well-defined problem of

estimating a target output label y, given an input vector x. Therefore, the goal of a
supervised learning algorithm is to learn the objective function that maps a given input
to an output, while minimizing a cost function. When the objective function to be
learned by an algorithm has well defined fixed parameters such that the learning task is
centered around estimating the values of these parameters, such an algorithm can be
classified as a parametric algorithm, and thus will return a parametric model [12].
Parametric algorithms include; backpropagation, linear & logistic regression and naïve
Bayes.
Nonparametric algorithms seek to best fit the training data through a “distribution”
or (quasi) assumption-free model, whilst still maintaining some ability to generalize to
unseen data. Nonparametric algorithms include; nearest neighbors (KNNs), decision
trees and support vector machines (SVMs). We shall focus on the nonparametric

https://doi.org/10.1007/978-3-030-39442-4_51
Estimating the Time-Lapse Between Medical Insurance Reimbursement 693
algorithms and more specifically, nearest neighbors, support vector machines, decision
trees and random forests. These algorithms have been extensively studied on diverse
datasets [1–3]. We attempt to contribute to the comparative literature by studying the
algorithms performance on a medical insurance dataset. The machine learning goal is
centered around a regression model that estimates the time lapse between the date upon
which medical treatment occurs and the date when an insurance company reimburses
charges spent on the treatment.
The analysis focuses on the comparative effect of data preprocessing practices
including, feature engineering and extraction, encoding and feature scaling, as well as
other important aspects of machine learning like the hyper parameter tuning using the
grid search algorithm. The R2 score is chosen as the primary scoring technique simply
because we wish to know how well each nonparametric regression model estimates the
real data points, that is, the goodness of fit.
Section 2 of this paper presents a brief but precise literature review under the
moniker related work, here we discuss some important comparative studies that make
use of nonparametric algorithms in various applications. Section 3 presents our
methodology for the comparisons, we also discuss the features of the dataset with
important data analysis and visualization. Section 4 presents our results, Sect. 5 dis-
cusses the results, and finally we conclude in Sect. 6.
2 Related Work
The “no free lunch” theorem stated by David Wolpert [4], shows us that indeed, there is
no one best supervised learning algorithm over another. What this means is that one
can only say that an algorithm performs better in some task domains with a well-
defined target label y and less so in other domains, this notion is key when carrying out
comparative studies. The goal of a comparative analysis is not to state that one algo-
rithm is better than the other but rather it is to reinforce certain expectations about the
performance of an algorithm given a certain task. Most researchers are aware of the
lack of a priori distinctions between supervised learning algorithms, and that the per-
formance of an algorithm highly depends on the nature of the target label y. This is
seen in the way the results from comparative studies are presented. In [5], the authors
found that there is no universal best algorithm, however, using datasets from different
substantive domains their results showed that boosted trees followed by SVMs had the
best overall performance while logistic regression and decision trees performed the
worst. Still based on the same prediction of recidivism, the results from [6], showed
that when a subset of predictors was used, traditional techniques such as logistic
regression performed better, while random forest performed the best on the whole set of
predictors. More recent studies, have shown again, that the sample size is a key factor
when comparing the performance of older algorithms such as logistic regression with
newer algorithms like logitboost [7]. Research on breast cancer detection in [9] used
five machine learning algorithms to separately classify a multidimensional image
dataset and the results where compared. The five machine learning classifiers used on
the Shearlton transformed images were, Support Vector Machines (SVM), Naive
bayes, Multilayer perceptron, k-Nearest Neighbour, and Linear discriminant analysis
694 M. Akinyemi et al.
classifier. The results conclude that SVM models were the best classifiers for breast
cancer detection using images with a well-defined region of interest.
The study conducted in [8], compares Gradient Boosting Machine, Random For-
ests, Support Vector Machines, and Naive Bayes model. An ensemble of these models
was fitted to create the ML model used for comparison with EUROSCORE II and
logistic regression. The models were fitted twice due to the sensitivity to the input data,
a chi-square fit was applied on the second fitting to the input features and only relevant
features were used. The performance of each machine learning model was assessed
with the area under the ROC curve, Random Forest produced the best result regardless
if the data was filtered or not. Out of the four machine learning models, Naive Bayes
produced the weakest accuracy without filtering, but proved to be better than SVMs
with filtering; but in both cases logistic regression does better than both Naive Bayes
and SVMs.
However brief, the key thing to take from the review is the various reasons why a
particular learning algorithm performs better than another. While presenting our results,
we shall discuss the performance of the algorithms with respect to certain aspects of the
dataset as well as hyperparameter tuning. Although most of the papers reviewed where
classification problems we are interested in seeing if similar effects of say the training
set sample size as seen in [5] and [6] will affect the performance of our models.
3 Methodology
In this section, we present first the dataset used for the training of the models, then we
present the various techniques used to train the models, test and score their
performance.
3.1 The Dataset

Choosing a health insurance plan is an important task that is often seldom paid as much
attention as required. Studies have shown that most people rarely understand the
concepts of cost-sharing, drug coverage and other benefits offered by insurance com-
panies [10]. In Nigeria, Health Maintenance Organizations (HMOs) are often suggested
to patients by hospitals, and these patients often lack detailed information as to the best
HMOs to go with. With these challenges in mind, the dataset used in this paper was
collected, with a view of uncovering, through exploratory analysis, possible informa-
tion in health insurance records that could help with deciding the benefits of one HMO
over another. The dataset is a health insurance record taken between the year 2015 and
2016. Collected, for a case study on Clearline HMO in Nigeria, the dataset contains 7
features, the treatment date, provider code (the hospital providing healthcare), diag-
nosis, drug prescription, charges spent (the cost of treatment), company code (the
health insurance company) and the payment date (the date when the charges spent is
reimbursed by the insurance company).
Our goal is to augment a statistical exploratory analysis of the data with supervised
machine learning. The supervised learning task is to train a model to estimate the time
lapse between a treatment date and payment date. To get the time lapse between
reimbursement of the charges spent, we calculate the number of days between the
treatment date and the payment date. This derived time-lapse is then used as the target
variable during the training.
Descriptive Statistics. In Table 1, we present a descriptive analysis of the health
insurance dataset.
Table 1. Summary statistics for the numerical features of the dataset.

S/N Basic Treatment Provider Charges Company Payment Time
Statistics date code sent code date lapse
1. Number of 70888 70888 70888 70888 70888 70888
observations
2. Missing 0 0 0 0 0 0
values
3. Mean 2014.35 494.04 9980.69 196.27 2015.00 105.23
4. Standard 0.002 1.316 100.712 0.552 0.000 0.11
error of
mean
5. Mode 2015 580 5000 144 2015 106
6. Standard 0.459 339.150 27419.072 148.730 0.000 0.013
deviation
7. Variance 0.210 111502.722 7.371E8 22120.487 .000 0.00169
8. Skewness −1.415 −0.083 13.344 0.441 −0.023 15.33
9. Kurtosis 2.118 −1.534 309.749 −1350 1.33 405
10. Standard 0.018 0.018 0.018 0.018 0.018 0.019
error of
kurtosis
11. Range 5 1123 1346456 464 0 400
12. Sum 146408508 1294759 720396042 14249993 146428035 7469469
13. Quantiles 2014.23 106.15 2500.00 62.00 2015.58 56.00
25 2014.74 580.06 4330.00 144.00 2016.22 205.00
50 782.86 8000.00 359.00 143.00
75
Data Visualization. The frequency distribution in Fig. 1 shows that most insurance
companies make reimbursements within 100 days.
Fig. 1. Frequency distribution of the Time lapse
In Fig. 2, the frequency distribution of the company codes is presented, this indi-
cates the number of records associated with each company. In Fig. 3, one can observe
some insurance companies make reimbursements in less than 100 days while others
may take as long as a year.
Fig. 2. Frequency distribution of the company code.

Fig. 3. Scatter plot showing the insurance companies (company_code) and the time lapse.
Data Preprocessing. All the features except for the payment date are selected for our
training and test set. The target label is of course the time lapse. We apply label
encoding to the character features, the diagnosis and the prescription. Next, we apply
one hot encoding to ensure that categorical features are not taken as ordinal values.
This increased the number of features in the dataset to 533, we did not apply any
dimensionality reduction techniques such as principal component analysis. We split the
dataset into the training set and test set using the 80-20 (percent) ratio. This resulted in
56,709 instances for the training set and 14,178 instances for the test set
Finally, we normalize all the values in the training and test set to lie between 0 and
1 using the Minmax scaler function in sklearn. A mathematical description of the
Minmax scaler is shown in Eq. 1, x is an original value and x0 is the normalized value.
Normalization ensures all values fall within a particular range and that an algorithm
doesn’t place a higher precedence on larger numerical values. The data preprocessing
functions were also carried out using the sklearn library [11].
x minð xÞ
x0 ¼ ð1Þ
maxð xÞ minð xÞ
3.2 Fitting the Models

Every machine learning algorithm has a set of hyperparameters that determine to a very
large extent how well the algorithm might perform, when fitting it to the data.
Therefore, it was extremely important to choose the right hyperparameters for the
models using an optimization technique. We rely on the Grid search algorithm to help
us choose the best hyperparameters for each of the algorithms. Using grid search only
solves half the problem actually, one still needs to know which hyperparameters to pass
to the grid search algorithm for optimization. The grid search algorithm is fairly straight
forward; given a set of hyperparameters a and b and a training set Strain , the goal of grid
search is to find the optimal value w* such that:
w ða; bÞ ¼ argminw Pðw; a; b; Strain Þ ð2Þ
Where P is the optimization problem of finding the optimal value of w . In Eq. (2),
w is the model that minimizes the cost function and it is a function of the optimized
parameters a and b. In Table 2, we present briefly the hyperparameters we chose for
optimization, the range of optimization values and the effect of these choices.
Table 2. Hyperparameters chosen for optimization in the nonparametric algorithms.

S/N Hyperparameter Algorithm Range of Effect
values
1. K KNN 5, 10 Lower k values may lead to high bias,
higher k values may lead to high
variance
2. Tree search KNN k-D Tree, Both are good for high dimensional
algorithm Ball Tree datasets; ball tree has lower search
time
3. C SVM 0.1, 1 It corresponds to regularize more the
estimation. Lower values for noisy
data
4. Kernel SVM Linear, Linear kernel works best for linearly
RBF separable data, RBF for non-linear
data
5. Max depth Decision 10, 20 Shallow trees may lead to
tree underfitting and deeper trees may
lead to overfitting
3.3 Validation and Scoring

To validate each model’s performance on the training data, we employ the k-fold cross
validation technique and set k to 10. The same k-fold cross validation is employed for
the predictions on the test data as shown in Table 5.
To measure each model’s goodness of fit, we use the R2 metric. The R2 metric is
used simply because we only want to know precisely how well each of the models fit
the data. The R2 score ranges between 0 and 1, values closer to one are preferable and
indicate a good fit. Equation (3) describes R2 :
Pn1
ðy ^yÞ ðy ^yÞ
R ðy; yactual Þ ¼ P i¼1
2 ð3Þ
n1
i¼1 y y y y
Where yi is the estimated value of the i-th instance of n samples, and yactualðiÞ is the
P
actual i-th value of n samples and y ¼ 1n n1i¼0 yi .
4 Results
Grid Search Results
Table 3. Results from Grid search showing the optimized hyperparameters.

S/N Hyperparameters Algorithm Best value
1. K KNN 10
2. Treesearch algorithm KNN Ball Tree
3. C SVM 1
4. Kernel SVM Linear
5. Max depth Decision tree 20
Training Results
The models are trained using the optimized parameters from Table 3. Table 4 shows
the training time for each model.
Table 4. Training time in seconds for each model.

S/N Algorithm Training time (seconds)
1. KNN 43
2. SVM 3367
3. Decision tree 548
4. Random forest 2395
Validation Results
Table 5. The mean of the 10-fold cross validation (R2) score for each regression model.
S/N Regression model R2
1. KNN 0.4753
2. SVM 0.3631
3. Decision tree 0.6618
4. Random forest 0.7189
Test Results
The k-NN’s visibly perform better than the SVM model. From Table 6, it is evident
that k-NN model ascertains correct predictions on close to 50% of inputs, and finely
predicts readings of a certain type of information in the data. The issue with this model
is the rate of correct prediction on this type of information in the data is low. From the
Fig. 4, in about 10 data points of input data, the model appears to correctly predict 3 of
those data points correctly. This form of uncertainty brings the accuracy of the model
down, although it performs better than SVM’s it still does not top the work done by
Decision trees or the Random forest model.
Table 6. R2 score on the test data for each regression model.

S/N Regression model R2
1. KNN 0.4409
2. SVM 0.3403
3. Decision tree 0.6704
4. Random forest 0.6813
Fig. 4. Plot showing the predicted time lapse over the real time lapse for the KNN regression
model.
The SVM’s were not able to perform at a level of over 50% accuracy. This poor
performance is observed in Fig. 5 and its R2 shown in Table 6. The most common input
readings (readings on the lower half of the plot), are not even correctly predicted by the
model. On closer inspection, the SVM model could been deemed to have poorly underfit
the data, as most of the new input data-points all fall into the same bracket or range of
prediction. The nature of the data and the SVM model are not a good match at all.
Fig. 5. Plot showing the predicted time lapse over the real time lapse for the SVM regression
model.
The plot in Fig. 6 does not bare evidence of overfitting or underfitting in the model.
The visual representation of the model does not show a general output or fit across an
input range or a prediction of outlying data-points. Instead, it shows an accurate reading
of new inputs spread evenly across the test set. The model performs very well com-
pared to SVM’s and k-NN’s and this figure reinforces that.
Fig. 6. Plot showing the predicted time lapse over the real time lapse for the Decision tree
regression model.
The Random Forest plot in Fig. 7 also shows no signs of overfitting or underfitting,
on closer inspection, it tends to show a model that does not fit too closely to the data.
This calls to mind that the best performing model may not closely fit every data-input,
but instead must have a less descriptive analysis when fitting inputs. This is evident
with the non-linearity of the data, a less rigorous check on an input from this dataset
favors a less descriptive analysis when predicting. Another scenario is random forest
being the average of decision trees trains on a range of scores from many decision trees.
This normalization on its descriptive analysis may be the reason for this permissible
model.
Fig. 7. Plot showing the predicted time lapse over the real time lapse for the Decision tree
regression model.
5 Discussion
The decision tree cross validation score of 0.66 and accuracy of 67% prediction on the
test set shown in Tables 5 and 6, is a fairly good results but they stand out when
compared to the performance of the two other dissimilar models (SVMs and k-NNs).
The data used in this experiment is highly non-linear, this means that a unique instance
of a feature (i.e., Drug) could map to more than one unique instances of other features.
This similarity in the dataset depresses the search space of predictions, models would
then act with very minimal discriminations on new inputs. Although the ‘Charges sent’
feature may have a unique instance across all samples, it still doesn’t produce a big
enough discrepancy for classification. Therefore, models like SVMs and k-NNs suffer
greatly for this, because they assess new inputs on how close they are to a cluster of
points(SVMs) or they make an intuitive prediction based on the likeness to other data-
points (k-NNs), most of the predictions would then be fairly wrong.
Decision Trees and Random Forests on the other hand define the attention that
should be placed on features in new inputs. A new input to the model traverses a tree
that classifies based on information and not on distance or likelihood to other data-
points. This behavior does not suffer greatly from overly similar or intertwined data
instances but bears the risk of overfitting. Thus, a search method was employed to
monitor this unwanted attribute in our model. Grid Search, grid search was able to find
an optimal depth for the decision tree and thereby improving the performance. The only
model that performed better than decision trees is Random Forest, which intuitively
makes sense and is expected. Random Forest is an average of a number of these well
performing decision trees on predictions and so should have better instances to make a
decision from.
6 Conclusion
The comparison between these algorithms underscores that a connectivity between

unique data-points becomes a major factor when finding a conclusion on suitable
machine learning models to employ. In our case, although the goodness of fit scores
selected the Random forest and the Decision tree algorithms as optimal, however, the
data took a form that was not high in non-linearity and so fell prey to poor scores from
algorithms like SVMs and k-NN’s. Yet according to the ‘no free lunch theorem’, this
does not deem the Decision Tree and Random Forest algorithm very suitable for
analyzing non-linear data; but instead strongly suggests that using k-NN’s and SVM’s
to predict classes betters an application for predictions on well-defined output labels
(i.e., Time lapse in days). The decision tree model performs better because it makes a
better prediction on new data points than the ballpark estimates made from likelihood
models- SVM’s and k-NN’s. This study also stands as a reference for the importance of
grid search when running non-parametric models. Owing to the fact that nonparametric
algorithms learn objective functions that are not defined by fixed parameters, grid
search presents an appropriate technique that solves for the intricate measure that data
needs to be measure with.
References
1. Tang, L., Pan, H., Yao, Y.: PANK-A financial time series prediction model integrating
principal component analysis, affinity propagation clustering and nested k-nearest neighbor
regression. J. Interdiscip. Math. 21, 1–12 (2018)
2. Sun, J., Fujita, H., Chen, P., Li, H.: Dynamic financial distress prediction with concept drift
based on time weighting combined with Adaboost support vector machine ensemble.
Knowl.-Based Syst. 120, 4–14 (2017)
3. Sim, D.Y.Y., Teh, C.S., Ismail, A.I.: Improved boosted decision tree algorithms by adaptive
apriori and post-pruning for predicting obstructive sleep apnea. Adv. Sci. Lett. 24(3), 1680–
1684 (2018)
4. Wolpert, D.: The lack of a priori distinctions between learning algorithms. Neural Comput.
8, 1341–1390 (1996)
5. Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning

algorithms using different performance metrics. In: Proceedings of the 23rd International
Conference on Machine Learning, pp. 161–168. Association for Computing Machinery,
New York (2006)
6. Breitenbach, M., Dieterich, W., Brennan, T., Fan, A.: Creating risk-scores in very
imbalanced datasets: predicting extremely low violent crime among criminal offenders
following release from prison. In: Koh, Y.S., Rountree, N. (eds.) Rare association rule
mining and knowledge discovery: Technologies for infrequent and critical event detection,
pp. 231–254. Information Science Reference, Hershey (2009)
7. Duwe, G., Kim, K.: Out with the old and in with the new? An empirical comparison of
supervised learning algorithms to predict recidivism. Crim. Justice Policy Rev. 28(6), 570–
600 (2017)
8. Allyn, J., Allou, N., Augustin, P., Philip, I., Martinet, O., Belghiti, M., Provenchere, S.,
Montravers, P., Ferdynus, C.: A comparison of a machine learning model with
EuroSCORE II in predicting mortality after elective cardiac surgery: a decision curve
analysis. PLoS ONE 12(1), e0169772 (2017). https://doi.org/10.1371/journal.pone.0169772
9. Kanchanamani, M., Varalakshmi, P.: Performance evaluation and comparative analysis of
various machine learning techniques for diagnosis of breast cancer. Biomed. Res.: Int.
J. Med. Sci. 17 February 2016. www.biomeds.info. www.alliedacademies.org/articles/
performance-evaluation-and-comparative-analysis-of-variousmachine-learning-techniques-
for-diagnosis-of-breast-cancer.pdf
10. Handel, B., Kolstad, J.: Health insurance for “humans”: information frictions, plan choice,
and consumer welfare. Am. Econ. Rev. 105(8), 2449–2500 (2015)
11. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.: Scikit-learn: machine learning in
Python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
12. Kotthoff, L., Thornton, C., Hoos, H., Hutter, F., Leyton-Brown, K.: Auto-WEKA 2.0:
automatic model selection and hyperparameter optimization in WEKA. J. Mach. Learn. Res.
18(1), 826–830 (2017)
CAMLPAD: Cybersecurity Autonomous
Machine Learning Platform
for Anomaly Detection
Ayush Hariharan(&), Ankit Gupta, and Trisha Pal
Blue Cloak LLC, Sterling, VA 20164, USA

ahariharan.research@gmail.com
Abstract. As machine learning and cybersecurity continue to explode in the

context of the digital ecosystem, the complexity of cybersecurity data combined
with complicated and evasive machine learning algorithms leads to vast diffi-
culties in designing an end-to-end system for intelligent, automatic anomaly
classification. On the other hand, traditional systems use elementary statistics
techniques and are often inaccurate, leading to weak centralized data analysis
platforms. In this paper, we propose a novel system that addresses these two
problems, titled CAMLPAD, for Cybersecurity Autonomous Machine Learning
Platform for Anomaly Detection. The CAMLPAD system’s streamlined, holistic
approach begins with retrieving a multitude of different species of cybersecurity
data in real-time using elasticsearch, then running several machine learning
algorithms, namely Isolation Forest, Histogram-Based Outlier Score (HBOS),
Cluster-Based Local Outlier Factor (CBLOF), and K-Means Clustering, to
process the data. Next, the calculated anomalies are visualized using Kibana and
are assigned an outlier score, which serves as an indicator for whether an alert
should be sent to the system administrator that there are potential anomalies in
the network. After comprehensive testing of our platform in a simulated envi-
ronment, the CAMLPAD system achieved an adjusted rand score of 95%,
exhibiting the reliable accuracy and precision of the system. All in all, the
CAMLPAD system provides an accurate, streamlined approach to real-time
cybersecurity anomaly detection, delivering a novel solution that has the
potential to revolutionize the cybersecurity sector.
Keywords: Machine learning Cybersecurity Anomaly detection

Clustering Visualization
1 Introduction
In recent years, the importance of varying fields within computer science, particularly
cybersecurity and machine learning, has skyrocketed. With new systems depending on
intelligent tools that bring next-level computation and systems open to security brea-
ches, the importance of the intersection of machine learning for analysis in cyberse-
curity data has flourished. However, the burdens of uneven cybersecurity data from a
variety of different sources often makes development of a tool that effectively and
accurately makes use of machine learning to improve cybersecurity data difficult. As a

https://doi.org/10.1007/978-3-030-39442-4_52
706 A. Hariharan et al.
result, very few end-to-end systems that can automatically classify anomalies in data
exist, let alone those that are accurate.
In this paper, we propose a novel, accurate system for real-time anomaly detection.
We term this product CAMLPAD, or the Cybersecurity and Autonomous Machine
Learning Platform for Anomaly Detection. By processing a plethora of different forms
of cybersecurity data, such as YAF, BRO, SNORT, PCAP, and Cisco Meraki real-time
using a variety of machine learning models, our system immediately determines if a
particular environment is at immediate risk for a breach as represented by presence of
anomalies. The specific machine learning algorithms utilized include Isolation Forest,
Histogram-Based Outlier Detection, Cluster-Based Local Outlier Factor, Multivariate
Gaussian, and K-Means Clustering. Once the data has been processed and anomalies
have been calculated, the CAMLPAD system utilizes Kibana to visualize outlier data
pulled from Elasticsearch and to gauge how high the outlier score is. Once a particular
threshold has been reached for this outlier score, an automated alert is sent to the
system administrator, who has the option to forward the alert to all of the employees in
the company so they are aware that a cybersecurity breach has occurred. By imple-
menting CAMLPAD as a running bash script, the CAMLPAD system immediately
recognizes anomalies and sends alerts. Subsequent paragraphs, however, are indented.
1.1 Background
Cybersecurity is the practice of defense of an organization’s network and data from
potential attackers that have unauthorized access to the particular network. One measure
of determining to what extent a particular user has this type of unauthorized access is
detecting anomalies in the network traffic data, specifically the data referred to earlier
(e.g. BRO, YAF, PCAP, SNORT). A potential platform that would be able to detect
anomalies would need to process this data in real time then use a model that uses past data
to learn whether current data contains anomalies and thus, if the network has an intrusion.
Specific to the CAMLPAD system, there are several ideas and terminologies that
would benefit the reader to have a background of. To begin with, there are a few pieces
of cybersecurity data that the CAMLPAD system makes use of. YAF, or “Yet Another
Flowmeter”, is a cybersecurity data type that processes PCAP data and exports these
flows to IPFIX Collecting Process [11]. BRO is an open source framework that ana-
lyzes network traffic and is used to detect anomalies in a network. SNORT, similarly, is
a network intrusion detection system that helps detect emerging threats. Meraki is a
cloud-based centralized management service that contains a network and organization
structure. This specifically is crucial as it further reveals the relationships between
members of a particular organization, which further assists the machine learning model
in determining where a potential anomaly may be.
Machine Learning, or ML, is the subfield within the exploding field of Artificial
Intelligence primarily concerned with having a computer or machine learns to make
predictions from a set of previous data rather than be explicitly programmed. There are
several algorithms, or methods, that facilitate this type of learning. Machine learning
consists of two main categories: unsupervised and supervised. In supervised machine
learning, the ML model already has data labeled, so calculating an accuracy is as simple
as detecting whether the model has correctly predicted the labeled data. In unsupervised
CAMLPAD: Cybersecurity Autonomous Machine Learning Platform 707
machine learning, the main type of ML to be referred to in this paper, the data has not
been labeled, so alternative methods need to be used to evaluate performance. As will be
discussed in further detail in the next section, the specific machine learning models used
as part of the CAMLPAD System are Isolation Forest, Histogram Based outlier detec-
tion, Cluster Based Local Outlier Factor, and Principal Component Analysis Fig. 1.
Fig. 1. Principal component analysis
Isolation forest Fig. 2 is an unsupervised machine learning algorithm that randomly

selects features by and selecting a value between the maximum and minimum for that
selected feature. Since isolating normal points from anomalies requires more compu-
tation due to the need to cover a broader spectrum, an anomaly score can be utilized
that measures the number of conditions needed to separate a given observation. The
algorithm specifically begins by creating random decision trees, and then the score is
calculated by being equal to the path length to isolate the observation.
Fig. 2. Isolation forest

Histogram-based outlier detection Fig. 3 is a method that scores records in linear

time by assuming independence of features making it much faster than multivariate
approaches, yet less precise. Cluster based local outlier factor (CBLOF) uses clusters to
find anomalous data points by measuring local deviation of a given point with respect
to its neighbors. Specifically, the CBLOF uses the concept of local density from k
nearest neighbors by comparing densities of an object to those of its neighbors in order
to identify regions of similar density.
Fig. 3. Histogram-based outlier detection
Based off of the cybersecurity system, data types, and machine learning algorithms,
the holistic CAMLPAD system incorporates the elasticsearch data stream in real time
with the machine learning algorithm, and if the anomaly score reaches a particular
threshold, a warning is sent to the organization. The methods section, delivers further
details based off of the fundamentals discussed in this section.
In the further sections of the paper, the CAMLPAD system will be thoroughly
described as well as results from comprehensive testing to gauge performance of the
system. Next, comparisons of results will be provided in a discussion with general
insights being stated in the conclusion.
2 Methods
Before implementing the machine learning aspect of the research, the data, which
described various online transactions, had to be accurately transferred from the sensors
that were run on Linux virtual machines, to a local server, where the model can process
the data and alert the user if any anomalies are present. The data, specific to the
different sensors running on the virtual machine including BRO, YAF, Snort, and
Meraki, are temporarily stored locally, in a machine composed of 4 Dell VRTX, before
being uploaded to a Hadoop server consisting of one master node and three slave
nodes. After successfully uploading the data to the Hadoop server, Apache NiFi is used
to streamline and process the sensor logs before pushing the processed information into
the Kafka Queue Fig. 4. Apache NiFi, a project from the Apache Software Foundation,
was specifically designed to automate the flow of data between software systems. In
our case, the data is transferred from the virtual storage on the Hadoop server to the
Kafka queue where it can be more efficiently stores. Specifically, the information stored
in each of the logs is queried into a JSON-like format consisting of a field, such as
MAC address or destination IP, and the actual information, such as a list of actual
addresses. Once the data has entered the queue, it is sent to the Elasticsearch database
where it is stored for future processing.
Fig. 4. Kafka queue
Elasticsearch is a database that parses and normalizes raw data before assigning
each query of information a unique identification number. Using this identification
number, and the index associated with the data, information from the sensor logs can be
queried for further processing using the machine learning models. However, one caveat
with Elasticsearch is that it doesn’t allow for custom processing scripts to be run inside
the database. Instead, the information must be queried based on an assigned index and
must be processed on an external server or node. That is where the current research
enters the workflow, since the machine learning algorithms utilize the indexing ability
of Elasticsearch to stream data into a separate machine. This data is streamed directly
from the database, without having to download the data as a CSV or JSON, meaning
that the data is quickly transferred from storage to a local processor on another
machine. Using a unique algorithm centered around the current date and time, all
previous data is indexed and imported into a dataframe, one of the most common
methods of storage for machine learning algorithms. The dataframe will contain the
information used for training and validating the machine learning models that were
created. Once the dataframe has been created, the current days data is indexed and
imported into another dataframe. This dataframe will contain the latest information
used for anomaly detection based upon patterns observed in the previous data stored.
Now that the data has been successfully imported into the respective dataframes,
the categorical data present in the sensor logs, such as type of request or url, must be
encoded into numerical values before further analysis by the machine learning algo-
rithms. After encoding the data, two methods of imputation: linear regression Fig. 5,
for purely numerical values, and backfill insertion, for encoded categorical values, were
used. Now that missing or lost data has been imputed, the data can be imported into the
custom ensemble model for anomaly detection. The custom ensemble model consists
of an Isolation Forest algorithm, histogram based outlier detection algorithm, and
cluster based local outlier factor. All of these models are similar to the implementation
in the python outlier detection library. Once the data is fit to the overall model, the
validation data and testing data are assigned an outlier score. Based on the outlier score
and a simple PCA algorithm, clusters are developed depending on the outlier score
assigned by the respective model. Those clusters are then processed and a heat map is
created describing the various levels of outliers present in the data.
This process is repeated for each model created, resulting in three heat maps
describing the outlier scores assigned by each model for the data. After the outlier
scores have been assigned, the ensemble model is created through a democratic voting
system, where each model has an equal say on whether a data point is an outlier or an
inlier. After the voting system has been completed, the final outlier scores are run
through the PCA algorithm and a final heat map is created. The process is then repeated
for the different types of data that is stored on the Elasticsearch database, including
YAF, BRO, SNORT, and Meraki. Specifically, the BRO data is split by protocol into
DNS and CONN in order to accurately label the data. Once the final outlier scores have
been compiled for each data type, a final ensemble model is created, using a democratic
voting system in order to reclassify each data point. This final model takes into con-
sideration not only different outlier detection models that have been successful in
previous research, but also different types of sensor data that capture different layers of
internet traffic. After the model has been created, the accuracy is determined by cal-
culating the Adjusted RAND score, a common method of evaluating unsupervised
machine learning algorithms.
This represents the last part of the workflow, where the data originated from dif-
ferent sensors has been effectively processed and anomaly scanning has been com-
pleted. After the accuracy has been tested and confirmed, the newly assigned outlier
scores are then indexed and queried back into the Elasticsearch database so that
visualizations of these outlier scores can be created in Kibana. Specifically, the index is
portrayed as a gauge, where the outlier score of the current day’s data is compared with
previous data that the model has trained on. When the gauge passes the 75th percentile,
a custom alert is sent to the owner of the Apache database, alerting them that there
might be anomalies in the current data. The user then has the choice of whether to
respond to this alert by blocking certain destination ports or IP addresses or can
perform further investigation to determine the cause of the anomaly.
Fig. 5. Linear regression
3 Results
In terms of results, the CAMLPAD system consists of five main data components:
BRO DNS, BRO CONN, YAF, SNORT, and Meraki. These components, along with
the three models: Isolation Forest (I-Forest), Cluster Based Local Outlier Factor
(CBLOF), and Histogram Based Outlier Score (HBOS), are then combined in a
democratic voting system to determine final outlier score. Although the user isn’t
alerted based upon each individual data type and only the final combined model, it is
interesting to note how different types of data, both layer3 and layer2, show similar
patterns in anomaly detection. In all of these heat maps, there are two separate data
points, previous data that the model trained on, which are represented by the smaller
data points, and current day’s data, which is represented by the larger data points. In
each graph, the differences or similarities between the current day and the previous data
can be observed along with other patterns that represent the level of anomalies present.
3.1 Data Heat Maps and Visualization

Based off of the BRO DNS Data Heat Maps Fig. 6, specifically the HBOS Heat Map, it
is evident that, since many of the datapoints are lighter, there were more outliers, and
the bigger dots for the current day were darker, meaning that the current day had less
outliers. For the CBLOF Heat Map, there were lighter dots at the center, surrounded by
darker dots, meaning that the current day had more outliers than previous days. This
means that the current day was likely an outlier. While the isolation forest algorithm
does not display this, it IS further corroborated by the combined algorithm, which
shows a dark dot representing less of an outlier, surrounded entirely by lighter dots.
Since the Kibana Recent day has a score of 0.13, which is higher than 0.022 of the
previous days, the current day is likely an outlier.
Fig. 6. BRO DNS data heat maps and Kibana visualization
Next, the BRO CONN data’s Fig. 7 HBOS algorithm reveals that all of the data is
very light, with a light gray dot in the center, meaning that the current day’s data is only
a little bit of an outlier in comparison to previous day’s data. This is roughly corrob-
orated by the CBLOF algorithm, but the I-Forest and Combined algorithms deliver an
entirely different result. In both Isolation Forest and Combined, there is first a light dot
surrounded by black dots, then a black dot surrounded by white dots, meaning that
there was likely an outlier on the current day. Since the Kibana recent day has a score
of 0.025, which is significantly lower than the previous days’ score of 5.25, the Kibana
score suggests that the current day is less of an anomaly.
Fig. 7. BRO CONN data heat maps and Kibana visualization
For the YAF data shown below Fig. 8, throughout each of the different algorithms,
it is evident that the current day’s data is an outlier when combined to the previous
days’ data. For example, in the HBOS diagram, there are black dots surrounded by
Fig. 8. YAF data heat maps and Kibana visualization

light dots. This repeats for CBLOF, I-Forest, and the combined algorithm. For Kibana,
the recent day has a score of 2.734, which is less than the previous days’ score of 2.75,
meaning that the most recent day is less of an outlier in comparison to previous days.
For the SNORT data shown below Fig. 9, the HBOS, CBLOF, and Combined
Algorithm all have black dots surrounded by lighter dots representing previous days’
data. However, the I-Forest algorithm has only light dots, meaning that it does not
perceive the current day’s data to be an outlier.
For Kibana, the recent day has a score of 0.961, which is less than the previous
days’ score of 4.378, meaning that the most recent day is far less of an outlier in
comparison to previous days.
Fig. 9. SNORT data heat maps and Kibana visualization
For the Meraki data Fig. 10 shown below, it is clear that there are four main gray
dots representing the current days’ data, and these four clusters are outliers. While the
HBOS model has the data being almost equivalent to the data from previous days, the
CBLOF, I-Forest, and Combined all reveal that the clusters are indeed outliers. For
Kibana, the recent day has a score of 5.287, which is less than the previous days’ score
of 0.029.
Fig. 10. Meraki data heat maps and Kibana visualization
Lastly, for the combined data Fig. 11 throughout each of the different algorithms, it
is evident that the current day’s data is an outlier when combined to the previous days’
data. In HBOS, there is an X-like outlier, in the CBLOF there is a dark dot, in I-Forest,
Fig. 11. Combined data heat maps and Kibana visualization

there is a gray dot, and in the combined algorithm there is a black dot surrounded by
gray points. For Kibana, the recent day has a score of 0.261, which is more than the
previous days’ score of 0.004, meaning that the most recent day is more of an outlier in
comparison to previous days.
3.2 Accuracy Measure

In order to determine our measure for accuracy, we used the RAND Score measure the
similarity between two clusterings by considering all pairs of samples and counting
pairs that are assigned.
The RAND score takes into account True Positives, False Positives, and determines
how accurate the measurement for a previous day is based off of the data from the
current day. This is used, because a previous days’ data is already validated, so it is
easier to use the current day’s data as a training set. All algorithms implemented
achieve similar outlier scores, so it is evident that the holistic algorithm is correct, since
none of the algorithms contradict each other. After testing with all algorithms imple-
mented and averaging, we achieved a RAND score of 0.95.
4 Discussion
As previously discussed, the CAMLPAD System provides an innovative, compre-

hensive, and streamlined approach for anomaly detection. Not only does the system
make use of a variety of data types, such as BRO (DNS/CONN), SNORT, YAF, and
Meraki, it also uses a combined, democratic-voting based measure that incorporates all
the different types of data to give a more holistic anomaly detection mechanism. Aside
from the data, the use of four different algorithms, specifically Isolation Forest, His-
togram Based Outlier Score, Cluster-Based Local Outlier Factor, and Angle-Based
Outlier Detection, results in an integrated system that capitalizes on the strengths of
each of the machine learning algorithms.
Although the CAMLPAD System has definitively proven to be accurate and useful
due to the RAND score of 0.95, there are several limitations of the engineered system.
One such limitation is that it only works with a few machine learning algorithms,
meaning that the system is inherently biased towards the four main machine learning
algorithms. Another is that the system does not classify all anomalies, so the sample
size is much smaller since only one type of anomaly is being detected. To further
corroborate these limitations and understandings, it is important to consider comparison
of the system to previous network software architectures that have been traditionally
used for anomaly detection.
Specifically, previous papers have reviewed different anomaly detection systems
and found that there are specific issues with wide scale use deployment of intrusion
detection systems [1]. Other papers have discussed the application of similar artificial
intelligence-based algorithms with respect to the immune system, and how potentially
there are interesting insights to be gained from computational models [2, 4]. Several
researchers have also discussed more sophisticated machine learning algorithms, such
as utilizing a spiking neural network algorithm, but even these have had limited success
[3]. With the rise of machine learning, other researchers have investigated the subfield
of deep learning, including its role as the frontier for distributed attack detection and in
varied prevention systems [5, 6, 10, 17]. In addition, while several papers have pointed
to the intriguing intersection of autonomous AI and anomaly detection, others contain a
systematic review of how cloud computing could potentially assist in the development
of a prevention system [7, 8, 16].
In addition to research papers that have investigated broad fields such as deep
learning and cloud computing, several have dug deeper into the specifics of machine
learning. For example, one paper considers use of a TCM-KNN algorithm Fig. 13 for
supervised network intrusion, which provides a novel approach, since most network
intrusion algorithms use unsupervised learning [9]. In addition, another paper considers
use of the n-gram machine learning algorithm in both anomaly detection and classi-
fication [18]. On the other hand, different papers have focused on larger scale intrusion
networks, and more cost-benefit analysis oriented operations, involving research pri-
orities and business intelligence [10, 12, 13, 15]. Real-time network flow Fig. 12
analysis has been another field populated by a great deal of research, as shown by the
system named Disclosure, which can detect botnet commands and control servers [14].
Lastly, several papers have investigated feature selection models and learning-based
approaches to intrusion detection, such as applications to SQL attacks [19, 20].
Fig. 12. Real-time network flow

Fig. 13. TCM-KNN
As discussed, clearly a plethora of research has been done in the intersection of

machine learning, cloud computing, and cybersecurity, but many research papers only
include segments of a broader, holistic understanding of anomaly detection. Some only
consider one type of data, while others only include one of a plethora of machine
learning algorithms. In contrast, CAMLPAD addresses several types of data with the
most efficient algorithms to provide a real-time, streamlined pipeline for autonomous
anomaly detection.
5 Conclusion
Security breaches and threats are growing along with the cybersecurity field.
CAMLPAD is machine learning-based platform that can effectively detect anomalies in
real-time resulting in an outlier score. We demonstrated the vast possibilities of
anomaly detection in the cybersecurity field using unsupervised machine learning
models: Isolation Forest, Histogram-Based Outlier Detection, Cluster-Based Local
Outlier Factor, Multivariate Gaussian, and K-Means Clustering.
CAMLPAD’s pipeline starts with data streamed directly from Elasticsearch and
formatted on a local notebook. From there, we run the models which result in a
visualization of the data and an outlier score. From the different models, the outlier
scores are averaged and displayed on the Kibana dashboard.
5.1 Future Research

We plan to directly run our script on The Metadata Encoding and Transmission
Standard (METS) Schema to reach max efficiency and timely results. METS Schema
helps encode descriptive and structural metadata for objects in a digital library. In
addition to running the machine learning models on a METS Server, we plan to clean
up the data and make each field consistent. Since the data, streamed from elasticsearch,
is different for each group (based on day), preprocessing the data to make all data
points similar will help increase the efficiency.
In the future, we will use supervised machine learning models to ensure all data
points are represented. For example, we plan to use a Support Vector Machine
(SVM) which is helpful for classification of the outliers and anomalies. The overall
model is trained on data from an earlier period of time, then outliers are detected and
scored based on those inconsistencies.
Overall, CAMLPAD achieved an adjusted rand score of 95%, but with the use of
several ML models and making CAMLPAD more efficient, the accuracy for anomaly
detection can increase.
Acknowledgments. We would like to thank the employees at Blue Cloak, LLC for their gen-
erous support throughout the duration of this research endeavor as well as for the cybersecurity
data and tools used.
References
1. Garcia-Teodoro, P., Diaz-Verdejo, J., Maciá-Fernández, G., Vázquez, E.: Anomaly-based
network intrusion detection: techniques, systems and challenges. Comput. Secur. 28(1–2),
18–28 (2009)
2. Dasgupta, D. (ed.): Artificial Immune Systems and their Applications. Springer, Heidelberg
(2012)
3. Demertzis, K., Iliadis, L., Spartalis, S.: A spiking one-class anomaly detection framework for
cyber-security on industrial control systems. In: International Conference on Engineering
Applications of Neural Networks, pp. 122–134. Springer, Cham (2017)
4. Dasgupta, D.: Immunity-based intrusion detection system: a general framework. In:
Proceedings of the 22nd NISSC, vol. 1, pp. 147–160 (1999)
5. Abeshu, A., Chilamkurti, N.: Deep learning: the frontier for distributed attack detection in
fog-to-things computing. IEEE Commun. Mag. 56(2), 169–175 (2018)
6. Patel, A., Qassim, Q., Wills, C.: A survey of intrusion detection and prevention systems. Inf.
Manag. Comput. Secur. 18(4), 277–290 (2010)
7. Mylrea, M., Gourisetti, S.N.G.: Cybersecurity and optimization in smart “autonomous”
buildings. In: Autonomy and Artificial Intelligence: A Threat or Savior?, pp. 263–294.
Springer, Cham (2017)
8. Patel, A., Taghavi, M., Bakhtiyari, K., Junior, J.C.: An intrusion detection and prevention
system in cloud computing: a systematic review. J. Netw. Comput. Appl. 36(1), 25–41
(2013)
9. Li, Y., Guo, L.: An active learning based TCM-KNN algorithm for supervised network
intrusion detection. Comput. Secur. 26(7–8), 459–467 (2007)
10. Diro, A.A., Chilamkurti, N.: Distributed attack detection scheme using deep learning
approach for Internet of Things. Futur. Gener. Comput. Syst. 82, 761–768 (2018)
11. Inacio, C.M., Trammell, B.: Yaf: yet another flowmeter. In: Proceedings of LISA10: 24th
Large Installation System Administration Conference, p. 107 (2010)
12. Huang, M.Y., Jasper, R.J., Wicks, T.M.: A large scale distributed intrusion detection
framework based on attack strategy analysis. Comput. Netw. 31(23–24), 2465–2475 (1999)
13. Russell, S., Dewey, D., Tegmark, M.: Research priorities for robust and beneficial artificial
intelligence. Ai Mag. 36(4), 105–114 (2015)
14. Bilge, L., Balzarotti, D., Robertson, W., Kirda, E., Kruegel, C.: Disclosure: detecting botnet
command and control servers through large-scale netflow analysis. In: Proceedings of the
28th Annual Computer Security Applications Conference, pp. 129–138. ACM (2012)
15. Chen, H., Chiang, R.H., Storey, V.C.: Business intelligence and analytics: from big data to
big impact. MIS Q. 36(4) (2012)
16. Doelitzscher, F., Reich, C., Knahl, M., Passfall, A., Clarke, N.: An agent based business
aware incident detection system for cloud environments. J. Cloud Comput.: Adv. Syst. Appl.
1(1), 9 (2012)
17. Ten, C.W., Hong, J., Liu, C.C.: Anomaly detection for cybersecurity of the substations.
IEEE Trans. Smart Grid 2(4), 865–873 (2011)
18. Wressnegger, C., Schwenk, G., Arp, D., Rieck, K.: A close look on n-grams in intrusion
detection: anomaly detection vs. classification. In: Proceedings of the 2013 ACM Workshop
on Artificial Intelligence and Security, pp. 67–76. ACM (2013)
19. Aljawarneh, S., Aldwairi, M., Yassein, M.B.: Anomaly-based intrusion detection system
through feature selection analysis and building hybrid efficient model. J. Comput. Sci. 25,
152–160 (2018)
20. Valeur, F., Mutz, D., Vigna, G.: A learning-based approach to the detection of SQL attacks.
In: International Conference on Detection of Intrusions and Malware, and Vulnerability
Assessment, pp. 123–140 (2005)
A Holistic Approach for Detecting DDoS
Attacks by Using Ensemble Unsupervised
Machine Learning
Saikat Das(&), Deepak Venugopal, and Sajjan Shiva
The University of Memphis, Memphis, TN 38152, USA

{sdas1,dvngopal,sshiva}@memphis.edu
Abstract. Distributed Denial of Service (DDoS) has been the most prominent
attack in cyber-physical system over the last decade. Defending against DDoS
attack is not only challenging but also strategic. Tons of new strategies and
approaches have been proposed to defend against different types of DDoS
attacks. The ongoing battle between the attackers and defenders is full-fledged
due to its newest strategies and techniques. Machine learning (ML) has
promising outcomes in different research fields including cybersecurity. In this
paper, ensemble unsupervised ML approach is used to implement an intrusion
detection system which has the noteworthy accuracy to detect DDoS attacks.
The goal of this research is to increase the DDoS attack detection accuracy while
decreasing the false positive rate. The NSL-KDD dataset and twelve feature sets
from existing research are used for experimentation to compare our ensemble
results with those of our individual and other existing models.
Keywords: Unsupervised machine learning ensemble Novelty and outlier

detection DDoS detection Accuracy IDS False positive rate
1 Introduction
From the beginning of the architectural evolution of the Internet, the proper way to
transmit a packet, and process reduction were the major concerns. Cyber attackers
easily exploit the existing limitations of the Internet protocols (TCP, UDP, etc.) and the
readily available attack tools. A Distributed Denial of Service (DDoS) attack is mostly
a network attack that causes bandwidth overloading due to the use of immense inbound
or outbound traffic over the network, resulting in disruption of the normal operation.
The first well-documented DDoS attack appears to have occurred on August 1999,
when a DDoS tool called ‘Trinoo’ was deployed in at least 227 systems, to flood a
single University of Minnesota computer, which was knocked down for more than 2
days. In recent years, attacks on financial systems, broadcast systems, and Internet-
based services have grown exponentially [1]. Moreover, those attacks are devastating,
wide-ranging, easy to implement, and difficult to detect and defend, posing a major
threat to Internet privacy and security. Today’s Internet is badly plagued by DDoS
attack and the attack has been escalated drastically over the last decade. In the last
couple of years, the giants such as GitHub, Amazon, Cloudflare, Facebook, Instagram,

https://doi.org/10.1007/978-3-030-39442-4_53
722 S. Das et al.
etc. had their service disruption by DDoS attack. According to the World Infrastructure
Security Report 2018 [2], for the first time ever, a DDoS attack reached 1 Tbps
(Terabyte per Second) in size and the Internet has officially entered the terabit attack
era. The largest attack was recorded as 1.7 Tbps. In the report, 16,794 DDoS attacks
occurred per day have been mentioned which is equal to 700 attacks per hour or 12
attacks per minute and it predicts that this number has been growing rapidly day by
day.
To keep alive in the competition, defenders are developing the newest technologies
and mechanisms against those attacks. The existing DDoS attacks have been scruti-
nized and it is found that the attack can be mitigated by one of these three approaches
or defense mechanisms, namely, attacker-end approach, victim-end approach, and in-
network approach, depending on their locality of deployment. Though attacker-end
detection approach is much more challenging than the victim-end detection approach,
solutions exist. On the other hand, victim-end detection is easier to implement com-
pared to the other two types of detection approaches. The existing detection approaches
can be categorized into statistical, soft computing, clustering, knowledge-based, and
hybrid. These approaches can also be classified as supervised or unsupervised based on
the type of dataset [3].
In the evolution of Intrusion Detection Systems (IDS), anomaly-based detection is
more popular than signature-based detection. Machine learning (ML) has promising
outcomes in detecting cyber-physical attacks including DDoS. Many researchers have
already used ML classifiers to build IDSs in defending against DDoS attacks. In
Machine learning, supervised, semi-supervised, and unsupervised are three basic ways
to classify anomalous packets from normal packets. Supervised methods have the
privilege of differentiating anomalous and normal data from a tagged dataset. Unsu-
pervised methods, on the other hand, cluster dataset into different clusters where the
strength of the clustering lies within the algorithm itself. Among those, novelty and
outlier detection strategies are the unsupervised methods that have significant outcomes
in detecting the unseen anomaly. One class SVM (Support Vector Machine), Local
Outlier Factor, Elliptic Envelope, Isolation Forest, etc. are the most well-known novelty
and outlier detection classifiers.
Both supervised and unsupervised classifiers are being used in developing IDS.
However, the majority of these approaches have focused on learning a single model for
intrusions. Moreover, due to the varied nature of intrusions, it may be hard to learn a
single model that generalizes to all types. For example, some types of intrusions can be
modeled using a simple linear model (e.g. logistic regression) while others may require
more complex non-linear models (e.g. support vector machines with kernels). There-
fore, the main idea of this paper is to train several models that can identify DDoS
intrusions and then combine these into a unified system based on different mechanisms.
The benefits of ensemble learning, i.e., combining multiple classifiers to form a
more powerful classifier have been well-studied in the ML community. Dietterich et al.
[4] mentioned that ensembles can perform better than a single classifier and many
classification problems have benefited from the idea of combining multiple classifiers.
In general, there are two ways to ensemble the classifiers: homogeneous and hetero-
geneous. When similar types of classifiers are used to build a training model, it is called
a homogeneous ensemble (e.g.; bagging and boosting), whereas combining different
A Holistic Approach for Detecting DDoS Attacks 723
types of classifiers is called a heterogeneous ensemble (e.g. stacking). Both homoge-

neous and heterogeneous ensembles are being used to build IDS. Aburomman et al. [5]
mentioned a wide range of ensemble ML techniques and methods used to detect
network intrusion. However, there are several drawbacks with existing ML approaches.
First, existing techniques do not use relevant domain knowledge in constructing the
classifier. This means that they end up using a lot of irrelevant features which results in
the so-called “curse-of-dimensionality”, i.e., the accuracy and generalization reduce as
the number of features increase. Next, most existing methods focus on supervised ML
models which is problematic since it requires a large amount of labeled data. Finally,
even in existing methods that use unsupervised methods to detect network intrusions,
there is no systematic approach that has been developed that can combine different
unsupervised models, which is particularly important since learning an ensemble model
is more robust as compared to learning a single model. In this work, our goal is to
address all these three issues.
Here, twelve feature sets [6–12] that produce higher accuracy are considered for
experimentation. The novelty of this research is to ensemble ‘novelty and outlier
detection’ type unsupervised classifiers for better detection accuracy and lower false
positive alarm. We show that this ensemble approach outperforms existing research
methods as well and empirically show that generalization over new attacks is signifi-
cantly improved when we combine different approaches as compared to using any one
single approach.
The rest of the paper is organized as follows: In Sect. 2, we discuss the state of the
art of recent IDS that use ensemble learning and how this research contribution is
different and better from other approaches. An ensemble-based IDS framework is
proposed in Sect. 3 and in Sect. 4, experimentation and observations are shown.
Finally, in Sect. 5, the pros and cons of the proposed model are discussed and the
conclusion of the paper with future research direction is drawn.
2 Literature Review
DDoS attacks have become a weapon of choice for hackers as well as for cyber
terrorists and used as a form of protest in a politically unstable society. Various
detection techniques are being improvised by researchers to defend against DDoS
attacks over the year. To evade the existing DDoS attack detection solutions, the attack
itself changes frequently. Based on the various techniques such as cloud computing,
software defined networking (SDN), backbone web traffic, and big data strategies, the
DDoS attack detection can be categorized into filtering mechanism, routers function,
network flow, statistical analysis, and learning machine.
A comprehensive survey of Machine learning intrusion detection [13], systematic
literature review and taxonomy of DDoS attack [14] are necessary to know the state of
art of ML approaches for both IDS and DDoS. Ahmad Riza’ain [14] et al. performed an
in-depth analysis on DDoS attack types as well as on existing DDoS detection and
attack prediction techniques by characterizing the attacks. Also, they identified the
factors behind those attacks. Moreover, they have classified and ranked at least 53
articles from different digital libraries such as Science Direct, ACM Digital Library,
724 S. Das et al.
IEEE Xplore, Springer, and Web of Science related to DDoS detection and prevention
and found 30% of them using ML techniques as their detection or prevention strategy.
To detect the DDoS attack, supervised [15], semi-supervised [16], and unsuper-
vised methods are being used to build the training model. A combination of supervised
and unsupervised ML model to detect anomaly can also be found in [17]. Neural
Network and SVM for supervised modeling, KNN for unsupervised modeling, and
Principal Component Analysis (PCA) and Gradual Feature Reduction (GFR) for fea-
ture selection with NSL-KDD dataset are used there. However, the reason behind using
the combination of the supervised and unsupervised method in their research is
ambiguous.
Ensemble is the way of combining multiple classifiers for better performance over
single classifier and many classification problems have benefited from this idea [4].
Homogeneous (combination of similar types of classifiers) and heterogeneous (com-
bination of different types of classifiers) are the two major ensemble types. A detailed
survey of ensemble and hybrid classifiers [5] helps in understanding the usage and
shortfalls of ensemble ML in network security.
Outlier and novelty detection techniques are more efficient in detecting unknown
attacks as they use unsupervised ML models. An unsupervised ML model is used [18]
to detect a high-volume DDoS attack using in-memory distributed graph. Jabez et al.
[19] mentioned an outlier detection mechanism NOF (Neighborhood Outlier Factor) to
detect the anomaly. But there could be a high chance for a single classifier to predict
incorrectly compared to multiple classifiers’ prediction. Therefore, an ensemble clas-
sifier would be a perfect fit for predicting anomalous behavior precisely. Smyth et al.
[20] showed that stacked density estimation outperforms a single best model which
could be chosen based on cross-validation, combining with uniform weights, or even
through bias. A few hybrid supervised learning models [21] are used to detect DDoS
attack but realistically for a zero-day attack or unknown attacks, an unsupervised
hybrid model has better detection accuracy.
Though most of the researchers have chosen a single classifier to train their model
in detecting DDoS attack, a combination of classifiers has better accuracy compared to
a stand-alone model. Moreover, none of them have focused on either unsupervised
ensemble or outlier and novelty detection ensemble. In addition, their works didn’t
guide properly in detecting unseen attacks. Therefore, our motivation and the goal of
this paper is to build an unsupervised ensemble model which combines five different
‘outlier and novelty detection’ classifiers, resulting in the detection of unseen DDoS
attacks. Using unsupervised ensemble model is the novelty of this research, and the
better detection accuracy with lower false positive rates are obtained with this model.
3 Proposed Method
As DDoS has the devastating damaging effects on organizations’ assets, a compre-

hensive defense mechanism is required to protect assets. Anomaly-based IDS over
signature-based has a better detection accuracy privileging in detecting unseen attacks
but at the expense of a lot of false identification of unusual activities as anomalous.
Traditional IDSs are defending and upgrading their strategies to cope up with new
attack types and patterns. However, with the change of attackers’ motive and intention,
an adaptive IDS is most demanding in the cyber world. Here, in this paper, an ML
based IDS is proposed that has the novelty to ensemble unsupervised classifiers based
on outlier detection approach, and which gives a better detection accuracy with a lower
false positive rate in detecting DDoS.
3.1 Dataset
In this research, NSL-KDD [22] dataset is used for training and testing purposes. NSL-
KDD is a data set suggested to solve some of the inherent problems of the KDD’99
data set which are mentioned in [23]. Although, McHugh discussed some problems that
are suffered by this new version of the KDD data set and may not be a perfect
representative of existing real networks due to the lack of public data sets for network-
based IDSs. It can be applied as an effective benchmark data set to help researchers
compare different intrusion detection methods. NSL-KDD has some major improve-
ments over the original KDD’99 dataset [22]:
1. No redundant records in the train data, so classifiers will not be biased towards more
frequent records.
2. No duplicate records in the test data, so the performance of the learners is not biased
by the methods which have better detection rates.
3. The number of selected records from each difficulty group is inversely proportional
to the percentage of records in the original KDD’99 dataset.
4. The number of records in the train and test sets are reasonable that makes the dataset
affordable to run the experiments, etc.
The dataset contains eight data files of different formats that are compatible with
most experimental platforms. Table 1 shows a summary of the testing and training data
record.
Table 1. Summary of training and testing data records.

Class Training set Occurrence % Testing set Occurrence %
Normal 67343 53.46% 9711 43.08%
DDoS 45927 36.46% 7460 33.085%
Other 12703 10.08% 5373 23.85%
Total 125973 100% 22544 100%
3.2 Data Preprocessing

A Machine learning classifier trains and tests it’s model by using a dataset that contains
numeric values. So, it is necessary to convert any non-numeric data content to numeric
data before it can be fed to a classifier. After analyzing the whole dataset, categorical
values have been found in features, namely protocol type, service, and flag, whereas all
other features contain numeric values. In the preparation of feature selections step, to
select the most important features, the categorical values are needed to convert into
726 S. Das et al.
numeric values. In the data preprocessing phase, for each such feature, the distinct
values are identified for all entries in that column and replaced with numeric values
using simple integer assignment starting from 1. The reference for this conversion is
shown in Table 2.
Table 2. Conversion table for categorical variables to integer values

Feature Integer conversions
name
Protocol ‘tcp’: 1, ‘udp’: 2, ‘icmp’: 3
type
Service ‘ftp_data’: 1, ‘other’: 2, ‘private’: 3, ‘http’: 4, ‘remote_job’: 5, ‘name’: 6,
‘netbios_ns’: 7, ‘eco_i’: 8, ‘mtp’: 9, ‘telnet’: 10, ‘finger’: 11, ‘domain_u’: 12,
‘supdup’: 13, ‘uucp_path’: 14, ‘Z39_50’: 15, ‘smtp’: 16, ‘csnet_ns’: 17, ‘uucp’:
18, ‘netbios_dgm’: 19, ‘urp_i’: 20, ‘auth’: 21, ‘domain’: 22, ‘ftp’: 23, ‘bgp’: 24,
‘ldap’: 25, ‘ecr_i’: 26, ‘gopher’: 27, ‘vmnet’: 28, ‘systat’: 29, ‘http_443’: 30,
‘efs’: 31, ‘whois’: 32, ‘imap4’: 33, ‘iso_tsap’: 34, ‘echo’: 35, ‘klogin’: 36,
‘link’: 37, ‘sunrpc’: 38, ‘login’: 39, ‘kshell’: 40, ‘sql_net’: 41, ‘time’: 42,
‘hostnames’: 43, ‘exec’: 44, ‘ntp_u’: 45, ‘discard’: 46, ‘nntp’: 47, ‘courier’: 48,
‘ctf’: 49, ‘ssh’: 50, ‘daytime’: 51, ‘shell’: 52, ‘netstat’: 53, ‘pop_3’: 54, ‘nnsp’:
55, ‘IRC’: 56, ‘pop_2’: 57, ‘printer’: 58, ‘tim_i’: 59, ‘pm_dump’: 60, ‘red_i’:
61, ‘netbios_ssn’: 62, ‘rje’: 63, ‘X11’: 64, ‘urh_i’: 65, ‘http_8001’: 66, ‘aol’:
67, ‘http_2784’: 68, ‘tftp_u’: 69, ‘harvest’: 70
Flag ‘SF’: 1, ‘S0’: 2, ‘REJ’: 3, ‘RSTR’: 4, ‘SH’: 5, ‘RSTO’: 6, ‘S1’: 7, ‘RSTOS0’:
8, ‘S3’: 9, ‘S2’: 10, ‘OTH’: 11
The NSL-KDD dataset that we have used is a tagged dataset. Since the proposed
model initially works with unsupervised methods which don’t require any class level,
‘class’ column from the dataset is removed in this phase. On the other hand, to combine
the outputs of those unsupervised methods, logistic regression (LR) and naïve Bayes
(NB) are used which require a class label. We use the removed class label from this
phase to train LR and NB model later.
3.3 Feature Selections

In our earlier research [24], we have built a supervised ensemble model using 24
reduced features [6, 7]. Here, twelve different feature sets from existing research are
used for experimentation and analyzed their accuracy. Based on our domain knowl-
edge, we have verified each of the features’ relevancy with the DDoS attack. Table 3
shows all feature sets that have been used on data classification phase.
Table 3. Feature sets with features’ list from references

Feature References Feature Features
set count
FS-1 [6, 7] 24 2, 3, 4, 5, 7, 8, 10, 13, 23, 24, 25, 26, 27, 28, 29, 30,
33, 34, 35, 36, 38, 39, 40, 41
FS-2 [8] 13 3, 4, 29, 33, 34, 12, 39, 5, 30, 38, 25, 23, 6
FS-3 [8] 14 5, 3, 6, 4, 30, 29, 33, 34, 35, 38, 12, 39, 25, 23
FS-4 [8] 14 12, 26, 4, 25, 39, 6, 30, 38, 5, 29, 3, 37, 34, 33
FS-5 [8] 14 5, 3, 6, 4, 29, 30, 33, 34, 35, 12, 23, 38, 25, 39
FS-6 [8] 14 3, 29, 4, 32, 38, 33, 39, 12, 36, 23, 26, 34, 40, 31
FS-7 [9] 12 23, 5, 3, 6, 32, 24, 12, 2, 37, 36, 8, 31
FS-8 [10] 16 2, 4, 10, 14, 17, 19, 21, 24, 25, 26, 27, 30, 31, 34, 35,
39
FS-9 [11] 35 9, 26, 25, 4, 12, 39, 30, 38, 6, 29, 5, 37, 11, 3, 22, 35,
34, 14, 33, 23, 8, 10, 31, 27, 28, 32, 1, 36, 2, 41, 40,
17, 13, 16, 19
FS-10 [12] 19 3, 4, 5, 6, 12, 23, 24, 25, 26, 29, 30, 32, 33, 34, 35, 36,
37, 38, 39
FS-11 [5] 15 4, 5, 6, 8, 10, 12, 17, 23, 26, 29, 30, 32, 37, 38, 39
FS-12 41 All features
3.4 Data Classification

Data classification section is divided into two major parts: individual classifier and
ensemble classifier.
Individual Classifier. To detect a DDoS attack using outlier detection and novelty
detection techniques, it is required to be able to decide whether a new observation data
belongs to the same distribution as existing observations (can be called an inlier) or
should be considered as different (can be called an outlier).
In any training dataset, data could be concentrated into different regions or sepa-
rated from each other. The observations that are not concentrated and far from any
concentrated regions, are defined as outliers. Outlier detection estimators try to fit in
those regions and ignore the deviant observations. On the other hand, the training data
is not polluted by outliers and a new observation on outlier is known as a novelty. Both
outlier and novelty detection techniques are used for anomaly detection where they are
interested in detecting abnormal or unusual observations. Outlier detection and novelty
detection are also known as unsupervised and semi-supervised anomaly detection
respectively.
Here, the proposed method starts with each of the five unsupervised outlier
detection classifiers working individually and then observing the performance metrics:
accuracy, false positive rate, precision, recall, and F1 scores. Then the Majority Voting,
Logistic Regression, and Naïve Bayes ensemble techniques are applied on these five
classifiers to get better performance.
728 S. Das et al.
One Class SVM. Support vector machines (SVMs), which are the types of supervised
learning models are very well-known in the ML environment that analyze data and
recognize patterns. It can also be used for both classification and regression tasks. One-
class classification (OCC), also known as unary classification or class-modeling, tries
to identify objects of a specific class amongst all objects by primarily learning from a
training set containing only the objects of that class [25].
Therefore, in anomaly detection, one-class SVM is trained with data that has only
one class, which is the “non-anomalous” or “normal” class. It infers the properties of
normal classes and using these properties, it can predict which examples are unlike the
normal. This is useful for anomaly detection because the scarcity of training examples
is what defines anomalies; typically, there are very few examples of the network
intrusion, fraud, or other anomalous behavior [26].
Local Outlier Factor. The local outlier factor score is computed by the LOF algorithm
which reflects the degree of abnormality of the observations. With respect to its
neighbors, LOF measures the local density (obtained from k-nearest neighbors) devi-
ation of a given data point. As a result, it detects the samples that have a substantially
lower density than their neighbors. For an observation, the LOF score is equal to the
ratio of the average local density of its k-nearest neighbors and its own local density.
A “normal” data is expected to have a local density similar to that of its neighbors,
while an “abnormal” data is expected to have much smaller local density.
Isolation Forest. Random forest is used to perform the outlier detection efficiently in
high-dimensional datasets. The algorithm isolates observations by randomly selecting a
feature and then randomly selecting a split value between the maximum and minimum
values of the selected features [27]. Here, recursive portioning is used to split the
values. As the recursive patronizing is illustrated by a tree structure, the required
number of splitting to isolate a sample is equivalent to the path length from the root
node to the leaf node where it terminates. The path length from the root node to the
terminating node, averaged over a forest of such random trees, is a measure of nor-
mality and the decision function. Random partitioning produces noticeably shorter
paths for anomalies. Therefore, the random forest trees that collectively produce shorter
path lengths for particular samples are highly likely to be anomalies [28].
Elliptic Envelope. In outlier detection, one common way to detect the outlier is to
assume that the regular data comes from a known distribution (e.g. Gaussian distri-
bution). From this assumption, a ‘shape’ of the data can be defined, and data points
stand far enough from the fit shape are defined as outlying observations. Elliptic
Envelope assumes the data as normally distributed and based on that assumption it
‘draws’ an ellipse around the data, classifying any observation inside the ellipse as an
inlier (labeled as (+)1) or “normal” and any observation outside the ellipse as an outlier
(labeled as (−)1) or “anomalous”.
Ensemble Classifier. All four classifiers defined previously are used to build five
training models, where One-class SVM is used twice with different hyperparameters.
According to the framework, on top of these five models, different ensemble techniques
are applied to combine them. The majority voting mechanism is chosen as a baseline.
Then logistic regression (LR) and naïve Bayes (NB) are two supervised models that are
applied on top of these five classifiers to ensemble, for the better detection accuracy and
lower false positive alarm.
Majority Voting. Majority voting scheme is the very common and basic technique in
ML ensemble. Generally, a majority means when the greater part or more than half of
the total accumulates. In Machine learning, an output prediction could be ‘1’ or ‘0’.
A majority voting mechanism could be applied on any number of classifiers’ output.
When the greater part or more than half of the total classifiers’ predictions agree with a
certain prediction value ‘1’ or ‘0’, that prediction value would be the final output of this
majority voting mechanism. For example, for five classifiers ‘A, B, C, D, E’ that
predict a certain data instance ‘1, 0, 1, 1, 0’ respectively, the final output will be the ‘1’
decided by majority voting mechanism. When Majority Voting is used in ML ensemble
as a combination rule (which only works with nominal classes), each of these classifiers
will predict a nominal class label for a test sample. The label which was predicted the
most will then be selected as the output of the voting classifier.
Logistic Regression. Logistic regression is the go-to method for binary classification
problems (problems with two class values) which is borrowed by ML from the field of
statistics. In statistics, the logistic model (or logit model) is used to model the proba-
bility of an existing class or event such as normal/abnormal, pass/fail, win/lose,
hot/cold, etc. This can be extended to model with several classes of events, and each of
the events would be assigned a probability value between 0 to 1, where the sum of all
probabilities will be a complete 1. The coefficients (Beta values, b) of the logistic
regression algorithm must be estimated from the training data which is done by using
maximum-likelihood estimation. The best coefficients would result in a model that
would predict a value very close to 1 (e.g. anomalous) for the default class and value
very close to 0 (e.g. normal) for the other class. The intuition for maximum-likelihood
for logistic regression is that a search procedure seeks values for the coefficients (Beta
values) that minimize the error in the probabilities predicted by the model to those in
the data.
Naïve Bayes. Naïve Bayes is a simple and common classifier used in many ML
problems. It is a Bayes theorem based probabilistic classifier which helps to define the
probability of an event based on some prior knowledge of certain conditions associated
with that event. The goal of any probabilistic classifier (e.g.; X0, X1, …. Xn features
and C0, C1, . . . Ck classes) is to determine the probability of the features occurring in
each class and to return the most likely class. Naïve Bayes classifier assumes that the
features are independent of each other and thus it is named as “Naïve”.
Figure 1 shows the process flow of the proposed framework from data prepro-
cessing to ensemble classification. In this framework, NSL-KDD dataset contains both
training and testing data and is used as an input of Step-1. In Step-2, data are converted
into a model readable format with the help of data preprocessing and feature selection.
Processed training data is then fed into five different classifiers: One-Class SVM
(OCS) with two different hyperparameters, Local Outlier Factor (LOF), Isolation Forest
(ISOF), and Elliptic Envelope (ELE). At the end of this step, all five different classifiers
have built their models using training data. Then test data is transferred from the raw
dataset through feature selection and data preprocessing phase to Step-4 and evaluates
730 S. Das et al.
the training models that are built on Step-3 to predict outcomes. In Step-5, all the
predictions coming from different training models in the last step generate a vector for a
single data instance which is then carried to each of the three ensemble classifiers:
Majority Voting (MV), Logistic Regression (LR), and Naïve Bayes (NB). MV, LR,
and NB have different performance measures for that prediction vector. Based on the
higher detection accuracy, precision, recall, F-1 score, and lower false positive rate,
Step-6 decides the best ensemble algorithm which is finally chosen for DDoS attack
detection.
Fig. 1. Process flow of the proposed framework to detect DDoS attacks
The proposed model has been implemented in python and the ML tool scikit-learn [29],
a python library to model the training data and to evaluate that model using test data. In
experimentation, raw data was extracted from the NSL-KDD website and then con-
verted into a classifier readable format. The classifier readable format came from data
preprocessing phase (described earlier in ‘Proposed Method’ section). The basic idea of
data preprocessing is the conversion of non-numeric data content to the corresponding
assigned numeric value. Since unsupervised outlier detection models were used in this
experiment, the training data should contain the majority of ‘normal’ or ‘non-
anomalous’ type data instances for better predictions. From the whole training dataset,
normal and anomalous data were separated, and a new training dataset was created that
contained 99% of normal data and 1% of anomalous data. The reason behind using 1%
anomalous data in the training dataset was to make the model run efficiently and
accurately in detecting anomaly by learning from normal behavior. The additional
percentage of noise (anomalous data) was added later to measure the framework’s
efficiency. After separating normal and anomalous data from the training data, 67343
data were found as normal (see Table 1). 1% of this normal data which is 673, was
added with anomalous data to create a new or modified training dataset. On the other
hand, test dataset wasn’t separated but only 1000 amount of random data instances
were chosen from test data and created a modified test dataset to evaluate the training
model. For both cases, the ‘class’ column from the dataset was removed as we used
unsupervised methods in this framework. However, the ‘class’ column from the test
dataset is preserved to use it later in determining the accuracy of all three ensemble
classifiers, and training as well as testing purposes for logistic regression and naïve
Bayes models.
In the early phase of the experimental section, five different classifiers were used to
build five different training models by using training dataset. As the goal of this
research is to detect existing and new DDoS attacks pattern, outlier and novelty
detection type classifier was the highest priority on selecting classifiers. Here, four
different types of outlier and novelty detection classifiers were chosen, namely One-
Class SVM, Local Outlier Factor, Isolation Forest, and Elliptic Envelope. As four is the
even number and there might be a chance of a tie situation while choosing an outcome
using majority voting ensemble, the next odd number five was chosen as the classifier
count in this experiment. Initially, we have experimented with different hyper-
parameter values of these classifiers and based on the accuracy, we have selected the
best five hyperparameter combinations. The details of the classifiers with hyperpa-
rameter combinations used in our experiment are listed in Table 4.
Table 4. Outlier and novelty detection classifier details

Classifier name Short code Hyperparameters
One class SVM- OCSVM-1 or nu = 0.2, kernel = “poly”, gamma = 0.1
Poly kernel OCSVM-Poly
One class SVM- OCSVM-2 or nu = 0.2, kernel = “linear”, gamma = 0.1
Linear kernel OCSVM-Linear
Local outlier LOF n_neighbors = 20, contamination = 0.22,
factor novelty = True
Isolation forest ISOF behaviour = 'new’, max_samples = 100,
random_state = RandomState(listLength),
contamination = 0.2
Elliptic envelope ELE support_fraction = 1, contamination = 0.2,
random_state = RandomState(listLength)
By using different combinations from Table 4, five different training models were
built from the modified training set. Then test dataset was used to evaluate each model.
732 S. Das et al.
In terms of performance evaluation, Confusion Matrix, True Positive Rate (TPR),

False Positive Rate (FPR), True Negative Rate (TNR), False Negative Rate (FNR),
Precision, Recall, and F-Measure are used very frequently. Confusion Matrix has three
main terms: Sensitivity, Specificity, and Accuracy which can be defined as follows:
TPR
Sensitivity ¼ ð1Þ
ðTPR þ FNRÞ
TNR
Specificity ¼ ð2Þ
ðTNR þ FPRÞ
ðTPR þ TNRÞ
Accuracy ¼ ð3Þ
ðTPR þ TNR þ FPR þ FNRÞ
Also, Precision, Recall, and F-Measure are another three important performance
metrics that are used to evaluate a model. Those terms can be defined in terms of TP,
TN, FP, FN from Eqs. (4), (5) and (6).
TP
Precision ðPÞ ¼ ð4Þ
ðTP þ FPÞ
TP
Recall ðRÞ ¼ ð5Þ
ðTP þ FN Þ
2PR
F Score ¼ ð6Þ
PR
Table 5 shows the details of performance metrics for these classifiers when they
trained their model by using a single classifier.
Table 5. Performance metrics when single classifier used.

Classifier OC SVM-Poly OC SVM-Linear LOF Isolation Elliptic
kernel kernel forest envelope
Accuracy 0.863 0.863 0.570 0.893 0.900
FPR 0.075 0.077 0.332 0.079 0.092
Precision 0.866 0.884 0.496 0.890 0.878
Recall 0.780 0.782 0.438 0.855 0.890
F-1 0.829 0.830 0.465 0.872 0.884
Score
After training with the single classifier, a top label classifier or technique was used
to ensemble these five classifiers. Majority voting was considered as a baseline, then
logistic regression (LR) and naïve Bayes (NB) were used to ensemble these classifiers.
In majority voting, the nominal class label which was predicted the most from the
unsupervised outputs will then be selected as the final output. On the other hand, LR
and NB are two supervised models that require a class label to learn. Here, we used
those two models to combine the outputs of five unsupervised models. To train those
models, a tagged dataset was required to create using the unsupervised models’ outputs
as features and the class label that was removed in the data preprocessing phase.
Maximum-likelihood estimation is the key mechanism for both logistic regression and
naïve Bayes to predict the ensemble outputs. Figure 2 shows the graphical represen-
tation of the comparison of ensemble classifiers’ performance metrics with single
classifier models.
1.2
0.8
0.6
0.4
0.2
0
One Class One Class LOF Isolation Elliptic Majority Naive Logistic
Poly Linear Forest Envelope Voting Bayes Regresion
Accuracy Precision Recall F-1 Score FPR
Fig. 2. Performance metrics of five single classifiers compared to ensemble classifiers.
As mentioned in Sect. 3.3, 12 different feature sets where each of the features is
relevant for DDoS attack were considered to build training models. Figure 3 shows the
accuracy of three different types of ensemble techniques with respect to 12 different
feature sets and we found that feature set 4 (FS-4) has the best accuracy when Logistic
Regression was used to ensemble.
As mentioned earlier, the majority of the data instances were normal data and a
very few amounts (1%) of noise (abnormal data) mixture was added to build a modified
training dataset. To verify our framework’s stability and efficiency with the increase of
noise (adding more anomalous data into dataset) added with the training dataset, we
varied the noise amounts from 1% to 5% and measured the performance metrics.
Figure 4 shows the deviation of performance by adding noise with the training data.
734 S. Das et al.
1
0.9
0.8
0.7
0.6
0.5 Majority Voting
0.4 Logistic Regression
0.3 Naïve Bayes
0.2
0.1
0
Fig. 3. Accuracy of ensemble classifiers with respect to different feature sets.
0.8
0.6
0.4
FPR
0.2 F-1 Score
Recall
0 Precision
1 2 Accuracy
3
4
5
Accuracy Precision Recall F-1 Score FPR
Fig. 4. Performance measurement with respect to adding noise
The proposed framework was tested and verified with both single classifiers and
ensemble classifiers to get a higher detection accuracy with a lower false positive rate.
Logistic Regression based ensemble model was found with the highest detection
accuracy and lowest false positive rate among all five different classifiers: One-Class
SVM, Local Outlier Factor, Isolation Forest, and Elliptic Envelope. Table 6 shows the
overall comparison among all classifiers including single classifiers, ensemble classi-
fiers as well as some existing research outcomes. ‘N/A’ refers to the results that are not
mentioned in those existing researches.
Table 6. Performance metrics comparison of single classifiers, ensemble classifiers, and

existing research
Classifier Type Accuracy FPR Precision Recall F-1 Score
OCSVM-1 Single 0.863 0.075 0.886 0.780 0.829
OCSVM-2 Single 0.863 0.077 0.884 0.782 0.830
LOF Single 0.57 0.332 0.496 0.438 0.465
ISOF Single 0.893 0.079 0.890 0.855 0.872
ELEC Single 0.90 0.092 0.878 0.890 0.884
MV Ensemble 0.891 0.075 0.894 0.845 0.869
LR Ensemble 0.938 0.077 0.903 0.958 0.930
NB Ensemble 0.881 0.099 0.865 0.855 0.860
GAR [30] Single 0.773 N/A N/A N/A N/A
IG-GAR [30] Ensemble 0.789 N/A N/A N/A N/A
SU-GAR [30] Ensemble 0.776 N/A N/A N/A N/A
LDA-NB-kNNCF [31] Ensemble 0.82 N/A N/A N/A N/A
LOO-OAR-SVM [32] Ensemble 0.827 N/A N/A N/A N/A
ROC curve (Receiver Operating Characteristics) is a probability curve. In anomaly

detection, higher the ROC, better the model is at distinguishing the anomalous traffic
from the normal one. The ROC curve is plotted with True Positive Rate (TPR) against
the False Positive Rate (FPR) where TPR is on y-axis and FPR is on the x-axis.
Figure 5 shows the ROC curve for DDoS classification in this experiment.
Fig. 5. ROC curve for DDoS classification

736 S. Das et al.
5 Conclusion
In this research, the goal was to detect DDoS attacks using an unsupervised ML
ensemble. Classifiers from various classifier families of outlier and novelty detection
type have been chosen to build the proposed framework. Initially, single classifiers
were used to measure the performance metrics in detecting DDoS attacks. On top of
these five individual classifiers (One class SVM: two different hyperparameters, Local
outlier factor, Elliptic envelope, and Isolation forest), an ensemble with majority voting
was applied as a baseline. Naïve Bayes and logistic regression were then used to
ensemble these five classifiers again to get a better detection accuracy. In our experi-
ment, Logistic regression based ensemble has the best performance measures that are
not only compared to baseline majority voting and Naïve Bayes ensemble but also with
the models where the single classifiers were used. In addition, it was also observed
from the experimental results and compared to existing research that logistic regression
based ensemble while using feature set 4 (FSet-4) has the best detection accuracy, high
precision and recall, F1 score, and low false positive rate. The proposed model is not
only capable of detecting existing DDoS attacks, but also using outlier detection
classifiers, it has the capability to detect unseen or new DDoS attack.
In this research, only one dataset was considered for experimentation and we plan
to continue our experiments with other different datasets using the proposed frame-
work. The dataset that we have used was an offline data, hence we have the limitation
of experimenting with online data. Twelve different feature sets have been chosen from
existing research for experimentation. In the future, we plan to reduce the features on
our own using different feature reduction techniques and domain knowledge. With this
research as the base, we will consider deep learning methods and software agents [33]
in detecting DDoS attacks more accurately.
References
1. Lee, Y.-J., Baik, N.-K., Kim, C., Yang, C.-N.: Study of detection method for spoofed ip
against DDoS attacks. Pers. Ubiquitous Comput. 22(1), 35–44 (2018)
2. NETSCOUT Report. https://www.netscout.com/report/. Accessed 10 July 2019
3. Specht, S.M., Ruby B.L.: Distributed denial of service: taxonomies of attacks, tools, and
countermeasures. In: Proceedings of the 17th International Conference on Parallel and
Distributed Computing Systems (2004)
4. Dietterich, T.G.: Ensemble methods in machine learning. In: International Workshop on
Multiple Classifier Systems. Springer, Heidelberg (2000)
5. Aburomman, A.A., Reaz, M.B.I.: A survey of intrusion detection systems based on
ensemble and hybrid classifiers. Comput. Secur. 65, 135–152 (2017)
6. Noureldien, N.A., Yousif, I.M.: Accuracy of machine learning algorithms in detecting DoS
attacks types. Sci. Technol. 6(4), 89–92 (2016)
7. Olusola, A.A., Oladele, A.S., Abosede, D.O.: Analysis of KDD’99 intrusion detection
dataset for selection of relevance features. In: Proceedings of the World Congress on
Engineering and Computer Science, WCECS, vol. 1 (2010)
8. Osanaiye, O., et al.: Ensemble-based multi-filter feature selection method for DDoS
detection in cloud computing. EURASIP J. Wirel. Commun. Netw. 2016(1), 130 (2016)
9. Ambusaidi, M.A., et al.: Building an intrusion detection system using a filter-based feature
selection algorithm. IEEE Trans. Comput. 65(10), 2986–2998 (2016)
10. Gaikwad, D.P., Thool, R.C.: Intrusion detection system using bagging ensemble method of
machine learning. In: 2015 International Conference on Computing Communication Control
and Automation. IEEE (2015)
11. Shrivas, A.K., Dewangan, A.K.: An ensemble model for classification of attacks with feature
selection based on KDD99 and NSL-KDD data set. Int. J. Comput. Appl. 99(15), 8–13
(2014)
12. Tesfahun, A., Bhaskari, D.L.: Intrusion detection using random forests classifier with
SMOTE and feature reduction. In: 2013 International Conference on Cloud & Ubiquitous
Computing & Emerging Technologies. IEEE (2013)
13. Haq, N.F., et al.: Application of machine learning approaches in intrusion detection system:
a survey. IJARAI-Int. J. Adv. Res. Artif. Intell. 4(3), 9–18 (2015)
14. Yusof, A.R., Udzir, N.I., Selamat, A.: Systematic literature review and taxonomy for DDoS
attack detection and prediction. Int. J. Digit. Enterp. Technol. 1(3), 292–315 (2019)
15. Belavagi, M.C., Muniyal, B.: Performance evaluation of supervised machine learning
algorithms for intrusion detection. Procedia Comput. Sci. 89, 117–123 (2016)
16. Ashfaq, R.A.R., et al.: Fuzziness based semi-supervised learning approach for intrusion
detection system. Inf. Sci. 378, 484–497 (2017)
17. Perez, D., et al.: Intrusion detection in computer networks using hybrid machine learning
techniques. In: 2017 XLIII Latin American Computer Conference (CLEI). IEEE (2017)
18. Villalobos, J.J., Rodero, I., Parashar, M.: An unsupervised approach for online detection and
mitigation of high-rate DDoS attacks based on an in-memory distributed graph using
streaming data and analytics. In: Proceedings of the Fourth IEEE/ACM International
Conference on Big Data Computing, Applications and Technologies. ACM (2017)
19. Jabez, J., Muthukumar, B.: Intrusion detection system (IDS): anomaly detection using outlier
detection approach. Procedia Comput. Sci. 48, 338–346 (2015)
20. Smyth, P., Wolpert, D.: Stacked density estimation. In: Advances in Neural Information
Processing Systems (1998)
21. Hosseini, S., Azizi, M.: The hybrid technique for DDoS detection with supervised learning
algorithms. Comput. Netw. 158, 35–45 (2019)
22. Canadian Institute for Cybersecurity, Datasets/NSL-KDD. https://www.unb.ca/cic/datasets/
nsl.html. Accessed 10 July 2019
23. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.: A detailed analysis of the KDD CUP 99
data set. In: Submitted to Second IEEE Symposium on Computational Intelligence for
Security and Defense Applications (CISDA) (2009)
24. Das, S., Mahfouz, A.M., Venugopal, D., Shiva, S.: DDoS intrusion detection through
machine learning ensemble. In: 2019 IEEE 19th International Conference on Software
Quality, Reliability and Security Companion (QRS-C), pp. 471–477. IEEE, July 2019
25. One-Class classification. https://en.wikipedia.org/wiki/One-class_classification. Accessed 10
July 2019
26. Microsoft, One-Class Support Vector Machine. https://docs.microsoft.com/en-us/azure/
machine-learning/studio-module-reference/one-class-support-vector-machine. Accessed 10
July 2019
27. Scikit learn, Novelty and Outlier Detection. https://scikit-learn.org/stable/modules/outlier_
detection.html. Accessed 10 July 2019
28. Scikit learn, Isolation Forest. https://scikit-learn.org/stable/modules/generated/sklearn.
ensemble.IsolationForest.html. Accessed 10 July 2019
29. Scikit learn. https://scikit-learn.org. Accessed 10 July 2019
738 S. Das et al.
30. Kanakarajan, N.K., Muniasamy, K.: Improving the accuracy of intrusion detection using
GAR-Forest with feature selection. In: Proceedings of the 4th International Conference on
Frontiers in Intelligent Computing: Theory and Applications (FICTA) 2015. Springer,
New Delhi (2016)
31. Pajouh, H.H., Dastghaibyfard, G.H., Hashemi, S.: Two-tier network anomaly detection
model: a machine learning approach. J. Intell. Inf. Syst. 48(1), 61–74 (2017)
32. Pervez, M.S., Farid, D.Md.: Feature selection and intrusion classification in NSL-KDD cup
99 dataset employing SVMs. In: The 8th International Conference on Software, Knowledge,
Information Management and Applications (SKIMA 2014). IEEE (2014)
33. Das, S., Shiva, S.: CoRuM: collaborative runtime monitor framework for application
security. In: 2018 IEEE/ACM International Conference on Utility and Cloud Computing
Companion (UCC Companion). IEEE (2018)
Moving Towards Open Set Incremental
Learning: Readily Discovering
New Authors
Justin Leo(B) and Jugal Kalita
University of Colorado, Colorado Springs, CO 80918, USA

{jleo,jkalita}@uccs.edu
Abstract. The classification of textual data often yields important

information. Most classifiers work in a closed world setting where the
classifier is trained on a known corpus, and then it is tested on unseen
examples that belong to one of the classes seen during training. Despite
the usefulness of this design, often there is a need to classify unseen
examples that do not belong to any of the classes on which the classifier
was trained. This paper describes the open set scenario where unseen
examples from previously unseen classes are handled while testing. This
further examines a process of enhanced open set classification with a
deep neural network that discovers new classes by clustering the exam-
ples identified as belonging to unknown classes, followed by a process of
retraining the classifier with newly recognized classes. Through this pro-
cess the model moves to an incremental learning model where it continu-
ously finds and learns from novel classes of data that have been identified
automatically. This paper also develops a new metric that measures mul-
tiple attributes of clustering open set data. Multiple experiments across
two author attribution data sets demonstrate the creation an incremental
model that produces excellent results.
Keywords: Incremental learning · Open set · Deep learning ·

Authorship attribution
1 Introduction
Formal as well as informal textual data are over-abundant in this Internet-
connected era of democratized publishing and writing. These textual information
sources are in multiple forms such as news articles, electronic books and social
media posts. The use of text classification allows us to determine important
information about the texts that can often be used to connect to the respective
authors, naturally leading to the concept of Authorship Attribution. Authorship
Attribution is seen as the process of accurately finding the author of a piece of
text based on its stylistic characteristics [1]. Authorship Attribution is useful in
scenarios such as identification of the author of malicious texts or the analysis
of historical works with unknown authors.
https://doi.org/10.1007/978-3-030-39442-4_54
740 J. Leo and J. Kalita
Typically, text classification has a few well-established stages. The words in

the text corpus are transformed using an embedding algorithm, and a classifier
is trained with documents labeled with associated classes. In Authorship Attri-
bution, the text samples tend to be books such as novels, transcribed speeches,
or Internet-mediated social media posts, where each sample is labeled with the
corresponding author. The trained text classifier is given testing data that is
usually unseen text samples from the same set of trained authors. This process
describes a closed set approach because the tested samples are associated with
the same trained classes. A problem with this process of classification arises if
the testing data includes samples from unfamiliar authors. In these cases, the
classifier typically and erroneously associates the piece of text with a wrong
author, an author on which it was trained. To remedy this problem, a new app-
roach called open set classification has been proposed. Open set classification
enables the classifier to discriminate among the known classes, but additionally
and importantly, to identify if some test example is not associated with any of
the classes on which it was trained [2].
There has been some recent work on open set classification using convolution
neural networks (CNN) and recurrent neural networks (RNN). Prior work on
open set classification has often been in areas such as computer vision [3], speech
processing [4], and natural language processing [5]. This paper utilizes open set
recognition to identify the presence of test examples from novel classes, and
incorporate these new classes to those already known to create an incremental
class-learning model.
The rest of the paper is organized as follows. After describing the related
work in the next section, the approach is presented to identify new classes and
instantiate them. Then, the following section discusses evaluation metrics for
assessing incremental learning, followed by experimental results using author-
ship attribution datasets and analysis. The final section reiterates the research
accomplishments and thoughts on future work.
2 Related Work
The discussed related work is in terms of four topics: deep networks for open set
classification, metrics for open set classification, open set text classification, and
recent proposals to use loss functions for open set classification in the context of
computer vision.
2.1 Open Set Deep Networks
Using deep neural networks for open set classification often requires a change in
the network model. Modern neural networks have multiple layers connected in
various ways, depending on the classifier architecture being used. Most models
eventually include a softmax layer that classifies the data to the known classes,
with an associated confidence level or probability for each class. A test example
is considered to belong to the class which has the highest probability among
Moving Towards Incremental Learning 741
all the classes. To adapt this model to the open set scenario, the softmax layer
was replaced by a unique layer named the OpenMax layer [6]. This layer esti-
mates the probability of an input being from one of the known classes as well
as an “unknown” class, which lumps together all classes unseen during training.
Thus, the network is able to recognize examples belonging to unknown classes,
enhancing the ability of the closed set classifier it starts with.
2.2 Metric for Evaluating Open Set Classification
The process of open set class recognition leads to new challenges during the
evaluation process. There are multiple sources of error that could be present
including: misclassification of known or unknown classes and determination of
novel classes. Bendale and Boult (2015) proposed a metric to evaluate how indi-
vidual examples are classified. Although the metric was originally proposed for
use in computer vision, it could be applicable in author attribution as well.
2.3 Deep Open Set Text Classification
Prakhya, Venkataram, and Kalita (2017) modify the single OpenMax layer pro-
posed by [6] to replace the softmax layer in a multi-layer convolution neural
networks with an ensemble of several outlier detectors to obtain high accuracy
scores for open set textual classification. The ensemble of classifiers uses a voting
model between three different approaches: Mahalanobis Weibull, Local Outlier
Factor [7], and Isolation Forest [8]. The average voting method produced results
that are more accurate in detecting outliers, making detection of unknown classes
better.
2.4 Loss Functions for Open Set Classification
A problem that often occurs in open set classification is the classifier label-
ing known class data as unknown. This problem typically occurs if there are
some similar features in the examples of the pre-trained classes and unknown
classes encountered during testing. In the context of computer vision, Dhamija,
Günther, and Boult (2018) introduce what is called the Entropic Open-Set loss
function that increases the entropy of the softmax scores for background train-
ing samples and improves the handling of background and unknown inputs.
They introduce another loss function called the Objectosphere loss, which further
increases softmax entropy and performance by reducing the vector magnitudes of
examples of unknown classes in comparison with those from the known classes,
lowering the erroneous classification of known class data as unknown. Since this
approach squishes the magnitudes of all examples that belong to all unknown
classes, it makes later separation of individual unknown classes difficult.
Fig. 1. Protocol for open set classification and incremental class learning
3 Approach
This paper explores open set classification and the process of moving towards
incremental learning of new classes. The objective is to create a classifier frame-
work that can incrementally learn and expand its knowledge base as additional
data is presented as shown in Figs. 1 and 2. The approach is also outlined in
Algorithm 1.
In prior work on open set classification, authors have focused on recognizing
test samples as belonging to classes unknown during training. Based on prior
research and knowledge, this paper is the first to instantiate new classes itera-
tively, extending prior work to real incremental class learning. Initially there is a
summarization of the approach to provide an easily comprehensible sketch before
moving on to details. The classifier framework is initially initialized by training
it with examples from a small number of selected classes. The trained classifier
is then exposed to a mix of examples from the already-known classes as well
unknown classes, during testing. At a certain point, the testing of the current-
classifier is paused and then all examples recognized as belonging to unknown
classes are clustered. Clustering allows for the grouping of similar data and visu-
ally represents the differences between unique clusters. The hypothesis is that, if
the clustering is good, one or more of the clusters of unknown examples can be
thought of as new classes the current-classifier has not seen and these clusters are
Fig. 2. Ensemble model and testing classifier diagram. This diagram more clearly
describes the ‘Ensemble Outlier Detector’ component from Fig. 1.
instantiated as new classes, by making up new unique labels for them. At this
point, the current-classifier is updated by retraining it with all examples of the
old known classes as well the newly instantiated classes. This process of training,
accumulating of outliers, clustering, and instantiating selected new classes out
of the clusters is repeated a number of times as long as the error of the entire
learning process remains acceptable.
In particular, the classifier is a multi-layer CNN structure for training pur-
poses. During testing, the softmax layer at the very end replaced by an outlier
ensemble, following the work of [9]. The outlier detector ensemble consists of a
Mahalanobis model, Local Outlier Factor model, and an Isolation Forest model,
like [9]. The classifier model, as used in training is shown in Fig. 2. Initially the
model is created by training a classifier Ecurrent with a given kseed number of
classes found in the entire training data set D. Then a derived dataset is created
test
Dcurrent for testing the model by mixing examples of kunknown unknown classes
with the previously trained kseed classes. The process always adds knew classes
to the number of known classes. Thus, at the end of the ith iteration of class-
learning, the classifier knows kseed + (i − 1)knew classes. The model instantiates
“new” classes by choosing dominant clusters, and then retrain the model with
these new classes. The classes are then removed from the set of all classes and
new ones are selected for the incremental addition.
This paper experiments with multiple clustering techniques including K-
Means [10], Birch [11], DBScan [12], and Spectral [13], to determine the most
suitable one for author attribution. There is also experimentation with various
values of the parameters: kseed , kunknown and δ.

Input: Training Set D = x(i) , y (i) , i = 1 · · · N , samples from all known classes
Output: An incrementally trained classifier E on examples from a number of
classes in D
1 Call ← C1 , · · · Cn , set of all known classes
2 Ccurrent
train
← (randomly) pick kseed classes from Call
(i) (i)
current ←
Dtrain | y(i) ∈ Ccurrent , samples from classes in Ccurrent
train train
3 x ,y
4 repeat
5 Ccurrent
unknown
← (randomly) pick kunknown classes from Call − Ccurrent train
(i) (i) (i)
6 Dcurrent → Dcurrent
test train
x ,y | y ∈ Ccurrent
unknown
7 Ecurrent ← (CNN) classifier trained on Dtrain currrent

8 O ← outlier samples detected by ensemble outlier detector when tested on
Dtest
current
9 L ← set of clusters produced from O using a selected clustering algorithm
10 Ldominant → pick knew dominant clusters from L, call these clusters new
classes by making up new labels for them
11 Ccurrent
train
← Ccurrent
train
L , increase the number of “known” classes

dominant
12 Dtrain
current ← D train
current x, y ∈ Lj | Lj ∈ Ldominant
13 until too low accuracy or n times;
14 E → Ecurrent
15 return E
Algorithm 1: Algorithm for incremental class-learning
4 Evaluation Methods
Since the method uses clustering as well as classification in the designed protocol
for incremental classification, there needs to be evaluation of both. First, it is
required to have an outline of how clusters obtained from examples classified as
unknown are evaluated, and then it is required to have a description of how the
incremental classifier is evaluated.
4.1 Evaluation of Clustering
There are a variety of clustering algorithms, and the model needs one that works
efficiently in the domain of author attribution. The test samples that are deemed
to be outliers are clustered, with the hypothesis that some of these clusters
correspond to actual classes in the original dataset. The evaluation process uses
the Davies-Bouldin Index as shown in Eq. (1) to evaluate clustering [14].
n
1 σi + σj
DB = maxj=i (1)
n i=1 d(ci , cj )
In this formula, n is the number of clusters produced, σi is the average distance

between the points in cluster i and its centroid, d(ci , cj ) is the Euclidean distance
between the centroids of clusters indexed i and j. Typically lower Davies-Bouldin
Index scores indicate better clustering. Another clustering evaluation metric used
is the V-Measure as shown in Eq. (2), which has been widely used in clustering in
natural language processing tasks when ground truth is known, i.e., the samples
and their corresponding classes are known. This metric computes the harmonic
mean between homogeneity and completeness [15]. Homogeneity measures how
close the clustering is such that each cluster contains samples from one class only.
Completeness measures how close the clustering is such that samples of a given
class are assigned to the same cluster. Typically scores close to 1 indicate better
clustering. Here β is a parameter used to weigh between the two components, a
higher value of β weighs completeness more heavily over homogeneity, and vice
versa.
(1 + β) ∗ homogeneity ∗ completeness
V = (2)
β ∗ homogeneity + completeness
4.2 Evaluation of Open Set Misclassification Error
Assuming there are n known classes, multi-class classification using a classifier

En (), trained on n classes, can be evaluated using the misclassification error:
N
1
n = En (x(i) ) = y (i) (3)
N i=1
where N is the total number of samples in the dataset. When the same classifier
En () is tested in the context of open set classification, there is a need to keep
track of errors due that occur between known and unknown classes. When the
classifier is tested on N samples from n known classes and N samples from u
unknown classes, the test is a total of N + N samples over n + u classes. The
open set classification error OS for classifier En is given as [3]:

N
1
OS = n +
En (x(i) ) = unknown (4)
N
j=N +1
4.3 Evaluation of Incremental Class Learning Accuracy
For this research, the approach uses clustering in order to obtain new classes after
open set recognition is performed. This way the new data identified for the novel
classes can be used to incrementally train the model. For the evaluation of these
clusters this paper presents a new metric ICA (Incremental Class Accuracy)
which takes into account the specific data from an identified cluster and averages
calculations of homogeneity, completeness, and unknown identification accuracy
of the cluster. This paper defines homogeneity as the ratio of the number of data
samples of the predominant class c in the cluster k (nc|k ) and the total number
of values in the cluster (Nk ). This paper defines completeness as the ratio of
the number of data samples of the predominant class c in the cluster k (nc|k )
and the total number of tested samples of the same class Nc . This paper defines
define unknown identification accuracy as ratio of the number of unknown u data
samples in the cluster k (nu|k ) and the total number on values in the cluster Nk .
The equation used for ICA assumes only one cluster is being evaluated, but the
equation can be adapted for multiple clustering by finding multiple ICA scores
for each cluster and averaging.
max(nc|k )
Homogeneity = (5)
Nk
max(nc|k )
Completeness = (6)
Nc
(nu|k )
Unknown Identification Accuracy = (7)
Nk

max(nc|k ) max(nc|k ) (nu|k ) 1
ICA = + + ∗ (8)
Nk Nc Nk 3
Other metrics that will be used to determine the performance of the model
will be accuracy and F1-score, these figures inherently show the accuracy of the
classifier as well as novel data detection.

This section discusses the data sets used, the experiments performed, and the
results with analysis.
5.1 Datasets
Since the objective is for open set author attribution, the testing uses two
datasets each of which contains 50 authors.
– Victorian Era Literature Data Set [16]: This dataset is a collection of
writing excerpts from 50 Victorian authors chosen from the GDELT database.
The text has been pre-processed to remove specific words that identify the
individual piece of text or author (names, author made words, etc.). Each
author has hundreds of unique text pieces with 1000 words each.
– CCAT-50 [17]: This data set is a collection of 50 authors each with 50 unique
text pieces divided for both training and testing. These texts are collections
of corporate and industrial company news stories. This data is a subset of
Reuters Corpus Volume 1.
5.2 Preliminary Clustering Results

After experimental comparison of the different clustering techniques, the final
decision was to use Spectral Clustering [13] as this typically produces the highest
accuracy results as seen in Figs. 3 and 4, the clustering evaluation scores are also
used for comparison. The pre-trained model word2vec [18] to obtain the word
embeddings to pass into the multi-layer CNN structure.
Fig. 3. Clustering plots for Victorian Literature data with accuracy score, 5 trained
classes and 8 tested classes
Fig. 4. Clustering plots for CCAT-50 data with accuracy score, 5 trained classes and
8 tested classes
5.3 Incremental Classification Results
For the first experiment the objective was to see if the proposed method would
improve the calculated classification accuracy and to also decide which clustering
algorithm would work best. Both data sets are run individually with five known
training classes and then with ten known training classes, then the model is
introduced to three unknown classes during the testing phase for each of the
tests. The results include the comparison with accuracy and F1-Score as found
on Table 1; a significant increase of these values is observed after the classifier is
retrained with the identified novel classes. The clustering evaluation metrics are
found on Table 2. V-Measure scores prove to be more useful because the Davies-
Bouldin scores do not always indicate the highest accuracy of clustering, this
is because the best formed clusters does not necessarily mean higher accuracy.
Even though the chosen data sets have not been used for open set classification
in prior research this paper compares the calculated open set classification scores
with the state of the art closed set classification scores. Based on prior research,
the best classification F1-Score from prior work for the Victorian Literature data
set using only few classes is 0.808 [16] and the designed model produces a slightly
better score. Also based on prior research, the best classification accuracy score
Table 1. Pre-trained class scores and post-open set classification scores, either 5 or 10
initial trained classes and 3 unknown added during testing
Dataset Pre-trained Post-open set

Acc F1 Acc F1
Victorian 5class 56.29% 0.592 85.43% 0.855
CCAT-50 5class 54.75% 0.565 83.00% 0.825
Victorian 10class 61.29% 0.644 71.38% 0.706
CCAT-50 10class 62.50% 0.727 86.77% 0.866
Table 2. Davies Bouldin Index and V-Measure score for clustering methods evaluated,
either 5 or 10 trained classes and 3 unknown added during testing.
Data set Davies-Bouldin V-Measure

Vic-5 CCAT-5 Vic-10 CCAT-10 Vic-5 CCAT-5 Vic-10 CCAT-10
K-Means 2.739 2.045 1.989 0.876 0.078 0.039 0.147 0.082
Birch 2.670 2.237 4.193 3.654 0.165 0.075 0.147 0.083
Spectral 4.457 2.550 4.841 0.807 0.319 0.242 0.328 0.258
DBScan 4.031 2.432 4.783 4.349 0.065 0.101 0.158 0.149
for the CCAT-50 data set for using only few classes is 86.5% and the designed
model obtains similar results. The clustering models seem to have the most
error for both data sets (especially the CCAT-50 data), thus presumably better
clustering models or would produce greater results.
For the second experiment the model is initially trained with a fixed amount
of classes kseed and then the method incrementally adds a kunknown amount of
classes for testing. This process is repeated to demonstrate the model incremen-
tally learns as the learning and open set classification cycle is repeated. This test
is run by adding classes for multiple iterations and record the change in the F1-
Score for the overall classification and generation of new classes; the objective is
to run each test until the results drop significantly or until the model reaches a
max value of classes. Figure 5 displays the results of the incremental cycle, and it
is observed that the model achieves better results when fewer classes are added
at a time. The experiment runs tests for adding 1, 2, and 3 classes at a time. The
open set error shown in Eq. 4 is also calculated for each test; this metric shows
error of unknown data identification but not novel class generation. The problem
noticed with the experiment is that error will propagate through the process so
as error accumulates the results deter. Another observation, based on the results
from both data sets, is that adding one class incrementally each iteration has
better results because this limits the clustering error. It is also clear that the
Victorian Literature performs worse than the CCAT-50 data and the initial rea-
soning for this is because of the text samples; the Victorian text includes words
with slurs and accent mark symbols and word2vec is not pre-trained with these
Fig. 5. Incremental learning plots. Initially trained with 5, 10, 15, and 20 initial classes
then tested by incrementally adding 1, 2, and 3 classes. These plots show the final F1-
scores and open set error from Eq. 4.
new features. The CCAT-50 data tends to have very distinct authors and the
pieces of text tend to also tend to be more unique. Overall based on the results,
it is concluded that most of this error can be attributed to the clustering process.
As stated in the previous experiments, the clustering process tends to have
the most variance, this is evident from the low clustering accuracy due to the lack
of fully distinct clusters. Thus, there needs to be a way to evaluate the clustering.
Using the Incremental Class Accuracy (ICA) metric shown from Eq. 5, there will
be ability to evaluate the clustering in regards to homogeneity, completeness, and
unknown identification accuracy. From the previous experiment it is also noticed
that adding one class at a time incrementally tends to produce the best results,
so the ICA score is calculated when one class is added and instantiated. The
results for both data sets is shown in Table 3. From these results it is observed
that having a fewer amount of initial trained kseed classes produces better results
and this is expected as the kunknown classes are more easily identified.
Table 3. ICA scores for 1 added class/cluster evaluation. Scores based on Eq. 5.
Initial training Victorian CCAT-50

5 Classes 0.687 0.875
10 Classes 0.593 0.754
15 Classes 0.529 0.764
20 Classes 0.387 0.681
6 Conclusion
This research works with open set classification regarding NLP text analysis
in the area of Authorship Attribution. The model created will be to deter-
mine the originating author for a piece of text based on textual characteristics.
This research also move towards a novel incremental learning approach where
unknown authors are identified and then the data is labeled so the classifier
expands on its knowledge. Through this process there is expansion upon the
state of the art implementation by creating a full cycle model by training on
given data and then expanding the trained knowledge based on new data found
for future testing.
Text based Authorship Attribution can be applied to research involving secu-
rity and linguistic analysis. Some similar developing work using similar research
methods involving image recognition [19], this can be applied to facial recogni-
tion tasks and video surveillance applications. This model can also be further
improved by developing a more precise way of distinguishing different pieces of
text. Another method for future research is using backpropagation. Once novel
classes are identified, the model should be then able to modify the already trained
train test
classifier with the Dcurrent data. Then the model can be tested with the Dcurrent
to determine if the model can recognize previously unknown classes. Backprop-
agation of a neural network requires a fully inter connected set of layers that
allow the processing of data through either side of the model [20]. This process
would save the step of fully retraining the classifier model. A similar approach
to this can also be to add new “neurons” to a deep neural network to allow for
an extension of a trained model [21]. With these new future improvements the
designed model can be further improved and potentially obtain better results.
Acknowledgment. The work reported in this paper is supported by the National

Science Foundation under Grant No. 1659788. Any opinions, findings and conclusions
or recommendations expressed in this work are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation.
References
1. Rocha, A., Scheirer, W.J., Forstall, C.W., Cavalcante, T., Theophilo, A., Shen, B.,
Carvalho, A.R.B., Stamatatos, E.: Authorship attribution for social media foren-
sics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2016)
2. Scheirer, W.J., de Rezende Rocha, A., Sapkota, A., Boult, T.E.: Toward open set
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(7), 1757–1772 (2012)
3. Bendale, A., Boult, T.: Towards open world recognition. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1893–1902
(2015)
4. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural
networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang.
Process. 20(1), 30–42 (2011)
5. Higashinaka, R., Imamura, K., Meguro, T., Miyazaki, C., Kobayashi, N., Sugiyama,
H., Hirano, T., Makino, T., Matsuo, Y.: Towards an open-domain conversational
system fully based on natural language processing. In: Proceedings of COLING
2014, the 25th International Conference on Computational Linguistics: Technical
Papers, pp. 928–939 (2014)
6. Bendale, A., Boult, T.E.: Towards open set deep networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1563–1572
(2016)
7. Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Loop: local outlier probabili-
ties. In: Proceedings of the 18th ACM Conference on Information and Knowledge
Management, pp. 1649–1652. ACM (2009)
8. Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation forest. In: 2008 Eighth IEEE Inter-
national Conference on Data Mining, pp. 413–422. IEEE (2008)
9. Prakhya, S., Venkataram, V., Kalita, J.: Open set text classification using convo-
lutional neural networks. In: International Conference on Natural Language Pro-
cessing (2017)
10. Hartigan, J.A., Wong, M.A.: Algorithm as 136: a K-Means clustering algorithm.
J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
11. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering
method for very large databases. In: ACM SIGMOD Record, vol. 25, pp. 103–
114. ACM (1996)
12. Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for
discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp.
226–231 (1996)
13. Stella, X.Y., Shi, J.: Multiclass spectral clustering. In: Null, p. 313. IEEE (2003)
14. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern
Anal. Mach. Intell. 2, 224–227 (1979)
15. Rosenberg, A., Hirschberg, J.: V-Measure: a conditional entropy-based external
cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empiri-
cal Methods in Natural Language Processing and Computational Natural Language
Learning (EMNLP-CoNLL), pp. 410–420 (2007)
16. Gungor, A.: Benchmarking authorship attribution techniques using over a thou-
sand books by fifty Victorian era novelists. Ph.D. thesis (2018)
17. Houvardas, J., Stamatatos, E.: N-gram feature selection for authorship identifica-
tion. In: International Conference on Artificial Intelligence: Methodology, Systems,
and Applications, pp. 77–86. Springer (2006)
18. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in Neural
19. Rebuffi, S.-A., Kolesnikov, A., Sperl, G., Lampert, C.H.: iCaRL: incremental clas-
sifier and representation learning. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 2001–2010 (2017)
20. Hecht-Nielsen, R.: Theory of the backpropagation neural network. In: Neural Net-
works for Perception, pp. 65–93. Elsevier (1992)
21. Draelos, T.J., Miner, N.E., Lamb, C.C., Cox, J.A., Vineyard, C.M., Carlson, K.D.,
Severa, W.M., James, C.D., Aimone, J.B.: Neurogenesis deep learning: extending
deep networks to accommodate new classes. In: 2017 International Joint Conference
on Neural Networks (IJCNN), pp. 526–533. IEEE (2017)
Automatic Modulation Classification Using
Induced Class Hierarchies and Deep Learning
Toluwanimi Odemuyiwa and Birsen Sirkeci-Mergen(&)
San Jose State University, San Jose, CA 95192, USA

toluwa@live.ca, birsen.sirkeci@sjsu.edu
Abstract. In this work, we contribute to the emerging field of deep learning

(DL) methods of automatic modulation classification (AMC) for cognitive
radios. Traditional AMC methods rely on expert-based knowledge of the wireless
channel and incoming signals. This method suffers from a lack of generalizability
to real-world channels that may be severely impaired or unknown. DL does not
require a priori or expert-based knowledge and has seen success in other fields
such as image processing and natural language processing. In recent years, DL
has been explored as an alternative to traditional methods; however, currently
proposed DL AMC methods suffer from high training times due to the many
layers used to improve classification accuracy. We propose the use of induced
class hierarchies to decompose the AMC task into subcomponents, while still
maintaining deep architectures for improved classification accuracy. A publicly
available, synthetic radio data set is used, which models severe channel
impairments under a range of various signal-to-noise (SNR) levels. Three hier-
archical convolutional neural network (h-CNN) architectures are developed: a
single-level, baseline model; a two-level hierarchical model, termed model A;
and a three-level hierarchical model, termed model B. Model A achieves a 4%
improvement in classification accuracy over the baseline model while model B
maintains comparable accuracy. Moreover, the training times of both models are
reduced, with 50% improvement with model A and 28.6% improvement with
model B, from the baseline model.
Keywords: Cognitive radio Deep learning Hierarchical classification
1 Introduction
In cognitive radios (CRs), automatic modulation classification (AMC) is a method used

by the receiver to automatically determine the modulation scheme of a transmitted
signal, regardless of various channel effects. Traditional AMC methods fall under the
following two categories: maximum-likelihood based methods and feature-based
methods. While theoretically accurate, likelihood-based methods are computationally
complex and require a priori knowledge of the signal and channel parameters [1].
Feature-based methods require signal preprocessing and expert feature engineering,
before applying a machine learning method [1].
In recent years, deep learning (DL) has seen widespread success in several fields and
is gradually gaining traction in wireless communication. Its advantage over traditional

https://doi.org/10.1007/978-3-030-39442-4_55
AMC Using Induced Class Hierarchies and Deep Learning 753
AMC methods is in its ability to inherently learn features from the dataset and use these
learned features to form a decision. Neural networks (NNs) have been shown to be
universal function approximators [2], suggesting they might be capable of competing
with likelihood-based AMC methods. Since NNs inherently learn features from a given
dataset, there is no need for feature engineering, as with feature-based AMC methods.
The work in [3] demonstrates that convolutional neural networks (CNN) can achieve
comparable accuracy to traditional AMC methods, even in the presence of severe
channel impairments. In [4], the authors apply a hierarchical approach and provide a
proof-of-concept that hierarchical deep neural nets (DNNs) are feasible for AMC.
However, their hierarchies are manually chosen, based on expert knowledge of mod-
ulation type.
Hierarchical classification has been used to improve classification in other fields,
such as text-classification. In [5], the authors developed a method to automatically
induce a hierarchy of classifiers based on a confusion matrix of a base classifier. The
hierarchical approach performed better than flat classifiers on text classification [5]. In
this work, we propose a hierarchical CNN architecture framework, h-CNN, that
leverages induced class hierarchies to completely remove the need for expert domain
knowledge for feature engineering and hierarchy determination. Moreover, to the best
of our knowledge, previous works have not applied induced class hierarchies to DL
architectures; this work provides an opportunity to explore that area.
In this paper, related research is presented in Sect. 2; Sect. 3 overviews the dataset
and experimental framework, results are presented in Sect. 4; a discussion on the
viability of h-CNN follows in Sect. 5, and finally Sect. 6 notes key conclusions and
outlines future work.
2 Related Work
One problem with previous research on AMC methods is the lack of uniformity with
datasets; that is, various papers generate their own datasets which undergo different
channel conditions. Thus, while the works are useful for determining which features and
classification methods perform well under certain channel conditions, it is difficult to
directly compare and evaluate results across all works. To meet this need in modulation
classification, in [3], a GNU Radio channel model is developed, and three signal datasets
are generated to create benchmarks for future work using DL techniques for AMC. The
datasets are available in [8]. Using this dataset, several candidate neural networks are
developed in [3] and a CNN of four layers performed comparably to feature-based
methods. Since CNNs use filters and convolution to remain impervious to shifting,
rotation, linear mixing, and scaling in images, they are a viable candidate to naively
learn modulation schemes for radio signals that have undergone various transformations
due to the channel [3]. Recent works on the application of DL to AMC have been based
on this pioneering work done in [3], which the authors expanded in [6, 7].
In [9], the authors expand on the work of [3] by using other DL architectures such
as ReLu and CLDNN. While the work in [9] was able to achieve significantly better
results on all four models, the use of deeper networks resulted in higher computational
complexity and a significant increase in training times compared to the original
754 T. Odemuyiwa and B. Sirkeci-Mergen
architecture in [3]. In [10], the authors further expand on the work of [9] but focus on
reducing overall training time. Various dimension reduction methods, such as principal
component analysis (PCA) and subsampling methods, are used to reduce the input
vector dimensions. Some of these methods reduced training time by up to a factor of 2,
but several of the models suffered considerable loss in accuracy, depending on the
original number of dimensions and reduction method used [10].
3 Methods
In this work, the publicly available synthetic radio dataset, RadioML2016.10a from [3,
8] is used. The code used to generate this dataset is available at [11]. This dataset
comes from an effort highlighted in [3] to create a benchmark database on which
various DL AMC methods can be evaluated and compared against. Two sources are
used for the dataset: an analogue source consisting of an episode from a podcast, and a
digital source composed of the entire works of Shakespeare from Gutenberg.
To simulate a real-world transmission, the signals are passed through a channel
modelled in GNU Radio [6]. The channel model is shown in Fig. 1. Five transfor-
mations are applied to the signal:
1. Sample Rate Offset: This gives an offset to the sample rate to model timing offsets
between transmitter and receiver.
2. Center Frequency Offset: This injects a randomized frequency offset to model
effects such as a mismatch between the transmitter and receiver oscillator, or the
Doppler effect between a moving receiver or transmitter.
3. Selective Fading Model: This block captures the Rayleigh or Rician fading pro-
cesses and uses the sum of sinusoids method to generate a signal output based on
multipath propagation.
4. Additive White Gaussian Noise: This is usually the result of thermal noise.
Twenty SNR levels are used, ranging from −20 decibels (dB) to 18 dB.
Fig. 1. Channel effects applied to the transmitted signal to simulate real-world scenarios.
3.1 Signal Modulation

A total of eleven modulation schemes are used. The WBFM, AM-SSB, and AM-DSB
schemes are used for the analogue source. The 8PSK, BPSK, CPFSK, GFSK, PAM4,
QAM16, QAM64, and QPSK schemes are used for the digital source. The final signal
consists of a series of symbols which are represented as the sum of sinusoidal
functions. For example, in the case of QPSK, given a symbol ci , and carrier waveform
with frequency fc , the transmitted waveform xðti Þ is as follows [3]:
2ci þ 1
xðti Þ ¼ ej2pfc t þ p 4 ; ci 2 0; 1; 2; 3 ð1Þ
3.2 Input Vectors

The input data to the AMC architectures is a set of 2 128 vectors, comprising of the
real-valued and imaginary-valued portions of the signal. The first row contains the in-
phase (I) samples, and the second row contains the quadrature-phase (Q) samples. Each
column represents a timestamp, for a total of 128 timestamps. The sample rate is 1
million samples per second, giving 128 ms of observation time per observed sample.
Every sample is represented as a 32-bit floating point number [3].
Overall, the dataset contains 220,000 samples, of which half are randomly selected
for training, and the other half for validation and testing. There are 1000 samples for
each SNR level. Figure 2 shows the power spectral density of these signals, where each
waveform is taken from the 536th sample at 18 dB SNR.
3.3 Hardware
Training and testing are conducted on the GPU machines provided by the Google
Colaboratory cloud service [12]. Hosted on the Google Cloud platform, Colaboratory
provides free access to GPU acceleration. The underlying hardware consists of an
NVIDIA Tesla K80 GPU, 12 GB of RAM, and 2496 CUDA cores [13]. All code is
written using Python 3 and the Keras library, with a TensorFlow backend.
3.4 Experimental Framework

In order to determine the effectiveness of hierarchical CNN architectures, the following
steps are followed:
1. For a reliable comparison, a baseline model, patterned after that in [3] and shown in
Fig. 3 is trained. The reference source code can be found online at [14].
2. A confusion matrix is then generated based on the prediction performance of the
base model. Using this confusion matrix, a series of various probable class hier-
archies are generated using a similar method as described in [5]. Based on cluster
analysis, one class hierarchy is chosen.
3. Various, brute-force, hierarchical classifiers are developed. That is, each sub-
classifier is trained individually on its subset of children labels. Each sub-classifier
is developed using the baseline model as a template and modified to achieve the
highest possible accuracy for that subset of classes.
4. The overall hierarchical CNN architecture is formed by chaining the sub-classifiers.
Prediction is performed using this chained architecture.
Fig. 2. PSD waveforms versus frequency for I-phase (blue) and Q-phase (orange) at 18 dB
SNR. Sample #536.
Fig. 3. The baseline CNN model consists of two convolutional layers followed by two fully
connected layers.
4 Results: Induced Hierarchies with CNN
4.1 Baseline Model

The baseline model is trained, and prediction run based on the work in [3]. Training is
completed with batch sizes of 1024 samples over 100 epochs, with early stopping once
validation loss stops significantly changing after five epochs. Optimization is done
using the Adam solver.
Overall, training took 39 epochs, at 30 s per epoch, for a total of 1149 s in training
time. Figure 4 shows the training and validation losses over the number of epochs for
the baseline model. In this run, the baseline CNN architecture yields an accuracy of
50.1% when predicting modulation schemes regardless of SNR noise level.
Fig. 4. Training and validation loss over 39 epochs for the baseline model.
4.2 Inducing Hierarchies

Figure 5 shows the confusion matric of the baseline model in an all-SNR scenario. The
resulting distance matrix generated using the method outlined in [5] is shown in Fig. 6.
Values closer to 0 in the distance matric indicate the two classes are nearly identical,
while values closer to 1 indicate the two classes are fully distinct, according to the
baseline model. From the matrices, the baseline model commonly misclassifies all
signals as AM-SSB, and has trouble distinguishing between QPSK and 8PSK, AM-
DSB and WBFM, and the two QAM16 and QAM64 modulation schemes.
From the generated distance matrix, a set of dendrograms is created. Dendrograms
are the final step when performing agglomerative clustering [15]. Agglomerative
clustering uses a bottom-up approach in which individual data points are successfully
grouped with similar datapoints, until a larger cluster is formed. A dendrogram then
displays the hierarchy of each cluster that is combined to form the larger cluster. In the
case of hierarchical classification, the data points are the accuracy measures of each
class, and the generated dendrogram provides insight into how close each class is to
other classes in the feature space.
Various distance measures are used to produce different dendrograms. To deter-
mine which dendrogram to select based on a distance measure, the cophenetic value is
used. Given x(i, j), which represents the distance between classes i and j, and t(i, j) - the
height of the node where the two classes are joined [16], the cophenetic correlation
coefficient is defined as [16]:
P
i\j ðxði; jÞ xÞðtði; jÞ tÞ
c ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ
P 2 P 2
i\j ðxði; jÞ xÞ jj i\j ðtði; jÞ tÞ
where x and t are the are the averages of x(i, j) and t(i, j), respectively. Overall, the
cophenetic value is a measure of how well a dendrogram remains faithful to pairwise
distances [16]. Using this value, the standardized Euclidean measure, with a coefficient
of 95.9%, was chosen as the final measure. Its resulting dendrogram is shown in Fig. 7.
An induced hierarchy can aid in better understanding hidden features of the
modulation schemes. In this induced hierarchy, the root separates the QAM signals
from the remaining schemes; rather than analog versus digital as done in previous
papers where hierarchies were determined manually [4]. From a classifier perspective,
this suggests the higher-order information contained in the QAM schemes versus the
remaining schemes has higher discriminatory power than the analog versus digital
features of the modulation schemes. AM-DSB and WBFM are grouped together as
both contain similar inputs of silent or carrier-tone only vectors from the original
analog signal; which confuses the base CNN classifier and affects the resulting
dendrogram.
Fig. 5. Baseline CNN model in an All-SNR scenario.
4.3 Hierarchical CNN Architectures: Model A

Figure 8 shows the architecture of the 2-level h-CNN. It consists of three, separately
trained CNN classifiers. We term the hierarchical model in Fig. 8 as “Model A.” Here,
QAM-16 and QAM-64 are separated out from the rest of the schemes first. To avoid
sample bias during training, rather than start with the root of the dendrogram, the next
node is chosen instead. From Fig. 7, this node separates out QAM-16, QAM-64,
GFSK, WBFM, and AM-DSB from AM-SSB, 8PSK, BPSK, CPFSK, PAM4 and
QPSK. The next level of the hierarchy then implements a 6-way and 5-way CNN
classifier, respectively.
Fig. 6. Distance matrix for the baseline CNN model in an all-SNR scenario.
Fig. 7. Dendrogram generated using the standardize Euclidean measure.

For the root-binary classifier, we experiment with various modifications and find
that a deeper network with a three convolution layers and three dense layers produces
the highest accuracy. CNN_L1a and CNN_L1b are also trained using similar CNN
architectures as the root. Overall, the root classifier achieves an accuracy of 84%, while
CNN_L1a and CNN_L1b achieve accuracies of 53% and 65%, respectively. The total
combined training time of all three classifiers is around 568 s, cutting the training time
Fig. 8. Model A contains two levels of CNN classifiers.
Fig. 9. Confusion matrix of the baseline model at 18 dB SNR.

Fig. 10. Confusion matrix of Model A at 18 dB SNR.
of the baseline model by over half. At first glance, this result appears surprising, as the
hierarchical model is a deeper architecture overall. However, each sub-CNN is training
a smaller subset of classifiers, and the root is simply a binary-classifier; a less complex
CNN than an N-way classifier. The overall accuracy achieved in the all-SNR scenario
is 51.1%. Though only slightly improved from the base CNN, the training time has
noticeably decreased. Moreover, at lower SNR ranges, the hierarchical classifier per-
forms better than the baseline CNN. Figures 9 and 10 show the confusion matrices of
the baseline CNN and Model A at 18 dB SNR. From this matrix, Model A does an
excellent job of accurately classifying 8PSK, AM-DSB, QAM-64, at the cost of a lower
misclassification rate of WBFM, AM-DSB, and QAM-16. In this case, the hierarchical
classifier is biased towards AM-DSB versus WBFM, 8PSK versus QPSK, and QAM64
vs QAM16, to achieve higher accuracy values.
4.4 Hierarchical CNN Architectures: Model B

The three-level h-CNN architecture is shown in Fig. 11. This architecture steps one
level below Model A in the dendrogram tree to add an additional 4 sub-CNN classi-
fiers. The final level of classifiers contains two 2-way classifiers, a 3-way classifier, and
a 4-way classifier; models CNN L2a and CNN L2c, CNN L2b, and CNN L2d,
respectively. As with Model A, each sub-CNN is trained separately, then combined to
form the overall hierarchical structure for prediction. CNN models root, L1a, and L1b
are kept the same as Model A. The individual classification accuracies of CNN L2a,
CNN L2b, CNN L2c, and CNN L2d are 50%, 62%, 83%, and 55%, respectively. The
overall model achieves an accuracy of 50.7%, equivalent to the baseline model.
Fig. 11. Model B contains three levels of CNN classifiers.
Fig. 12. Model B’s confusion matrix at −8 dB SNR.

This degradation in accuracy from Model A can be attributed to the difficulty of the
lowest-level CNNs in naturally extracting the fine-grained features that distinguish
similar classes. For example, regardless of how deep CNN_L2a is designed for clas-
sifying QAM-16 and QAM-64, the area under the curve (AUC) measure – taken from a
graph of True Positive rates versus False Positive rates – consistently oscillates around
50%, indicating the classifier cannot definitively distinguish between the two modu-
lation schemes. Even upon changing the representation of the input vectors from raw
IQ samples to instantaneous amplitude and phase samples, the classification results
remain the same. Previous works using the RadioML datasets have noted similar
results; and several authors suggest adding a preprocessing step to manually extract
features [10–14]. However, this involves stepping back into using expert feature
engineering, which DL should not require. Figures 12 and 13 show the confusion
matrix of Model B at −8 dB SNR and 18 dB SNR. In both the high and low SNR
scenario, there is no significant improvement over the base CNN classifier.
Fig. 13. Model B’s confusion matrix at 18 dB.
4.5 Performance Comparisons

Overall, the baseline model is the shallowest model; thus, it has the lowest number of
parameters. However, it has the highest training time, since it takes as input a larger
dataset for training. Both Model A and B have a significantly larger number of training
parameters but reduce the overall training time by nearly half and a third, respectively.
Figure 14 shows the overall performance, in terms of accuracy, of all 3 models.
Starting at −12 dB, Model A performs significantly better than all three models, while
Model B performs comparably to the baseline CNN model. Figures 15 and 16 depict
the overall training and performance times of the three models.
Fig. 14. Accuracy versus SNR of all three models.
Fig. 15. Number of parameters of each model.

Fig. 16. Training time of all three models.
5 Induced Hierarchies for AMC: Are They Viable?
This work has provided a baseline that shows the viability and benefits of using DL
based induced hierarchical classification for AMC. First, using DL eliminates the need
for expert-based feature extraction, since the model can independently learn features
during the training phase.
Second, there are various performance benefits to using induced class hierarchies.
For the DL based model, both hierarchical models have significantly improved training
times over the base CNN. For previous works using the RadioML dataset, an
improvement in classification accuracy generally also results in an increase in training
time due to the use of deeper neural networks [9, 17]. In this work, though the overall
architecture is deeper than the base model, training time is reduced due to the modu-
larization of the N-way problem into smaller classification tasks. Moreover, the 2-level
hierarchy yields a significant increase in accuracy from the baseline model, while the 3-
level hierarchy has comparable performance. However, while induced class hierarchies
improve or provide comparable performance in terms of classification accuracy and
training time, their effectiveness and depth is limited by the effects of error propagation.
5.1 Error Propagation in Hierarchical Models

A weakness in the hierarchical classification approach is the issue of error propagation;
the mistakes of higher nodes in the hierarchy affect the ability of children nodes to
properly classify a given signal. If a higher-level node misclassifies a modulation

scheme, the subsequent classifiers in the chain will automatically classify incorrectly as
well. This is a well-known phenomenon in hierarchical text classification [18]. In the
case of Model A, the root classifier has a classification accuracy of 84.6%. Thus,
regardless of how the two children classifiers are designed to achieve the highest
possible accuracy, the overall classification accuracy of the model is now upper bound
to 84.6%. The effect of error propagation is a future topic to explore, and future models
can be developed to mitigate this issue and capitalize on the benefits of induced class
hierarchies.
One such model to explore can be to directly incorporate the probability values of
the decision of each classifier. For example, the root classifier may assert with 51%
probability that the current instance belongs to node 1, and with 49% probability that it
belongs to node 2. In a sense, these probability values are measures of how confident
the classifier is on its final decision. In this work, models A and B are currently
designed such that the root classifier will automatically decide on class 1 as the label of
the current instance, with 51% confidence. An improved model based on [19] may then
observe the responses of the children classifiers of both nodes and based on the overall
confidence values of each path, select the final path. A downside to this method is it
will likely also increase training times as the different possible paths must now be
considered to update the weight matrices.
The hierarchical approach for CNN shows promise in terms of reducing overall training
time, and after fine-tuning, in improving accuracy as well. In previous works, a con-
fusion matrix highlighting the areas where the base CNN classifier has issues motivated
the need for deeper architectures to extract the subtle differences between schemes. In a
sense, this work is creating deeper CNN architectures, but with separately trained CNN
classifiers, whose hierarchy is mathematically determined by the confusion matrix. By
using an induced hierarchy, the overall neural network is guided to learn generic
features at higher levels and subsequently learn more intricate features in lower levels
where the classes have higher similarity in the feature space.
In addition, from a computer architecture perspective, the chained h-CNN archi-
tecture allows for a modular hardware implementation of each sub-CNN. Rather than
accessing an entire block of memory containing weights, only the relevant memory
locations containing the weights for a sub-CNN need to be accessed. Future work can
focus on quantifying the hardware efficiency increase of an h-CNN. Moreover, each
sub-CNN model can be trained in parallel allowing for greater use of parallelization
techniques than with traditional CNN architectures.
This work is completed using the RadioML2016.10a dataset provided in [3, 8].
Several of the recent works based on the RadioML database use RadioML2016.10b,
which is a slightly larger database; however, most works remove the AM-SSB mod-
ulation scheme. The removal of the AM-SSB scheme yields higher accuracy values - of
above 75% at high SNRs - than what is achieved in this work [4, 9, 10]. Future
experiments can run both the RadioML2016.10a dataset and RadioML2016.10b
dataset on the CNN architectures presented in this work, removing the AM-SSB
scheme, to directly compare performance of these architectures against those in pre-
vious works. For true training time comparison, code-sharing of the structure of each
architecture from various works is needed, such that independent researchers can run
and verify the architectures of other works on their own machines, and directly com-
pare against their own architectures.
Finally, in a real-world system, cognitive radios are exposed to a large variety of
modulation schemes and unpredictable channel conditions. While this work focuses on
11 modulation schemes and a severely impaired channel, future work should expand
the number of modulation schemes and include multi-receiver situations that require
schemes such as OFDM. Moreover, an interesting study would be to determine the
performance of DL architectures as channel constraints are successively relaxed; that is,
beyond the impairments of sample rate offset, carrier frequency offset, fading and
multipath effects, how do other channel conditions affect a classifiers accuracy?
References
1. Dobre, O.A., Abdi, A., Bar-Ness, Y., Su, W.: Survey of automatic modulation classification
techniques: classical approaches and new trends. IET Commun. 1(2), 137–156 (2007)
2. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal
approximators. Neural Netw. 2(5), 359–366 (1989). 197-11 (8-12)
3. O’Shea, T.J., West, N.: Radio machine learning dataset generation with gnu radio. In:
Proceedings of the GNU Radio Conference, vol. 1, no. 1 (2016)
4. Karra, K., Kuzdeba, S., Petersen, J.: Modulation recognition using hierarchical deep neural
networks. In: 2017 IEEE International Symposium on Dynamic Spectrum Access Networks
(DySPAN) (2017). https://doi.org/10.1109/dyspan.2017.7920746
5. Silva-Palacios, D., Ferri, C., Ramrez-Quintana, M.J.: Improving performance of multiclass
classification by inducing class hierarchies. Procedia Comput. Sci. 108, 16921701 (2017).
https://doi.org/10.1016/j.procs.2017.05.218
6. O’Shea, T.J., Corgan, J., Clancy, T.C.: Convolutional radio modulation recognition networks.
In: Engineering Applications of Neural Networks. Communications in Computer and
Information Science, pp. 213–226 (2016). https://doi.org/10.1007/978-3-319-44188-7_16
7. O’shea, T., Hoydis, J.: An introduction to deep learning for the physical layer. IEEE Trans.
Cogn. Commun. Netw. 3(4), 563575 (2017). https://doi.org/10.1109/tccn.2017.2758370
8. Datasets: DeepSig Inc. https://www.deepsig.io/datasets
9. Liu, X., Yang, D., Gamal, A.E.: Deep neural network architectures for modulation
classification. In: 2017 51st Asilomar Conference on Signals, Systems, and Computers
(2017). https://doi.org/10.1109/acssc.2017.8335483
10. Ramjee, S., Ju, S., Yang, D., Liu, X., Gamal, A., Eldar, Y.C.: Fast deep learning for
automatic modulation classification. J. Sel. Areas Commun. (2019)
11. RadioML. https://github.com/radioML
12. Colaboratory: Frequently asked questions. https://research.google.com/colaboratory/faq.html
13. Carneiro, T., Da Nobrega, R.V.M., Nepomuceno, T., Bian, G.-B., De Albuquerque, V.H.C.,
Reboucas Filho, P.P.: Performance analysis of Google colaboratory as a tool for accelerating
deep learning applications. IEEE Access 6, 61677–61685 (2018). https://doi.org/10.1109/
access.2018.2874767
14. radioML: RadioML/examples. https://github.com/radioML/examples/blob/master/modulati

on_recognition/RML2016.10a_VTCNN2_example.ipynb
15. Malik, U.: Hierarchical clustering with python and scikit-learn (2019). https://stackabuse.
com/hierarchical-clusteringwith-python-and-scikit-learn/
16. Sarali, S., Doan, N., Doan, I.: Comparison of hierarchical cluster analysis methods by
cophenetic correlation. J. Inequalities Appl. 1, 2013 (2013). https://doi.org/10.1186/1029-
242x-2013-203
17. Wang, T., Wen, C.-K., Wang, H., Gao, F., Jiang, T., Jin, S.: Deep learning for wireless
physical layer: opportunities and challenges. China Commun. 14(11), 92111 (2017). https://
doi.org/10.1109/cc.2017.8233654
18. Silla, C.N., Freitas, A.A.: A survey of hierarchical classification across different application
domains. Data Min. Knowl. Discov. 22(1–2), 31–72 (2010). https://doi.org/10.1007/s10618-
010-0175-9
19. Zhu, S., Wei, X.-Y., Ngo, C.-W.: Error recovered hierarchical classification. In: Proceedings
of the 21st ACM International Conference on Multimedia, MM 2013 (2013). https://doi.org/
10.1145/2502081.2502182
Using Digital Image Processing to Characterize
Flocculation of Papermaking Wastewater
Ming Li1(&), Kaitang Hu2, and Jin Wang3

1
University of Michigan-Flint, Flint, MI 48502, USA
minglilm@umich.edu
2
Yuncheng Professional Technology College, Yuncheng, China
3
Beijing Institute of Chemical Defense, Beijing, China
Abstract. Wastewater generated from pulp and paper mills is a major pollution
source. It is important to identity the flocculation characteristics of papermaking
wastewater so that the wastewater treatment can be optimized. In this paper the
characteristics of flocculation of deinking wastewater were studied by computer
image processing. Experiments were carried out to acquire images of flocculation.
A series of graph parameters related to floc sedimentation characteristics and time-
variable parameters were found. Using computer visualization technology to study
the static and dynamic behavior of wastewater flocs has many advantages; com-
puter visualization technology can be used to improve wastewater treatment.
Keywords: Flocculation Sedimentation Image processing Wastewater

treatment
1 Introduction
Wastewater generated from pulp and paper mills is a major source of industrial pollution.
As the demand of paper products will continue to rise; it is imperative to optimize the
treatment of papermaking wastewater [1]. Papermaking wastewater contains soluble
organic and inorganic substances, insoluble organic and inorganic substances, which is a
very complex pollution source [2]. At present, the common methods for papermaking
wastewater treatment include: physical method, physical chemical method, biological
method, etc. The new technologies currently being explored and developed include
electrochemical method, membrane separation method and photocatalytic decomposi-
tion method [3]. Regardless of the method, it is essentially the separation process of the
components in the wastewater. The effective separation depends largely on the charac-
teristics of the flocculation of various components in the wastewater. The flocculation
method is the most widely used in wastewater treatment. It is not only used in pre-
treatment, primary treatment, final treatment of wastewater, but also used in sludge
treatment [4]. It can remove high molecular substances, plant fibers, various organic
substances, biological sludge, heavy metal substances such as lead, cadmium and
mercury, and other pollutants. The size of flocs formed during flocculation and the
characteristics of their sedimentation are important topic for environmental researchers
[5, 6]. Digital image processing has been used to identify flocculation and improve
wastewater treatment [7, 8].

https://doi.org/10.1007/978-3-030-39442-4_56
Using Digital Image Processing 771
To characterize the flocculation is of great significance for studying the flocculants,

improving the separation effect and developing new separation technology. In this
paper, utilizing digital image processing to analyze the flocs in wastewater generated
from pulp and paper mills was investigated. Experiments were carried out to acquire
images of flocculation during various treatment processes, and the flocculation sedi-
mentation characteristics were obtained, which provides a basis for the automatic
control of the flocculation process.
2 Experimental Setup and Methods
2.1 Experimental Setup

As shown in Fig. 1, a digital camera was used to acquire the images during the
flocculation sedimentation; it converted the optical signal of the measured floc into an
analog signal. An Analog/Digital acquisition card converted the analog signal into
digital signal and inputted into a computer to form a digital image. Then image pro-
cessing methods were utilized to analyze the image and identify the floc size, density
and sedimentation velocity in the entire measurement area. Through this method, test
results can be illustrated in real time on a computer screen.
Figure 2 shows the schematic diagram of the coagulation sedimentation tank. The
tank is a glass container, and its height is 98 cm, width is 15 cm, thickness is 16 cm,
and the total effect volume is 2187 ml. During the experiment, the valve at the bottom
was closed, and wastewater flew in through the inlet at the top.
Fig. 1. Experimental setup Fig. 2. Sedimentation tank
2.2 Experimental Methods

The flocculation experiment was first carried out in a large beaker, papermaking
wastewater after deinking was used. The flocculant, Polyaluminium chloride, was then
added to the wastewater. After a certain period of stirring, the wastewater was slowly
poured into the coagulation sedimentation tank, and the images of flocs were acquired
by the CCD device.
772 M. Li et al.
3 Experimental Results and Discussion

3.1 Determine the Sedimentation Velocity
The following images in Fig. 3 illustrate the flocculation of wastewater poured in the
sedimentation tank after 0, 1, 5, 10, 20, 30 min.
Fig. 3. Images during flocculation sedimentation
A program was coded in order to monitor the flocculation characteristics of floc-

culants, the main functions are:
• Display the captured floc activity image on the computer screen in real time.
• Edge enhancement, digital filtering, binarization processing, and connectivity dis-
crimination for the floc image.
• According to the experimental analysis, find a set of parameters closely related to
the floc sedimentation characteristics.
The average gray value is expressed by the following formula:
1XL1 X
L1
f ¼ f ðx; yÞ ð1Þ
N x¼0 y¼0
Where, L is the gray level of the flocculation image, which is 256;

N is the number of pixel points in the image area of the flocculation;
f(x, y) is the gray value at the pixel point (x, y).
Using Digital Image Processing 773
The average gray value then used to calculate the floc strength value.
f
A¼1 ð2Þ
L
Where, A is floc strength.
Then the equivalent diameter of the floc was calculated as:
rffiffiffi pffiffiffiffiffi
s 2 sp 1
U¼2 1 1 1 1 A ð3Þ
p l m
• According to the floc characteristics, the settlement rate of flocs is automatically

calculated by computer using Stokes equation:
ðqs qÞg 2
v¼ ds ð4Þ
18 l
Table 1 shows the characteristics of the flocculation in according images:
Table 1. Characteristics of flocculation of wastewater.

Floc figure Average equivalent diameter Floc sedimentation Floc average
number of flocs, mm velocity, mm/s grayscale
12 0.2 3.608 10−3 252
13 0.5 2.255 10−2 251
14 1.2 0.130 253
15 0.2 3.608 10−3 253
16 0.1 9.02 10−6 250
17 0.01 9.02 10−6 253
4 Conclusion
Wastewater generated from pulp and paper mills is a major source of industrial pol-
lution. It is imperative to optimize the treatment of wastewater generated from pulp and
paper mills. Characterization of the flocculation is of great significance to study the
flocculants and improve wastewater treat. In this paper, wastewater generated from
pulp and paper mills after deinking was investigated utilizing digital image processing.
Digital images of flocculation during sedimentation were acquired and analyzed to
identify the characteristics of flocculation. The equivalent diameter of the flocs was
calculated and based on the results, the sedimentation velocity was determined. Image
processing can be an effective tool to be used for flocculation morphology analysis and
dynamic characterization in wastewater treatment.
774 M. Li et al.
References
1. Pokhrel, D., Viraraghavan, T.: Treatment of pulp and paper mill wastewater-a review. Sci.
Total Environ. 333(1–3), 37–58 (2004)
2. Kamali, M., Khodaparast, Z.: Review on recent developments on pulp and paper mill
wastewater treatment. Ecotoxicol. Environ. Saf. 114, 326–342 (2015)
3. El-Ashtoukhy, E.S., Amin, N.K., Abdelwahab, O.: Treatment of paper mill effluents in a
batch-stirred electrochemical tank reactor. Chem. Eng. J. 146(2), 205–210 (2009)
4. Nasser, M.S.: Characterization of floc size and effective floc density of industrial papermaking
suspensions. Sep. Purif. Technol. 122(10), 495–505 (2014)
5. Jenne, R., Cenens, C., Geeraerd, A.H., Impe, J.F.: Towards on-line quantification of flocs and
filaments by image analysis. Biotechnol. Lett. 24(11), 931–935 (2002)
6. Biggs, C.A.: Activated sludge flocculation: on-line determination of floc size and the effect of
shear. Water Res. 34(9), 2542–2550 (2000)
7. Juntunen, P., Liukkonen, M., Lehtola, M., Hiltunen, Y.: Characterization of alum floc in water
treatment by image analysis and modeling. Cogent Eng. 1(1), 944767 (2014)
8. Garcıa, H.L., González, I.M.: Self-organizing map and clustering for wastewater treatment
monitoring. Eng. Appl. Artif. Intell. 17(3), 215–225 (2004)
Detection of Anomalous Gait as Forensic Gait
in Residential Units Using Pre-trained
Convolution Neural Networks
Hana’ Abd Razak1, Ali Abd Almisreb2,

and Nooritawati Md. Tahir1(&)
1
Faculty of Electrical Enginnering, Universiti Teknologi MARA,
40450 Shah Alam, Malaysia
nooritawati@ieee.org
2
International University of Sarajevo, 71210 Sarajevo, Bosnia and Herzegovina
Abstract. One of the advantages of transfer learning technique is its capability

to learn new dataset using its finest pre-trained architecture. Other advantages of
this technique are small dataset requirements along with faster learning process
that could yield high accuracy results. Hence in this paper, anomalous gait
detection or also known as forensic gait during housebreaking crime at the gate
of residential units is discussed with transfer learning technique based on five
popular pre-trained convolution neural networks (CNNs) as classifiers. High
accuracy and sensitivity are achieved from remodeled of the pre-trained CNNs
for the learning process, offline test, and real-time test. The accuracy attained
from remodeled of the pre-trained CNNs have pledged high potential towards
developing the forensic intelligent surveillance technique.
Keywords: Anomalous behavior Forensic gait Pre-trained CNN

Remodeled pre-trained CNN Transfer learning
1 Introduction
Malaysia is listed as one of the most rapidly evolving countries in the region of
Southeast Asia. The development growth is also in line with the rise in crime index [1–
3]. Motorcycle theft, car theft, and housebreaking are the most frequent crimes that
contributed up to 56% of reported cases in Malaysia [3, 4]. Previous studies have
shown that terraced, semi-detached and detached houses have a higher risk of
housebreaking [5, 6] and residence without surveillance systems was six times higher
in the risk as housebreaking crime victims [7].
Nowadays, the utilization of closed-circuit television (CCTV) as surveillance
cameras is prominently increasing in public places viz. streets, banks, shop lots, etc.
including residential units as a precautionary measure to protect against crime inci-
dents. Explicitly, the main task of the surveillance cameras is to monitor and detect the
occurrences of anomalous behavior that occur in its vicinity. The anomalous state of
objects and people can be defined as the changing pattern of the original state or
movement from the normal behavior [8, 9]. Furthermore, the anomalous behaviors are

https://doi.org/10.1007/978-3-030-39442-4_57
776 H. A. Razak et al.
frequently ambiguous between normal and anomalous behavior albeit variation

according to person, place, and environment. The monitoring process of anomalous
events can be regarded as a waste of funds, time and labor as it involves several
employees being assigned to observe the screen for long hours and waiting for
anomalous behavior occurred [9–11]. Therefore, numerous studies have been con-
ducted to detect and track the anomalous state of objects and people by employing
image recognition using convolution neural network (CNN) [9, 10, 12–14]. Recently, a
new type of learning with the ability to yield better results, faster training and allow
smaller input data has been introduced, known as transfer learning by exploiting the
pre-trained CNNs [15, 16]. Transfer learning can be applied on two types of pre-trained
CNNs, (i) series network namely AlexNet, VGG-16, VGG-19 and (ii) directed acyclic
graph (DAG) network, for instance GoogLeNet, Inception-v3, ResNet-50, and ResNet-
101. Technically, transfer learning offers the knowledge of pre-trained CNNs that
previously learned on an enormous dataset and utilize it to learn new, related or even
carry out the assignment in different domains [16–20]. Two strategies in implementing
transfer learning techniques over pre-trained CNNs are feature extractor and fine-tuning
[15, 21, 22]. The layers of convolution networks for image classification commonly
characterize two sections, (i) convolution base and (ii) dense base where both strategies
use the convolution base of pre-trained CNNs to acquire the reusable knowledge of
learned weights. Meanwhile, the dense base generally replaced the layers or fine-tuned
the hyper-parameters as it holds specific knowledge of the previous assignment that
unbeneficial to the current assignment [21, 22].
Thus, in this study, transfer learning in detecting anomalous behaviors at the gate
during committing housebreaking crimes are evaluated and validated.
1. Forensic gait features of housebreaking crime are used with the consent of the
Royal Malaysia Police (RMP)
2. Information learned from five pre-trained CNNs viz. AlexNet, GoogLeNet,
Inception-v3, ResNet-50, and ResNet-101 to learn the normal and anomalous
behavior at the gate of residential units are leveraged.
3. Transfer learning with supervised classification approach to extract and train the
features of the normal and anomalous behavior of the housebreaking crime and test
on offline videos and real-time videos feed from the webcam is employed.
2 Related Work
Researches in medical, surveillance and forensic biometrics are considerably in reliance

on anomalous data to assist in improving results attained [23]. Generally, anomaly
detection is a method or process to identify behavior that differs from normal behavior;
it is due to the complexity and diversity of anomalous behaviors. This method has been
successfully applied in many studies on anomalies of human behavior either individ-
ually or in groups in high or low crowd density scenarios such as:
Detection of Anomalous Gait as Forensic Gait in Residential Units 777
• predicting anomalous pedestrian or vehicle path in walkway [19, 24];

• detecting ignorant driver making illegal U-turn, driving in the wrong direction,
running a red light, speeding on the road [25] or thoughtlessly using a mobile phone,
applying cosmetics, consuming food and beverage, etc., upon driving [26, 27];
• detecting condition of senior citizen if they are falling, loitering or wandering [28–30];
• detecting harassing, pushing, hitting, punching, kicking, fighting, or even robbery
for violence behavior [11, 31–34];
• detecting suspicious behavior in public areas e.g., ATM bank and elevator [35, 36].
Previous studies done on detecting anomalous behavior involving image recogni-
tion are increasing along with the emergence of several public dataset for instance
CAVIAR, CASIA, Weizmann, KTH, and many more. These databases provide various
types of normal behavior data like walking, bending, jogging, jumping, running, hand
waving, hand clapping, handshaking, and several others [35, 37–39]. The commonly
encountered data for abnormal behaviors offer by public databases such as WED, UTI,
PEL, HOF, and ViF are panic actions, violent gestures such as pushing, kicking,
punching, fighting and several more [12, 33]. However, some researchers are volun-
tarily collecting their own data to facilitate the needs of their studies [11, 29, 40]. The
major role of these databases is to provide massive data in enabling the training process
using CNNs techniques for feature extraction in tracking and detecting anomalous
behavior or events. It is crucial as CNNs are known for their ability to learn valuable
features from large training data. Various anomalous human behaviors as falling
gesture, careless driving, abnormal behavior at the public area namely fighting, rob-
bery, shoplifting, etc. have been successfully detected using CNNs [11, 12, 32, 34].
Some studies implemented hardware such as Raspberry Pi to measure the anomalous
behavior in real-time [27, 30]. However, the CNN model requires huge input data, long
duration of the training process and demands high consumption time in formulating the
finest architecture for higher detection accuracy [16, 18] and these issues can be solved
by applying transfer learning technique. However, very few studies on anomalous
human behavior using transfer learning technique is explored. Most studies used pre-
trained CNNs (AlexNet, VGG-16, ResNet-50, Xception, and DenseNet-121) with
shallow classifier as feature extractor in recognizing human anomalies on the face,
gender, handwriting, pedestrian walking in the wrong direction and vehicle in pedes-
trian walkway [17, 18, 41]. In addition, there is a study that fine-tuning the pre-trained
AlexNet by adding a new convolution layer with two Gaussian classifiers to detect
anomalous human behaviors during entering or exiting the subway entrance illegally
and avoiding payment [25].
Hence, we deemed further made by previous studies to detect anomalous behavior
[11] using the transfer learning technique [19] in housebreaking scenes. In this study,
the forensic gait features in collaboration with the Royal Malaysia Police (RMP) fo-
cused on anomalous behavior at a specific place is investigated due to the importance
of understanding the vagueness features between normal and anomalous behavior as
the key in reducing false alarm in surveillance system. In addition, the transfer learning
technique using five pre-trained CNNs is utilized and the remodeled pre-trained CNNs
with both offline and real-time test is conducted.
3 Transfer Learning Techniques
Transfer learning is a machine learning technique that exploited the knowledge learned
in a particular setting and improve its performance while modeling the related or
different set of the new assignment. Transfer learning has given a huge advantage to
deep learning with image data users by providing the finest architecture of pre-trained
CNNs and requires smaller data to obtain better results [15, 42]. Practically, there are
two focal steps in implementing the transfer learning technique, (i) selecting the pre-
trained CNN, and (ii) remodeling the pre-trained CNN.
3.1 Selecting Pre-trained CNN

Firstly is selecting the pre-trained CNN. The pre-trained CNNs and deep learning share
similar attributes regarding training datasets that are generally massive and challenging.
There are two types of pre-trained CNNs offers by MATLAB as tabulated in Table 1.
Pre-trained CNNs with series network owns architecture of layers connected sequen-
tially one after another [43] and consist of fewer layers but an immense number of
trainable parameters [44, 45]. On the other hand, pre-trained CNNs with DAG network
composes of more graph layers yet a smaller number of trainable parameters [45, 46].
The architecture of graph layers is more sophisticated as the network variables are not
restricted to linking layers successively one after the other and the layers may even
have multiple connections of variables as the input and output [43, 45, 47, 48].
Table 1. Popular pre-trained CNNs provided by MATLAB.

Network Name Deep Total Layers Trainable
layers layers connection parameters
Series AlexNet [49] 8 25 25 62 million
VGG-16 [50] 16 41 41 138 million
VGG-19 [50] 19 47 47 144 million
DAG GoogLeNet [51] 22 144 170 6.79 million
Inception-v3 [52] 42 316 350 23 million
ResNet-50 [53, 54] 50 177 192 25.6 million
ResNet-101 [53, 54] 101 347 379 44.5 million
3.2 Remodeling Pre-trained CNN

There are two popular strategies to remodel the pre-trained CNNs, feature extractor and
fine-tuning. Theoretically, deep learning has the hierarchical architecture of layered
features that learn different features at different layers with fully connected layer as the
final layer. In supervised classification learning, the initial layers of the network or
convolution base learned general features and the final layers or dense base learned
specific features to a particular class of assignment [21, 22].
The feature extractor strategy allows operating the pre-trained CNNs without its
final layer as the new assignment is different from the original setting. The final layers
are then being replaced by shallow classifier models (SVM, OCSVM or IPCA) or
dense classification layers (fully connected layer) in order to specify the output class of
new assignment [17, 18, 41]. This strategy enables the feature extraction process of the
new assignment to leverage knowledge from the convolution base of the pre-trained
CNNs.
The fine-tuning strategy requires more skills by replacing the final layers with a
shallow classifier or dense classification layers and fine-tuned the hyperparameters. In
addition, this strategy requires the new layers to learn faster by modifying the hyper-
parameters and the convolution base relatively freezes by fixing its weights. The
training process can have better performance and less training time.
4 Experiment Protocol
All experimental and analysis in this study are conducted using HP Pavilion 15
Notebook with 8G memory and GeForce 840 M graphic card including data collection
phase, learning phase for all five pre-trained CNNs that took over thirty-five hours of
training time and finally the testing phase. Early-stop based on the validation loss
during the training process is applied. The trained networks are tested using two
methods, (i) offline mode and (ii) real-time mode.
4.1 Forensic Gait Features

Firstly, it is challenging to set the standard features of anomalous behavior considering
its complex definition. It is impossible to enumerate all the anomalous behaviors in the
real world. The anomalous behavior in the epistemology of the crime is described as
the behaviors that deviate from its common state with the intention to threat human
property, life, and freedom. This definition is permissible but shall not recklessly accept
without a proper understanding of the vagueness features between normal and
anomalous behavior since the ambiguities are slightly small. Forensically, the crime
behaviors are related to the place, environment and perpetrator so this argument can be
used to scope out the anomalous behavior more precisely.
In this study, the definition of forensic gait features is interpreted according to the
Criminal Procedure Code practiced by the Royal Malaysia Police (RMP). The author-
ities have agreed that the observations from their experience are relatively in line with
the Malaysian Penal Code. There are four most frequent postures during committing the
crime that meets the definition and observations suggested by RMP, (i) squatting with
heels down, (ii) bending, (iii) squatting with heels up, and (iv) kneeling with heels
up. The details configurations of the body are presented in Fig. 1.
Fig. 1. The features of forensic gait: (a) bending, (b) squatting with heels down, (c) squatting
with heels up, and (d) kneeling with heels up.
4.2 Data Acquisition

For this study, twelve participants (8 male, 4 female) aged 23 to 35 years old volun-
teered as subjects. The average age of participants was 27.1 ± 4.05 years, the average
height was 163.6 ± 6.73 cm, and the average weight was 68.2 ± 16.68 kg. Although
many scenarios found generally significant correlation between human behaviors and
the gates, each participant was only required to perform both normal and anomalous
activities while opening the gate with various scenarios according to their habits,
interpretation, perspective, and evaluation. The data were recorded using the Microsoft
Kinect sensor.
4.3 Transfer Learning of Pre-trained Series Network

As mentioned earlier, the transfer learning of the pre-trained series network uses both
feature extraction and fine-tuning technique. The fine-tuning technique complemented
the feature extraction technique by fine-tuning and update the weights of the new
dataset while keeping the convolution base of pre-trained CNN during training process
[15, 16, 18, 22]. The hierarchical architecture of the series network makes it easier to
train and generalize.
Pre-trained AlexNet is the only selected pre-trained series network due to its
modest architecture and its compatibility with the hardware platform as compared to
pre-trained VGG-16 and VGG-19. The convolution base of the pre-trained AlexNet
performed as a fixed feature extractor in learning useful features of the normal and
anomalous dataset. The final three layers of dense base (Layer 23 to Layer 25) are
replaced with five new layers (Layer 23 to Layer 27) as in Table 2: (i) fully connected
layer with 64 outputs instead of 1000 outputs of original setting, (ii) rectified linear unit
(ReLU) layer to optimize the weights of new datasets, (iii) fully connected layer with 2
outputs of new setting i.e. normal and anomaly, (iv) softmax layer to calculate the loss
function into the range of 0 and 1 for classification process, and (v) classification
output layer to classify the feature as either normal or anomalous behavior. The
hyperparameters of the first new added fully connected layer has remained as the
original setting to synchronous the connection of both setting. However, the learning
rate hyperparameter of the second new added fully connected layer is set to a higher
number so that weights and biases in the new layer can learn faster to converge at the
same pace as the pre-trained layers. The regularization factor hyperparameter of the
series network is fixed throughout the learning process.
4.4 Transfer Learning of Pre-trained DAG Network

The transfer learning of the pre-trained DAG network uses both feature extraction and
fine-tuning technique although the DAG network possesses a unique hierarchical
architecture [42, 46, 55]. Based on the trainable parameters, several pre-trained DAG
networks are compatible with our humble hardware platform.
In this study, four pre-trained DAG networks specifically GoogLeNet, Inception-
v3, ResNet-50, and ResNet-101 are selected to be remodeled to train both normal and
anomalous datasets. The transfer learning process for all pre-trained DAG networks are
similar. The convolution base of all pre-trained DAG networks is responsible for
feature extraction process. Generally, the final three layers of the dense base in pre-
trained CNNs are assigned to configure for 1000 classes. These layers now substituted
with the three new layers accountable for configuring two classes of human behavior as
in Table 3. No additional new layers are needed to synchronize the pre-trained layers
due to its unique connection as architecture in the pre-trained DAG networks. The final
three layers of pre-trained DAG networks are as follows, (i) GoogLeNet (Layer 142 to
Layer 144), (ii) Inception-v3 (Layer 314 to Layer 316), (iii) ResNet-50 (Layer 175 to
Layer 177), and (iv) ResNet-101 (Layer 345 to Layer 347). The learning rate hyper-
parameter of the newly added layer is set to a higher number so the weights and biases
in the new fully connected layer can learn faster. The regularization factor hyperpa-
rameter of DAG networks is constant during the learning process.
Table 2. Transfer learning process of series networks.

Pre-trained Alexnet Remodeled pre-trained Alexnet
Layer (end−3): Fully Connected Layer (end+1): Fully Connected
Layer Layer
Learning Rate of Weight 1 Learning Rate of Weight 1
Learning Rate of Bias 2 Learning Rate of Bias 2
Regularization Factor of Weight 1 Regularization Factor of Weight 1
Regularization Factor of Bias 0 Regularization Factor of Bias 0
Layer (end−2): Softmax Layer Layer (end+2): ReLU Layer
Layer (end−1): Classification Layer (end+3): Fully Connected
Output Layer Layer
Learning Rate of Weight 10
Learning Rate of Bias 20
Regularization Factor of Weight 1
Regularization Factor of Bias 0
Layer (end+4): Softmax Layer
Layer (end+5): Classification Output
Layer
Table 3. Transfer learning process of DAG networks

Pre-trained DAG networks Remodeled pre-trained DAG networks
Layer (end−3): Fully Connected Layer Layer (end+1): Fully Connected Layer
Learning Rate of Weight 1 Learning Rate of Weight 20
Regularization Factor of Weight 1 Regularization Factor of Weight 1
Regularization Factor of Bias 0 Regularization Factor of Bias 0
Layer (end−2): Softmax Layer Layer (end+2): Softmax Layer
Layer (end−1): Classification Output Layer (end+3): Classification Output
Layer Layer
5 Experimental Results and Discussion
This section will discuss the experimental analysis done as well as the results attained.
Input images of remodeled pre-trained CNNs consist of 9558 color images for each class
viz. normal and anomaly with regards to forensic gait features requirement. The images
are collected from the footage of participants during data collection and the augmented
images. 7000 images are randomly selected as training images and the rest are served as
testing images. All the images are resized according to the requirement of pre-trained
networks, (i) AlexNet is 227*227, (ii) GoogLeNet is 224*224, (iii) Inception-v3 is
299*299, (iv) ResNet-50 is 224*224, and (v) ResNet-101 is 224*224.
The validation frequency is similar for each remodeled pre-trained CNNs but the
minibatch numbers are different according to the memory of the hardware. Total
iteration is varied between remodeled pre-trained CNNs as the consequence of different
numbers of minibatch despite the maximum epoch is set equal at 1000 for each
network. A small learning rate of 0.001 is used to ensure constant convergence between
pre-trained layers and newly added layers. Stochastic Gradient Descent with
Momentum is chosen and b is set to 0.9 for better navigation towards optimum global
minima and 50 iterations of validation frequency at regular intervals of the network.
The effectiveness of each remodeled pre-trained CNNs are investigated under two
methods, (i) offline mode tests the remodeled pre-trained CNNs using video as input
data where the videos are downloaded from YouTube channel related to the real
housebreaking events captured by CCTV take place in Malaysia, and (ii) real-time
mode tests the remodeled pre-trained CNNs using live feed from webcam as input data
at the laboratory.
5.1 Remodeled Pre-trained CNNs

Five pre-trained CNNs are remodeled using transfer learning technique including a pre-
trained series network and four pre-trained DAG networks. Table 4 compares the
difference between pre-trained CNNs and remodeled the pre-trained CNNs after the
transfer learning process using feature extraction and fine-tuning technique. Mean-
while, Table 5 describes the final layers of remodeled pre-trained CNNs.
5.2 Performance of the Five Remodeled Pre-trained CNNs

The accuracy and sensitivity of each remodeled pre-trained CNNs are indeed high as
tabulated in Table 6. Training process duration for series network representatives
namely AlexNet only took 75 min that was almost two times faster than the fewest
layers architecture of DAG network, i.e. GoogLeNet that consumed 153 min to
complete. Referring to Tables 1 and 6, the deeper layer of the network requires higher
memory as Inception-v3 with 316 layers and ResNet101 with 347 layers required
841 min and 624 min to complete the training process, respectively. As for low
memory hardware, the higher number of trainable parameters leads to a smaller number
of minibatch per learning process. This can be observed from ResNet-101 that has 44.5
million trainable parameters and can only process 3 images per batch. However, the
smaller number of minibatch may contribute to faster training time as can be seen in
Inception-v3 and ResNet-101. Both networks have more than 300 layers, Resnet-101
was trained with 3 images of minibatch and completed 200 min faster than Inception-
v3 that was trained with 20 images of minibatch.
The previous study has shown that ResNet-50 and ResNet-101 required high GPU
hardware to achieve better performance and to avoid significantly long training dura-
tion [50, 54]. Nevertheless, ResNet-50 obtained the highest accuracy of 98.44% in
classifying the forensic gait postures using low memory hardware. Owing to the val-
idation loss stabilization, all samplings have completed in very few epochs and itera-
tions as compared to the setting values before the training process. The misclassified
images were considered to be very small for all pre-trained CNNs as compared to the
number of images used for validation purpose specifically 2558 images for each class.
ResNet-50 has the least misclassified images specifically 80 images and ResNet-101
has the most misclassified images specifically 148 images.
From the classification results, remodeled ResNet-50 attained the highest accuracy
whilst remodeled GoogLeNet obtained the lowest accuracy. However, remodeled
ResNet-101 attained the highest sensitivity but contributed to the highest numbers of
misclassified images. Conversely, higher accuracy of remodeled pre-trained CNNs
yielded a lower number of total misclassified images.
The results of misclassified images for series network and DAG networks seem
contrary to each other. Remodeled AlexNet obtained the lowest number of false neg-
ative that resulted in the smallest numbers of misclassified anomaly images of 18
images and also earned the highest number of false positive that contributed to almost
seven times higher in misclassified normal images of 122 images.
Table 4. The fine-tuning of pre-trained CNNs.

Pre-trained CNNs Remodeled pre-trained CNNs
Alexnet (Series Layer (end−3): Fully Connected Layer (end+1): Fully Connected
Network) Layer Layer
Input Sizes 4096 Input Sizes ‘auto’
Output Sizes 1000 Output Sizes 64
Learning Rate of 1 Learning Rate of 1
Weight Weight
Regularization Factor 1 Regularization Factor 1
of Weight of Weight
of Bias of Bias
Layer (end−2): Softmax Layer Layer (end+2): ReLU Layer
Layer (end−1): Classification Layer (end+3): Fully Connected
Output Layer Layer
Output Sizes 1000 Input Sizes ‘auto’
Output Sizes 2
Learning Rate of 10
Weight
Learning Rate of Bias 20
Regularization Factor 1
of Weight
Regularization Factor 0
of Bias
Layer (end+4): Softmax Layer
Layer (end+5): Classification
Output Layer
Output Sizes ‘auto’
GoogLeNet Layer (end−3): Fully Connected Layer (end+1): Fully Connected
(DAG network) Layer Layer
Input Sizes 1024 Input Sizes ‘auto’
Output Sizes 1000 Output Sizes 2
Weight Weight
of Weight of Weight
of Bias of Bias
Layer (end-2): Softmax Layer Layer (end+2): Softmax Layer
Layer (end-1): Classification Layer (end+3): Classification
Output Layer Output Layer
Output Sizes 1000 Output Sizes ‘auto’
(continued)
Inception-v3 Layer (end-3): Fully Connected Layer (end+1): Fully Connected
ResNet-50 Layer Layer
ResNet-101 Input Sizes 2048 Input Sizes ‘auto’
(DAG network) Output Sizes 1000 Output Sizes 2
Weight Weight
of Weight of Weight
of Bias of Bias
Layer (end−2): Softmax Layer Layer (end+2): Softmax Layer
Layer (end−1): Classification Layer (end+3): Classification
Output Layer Output Layer
Output Sizes 1000 Output Sizes ‘auto’
Table 5. Final dense layers of CNNs before and after transfer learning process.
Layer Layer type Layer details Layer Layer type Layer details
No. Name No. Name
AlexNet 23 ‘fc8’ Fully 1000 fully 23 ‘special_2’ Fully 64 fully
connected connected connected connected
layer layer
24 ‘prob’ Softmax Softmax 24 ‘relu’ ReLU ReLU
25 ‘output’ Classification crossentropyex 25 ‘fc8_2’ Fully 2 fully
output connected connected
layer
26 ‘softmax’ Softmax Softmax
27 ‘classoutput’ Classification crossentropyex
output
GoogLeNet 142 ‘loss3-classifier’ Fully 1000 fully 142 ‘fc’ Fully 2 fully
layer layer
143 ‘prob’ Softmax Softmax 143 ‘softmax’ Softmax Softmax
144 ‘output’ Classification crossentropyex 144 ‘classoutput’ Classification crossentropyex
output output
Inception- 314 ‘prediction’ Fully 1000 fully 314 ‘fc’ Fully 2 fully
v3 connected connected connected connected
layer layer
315 ‘prediction_ Softmax Softmax 315 ‘softmax’ Softmax Softmax
softmax’
316 ‘classification Classification crossentropyex 316 ‘classoutput’ Classification crossentropyex
Layer_prediction’ output output
ResNet-50 175 ‘fc1000’ Fully 1000 fully 175 ‘fc’ Fully 2 fully
layer layer
176 ‘fc1000_softmax’ Softmax Softmax 176 ‘softmax’ Softmax Softmax
Layer_fc1000’ output output
(continued)
Layer Layer type Layer details Layer Layer type Layer details
No. Name No. Name
ResNet- 345 ‘fc1000’ Fully 1000 fully 345 ‘fc’ Fully 2 fully
101 connected connected connected connected
layer layer
346 ‘prob’ Softmax Softmax 346 ‘softmax’ Softmax Softmax
Layer_prediction’ output output
The remodeled pre-trained DAG networks learned the anomaly images better than
normal images. Moreover, the higher numbers of trainable parameters were observed to
produce lower numbers of misclassified images. GoogLeNet has 6.79 million,
Inception-v3 has 23 million and ResNet-50 has 25.6 million trainable parameters and
the misclassified images of GoogLeNet were 147 images, Inception-v3 was 114 images
and ResNet-50 was 80 images. Figure 2 shows examples of misclassified images or
false positive and false negative images during the classification process.
Table 6. Training results of remodeled pre-trained CNNs.

Remodeled pre- Maximum Total Minibatch Time Accuracy Sensitivity Misses
trained CNNs epoch (1000) iteration (Image) (Minute) (%) (%) Normal Anomaly Total
AlexNet 3 1801 20 75 97.26 95.39 122 18 140
GoogLeNet 3 1851 20 153 97.13 97.77 55 92 147
Inception-v3 4 2201 20 841 97.77 97.89 54 60 114
ResNet-50 2 2251 10 448 98.44 98.44 40 40 80
ResNet-101 1 2051 3 624 97.11 98.66 34 114 148
Fig. 2. Misclassified images of remodeled pre-trained CNNs (a) normal behavior and
(b) anomalous behavior
5.3 Offline Mode Test

Videos of CCTV footage of housebreaking at the gate of residential units were
downloaded from the YouTube channel to be used during the offline mode test. The
downloaded videos include single or multiple numbers of perpetrators during com-
mitting the crimes either in a broad daylight or at the night. The anomalous features as
defined by RMP are clearly performed by each of the perpetrators. The videos are then
used as input dataset to test the remodeled pre-trained CNNs.
In addition, Fig. 3 displays the results of offline mode test from all remodeled pre-
trained CNNs. Most remodeled pre-trained DAG networks were good in classifying
anomalous behavior based on high accuracy achieved specifically between 95% and
100%. It was observed that the tendency to detect normal scenes in videos as anomaly
occurred except for remodeled ResNet-101. Remodeled AlexNet also prioritized
anomaly behavior during detection in each video frame and results in acquiring a
moderate detecting skill of 90% to 97% in distinguishing both normal and anomaly
images. Some frames displayed low accuracy within 70% and most of the frames have
been mistakenly detected as an anomaly for the category of bending and squatting
postures. In comparison, remodeled GoogLeNet managed to detect skill with most of
the frames demonstrated moderate accuracy of 80% to 90% during the offline mode
test. Remodeled ResNet-101 was in third place in detecting with high accuracy for
most videos with 90% to 95% accuracy evaluated and tested. Moreover, remodeled
Inception-v3 and remodeled ResNet-50 were at par in detecting considering high
accuracy performance shown specifically 95% to 100% using most tested video frames.
However, most remodeled pre-trained CNNs attained lower detection at the gate scene
as an anomalous situation. This showed that only network with high accuracy has the
ability in distinguishing between normal and anomalous behavior accurately in offline
mode test.
Fig. 3. Offline mode test to detect behaviors at the gate using remodeled pre-trained CNNs
(a) Normal behavior and (b) Anomalous behavior
5.4 Real-Time Mode Test

The real-time mode test is held in the laboratory and conducted in various situations
such as a luminous environment, dusky environment and different acts according to the
type of gates; for instance slide gate, push gate. A webcam is used as input data to test
the remodeled pre-trained CNNs in real-time mode with the live feed images processed
to perform the detection. Results showed that detection accomplished within 40 ms to
0.2 s. Four courses were carried out:
• Course 1 required participant to behave normally at the gate with a multi-condition
gate.
• Course 2 required participant to impersonate housebreaking crime at the gate with a
multi-condition gate.
• Course 3 required participant to behave normally at the gate without multi-condition
gate.
• Course 4 required participants to impersonate housebreaking crime at the gate
without a multi-condition gate.
From Fig. 4, it was shown that remodeled Inception-v3 was the best network
during real-time mode test with accuracy of 99% in detecting normal and anomaly
behaviors for all courses but moderate detection of 80% to 90% for Course 1 and
Course 3. On the other hand, remodeled GoogLeNet accomplished almost perfect
accuracy of 99% in all courses especially Course 2 and Course 4 except 80% to 90% of
detection accuracy for Course 1 and weak detection of 60% to 70% accuracy observed
for Course 3. Further, remodeled AlexNet, remodeled ResNet-101 and remodeled
ResNet-101 were relatively good in Course 1 and Course 2 with 80% to 90% accuracy
but totally failed in Course 3. However, for Course 4 the accuracy obtained is 99% to
100% for all frames as anomaly whilst remodeled AlexNet attained 70% to 80%
accuracy during Course 3 and Course 4 but failed to detect in Course 1 and Course 2.
Moreover, remodeled ResNet-50 failed in all courses in detecting normal behavior for
each frame throughout all courses. Furthermore, almost the same characteristics were
observed throughout the testing: (i) squatting was highly identified as an anomaly,
(ii) bending and kneeling were identified more as an anomaly than normal, and
(iii) standing was highly identified as normal. All the normal activities at the gate
recorded lesser time duration than anomalous activities.
Figure 5 outlined the behaviors changing over time for all four courses. Figure 5(a)
and (b) of Course 1 and Course 2 presented the real-time results from remodeled
GoogLeNet and the graphs in Fig. 5(c) and (d) are the results captured from remodeled
Inception-v3. Figure 5(a) depicted the steps made by the participant while searching for
keys in the backpack to open the gate lock (step 1 to 10). Note that the subject opened
the lock mostly during bending posture followed by walking towards the frontage of
the house (step 11 to 15). Figure 5(b) described the steps taken during sneaking with
slightly bending posture and swiftly changed into the full squat posture to check the
gate lock status (step 1 to 10) and further tried breaking the lock using a tool in bending
posture (step 11 to 25). The steps of normal behavior in Fig. 5(c) are interpreted from
Fig. 4. Real-time mode test in detecting behaviors at the gate using remodeled pre-trained
CNNs, (a) Normal behavior (i) Open or slide gate to open, (ii) Unlock the padlock or latch,
(iii) Other activities (b) Anomalous behavior (i) Lurking or sneaking, (ii) Breaking the gate using
tool, (iii) Wildly shaking the gate, (c) Detection on dusky environment without multi-condition
gate and (d) Two perpetrators
the scene of a person walking towards the gate with full bags in hand. Step 3 to 5 and
step 9 to 11 are detected as an anomaly due to the squatting posture during putting
down and picking up bags. Additionally, step 6 to 8 is considered as normal bending
posture, not as anomaly bending posture during opening the lock. Step 1 to 2 and step
12 to 15 are considered as normal walking towards the gate and the frontage of the
house. The anomalous behavior steps in Fig. 5(d) are acted by the participant for the
must-do thing during housebreaking attempt namely scene of wildly shaking the gate
(10 to 25). Moreover, the first three steps represented lurking with standing upright
posture and step 4 to 9 are classified alternately between normal and anomaly behavior
as participant standing and bending extensively while checking the gate locks. The
detection for remodeled GoogLeNet was faster as it completed at 40 to 80 ms or 16 fps
as compared to remodeled Inception-v3 that took 0.2 s per frame or 5 fps.
Course 2
Course 1
Step
Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0
0
0.01
0.02
0.04 0.02
0.03
Time, s
0.06
Time, s
0.08 0.04
0.1 0.05
0.12
0.06
0.14
0.07
(a) (b)
Course 4
Course 3 Step
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Step
0.212
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0.205 0.214
0.21
0.216
0.215
0.22 0.218
Time, s
Time, s
0.225
0.22
0.23
0.235 0.222
0.24
0.224
0.245
0.25 0.226
0.228
(c) (d)
Fig. 5. Time vs. steps during time (a) Course 1, (b) Course 2, (c) Course 3, and (d) Course 4
6 Conclusion
In conclusion, the anomalous postures of housebreaking crime with the consent of

RMP are performed by the perpetrators physically and have been captured in CCTV
footage. Both normal and anomalous behaviors at the gate of residential units are used
as dataset instead of normal data only to achieve optimum detection. Further, five pre-
trained CNNs and apply transfer learning technique is evaluated and validated to
achieve better results in detecting and recognizing the anomalous behavior. The
learning process shows that most misclassified images for both normal and anomaly are
in standing posture from opening the lock, sneaking or lurking activities that are hardly
distinguish as the anomalous postures even by a human. The remodeled pre-trained
CNNs are then tested using offline and real-time mode to verify the effectiveness of
each network. As expected, high accuracy was obtained for the classification of both
behaviors. High accuracy and sensitivity approximately 98% were obtained in classi-
fying normal and anomalous behavior. All remodeled pre-trained CNNs were tested
using both offline and real-time mode to verify the effectiveness of each network. Most
networks achieved high accuracy up to 100% during offline mode test and remodeled
GoogLeNet and Inception-v3 continued the success of detection for real-time mode test
with 99% accuracy. However, the other networks performed moderately in the real-
time mode test.
Future work will focus on the sequence of movement detection before declaring the
observed scene is normal or anomalous to reduce false alarm issues. This idea can be
achieved by developing a forensic criminal database to strengthen the understanding of
criminal behavior. Remodeled pre-trained CNNs with transfer learning techniques are
proven to acquire high detecting skills towards anomalous behavior but more of them
are still lacking to perform during real-time test mode, for instance, Faster-RCNN may
improve the ability in detecting.
Acknowledgments. This research is funded by Research Management Centre (RMC), Universiti

Teknologi MARA (UiTM), Shah Alam, Selangor, Malaysia Grant No: 600-IRMI/MyRA5/
3/BESTARI (041/2017). The first author would like to thank Ministry of Education
(MOE) Malaysia for the scholarship awarded under MyBrain MyPhD as well as Faculty of
Electrical Engineering UiTM Shah Alam for all the support given during this research. In addition,
special thanks to Royal Malaysia Police for providing legal information and assisting in devel-
oping forensic gait features.
References
1. Sidhu, A.C.P.A.S.: The rise of crime in malaysia: an academic and statistical analysis.
J. Kuala Lumpur R. Malaysia Police Coll. 4, 1–28 (2005)
2. Hamid, L.A., Toyong, N.M.P.: Rural area, elderly people and the house breaking crime.
Proc. - Soc. Behav. Sci. 153, 443–451 (2014)
3. Soh, M.C.: Crime and urbanization: revisited malaysian case. Proc. Soc. Behav. Sci. 42(July
2010), 291–299 (2012)
4. Marzbali, M.H., Abdullah, A., Razak, N.A., Tilaki, M.J.M.: The relationship between socio-
economic characteristics, victimization and CPTED principles: evidence from the MIMIC
model. Crime Law Soc. Chang. 58(3), 351–371 (2012)
5. Chris, K., Natalia, C.-M., Carys, T., Rebbecca, A.: Burglary, vehicle and violent crime. In:
The 2001 British Crime Survey. First Results, England and Wales, vol. 18, pp. 23–27. Home
Office Statistical Bulletin, Queen Anne’s Gate, London (2001)
6. Van Dijk, J.J.M., Mayhew, P., Killias, M.: Victimization rates. In :Experiences of Crime
across the World: Key findings of the 1989 International Crime Survey, pp. 23–25. Kluwer
Law and Taxation Publishers, Deventer (1990)
7. Murphy, R., Eder, S.: Acquisitive and other property crime. In: Flatley, J., Kershaw, C.,
Smith, K., Chaplin, R., Moon, D. (eds.) Crime in England and Wales 2009/10, Third Edit.,
vol. 12, pp. 79–87. Home Office Statistical Bulletin, Marsham Street, London (2010)
8. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv.
1–72 (2009)
9. Lawson, W., Hiatt, L.: Detecting anomalous objects on mobile platforms. In: 2016 IEEE
Conference on Computer Vision Pattern Recognition Working, pp. 1426–1433 (2016)
10. Mohammadi, S., Perina, A., Kiani, H., Murino, V.: Angry crowd : detecting violent events in
videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision - ECCV 2016.
LNCS, vol. 9911, pp. 3–18. Springer, Cham (2016)
11. Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In:
Proceedings of IEEE Computing Socitey Conference on Computer Vision and Pattern
Recognition, pp. 6479–6488 (2018)
12. Tay, N.C., Tee, C., Ong, T.S., Goh, K.O.M., Teh, P.S.: A robust abnormal behavior
detection method using convolutional neural network. In: Alfred, R., Lim, Y., Ibrahim, A.,
Anthony, P. (eds.) Computational Science and Technology. Fifth International Conference
on Computational Science and Technology. Lecture Notes in Electrical Engineering, vol.
481, pp. 37–47. Springer, Singapore (2019)
13. Al-Dhamari, A., Sudirman, R., Mahmood, N.H.: Abnormal behavior detection in automated
surveillance videos: a review. J. Theor. Appl. Inf. Technol. 95(19), 5245–5263 (2017)
14. Delgado, B., Tahboub, K., Delp, E.J.: Automatic detection of abnormal human events on
train platforms. In: IEEE National Aerospace and Electronics Conference (NAECON 2014),
pp. 169–173 (2014)
15. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
16. Almisreb, A.A., Jamil, N., Md Din, N.: Utilizing AlexNet deep transfer learning for ear
recognition. In: 2018 Fourth International Conference on Information Retrieval and
Knowledge Management, pp. 8–12 (2018)
17. Andrew, J.T.A., Tanay, T., Morton, E.J., Griffin, L.D.: Transfer representation-learning for
anomaly detection. In: Proceedings of 33rd International Conference on Machine Learning
Research, New York, USA, vol. 48, pp. 1–5 (2016)
18. Ali, A.M., Angelov, P.: Anomalous behaviour detection based on heterogeneous data and
data fusion. Soft. Comput. 22(10), 3187–3201 (2018)
19. Sabokrou, M., Fayyaz, M., Fathy, M., Moayed, Z., Klette, R.: Deep-anomaly : fully
convolutional neural network for fast anomaly detection in crowded scenes. J. Comput. Vis.
Image Underst. 1–30 (2018). (arXiv00866v2 [cs.CV])
20. Huang, Z., Pan, Z., Lei, B.: Transfer learning with deep convolutional neural network for
SAR target classification with limited labeled data. Remote Sens. 9(907), 1–21 (2017)
21. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural
networks? Adv. Neural. Inf. Process. Syst. 27, 1–14 (2014)
22. Chollet, F.: Deep learning for computer vision: using a pretrained convnet. In: Deep
Learning with Python, pp. 143–159. Manning, Shelter Island(2018)
23. Ali, A.M., Angelov, P.: Applying computational intelligence to community policing and
forensic investigations. In: Bayerl, P.S., Karlovic, R., Akhgar, B., Markarian, G. (eds.)
Advanced Sciences and Technologies for Security Applications: Community Policing - A
European Perspective, pp. 231–246. Springer, Cham (2017)
24. Lu, J., Yan, W.Q., Nguyen, M.: Human behaviour recognition using deep learning. In: 2018
15th IEEE International Conference on Advanced Video and Signal Based Surveillance,
pp. 1–6 (2018)
25. Hospedales, T., Gong, S., Xiang, T.: A Markov clustering topic model for mining behaviour
in video. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1–8 (2009)
26. Zhang, C., Li, R., Kim, W., Yoon, D., Patras, P.: Driver behavior recognition via interwoven
deep convolutional neural nets with multi-stream inputs. arXiv:1811.09128v1 [cs.CV], pp.
1–10 (2018)
27. Pang, Y., Syu, S., Huang, Y., Chen, B.: An advanced deep framework for recognition of
distracted driving behaviors. In: 2018 IEEE 7th Global Conference on Consumer
Electronics, pp. 802–803 (2018)
28. Arifoglu, D., Bouchachia, A.: Activity recognition and abnormal behaviour detection with
recurrent neural networks. In: 14th International Conference on Mobile Systems and
Pervasive Computing (MobiSPC 2017), vol. 110, pp. 86–93 (2017)
29. Kröse, B., van Oosterhout, T., Englebienne, G.: Video surveillance for behaviour monitoring
in home health care. Proc. Meas. Behav. 2014, 2–6 (2014)
30. Leixian, S., Zhang, Q.: Fall behavior recognition based on deep learning and image
processing. Int. J. Mob. Comput. Multimed. 9(4), 1–16 (2019)
31. Xu, H., Li, L., Fang, M., Zhang, F.: Movement human actions recognition based on machine
learning. Int. J. Online Biomed. Eng. 14(4), 193–210 (2018)
32. Datta, A., Shah, M., Da Vitoria Lobo, N.: Person-on-person violence detection in video data.
In: Proceedings of International Conference on Pattern Recognition, vol. 16, no. 1, pp. 433–
438 (2002)
33. Gao, Y., Liu, H., Sun, X., Wang, C., Liu, Y.: Violence detection using oriented VIolent
flows. Image Vis. Comput. 48–49, 37–41 (2016)
34. Kooij, J.F.P., Liem, M.C., Krijnders, J. D., Andringa, T., Gavrila, D.M.: Multi-modal human
aggression detection. Comput. Vis. Image Underst. 1–35 (2016)
35. Patil, S., Talele, K.: Suspicious movement detection and tracking based on color histogram. In:
2015 International Conference on Communication, Information and Computing Technology,
pp. 1–6 (2015)
36. Zhu, Y., Wang, Z.: Real-time abnormal behavior detection in elevator. In: Zhang, Z., Huang,
K. (eds.) Intelligent Visual Surveillance. IVS 2016. Communications in Computer and
Information Science, vol. 664, pp. 154–161. Springer, Singapore (2016)
37. Ben Ayed, M., Abid, M.: Suspicious behavior detection based on DECOC classifier. In: 18th
International Conference on Sciences and Techniques of Automatic Control and Computer
Engineering, pp. 594–598 (2017)
38. Yu, B.: Design and implementation of behavior recognition system based on convolutional
neural network. In: ITM Web Conference, vol. 12, no. 01025, pp. 1–5 (2017)
39. He, L., Wang, D., Wang, H.: Human abnormal action identification method in different
scenarios. In: Proceedings of 2011 2nd International Conference on Digital Manufacturing
and Automation ICDMA 2011, pp. 594–597 (2011)
40. Min, W., Cui, H., Han, Q., Zou, F.: A scene recognition and semantic analysis approach to
unhealthy sitting posture detection during screen-reading. Sensors (Basel) 18(9), 1–22
(2018)
41. Nazare, T.S., de Mello, R.F., Ponti, M.A.: Are pre-trained CNNs good feature extractors for
anomaly detection in surveillance videos? arXiv:1811.08495v1 [cs.CV], pp. 1–6 (2018)
42. Lee, J., Kim, H., Lee, J., Yoon, S.: Transfer learning for deep learning on graph-structured data. In:
Proceedings of Thirty-First AAAI Conference on Artificial Intelligence, pp. 2154–2160 (2017)
43. Bell, M.O.: Computational Complexity of network reliability analysis: an overview. IEEE
Trans. Reliab. R-35(3), 230–239 (1986)
44. The Mathworks: Series network for deep learning – MATLAB (2016). https://www.
mathworks.com/help/deeplearning/ref/seriesnetwork.html. Accessed 12 June 2019
45. Vedaldi, A., Lenc, K., Gupta, A.: MatConvNet - convolutional neural networks for
MATLAB. arXiv:1412.4564 [cs.CV], pp. 1–59 (2015)
46. The Mathworks: Directed acyclic graph (DAG) network for deep learning - MATLAB
(2017). Available: https://www.mathworks.com/help/deeplearning/ref/dagnetwork.html.
Accessed: 12 June 2019
47. Sahner, R.A., Trivedi, K.S.: Performance and reliability analysis using directed acyclic
graphs. IEEE Trans. Softw. Eng. SE-13(10), 1105–1114 (1987)
48. Bang-Jensen, J., Gutin, G.Z.: Acyclic digraphs. In: Diagraphs: Theory, Algorithms and
Applications, Second Edition. Monographs in Mathematics, pp. 32–34. Springer, London (2009)
49. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional
neural networks. In: Advance in Neural Information Processing Systems, vol. 25, pp. 1–9
(2012)
50. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. In: 3rd International Conference on Learning Representation (ICLR 2015),
pp. 1–14 (2015)
51. Szegedy, C., et al.: Going deeper with convolutions. In: IEEE Conference on Computer
Vision Pattern Recognition (CVPR 2015), pp. 1–9 (2015)
52. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J.: Rethinking the inception architecture for
computer vision. In: 2016 IEEE Conference on Computer Vision Pattern Recognition
(CVPR 2016), pp. 2818–2826 (2016)
53. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016
IEEE Conference on Computer Vision Pattern Recognition, pp. 770–778 (2016)
54. Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv:1605.07146v4 [cs.CV], pp.
1–15 (2017)
55. Liu, Y.Y., Slotine, J.-J., Barabasi, A.-L.: Control centrality and hierarchical structure in
complex networks. PLoS One 7(9), 1–7 (2012)
Occluded Traffic Signs Recognition
Shwu-Huey Yen(&), Chun-Yung Shu, and Hui-Huang Hsu
Tamkang University, New Taipei City, Taiwan, ROC

105390@mail.tku.edu.tw, 605410652@s05.tku.edu.tw,
hsu@gms.tku.edu.tw
Abstract. Traffic sign recognition is very important in the intelligent driving. It

can remind drivers to react properly to the road condition and increase the
driving safety. One of the challenges in recognizing traffic sign is occlusion. In
this paper, we focus on this problem particularly in Taipei and the vicinity
including Taipei and New Taipei City. We propose a convolution neural net-
work equipped with the regional masks to solve the occlusion traffic sign
recognition. Traffic sign images of Taipei and New Taipei City are collected
mainly from Google Maps for training and testing. Finally, the proposed method
is tested both on our own dataset and German public dataset GTSRB. The
experimental results demonstrated the occlusion problem is being greatly alle-
viated and the result is very promising.
Keywords: Occlusion Traffic sign Recognition GTSRB Convolutional

Neural Network Mask
1 Introduction
Automatic traffic sign recognition plays a vital role in intelligent self-driving vehicles
and driver assistant systems. Especially in big Taipei area, including Taipei and New
Taipei City, the situation of heavy traffic of cars and motorcycles and a rapidly growing
number of aged drivers makes a reliable sign-recognizing system even more crucial in
enhancing safe driving.
Driving in the real-world road scene, the quality of traffic sign images is easily
affected by various conditions, such as weather, illumination, or occlusion, which are
difficult to control. In this article, we focus on the case of occluded traffic signs. The
occlusion is a common problem due to trees, pedestrians or other signs, and the resulted
incomplete information makes sign recognition difficult.
Recently, many object recognition tasks are solved using convolution neural net-
works (CNN). Due to its high recognition rate and fast execution, the superior per-
formance of CNN applying on various computer vision tasks is widely known. We
propose a CNN based recognition system for occluded traffic signs. It utilizes multiple
local masks to learn the regional feature of traffic signs. With auxiliary classifiers,
robust local features are learned and the problem of incomplete information is much
alleviated. In the following sections, we give a brief introduction of related works,
followed by the proposed method, experiments, and, finally, conclusion and future
work.

https://doi.org/10.1007/978-3-030-39442-4_58
Occluded Traffic Signs Recognition 795
2 Related Works
The task of recognizing traffic sign images includes detecting then recognizing the
detected one. As mentioned, many issues make the problem of the automatic detection
of traffic signs difficult. Hence, features from both the color and shape of signs are used
for detection [1–3]. Greenhalgh et al. proposed a system detected candidate regions as
Maximally Stable Extremal Regions (MSERs), a method which is robust to variations in
lighting and illumination in the scene to diminish the illumination effect. These can-
didate regions were then classified using HOG features with a cascade of Random
Forests [1]. Kuang et al. improved Greenhalgh’s method via image contrast enhance-
ment before MSER which makes the sign detection more robust to illumination and
rotation scale. In recognition phase, they also used Random Forests but with a HSV-
HOG-LBP feature descriptor [2].
Conventional methods for traffic sign recognition, such as aforementioned, usually
include feature extraction and classification training. Features like color, HOG, SIFT,
SURF and LBP, or their improved version, are combined with classifiers like LDA,
SVM, AdaBoost, or Random Forest [4]. However, Convolutional Neural Network can
be trained without the need of hand-designed features and it is popular nowadays. For
example, Ciresan et al. used an ensemble of CNNs and the final prediction is averaging
the output score of all columns to boost the classification accuracy with low error rate
[5, 6]. In 2011, a traffic sign classification challenge on German Traffic Sign Recog-
nition Benchmarks (GTSRB) dataset held at the International Joint Conference on
Neural Networks (IJCNN) and [6] won the first place with an accuracy rate better than
human being’s [7]. Singh et al. used HSV color feature to extract sign candidates and a
classifier is trained by CNN [8]. They reported a 97.92% accuracy when tested on the
GTSRB dataset [7].
When the sign is occluded, the recognition task becomes more difficult due to the
insufficient visual clues. Hou et al. [9] proposed a two stages recognition system to
solve occlusion problem. Firstly, it extracts HOG feature to classify the category of the
sign by pairwise One-Versus-One SVMs. With voting strategy of SVMs, the correct
shape can be determined even if the sign is partially occluded. Further, the SVM is
trained to determine the exact label of the traffic sign under the specific shape.
Occlusion in tracking is a well-studied issue. In Yang et al. [10], they utilized
Siamese network for capturing features of exemplar image and search image as kernel
mask and the candidate area, respectively. The score map is obtained by the correlation
of the kernel mask with the candidate area. In addition, they used multiple (2-by-2)
position-sensitive score maps to alleviate occlusion effect. Finally the four position-
sensitive score maps are merged into a final score map to locate the target object. Our
method is inspired by this viewpoint.
In this article, a CNN-based system is proposed to recognize traffic sign especially
for occluded signs in big Taipei Area. We observe that it is distracted to recognize
traffic signs in brief time during busy traffic not to mention of the cluttered street scene
in big Taipei area. In order to train a system working for Taipei area, it would be better
to use training images reflecting the true environment. In addition, the public dataset
GTSRB only has limited occluded sign images. For these reasons, we collect our own
796 S.-H. Yen et al.
training images of Taipei area. We captured some traffic sign images ourselves but
most of images are collected from Google Maps [11]. These images are all taken in big
Taipei area. By analyzing the statistics of collected data, twenty most common traffic
signs are used as our recognition targets.
Global feature Local feature Classification

extraction extraction
Fig. 1. The architecture diagram of our method which is made of three parts: global feature
extraction, local feature extraction, and classification.
3 Methodology
The architecture of the system is shown in Fig. 1. Given an input sign image, the global
feature is first extracted, followed by five local feature extractions. The local feature
extractor together with auxiliary classifier is to learn important features if only partial
feature is available. Finally, the label of the sign is predicted by a concatenation of local
features with a Softmax classifier. To illustrate the proposed system in details, we first
explain the training data followed by system architecture. Then feature maps will be
discussed especially the differences between the global and local features.
3.1 Database Collection

In the field of traffic sign recognition, the German dataset GTSRB probably is one of
the most well-known datasets. The dataset contains 43 different signs under various
sizes, lighting conditions, motion-blur, low resolution (15 15). Since road conditions
and street scenes in German are different from Taipei’s, and it contains only a limited
number of occluded signs, we build one dataset of our own. 41,684 images are col-
lected from Google Maps or captured by ourselves. Only those signs with at least 400
images and have occluded ones are kept which are 20 different signs for our system.
Since the goal is to train a system for recognizing occlusion signs, we carefully choose
every occluded one from the samples to be the testing images and there are 330 such
test images. The rest of non-occlusion images are for training.
In preparing training and testing images, we crop the image by a rectangular box
roughly inscribing the sign. Resize the cropped image to be 64 64 with label
annotation. Figure 2 shows some training images and some testing images.
(a) Training images
(b) Testing images
Fig. 2. Some sample images from our dataset. From left to right, these signs are “Dividing
Ways”, “Right Bend”, “Speed Limit (50 km/h)”, “Two-Stage Left Turn for
Bicycles/Motorcycles”. The first row shows training images and bottom shows testing images
which are occluded. These images are resize to be 64 64.
Since distribution of collected traffic signs is not uniform, we first enlarge the data
number of those classes with samples less than 1000. Using random translation in
horizontal and/or vertical direction to replicate the images until the sample number
reaches 1000. We take these image of each sign as the original training images and
augment the training data by randomly adding noise, rotating ±10o, ±15o, ±25o and
occlusion-blob synthesized to the training images. By this way, each traffic sign has
9,000 images and the total training images are 180,000. During training, we randomly
take 80% for training and 20% for validation.
3.2 System Architecture

As shown in Fig. 1, the proposed system is made of three parts: global feature
extraction, local feature extraction, classification.
Global Feature Extraction. It is to learn the overall feature of traffic sign. Figure 3 is
the structure of global feature extractor in details. In Fig. 3, as well as in Fig. 4, h, w,
and c represents height, width, and channel. In Op(s), Op and s represents operation
with stride s, for example, (C33 ðs1Þ þ ReLU) represents it is a CNN operation of
kernel 3 3 with stride 1, and it is followed by a ReLU. c_number represents number
of kernel which becomes the number of channel for the input of the next layer.
Fig. 3. Global feature extraction. Detected sign image of size 64 64 3 is the input to the
system. After global feature extraction the output feature maps are of the size 14 14 64
(output of layer 5). In the figure, CNN, ReLU, and Max_Pooling operations are represented as
purple, brown, and green bars, respectively. It is made of 5 layers, where layers 1, 2, & 4 are 3
3 CNN (stride 1) followed by ReLU, and layers 3 and 5 are max pooling of 2 2 of stride 2.
Local Feature Extraction. It is to learn local features of traffic signs. Figure 4

illustrates local feature extraction architecture. To alleviate the problem of incomplete
feature caused by occlusion, we make the machine to discriminate traffic signs by
learning partial features located on upper-left, upper-right, center, lower-right and
lower-left of quartered-regions only, as shown in the top of Fig. 4. Each of these
networks learns features by three CNNs and ReLUs, as shown in the bottom of Fig. 4.
For each subnetwork, features are flattened in layer 9 followed by a dense operation
maps 4096 nodes to 20 nodes in layer 10. These 20 nodes are to be classified into
traffic signs in the next step. In addition, a concatenation of all these five 20-nodes are
fused in the later classification. The details are explained in the following.
Fig. 4. Local feature extraction. (Top) The input is from level 5 described in Fig. 3. It then is
copied into 5 parallel networks but each with only one-quarter of features retained and zeros for
the rest. (Bottom) For each network, it performs the operations as described. At layer 10, a dense
operation maps flattened 4096 nodes to 20 nodes.
Classification. The traffic sign is recognized by Softmax classifier with the help of five
auxiliary classifiers as explained in Fig. 5. In order to make the subnetworks learn local
features, we use auxiliary (Softmax) classifiers for these five subnetworks, respectively.
In addition, these local features are fused together to classify traffic signs. For each
output of layer 10, a vector of dimension 20 right before Softmax, is activated by ReLU
then concatenated with other subnetworks resulting a column vector of 100 nodes. This
column vector is densely connected to 20 nodes, followed by Softmax, then is clas-
sified. We call this classifier as Final Classifier. Therefore, the loss function consists of
loss functions from five (regional) auxiliary classifiers and one from overall final
classifier. As given in (1), all lost functions are categorical cross entropy function
where a is empirically set as 0.2.
Ltotal ¼ a Lauxiliary þ Lfinal ¼ aðL1 þ L2 þ L3 þ L4 þ L5 Þ þ Lfinal ð1Þ
Fig. 5. Auxiliary Classifiers and Final Classifier in the system. Output of layer 10 is used for
both auxiliary classifier and final classifier. The former is to learn different local feature of traffic
signs and the latter is to classify traffic sign after fusion of learnt local features.
3.3 Feature Map

We demonstrate the effect of global and local feature extractors in Figs. 6 and 7. In
Fig. 6, there are heat maps of global features of two images of “Speed Limit
(50 km/h)” traffic sign, (a) is non-occluded and (b) is occluded. In Fig. 6(a), the heat
map properly exhibits the important feature of the sign since the image provides
complete information. Contrarily, in Fig. 6(b), the heat map indicates a higher response
to the right-side (occluded area) which is misleading to the recognition task if the
global feature is used for classification. Next, this global feature is fed into local feature
extractors. Take Fig. 6(b) as an example, the five heat maps of local feature extractors
(output of layer 10) are shown in Fig. 7. Heap maps of upper-right and lower-right
indicate low responses, and heat maps of upper-left, lower-left, and middle ones all
have higher responses. These responses agree with what the layout of the input
occluded image is. With helps of these local features, the occluded sign is recognized
correctly by our system. Figure 8 gives two more occluded examples. Their local
features show high response to the sign area and low response to the occluded area.
(a) non-occlusion sign (b) occluded sign
Fig. 6. Traffic sign “Speed Limit (50)” and its heat map of global feature (output of layer 5).
Red and blue colors represent high and low responses in the heat map. While heat map of (a) is
capable of representing the important feature of the sign, the occluded one mistakenly responses
to the green sign on the right of the image.
Upper-left Upper-right Center Lower-right Lower-left
Fig. 7. Heat maps of local features (output of layer 8) of the occluded traffic sign from Fig. 6(b).
Heat maps of upper-left and lower-left regions give high response, upper-right and lower-right
regions give low response, and the center region gives a middle response. These heat maps
indicate that they learn well for local feature of the sign.
(a) “No U Turn” (b) “Watch Out for Pedestrians”
Upper-left Upper-right Center Lower-right Lower-left
Fig. 8. More heat maps of global & local features. First row, (a) and (b) are occluded images
and global features. Row (c) is the local features of (a) “No U Turn”, and Row (d) is the local
features of (b) “Watch Out for Pedestrians”.
4 Experiment
To evaluate the effectiveness of our system, we execute several experiments. In

Sect. 4.1, we test the system on our dataset. In Sect. 4.2, we test the system on widely
used public dataset GTSRB. The system was implemented in Python 3.6 with Tensor
flow and trained on a computer with 1080 Ti GPU.
4.1 Effectiveness on Occluded Test Images of Our Dataset

Figure 9 shows the result of 330 (occluded) test images on twenty traffic signs clas-
sification. There are 11 test images classified incorrectly, i.e., the accuracy rate is
96.67%. Twenty signs are the following: 1. Dividing ways; 2. No entry; 3. Straight; 4.
No U turn; 5. Prohibit left turn; 6. Approaching vehicle on left; 7. Narrow road; 8. No
stopping; 9. Left turn only lane; 10. No parking; 11. Watch out for pedestrians; 12.
Speed limit (50); 13. Right bend; 14. Speed limit (70); 15. Watch out for children; 16.
Do not turn right; 17. Keep right; 18. Keep left; 19. Two-stage left turn for
bicycles/motorcycles; 20. Slow.
Fig. 9. Confusion matrix for 330 occluded traffic signs.
In the confusion matrix of Fig. 9, blank cells represent 0 s. We can see that
diagonal entries are almost all 1 s with only three classes are below 90%. Among these
20 signs, the third class (“Straight”), the thirteenth class (“Right Bend”) and the fif-
teenth class (“Watch Out for Children”) have low accuracy rates of 0.778, 0.769, and
0.857, respectively. One of the main reason of low accuracy rate is the small size of test
samples. In Fig. 10, it shows erroneously classified test images of these three classes:
(first row) it has 2 errors in total 9 test images of “Straight”, and (middle row) it has 3
errors in total of 13 test images of “Right Bend”, and (third row) it has 1 error in total of
7 test images of “Watch Out for Children”. Observed from these erroneously classified
images, we can conclude that the classifications are prone to be incorrect when the
occlusion is severe or combining with blurring.
To further explore whether auxiliary classifiers can boost the performance of our
system, we train an identical system using the same training samples except it does not
have auxiliary classifiers. The accuracy rate of 330 occluded test images on the system
without auxiliary classifiers is 85%. Comparing to the 97% of the proposed system, it is
clearly that auxiliary classifiers help the machine to learn robust and discriminative
local features of different signs.
Fig. 10. Some erroneously classified test images.
4.2 Effectiveness on GTSRB Dataset

GTSRB is an open and widely used traffic sign database . It contains 43 different signs
with the various conditions. Comparing to our collected data, GTSRB data contain
different lighting, weather conditions, and/or motion blur challenges, as shown in
Fig. 11. These challenges are not the core of our collected dataset. We would apply
GTSRB dataset on the proposed system to explore the generality of the proposed
method.
Fig. 11. Some examples of GTSRB dataset.
However, our system is trained using signs collected from Taipei and the vicinity
and only trained for 20 different signs. For comparison, we carefully design two tests
for GTSRB dataset. One is to select similar traffic signs from GTSRB and fine-tune the
system, and the other is to retrain our system by GTSRB training data.
For the first test, we select 5 signs that are similar to Taipei’s: No Entry, Speed
Limit (50), Speed Limit (70), Keep Right, and Keep Left. In these five signs, the least
number of training size is 210 in GTSRB. So we randomly take 210 training images on
each sign for fine-tuning and test on GTSRB test images. The test result has an
accuracy rate of 93.5%. As stated, the system is fine-tuned using only 1050 images, but
it still gets a satisfactory result.
Next, we would like to know whether the proposed architecture is able to work on
GTSRB traffic sign database in general. Among 43 signs, we discard 3 signs with least
training images and divide the remaining data into groups of 20 signs (to fit our
architecture) and retrain the system. Without loss of generality, according to their order
as in GTSRB (after 3 signs are discarded), the dataset is divided into 4 groups of 20
signs: 1–20, 21–40, odd numbered 1, 3, …, 39, and even numbered 2, 4, …, 40. We
train the system four times using these 4 groups of training and test on the corre-
sponding test images, respectively. The test result is summarized in Table 1. The
average accuracy is 0.9803 ± 0.0023. Comparing to the existing methods, Ciresan’s
0.9915 [5] and Singh’s 0.9792 [8], our result is compatible with the accuracy rates of
these methods. But both methods are to classify 43 classes of GTSRB which are more
difficult. We want to point out that although the proposed system is designed to solve
the occlusion problem, the experiment demonstrated that the system can capture
important features even with different lighting, weather conditions, and/or motion blur
challenges.
Table 1. GTSRB test result using our architecture.

Sign classes 1–20 21–40 1, 3, …, 39 2, 4, …, 40 Average accuracy
Accuracy 0.9778 0.9824 0.9790 0.9821 0.9803
In this paper, we illustrated how to use local features in solving occluded traffic sign
recognition problem. With five local feature extractors each equipped with Softmax
classifier, the system can learn discriminative partial feature for traffic signs. The
experimental results are promising. However, we also found a few samples which can
be classified correctly using global feature alone but are erroneously classified by the
system. This indicates that it should be beneficial if we can combine both global and
local features together.
Another issue is about benchmark dataset. Although we collected our own dataset.
But the size is too small. The GTSRB dataset currently mostly contain illumination
variation and motion blur. It does not contain enough occlusion samples. In addition, it
is imbalance in the distribution of samples in the traffic sign class. This can negatively
impact on the recognition performance. Lastly, but not least, the current dataset has
achieved saturation. A new balanced dataset that contains signs from different regions
and captured in a more complex background with different conditions is in need in this
research field.
Acknowledgment. The authors would like to thank the Ministry of Science and Technology
(MOST) of R.O.C. financially supporting in part of the research under the grant number MOST
107-2221-E-032-035-MY2.
References
1. Greenhalgh, J., Mirmehdi, M.: Real-time detection and recognition of road traffic signs.
IEEE Trans. Intell. Transp. Syst. 13(4), 1498–1506 (2012)
2. Kuang, X., Fu, W., Yang, L.: Real-time detection and recognition of road traffic signs using
mser and random forests. Int. J. Online Biomed. Eng. (iJOE) 14(3), 34–51 (2018)
3. Yang, Y., Luo, H., Xu, H., Wu, F.: Towards real-time traffic sign detection and
classification. IEEE Trans. Intell. Transp. Syst. 17(7), 2022–2031 (2016)
4. Saadna, Y., Behloul, A.: An overview of traffic sign detection and classification methods.
Int. J. Multimed. Inf. Retr. 6(3), 193–210 (2017)
5. Ciresan, D., Meier, U., Masci, J., Schmidhuber, J.: A committee of neural networks for
traffic sign classification. In: The 2011 International Joint Conference on Neural Networks,
pp. 1918–1921 (2011)
6. Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image
classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR
2012), 3642–3649 (2012)
7. The German Traffic Sign Recognition Benchmark (GTSRB). http://benchmark.ini.rub.de/?
section=gtsrb&subsection=news. Accessed 01 Aug 2019
8. Singh, M., Pandey, M.K., Malik, L.: Traffic sign detection and recognition for autonomous
vehicles. Int. J. Adv. Res. Ideas Innov. Technol. (IJARIIT) 4(2), 1666–1670 (2018)
9. Hou, Y.L., Hao, X., Chen, H.: A cognitively motivated method for classification of occluded
traffic signs. IEEE Trans. Syst. 47(2), 2168–2216 (2017)
10. Yang, L., Jiang, P., Wang, F., Wang, X.: Region based fully convolutional siamese networks
for robust real-time visual tracking. In: IEEE International Conference on Image Processing
(ICIP 2017)
11. Google Maps. https://maps.google.com/. Accessed 01 Aug 2019
Consumer Use Pattern
and Evaluation of Social Media
Based Consumer Information Source
Yafang Zhang, Harim Yeo, Xu Li, and Hyesun Hwang(&)
Sungkyunkwan University, 25-2 Sungkyunkwan-ro, Jongno-gu, Seoul, Korea

h.hwang@skku.edu
Abstract. This study attempted to investigate consumers’ evaluation on the

utility of social media based consumer information source. The utility was
identified by the evaluation of information characteristics and the evaluation of
network characteristics. The evaluation of information characteristics includes
reliability, vividness, and variety. The evaluation of network characteristics
includes economics, interactivity, and regenerativity. The result showed the
highest evaluation of regenerativity, followed by economics, interactivity,
vividness, variety, and reliability. The mean differences were verified according
to socio-economic characteristics and usage patterns of information source. In
particular, it was found that female rated reliability higher than male, and the
group with middle income rated vividness higher than a group with low income.
In addition, consumers with relatively high access to SMCIS evaluated the
information in terms of the reliability less positively than those with lower
access to SMCIS. This study explored the elements that social media needs to
function as a useful information channel for consumers.
Keywords: Social media Consumer information source Consumer

evaluation of information source
1 Purpose
The characteristics of social media, such as openness, participation, and sharing have
created a consumer-driven source of information and overactive interactions on the
network among consumers. Consumer-driven sources of information have different
attributes from the existing sources of information that function in a one-way manner.
This study aims to explore the emerging consumer information environment by
investigating the characteristics of social media based consumer information source
(SMCIS) and analyzing consumers’ evaluation of the source.
This study first classifies the characteristics of SMCIS into information character-
istics and network characteristics to validate the evaluation of consumers based on
previous research. Then we examine the patterns of using SMCISs and investigate
consumers’ evaluation on the characteristics of information and network of SMCIS.
Finally, we analyze the consumers’ evaluation of SMCIS according to their socio-
economic characteristics and usage patterns.

https://doi.org/10.1007/978-3-030-39442-4_59
806 Y. Zhang et al.
2 Background
Consumers use their own experience and memory, or gather the necessary information
from various channels, such as advertising, Internet sites, and people around them, in
the process of consumer information search [1]. Based on participatory online envi-
ronment, online consumer-driven information, such as recommendations, user reviews,
and assessments of products from other consumers, has a significant impact on con-
sumers’ purchase decisions [2]. Recently, social media based consumer information
source (SMCIS) is emerging as an interactive and participatory consumer-driven
information source.
Social media, a tool and platform that enables the transmission of documents,
pictures, videos, and music sources or online communication has led to the production
and sharing of information by social media users. The production and sharing of
consumer-driven information has gained momentum in an environment where con-
sumers can communicate and elaborate useful information by themselves. In this
context, consumers, furthermore, trust information received from SMCIS, such as
recommendations of products or user reviews that are generated by other consumers,
which also influence consumers’ purchasing decisions [3].
According to the previous studies, the usefulness of SMCIS can be identified by the
evaluation of information characteristics and the evaluation of network characteristics.
The information characteristics include reliability, vividness, and variety [3–5]. The
network characteristics have been identified with its economics, interactivity, and
regenerativity [6, 7]. In this study, the evaluation of the characteristics of SMCIS will
be explored through reliability, vividness, variety, economics, interactivity, and
regenerativity. Consumers’ evaluation on SMCIS will be identified through a case
study of “RED (www.xiaohongshu.com)”, China’s leading online information-sharing
community.
3 Method
3.1 Data
The number of users per region was calculated using the location information of RED
users from December 20, 2017, to December 20, 2018. The results showed that 56.5%
of the users were located in Shanghai, Guangdong, and Beijing out of the total 34
regions as shown in Fig. 1. In addition, Fig. 2 shows the user age distribution of RED.
In order to secure the validity and reliability of the survey results, the selected subjects
were Chinese consumers in their 20s to 40s in Shanghai, Guangdong, and Beijing who
had experience using SMCISs.
The responses from consumers aged 20 to 49 who have experience using SMCISs
and live in Beijing, Shanghai and Guangdong were collected by WenJuanXing, a
Chinese online research firm, for 10 days from December 1, 2018 to December 11,
2018. A total of 417 responses were collected and analyzed.
Consumer Use Pattern and Evaluation of Social Media 807
Fig. 1. User distribution by region Source: Baidu Index (https://index.baidu.com)
60%
50%
40%
30%
20%
10%
0%
≤19 20~29 30~39 40~49 ≥50
Fig. 2. User age distribution of RED Source: Baidu Index (https://index.baidu.com)
3.2 Measurement
The use pattern of SMCIS was measured by three items including Internet usage time
per day, most commonly used source of information, and the average usage of SMCIS
per week. The evaluation on information and network characteristics of SMCIS was
measured by three subfactors of each. All measurement items were extracted from
previous research [8–11]. In this study, the reliability of information was defined as
“the extent to which the information provided by the information source is realistic,
truthful, neutral and objective.” Vividness was defined as “the extent to which the
information provides vivid feelings as if it had actually been seen and specific, detailed,
and realistic depiction.” Variety was defined as “the extent to which the information
addresses various topics, alternatives, and substantial content.” Economics was defined
as “the cost of exploring information.” Interactivity was defined as “the extent to which
808 Y. Zhang et al.
the network helps them to share information among users and to build relationships.”
Regenerativity was defined as “the extent to which the network provides up-to-date
information in time.” Information characteristics include reliability (4 items), vividness
(4 items), and variety (4 items). Network characteristics include economics (3 items),
interactivity (4 items), and regenerativity (4 items). All items were measured on a 5-
point Likert scale (1 = not at all to 5 = perfectly). The measurements were examined
for the face validity and refined by two consumer researchers. A preliminary survey
was conducted on 45 consumers who have experience using SMCISs through an online
survey. The preliminary survey was conducted using test-retest on the same respon-
dents from November 16, 2018 to November 23, 2018 to verify the reliability. Items
with low reliability were deleted or modified.
3.3 Analysis
The analysis was conducted using SPSS 24.0. The Cronbach’s alpha (a) coefficient that
indicates internal consistency was calculated to determine the reliability of the mea-
surement that were composed of several items. Validity test of measurement items was
performed through exploratory factor analysis. T-test and ANOVA were performed to
examine the differences in SMCIS characteristics according to socio-demographic
characteristics.
4 Result
4.1 Characteristics of Research Participants

The characteristics of research participants are shown in Table 1. Of the respondents,
44.1% and 55.9% were men and women, respectively. Regarding age, 53.5% were in
their twenties, 36.4% in their thirties, and 10.1% in their forties. Regarding education
level, 73.8% were college graduates or more. With regard to income, 32.4% of the
respondents earned 3000 yuan or more and 23.5% earned between 6000 and 8000 yuan.
4.2 Consumer Evaluation of SMCIS
Evaluation of Information and Network Characteristics. The results of the

descriptive statistics of the evaluation of information characteristics and network
characteristics are as follows. The evaluation of regenerativity was the highest at 3.76,
followed by economics with 3.72, interactivity with 3.7, vividness with 3.68, variety
with 3.66 and reliability with 3.24. Specifically, in reliability factor, the mean of
“Information is neutral and not distorted” showed the highest mean (3.26). The lowest
mean question was 3.23, “Information contains both advantages and disadvantages of a
product or service.” and “Information is reliable and realistic”. In vividness, “The
content of the information is concrete” was the highest mean at 3.74, while “The
content of the information is vivid with pictures or videos attached” was the lowest at
3.60. In variety, “Information covers a wide range of topics” was the highest mean at
3.76, while “The content of information varies,” was the lowest at 3.60. In economics,
“I can find the information that I need without much effort” was the highest mean at
3.74, while “I can find the information that I need quickly” was the lowest at 3.72. The
highest mean question in Interactivity was 3.83: “It allows me to build a bond of
sympathy with other people,” and “It helps me build relationships” was the lowest at
3.65. The highest mean question in the regenerativity was 3.79, “The information about
the provision and sharing of information among users is updated in real-time”, while
“The information provided is constantly being updated” was the lowest at 3.72.
Table 1. General characteristics and information source usage pattern of participants

Classification Frequency
(%)
Socio-demographic Gender Male 184 (44.1)
characteristics Female 233 (55.9)
Age 20–29 223 (53.5)
30–39 152 (36.4)
40–49 42 (10.1)
Education Below high school graduation 7 (1.7)
High school graduation 102 (24.5)
College/university graduation 270 (64.7)
Graduate school 38 (9.1)
Income level <2000 yuan 36 (8.6)
2000 and < 4000 yuan 61 (14.6)
4000 and < 6000 yuan 87 (20.9)
6000 and < 8000 yuan 98 (23.5)
8000 yuan 135 (32.4)
Information source usage Internet usage time per day Less than an hour 10 (2.4)
pattern Less than 1 to 2 h 58 (13.9)
Less than 2 to 3 h 104 (24.9)
Less than 3 to 4 h 83 (19.9)
Less than 4 to 5 h 48 (11.5)
More than 5 h 114 (27.3)
Most commonly used source of Advertising, a salesperson 5 (1.2)
information Newspapers, magazines, and 13 (3.1)
publications
Family, friends, colleagues and 46 (11.0)
neighbors
Corporate homepage 105 (25.2)
Government agencies and 37 (8.9)
organizations
Social media-based 211 (50.6)
information site
The average usage of SMCIS per 1–2 times 18 (4.3)
week 3–4 times 81 (19.4)
5–6 times 96 (23.0)
7–8 times 55 (13.2)
8–9 times 33 (7.9)
More than 10 times 134 (32.1)
810 Y. Zhang et al.
Evaluation of SMCIS According to Personal Characteristics. T-test and analysis of

variance were conducted to find what the differences were in the evaluation of SMCIS
depending on the individual characteristics as shown in Table 2. As a result of t-test
and analysis of variance on the evaluation of each characteristic of SMCIS according to
gender, age, education, and income level, there were significant differences in the
evaluation of SMCIS according to gender and income level.
The gender-specific differences indicated that the assessment of reliability is 3.13 for
male, 3.32 for female (p < .10). In terms of differences in income levels, the mean of
vividness for low income was 3.46, and the mean for middle income was 3.79. The
middle income group rated highly for the vividness of information at a significance
probability of 5% compared to those with low income.
Table 2. Evaluation of SMCIS according to personal characteristics

Socio-demographic characteristics n Evaluation of information characteristics
Reliability Vividness Variety
M SD M SD M SD
Gender Male 184 3.13 1.15 3.74 0.97 3.61 0.98
Female 233 3.32 1.14 3.62 1.00 3.71 0.99
t −1.699† 1.229 −0.97
Age 20–29 223 3.23 0.08 3.64 0.07 3.68 0.07
30–39 152 3.28 0.09 3.67 0.08 3.68 0.08
40–49 42 3.13 0.18 3.85 0.15 3.51 0.15
F 0.294 0.730 0.607
Education High school 109 3.22 1.20 3.72 0.93 3.67 0.95
College/university graduation 308 3.25 1.13 3.66 1.01 3.66 1.00
t 0.249 −0.527 −0.124
Monthly income Low income (<4000 yuan) 97 3.23 1.20 3.46 1.06 3.76 0.99
Middle income (4000–8000 yuan) 185 3.14 1.16 3.79 0.95 3.62 1.01
High income(8000 yuan ) 135 3.38 1.09 3.68 0.96 3.65 0.94
F 1.723 3.610* 0.664
Socio-demographic characteristics n Evaluation of network characteristics
Economics Interactivity Regenerativity
M SD M SD M SD
Gender Male 184 3.76 1.03 3.68 1.04 3.70 1.00
Female 233 3.69 1.01 3.72 0.93 3.80 0.96
t 0.730 −0.400 −1.080
Age 20–29 223 3.64 0.07 3.74 0.07 3.73 0.07
30–39 152 3.81 0.08 3.72 0.08 3.80 0.08
40–49 42 3.85 0.16 3.43 0.15 3.76 0.15
F 1.645 1.750 0.254
Education High school 109 3.61 1.00 3.68 1.01 3.72 0.98
College/university graduation 308 3.76 1.02 3.71 0.97 3.77 0.98
t 1.372 0.215 0.438
Monthly income Low income 97 3.74 1.02 3.54 0.97 3.64 0.99
(<4000 yuan)
Middle income 185 3.79 1.00 3.31 0.96 3.79 0.99
(4000–8000 yuan)
High income 135 3.71 1.05 3.67 1.00 3.80 0.95
(8000 yuan )
F 0.026 2.584 0.899
†
p < .10, * p < .05
Evaluation of SMCIS According to Usage Patterns. The t-test was conducted to

analyze how the evaluation of SMCIS differed depending on the social media usage
patterns of the participants. T-test and analysis of variance were conducted for the
difference in the evaluation of each characteristic of SMCIS depending on internet
usage time per day, most commonly used sources of information, and the average
number of SMCIS usage per week. As shown in Table 3, the result showed that there
were significant differences in the evaluation of SMCIS depending on internet usage
time per day and the average frequency of SMCIS users per week.
Regarding the time spent on the Internet per day, the mean of the reliability for
groups with an average Internet usage time of more than three hours per day was 3.11,
and the mean for those with less than three hours was 3.43. It has been found that
groups with an average daily Internet usage time of less than three hours rated sig-
nificantly higher on reliability than those with more than three hours at a significance
probability of 1%. For the average frequency of SMCIS weekly use, the mean of
evaluation of vividness for the group that used SMCISs more than once a day was 3.76,
and the mean for the groups that used SMCISs less than once a day was 3.58. Groups
that use SMCIS more than once a day rated significantly higher for a vividness
compared to groups that use SMCIS less than once a day at a significance probability of
10%. There were no significant differences in the evaluation of SMCIS based on the
most frequently used sources of information.
Table 3. Evaluation of SMCIS according to usage patterns

Information Source Usage Pattern n Evaluation of information characteristics
Reliability Vividness Variety
M SD M SD M SD
Internet usage time per day <3 h 172 3.43 1.13 3.65 1.03 3.70 0.98
3h 245 3.10 1.15 3.69 0.96 3.63 0.98
t −2.822** 0.439 −0.783
Most commonly used Information source SMCIS 211 3.24 1.17 3.69 0.94 3.60 0.97
Others 206 3.24 1.13 3.66 1.03 3.73 0.99
t −0.062 0.304 −1.287
Usage of SMCIS per week <7 times 195 3.30 1.11 3.58 1.01 3.68 0.99
7 times 222 3.18 1.18 3.76 0.96 3.65 0.98
t −1.076 1.783† −0.221
Information source usage pattern n Evaluation of network characteristics
Economics Interactivity Regenerativity
M SD M SD M SD
Internet usage time per day <3 h 172 3.76 1.00 3.62 0.95 3.77 0.94
3h 245 3.70 1.02 3.76 1.00 3.75 1.00
t −0.674 1.425 −0.270
Most commonly used Information source SMCIS 211 3.70 1.00 3.63 1.00 3.76 0.98
Others 206 3.75 1.04 3.78 0.96 3.75 0.97
t −0.542 −1.537 0.061
Usage of SMCIS per week <7 times 195 3.72 1.03 3.76 1.00 3.72 0.99
7 times 222 3.72 1.01 3.65 0.96 3.79 0.97
t −0.026 −1.234 0.729
†
p < .10, ** p < .01
812 Y. Zhang et al.
5 Conclusions
The purpose of this study was to analyze the characteristics of SMCIS and identify how
consumers are evaluating the characteristics of SMCIS, with the aim to help them
understand social media as a newly emerging consumer information environment. In
addition, we studied the difference in the evaluation of SMCIS according to the indi-
vidual characteristics and usage patterns of SMCIS. The evaluation of information
characteristics consisted of three subfactors: reliability, vividness, and variety, while
the evaluation of network characteristics consisted of three subfactors: economics,
interactivity, and regenerativity.
The results of the study are as follows. First, as SMCIS users, there were more
females than males and the most number of people were college graduates in their 20s,
with an average monthly income of more than 8000 yuan. According to the information
source usage pattern, most people spent more than five hours a day on the Internet, and
they often use online product reviews on SMCIS and community sites as sources of
information.
Second, the evaluation on information and network characteristics of SMCIS
showed the highest average of regenerativity, followed by economics, interactivity,
vividness, variety, and reliability.
Third, when it comes to personal characteristics, it was found that female rated
reliability higher than male, and the group with middle income rated vividness higher
than a group with low income. In terms of usage patterns, groups with an average
Internet usage time of less than three hours per day rated the reliability higher than
those with more than three hours per day, and those who use SMCIS more than once
per day rated the vividness higher than those who use it less than once per day.
5.1 Implications and Future Studies

The implications of this study are as follows:
First, it was found that consumers evaluated SMCIS most positively in terms of the
regenerativity of information. This suggests that for SMCIS to be activated, systems
and technologies, which enable consumers to easily create and share information
anytime, anywhere, and quickly, must be continuously improved and supported.
Second, consumers evaluated the lowest in the reliability of SMCIS. In addition, it
is seen that SMCIS is not trusted completely by Internet-intensive groups with rela-
tively high access to SMCIS. This is because the information on social media is based
mostly on subjective opinions from users; it is difficult to determine the reliability of
the information. Therefore, when a SMCIS information provider publishes new
information, it is recommended to present information about the information provider
to increase reliability in the information itself and help consumers make purchasing
decisions.
This study attempted to investigate consumers’ evaluation on the utility of SMCIS
through the evaluation of information characteristics and the evaluation of network
characteristics. This study provided implications that enhance our understanding of the
new consumer information environment and explored the elements that social media
needs to function as a useful information channel for consumers.
Future research proposed by this study includes:

First, SMCIS proposes a variety of possibilities for consumer participation, which
increases the value of information. The scope of the usefulness and value of the source
of information could be expanded if consumers participate more actively in generating,
sharing, and elaborating information on SMCIS. Therefore, future studies will require
comprehensive studies considering various influencing factors of consumer participa-
tion in generating, sharing, and elaborating information. Second, it is expected that
conducting experimental studies over time on the experience of using SMCISs will
provide a more accurate understanding of the factors affecting consumers’ information
assessments. Third, it needs to investigate the usefulness of information acquired from
SMCIS by product categories to derive more practical implications.
References
1. YuJin, N., KyungJa, K.: Consumer confusion about information channels and information
contents in a multichannel environment. J. Korean Soc. Liv. Sci. 22(3), 455–471 (2013)
2. Bickart, B., Schindler, R.M.: Internet forums as influential sources of consumer information.
J. Interact. Mark. 15(3), 31–40 (2001)
3. TanWoo, P., KyungRyul, L.: Accessing social networking sites and accessing social
networking sites. Advert. Res. 100, 172–224 (2006)
4. JiSook, K., HyukKi, K.: A study on the influence of online oral information characteristics
on information acceptance and re-decision intention: focused on recipient’s expertise.
J. Korean Ind. Inf. Soc. 21(6), 81–93 (2016)
5. SangHyun, L., YongGil, J.: Influence of online oral characteristics on trust, oral acceptance
and purchase intent. J. Korean Cont. Soc. 16(9), 545–559 (2016)
6. WonWoo, J., SunJin, H., DongDong, C.: The impact of online community information
characteristics and pursuit benefits on old-fashionedness. J. Korean Assoc.Phys. Beauty Arts
16, 143–159 (2015)
7. JungMi, P., SungJin, H.: The influence of information characteristics of cosmetic blogs on
trust and oral effect of oral process. Korean Gastroenterol. 62(2), 13–25 (2012)
8. Elliott, K.M.: Understanding consumer-to-consumer influence on the web. Doctoral
Dissertation. Duke University (2002)
9. Chiou, J., Cheng, C.: Should a company have message boards on its web sites? J. Interact.
Mark. 17(3), 50–61 (2003)
10. TaeMin, L., EunYoung, L.: A study on the effects of perceived risk and perceived benefits
on the intent of mobile commerce. Asia Pac. J. Inf. Syst. 15(2), 1–21 (2005)
11. SangJo, K., SunMi, J.: Effect of the characteristics of the sender of the information on social
networking sites and the value of the information on the users’ intention to visit the
restaurant. Internet e-Commer. 14(6), 126–127 (2014)
Hardware-Software Implementation
of a McEliece Cryptosystem
for Post-quantum Cryptography
Mariano López-García1(&) and Enrique Cantó-Navarro2

1
Universitat Politècnica de Catalunya, 08800 Vilanova I la Geltrú, Spain
mariano.lopez@upc.edu
2
Universidad Rovira i Virgili, 43007 Tarragona, Spain
Abstract. This paper shows the implementation on FPGA of a McEliece

cryptosystem, which ensures the security recommendations given by the
European Telecommunications Standards Institute (ETSI) for next generation of
quantum-resistant cryptosystems. The proposed implementation provides more
than 128 bits of quantum security using a public key of 2,097,152 bytes. The
proposed system is based on a hardware/software implementation that uses an
ARM Cortex-A53 core connected to a coprocessor through an AX14 lite
interface. The complete system was implemented on a Xilinx Zynq UltraScale+
and it is able to decipher texts of 8192 bit-length is 47.39 ms.
Keywords: McEliece Post-quantum crytography FPGA

Hardware/software co-design Embedded devices
1 Introduction
Robert J. McEliece proposed in 1978 a code-based public-key cryptosystem whose

security is proven to be NP-complete [1, 2]. Even its promising properties, in the field
of embedded systems its use was very reduced, due to the memory requirements
needed to store both private and public keys. This was, initially, a clear disadvantage
against other cryptosystems based, for instance, on RSA or elliptic curves. Nowadays,
with the advance of the microelectronic industry, most commercial brands provide
embedded platforms that already include large memories that allow its implementation
with no restrictions.
McEliece cryptosystem can achieve the same security level against attacks per-
formed on classic computers than those based on RSA or Elliptic curves, but however,
it is able to perform encryption and decryption 25 and 2 times faster than RSA,
respectively [3]. In fact, it is considered a secure cryptosystem against quantum attacks.
Indeed, the National Institute of Standards and Technology (NIST) included McEliece
among the public-key cryptosystems listed in the document “Report on Post-Quantum
Cryptography” [4]. On the other hand, the European Telecommunications Standards
Institute (ETSI) issued a report that not only includes McEliece in the list of quantum
save security cryptosystems, but also recommended for this algorithm the use of public
keys of 1,046,739 bytes to obtain 131-bits of quantum security level [3]. Similarly, the

https://doi.org/10.1007/978-3-030-39442-4_60
Hardware-Software Implementation of a McEliece Cryptosystem 815
European project “PQCrypto: Post-quantum cryptography for long-term security” also

recommends, for public key encryption, the use of McEliece as quantum-resistant
cryptosystem [5].
The original proposal of McEliece was based on binary Goppa codes, although this
scheme can be extended to any family of linear codes. In this way, in order to decrease
the size of the keys, several proposals were presented in the past. However, until now
there is no clear evidence that using others codes it was possible to increase the security
of this scheme.
Obtaining the public key of a McEliece cryptosystem consists in choosing two
parameters: n and t. These two integers (n = 2m is much bigger than t) are used to
create the generator matrix G of dimension k n of a binary irreducible Goppa code
(k = n – m t). The parameter t is the number of errors that can be corrected by the
code, n is the length of the cypher text, and k is the size of the block to be encrypted
(message). Matrix G is hidden by using a permutation P matrix of dimension
n n and a substitution matrix S of dimension k k. Both matrices are chosen at
random in such a way that the public key Gpub is obtained as Gpub = S G P. On the
other hand, G, S and P matrices, forms the private key.
One of the weaknesses of the original McEliece cryptosystem is that it fails when
encrypting messages that have some known relationship between them [6]. However,
such a problem can be fixed by scrambling or randomizing the messages in order to
eliminate their dependency [7]. Though it is not easy to choose the best parameters for
n and t, Bernstein [8] et al. proposed concrete parameters for a variant of McEliece
cryptosystem called CCA2-secure, that was attacked applying the list decoding algo-
rithm and using a classic computer based on a 2.4 GHz Core 2 Quad CPU. From the
obtained results, authors conclude that for a non-quantum security level of 256 bits
Goppa codes of length n = 6624, t = 115 and k = 5129 should be used, which leads to
a public key size of 958,481 bytes (for the variant of McEliece CCA2-secure, the
public key could be reduced to k (n − k) bits instead of n k bits).
In the past, there have been several proposals for the implementation of McEliece
in hardware. In [9] this cryptosystem was implemented on both an 8-bit AVR
microcontroller and a Xilinx Spartan-3 FPGA. In this proposal, the matrix S is not
stored and it is created using a PRNG (Pseudorandom Number Generator). The
implementation achieved a maximum throughput for encryption of 1,626,517 bits/s
(calculated by extrapolation on a DDR2 memory) and 3,899 bits/s for the FPGA and
8-bit AVR implementations, respectively. In the implementation proposed by Bernstein
in [10], performed on an Intel Core 2, the best rate of cycles per byte obtained was
2260 for decryption with n = 2048 and t = 40. Other authors as Von Maurich and
Güneysu [11], proposed a lightweight implementation of McEliece by using quasi-
cyclic moderate density parity-check codes. Their system was implemented on a Virtex
6 FPGA using a non-quantum security level of 80 bits. Such system was able to encrypt
and decrypt an input block in 2.2 ms and 13.4 ms, respectively. In [12] Ghosh et al.
presented a hardware-software (Hw/Sw) co-design implemented on a Spartan 3 FPGA
that optimizes speed and area. Authors claimed that their solution is faster and smaller
(in terms of area) that existing designs, taking less than 100k clock cycles for
decryption. In fact, it is difficult to perform a fair comparison between all these pub-
lications, since the goals to be optimized are very different as well as the performance
816 M. López-García and E. Cantó-Navarro
of the embedded platforms or the security level (quantum or non-quantum) used in their
implementations.
This paper presents the implementation on an embedded device of a complete
McEliece cryptosystem on dedicated hardware. The security level is in accordance with
the recommendations given by ETSI for post-quantum resistant cryptosystems. The
implementation, which uses as parameters n = 2m = 8192, m = 13, t = 315 and k =
m – n t = 4097, is based on a hardware-software co-design that includes an ARM
Cortex-A53 microprocessor and a reconfigurable hardware. The overall system is
integrated on a Xilinx Zynq UltraScale+ FPGA. The encryption and decryption pro-
cesses are carried out in 1.55 ms (throughput of 5.28 Mbits/s) and 47.39 ms
(throughput of 86.45 kbits/s), respectively.
This paper is organized as follows. Section 2 describes the basic theory about the
McEliece Cryptosystem. Section 3 presents the hardware-software partitioning that
was performed. Section 4 describes the internal structure of the McElice IP block
implemented in hardware. Section 5 shows the experimental results and finally con-
clusions are presented.
2 McEliece Cryptosystem Review
The fundamentals of McEliece Cryptosystem are well documented in [1, 13, 15], so
that they will be only briefly reviewed here for understanding the internal structure of
the implemented coprocessor and how the software-hardware partitioning is performed.
As in the original proposal, the proposed implementation of McEliece is based on a
binary irreducible Goppa code. Thus, operations (sums and products) performed with
binary matrices are computed over F2 , while the coefficients of a Goppa polynomial are
defined in F2m .
2.1 Key Generation

The key generation consists in choosing the values of n and t; in our case, this choice is
carried out according to the compliance of safety recommendations given by ETSI for
post-quantum cryptosystems. As mentioned before, such values are n = 2m = 8192,
and t = 315. Then, the following matrices are generated:
• G: a k n generator matrix of a binary irreducible Goppa code C that has minimum
distance d 2t þ 1, being k ¼ n m t ¼ 4097.
• S: a k k binary non-singular matrix over F2 chosen at random.
• P: a n n permutation matrix over F2 chosen at random with exactly one 1 in
every row and column, and the rest of entries 0.
• Gpub: A k n matrix obtained by the product S G P.

The public key is then defined by the pair Gpub ; t . It is important to remark that
the permutation matrix P is chosen to create a generator matrix G of the kind
G ¼ ðJ T jInk Þ, where In−k is the identity matrix. See [14] for a wider explanation.
2.2 Encryption Process

The message m to be encrypted is k bits length as it is created as follows:
• Chose a random z Fn with the property that z has t entries different from zero, and
n−t entries equal to zero.
• Compute the encrypted message r as r ¼ m Gpub þ z, in which the vector z is
interpreted as the error.
2.3 Decryption Process

The cipher-text decryption is performed using the private key (S, D, P), in which S and
P are matrices found during the key generation; and D is an algorithm for decoding the
irreducible Goppa code C. The original message is obtained following the steps:
• Compute the string y ¼ r P1 ¼ ðm Gpub þ eÞ P1 ¼ m S G þ e P1 .
• As y is a codeword of C, use the decoding algorithm D to find y′ as
Dð yÞ ¼ DðmSG þ e P1 Þ ¼ mSG ¼ y0 . This decoding is performed using the Pat-
terson algorithm.
• It is observed that y0 ¼ mðSGÞ ¼ ðmSÞG ¼ m0 G, where m0 ¼ mS is another mes-
sage of the same length. Therefore, y is the codeword associated to message m′.
• As P was chosen to create a matrix G of the form G ¼ ðJ T jInk Þ, therefore the last
n−k coordinates of y′ are the message m′ (truncation).
• Once we have m′ we multiply by S−1 to obtain m ¼ m0 S1 ¼ mSS1 .
Patterson Algorithm
The Patterson algorithm is used to computer the error-locator polynomial r(Z) of a
codeword with errors. The polynomial r(Z) of a vector y satisfies the property that
r(ai) = 0 if and only if y has an error in the i-th position. The algorithm is based on the
knowledge of the syndrome Sy(Z) of the codeword, and it is computed following the steps:
1. Compute the syndrome Sy ðZ Þ.
1
2. Compute the inverse T ðZ Þ ¼ Sy ðZ Þ .
3. Compute the square root RðzÞ ¼ sqrtðT ðZ Þ þ Z Þ.
4. Solve the equation aðZ Þ ¼ bðZ Þ RðZ Þ.
5. Compute the error location polynomial rðZ Þ ¼ a2 ðZ Þ þ Z b2 ðZ Þ.
6. Compute the error vector ɛ from rðZ Þ.
7. Correct the codeword y0 ¼ y þ e.
The inverse T(Z) and the resolution of equation aðZ Þ ¼ bðZ Þ RðZ Þ are computed
using the Extended Euclidean Algorithm.
Extended Euclidean Algorithm

The Extended Euclidean algorithm computes the greatest common divisor of two
polynomials a(Z) and b(Z), along with two additional polynomials k(Z) and l(Z)
satisfying:
kðZ ÞaðZ Þ þ lðZ ÞbðZ Þ ¼ gcd ðaðZ Þ; bðZ ÞÞ ð1Þ
The two initial polynomials have deg(b) deg(a). The algorithm works as fol-
lows. First, b(Z) is divided by a(Z) obtaining a quotient q(Z) and a reminder r(Z) that
meet b(Z) = a(Z)a(Z) + r(Z) and deg(r) < deg(a). Since gcd(a(Z), b(Z)) = gcd(r(Z),
a(Z)), we can reduce the problem of finding gcd(a(Z), b(Z)) to determining gcd(r(Z),
a(Z)) where the degree of r(Z) is smaller than the degree of b(Z). The process could be
repeated until one of the arguments is zero.
Extended Euclidean Algorithm

Input: (r0, r1, λ0, λ1, μ0, μ1):=(a, b, 1, 0, 0, 1)
Output: (g, λ, μ)
while ri≠0 do
qi=: quotient or ri-1 on division by ri;
(ri+1, λi+1, μi+1):= (ri+1, λi+1, μi+1) – qi · (ri+1, λi+1, μi+1);
i:=i+1;
end while;
return (g, λ, μ):=(ri+1, λi+1, μi+1);
3 Hardware-Software Partitioning
The architecture of a system based on hardware-software co-design includes a

microprocessor, which purpose is managing the organized execution of the main
program, the communication between input/output devices and the control of infor-
mation through the system’s buses. The specific subsystems designed on reconfigurable
hardware, known as coprocessors, are connected to the microprocessor by means of the
bus architecture. Due to their specific design, coprocessors have the ability of resolving
some specific functionality, regarding some part of the application or algorithm, in an
efficient way and in a very short time. However, they are distinguished with a low
flexibility, in the sense that only are able of resolving the task for what they have been
designed. The concept of partitioning consists in determining which parts of an
algorithm are best suited for execution by software (microprocessor) or for synthesis in
dedicated hardware (coprocessors), in order to meet a set of constraints and goals.
Usually, the parameter to be optimized is the resolution time while the main constraint
is usually the area needed by the coprocessor. Additionally, the success of the
hardware-software co-design depends not only on the active cooperation between the
software and hardware modules, but also on the communication protocol needed for
their proper operation.
Table 1 shows the execution time when encrypting and decrypting a block of 512
characters in length chosen at random. The table also includes the specific time needed
by each of the steps described in Sect. 2. These results were obtained programming the
overall McEliece cryptosystem in C language.
Table 1. Execution time for each step of the McEliece Cryptosystem (encryption and
decryption). Results are given in clock cycles and milliseconds using an ARM Cortex A53
microprocessor clocked at 1,1 GHz. In bold are represented the steps that will be implemented in
hardware.
Function No. cycles Time (ms)
Encryption Compute m Gpub 166.53 10 TCLK 1.51 ms
4
Add error: r ¼ m Gpub þ z 497.42 102 TCLK 0.045 ms

Total time for encryption: 166,65 104 TCLK 1.515 ms
Decryption Compute r P1 186.95 103 TCLK 0.169 ms
Syndrome Polynomial Sy ðZÞ 350.85 104 TCLK 3.189 ms
1
Inverse Syndrome TðZÞ ¼ Sy ðZÞ 347.94 106 TCLK 316.31 ms
Square root RðzÞ ¼ sqrtðTðZÞ þ ZÞ 124.14 105 TCLK 11.28 ms
Equation aðZÞ ¼ bðZÞ RðZÞ 184.45 106 TCLK 167.68 ms
Compute rðZÞ ¼ a2 ðZÞ þ Z b2 ðZÞ 112,56 105 TCLK 10.23 ms
Compute vector error ɛ 537.72 106 TCLK 488.84 ms
Correct the codeword y0 ¼ y þ e 189.02 103 TCLK 0.171 ms
Truncation y′ to obtain m′ 873.31 102 TCLK 0.079 ms
Compute m ¼ m0 S1 461.34 104 TCLK 4.19 ms
Total time for decryption: 110.23 107 TCLK 1002.1 ms
As can be seen, the calculation of the inverse syndrome, the polynomial a(Z) and
the vector error ɛ takes about the 98% (972.8 ms or 1070 106 clock cycles) of the
total time. Thus, these three steps are the candidates to be implemented in hardware.
On the other hand, the partitioning could be performed at different levels, ranging
from simple operations or instructions (fine granularity) to complex processes or
routines (coarse granularity). After analyzing the most time consuming steps, it seems
that the optimal partitioning should be performed mixing different granularities. Thus, a
unique coprocessor is designed, which internally includes several subsystems for
implementing some basic operations. The processor is able to perform operations using
Goppa polynomials. The proper sequence in which such operations should be per-
formed depends on the step that is being implemented by the coprocessor. Such a
sequence is generated by managing several internal signals by a control unit, which
conveniently programmed stablishes the order in which the subsystems are activated.
The internal subsystems are also designed in a similar way including each of them its
corresponding control unit.
4 Coprocessor Design (McElice IP)
Figure 1 shows the internal architecture of the McEliece IP block. Such IP is connected
as slave peripheral to the AXI4 bus, which is used during the configuration process
performed by the microprocessor. Additionally, the IP is also attached to an AXI4
stream-interface used for retrieving and storing polynomial data from/to memory. In
order to improve the performance, a couple of buffers are included. Such buffers store
(temporally) the input/output Goppa polynomials (input operands) while the arithmetic
unit is busy processing data.
McEliece IP
AXI MCELIECE
AXI Slave
AXI4 Bus
Interface
input output
degree control status degree
316*13-bit Goppa Arithme c Unit
Error Loca on
Slave Goppa Goppa Master
Buffer Buffer
AXI4 Stream Reader Writer AXI4 Stream
XOR-Mod Mul plier
Divisor
Fig. 1. Internal architecture of McEliece IP.
The coprocessor is designed to carry out seven different operations using Goppa
polynomials (see Table 2). Such operations are indicated by writing a specific code in
the input registers, which are used to communicate the microprocessor and the IP block.
Table 2. Codification table and Goppa polynomial operations performed by the arithmetic unit.
Note: Gp is the Goppa polynomial generator used to create the private and public keys.
Codification Operation
0000 op1 * op2
0001 (op1 * op2) mod Gp
0010 (op1 * op2) ⊕ op3
0011 ((op1 * op2) ⊕ op3) mod Gp
0100 op1/op2
1000 Error location (op1)
1001 Not used
1110 Set Gp
All these operations can be performed designing only four basic operations: mul-
tiplication, division, addition (xor) and modulus.
The Goppa multiplier is presented in Fig. 2. Note as its design includes 316
(t = 315) internal multipliers in F2 that are able to perform in parallel the 316 multi-
plications between the coefficients that form the two input polynomials. An additional
block is added to perform the xor operation between this multiplication and a third
polynomial when necessary. This additional operation is very common in the Extended
Euclides Algorithm.
As shown in Fig. 3(a), a Goppa divider can be designed by including a Goppa
multiplier and a divider in F2 . Finally, the error location module can be implemented
by simply using a multiplier in F2 (see Fig. 3(b)).
Goppa Mult
GF_Mult GF_Mult GF_Mult
1 2 316
op1 op2 res op1 op2 res op1 op2 res
P1(Z) Control O(Z)

P2(Z) Unit Xor
P3(Z)
Fig. 2. Internal architecture of Goppa multiplier.
Goppa XOR-Mod Mul plier/Divisor Error Loca on
GF_Div Goppa
GF_Mult
Mult
op1 op2 res P1(Z) P2(Z) P3(Z) O(Z) op1 op2 res
P1(Z) R(Z)
Control Control
P2(Z)
Unit σ(Z) Unit
ε
P3(Z)
a) b)
Fig. 3. Internal architecture of: (a) Goppa Xor-Mod multiplier-divisor, (b) Error location.
The overall embedded system was implemented on a Xilinx Zynq UltraScale+ ZCU
102 evaluation board, which includes an MPSoC device XCZU9EG-2FFVB11561.
Although the MPSoC includes four ARM Cortex-A53 core clocked at 1.1 GHz only
one of them is used in the final implementation. Additionally, a 4 GB DDR4 64-bit
memory is available, so that there is no limitation on the size of the keys to be stored.
All the hardware designs were described in VHDL and implemented by using the
software tools provided by Xilinx. The results are shown in Table 3.
Table 4 shows the results for the execution time of the overall hardware-software
co-design proposed in this paper. Results are obtained when encrypting a block of 4097
bits (512 characters). Afterward the cypher text of 8192 bits (1024 characters) is
decrypted recovering the original message. The table also shows the acceleration factor,
provided by the McEliece IP block, for those steps implemented in hardware against
the software version. In order to meet the maximum frequency requirements imposed
by the critical path, the McEliece IP block is clocked at 250 MHz. Note that the inverse
of the syndrome is processed in 15.67 ms, which means that it is about 20.18 faster
than the software execution. As similar figure is obtained when calculating the poly-
nomial a(Z). This polynomial is computed by the coprocessor in 7.38 ms, leading to an
acceleration factor of 22.72. The vector error calculation offers the best result. This
vector is obtained in 5.21 ms against the 488.84 ms needed by the microprocessor
(acceleration factor of 93.82). Thus, a complete decryption could be performed in
approximately 48 ms.
Table 3. Resources used by the McEliece IP Block when implemented on Zynq UltraScale+™
MPSoC family.
Used Available Utilization %
LUTs 77357 274080 28.22
FFs 41338 548160 7.54
Table 4. Comparison execution time for both: only software implementation (ARM Cortex
clocked at 1.1 GHz) and hardware-software co-design including McEliece IP (coprocessor)
clocked at 250 MHz.
Function Time (ms) Time (ms) Ratio: acceleration
(Only Software) (Hardware-Software) factor
Compute m Gpub 1.51 ms 1.51 ms (sw) –
Add error: r ¼ m Gpub þ z 0.045 ms 0.045 ms (sw) –
Compute r P1 0.169 ms 0.169 ms (sw) –
Syndrome Polynomial Sy ðZÞ 3.189 ms 3.189 ms (sw) –
Inverse Syndrome 316.31 ms 15.67 ms (hw) 20.18
1
TðZÞ ¼ Sy ðZÞ
(continued)
Function Time (ms) Time (ms) Ratio: acceleration
(Only Software) (Hardware-Software) factor
Square root RðzÞ ¼ sqrtðTðZÞ þ ZÞ 11.28 ms 11.28 ms (sw) –
Equation aðZÞ ¼ bðZÞ RðZÞ 167.68 ms 7.38 ms (hw) 22.72
Compute rðZÞ ¼ a2 ðZÞ þ Z b2 ðZÞ 10.23 ms 10.23 ms (sw) –
Compute vector error ɛ 488.84 ms 5.21 ms (hw) 93.82
Correct the codeword y0 ¼ y þ e 0.171 ms 0.171 ms (sw) –
Truncation y′ to obtain m′ 0.079 ms 0.079 ms (sw) –
Compute m ¼ m0 S1 4.19 ms 4.19 ms (sw) –
Total time for encryption: 1.515 ms 1.515 ms
Total time for decryption: 1002.1 ms 47.39 ms 21.14
Table 5 shows the throughput achieved by the software-hardware co-design when

encrypting a plain text and decrypting its corresponding cypher text. Due to the sim-
plicity of the operations involved in the process, the best result is obtained when
encrypting the original message 5.285 Mb/s.
It is difficult to perform a fair comparison against other publications carried out in
the past, because each of them uses a different FPGA (with different resources), a
specific-limited memory RAM, higher or lower values are selected for n, k, t (and
therefore bigger or slower levels of security) and a different number of coprocessors.
Furthermore, the goal of some implementations could be to obtain a lightweight ver-
sion to be embedded in a hardware with limitations in area and speed. Table 6 shows
the results obtained for other authors when implementing in hardware the three steps
presented in Sect. 3 (inverse Syndrome, equation a(Z) and error computation). In order
to perform a fair comparison, such results are given in clock cycles (independent of the
clock frequency) along with the values of n, k, t used in the implementation. Note that
no extrapolations can be performed, since there is no proportionality between such
values and the number of clock cycles. Furthermore, it important to remark that the
security level given by most authors is referred to non-quantum cryptography and is
based on the publication presented in [8].
Table 5. Throughput for both software-only and hardware/software implementations.

Throughput
Only software Hardware/Software
Plaint text: 4097 bits (encryption) 5.285 Mb/s 5.285 Mb/s
Ciphertext: 8192 bits (decryption) 4.09 kb/s 86.45 kb/s
Table 6. Comparison of the McEliece decryption implementations.

MicroEliece [13] Hw/Sw Our design
co-design [12]
Frequency/Throughput for 80 MHz 92 MHz 250 MHz
decryption 161,829 bit/s 1,716,66 bits/s* 86,452 bits/s
Security, (n, k, t) 80 bits**, 80 bits**, >131 bits***,
(2048,1751,27) (2048,1751,27) (8192,4097,315)
Inverse syndrome 625 cycles 1821 cycles 391 103 kcycles
1
TðZÞ ¼ Sy ðZÞ
Equation 312 cycles 960 cycles 1845 103 kcycles
aðZÞ ¼ bðZÞ RðZÞ
Compute vector error ɛ 312328 cycles 57356 cycles 1302 103 kcycles
*Result estimated by us.
** Non-quantum security level obtained from [8].
***As we are using a public key size that is twice the recommendation given by ETSI [3], we
assume that quantum security level would be higher than 131 bits.
6 Conclusions
This paper presented the implementation of a McEliece cryptosystem on a Xilinx Zynq

UltraScale+ FPGA. The structure of the system was based on a hardware-software co-
design, which consists of an ARM Cortex microprocessor and a McEliece IP block that
includes several hardware subsystems. The microprocessor acts as master of the sys-
tem, while the IP block processes the most time consuming steps involved when
decrypting a cypher text. The speed processing is increased in average by a factor of
21when compared with an only-software resolution. The parameters for n, k and t were
chosen in order to provide a security level in accordance with the recommendations
given by ETSI for post-quantum cryptography. In order to improve area and speed, a
faster version of McEliece cryptosystem, particularly suited for FPGA implementa-
tions, will be chosen in a future work.
Acknowledgments. This work was supported by the Ministerio de Economía y Competitividad

in the framework of the Programa Estatal de Investigación, Desarrollo e Innovación Orientada a
los Retos de la Sociedad, project TEC2015-68784-R.
References
1. McEliece, R.J.: A public key cryptosystem based on algebraic coding theory. DNS progress
report 43.44 (1978)
2. Berlekamp, E.R., McEliece, R.J.: On the inherent intractability of certain coding problems.
IEEE Trans. Inf. Theory 24(3), 384–386 (1978)
3. ETSI – European Telecommunications Standards Institute: Quantum Safe Cryptography
(QSC); Quantum-safe algorithmic framework. ETSI GR QSC 001 v1.1.1 (2016)
4. National Institute of Standards and Technology: Report on Post-Quantum Cryptography.

Internal report 8105 (2016). http://dx.doi.org/10.6028/NIST.IR.8105
5. Augot, D., et al.: Initial recommendations of long-term secure post-quantum systems.
Horizon 2020 ICT-645622. Revision 1 (2015)
6. Berson, T.: Failure of the McEliece public-key cryptosystem under message-resend and
related-message attack, pp. 213–220. Springer, Heidelberg (1997)
7. Engelbert, D., Overbeck, R., Schmidt, A.: A summary of McEliece-type cryptosystems and
their security. http://eprint.iacr.org/2006/162.ps
8. Bernstein, D.J., Lange, T., Peters, C.: Attacking and defending the McEliece cryptosystem.
In: International Sorkshop on Post-Quantum Cryptography, pp 31–46 (2008)
9. Eisenbarth, T., Güneysu, T., Heyse, S., Paar, C.: MicroEliece: McEliece for embedded
devices. In: International Conference on Cryptographic Hardware and Embedded Systems -
CHES (2009)
10. Bernstein, D.J., Buchmann, J., Dahmen, E.: Post-Quantum Cryptography. Springer,
Heidelberg (2009)
11. Von Maurich, I., Güneysu, T.: Lightweight code-based cryptography: QC_MDPC McEliece
encryption on reconfigurable devices. In: Design, Automation & Test in Europe Conference
& Exhibition (DATE) (2014)
12. Ghosh, S., Delvaux, J., Uhsadel, L., Verbauwhede, I.: A speed area optimized embedded co-
procesor for McEliece cryptosistem. In: IEEE 23rd International Conference on Application-
Specific Systems, Architectures and Processors (2012)
13. Heyse, S.: Code-based cryptography: implementing the McElice scheme on reconfigurable
hardware. Master thesis, Faculty of Electrical Engineering and Information Technology,
Ruhr-University Bochum (2009)
14. Flexiprovider. http://www.flexiprovider.de/
15. Quantum-resistant cryptography. Oriol Farràs. Technical report (2017)
Design of Ride Sharing Service
“ZOUGAME” in Neighborhood
Community
Minako Matsui(B) , HongYu Chang, Satomi Manzaki, Chihiro Sato,

and Naohito Okude
Keio University Graduate School of Media Design,

4-1-1 Hiyoshi Kohoku-ku, Yokohama, Kanagawa, Japan
ak375mdr@keio.jp
http://www.kmd.keio.ac.jp/ja/
Abstract. This paper reveals a ride sharing service concept

“ZOUGAME” which provides a novel mobility experience with a pet-
like slow-moving mobility vehicle limited in a certain neighborhood town.
Users can pick any particular ZOUGAME vehicle circulating around the
neighborhood, and can enjoy chatting during the carpool ride among
other riders. The researchers aim to bond the community together
through this service. This service concept was created through ethnogra-
phy research of how a home-visiting physical therapist uses a e-bike in a
neighborhood as a maneuverability master. The researchers validate the
possibilities that industries can contribute and benefit from this service,
through a collaborative value co-creation design process.
Keywords: Mobility as a service · City · Design thinking ·

Well-being · Service design
1 Introduction
With the development of technology, our daily lives have been tremendously
changed. Streets which are in physical space also have been significantly impacted
by these technologies, and people expect to see further changes of their forms.
The city of Los Angeles was build up with the mobility vehicles [1], Amsterdam
developed the shape of streets with channels. The shapes of streets are strongly
related to our transporting within the cities people live in. Breaking into 21th
century, as linear Chūō Shinkansen, people transport in a faster and more conve-
nient way. However, in the case of circulating around the neighborhood, traveler
might achieve more value from a lower speed pace rather than persuading to
a fast and convenient transportation, e.g., people might want to walk around
slowly and feel the streets when visiting a tourist attraction, also, visitors would
want to enjoy the beautiful seasonal scenery when being out on a spontaneous
walk. In this study, the researchers believe these low-speed moving experiences
can also, in a different vector, enrich our standard lives.
https://doi.org/10.1007/978-3-030-39442-4_61
ZOUGAME 827
ZOUGAME is a ride sharing service which provides people in a local com-

munity novel mobility vehicles for using in various scenarios, e.g. going to a
supermarket for buying something on discount, climbing up to a high-ground
for sightseeing, etc. Simultaneously, ZOUGAME will accumulate the memories
of the routes which it has been on. Thereby when new visitors come to town for
the first time, ZOUGAME can recommend them an optimal route according to
their goals. Furthermore, ZOUGAME is a local community mobility vehicle, so,
on the way to the goal, people who have the same destination can also get on it.
Through riding on ZOUGAME, people new to the town can meet new friends,
and local people can share their experiences about the town. People can also
greet each other when they run into other ZOUGAME vehicles on their rides.
It has been over one hundred years since the first automobile for personal
transporting was made. With the exponential spread of automotive industry,
not only our standard lives were changed, but also the shapes of streets in our
cities were transformed. Looking back to when the first car appeared, numerous
changes from people’s lifestyles to the sense of value have kept evolving. For
the last 50 years the most optimum transporting way has been said to be the
automobile, and it is ambiguous to say it still is the optimum transporting way
today. In this study, based on today’s lifestyle, the researchers are figuring out
the optimum form of mobility which can provide people a new value through
the transporting experience.
In this study, the researchers analyzed records observed by ethnographic
survey [2], obtained goals of service people from it, and designed personas for
service [3]. The researchers integrated the resources possessed by each persona
and designed a concept that makes each person feel value [4]. The researchers
have demonstrated the effectiveness of the concept by prototyping with card-
board [5]. This research is collaborating with 3 corporations which are DENSO
CORPORATION, an automotive manufacturer which is researching mobility as
a service; YAKUJU CORPORATION, a pharmaceutical company which aims to
be the pharmacy that forms a local community; and SEKISUI KINZOKU COR-
PORATION, a model railroad manufacturer and dealer which enriches people’s
lives through enhancing their individual interests.
2 Related Works
2.1 Developing Ride-Sharing Services Rooted in the Town Street
Ride-Share Services for Easy Transporting. The share services of elec-
tronic scooters have already been expanding worldwide. e.g., “Bird’s electric
scooters [6]” which is based in LA, provides a ride-share service for people to
move around in the city easily and reduce the traffic problem in LA, and “Tele-
pod [7]” which is based in Singapore giving another transportation choice for
people who don’t want to walk in the hot weather, etc. These scooter-share ser-
vices provide users easy transporting experiences in which users don’t need to
pedal on bicycles while sweating, and can enjoy a beautiful day while in a com-
fortable riding experience. With the application on smartphones, it can navigate
828 M. Matsui et al.
the scooter’s parking spots, and a ride will be started only by scanning the QR
codes on the scooters. Also, users can pay through the application and return
the scooters in any parking spot for the rental service easily.
A Delivery Robot in the Town Street. Starship Technologies. Inc provides

an autonomous delivery service which delivers food through the use of robots
[8]. The robot was equipped with six wheels, a camera, sensors, control system,
telecommunication equipment, LEDs, and a battery. It moves at the same speed
ratio as that of a human walk, and senses pedestrians to avoid collisions, also,
the alarm rings when it is lifted or vandalized. Receivers can trace the position of
the robot through an application. On Starship’s website, we can find the scenes
which show how this robots co-exsist with people in the street.
2.2 The Relationship Between Streets and Residents

The Trust of a City Street. An urban theorist, Butzner Jacobs said, “The
trust of a city street is formed over time from many, many little public sidewalk
contacts . . .” [9]. The formation of a street, is not only dependent on the plan of
architectures and roads, but also on the interactions between people, e.g., visiting
the coffee shop in the street corner, greeting someone walked pass by, talking in
a grocery store, having eye contact with another person. . . etc. However, looking
at the modern city today, we can rarely find the “medium” which facilitates the
connections between people and increases the trust of a city street.
Community Cats. In Japan, one of the solutions of creating a local community

is “community cats” which can strengthen the connection of residents by taking
care of cats. Feral cats in local streets have became a big issue which can not be
solved by an individual person (Watanabe 2015). On the other hand, the feral
cats issue is also considered as a “local environmental problem”, thereby local
people would try to solve this problem spontaneously. As a result, some local
communities were formed in japan which regard feral cats as a resource of local
communities, and raise feral cats as “community cats” by all the local people
together. “Community cats” turned an environmental problem into a adorable
resource in the town. A local town becomes more comfortable and enjoyable to
live in, while developing the foundation of community welfare [10].
2.3 Contributions
In this study, the researchers presented a novel approach to facilitate a local

community with easier transporting experiences provided by ZOUGAME vehi-
cles which look to enhance and enchant the users relationship with their sur-
rounding landscape. The researchers capture traces of the people’s movements
in the town streets and by accumulating this data they are able to understand
and express the character of the town streets in a new form. As an academic
ZOUGAME 829
contribution, the researchers set out the possibility of constructing and strength-
ening the already existing social relationships between the town’s people and the
potential of creating new social bonds through this service, and captured the
traces of people transporting in the town street for accumulating the data for
extracting the character of the street.
3 Design
3.1 Concept
In this study, the researchers designed a ride-share service called “ZOUGAME”,

which provides a pet-like vehicle, and a fun riding experience which can also
expand users’ view of the world through the communications that happened in
the streets during their ride to their destination.
Users can get the nearest ZOUGAME vehicle’s position through the applica-
tion. By tapping the ZOUGAME icon, users can ask ZOUGAME vehicle to come
to them. When the ZOUGAME vehicle comes, users can put their packages into
it, and set up the destination desired by the user. On the way to the destination,
other users who have the same destination can also call the ZOUGAME vehicle
by sending a message to it, and the ZOUGAME will go to pick up the user. For
this reason, users were given a opportunity to ride together and local users can
share their experiences of the town with new visitors. Accordingly ZOUGAME
will accumulate the memories of the routes which it has been on, so it can also
navigate a optimum route for a user who just wants to enjoy the unique sight of
the street.
Because ZOUGAME vehicles are community pets in the town, when a
ZOUGAME vehicle gets a scratch or deteriorates over time, people in the town
can take a photo and send it to the provider through the application. Then, the
mechanic will come and repair it. Also, when a ZOUGAME vehicle runs out of
its power, just like the i-robot company’s Roomba which can be charged in a
storage space, the ZOUGAME will go back to its charge space where it will be
charged and stored in a free space near the residents’ houses.
ZOUGAME provides fun and enchanting ride experiences to the people in
the town, and in return people take care of the ZOUGAMEs. Through this,
a local community can be formed. People can enjoy the town streets and the
enhanced connections and interactions in the community (Fig. 1).
3.2 Method
ZOUGAME is based on the concept design framework, through the use of design
thinking and service dominant logic methodology. Design thinking is a service
design methodology which can co-create values with customers through observ-
ing and describing in a fieldwork, and making several prototypes based on the
empathy which was found through the fieldwork. Service dominant logic which
is a new marketing logic Stephen L. Vargo and Robert F. Lusch advocated in
Fig. 1. Concept sketch
2004, thinking that values are create by both customers and corporations through
multiple intersections and interactions between them.
ZOUGAME is a ride sharing service which can enhance people’s lives by
making the transporting easier and more enjoyable. To make this, the researchers
extracted actors who are needed in the service. An actor is an existence which
has an agency for goal and acts in a structure which works as a system or a
policy. The researchers set up 4 actors for this service. First, “Mobility service
provider”, who provides this service. Second, “Local resident in the street”, who
is familiar with the town street. Third, “New resident who just moved into the
town”, and Fourth, “Mobility maker/mechanic”, who produces the mobility and
does maintenance.
To realize this service, the researchers need to make these four actors feel the
values of this service. Thus, the researchers conducted an ethnographic research
to figure out the actors’ goals and their mental models, especially “Local resident
in the street” and “New resident who just moved into the town”, and observed
how residents live and interact in the town streets and how they move around
the streets in a standard day life (Fig. 2).
In this ethnographic research, the researchers visited physical therapist Y and
observed the way she gets around the street by the use of her electronic bicycle
from10 am. to 1 pm. on 06 August, 2018, and had an interview after observation.
Physical therapist Y was born in Shinagawa, Tokyo, and now works as a home
physical therapist. After she got marriage, physical therapist Y still lives in the
same place as before, and her work and lives are all in the same street. The
nursing center where she works in, takes responsibility for all the patients in
Shinagawa, for this reason she has to move over 20 km on her electronic bicycle
everyday.
ZOUGAME 831
Fig. 2. Fieldwork map
On the way to the nursing center, physical therapist Y stopped at a super-

market to purchase some things, and also made a short stop to talk with the
patients and colleagues who she ran into at the parking space. Therefore, the
researchers set up her goals as “wanting to use her hiatus (travel time between
two places) efficiently” and “wanting to get some new information from someone
she encounters along the way”. From the interview, the researchers also set up
“wanting to enjoy the seasonal changes in the sight of the street while trans-
porting” and “wanting a mobility option which can carry heavy bags and items
through her journey” as her goals.
According to the result of the analysis of the ethnographic research, the
researchers concreted the 4 actors’ goals. The new resident who just moved
into the town who’s goals are “wanting to move around the town street easily
in any season”, “wanting to make friends who can show him/her around the
streets”, and “wanting to know the fun places of the streets”. The local resident
in the street who’s goals are “wanting to show off the street that he/she loves”,
“wanting to carry all their baggages and items at the same time”, and “wanting
to get new information in a short time”. The mobility service provider who’s
goals are “wanting to make a new service that users can enjoy”, “wanting to
share the informations with developers for making it easier to develop a new
mobility service in the future”. The mobility maker/mechanic who’s goals are
“wanting the vehicle he/she made to be used permanently”, “wanting to repair
the vehicles as soon as possible”.
For making a concept which can give more attractive values to the actors, the
researchers integrated all the elements which the researchers have obtained from
the combined research, analysis and idealization, and reconfirm that the 4 actors
can feel the values which are in their contexts. And the researchers designed a
service ecosystem as below: (Fig. 3).
Based on the concept, the researchers made some simple prototypes with
cardboard which are route-generating application for users to call and have con-
versations with ZOUGAMEs, and a ZOUGAME vehicle which was equipped
Fig. 3. Service ecosystem
with a platform for loading baggages for one person, and can be connected to
another ZOUGAME as a modular vehicle system (Fig. 4).
4 Results
The researchers used the simple ZOUGAME vehicle prototype to validate the
efficiency of this service. The research members played the four actors and per-
formed a skit. The researchers set up a scene which represents a father taking his
daughter to her kinder garden class in the morning (Fig. 5). And the feedback
from the member who played the father was “It feels good when you can have a
Fig. 4. Making props

ZOUGAME 833
conversation with ZOUGAME.”, and “ZOUGAME moves at a comfortable pace,

so I can smoothly talk to my daughter who rides on a connected ZOUGAME.”
The member who played the daughter said, “It’s exciting to ride on my own
ZOUGAME.”, “It will be more enchanting if I could customize some elements of
the ZOUGAME.” As a result, the researchers proved ZOUGAME can provide
higher value experiences which could be fun and enchanting during the ride and
the efficiency of the service ecosystem.
Fig. 5. Concept skitt

5 Conclusion
5.1 Value Co-creation with Research Partners
This study is a collaborative research operated by three corporations and Keio
university of media design. Based on the methodology of design thinking, the
members came together and joined our hands to clearly define the form of the
service. Engineer M who works in Denso corporation pointed out “if ZOUGAME
extends its cable for charging, then it will be like a normal home appliance”,
so he provided his knowledge and skill of electrical engineering and made an
idea which makes ZOUGAME more adorable and vibrant by making one of the
ZOUGAME’s back legs as a positive pole, and the other leg as a negative pole,
then it can charge itself by going on a charge board located in its storage space.
Physical therapist Y who works in YAKUJU corporation shared his feelings
about an experience in which he traveled around the city by bicycle to go to work
as a daily routine in the past, and made a couple of ideas to have a discussion.
Mechanic M who works in a model railroad manufacturer and dealer, SEKISUI
KINZOKU, shared his view of a railroad fan community. Hence, the researchers
considered that ZOUGAME can give more value by installing functions which
can make it more pet-like. The researchers will still develop “ZOUGAME” and
co-create its values with the corporations.
5.2 Future Work

In this study, the researchers designed a ride-sharing service “ZOUGAME”, and
presented a novel service model. To bring out the validity of ZOUGAME, the
researchers focus on one of the housing complexes in Chiba called Yonamoto
Danchi. Yonamoto Danchi is a small town which people can walk from one side
to another in around 20 min. Because it was build for nuclear families to bring
up their children, the environment of Yonamoto Danchi is well managed. The
facilities which are necessary for a daily life routine, such as a supermarket, a post
office, a kindergarten, an elementary school, and a public hall, are completely set
in Yonamoto Danchi. The researchers are now doing the ethnography research,
a field survey, and follow-up studies in Yonamoto Danchi, and have made a
simple prototype to validate the value of the service in this study. From now
on, the researchers look to manufacture a fully developed prototype based on
the researches in Yonamoto Danchi, and conduct user-tests in the area over and
over again to upgrade this service.
Acknowledgments. This research is funded by DENSO CORPORATION, YAKUJU

CORPORATION, SEKISUI KINZOKU, the researchers also like to thank Physical
therapist Y for helping this research.
References
1. Ratti, C., Claudel, M.: The City of Tomorrow: Sensors, Networks, Hackers, and
the Future of Urban Life (The Future Series). Yale University Press, New Haven
(2016)
ZOUGAME 835
2. Holtzblatt, K., Beyer, H.: Contextual Design: Design for Life (Interactive Tech-
nologies), 2nd edn. Morgan Kaufmann, Burlington (2016)
3. Goodwin, K.: Designing for the Digital Age: How to Create Human-Centered Prod-
ucts and Services. Wiley, Hoboken (2009)
4. Lusch, R.F.: Service-Dominant Logic: Premises, Perspectives, Possibilities. Cam-
bridge University Press, Cambridge (2014)
5. Levy, J.: UX Strategy: How to Devise Innovative Digital Products that People
Want. O’Reilly Media, Newton (2015)
6. Bird. https://www.bird.co/
7. Telepod. https://www.telepod.co/
8. Starship Technologies. https://www.starship.xyz/business/
9. Jacobs, B.J.: The Death and Life of Great American Cities. Vintage Books, New
York (1961)
10. Watanabe, S., Watanabe, Y.: Community Cat Activity that Connects People:
We Develop the Foundation of Community Welfare. Research Bulletin of Kyushu
Junior College of Kinki University, Higashi-osaka (2015)
Deep Learning Based Face Recognition
Application with Augmented Reality Devices
Andrew Kim, Ehsan Kamalinejad, Kelby Madal-Hellmuth,

and Fay Zhong(&)
California State University East Bay, Hayward, CA 94544, USA

jiaofei.zhong@csueastbay.edu
Abstract. Augmented reality devices such as the Microsoft HoloLens are

currently in development and have potential uses in many applications. These
devices allow the user to interact with holograms projected into their sur-
roundings. Because the augmented reality devices can project into the user’s
environment, one potential application for these devices is in recognizing and
labeling objects and people in the user’s view. This paper describes a deep
learning based method that uses a deep Convolutional Neural Network
(CNN) with the HoloLens to recognize and classify individual faces in a scene.
Keywords: Face recognition Convolutional Neural Network HoloLens
1 Purpose
Deep learning has become a topic of intense interest in the last few years due to its
success in many different fields such as computer vision and speech recognition [1].
The performance of some deep learning algorithms has even exceeded that of humans.
The convolutional neural network (CNN), a particular class of deep neural net-
works, has met with great success in the last few years in solving image classification
problems [2]. This makes CNNs the ideal tool to classify faces in the user’s view of the
HoloLens.
There has been resurgence in popularity and demand for Convolutional Neural
Networks (CNN) influenced by ImageNet challenges happening every year. The CNN
algorithm outperforms its next best algorithm by a huge margin; the gap being so large
makes CNN the only favorable choice for object recognition as of today. CNN is an
object recognition algorithm with the design inspired by the organization and structure
of the brain of living organisms. As of now CNN outperforms the human eye on the
ImageNet challenges.
The purpose of this research is to increase the accuracy of the object detection of
the CNN with additional data, specifically depth data. Usually CNN is used to detect
object in two dimensions such as through video streams or images. Adding depth data
would be a complicated task but having the neural network accept and identify three
dimension objects would logically give a better accuracy to the object it is identifying.
Therefore, the research of CNN with three-dimensional objects is implemented
through a device called HoloLens created by Microsoft. The product is an augmented
reality headset, which has key features such as spatial mapping where it can detect
surrounding objects shapes and the distance between the object and the user.
https://doi.org/10.1007/978-3-030-39442-4_62
Deep Learning Based Face Recognition Application with Augmented Reality Devices 837
2 Background
The success of CNNs at image classification can be attributed to the unique architecture
of the CNN. CNNs use the convolution operation at each layer of the network. This
involves convolving a set of learnable filters across the width and height of the input to
the layer.
The result of this process is that the filters learn different features of the input image
and deeper layers learn progressively more abstract features of the image. The resultant
output of the last convolutional layer can then serve as input to a traditional dense feed
forward network, which can perform the classification of the initial input image.
Fig. 1. An example of the kernel convolution.
Figure 1 shows an example of kernel convolution, and specifically how a kernel

operates on one pixel [3]. Figure 2 shows a non-trivial image of a dog, and the corre-
sponding Horizontal Edge Detection on that image [4].
3 Methodology
A four-step process has been proposed in order to complete the following tasks: First,
detect faces in the image. Second, transform the detected faces to make them suitable as
input to a CNN. Third, pass each face through a CNN to produce a feature vector; And,
finally, compare the feature vector to the feature vectors of known faces and pick the
closest one as the correct face.
838 A. Kim et al.
Fig. 2. (a). A non-trivial image of a dog; (b). Horizontal Edge Detection on (a).
3.1 Detect Faces

There may be more than one faces in the image. Therefore, the first step of our process
is to detect all of the faces in the image.
The Histogram of Oriented Gradients (HOG) has been utilized for this task [5].
HOG computes the gradient of each pixel with respect to brightness and then compares
the resultant gradient map with a known face gradient pattern to identify all of the faces
in the image. Then the location of the faces computed by HOG will be used in the next
step of our process.
3.2 Preprocessing
Because the faces detected in the previous step could be looking in different directions,
they need to be transformed so that they are all oriented in approximately the same way
to serve as better input to our CNN. Therefore, the second step is a preprocessing step.
This step can be done using Face Landmark Estimation (FLE). An FLE algorithm
will identify a certain number of “landmarks” of a face such as the eyes, nose, lips, etc.
Then we can apply a transform to the image based on the positioning of the landmarks
to center the face in the image as best as possible. Doing this will improve the accuracy
of our CNN’s prediction.
3.3 Generate Feature Vector

The third step in our process is to feed each preprocessed detected face through a CNN
to produce a feature vector, which represents some number of learned facial attributes.
It is important that the CNN that is used has been previously trained on a large
dataset of faces such as those available from OpenFace [6]. The time to train a deep
CNN can be lengthy, however it only needs to be done once. Once the network has
been properly trained a single forward pass through the network is all that is required to
produce a feature vector.
Fig. 3. Application architecture of our proposed method.
3.4 Comparison
Once the feature vectors have been computed by the CNN, the final step is quite
simple. The first thing is to compute a set of feature vectors for known faces, by
running them through Steps 1–3.
This will need to be done only once before we attempt to classify any faces. Then,
for each vector to be classified, it is compared against the set of feature vectors of
known faces, and the closest match will be selected.
4 Implementation
Implementing this process on a HoloLens, though possible is problematic as the

HoloLens doesn’t have the hardware necessary to efficiently run the CNN locally on
the device itself. To work around this problem we chose to employ a client/server
approach, where most of the processing task is offloaded to the server.
The client, HoloLens, sends a video stream captured from the onboard video
camera to a server. The server implements the four-step process explained above and
returns data to the HoloLens that contains bounding box locations and classifications
for each face detected.
The HoloLens then updates its display with this information by drawing the
bounding box around each face and displaying the classified name below the box. This
process is repeated for each frame of video. We used the face_recognition [7]
Python API which implements this process in building our server. Figure 3 shows the
architecture of this application and illustrates our proposed implementation method.
840 A. Kim et al.
Fig. 4. Face Recognition with Hololens (Input: Photo of people, Output: Photo with bounding
boxes on face detected and their names).
5 Results
Figure 4 is the output of our face recognition application implemented on the aug-
mented reality devices, i.e. the Hololens. Our application is able to detect faces, add
bounding boxes on face detected, and also display their names in real time.
The main issue with this workaround is that it doesn’t allow for a very high frame
rate on the HoloLens. This is caused by several factors. Assuming no overhead for
processing, the time delay of sending a frame to the server and receiving a response,
that is, latency, imposes an absolute upper bound on the frame rate.
Once accounting for the processing time necessary on the server, the effective
frame rate is further reduced. This can be minimized by using a very efficient GPU
based implementation on the server.
6 Conclusion
In summary, a deep learning based method that uses a deep Convolutional Neural
Network (CNN) with the HoloLens can successfully recognize and classify individual
faces in a scene in real time.
Our experience implementing this process on a non-GPU based server resulted in a
frame rate of only approximately one frame per second. It has been reported that the
next version of the HoloLens will have a special coprocessor aimed at accelerating
deep learning workloads. This may allow at client side only implementation that should
run considerably faster.
Acknowledgments. We would like to thank the California State University Fresno, the
California State University East Bay, and the National Science Foundation for their financial
support (NSF Grant #DMS-1460151), The FURST program (NSF Grant #DMS-1620268),
CSUEB FURST program (NSF Grant #DMS-1620500), and our mentors, Dr. Kamalinejad and
Dr. Zhong for their support during the completion of the project.
References
1. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. The MIT Press, Cambridge (2017)
2. A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN. https://
blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-
34ea83205de4. Accessed 23 June 2019
3. Performing Convolution Operations. https://developer.apple.com/library/content/
documentation/Performance/Conceptual/vImage/ConvolutionOperations/Convolution
Operations.html. Accessed 23 June 2019
4. Dither a Grayscale Image. https://codegolf.stackexchange.com/questions/26554/dither-a-
grayscale-image. Accessed 23 June 2019
5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: International
Conference on Computer Vision & Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893.
IEEE Computer Society, San Diego (2005)
6. Amos, B., Ludwiczuk, B., Satyanarayanan, M.: Openface: a general-purpose face recognition
library with mobile applications. CMU-CS-16-118, CMU School of Computer Science, Tech.
Rep. (2016)
7. Face Recognition. https://github.com/ageitgey/face_recognition. Accessed 23 June 2019
e-Canteen for the Smart Campus Application
Zulfajri Basri Hasanuddin1(&), Muhammad Niswar1, Aulia Fadhila1,

and Mayong Adi Wardana2
1
Department of Electrical Engineering, Faculty of Engineering,
Hasanuddin University, Jln. Poros Malino KM 6, Bontomarannu,
92172 Gowa, South Sulawesi, Indonesia
zulfajri@unhas.ac.id
2
Department of Computer Science, Faculty of Engineering,
Universitas Sulawesi Barat, Jln. Prof. Dr. H. Baharuddin Lopa, SH, MH,
Lutang, Majene, West Sulawesi, Indonesia
Abstract. Nowadays in 4.0 industrial and digital economic era, all of the
manual serving should be changed by the serving service which use a tech-
nology as the helping tools and applications to support the smart canteen in
solving the problem of serving food and the queue for paying. Therefore,
e-canteen web system with the Raspberry Pi as the sever is created, which can
ordering directly through the table of customer without waiting for the waiter to
bring the menu, and even for payment as well because the system is compactly
designed to make it easier for cashier, waiters and even for chef in the kitchen in
coordinating each other. The test is conducted by the support of the Apache
JMeter software to do the testing of server ability in handling many clients based
on the response time from server. The results show that Raspberry Pi 3 Model B
with Ram 1 GB capacity able to handle numbers of threads (users) more than
150 users at the same time with response time value around 8 ms–17.6 ms and
throughput value around 132.23 request/s. In parameter QoS testing, when using
WiFi network in Raspberry and Wireless Network Adapter ARGtek ARG-1209
type, it is obtained that the index value is 3 with 87,5% percentage which means
the QoS value in web access analysis when using the Raspberry WiFi network
as well as the Wireless Network Adapter is in the good category.
Keywords: e-Canteen Raspberry Pi Apache JMeter WiFi network

Wireless network adapter Queue solution
1 Introduction
The development of technology always showing the rapid improvement in information

and communication technology. Especially in this modern world, everybody wants
easiness and rapidity to fulfill their needs. Efficiency and effectiveness are the most
influential factors to create it. One of the way to create the efficiency and effectiveness
is by using the technology system which can make each works become more effective
and efficient.
In Indonesia, the application of technology system which can make each works
become more effective ad efficient is still lack in order of the technology development.

https://doi.org/10.1007/978-3-030-39442-4_63
e-Canteen for the Smart Campus Application 843
For instance, in culinary industry such as canteen. Canteen is a familier thing in office
and school environment.
According to Azwar, Sapri 2012 [1]: Canteen of Hasanuddin University is a place
where the students can spend their time to eat, drink, or just for taking a rest. But the
payment transaction still in manual process and using cash for the payment. It caused a
queuing problem. Therefore, a Smart Card is designed as the tools of transaction in
canteen. By using the Smart Card as the payment tools, the queue in cashier could be
solved [2]. But the queue for ordering food still unsolved. The ordering method of food
and drink still using the manual system which take much time.
Now to fulfill the needs of the customer, mostly the canteen let the customer to
make a line in front of the cashier or waiting for the waiter to come and give the menu,
then make an order for food. After they finish to eat, the customer go to the cashier to
pay the bill, then it will make a line again. The manual ordering also tend to make a
mistake on misspell of the menu which had been ordered by the customer. The service,
ordering and payment method of this system show how ineffective and inefficient it is.
Therefore it needs a system which could be make a direct ordering through customer’s
table without waiting for the waiter to give them menu list, or making a line in front of
the cashier either to order or pay the bill.
According to Vinayak Ashok Bharadi, Ph. D. et al. 2013 [3]: E-menu provides the
information about the menu items with interactive pictures. Sometimes there are many
confusions which happened in the kitchen about the menu that had been written by the
waiter. The using of tablet will minimize the ordering mistakes, and make a faster
service as well.
Based on the description above, there will be an effective and efficient system
which is designed for the canteen service by using WEB that had been installed in
tablet as the ordering tools. The pictures of the food and drink with the prices from
canteen will appear in the tablet. The customer could choose the menu that they want
by using the tablet which is available in each table. After finish the ordering, the
kitchen will receive a message about the food and drink that had been ordered by the
customer, then the chef in the kitchen will serves it. The cashier will also receive the
order list and the amount of customer’s bill, therefore the customer will just pay the bill
directly without check their order list. To connect the customer to the cashier and the
kitchen, it is used Raspberry Pi as the server. Therefore, the authors consider creating
an “E-Canteen for the Smart Campus Application” as the solution to solve the long
queue lines problem in ordering and payment process.
2 Theoretical Background
2.1 Raspberry Pi
Raspberry Pi is a single-board computer in a credit card size with the operation system
which is generally in Linux base that in its development already capable to run the
operation system in Windows IoT base. The development of Raspberry Pi was started
since 2006 by a non-profit institute of Raspberry Pi Foundation, which consist of
volunteer and academician of technology in England [4].
844 Z. B. Hasanuddin et al.
Raspberry Pi 3 has a system of Broadcom BCM2837 A 1.2 GHz 64-bit quad-core

ARMv9 CPU,802. 11n Wireless LAN, Bluetooth 4.1, Bluetooth Low Energy (BLE).
Same as the Raspberry Pi 2, the Raspberry Pi 3 also has 1 GB RAM, 4 USB ports, 40
GPIO pins, Full HDMI port, Ethernet port, combination of 3.5 mm audio jack and
composite video, Camera interface (CSI), Display interface (DSI), Micro SD card slot,
VideoCore IV 3D graphics core. It is not included the hard disk built in or solid state
drive, but using the SD card for booting and persistent storage. The institution provides
the distribution of Debian and Arch Linux ARM to be downloaded; Phyton language as
the main program language [5].
2.2 WEB Server

In generally, server could be defined as the center and used as the “servant” which is
useful to the data delivery or data received, and also to arrange the delivery and demand
data between the connected computers, in other words server is used to provide the
service to the client. Meanwhile the Web Server is a form of server which is used
specifically to save the website page or home page [6].
2.3 PHP
PHP (PHP Hypertext Processor) is a scripting language in HTML. Most of the syntax
are similar with the C language, Java, and Perl, and some of the specific PHP functions.
The main purpose of the using of this language is to enable the web creator to write a
dynamic web page fast [7].
2.4 MySQL
MySQL is one of the most popular kind of database server. The popularity is caused by
MySQL is using SQL as the basic language to access its database. SQL is a concept of
database operation, especially in choice or selection and data input, which is enable the
data operation to be done easily and automatically [8].
2.5 Apache JMeter

Apache JMeter application is an open source software, 100% pure Java application,
which is created to load the behaviour of functional test and to measure the perfor-
mances. At first, it was created to test the Web Application, but then expanded to test
the other functions. Apache JMeter could be used to test the performances either in
static or dynamic resources (Web services (SOAP /REST), Web dynamic language -
PHP, Java, ASP.NET, File, etc. [9].
2.6 Wireshark
Wireshark is a Network Packet Analyzer application which will try to catch or filter the
packets which exist in a network and trying to show all the information in that packets
very clearly. The filtered packets could be used to analyze a network. The analysis of
network performance included many things, start from the process of catching the data
packets or the information through the network, check the network security and the
troubleshooting, and even for sniffing (get the important information such as password,
privacy data, etc.). Wireshark is an open source for Network Analyzer. The appearance
of Wireshark is using GUI (Graphical User Interface) [10].
3.1 General Overview of the System
Figure 1 shows the flow system from the beginning when the customer open the web
system of menu list in the tablet which already available in each tables. Then the
customers look at the menu in that show in the tablet when they want to make an order.
If the customers already made an order, the order list will be sent to the kitchen to
process their order, and to the cashier to count their payment. LCD in the cashier and
kitchen will show the customers’ order list in a real time, that the kitchen LCD will
show the order view list, while in the cashier will show the order view list with the total
payment. After the customers pay the bill, the flow system is done.
Fig. 1. System flow diagram of e-canteen

3.2 Measurement Scenario

The measurement scenario that will be conducted in measuring the performances of
raspberry web server and network quality is QoS parameter, either when using the
raspberry WiFi or wireless network adapter, as shown in Fig. 2.
Fig. 2. Measurement scenario.
4.1 The Implementation of Device System

4.1.1 The Implementation of User Interface
The user page is a page for making an order from the menu by the customers, which
could be opened after going through the log in process in Fig. 3. The options in user
page are home, food and drinks pictures.
Fig. 3. Login user
Login menu is used to input the username and password of the customer’s table.
Login is used as the system authentication for the user who may to access it. If the login
data that had been input is valid, then the user will enter the homepage of the website.
Fig. 4. Home user
In home user screen as shown in Fig. 4, there are buttons to enter the customer’s
name, add order, final order, and log out:
1. Enter the customer’s name to input the customer’s name.
2. Add order to move to the next page and see the menu list.
3. Final order to end the order process.
4. Logout is a sign that the user is ending the system.
Fig. 5. The view of food and drink
In the view of food and drink as shown in Fig. 5, there are buttons:
1. Add to chart
Add to chart to choose the food and drink that customers want.
2. Quantity
Quantity determines the amount of food and drink of customer’s order.
3. Order
Order will send the order data of customers to the cashier and kitchen.
After the customer order the food and drink by click the order in user page, then the
customer will back to the home user page to re-check the order list as shown in Fig. 6.
Fig. 6. The view of home user
After re-check the order list in order, the customer will click the final order to end
the order process. Then the order data will be sent to the kitchen and cashier.
4.1.2 The Implementation of Kitchen Interface

The kitchen page is a page which could be used to check the order list from the
customers. The options in kitchen page are the order view, food management, and
drink management.
Fig. 7. The view of kitchen page
Kitchen page shows the order list of the customers as shown in Fig. 7. As follows
in the table:
1. Product name: the order list of food and drink from the customer
2. Quantity: the amount of the order
3. Price details: the price of food and drink that had been ordered by the customer
4. Order total: the payment total of the customer’s order.
Fig. 8. Food and drink management
Food and drink management as shown in Fig. 8 is a view of kitchen page which is
used to disable the food or drink in the page of user menu. If the kitchen makes the food
or drink disable or not available then the customer could not order it from the menu.
4.1.3 The Implementation of Cashier Interface

The cashier page is the page which is used by the admin. In this page, the admin will
process the payment transaction of the customer, arrange the menu that will show in the
page, also to save the customer’s order data which will be used as the order database.
This page will be opened after going through login process.
Fig. 9. The view of login admin page
Login is used to input the username and password for admin as shown in Fig. 9.
This login is an authentication system for the user who may to access it. If the login
data which had been input is valid, then the user will enter the cashier website page.
Fig. 10. The view of admin page

After login, then the view of admin page will appear as shown in Fig. 10. This page
has logout button which is used by the admin to exit from the page.
Fig. 11. The view of cashier page
Figure 11 shows cashier page. In cashier page, there are pay and clear data buttons.
The pay button is used when the customer already paid the bill. While the clear data
button will delete the data on cashier and kitchen page.
Fig. 12. The view of insert food and insert drink page
Insert food and insert drink page as shown in Fig. 12 is a page to add the food or
drink menu on e-canteen order system. The data which is need to be input are:
1. Product name
2. Product price
3. Product image
Fig. 13. The view of data order page

The database of order from the customer could be checked again by the admin on
data order page as shown in Fig. 13. The data order page will show the information as
follows:
1. Data order: the view of customer order list
2. Time: the customer’s times of order
3. Total: the total amount of customer’s order
4. Detail: the order details of the customers
4.2 The Response Testing of Web Server

The testing is conducted in two conditions, which the first condition is using the WiFi
network on Raspberry, while the second condition is using the wireless adapter, which
is ARGtek ARG-1209. The parameters which are being tested are Average, Min, Max,
Throughput, and request status. The test results of both conditions according to the
response analysis of Web Server are:
1. The average graphic of response time from the observation results is shown in
Fig. 14 below.
Fig. 14. Average graphic of response time
2. The average graphic of throughput from the observation results is shown in Fig. 15
below.
Fig. 15. Average graphic of Throughput

Based on the graphics of both conditions above, it could be seen that there is a
difference when using the raspberry WiFi network and the wireless network adapter. In
the testing, the result of response time when using the raspberry WiFi network range
around 8 ms–15.2 ms. And when using the wireless network adapter, the response time
range around 12.8 ms–19.6 ms. The highest response time from both of the conditions
is on 5 number of threads (client), which is for the Raspberry WiFi with 15.2 ms, while
in the ARGtek is 19.6 ms, therefore based on that results a better response time is when
using the Raspberry WiFi network.
The biggest throughput value when using the raspberry WiFi network is 132.33/s,
and when using the wireless network adapter the throughput value is 141.86/s with 150
number of threads, while the lowest throughput value when using the raspberry WiFi
network is 6.04/s and when using the wireless network adapter the throughput value is
6.06/s with 5 number of threads. For the testing of 5–150 number of threads, the
condition of throughput could be seen that the more sample is sent, then the request
will be faster for each seconds, which is caused by the server that tend to maintain the
response time as soon as possible, but the increasing still occur in the throughput. In
throughput, the higher throughput value will be better. It means that the server could
execute many requests per time.
Based on the test results, it reveals that the web server performances in the first
condition is better on responding the request which is showed by a less response time
value than the response time value in the second condition, with a difference of 4.4 ms.
While the second condition has a better condition than the first condition, which is
showed by the throughput value which is bigger than the throughput value in the first
condition, with a difference of 0.02–9.5/s.
4.3 QoS Testing

The following Table 1 gives a result data of QoS Parameter measurement by using
Raspberry WiFi.
Table 1. Index value of QoS by using Raspberry WiFi

Total user QoS parameter Measurement result TIPHON standard Index
2 Packet Loss (%) 0 Very Good 4
Delay (ms) 44.6716 Very Good 4
Jitter (ms) 1.4112 Good 3
Throughput (kbps) 149.8 Good 3
Jitter (ms) 0.785 Very Good 4
The following Table 2 gives a result data of QoS Parameter measurement by using
Wireless Network Adapter.
Table 2. Index value of QoS by using Wireless Network Adapter

Total user QoS parameter Measurement result TIPHON standard Index
Throughput (kbps) 208 Good 3
By looking on the index table of QoS, to determine the QoS value is by divided the
index total from QoS parameters with 16, then the index value of QoS for each
condition are as follows:
1. The index result value of QoS when using Raspberry WiFi by making request is
14/16 = 0.875 100% = 87.5% with the index value is 3 which means that the
QoS value is in the good category.
2. The index result value of QoS when using Wireless Network Adapter by making
request is 15/16 = 0.9375 100% = 93.75% with the index value is 3 that means
the QoS value is in the good category for the total user 2 and 4. For the total user 3
the index result of QoS is: 14/16 = 0.875 100% = 87.5% with the index value is
3 which means that the QoS value is in the good category.
5 Conclusion
From this research, it could be concluded that:

1. e-Canteen could help to solve the long queue lines problem in ordering and pay-
ment process.
2. The test result reveals that the web server performance shows a good response in
handled many clients either when using the Raspberry Pi WiFi network or Wireless
network adapter. It could be seen from the condition of throughput which the more
sample is sent then the request will be more faster, and in the throughput, the higher
value of throughput will be more better. It means that the server could execute many
requests per time.
3. The testing result of QoS when using Raspberry WiFi network and wireless net-
work adapter ARGtek ARG-1209 have shown the index value is 3 which means
that QoS value on web access analysis when using Raspberry WiFi as well as
Wireless network adapter ARGtek ARG-1209 type is in good category with 87.5%
of QoS for Raspberry WiFi and 93.75% of QoS for ARGtek ARG-1209 Wireless
Network Adapter, respectively.
6 Future Works
There are many things that could be conducted to improve this research in the future,
which are:
1. The making of this E-Canteen still in a simple category, especially from the
appearance and security, therefore it is expected to be more interesting in the further
improvement.
2. The design of this E-Canteen is expected to be more interactive to make the
information needed by the customer more useful.
3. The availability of facilities that can handle the payment process.
4. Provide big scree (display) for promoting new menus for foods and drinks and
CCTV for safety reason to support E-Canteen management system.
Acknowledgment. The authors would like to thank the Hasanuddin University and the
University of Sulawesi Barat for supporting this work.
References
1. Aswar, Z.: Kartu Mahasiswa Cerdas Menggunakan Teknologi RFID, Universitas Hasanud-
din (2016)
2. Amol, S., Aayushi, V., Manali, C.: Integrated cafeteria management system using RFID.
IOSR J. Electron. Commun. Eng. (IOSR-JECE) 1, 01–06 (2017)
3. Vinayak, A., et al.: Intelligent e-Restaurant Using Android OS (2013)
4. Patulak, Mitra, Priyon: Aplikasi Raspberry Pi sebagai Webserver untuk mengendalikan
Lampu melalui Website. Skripsi. FT, Teknik Informatika. Universitas Hasanuddin (2016)
5. Rao, P.B., et al.: Raspberry Pi home automation with wireless sensors using smart phone.
Int. J. Comput. Sci. Mob. Comput. 4(5), 797–803 (2015)
6. Betha, S.: Pemrograman WEB dengan PHP. Informatika Bandung, Bandung (2014)
7. Muhammad, I.: PHP dan MySQL untuk orang Awam. Maxikom, Palembang (2003)
8. Abdul, K.: Tuntunan Praktis: Belajar Database Menggunakan MySQL. Andi, Yogyakarta
(2008)
9. Evin, Asmunin: Performance Test Dan Stress Website Menggunakan Open Source Tools,
Jurnal Manajemen Informatika 6(1) (2016). ISSN: 208-215 208. http://jurnalmahasiswa.
unesa.ac.id/index.php/jurnal-manajemeninformatika/article/view/18463/16837. Accessed 14
Mar 2018
10. Annisa, C.: Pengenalan dan Dasar Penggunaan Wireshark (2013). http://ilmukomputer.org/
2013/04/22/pengenalan-dan-dasar-penggunaan-wireshark/. Accessed 19 Mar 2018
Voluntary Information
and Knowledge “Hiding” in a Conservative
Community and Its Consequences: The Case
of the Ultra-orthodox in Israel
Moshe Yitzhaki(&)
Department of Information Studies, Bar-Ilan University,

52900 Ramat-Gan, Israel
yitzhm@biu.ac.il
Abstract. A voluntary self-imposed “hiding” of information and knowledge is

one means by which a conservative community, striving to retain its members,
especially its youngsters, within a traditional lifestyle, copes with the challenges
of unrestricted information and knowledge flow in a developed country like
Israel. These voluntary barriers to the free flow of information and knowledge
are usually effective in achieving its target, but also have considerable draw-
backs, on the national level as well as on the individual one.
Keywords: Voluntary information and knowledge hiding Israeli Ultra-

orthodox community Barriers to information and knowledge flow
Conservative communities Internet filtering
1 Introduction
Generally speaking, two distinct religious groups can be distinguished among the
Jewish population world-wide: modern orthodox and ultra-orthodox. The term used
by members of this community to refer to themselves, is the Hebrew word ‘Haredi’
(rather than Ultra Orthodox). The modern orthodox are usually much more open to the
modern world, combine ‘secular’ general and academic studies in their educational
system and are more involved in the surrounding ‘secular’ world. The ultra-orthodox
groups (‘Haredi’) on the contrary, are less open to the modern world, strongly adhering
to traditions and old customs.
Understandably, the free flow of unfiltered information, as facilitated by current IT,
poses great challenges for a conservative community which strives to retain its
members and especially its youngsters, within a traditional lifestyle [1].
2 Objectives of the Study
The aims of our study was to explore the ways by which a conservative minority
community upholding a unique subculture, copes with the challenges of unrestricted
information and knowledge flow, in a cyber-developed country, like Israel. We tried to

https://doi.org/10.1007/978-3-030-39442-4_64
856 M. Yitzhaki
determine the extent of the self-imposed information and knowledge “hiding”,

which prevails among this specific community, by which means is it carried out, and
what have been its implications.
Our current analysis is limited to the ultra-orthodox community, which constitutes
today approximately 10% to 15% of Israeli society, while the Modern orthodox
community (about 20%) deserves a separate study, since it considerably differs in its
degree of openness.
3 Literature Review
After being neglected for a long time by academics and the scholarly community, there
has recently, during the last three decades there has been an influx of academic publi-
cations, articles as well as books, not to mention popular news reports, dealing with the
ultra-orthodox society, on its various branches, trying to decipher its roots, codes,
motives, goals, behavior and future. The growing interest in this group has to do with the
rapid demographic growth of ultra-orthodox society. From a small and negligible
minority at the time of the establishment of the State of Israel in 1948, their population is
now estimated at about 10% to 15% of Israeli population, becoming higher every year.
One of the first academic experts in Israel was Friedman, from the department of
sociology and anthropology at Bar-Ilan University in his 1991 now-classic book, titled
The ultra-orthodox society: sources, trends and processes [1]. Friedman had since then
many followers, who used various research methods, mainly qualitative ones, to study
various aspects of this unique society, which is different in many aspects from the
general population in Israel, but cannot be ignored due to its growing influence,
politically, economically and sociologically.
To mention at least some of the studies which follow: Nurit Stadler has published in
2008 her book, titled Yeshiva Fundamentalism: Piety, Gender and Resistance in the
Ultra-Orthodox World. In her book she analyzes the reconstruction of masculine in the
fundamentalist world as a result of the challenges of modernity. It addresses these
questions through an investigation of the redefinition of the family, work, the army and
civil society in the Ultra-Orthodox yeshiva world in Israel [24]. In her 2012 book A
Well-Worn Tallis for a New Ceremony she explores new aspects of voluntarism, citi-
zenship, family life and the concept of freedom in the ultra Orthodox culture today [25].
Hakak’s 2012 book (volume 19 in the series of “Jewish identities in a changing
world”) titled: Young Men in Israeli Haredi Yeshiva Education: The Scholars’ Enclave
in Unrest. The book is based on former papers he had published, dealing with various
aspects of the Israeli ultra-orthodox community [20]. Foscarini has published in 2014 a
long and thorough study of the ultra-orthodox community in Israel, claiming that the
secular education and vocational training that ultra-orthodox Jewish women get in
order to get a decent job in the labor market have been a source of emancipation and
modernization in their community [21].
In a recent article 2015 Adam Ferziger reviews in detail former studies dealing with
the burgeoning egalitarian trends featured in the new roles for women within liberal
Jewish denominations and among the Modern Orthodox as well as the ultra-orthodox
one. Ferziger builds upon and adds to these former studies by offering an initial portrayal
Voluntary Information and Knowledge “Hiding” in a Conservative Community 857
and analysis of a relatively new phenomenon: the American female non-hasidic haredi
outreach activist. He does so by locating these figures within overall trends of American
ultra-orthodox Jewry, as well as in relation to the broader phenomenon of Orthodox
feminism [22].
4 Methods
In the present study we had heavily used tools and techniques of the qualitative
research methods. Several ultra-orthodox newspapers, radio stations, public
announcements and other media, as well as adult and children’s books were reviewed.
In addition, members of the community were interviewed and various public
announcements of religious leaders were analyzed, using qualitative research tech-
niques, including in-depth content analysis.
4.1 Findings and Discussion

A general background about the main features and characteristics of the Jewish ultra-
orthodox community in Israel seems here necessary. We will skip it, however, since it
has already been described in great detail in some from our former papers, especially the
one given in the InSITE conference in Vilnius [17], as well as in numerous articles and
books published during the last three decades, both in Hebrew and English [1, 3, 17–19].
As already mentioned above, the ultra-orthodox society strives to counter external
influences of the surrounding secular and permissive western society, thus preventing
desertion of tradition and religion by its members, especially the young generation. The
goal is to achieve maximal cultural and social segregation [19]. Once graduating the
elementary school, in which secular core studies are very limited, they continue in a
‘Small Yeshiva’ from age 14 to 17, where secular core studies do not exist at all [20].
Then, they will be facing entrance examinations to a ‘Higher Yeshiva’, where they
would remain until they married, usually in their early twenties. The ‘Higher Yeshiva’
to which they are accepted will have a significant impact on their future—on their
chances of finding a good match, as breadwinners, as well as on their social mobility.
“The years of study in the Yeshiva – wrote Yaakov Friedman in his book - are the
root of the Jewish soul. . . . They provide the young man with tools for a full spiritual
life. . . . The Yeshiva will make those who dwell in it contemptuous of the temptations
of the external world and will reveal to them the beauty hidden in the pages of the
Talmud”. Hakkak explains in his book that “The control and discipline of the students’
[in the Yeshiva’] dress code is part of a broader effort to discipline and control the
students’ bodies, protecting them from these external temptations, influences and
fashions, which are likely to impact on the spiritual and religious realm. In the Yeshiva
world, the body, like all other aspects of the students’ earthly lives must be disciplined
so that their spiritual lives may flourish. The extent to which the students are successful
in these efforts will have major influence in determining if they will be accepted to a
higher yeshiva and to which one” [23].
858 M. Yitzhaki
It is widely known that all forms of media, including adults and children literature,
are not ideologically and culturally neutral, strongly reflecting various biases of the
people who had created them. Consequently, the spiritual leaders of the ultra-orthodox
society (the “Rabbis”) insist that both adults and youngsters consume only media and
literature upholding, promoting and consistent with their values and lifestyle. These
leaders strongly oppose, on ideological grounds, any use of the so-called ‘secular’
media, both printed and electronic, including children’s books.
The Internet in particular is perceived as a very serious spiritual threat and is the
object of heated condemnation and total rejection. Prominent rabbinical figures from
almost all ultra-orthodox circles constantly warn, in mass public gatherings as well as in
all kinds of their media of the spiritual hazards of the Internet. In public rallies the Internet
is termed “today’s major threat to the souls of the young generation” [4, 5, 11–16].
Special newly-written prayers are disseminated in synagogues and congregations
against the use of the new IT. Following is one of these new prayers, claimed to be
recommended by chief spiritual leaders (like Rabbis Hayyim Kanievsky, Moshe
Tsedaka and David Abu-Hatsira): “Lord of the universe, you know very well that
terrible dangers are hidden in most of the technological tools which are common in our
generation, and have ‘killed’ many. And several men, women and youngsters, boys and
girls, who had been god fearers, have strongly deteriorated due to these tools, god
forbid. We beg you that you will help them … to fully repent soon. Have mercy on us
for your holy sake to help us as well as all your beloved children (especially to…) to
keep great distance from all the non-Kosher tools and to earn our living by only the
permitted means ….” Another slogan says: “Thank God we have no email and no site
in the Internet (Inter-sin…), let its name be erased” [a privately owned document].
Admittedly, such harsh admonitions may attest to the contrary; perhaps the rab-
binical leadership does not expect total compliance to its prohibitions and fears the
consequences of dissent. After all, obedience is voluntarily and the rabbis lack tangible
means of enforcement, besides social and community pressure [3, 9].
4.2 Solutions and Alternatives

Thus, it is possible to understand the rationale behind the objection to IT and to free
flow of information and knowledge in general. The ultra-orthodox community is well
aware of the hazards of unrestricted information and has employed various measures to
counter it. These measures have enabled them to “walk the tight rope”, that is, to utilize
at least part of the immense benefits offered by the new IT and media, trying not to
compromise their ideology and religious principles.
Obviously, the ultra-orthodox cannot completely isolate themselves from the sur-
rounding society. Working people need the Internet to communicate by e-mail and for
necessary information searches. Many jobs obligate the use of computers and cell-
phones. Even those who are full-time learners often need cell-phones to contact family
and friends, in the lack of pay-phone booths. The fact is, however, that the vast
majority of the ultra-orthodox sector avoids television and in principle do not own one.
Home computers are almost non-existent in many ultra-orthodox families, and

those which do exist are not connected to the Internet, or are filtered. Some others have
limited Internet access, for breadwinning work, including e-mail messages, but not
information searches. Due to public demand and commercial reasons, all Israeli cellular
companies offer a certain type of cell-phone, nicknamed the ‘Kosher’ cell-phone,
adapted to the limiting demands of the ultra-orthodox community. This device lacks the
option of Internet connection and subsequently, lacks the options of free surfing, tex-
ting, whatsapp, etc. [17]. To avoid consumption of ‘secular’ media, including television
and Internet, considered a serious spiritual threat, the ultra-orthodox society has suc-
cessfully developed its own sub-cultural media and recreation activities, separated and
different from the mainstream cultural life in Israel [7–9].
Regarding books, there has been enormous ongoing demand for books and
magazines. The demographic reality of large families resulted in a vast population of
young book consumers, for whom reading has to replace television, computer, Internet,
etc. Current official data of the governmental Central Bureau of Statistics show that in
towns in which ultra-orthodox communities form the majority, the proportion of
youngsters up to age of 17, is 47% to 64% [10]. Studies done by Nitsa Kasir, a non-
ultra-orthodox woman serving as a co-head of the ‘Ultra-Orthodox Institute for Policy
Studies’, indicate that Israel is on the verge of a demographic revolution: in about 45
years, until 2065, the ultra-orthodox population will increase three-fold, comprising
about one-third of the population and about 40% of the Jewish population in Israel.
Already today 20% of the children up to 9 years age are ultra-orthodox, since ultra-
orthodox women give birth to 7 children on the average, compared to about 2.3 in the
general sector [10].
Consequently, reading remains one of the main leisure activities, thus creating a
constant growing demand for books, magazines and newspapers. As a result, pub-
lishing has become an important economic branch in this sector, annually producing
thousands of titles of all genres: novels, thrillers, etc. intended for youngsters, women,
and also men [2, 17].
4.3 Adverse Social and Economic Consequences

Without doubt, the voluntary segregation of this community from great portions of
information and knowledge shared by the general public around them, takes its toll by
reducing their involvement in important areas of life in modern society, like academy,
higher education professions, army service, etc. The exclusion of so-called “secular
studies” (i.e. Math, English, Physics, Chemistry, etc.) from the curriculum of most
ultra-orthodox boys between the ages of 14 and 18, makes it almost impossible for
them to later attend academic studies in higher education institutions [9, 18–20].
The lack of core subjects (English, math, etc.) from the school’s curriculum clearly
reflects in the proportion of graduates who are eligible to get the ‘Matriculation Cer-
tificate’: while in the general sector it was about 81% in 2018, 79.5% in the modern-
orthodox sector, it drops to as low as 32.5% in the ultra-orthodox one. Nationally, ultra-
orthodox boys got an average grade of only 39 in mathematics compared to 62 among
the Jewish non-orthodox sector [Central Bureau of Statistics -2018].
860 M. Yitzhaki
As a result, in a later phase of their life, when these boys marry and have to find a
good high-income job, their options are very limited and many of them are forced to
hold only low salary jobs, which do not require higher education background. It is a
loss for them personally as well as to the whole country on the macro level. No doubt,
the lack of core subjects (English, math, etc.) from the school’s curriculum clearly
damages the chances of the ultra-orthodox graduates to earn a decent salary once they
leave the ‘Kollel’ and go out to the labor market [20].
A recent Facebook post by David Uman, a former ultra-orthodox ‘Kollel’ member,
and now a fresh divorcee, published in the weekly modern-orthodox magazine ‘Be-
Sheva’ (Sep. 26, 2019, p.32) [6] reads: “This is what I told this morning the rabbinical
judges of the family court at the end of the session dealing with the monthly sum of
money I have to pay to my ex-wife: You have explained very well how much I have to
pay as a father responsible for his daughters, and why 2000 Shekels (about $550 USD)
are not enough, and even though I don’t have a profession, I should have one. You are
100% right. It’s my responsibility and I’ll do my best to live up to it. However, I have
only one question to you: Now you come?! All my life I was told that I don’t need a
profession and that I can stay in the ‘Kollel’, bring home 1500 Shekels (about $400
USD) for 10 children and everything will be OK. And now you are telling me that in
fact a father should bring 1400 Shekels for each child?! You, as judges who are
familiar with the problem, are responsible to change it, to prepare the boys long time
ahead for a situation in which they will have to provide living for their future children,
before they reach an impossible critical situation” [6].
Mr. Uman, 28, was raised in an ultra-orthodox typical family, passing through all
stages of ultra-orthodox education. After marriage and birth of children, striving for
decent living, he tried to establish cloth business, together with his wife, but failed and
fell to bigger debts. He then tried to find a job, but failed in finding one with a decent
salary due to lack of education and training, thus deteriorated to the situation described
in his post [6].
Interviewed later by the abovementioned magazine, Mr. Uman added: “They keep
telling us that we must learn in the ‘Kollel’ as much as possible, year after year. Core
studies are strictly forbidden in both elementary and high school, and thus comes the
day when a guy has 5 children and realizes that his wife’s salary together with modest
allowance from the ‘Kollel’ are not enough. But then it’s too late. He has already lost
the chance to go to the labor market to earn the money he needs… the ultra-orthodox
community has developed the model of helping their children by buying apartment for
the young couple. It helps the young couple at the beginning and makes them happy.
Soon, however, they realize that they, in turn, will have to provide 400,000 Shekels
(more than $100,000 !) to each of their eight children, when they get from the ‘Kollel’
the 1500 Shekels allowance” [6].
The dissonance presented by Uman’s story appears in a delicate and complicated
form in the story of Avigdor Feldman, his classmate. He also encountered the diffi-
culties in earn his family living costs with the modest allowance of ‘Kollel’ member,
but chose a different way of coping with it. After failing five (!) times in the entrance
examination for the academic degree, he did not give up and finally completed the
Bachelor degree in the Open University and soon will start studying for his Master
degree. Meanwhile, he started a huge start-up project of searching the internet for jobs
for ultra-orthodox people. He also stated a project of teaching English among the ultra-
orthodox youngsters. He cites data showing that many ultra-orthodox students who
want to acquire secular education, fail due to lack of English knowledge. There is a
demand for English study among parents who want to enable their children a decent
living. He manages to walk “between the drops”, his projects are part of unwritten and
unofficial consensus between the leadership and its rank and file. The spiritual lead-
ership is aware of his projects, being publicized in the ultra-orthodox newspapers, but
they avoid any hint which might show that they support them. Knowing the ultra-
orthodox community, Feldman knows how to manage himself in a way that will not
attract ‘fire’. “I’ll never attack – he says – and never start a controversy with a rabbi
concerning studying English. When parents demand it, we provide it and the rabbis
usually are aware and agree in silence. If a local rabbi or another spiritual leader
publicly objects, we stay away immediately, avoiding head to head confrontation. This
is the way that the change takes place and it occurs in great numbers” (‘Be-Sheva’
weekly magazine, Sep. 26, 2019, p.32) [6].
It is worth noting that the question of introducing core subjects into the ultra-
orthodox school’s curriculum has been a very controversial issue, being in a hot debate
in the Israeli political arena. Some political parties, mainly from the ‘left-central’ side
of the political map, demanded to enforce the ultra-orthodox schools to teach core
subjects, if they want to get government money. These parties have put it as a necessary
condition for their joining the coalition and forming a government. On the other hand,
the ultra-orthodox parties still sharply object it, striving to keep the independence of
their school system curriculum-wise, because of ideological reasons. They fear that
introducing the core subjects will interfere with the Torah and Talmud studies of their
boys and will attract some of them to academic studies [1, 19].
Another obstacle preventing the ultra-orthodox adults from getting academic
studies has been the issue of ‘gender separation’ in the classes. Most ultra-orthodox
adults prefer to study academic studies in classes with separation between males and
females, as they had been used to in their own separate educational system. As a matter
of fact, most universities have opened separate tracks, especially tailored for the ultra-
orthodox community. This step was greatly encouraged and financially supported by
various government executive agencies and branches, in order to help people from the
ultra-orthodox community to enter the labor market, to earn a decent living and to
increase the labor force and strengthen the economic state of the country.
Following are two examples, out of many, of the ongoing efforts to teach at least
vocational skills to ultra-orthodox adults, to enable them a decent living standard.
A bulletin board in an ultra-orthodox neighborhood presents an advertisement for an
evening class teaching plaster building. Another ad announces the opening of a weekly
evening class to acquire the skills of technician for home small electric appliances. No
prior knowledge is required and graduates will get a certificate and assistance in work
placement. Moreover, it states that the class is being directed by “ultra-orthodox
management under rabbinical supervision”.
862 M. Yitzhaki
5 Conclusions
1. Being a religious-cultural minority, the ultra-orthodox community in Israel seeks to
conserve a specific sub-culture which advocates maximal separation from general
secular culture and adhering to tradition, retaining its youngsters within the
community.
2. It has significant implications regarding the use of IT, which is being heavily
utilized in daily life in Israel, but with very clear-cut limits, such as various degrees
of filtering and partial access to the Internet and other forms of IT.
3. To be sure, these practical solutions block to a great extent the free flow of infor-
mation and knowledge in the ultra-orthodox community here and have adverse
implications both nationally and individually. However, the ultra-orthodox com-
munity has been ready to pay this price in order to conserve its old traditions.
References
1. Friedman, M.: The ultra-orthodox society: sources, trends and processes. The Jerusalem
Institute for the study of Israel, Jerusalem (1991)
2. Hovav, L.: Ultra-orthodox children literature – realistic or didactic? Sifrut Yeladim ve-Noar
20(3–4), 20–35 (1994). (in Hebrew)
3. Kaplan, K., Shtadler, N.: Leadership and authority in the ultra-orthodox society in Israel-
Challenges and Alternatives. Van-Leer Institute, Jerusalem (2009). (in Hebrew)
4. Karmi-Laniado, M.: Ideologies and their reflections in children literature. Dvir, Tel-Aviv
(1983). (in Hebrew)
5. Regev, M.: Children literature – reflections of society, ideology and values in Israeli children
literature. Ofir, Tel-Aviv (2002). (in Hebrew)
6. Rotenberg, Y.: A remedy to lack a livelihood. ‘Be-Sheva’, 26 September 2019, p. 32 (2019).
(in Hebrew)
7. Segev, Y.: Literature as creating and reflecting a narrative: the national-religious children
literature as a test-case. Talelei Orot; Annual of Orot Yisrael College, vol. 15, pp. 229–242
(2009). (in Hebrew)
8. Segev, Y.: Themes and trends in the ultra-orthodox children literature from the 1990’s on.
Oreshet, 4, 327–355. (in Hebrew)
9. Stadler, N., Lomsky, F.E., Ben Ari, E.: Fundamentalist citizenships: the Haredi challenge.
In: Ben-Porat, G., Turner, B.S. (eds.) The Contradictions of Israeli Citizenship: Land,
Religion, and State. Routledge, Milton Park (2011)
10. Tsweek, G.: The state misses the ultra-orthodox 8200. Yisrael Hayom [daily newspaper], 5
July 2019, pp. 12–13 (2019)
11. Yafe, O.: Psychological aspects of Ultra-orthodox children literature: child and self concepts.
Megamot 41(1–2), 10–19 (2001). (in Hebrew)
12. Yitzhaki, M.: Censorship in Israeli high school libraries. Yad la-Kore 31, 20–33 (1998). (in
Hebrew)
13. Yitzhaki, M.: Censorship in high school libraries in Israel: an exploratory field study. In:
Education for All; Culture, Reading and Information - Selected Papers from the Proceedings
of the 27th Annual Conference of the International Association of School Librarian-
ship. Ramat-Gan, IASL, pp. 265–275 (1998)
14. Yitzhaki, M.: Censorship in high school libraries in Israel: the role of the sectorial affiliation
factor; an extended nation-wide field study. In: Hughes, P., Selby, L. (eds.) Inspiring
Connections: Learning, Libraries and Literacy; Selected Papers from the Fifth International
Forum on Research in School Librarianship, pp. 231–247. IASL, Seattle (2001)
15. Yitzhaki, M.: The Internet as viewed by school librarians in Israel: conceptions, attitudes,
use and censorship. Yad la-Kore 35, 56–73 (2003). (in Hebrew)
16. Yitzhaki, M., Sharabi, Y.: Censorship in Israeli high school libraries; analysis of complaints
and librarians’ reactions. In: Lee, A.O.S. (ed.) Information Leadership in a Culture of
Change; Selected Papers from the Ninth International Forum on Research in School
Librarianship, pp. 183–202. IASL, Hong Kong (2005)
17. Yitzhaki, M.: Free flow of information and knowledge and use of IT in a conservative
community: the case of the ultra-orthodox in Israel. In: Proceedings of Informing Science &
IT Education Conference (In SITE) 2016, pp. 159–164 (2016). http://www.
informingscience.org/Publications/3503. Accessed 3 Nov 2019
18. Zicherman, H., Kahaner, L.: Modern ultra-orthodox – a Hareidi middle-class in Israel.
Jerusalem, Israel Institute for Democracy (2012). (in Hebrew)
19. Zicherman, H.: Black blue-white: a journey into the ultra-orthodox society in Israel, Tel-
Aviv, 360 p. (2014). (in Hebrew)
20. Hakak, Y., Rapoport, T.: Equality or excellence in the name of God? The case of ultra-
orthodox enclave education in Israel. J. Relig. 92(2), 251–276 (2011)
21. Foscarini, G.: Ultra-orthodox Jewish Women go to work: secular education and vocational
training as sources of emancipation and modernization. Annali di Ca’Foscari 50, 53–74
(2014)
22. Ferziger, A.S.: Beyond Bais Ya’akov: orthodox outreach and the emergence of Hareidi
women as religious leaders. J. Mod. Jewish Stud. 14(1), 140–159 (2015)
23. Friedman, Y.: Shtaygen. Benei-Brak (2006)
24. Stadler, N.: Yeshiva Fundamentalism: Piety. Gender and Resistance in the Ultra-Orthodox
World. NYU Press, New York (2008)
25. Stadler, N.: A Well-Worn Tallis for a New Ceremony. Academic Studies Press, Brighton
(2012)
Emoji Prediction: A Transfer
Learning Approach
Linrui Zhang(&), Yisheng Zhou, Tatiana Erekhinskaya,

and Dan Moldovan
901 Waterfall Way Building 5, Richardson, TX 75080, USA
Abstract. We present a transfer learning model for the Emoji Prediction task
described at SemEval-2018 Task 2. Given a text of tweet, the task aims to
predict the most likely emoji to be used within such tweet. The proposed method
used a pre-training and fine-tuning strategy, which applies the pre-learned
knowledge from several upstream tasks to downstream Emoji Prediction task,
solving the data scarcity issue suffered by most of the SemEval-2018 partici-
pants using supervised learning strategy. Our transfer learning-based model can
outperform state-of-the-art system (best performer at SemEval-2018) by 2.53%
in macro F-score. Except from providing details of our system, this paper also
intends to provide a comparison between supervised learning models and
transfer learning models in solving Emoji Prediction task.
Keywords: Deep learning Transfer learning Emoji prediction
1 Introduction
Emojis are graphic symbols that represent ideas or concepts used in electronic mes-
sages and web pages [1]. Currently, they are largely adopted by almost any social
media service and instant messaging platforms. However, understanding the meanings
of emojis are not straightforward, i.e. people sometimes have multiple interpretations of
emojis beyond the designer’s intent or the physical object they evoke [2]. For example,
intends to mean “pray”, but it is mis-used as “high five” in many occasions.
A misunderstanding of emojis can reverse the meanings of sentences and mislead
people. Therefore, effectively predicting emojis from text is an important step to
understand content, especially for the emoji-enriched social media messages.
SemEval-2018 Task 2 [3] introduced an Emoji Prediction task. Given a text
message including an emoji, the task consists of predicting that emoji based exclusively
on the textual content of that message. Specifically, the messages are selected from
Twitter data and assume that only one emoji occurs inside each tweet. Figure 1
illustrates an example of a tweet message with an emoji at the end.
Fig. 1. Example of a tweet with an emoji at the end.

https://doi.org/10.1007/978-3-030-39442-4_65
Emoji Prediction: A Transfer Learning Approach 865
Most of the participants in SemEval-2018 treat the task as a classification problem

and train a machine learning classifier based on supervised method. The most com-
monly selected classification methods are linear SVMs [4, 5] and Recurrent Neural
Networks (RNNs) [6, 7]. The experimental results showed that linear SVMs yield
better performance than RNNs. For example, the top one performer [4] compared
between the SVM and RNN models and showed that SVMs perform consistently better
than RNNs regardless of the training set size. In addition, their SVM models outper-
form the second ranked system [6] using Bi-LSTM with attention and the fourth ranked
system [7] using Bi-LSTM with several lexical features. In addition to the emoji
prediction task, SVMs also show a better performance over Deep Neural Networks in a
series of other text classification tasks [8, 9]. This observation seems to contradict with
the research trends of NLP that Deep Neural Models generally show superior results
than linear models. There are two plausible explanations: (1) Neural Network Models
typically require more data to train than SVM models. (2) Twitter data normally
contains short text. The short dependencies captured by the n-gram features in SVM
models are effective enough for the classification problem, which presumably limits the
beneficial effect of RNNs in capturing long term dependencies for long text.
Transfer Learning is a machine learning strategy that stores knowledge gained in
solving some upstream tasks and then apply the stored knowledge to solve some new
but related downstream tasks [10]. In this paper, we designed a transfer learning model
to evaluate if leveraging the pre-learned knowledge from related tasks can help Deep
Neural Models achieve better performance than linear SVM models. Besides, we
would like to evaluate if transfer learning is a better strategy than supervised learning in
solving Emoji Predication task. In addition, we conducted experiments to analyze the
factors, e.g. fine-tuning set size, that may affect the performance of the transfer learning
model.
Currently, almost all the popular transfer learning-based NLP models follow a pre-
training and fine-tuning step that a model is pre-trained on several upstream tasks and
then customized to solve the downstream tasks by fine-tuning the model parameters
with the training data of the downstream tasks. Typical transfer learning-based NLP
models include: GPT [11], BERT [12] and XLNet [13]. Our model is built with BERT
and the model is constructed in three steps. First, we selected the pre-trained BERT
model and customized the structure for the Emoji Prediction task. Second, we fine-
tuned the customized structure with the corresponding training data. Third, we eval-
uated the performance of the fine-tuned model on the testing set.
We evaluated our model on SemEval-2018 Task 2 benchmark and compared its
performance with other learning models in literature. Our transfer learning-based model
could achieve state-of-the-art performance. The primary contributions of this paper are
as follows:
• We present a novel transfer learning-based method for Emoji Prediction task. Our
model can achieve state-of-the-art performance at SemEval-2018 Task 2, exceeding
the top performer by 2.53% in macro F-score.
• We compare the transfer learning-based model with supervised learning-based
models (using SVMs and RNNs) and demonstrate the strengths and limitations of
transfer learning-based models for Emoji Prediction task.
866 L. Zhang et al.
Fig. 2. Overall pre-training and fine-tuning procedures of our model.
2 Model Description
The main structure of the proposed model is illustrated in Fig. 2. Part (a) shows the
basic architecture and the pre-training step of BERT. Ei and Ti are the token embed-
dings and the final hidden state with respect to the input token Tok i. Particularly, a
special classification token ([CLS]) is always added to the beginning of the input
sequence and its corresponding hidden state C is used as the aggregate sequence
representation for classification tasks. A separate token ([SEP]) is used to separate the
input sentences. 12 to 24 layers of bidirectional Transformer are used as the repre-
sentation layers to learn token representations (Ti) from token embeddings (Ei). For
more details about the architecture and the pre-training procedure of BERT, please
refer to the original paper [12].
2.1 Preprocessing
Generally speaking, preprocessing is a fundamental step for many NLP systems. It
directly affects the performance of the follow up components in the system. In par-
ticular, social media message processing is challenging, since there is a large variation
in vocabulary and expressions that is used in such data.
Table 1. Example of our text processor.

Original Gonna be #oneepicsummer 3 Days | May 7th 2016 https://t.co/10PEnv2pz5 @
Epic Residences
Processed gonna be <hashtag> one epic summer </hashtag> <number>
<allcaps> days </allcaps> | <date> <url> @ epic residences
We utilized the ekphrasis tool [14] as the text preprocessor. It can perform tok-
enization, word normalization, word segmentation (for splitting hashtags) and spell
correction for Twitter data. Table 1 shows an example of a tweet processed by the
ekphrasis. There are several benefits of using the preprocessed tweets as inputs. For
example, ekphraris can recognize dates (e.g. May 8, 1989) and replace them with
labels (<date>). This can help the system reduce the vocabulary size without losing too
much information.
2.2 Pre-training Step

BERT is pre-trained with two unsupervised upstream tasks: Masked LM and Next
Sentence Prediction (NSP). In the Masked LM task, the system randomly masked out
15% of the tokens in each sequence and used the rest of the content to predict the
masked tokens. This allows the model to learn a deep bidirectional representation of a
sequence. In the NSP task, the system is trained to predicate if a sentence B is directly
following another sentence A from a corpus. This allows the model to learn the rela-
tionship between two sentences.
We leveraged the pretrained BERT to build our model. Specifically, we selected the
part of BERT that are designed for sequence classification (a.k.a BertForSequence
Classification) and modified its structure to accept the input of Emoji Prediction task
for the fine-tuning step.
2.3 Fine-tuning Step
Structure: The structure of the customized BERT is illustrated in Part (b) of Fig. 2.
The structure of this model is similar to the structure of the original BERT model,
except that it accepts single sentences as input. A classification layer is built on top the
sequence representation state C to generate class labels of the emojis. The parameters
within this model are pre-learned from the pre-training step.
Input Sequence Representation: The preprocessed tweets need to be transformed
into the BERT input format (BERT input embeddings) before sending to the fine-
tuning step. The BERT input embeddings consist of three embeddings corresponding
to the input tokens: (1) Pre-learned Token Embeddings, (2) Segment Embeddings,
indicating whether the sentence belongs to the first sentence A or the second sentence
B, and (3) Position Embeddings, indicating the positions of the tokens in the sentence.
The final input representation is constructed by summing the corresponding token,
segment and position embeddings. The visualization of the BERT input representation
of sentence “the dog is hairy” can be seen in Fig. 3.
868 L. Zhang et al.
Fig. 3. The representation of sentence “the dog is hairy” in BERT input format.
Fine-tuning: To customize the model for the Emoji Prediction task, we fine-tuned the
model parameters with the training data provided by SemEval-2018 Task 2. Cross-
entropy loss is used as the object function for the fine-tuning procedure, which is
calculated as follows:
Xn X20
Loss ¼ i¼1
yj
j¼1 i
log Pij ð1Þ
Where, y is a binary indicator (0 or 1) indicating whether a class label is correctly

classified, and P is the predicted probability of the correctly classified label. i [1, n] is
the index of training samples and j [1, 20] is the index of the class label.
The only new parameters introduced during fine-tuning step is the num-
ber_of_labels in the output layer. We set it to be twenty since there are twenty different
classes to be predicated in the emoji labels set. The rest of the hyperparameters are set
as default in BERT, which are shown in Table 2.
Table 2. Hyperparameters for fine-tuning.

Hyperparameter Value
Max_seq_length 128
Train_batch_size 32
Learning_rate 2e−5
Num_training_epochs 3
Number_of_labels (output layer) 20
Bert_model Bert-base-uncased
Optimizers BERT Adam
Lower case in tokenization Yes
In this section, we introduce the corpus, present the experimental results and compare
our system with other learning models in literature.
Table 3. Distribution of emoji labels.
# Emoji Train Test # Emoji Train Test

1 22.4% 21.6 11 3.2% 2.9
2 10.3% 9.7 12 3.0% 3.9
3 10.2% 9.1 13 2.9% 2.5%
4 5.5% 5.2 14 2.6% 2.2%
5 4.9% 7.4 15 2.7% 2.6%
6 4.7% 3.2 16 2.7% 2.5%
7 4.3% 4.0 17 2.6% 2.3%
8 3.6% 5.5 18 2.6% 3.1%
9 3.4% 3.1 19 2.6% 4.8%
10 3.2% 2.4 20 2.5% 2.0%
3.1 Corpus
SemEval-2018 Task 2 provided a corpus for Multilingual Emoji Prediction with
roughly 500K training data and 50K testing data in English track. It collected tweets
that include one of the twenty emojis that occur most frequently in the Twitter data.
The relative frequency percentage of each emoji in the train and test set is shown in
Table 3.
3.2 Experimental Results

Table 4 demonstrates the performance of our model compared with the top performers
in SemEval-2018 Task 2. We selected the top 1 performer Tubingen-Oslo [4], top 2
performer NTUA-SLP [6], top 4 performer EmoNLP [7], top 6 performer UMDuluth-
CS8761 [5] and the top 7 performer BASELINE system as the comparison systems.
The macro-averaged precision, recall and F-score are presented. From the results, we
can observe that our model can achieve state-of-the-art performance, exceeding the top
performer and baseline model by 2.53% and 7.54%, respectively in F-score.
Table 4. Comparison of the participating systems with our system by precision, recall and
macro F-score in the test set of SemEval-2018 task 2 English track.
Team Approach F-score Prec. Recall
Ours Pre-trained Model with BERT 38.52 40.64 41.76
Tubingen-Oslo SVMs, RNNs 35.99 36.55 36.22
NTUA-SLP RNNs 35.36 34.53 38.00
EmoNLP RNNs 33.67 39.43 33.70
UMDuluth-CS8761 SVMs 31.83 39.80 31.37
BASELINE Pre-trained Classifier with FastText [15] 30.98 30.34 33.00
870 L. Zhang et al.
39 38.52
37.59 37.97
37 36.6
35.99
35 35.12
33 33.52
31 31.47 30.98
29 29.05
27
25
0 100 200 300 400 500 600
Ours BASELINE Tubingen-Oslo
Fig. 4. Learning curve of our model against the training set (1000 instances).
3.3 Effect of Fine-tuning Set Size

The size of fine-tuning data is a very important factor in pretrain-finetune model, it
immediately affects the quality of the customized model for the downstream task. In
this case, we presented the F-score of our model against the training set size in Fig. 4.
To better visualize the comparisons, the final results of the top 1 performer and the
BASELINE model are also labeled in the figure.
From Fig. 4, we can observe that the performance curve of the model is increasing
with the size of the fine-tuning set. Our model surpasses the BASELINE model and the
top performer when the fine-tuning set size reaches to around 20K and 200K. This
indicates that our model is gradually customized to adapt to the target task with the
increasing size of the corresponding fine-tuning data.
3.4 Results Analysis

We believe there are several reasons that cause the superior performance of transfer
learning model over supervised learning model. One of the most important reason is the
pre-learned knowledge used in the transfer learning model, since:
• The pre-learned knowledges from the upstream tasks provide a higher scratch for
downstream task to start with, compared with supervised learning model that is
trained from zero.
• As opposed to the supervised learning model that highly relies on the training data,
the pre-learned knowledges learned from upstream tasks have no dependency on the
training data of the downstream task. This is essential in solving some resource-
scarce tasks that contain less training data.
• The pre-learned knowledges may contain extra syntactic and semantic information,
e.g. relation between sentences, deep bidirectional representation of a sequence, that
cannot be learned directly from the training data of the downstream task.
We also observe that the size of the fine-tuning set significantly affects the quality
of the fine-tuned model in both accuracies (shown in Fig. 4) and speed. According to
our experiment, the speed of fine-tuning procedure is roughly 10K tweets/min. It takes
approximately 55 min to complete one epoch of fine-tuning, since there are 500K
tweets in the fine-tuning set.
4 Conclusion
In this paper, we described our system on Multilingual Emoji Predication at SemEval

2018 Task 2. Our pre-trained system, based on BERT, can achieve state-of-the-art
performance on the SemEval benchmark, exceeding the top performer and BASELINE
system by 2.38% and 7.39% respectively in F-score. In addition, we compared our
transfer learning model with previous supervised learning models using SVMs and
RNNs. Experimental results have demonstrated that leveraging the pre-learned
knowledge from upstream tasks can significantly increase the system performance of
downstream tasks, resulting in the superior performance of transfer learning model over
the supervised learning models. We also observe that the performance of the transfer
learning model is strongly affected by the fine-tuning set. Optimizing the quality, as
well as the quantity, of the fine-tuning set should be one of the future directions in
designing transfer learning models.
References
1. Cappallo, S., Mensink, T., Snoek, C.G.: Image2emoji: zero-shot emoji prediction for visual
media. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1311–
1314. ACM (2015)
2. Barbieri, F., Ballesteros, M., Saggion, H.: Are emojis predictable? In: Proceedings of the
15th Conference of the European Chapter of the Association for Computational Linguistics:
Volume 2, Short Papers, pp. 105–111. Association for Computational Linguistics, Valencia,
Spain (2017)
3. Barbieri, F., Camacho-Collados, J., Ronzano, F., Anke, L.E., Ballesteros, M., Basile, V.,
Saggion, H.: SemEval-2018 Task 2: multilingual emoji prediction. In: Proceedings of the
12th International Workshop on Semantic Evaluation (SemEval-2018). Association for
Computational Linguistics, New Orleans, LA, United States (2018)
4. Coltekin, C., Rama, T.: Tubingenoslo at semeval-2018 task 2: SVMs perform better than
RNNs in emoji prediction. In: Proceedings of the 12th International Workshop on Semantic
Evaluation, pp. 32–36. Association for Computational Linguistics, New Orleans, LA, United
States (2018)
5. Beaulieu, J., Owusu, D.A.: Umduluth-cs8761 at semeval-2018 task 2: emojis: too many
choices? In: Proceedings of the 12th International Workshop on Semantic Evaluation,
pp. 32–36. Association for Computational Linguistics, New Orleans, LA, United States
(2018)
872 L. Zhang et al.
6. Naziontis, C., Athanasiou, N., Chronopoulou, A., Kolovou, A., Paraskevopoulos, G.,
Ellians, N., Narayanan, S., Potamianos, A.: Ntua-slp at semeval-2018 task1: predicting
affective content in tweets with deep attentive rnns and transfer learning. arXiv prepreint
arXiv:1804.06658
7. Liu, M.: Emonlp at semeval-2018 task 2: English emoji prediction with gradient boosting
regression tree method and bidirectional LSTM. In: Proceedings of the 12th International
Workshop on Semantic Evaluation, pp. 32–36. Association for Computational Linguistics,
New Orleans, LA, United States (2018)
8. Coltekin, C., Rama, T.: Discriminating similar languages with linear SVMs and neural
networks. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties
and Dialects, pp. 15–24. Osaka, Japan (2016)
9. Medvedeva, M., Kroon, M., Plank, B.: When sparse traditional models outperform dense
neural networks: the curious case of discriminating between similar languages. In:
Proceedings of the Fourth Workshop on NLP for similar language, Varieties and Dialects,
pp. 156–163. Association for Computational Linguistics, Valencia, Spain (2017)
10. West, J., Venture, D., Warnick, S.: Spring research presentation: a theoretical foundation for
inductive transfer. Brigham Young Univ. Coll. Phys. Math. Sci. 1, 32 (2007)
11. Radford, A., Narasimhan, K., Salimans, T., Sutskever. I.: Improving language understanding
by generative pre-training. https://s3-us-west-2.amazonaws.com/opennai-assets/research
covers/languageunsupercised/languageunderstandingpaper.pdf (2018)
12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
13. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized
autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237
(2019)
14. Baziotis, C., Pelekis, N., Doulkeridis, C.: Datastories at semeval-2017 task 4: deep lstm with
attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th
International Workshop on Semantic Evaluation (SemEval-2017), pp. 747–754. Vancouver,
Canada (2017)
15. Joulin, A., Grave, E., Bojanowski, R., Mikolov, T.: Bag of tricks for efficient text
classification. arXiv:1607.01759 (2016)
On the Emerging Area of Biocybersecurity
and Relevant Considerations
Xavier-Lewis Palmer1, Lucas Potter1, and Saltuk Karahan2(&)

1
Biomedical Engineering Institute, Department of Electrical
and Computer Engineering, Old Dominion University, Norfolk, VA 23529, USA
2
Department of Political Science and Geography, Old Dominion University,
Norfolk, VA 23529, USA
skarahan@odu.edu
Abstract. Biocybersecurity is a novel space for the 21st century that meets our
innovations in biotechnology and computing head on. Within this space, many
considerations are open for and demand consideration as groups endeavor to
develop products and policies that adequately ensure asset management and
protection. Herein, simplified and brief exploration is given followed by some
surface discussion of impacts. These impacts concern the end user, ethical and
legal considerations, international proceedings, business, and limitations. It is
hoped that this will be helpful in future considerations towards biocybersecurity
policy developments and implementations.
Keywords: Bio Biocybersecurity Cyberbiosecurity Security

Bio-Security
1 Introduction
Biocybersecurity presents a new way of exploring how we protect our societies. It can
be thought of in part as an extension of cybersecurity, which involves the protection of
systems, made of hardware and software, from unauthorized access and attacks. Bio-
cybersecurity is alternatively referred to as Cyberbiosecurity, which, according to the
Peccoud Lab, exists at the intersection of cybersecurity, cyber-physical security, and
bio-security, and focuses on mitigating risks within and relating to their intersections
[1, 2]. A growing need exists for expertise in this field as we persist in a world at a time
where computer systems and biotechnology are increasingly ingrained in day-to-day
life, in both developed and developing economies. Furthermore, strong lines eventually
need to be drawn when determining where Biocybersecurity and other fields end, to
adequately allot resources and focus towards adequately mapping vulnerabilities and
prevent exploits that may occur and evolve [1]. To this end, we have provisionally
defined Biocybersecurity as thus: Any cybersecurity system where a biological com-
ponent, target, or interlock is involved in the terminal or intermediate stages. The first
potential hurdle- is there a clear and present need for the development of an entirely
new field of study? Let us first turn to the current ease of obtaining biological data.
Whereas sequencing DNA used to be a rigorous process, it has gotten easier [3]. In fact
the Human Genome Project was projected to last 15 years, and was completed in 13,

https://doi.org/10.1007/978-3-030-39442-4_66
874 X.-L. Palmer et al.
demonstrating that the speed of computing and insights into the structure of life have
made it easier than ever to obtain, disseminate, and utilize biological data [3]. Sec-
ondly- the processing and accessibility to the mentioned data is easier than ever. The
physical barrier to acquiring healthcare data has been demolished in the name of ease of
access and patient-centered care digitally. So, while the hardware to crack conventional
cybersecurity barriers is more prevalent, the safeguards of that data have been chipped
away. Thirdly, new computational platforms question the nature of the separation of
biology and computing, leading to a more tightly integrated biocybersecurity process
[3, 4]. Some rising platforms even call into question the contemporary understanding of
typical cybersecurity processes and could circumvent typical security at the cost of
creating entirely new (and unforeseen) problems that arise from biological matter and
the application of medicine being used to perform computations [1, 4]. An entire new
group of subdisciplines may be needed to understand the unknown complications that
arise from use of these platforms [5]. All these features of modern healthcare and
technology combine, meaning, that biological data can be applied in more ways. For
instance, the process of implicit authentication using biological data is now possible
with COTS (Commercial, Off-The-Shelf) components. A smartwatch can access
multiple kinds of information including heart rate data and more [6]. The use of an
RHR monitor can be used as a simple mode of implicit authentication- matching a set
of recorded data to a user accessing a work terminal. However, this means that said data
would, somehow, be accessible to other, perhaps nefarious, individuals. As touched on
above, part of the rise of potential threats to in the field of biocybersecurity is the
demand for easier and faster access to data. Patients desire faster, more convenient
access to medical records; medical research companies need larger and more com-
prehensive trials; and data sets to remain viable for the more technically demanding
medical interventions currently used [7–10]. Medical companies also need to increase
their awareness of the potential of malware to compromise device outputs; for example,
one research team's recently demonstrated an algorithm that could modify CT scans to
mislead sick patients into believing that they are healthy and vice versa [11]. Inade-
quate defenses against such malware and inadequate protection of patient data could
brew a maelstrom of related crises, on the scale of ransomware outbreaks [11]. The
prevalence of such devices that demonstrate the confluence of biology and cyberse-
curity include thumbprint scanners, retina scanners, digitized healthcare records,
forensics databases, DNA sequencing databases, and pharmacology records. All of
these could be accessed and used for threats in the Biocybersecurity domain. The
potential growth of biological data in demand is seen in examples in the rise of services
to sequence and interpret DNA as seen through ancestry and health services through
DNA analysis. For a more concrete example of what is meant when they describe the
definition of biocybersecurity: let’s say that a user is at a computer using implicit
authentication. The computer’s security system tracks the eyes of the user, and if the
saccic rhythms change, the computer locks the user. Under our definition, this would
fall under biocybersecurity, as the system is using biological inputs or data as an
intermediate step- in this case as preventative interlocks.
On the Emerging Area of Biocybersecurity and Relevant Considerations 875
2 Impacts and Considerations

2.1 Some Ethical Considerations
Let us consider some ethical conundrums that help us alternatively view problems
within the field that relate to the previous examples. We’ll do so with a few, including
the trolley problem and the Gettier Problem. The trolley problem states that a trolley
can head down two tracks, and a moral operator selects which track to send it down-
killing whoever is on either track. One considerable problem in this context is that with
many of these bulk DNA or other -omics analyses, the operator is not directly at the
switch. In fact, they may produce or remove multiple switches that complicate the
ethical calculus involved with handling biological data that is digitalized. The use can
get out of hand. The operators’ data collection gives way toward many more switch
pullers, those who get their hands on the data, who can affect a great many lives beyond
their original intention. Now let us pivot to another potential scenario with problems
that might be faced: a business has access to a large amount of personal DNA data that
gets mis-used.They use it to discover the prevalence of people that enjoy a certain kind
of sugar more or consume considerably more than others. If they then mold their
advertisements based on this data- either personally with internet-based ads or in bulk-
are they responsible for the health of this population of people? With this data, com-
panies within the food and beverage industry have an extra curated set of people who
may have a higher difficulty of exercising proper agency over their dietary decisions.
At the same time, they are working against their interests to not use such data.
Insurance agencies might obtain and use this data with regards to modifying payment
or coverage rates with diseases such as diabetes and other metabolism linked disorders
in mind, their business model and consumer finances to juggle. Afterall, this must be
viewed additionally in the light of emergent actions that appear from this data merely
being available and obtainable. Once produced, it is subject to analysis by actors within
who may not share company motives, companies that have data sharing agreements,
malicious actors that may leak or funnel said data to other groups at multiple levels of
power, and even harder to discern, meta- analysis by an independent company who is
able to link said DNA to the customers involved. Furthermore, analyses, both core and
meta, of this data by companies that they share this with can lead to further emergent
concerns of abuse. To summarize the Gettier Problem, it is a case on which an agent
can have a justified basis for belief in a proposition and still be wrong about the fact of
a matter. The Gettier Problem can be used to prompt us, within biocybersecurity, to
consider what data we collect and hold, but also ask why and in what form and under
what conditions we interface with the data and meta-data in terms of the risks posed if
we are wrong about how such data might be used and abused. One must bet on the
possibility of being wrong and the consequences that follow from that. We can relate
this to biocybersecurity in which the same company above has a justified belief in their
level of security, and yet suffers a breach that results in the leak of millions of sensitive
and valuable biometrics. Companies must be willing to maintain ethical boards and
rigorous standards to limit unnecessary data collection, holdings, transfers, eyes on
data, and time of holding said data. They must be prepared for not only damage control,
but compensation and talks with the public that they serve so as not to damage
perceptions of biocybertechnologies with the public. For example, is the user of a

hypothetical system made aware that their eye movements or other biosignatures, or
biological means of expression are being tracked? Can the user manually turn this
feature on and off? If no they should be made aware and be given the ability in case of
potential abuse that they may have overlooked, the company must be ready to
responsibly deal with the mountain of problems that may follow. In general, ethical
approaches to biocybersecurity must be comprehensive.
2.2 End User and Social Impacts

As Biocybersecurity policy matures, it is ever prudent to consider social implications
that exist in a world where biocybertechnologies and those adjacent become more
prominent and their misuses become more of a threat. This increasingly applies to
interconnected technologies that we easily and often take for granted, especially those
that have benefitted from recent life sciences research as mentioned earlier. Let us
consider the “Internet of Things”, known as IOT and the devices that can fall under this
paradigm. IOT can be thought of as a mass constellation of devices connected to the
internet [12]. You may recognize them in the form of commonly used products such as
refrigerators that report on the quality of its contents or remind you on when to restock,
your wearable exercise equipment that gives your heartrate or temperature, medical
autoinjectors that monitor or regulate your insulin supply and report to your doctor,
rooms that monitor your position and try to keep the room at a suitable temperature for
you, implants that augment features of your body, or even more simply, your smart-
phone, with its bevy of sensors. Each of these devices gather and transmit a variety of
data that can directly or indirectly characterize consumers in ways that they may or may
not consent to. Quite easily, a consumer can consent to the use of a device that monitors
their heartrate, but to a skilled analyst, studying the heartrate over significant amounts
of time can reveal one’s sleep, work, schooling, romantic, diet, and social behavior, in
ways that the consumer certainly wouldn’t easily consent to. The same considerations
can be applied to the earlier mentioned refrigerator, exercise equipment, and medical
equipment – they all can give data which can generate a mesh of complex stories in
different curated combinations and when interpreted differently [12]. Even when
guarded with a degree of caution, a skilled hacker can gain access to said data, which
leaves an ever-existing risk with the nature of said technology for anyone with privacy
in mind. With such data ever open to exploitation, one huge social implication is
increased societal fear, and one to follow is the erosion of trust in advanced technology.
Depending on how ingrained said technology is within companies or governments, this
can mean erosion of trust and cooperation with those entities as these technologies
become increasingly exploited at the detriment of citizens. Companies and large
governments would do well to tread cautiously while and when employing these
technologies. Failure to reign in control of said data could lead to mass social disarray,
which could be irreparably injurious to societal stability, depending on the extent of the
damage. One more source of lay perceptions of biocybersecurity that no doubt affect
policy is popular media in how it has influenced how we may see and interact with such
technologies. One considerable influencer is that of Cyberpunk culture, which
encompasses futures that push the boundaries of technologies, leading to a blending or
enhancement of humans and their technology in often unique and beneficial ways [13,
14]. In some stories, innovations within such literature often arise from lack the of
oversight, or reduced confidence in the ability of the government to adequately assuage
needs of a growingly frantic populace in the face of ever-growing technological reli-
ance. Cyberpunk culture has also contributed to the growth of cultures like that of the
Maker-movement, which is composed of individuals that are often resisting traditional,
institutional control of technologies while self-policing [13–15]. With respect to
biology, some of them are addressing prosthetics, implantable electronics, gene editing
and protein engineering, and bio-adjacent biotechnologies with wide appeal and
potential to correct for deficiencies in their communities [14, 15]. An easy case to
consider is that of the failure of the US government to control for drug prices, leading
to some groups to take matters in their own hands to make them and analogues
themselves such as some community bio labs that have met, with increasing success at
mobilizing the community [13–15]. Less benign efforts have just been to engineer other
means of producing food, whereas others may aim to re-write some parts of life itself
through the possibility of creating synthetic organisms [13–16]. Some of these groups
may or may not apply for government funding and instead pursue their own path to
innovation through private or self-funded measures. Examples can be seen among a
few groups in the Community Bio movement which arose out of the Maker Movement,
in which groups of people have been inspired to pursue these research areas and more
through a mix of traditional and non-traditional cooperation with mixed success [17–
19]. Plenty of these successes resulted in the creation of start-ups that deal in a great
amount of biometric data or material which is tracked, for improving health outcomes
or expanding functions [14–19]. Much of what is thought of as cyberpunk in science
fiction has reached reality, and this implies that the time to think ahead regarding the
protection of biometric data is now. There’s little reason to suggest that these projects
won’t become even more complex. Overall, there is much to consider socially as we
consider and pursue cyberbiotechnological policies that tackle our increasing reliance
and potential overexposure to technology. Given that the technology already exists in
large amounts and data is already being generated in volumes and at rates at which
already has the potential to overly stir and stoke negative public action, we are able to
need to bolster our foci on further social implications of such technology. To fail to do
so can undo many societal gains within technologically advanced nations.
2.3 Policy and Legal Impacts

Quite a few policies have provisions and objectives that would be wise for groups to
consider factoring into their cybersecurity policies. Some worthy of mention are the
Nagoya Protocol, the Genetic Information Nondiscrimination Act, Dual Use Research
of Concern, and the Health Insurance Portability and Accountability Act [2]. Each will
be briefly summarized and drawn from. The Nagoya Protocol, known The Nagoya
Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of
Benefits Arising from their Utilization (ABS) to the Convention on Biological
Diversity, is an act sign in 2010 that discussed a framework for sensible sharing and
access to genetic resources, in the scope of preserving biodiversity [2, 20]. Aims of it
are ensuring, flexible, consent-based access to genetic resources that respect the
jurisdiction to which said resources and or their owners belong to protect commercial
and otherwise academic chains of value. The Genetic Information Nondiscrimination
Act was passed in 2008 and has the aim of protecting people from DNA based health
insurer and employee discrimination. A core weakness is that life and disability
insurance as well as long term plans are not covered, leaving people up to the mercy of
state government laws [2, 21]. Dual Use Research of Concern Policy, implemented in
2012, underlines policies to regulate life sciences research that could have a double-
edge. Means of research that pursue overly risky methods or are otherwise unethical
face defunding and additional potential penalties [2, 22]. Lastly, the Health Insurance
Portability and Accountability Act, originally enacted in 1996, outlined a means of
protecting citizen medical data while making it available to health professionals in
order to allow adequate, if not superior care [2, 23]. This is increasingly important in an
industry where data driven care is used to deliver more informed treatment wherein
health professionals can learn of complications and patient differences, allowing for a
more personalized and accurate care of an individual, reducing confusion. What can be
taken from the existence of these policies is that in the context of cybersecurity,
biocybersecurity policies will need to flexible, based on consent, and the benefit of
whose information is at risk. Policies not taking this into account are likely to face
considerable legal action.
2.4 International Impacts

Historically the connection between health and international security was based on the
spread of the diseases and disease-related casualties in wars. As scholars focused on
political stability and considered the relation between democracy, growth and political
stability, the relation was mostly seen as a democracy supporting or hindering role in
growth and stability. [24] However, democracies also require economic and social
well-being for political stability. A deterioration in public health has the potential of
impacting political instability and public unrest in democratic nations. [25] This direct
relationship between public health and political stability allows international actors
(mostly autocratic ones) to use public health as a tool of coercive power in international
relations. One other factor influencing general public health is the ascent of global-
ization with its positive and negative influences. While globalization has been a factor
for the spread of diseases, the technology that brought widespread connectivity has at
the same time highly benefited the health service providers in reaching populations in
remote corners of the world [26]. The management of resources and making them
available in most needed areas were facilitated through global connectedness. The link
between interconnectedness, management of resources and making them available in
most needed areas were facilitated with the global connectedness. All this naturally
brings concerns about the security, reliability, and resilience of this connectedness. The
link between interconnectedness, public health and political stability naturally make
biocybersecurity a concern for international relations. Furthermore, this relevance is
strengthened with the consideration of international political economy. Using the
example from the above Ethical Considerations section, the fictious scenario below can
help us understand the effects. Let’s assume that one nation or NGO finds that a
competing nation is particularly susceptible to a specific non-infectious condition. To
make this example concrete, let’s say Nation A is particularly susceptible to type 2
diabetes mellitus, and Nation B is a producer of Fructose. If Nation B increases
marketing, increases supply, and seeks trade deals to increase the uptake of fructose in
Nation A-does this count as a kind of stochastic act of war? This and similar questions
form the landscape defining the impact of biocybersecurity on international security.
2.5 Business Impacts

Potential business impacts of biocybersecurity policies are wide reaching, leading
significant effects of consumer trust, intellectual property (IP), domestic and interna-
tional supply chains, and ultimately capital. In terms of consumer trust, the business
impacts seen today from data breaches are likely not as far-reaching as will be true in
the future. This is potentially due simply to the lack of understanding of the value of
user data and comparatively reduced coupling of individuals to data. As the more
technically literate people become the primary consumers of such business, such
mishandling of data breaches may have more dire consequences- especially if the data
is biological in nature. In the same vein, as the demand for faster innovation rises, more
biological IP (potentially even in the form of Trade Secrets) will be placed in digital
formats, which renders it vulnerable to theft. Additionally, the new avenues of security
breaching will not solely be in the domain of more advanced technology. As historical
examples demonstrate even the most current technologies can be breached by relatively
simple methods [27, 28]. A tangentially related topic would be the targeting of the
synthetic biology supply chain. The supply chain of any manufacturing company is an
important matter, but the biological field has its own major dangers and limitations
when considering the items that it requires and exports [28]. At the core of the prob-
lems above, considerable economic damage in the form of lost capital and weaknesses
economic sectors linked to cyberbioeconomies is possible.
2.6 Complications and Limitations

The emerging spread of data processing methods into biological domains is, from a
scientific perspective, a worthy and meaningful goal for many fields. However, the
demand for faster and easier accessibility, the dependence on biometric data for
security purposes, and the potential growth of automated biological analysis is a
looming threat in the coming world. It is the hope of the authors that two goals were
accomplished by reading this work. (1) That the “Failure of Imagination” that led to
many threats in the cybersecurity domain will occur to a lesser extent after reading this
work. (2) That the novelty of biodata will be seen as it is- not just an interesting benefit
of the coordination of biology and computational sciences, but as an venue of attack.
2.7 Conclusion
There is little rationality in denying the currently robust and the potentially explosive
growth of biocybersecurity as a field of thought, research, and action. The dangers
presented in this paper, as well as the complications of our world are clear. The
question is not one of what to do if these problems occur, but what do we do to prevent,
ameliorate, and treat them? What are we going to do about it?
References
1. Murch, R.S., So, W.K., Buchholz, W.G., Raman, S., Peccoud, J.: Cyberbiosecurity: an
emerging new discipline to help safeguard the bioeconomy. Front. Bioeng. Biotechnol. 6, 39
(2018). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5895716/
2. Dieuliis, D., Lutes, C.D., Giordano, J.: Biodata risks and synthetic biology: a critical
juncture. J. Bioterrorism Biodefense, 09(01) (2018). https://doi.org/10.4172/2157-2526.
1000159
3. Chial, H.: DNA sequencing technologies key to the human genome project. Nat. Educ. 1(1),
219 (2008). https://www.nature.com/scitable/topicpage/dna-sequencing-technologies-key-
to-the-human-828/
4. Trafton, A.: MIT News Office. Scientists program cells to remember and respond to series of
stimuli. http://news.mit.edu/2016/biological-circuit-cells-remember-respond-stimuli-0721.
Accessed 21 July 2016
5. Radanović, I., Likić, R.: Opportunities for use of blockchain technology in medicine. Appl.
Health Econ. Health Policy 16(5), 583–590 (2018). https://doi.org/10.1007/s40258-018-
0412-8
6. Arriba-Pérez, F.D., Caeiro-Rodríguez, M., Santos-Gago, J.: Collection and processing of
data from wrist wearable devices in heterogeneous and multiple-user scenarios. Sensors 16
(9), 1538 (2016). https://doi.org/10.3390/s16091538
7. Patient Demand for Patient-Driven Health Information (2018). https://catalyst.nejm.org/
patient-demand-for-patient-driven-health-information/
8. Providers Turn to Portals to Meet Patient Demand, Meaningful Use (2012). https://journal.
ahima.org/2012/08/23/providers-turn-to-portals-to-meet-patient-demand-meaningful-use/
9. Doi, S., Ide, H., Takeuchi, K., Fujita, S., Takabayashi, K.: Estimation and evaluation of
future demand and supply of healthcare services based on a patient access area model. Int.
J. Environ. Res. Pub. Health 14(11), 1367 (2017). https://doi.org/10.3390/ijerph14111367
10. Merrill, R.A.: Regulation of drugs and devices: an evolution. Health Aff. 13(3), 47–69
(1994). https://doi.org/10.1377/hlthaff.13.3.47
11. Mirsky, Y., Mahler, T., Shelef, I., Elovici, Y.: CT-GAN: Malicious Tampering of 3D
Medical Imagery using Deep Learning (2019). https://arxiv.org/abs/1901.03597
12. Pauwels, E., Denton, S.W.: The internet of bodies: life and death in the age of AI. Calif.
Western Law Rev. 55(1), 221 (2019). https://scholarlycommons.law.cwsl.edu/cgi/
viewcontent.cgi?article=1667
13. de Beer, J., Jain, V.: Inclusive innovation in biohacker spaces: the role of systems and
networks. Technol. Innov. Manage. Rev. 8(2) (2018). https://doi.org/10.22215/timreview/
1133
14. Parisi, L.: What can biotechnology do? Theory Cult. Soc. 26(4), 155–163 (2009). https://doi.
org/10.1177/0263276409104973
15. Wilbanks, R.: Real vegan cheese and the artistic critique of biotechnology. Engag. Sci.
Technol. Soc. 3, 180 (2017). https://doi.org/10.17351/ests2017.53
16. Agapakis, C.M.: Designing synthetic biology. ACS Synth. Biol. 3(3), 121–128 (2013).
https://doi.org/10.1021/sb4001068
17. Hyysalo, S., Kohtala, C., Helminen, P., Mäkinen, S., Miettinen, V., Muurinen, L.:
Collaborative futuring with and by makers. CoDesign 10(3–4), 209–228 (2014). https://doi.
org/10.1080/15710882.2014.983937
18. Landrain, T., Meyer, M., Perez, A.M., Sussan, R.: Do-it-yourself biology: Challenges and
promises for an open science and technology movement. Syst. Synth. Biol. 7(3), 115–126
(2013). https://doi.org/10.1007/s11693-013-9116-4
19. Gallegos, J.E., Boyer, C., Pauwels, E., Kaplan, W.A., Peccoud, J.: The open insulin project:
a case study for ‘Biohacked’ medicines. Trends Biotechnol. 36(12), 1211–1218 (2018).
https://doi.org/10.1016/j.tibtech.2018.07.009
20. About the Nagoya Protocol. https://www.cbd.int/abs/about/
21. Genetic Discrimination Fact Sheet (2008). https://www.genome.gov/10002328/genetic-
discrimination-fact-sheet/
22. United States Government Policy for Institutional (2015). https://www.phe.gov/s3/dualuse/
Documents/durc-policy.pdf
23. HHS Office of the Secretary, Office for Civil Rights, Ocr. Summary of the HIPAA Privacy
Rule (2013). https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.
html
24. Feng, Y.: Democracy, Political stability and economic growth. Br. J. Polit. Sci. 27(3), 391–
418 (1997). https://doi.org/10.1017/s0007123497000197
25. Price-Smith, A.T.: The Health of Nations: Infectious Disease, Environmental Change, and
Their Effects on National Security and Development. MIT Press, Cambridge (2002)
26. MHealth: New Horizons for Health through Mobile Technologies. World Health Organi-
zation (2011). https://www.who.int/ehealth/mhealth_summit.pdf
27. M. H. History of the Bureau of Diplomatic Security of the United States Department of State
[PDF]. United States Department of State Bureau of Diplomatic Security (2011). https://
www.state.gov/documents/organization/176589.pdf
28. Molteni, M.: Hackers Listen in on What Synthetic DNA Machines are Printing (2019).
https://www.wired.com/story/hackers-listen-synthetic-dna-machines/
The Impact of an Online Course of Inclusive
Physical Education on Teachers’ Skills
and Knowledge
Noa Choresh(&) and Yesha’ayahu Hutzler
The Academic College at Wingate, Netanya, Israel

noa_c@wincol.ac.il
Abstract. In order to integrate pupils with disabilities into physical education

(PE) classes, an online training course has been developed for PE teachers. The
goal of the course was to provide knowledge and skills for teachers concerning
issues of integration and inclusion of students with different physical and cog-
nitive disabilities. The purpose of the study was to examine the contribution of
the course to the teacher’s self-efficacy (SE) for integration of students with
various disabilities in PE classes. The research method was to collect back-
ground data of about 100 teachers in the course, examining their status of entry
in terms of knowledge of the integration of students with disabilities and their
ability to resolve various situations and assessing their self-capable perception of
the course. Throughout the course, the teachers watched videos and performed
various tasks involving disability issues and possibilities for integrating students
with disabilities into their classes. At the end of the course, the teachers were
given a knowledge test, in which they provided information on the use of their
own perception of their own level of knowledge. Findings indicate that the
teachers improved their capability for inclusion in general, and also for edu-
cating their peers in terms of fitness and motor skills. The teachers reported that
they gained their highest SE in the inclusion of children with developmental
coordination disorders (mean = 3.6), and that the smallest gain involved the
inclusion of those with spinal cord injury (mean = 3.0). The conclusion was that
teachers benefited from the online course mostly through reflecting on the
knowledge gained from their past experiences.
Keywords: Inclusion MOOC (Massive Open Online Courses) Physical

education
1 Background
1.1 Children with Disabilities in Schools

Various surveys (e.g. [32, 33, 43]) have shown that the extent of participation in
physical activities of children with disabilities in and out of school is significantly
smaller than that of children without disabilities, and that they experience different
barriers to participation. According to a study in Israel [10], the participation of chil-
dren with special needs in physical education (PE) classes has made a significant
contribution to the feeling of burnout of PE teachers in the school. Furthermore, in a
https://doi.org/10.1007/978-3-030-39442-4_67
The Impact of an Online Course of Inclusive Physical Education 883
study conducted by teachers in the Arab sector in the north of the country, in relation to
the ability to integrate students with disabilities in PE, a one-dimensional picture was
presented showing low to medium values of self-efficacy (SE) [19] when the areas of
disability that are particularly low are children with visual impairment and mobility-
impaired [18]. Therefore, it is worth noting that among PE teachers in Israel, there is
still a great lack of knowledge on ways of teaching that are adapted for children with
disabilities, especially those who challenge the movement system, such as children
with severe visual impairments or children using a wheelchair.
1.2 Inclusion
Towards the beginning of the 21st century, the inclusion trend spread within educa-
tional systems worldwide [29]. Inclusion is an educational worldview that supports the
active participation of students with a variety of abilities in school culture [23]. The
inclusion trend is based on the right of every learner to quality education that enables
personal development and the realization of the child’s potential, while taking into
account the diversity of children’s environments and abilities, in order to promote
opportunities and reduce barriers to learning [1]. In order to achieve this educational
goal, the school system requires an effort that encourages teachers to develop high
expectations of all students and to ensure that the educational programs are tailored to
the child’s needs [41]. The inclusion trend is based on early conceptions of the Swedish
researcher Bengt Nirje [27], who together with Wolf Wolfensberger and other col-
leagues [42] developed the concept of normalization to establish the right of people
with disabilities to live in a normative environment alongside people with normative
behavior. The goal is to prevent the phenomenon of isolation in people with disabilities
who develop non-normative behaviors in the absence of normative models.
Some believe that PE for students with disabilities does not provide enough in the
educational system, and that they must be provided with a supportive educational
environment that includes reducing barriers to participation, in line with the principles
of universal design and a set of unique adjustments according to abilities and needs of
the child – defined as corrective teaching (Israeli Ministry of Education, 2015).
Inclusion in Physical Education. Professional perceptions that support the develop-
ment of teaching methods that are adapted to the inclusion of students with disabilities
were mentioned in the United States as early as the 1950s [38]. Further validation of the
requirement of inclusion of children with disabilities in PE and sport was provided by
the United Nations Education, Science and Culture Organization (UNESCO) [40],
which states that “it is needed to provide opportunities for participation in PE, physical
activity and sports for all people, especially children of school age, women and girls,
people with disabilities and local minorities in developed countries” [40]. Referring to
the international demand and the ongoing efforts to train teaching forces and devise
tailored teaching methods, the study presents a variety of evidence that demonstrates
success in acquiring skills and experiences of accepted frameworks. However, on the
other hand, evidence of negative experiences and social rejection in North America [3,
13], in Europe [5], and in Israel [20] was also revealed.
884 N. Choresh and Y. Hutzler
Unfortunately, one of the main arguments of many teachers is the lack of adequate
training, and this is also common in the United States [2, 15], where most states require
a union-based exam to allow them to tailor educational programs for students with
disabilities [22]. A survey of 129 universities sampled in 41 states found characteristics
of the educational teacher training courses used in the United States [30]. The findings
in this survey indicated that in most training frameworks (69%), one course usually
offers a face-to-face setting, and only one percent reported using an entirely online
environment.
Physical Education Teachers’ Self-efficacy Towards Inclusion. Self-ability is a
concept that Albert Bandura (1977) embedded as part of the social-cognitive theory of
learning, expressing the individual’s belief in his/her ability to successfully complete a
concrete task. This term is considered to be among the key variables that influence
motivation to perform a particular activity and maintain participation in the activity,
despite difficulties that may emerge during the course of time. Therefore, ensuring
teachers’ SE is a very important issue for promoting quality teaching (Bandura 1997).
Studies that have examined teachers’ perception of SE indicate that the teacher’s
professional ability is a significant asset that distinguishes teachers who will engage in
their work, know how to use effective strategies, and improve the performance of their
students and teachers who feel helpless [7, 44]. The SE of teacher educators toward the
inclusion of students with disabilities was first investigated through self-filling ques-
tionnaires, without the backing of a theoretical model [21], and later by constructing a
theoretical model that addresses students with different disabilities individually [4].
1.3 21st Century Skills

In the 21st century, computer and information skills are essential for daily life in general
and academic development in particular [6, 11]. This trend includes lifelong learning,
self-directed learning, massive open online courses (MOOCs) and learning from video
clips, forums and texts. All the methods above have been found to be relevant to the
21st century [6, 12]. Therefore, teachers must experiment self-regulated learning with
feedbacks in the digital environment [16, 25].
Self-directed Learning. Self-learning, or self-directed learning, is an active and
constructivist process that encourages the learner’s cognitive, metacognitive, and
motivational processes, during which they set short-term goals, monitor and control the
learning process, and control their behavior and motivation [16, 28, 39]. Self-learning
can occur with a teacher who guides students on what to learn independently and how
to learn, and knows where to provide them with scaffoldings and learning aids [39].
Technology and communication can be used to assist both teachers and learners in self-
learning [25], in sophisticated, varied, and tailored learning aids and innovative
teaching methods [25]. This is self-taught learning. Self-learning and online self-
learning may also occur without a teacher, albeit in a learning environment that pro-
vides the material, goals and learning aids that are required or recommended. With
these methods there is a variety of learning tools: interactive learning, marker videos,
written tutorials, and exercises.
The beginnings of the self-learning method have evolved in textbook-driven self-

learning, and are now increasingly prevalent in this form of learning in MOOCs. In
recent years we have seen a dramatic increase in these courses, where technology
platforms have been used to facilitate video viewing and interaction through forums
[9]. Self-learning, self-learning without teacher, and learning from practice according to
guidelines were found to be important, and they contributed to students before entering
the school [6]. Self-learning is characterized by a personal and unique learning expe-
rience with personal meanings, development of plans, setting goals and objectives and
monitoring processes that represent metacognitive aspects in the context of the lear-
ner’s task and the self-ability to cope with and control his/her self-control and learning
[28, 39]. This learning is active learning [16, 28, 37], where the learner needs to be
task-oriented and to present planning, organization, and self-control skills. Learners are
required to find out what motivates them and what helps them to assess themselves and
the inhibitory factors [17, 26, 28]. In order to learn self-learning, as characterized
above, the learner must use management and control skills [28, 39] that include process
reflection and that require high motivation.
According to Shamir and Blau’s study [37] based on field-based theory, four
components of an integrated self-learning learning course were identified: (1) teaching
processes and the role of the lecturer; (2) learning processes and the role of learners;
(3) evaluation processes; (4) the role of technology in supporting teaching-learning-
assessment processes. From a qualitative analysis of student comments in a self-
directed distance learning course taught by the “reverse classroom” method, the fol-
lowing findings regarding self-directed distance learning emerged: (1) on-site and on-
time learning flexibility in the learning process and responsibility in the learning
process; (2) studying “at the center” – an active, controlling the way of learning and
choosing action strategies; (3) the regulation of learning processes; (4) the teacher as a
facilitator who provides scaffolding for learning and defines the quality of performance
required; (5) discussions and collaborations in the learning community; (6) authentic
problem solving promotes learning motivation. In the meantime, the researchers found
that online self-learning fosters metacognitive thinking and requires an element of
monitoring learning strategies, learning-rate regulation, self-discipline, taking respon-
sibility for learning, and organizing time. In the context of Information Communication
Technologies (ICT), it has been found that technology enables flexibility, helps to
access content and fosters learning and communication processes [31, 37].
Learning Tools in Self-regulated Learning. Learning aids and scaffolding are the
means available to students to learn the material, acquire a skill, apply it, and evaluate
it. A variety of learning aids have been found to contribute to students and even
motivate them, if only by giving students a choice, which is helpful and positive
according to their perception [14, 34], not least because diversity itself contributes to
learning over only one method of teaching. In a study of the use of learning aids among
homeschooled learners (learners who practice independently with online learning aids
and only a minority under a teacher’s directive), it has been found that learning takes
place when different learning aids are used, and not only one type of aid [8]. In
addition, it was found that there is also a positive relationship between different
learning aids and the level of education of children who studied in homeschooling or
open schools.
1.4 The Rationale of the Study

In light of the above background, a multifaceted online course was first developed in
Israel to teach the inclusion of students with disabilities in PE classes at the school and
to mentor students and teachers in the field. The construction of this course was aided
by the results of U.S. studies that reported findings from online courses for training
teachers in this area, and found that (a) teachers’ self-efficacy perception increased
significantly after the online course as compared to teachers who received only written
material [24], and (b) the significant outcomes for teachers of this course were:
(1) changes in the teacher’s role perception, (2) the development of a professional
community learning concept, and (3) gaining of a deeper understanding of adaptive PE
issues [35]. In another study, perceptions of teachers who participated in online courses
were reported [36]. These perceptions focused on (a) communication between the
course instructor and the students, (b) student discussions, and (c) knowledge appli-
cation on issues revealed by the assessment of the students with disabilities. So far,
information is lacking that would help online self-learning course developers under-
stand the factors that may affect course success.
Therefore, we created an online training course in a MOOC style for inclusion of
children with disabilities in PE classes, and we studied the contribution of this course to
teachers’ knowledge and SE in inclusion.
2 Purpose
The purpose of the study was to investigate the impact of an online course for inclusive
PE on teachers’ knowledge, skills and prospectives about integrating children with
disabilities into regular PE classes.
3 Method
One hundred and ten PE teachers participated in five online groups, each of them
registered to an online a-synchronic course in a Moodle platform. All the courses had
the same content, including 14 units that included APE (adapted physical education)
theory and philosophy, adaptation principles, as well as disability-based and practice
related modules. The course administrators were APE graduates and certified teachers,
who received on-the-job training in online course administration. The training included
weekly meetings where issues were discussed after or prior to opening the units for
learning.
A preliminary questionnaire reported demographic data and general SE toward

inclusion in PE. An additional questionnaire was filled in toward the end of the courses,
reporting feedback on teachers’ perceptions of outcomes: knowledge, skills, and
inclusion capacity gained, as well as specific SE toward including children with dif-
ferent disabilities in PE. Additional outcomes considered were the knowledge test and
number of learning events recorded. Regression analyses were used to predict the
outcomes from demographic data.
In addition to the descriptive analysis, methods included correlations and multiple
regression analyses across SE and outcome variables measured during the course,
analyses of variance for reporting differences between SE at the beginning and toward
the end of the course, and taking into account the impact of training and experience.
4 Results
Findings indicated that teachers improved their capability for inclusion in general, as
well as in educating peers in terms of fitness and motor skills (3.6, 3.3, 3.7, respec-
tively, from a range of 1–5). Figure 1 shows the teachers’ perceived capability average
scores in various popular PE activities.
Perceived Capability Score by Activity

5
4
3
2
1
0
Peer support Fitness Games Individual Safety Motor skills
training training
Fig. 1. Teachers’ scores on perceived self-efficacy or capability for inclusion children with
disabilities in various sports activities.
Teachers reported they have gained their highest SE in inclusion of children with
developmental coordination disorders (mean = 3.6), and the least gain referred to the
inclusion of those with spinal cord injury (mean = 3.0). Figure 2 shows the teachers’
perceived SE average scores regarding to inclusion of children with various disabilities.
Perceived SE Score by Disability
5
4
3
2
1
0
ID ASD DCD CP SCI VIS
ID= Intelectual disorder; ASD= Autism Spectrum Disorder; DCD= Developmental

Coordination Disorder; CP= Cerebral Palsy; SCI= Spinal Cord Injury; VIS= Visual
Imparment
Fig. 2. Teachers’ scores on perceived SE or capability for inclusion of children with various
disabilities in PE classes.
5 Discussion and Conclusion
Findings indicate that all the teachers concluded the course with good knowledge and
skills for inclusion of children with various disabilities in a variety of sports activities.
We learned from the results that the course was fairly effective. Some cases, such as
inclusion of children with a spinal cord injury disability and training peers to support
children with disabilities in class, need to be improved.
We suggest that training courses for applied APE will include practical workshops
in addition to digital tools, in order to reach higher self-efficacy in general.
A further investigation could be a follow-up study to investigate the teachers’ years
after the course, during which time they probably had a significant number of children
with disabilities, to determine if they actually related to them and integrated them more
effectively in classes.
References
1. Ainscow, M., Miles, S.: Making education for all inclusive: where next? Prospects 145(1),
15–34 (2008)
2. Ammah, J.O., Hodge, S.R.: Secondary physical education teachers’ beliefs and practices in
teaching students with severe disabilities: a descriptive analysis. High Sch. J. 89, 40–54
(2005)
3. Blinde, E.M., McCallister, S.G.: Listening to the voices of students with physical disabilities.
J. Phys. Educ. Recreat. Dance 69, 64–68 (1998)
4. Block, M., Hutzler, Y., Klavina, A., Barak, S.: Creation and validation of the situational-
specific self-efficacy scale. Adapt. Phys. Act. Q. 29, 184–205 (2013)
5. Bredahl, A.-M.: Sitting and watching the others being active: the experienced difficulties in
PE when having a disability. Adapt. Phys. Act. Q. 30, 40–58 (2013)
6. Chee, T.S., Divaharan, S., Tan, L., Mun, C.H.: Self-Directed Learning with ICT: Theory,
Practice and Assessment, pp. 1–65. Ministry of Education, Singapore (2011)
7. Dibapile, W.T.S.: A review of literature on teacher efficacy and classroom management.
J. Coll. Teach. Learn. 9(2), 79–92 (2012). http://trace.tennessee.edu/utk_educpubs/31.
Accessed 20 Sept 2018
8. Donkor, F.: The comparative instructional effectiveness of print-based and video-based
instructional materials for teaching practical skills at a distance. Int. Rev. Res. Open Distrib.
Learn. 11(1), 96–116 (2010)
9. Draus, P.J., Curran, M.J., Trempus, M.S.: The influence of instructor-generated video
content on student satisfaction with and engagement in asynchronous online classes.
J. Online Learn. Teach. 10(2), 240–254 (2014)
10. Fejgin, N., Talmor, R., Erlich, I.: Inclusion and burnout in physical education. Eur. Phys.
Educ. Rev. 11(1), 29–50 (2005)
11. Fullan, M., Langworthy, M.: Towards a New End: New Pedagogies for Deep Learning.
Collaborative Impact, Washington (2013)
12. Geri, N., Winer, A.: Patterns of online video lectures use and impact on student achievement.
In: Eshet-Alkalai, Y., Blau, I., Caspi, A., Geri, N., Kalman, Y., Silber-Varod, V. (eds.)
Proceedings of the 10th Chais Conference for the Study of Innovation and Learning
Technologies: Learning in the Technological Era, pp. 9E–15E. The Open University of
Israel, Raanana (2015). (in Hebrew)
13. Goodwin, D.L., Watkinson, E.J.: Inclusive physical education from the perspective of
students with physical disabilities. Adapt. Phys. Act. Q. 17, 144–160 (2000)
14. Hahn, E.: Video lectures help enhance online information literacy course. Ref. Serv. Rev. 40
(1), 49–60 (2012)
15. Hardin, B.: Physical education teachers’ reflections on preparation for inclusion. Phys. Educ.
62, 44–56 (2005)
16. Horizon Report. Higher Education Edition. NMC (2016)
17. Huffaker, D., Calvert, S.: The new science of learning: active learning, metacognition, and
transfer of knowledge in E-Learning applications. J. Educ. Comput. Res. 29(3), 325–334
(2003)
18. Hutzler, Y., Barak, S.: Self-efficacy of physical education teachers in including students with
cerebral palsy in their classes. Res. Dev. Disabil. 68, 52–65 (2017). https://doi.org/10.1016/j.
ridd.2017.07.005
19. Hutzler, Y., Shama, E.: Attitudes and self-efficacy of Arabic-speaking physical education
teachers in Israel toward including children with disabilities. Int. J. Soc. Sci. Stud. 5(10), 28–
42 (2017). https://doi.org/10.11114/ijsss.v5i10.2668
20. Hutzler, Y., Fliess, O., Chacham, A., van den Auweele, Y.: Perspectives of children with
physical disabilities on inclusion and empowerment: supporting and limiting factors. Adapt.
Phys. Act. Q. 19, 300–317 (2002)
21. Hutzler, Y., Zach, S., Gafni, O.: Physical education students’ attitudes and self-efficacy
towards the participation of children with special needs in regular classes. Eur. J. Spec.
Needs Educ. 20(3), 309–327 (2005)
22. Kelly, L.: Adapted Physical Education National Standards, 2nd edn. Human Kinetics,
Champaign (2006)
23. Kugelmass, J.W.: The Inclusive School: Sustaining Equity and Standards. Teachers College
Press, New York (2004)
24. Kwon, E.H., Block, M.E.: Implementing the adapted physical education E-learning program
into physical education teacher education program. Res. Dev. Disabil. 69(1), 18–29 (2017)
25. McLoughlin, C., Lee, M.J.: Personalised and self regulated learning in the Web 2.0 era:
international exemplars of innovative pedagogy using social software. Australas. J. Educ.
Technol. 26(1), 28–43 (2010)
26. Mega, C., Ronconi, L., De Beni, R.: What makes a good student? How emotions, self-
regulated learning, and motivation contribute to academic achievement. J. Educ. Psychol.
106(1), 121 (2014)
27. Nirje, B.: The normalisation principle and its human management implications. In: Kugel,
R., Wolfensberger, W. (eds.) Changing Patterns in Residential Services for the Mentally
Retarded. President’s Committee on Mental Retardation, Washington, DC, chap. 7 (1969)
28. Nodoushan, M.A.S.: Self-regulated learning (SRL): emergence of the RSRLM model. Int.
J. Lang. Stud. 6(3), 1–16 (2012)
29. Pecora, P.J., Whittaker, J.K., Maluccio, A.N., Barth, R.P.: The Child Welfare Challenge:
Policy, Practice, and Research, 4th edn. Adline Transaction, London (2012)
30. Piletic, C.K., Davis, R.: A profile of the introduction to adapted physical education course
within undergraduate physical education teacher education programs. ICHPER-SD J. Res. 5
(2), 27–32 (2010). https://files.eric.ed.gov/fulltext/EJ913329.pdf. Accessed 29 Sept 2018
31. Platt, C.A., Amber, N.W., Yu, N.: Virtually the same?: student perceptions of the
equivalence of online classes to face-to-face classes. J. Online Learn. Teach. 10(3), 489–503
(2014)
32. Rimmer, J.A., Rowland, J.L.: Physical activity for youth with disabilities: a critical need in
an underserved population. Dev. Neurorehabilitation 11(2), 141–148 (2008)
33. Rimmer, J.H., Riley, B., Wang, E., Rauworth, A., Jurkowski, J.: Physical activity
participation among persons with disabilities: barriers and facilitators. Am. J. Prev. Med. 26,
419–425 (2004)
34. Rose, K.K.: Student perceptions of the use of instructor-made videos in online and face-to-
face classes. J. Online Learn. Teach. 5(3), 487–495 (2009)
35. Sato, T., Haegele, J.A.: Professional development in adapted physical education with
graduate web-based professional learning. Phys. Educ. Sport Pedagogy 22(6), 618–631
(2017)
36. Sato, T., Haegele, J.A., Foot, R.: In-service physical educators’ experiences of online
adapted physical education endorsement courses. Adapt. Phys. Act. Q. 34, 162–178 (2017)
37. Shamir, T.A., Blau, I.: “The flipped classroom” at the open university? Promoting personal
and collaborative self-regulated learning in an academic course. In: Eshet-Elkalay, Y., Blau,
I., Caspi, N., Gerri, N., Kelmn, Y., Zilber-Warod, W. (eds.) The 11 Conference for
Innovation Study and Learning Technologies Chaise: The Learning Person on the Digital
Era, pp. 226–233 (2016). (in Hebrew)
38. Sherrill, C.: Adapted Physical Activity, Recreation, and Sport: Crossdisciplinary and
Lifespan, 6th edn. McGraw-Hill, New York (2004)
39. Svinicki, M.D.: Student learning: from teacher-directed to self-regulation. New Dir. Teach.
Learn. 2010(123), 73–83 (2010)
40. UNESCO: International Charter of Physical Education, Physical Activity and Sport (2015).
http://unesdoc.unesco.org/images/0023/002354/235409e.pdf
41. Villa, R., Thousand, J.: Restructuring for Caring and Effective Education. Brookes,
Baltimore (2000)
42. Wolfensberger, W.P., Nirje, B., Olshansky, S., Perske, R., Roos, P.: The Principle of
Normalization in Human Services. National Institute of Mental Retardation. Online: Books:
Wolfensberger Collection (1972). http://digitalcommons.unmc.edu/wolf_books/1
43. Wright, A., Roberts, E., Bowman, G., Crettenden, A.: Barriers and facilitators to physical
activity participation for children with physical disability: comparing and contrasting the
views of children, young people, and their clinicians. Disabil. Rehabil. (2018). https://doi.
org/10.1080/09638288.2018.1432702
44. Zee, M., Koomen, H.M.Y.: Teacher self-efficacy and its effects on classroom processes,
student academic adjustment, and teacher well-being: a synthesis of 40 years of research.
Rev. Educ. Res. 86(4), 981–1015 (2016)
5G Service and Discourses
on Hyper-connected Society in South Korea:
Text Mining of Online News
Li Xu, Harim Yeo, Hyesun Hwang(&), and Kee Ok Kim
Sungkyunkwan University, 25-2 Sungkyunkwan-ro, Jongno-gu, Seoul, Korea

h.hwang@skku.edu
Abstract. This study explored social discourses on hyper-connected society by

using text mining of online news articles in South Korea. Online news data was
collected from a database of news articles provided by the Korea Press Pro-
motion Foundation using the R3.5.3 program, and the data cleaning and tok-
enization was conducted. Then, a Topic Modeling (Latent Dirichlet Allocation)
and Network Analysis (N-gram Language Model) was conducted. The number
of topics was set to the best 10 topics based on the results of several LDA
results. Many words related to 5G next-generation communications and inno-
vative technologies are mentioned in several topics. In addition, the results
showed that various social discourses, such as education, human life, societal
changes, governmental support, and industry, are currently being discussed as
major topics of a hyper-connected society. The results showed that hyper-
connected society not only signifies the emergence and application of innovative
communication technology, but also includes extensive changes to human life,
social relations, education, and industry.
Keywords: Hyper-connected society Hyper-connectivity Topic modeling

Social Network Analysis 5G service
1 Purpose
With the rapid development of information and communication technology, our

everyday experiences are based on hyper-connectivity. A society based on hyper-
connectivity is known as hyper-connected society. The hyper-connected society means
a society in which people and people, people and objects, things and things, online and
offline, one to one, one to many and many-to-many are connected using digital tech-
nology [1]. It shows the emergence of various agendas that are being transferred to
different societies. In particular, the rapid development of communication technologies
that promote this hyper-connectivity is expected to change the industrial structure at a
macro-level and human life patterns at a micro-level.
This study investigates the movement of change in human life and society by ana-
lyzing how the social discourse on hyper-connected society is formed. In particular, South
Korea is experiencing the fastest transition to a hyper-connected society as it becomes the
first place in the world to experience 5G commercialization. This study aims to examine
the social discourses being formed in South Korea by text mining of online news articles.
https://doi.org/10.1007/978-3-030-39442-4_68
5G Service and Discourses on Hyper-connected Society in South Korea 893
2 Background
Due to the rapid development of information and communication technology, con-

nections are manifest in various forms, such as people-to-people, people-to-things, and
things-to-things [1]. Through such connections, a hyper-connected society gets formed,
where opportunities for growth and value creation are realized [2]. Through the Internet,
people can interact with each other regardless of time and space. Recently, various forms
of information have been actively generated and shared via YouTube or SNS.
The 5G environment must be equipped to utilize the latest technologies, such as
connecting various smart devices or collecting and transmitting multiple data [3]. The
super-connectedness of 5G service contributes to realization of a hyper-connected
society as well. It forms smart cities, for example, by connecting the information system
to other city infrastructure with IoT (Internet of Things).
South Korea started the commercialization of 5G for the first time in the world on
April 3, 2019 [4]. The commercialization of the 5G service shows the advent of an era
of information and communication that is one step further and has the potential of
progressing toward a hyper-connected society. Meanwhile, despite the rapid devel-
opment of information and communication technologies and the worldwide level of
innovation, South Korea still lacks legislation to support industries based on new
communication technologies and even has a lot of regulations on them. With both of
the rapid development of technology and industry and the rigid regulation in South
Korea, there are various discourses regarding its progress toward a hyper-connected
society.
3 Method
3.1 Data Collection
This study used news articles available online to explore aspects of a hyper-connected
society that are being discussed socially in Korea. News articles were crawled using
R3.5.3 on BigKinds (www.bigkinds.or.kr), a database of news articles provided by the
Korea Press Promotion Foundation. A total of 918 news data were collected dated from
April 3, 2019, when 5G was first commercialized, to August 7, 2019.
3.2 Data Cleaning

Refined data were extracted with nouns as the unit of analysis. The collected news
articles were constructed as data for text mining. The data was cleaned through the
R3.5.3 program by removing unnecessary words, symbols and meaningless punctua-
tion marks and converting various colloquial expressions into regular expressions.
3.3 Topic Modeling

Latent Dirichlet Allocation (LDA), a topic modeling method, was used to detect events
in the refined news data [5]. LDA is a generative statistical model that allows sets of
894 L. Xu et al.
observations to be explained by unobserved groups that elaborates why some parts of

the data are similar [5]. The LDA main field of interest is modeling relations between
topics [5]. This is achieved by using another distribution on the simplest instead of the
Dirichlet [5]. As a result, it is possible to calculate the distribution of different topics in
a single document and the proportion of topics contained within the set of documents
[5]. According to previous studies that used LDA’s algorithms set the values to a = 0.5
and h = 0.01, this study also applied these common values, which are common settings
in the literature [5]. The number of topics was set to the best 10 topics based on the
results of several LDA results from 6–40 topics.
3.4 Network Analysis

The first step of analysis is data tokenization. The nouns should be extracted first to
form a network from the text of the collected article. In order to tokenize data, a total of
132,201 noun tokens were extracted by conducting a data morphing analysis through
the KoNLP (Korean Natural Language Processing) Package. After completing these
tasks, the N-gram Language Model was used to conduct network analysis.
4 Results
4.1 Results of Topic Modeling
The results of the LDA are shown in Table 1. Each of the 10 topics contains various
social discourse in hyper-connected society that is currently being discussed.
First, many words related to 5G next-generation communications, such as “Auto-
matic Driving,” “Mobile Communication,” and “Technology” are mentioned in Topic
1. This indicates that changes based on information and communication technology,
which is the basis of a hyper-connected society, are becoming an essential factor. As an
industry that leads change, it is possible to see the emergence of communication-based
products and services, such as self-driving cars. Similarly, commercialization of 5G as
the “First” can represent the general characteristics of the hyper-connected society
itself; rapid “Speed” and “Network” indicate that the main attributes of the changes in
hyper-connected society are being discussed.
Second, Topic 6 presents a description of lower-level technologies that specifically
apply to develop solutions for corporate products and services, such as “Platform,”
“Cloud,” and “Service.” In addition, words such as “Technology,” “Solution,” and
“Game” are mentioned, which shows the content on the side of innovative technology
that will emerge in the hyper-connected society.
Third, Topic 4 features words, such as “Industry,” “Construct,” and “Develop-
ment,” as well as information on the overall structure of the industry and government
support to strategically create and foster it.
Fourth, Topic 3 and Topic 7 show content on societal issues and concerns in the
hyper-connected society. The words “People” and “Issue” were mentioned in Topic 3
and the words “Security” and “Safety” about “Information” emerged in Topic 7.
Finally, words such as “Education,” “Progress,” and “Support” appeared in Topic

2. Additionally, there has been a growing need for close cooperation between academia
and industry, with “Small and medium enterprises” and “Forums” appearing under
Topic 5. This shows that the need for an industrial-academic cooperative relationship is
being discussed.
Table 1. Results of topic modeling of news data.

Topic 1 Topic 2 Topic 3 Topic 4
1 Service University People Industry
2 World Major Industry Business
3 Technology Education Fourth Industrial Technology
Revolution
4 Automatic Student Internet Smart
Driving
5 Commercialize Progress Artificial Support
6 First Department Change Development
7 Mobile Support IoT Innovation
Communication
8 Era President Innovation Enterprise
9 Communication Amalgamate Future Create
10 Network Professor Issue Construct
1 Small businesses Service Security
2 Economy Data Technology
3 Subject Technology Service
4 Government Cloud Smart
5 Block chain Enterprise Information
6 Era Game Construct
7 Hyper-connected Telecom Safety
8 Future Solution Telecom
9 Forum Market Data
10 Enterprise Platform System
4.2 Results of Network Analysis

Figure 1 shows the results of a network analysis showing which semantic words form a
network within the text data.
Major 5G feature words, such as “High-speed,” “Hyper low-latency,” and “Hyper-
connected” were connected [6]. In South Korea, 5G is currently commercialized, and the
“Infrastructure,” “Ecosystem,” and “Platform” are “Constructed,” explaining that the
speed of mobile communication is causing changes of societal and industrial structure in a
hyper-connected society. In addition, “Automatic Driving,” “Artificial Intelligence,”
896 L. Xu et al.
“IoT,” and “Big data” are linked to “Core” and “Technology.” These “Technologies” are
linked to “Services,” such as “Connected cars,” “Clouds,” and “Mobile comm-
unications.”
Fig. 1. Results of network analysis of news data.
5 Conclusion
As South Korea is the first country to commercialize 5G, it is leading many changes
that have been brought by the automated telecommunication generation. Therefore, this
study attempted to explore the aspects of hyper-connected society that have been
discussed in Korean society.
First, the result of the Topic Modeling shows important aspects of the technologies
underlying the hyper-connected society. In particular, this shows that the innovative
performance that can be achieved by applying the leading Fourth Industrial Revolution
information and communication-related technologies to the industry is being discussed
as an essential topic.
Second, the analysis shows that a discussion of both negative and positive aspects
of a hyper-connected society is taking place simultaneously. In particular, the social
environment that is evolving in a hyper-connected society revolves around technology
and relationships among people; this is being discussed as one of the primary topics.
This implies that advanced communication technology does not just mean a problem of
speed and accessibility but should be able to support human life reliably through the
implementation of secure and reliable network technologies.
Third, in the case of South Korea, the government actively implements policies to
promote informatization and fosters information communication technology-related
industries. This result shows the relevant government responses. This includes not only
the growth of self-sustaining industries but also the strategic creation of new industries
led by the government. Therefore, it can be expected that the rapid development of a
hyper-connected society in Korea where the government supports and fosters the
changes of information communication environment.
Fourth, with the progress of hyper-connected society, changes in technology and
social environment have changed the demand for education. This has led to changes in
universities offering new majors or strengthening relevant educational programs and
has recognized the need for close collaboration with industry and academia.
These analyses show that hyper-connected society does not mean only the emer-
gence and application of innovative communication technologies but it also represents
a wide range of changes that encompass changes in human lives, social relationships,
education, industry, etc. In the future, we must prevent and adequately deal with any
problems of maladjustment or lag that may arise in hyper-connected society, or of
reverse functioning. To this end, the hyper-connected society does not focus solely on
the technical aspects. Comprehensive discussions are needed on “how” information
and communication technologies, which are the basis for leading human life, are being
used and “what” the results will be.
This study analyzed online news articles to understand how the new era of hyper-
connectedness has been discussed. In future studies, it is necessary to consider the
responses from people who are experiencing the progress of hyper-connected society.
References
1. Seungwha (andy), C., Sunju, P., Seungyong, L.: The era of hyper-connected society and the
changes in business activities: focusing on information blocking and acquisition activities. Int.
J. Manag. Appl. Sci. 3 (2017). ISSN: 2394-7926
2. YoungSung, Y.: The Advent of Hyper-Connected Society and Our Future. Hanulbooks, Seoul
(2014)
3. The Financial News, A hyperconnected society…A new world is coming. http://www.efnews.
co.kr/news/articleView.html?idxno=78971. Accessed 26 Mar 2019
4. Etnews, First commercialization of 5G in Korea, opening of a new era. http://www.etnews.
com/20190404000302. Accessed 04 Apr 2019
5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5),
993–1022 (2003). Edited by J. Lafferty
6. The Financial News, It’s not just communication…The “Industrial Big Bang” is coming.
http://www.efnews.co.kr/news/articleView.html?idxno=78904. Accessed 15 Mar 2019
A Digital Diagnostic Aide for Skincare: The
Role of Computer Vision and Machine
Learning in Revealing Skin Texture Changes
Jaya Shankar Vuppalapati1, Santosh Kedari1, Anitha Ilapakurti2,

Chandrasekar Vuppalapati2(&), Sharat Kedari2,
and Rajasekar Vuppalapati2
1
Hanumayamma Innovation and Technologies Private Limited, HIG-II,
Block-2/Flat-7, Baghlingampally, Hyderabad, Telangana, India
{jaya.vuppalapati,skedari}@sanjeevani-ehr.com
2
Hanumayamma Innovations and Technologies Inc., 628 Crescent Terraces,
Fremont, CA, USA
{Ailapakurti,cvuppalapati,Sharath,
rvuppalapati}@hanuinnotech.com
Abstract. Skin disease is a serious disease and can also be a deadly disease.
Prevalence of skin disease is high and is likely to increase as the population
ages. Skin disease burdens Americans, their families and employers. According
to American Academy of Dermatology (AAD), nearly 25% of the population
ages 0–17 had diagnosed with skin disease in 2013 and the price tag for
treatment was $75 billion. Worldwide, an estimated 1.9 billion people suffer
from a skin condition at any given time, and shortage of dermatologists
aggravating the issue. One of the chief early signs, per dermatologist, to a
potential skin disease is change in the skin, ranging from discoloration to new
growth. In this paper, we will discuss the application of machine learning
algorithms and Computer Vision techniques to analyze skin texture changes that
are invisible to the naked eye and provide actionable insights framework that
would trigger preventive treatment procedures to address any impending skin
disease. We will discuss several computer vision techniques and cognitive
services to improve efficiencies of computer vision techniques. Our goal is to
develop assistive Computer vision models that could potentially help derma-
tologists to take proactive healthcare actions to reduce occurrence skin diseases.
Keywords: Cognitive Services CV HER Local Binary Patter (LBP)

Azure & Amazon Cognitive Services Sanjeevani Electronic Health Records
1 Introduction
Skin disease is serious and can be deadly. Prevalence of skin disease is high and is
likely to increase as the population ages. Skin disease burdens Americans, their

https://doi.org/10.1007/978-3-030-39442-4_69
A Digital Diagnostic Aide for Skincare 899
families and employers1. Patients and caregivers with skin disease suffered $11
billion2 in lost productivity. Market research projects that the global market for skin
disease treatment technologies will reach $20.4 billion in 20203. Prevention is better than
cure and that the application of machine learning and computer vision to analyze images
to predict and prevent the onset of skin disease, generally change of the texture invisible
[1] to the naked eye, would reduce overall cost and improve health outcomes (see Fig. 1).
Fig. 1. ML for dermatology [1]
Most Artificial Intelligence applications that cater to Skin related & dermatology-
based applications falls into Skin Image Analysis and Skin care treatment
personalization4.
Additionally, Computer assisted diagnosis (CAD) systems use artificial intelligence
(AI) to analyze lesion data and arrive at a diagnosis of skin cancer5. Computer Vision
infused AI systems could extract valuable diagnostics markers from Selfies6 and could
provide forecast of a potential masked disease and thus trigger actions to forestall any
emergency healthcare incidence, so saving millions of lives. Images provides valuable
medical aides to diagnose Jaundice7 (see Fig. 2) [2, 3].
1
AAD - https://www.ncmedsoc.org/wp-content/uploads/2018/07/NCDA-AADA-LC-18-Burden-of-
Skin-Disease.pdf.
2
New Study shows significant economic burden of skin disease in the United States - https://www.
aad.org/media/news-releases/burden-of-skin-disease.
3
Face of Dermatology Industry Changing; Companies in Global Skin Disease Market Extending
Products - https://www.bccresearch.com/pressroom/phm/face-of-dermatology-industry-changing-
companies-in-global-skin-disease-market-extending-products.
4
Machine Learning for Dermatology – 5 Current Applications - https://emerj.com/ai-sector-
overviews/machine-learning-dermatology-applications/.
5
Computer‐assisted diagnosis techniques (dermoscopy and spectroscopy‐based) for diagnosing
skin cancer in adults - https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD013186/
abstract.
6
The Role of Selfies in Creating the Next Generation Computer Vision Infused Outpatient Data Driven
Electronic Health Records (EHR) - https://ieeexplore.ieee.org/document/8622458.
7
Skin - https://www.livescience.com/46868-skin-changes-signal-health-problems.html.
900 J. S. Vuppalapati et al.
Fig. 2. Selfies [1]
One challenge to texture analysis of images is computationally intensive and could

lead to perpetual CV analysis. Additionally, the data footprint could be huge as data
collected from various clinics on several texture change diseases. In order to optimize
compute and reduce overall costs, our algorithms and end application performs image
compression to reduce the size of storage and cost of compute analysis.
This paper provides framework for image processing, discusses optimization image
compression techniques, comparison of accuracies with public cognitive Services, and
integrating with EHR. Its major contribution is its proposed image compression
analysis with infused machine learning & artificial intelligence on medical images.
The organization of this paper is as follows: Image compression, Cognitive Ser-
vices and Computer Vision techniques are discussed in Sect. 2, Core Machine
Learning Algorithm is discussed in Sect. 3, Sect. 4 presents our EHR analytics service
system, and Sect. 5 shows a case study. Section 6 concludes the paper with brief lead-
in on scope of future work.
2 Understanding Analytics and Computer Vision for Digital

Diagnostic Aide for Skincare
2.1 Computer Vision (CV)

In simple terms, the goal of computer vision is to enable computers to see and
understand the digital images such as photographs and videos8.
8
A gentle introduction of computer vision - https://machinelearningmastery.com/what-is-computer-
vision/.
Computer Vision9 is broadly multi-disciplinary field and sits in the middle or

intersection of Computer Science (Graphics, Algorithms, Theory, Systems, Architec-
ture), Mathematics (Information Retrieval, Machine Learning), Engineering (Robotics,
Speech, NLP, Image Processing), Physics (Optics), Biology (Neuroscience), and
Psychology (Cognitive Science). Deep learning techniques enable to solve computer
vision related use cases. The computer and algorithms that deep learning offers enable
object, feature, or image classification.
Many popular computer vision applications involve trying to recognize following
in images or photographs (see footnote 8): central is “Object” – CIVDLSR [4]
• Classification: what category of objects in photograph
• Identification: type of objects in photograph
• Verification: Is the object in the photograph?
• Detection: Objects in photograph
• Landmark detection: Key points for the objects in the photograph
• Segmentation: Object to pixel
• Recognition: conformation of objects in photograph and location.
2.2 Computer Vision – Cognitive Services

Public Cloud providers such as Microsoft Azure and Amazon AWS have enabled
Application Programming Interface (API) to call computer vision techniques to extract
information from images to categorize and process the data.
(1) Microsoft Azure Cognitive Services
Microsoft Azure10 provides rich Vision Cognitive Services API that extract useful
information from image.
For example, if we upload image in Fig. 2, following data is provided by Cognitive
services (see Table 1).
As it is clear from the above table, the API provides all the details on the image (see
Fig. 3).
(2) Amazon Rekognition
Amazon AWS Rekognition service provides computer vision for parsing image.
9
The 5 Computer Vision Techniques That Will Change How You See The World - https://heartbeat.
fritz.ai/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world-1ee19334354b.
10
Microsoft Azure Vision Cognitive Services - https://azure.microsoft.com/en-us/services/cognitive-
services/computer-vision/.
Table 1. Microsoft cognitive service response
Feature Name: Value
Objects [ { "rectangle": { "x": 89, "y": 38, "w": 132, "h": 74 }, "object": "Glasses",
"parent": { "object": "Personal care", "confidence": 0.757 }, "confidence":
0.749 } ]
Tags [ { "name": "person", "confidence": 0.9999424 }, { "name": "man",

"confidence": 0.974333167 }, { "name": "indoor", "confidence":
0.9717449 }, { "name": "human face", "confidence": 0.9135097 },
{ "name": "glasses", "confidence": 0.7313341 }, { "name": "shirtless",

"confidence": 0.548878849 } ]
Description { "tags": [ "person", "man", "indoor", "holding", "glasses", "looking",

"front", "woman", "hand", "standing", "wearing", "shirt", "young",
"teeth" ], "captions": [ { "text": "a man wearing glasses", "confidence":
0.687249959 } ] }
Image format "Png"
Image dimensions 158 x 237
Clip art type 0
Line drawing type 0
Black and white false
Adult content false
Adult score 0.0366226844
Racy false
Racy score 0.08602096
Categories [ { "name": "people_baby", "score": 0.4609375 } ]
Faces []
Dominant color background "White"
Dominant color foreground "White"
Accent Color #3F558C

Fig. 3. Selfies Parsed on Microsoft Azure (Microsoft Azure Cognitive Services - https://azure.
microsoft.com/en-us/services/cognitive-services/computer-vision/)
Following image response (see Table 2) for uploading Fig. 2:
Table 2. Amazon AWS Rekognition response
{ "Labels": [ { "Name": "Sunglasses", "Confidence":

99.43801879882812, "Instances": [ { "BoundingBox":
{ "Width": 0.57877516746521, "Height":
0.5148587822914124, "Left": 0.33443233370780945,
"Top": 0.20909881591796875 }, "Confidence":
99.43801879882812 } ], "Parents": [ { "Name":
"Accessories" } ] }, { "Name": "Accessories",
"Confidence": 99.43801879882812, "Instances": [],
"Parents": [] }, { "Name": "Accessory", "Confidence":
99.43801879882812, "Instances": [], "Parents": [] },
{ "Name": "Human", "Confidence": 99.14139556884766,
"Instances": [], "Parents": [] }, { "Name": "Person",
"Confidence": 99.14139556884766, "Instances":
[ { "BoundingBox": { "Width": 0.9506932497024536,
"Height": 0.9779492020606995, "Left":
0.04508737102150917, "Top": 0.015436968766152859 },
"Confidence": 99.14139556884766 } ], "Parents": [] },
{ "Name": "Glasses", "Confidence": 96.43098449707031,
"Instances": [], "Parents": [ { "Name": "Accessories" } ] },
{ "Name": "Head", "Confidence": 62.213706970214844,
"Instances": [], "Parents": [] } ], "LabelModelVersion":
"2.0" }
Fig. 4).
Fig. 4. Selfies Parsed on AWS Rekognition (Amazon AWS Rekognition - https://console.aws.

amazon.com/rekognition/home?ad=c&cp=bn&p=rkn&region=us-east-1#/label-detection)
Table 3. Google vision API response
{
"cropHintsAnnotation": {
"cropHints": [
{
"boundingPoly": {
"vertices": [
{
"x": 35
},
{
"x": 161
},
{
"x": 161,
"y": 157
},
{
"x": 35,
"y": 157
}
]
(continued )
"labelAnnotations": [
{
"description": "Hair",
"mid": "/m/03q69",
"score": 0.97040063,
"topicality": 0.97040063
},
{
"description": "Skin",
"mid": "/m/06z04",
"score": 0.9424602,
},
{
"description": "Shoulder",
"mid": "/m/01ssh5",
"score": 0.8860629,
},
{
"description": "Arm",
"mid": "/m/0dzf4",
"score": 0.88020605,
},
(continued )
{
"description": "Muscle",
"mid": "/m/04_fs",
"score": 0.8697178,
},
{
"description": "Chest",
"mid": "/m/0dzdr",
"score": 0.86141694,
},
{
"description": "Male",
"mid": "/m/05zppz",
"score": 0.8549979,
}
]
(3) Google Vision AI

Google offers Vision AI that parses object and provides image features.
The Google AI Vision would provide following response to the image in Fig. 2.
Additionally, Web Entities like the image. For instance, Fig. 2 is closer to dermatology
by 70.03% (see Table 3).
Fig. 5).
Fig. 5. Selfies Parsed on Google Cloud Vision (Google Cloud Vision AI - https://cloud.google.
com/vision/)
2.3 CV Algorithm Analysis11

Generally, Selfies are takes from Users, non-profession enthusiasts, and the images are
perturbed, one of the issues of the image. The Computer Vision images that are
probably high likely perturbed. Computer Vision algorithm needs to handle image
deviations:
Let A refer to CV algorithm.
A : ðUin; T Þ ! Uout ð1Þ
The CV takes input units Uin and generates the output Uout. Tuning vector is T.
Given image perturbed due quality of camera, or other environmental reasons, the
input is assumed random variable.
A: ðUîn ; TÞ ! Uôut ð2Þ
Performance characterization of algorithm must balance the random variance or

imperfections of input data and output data.
2.4 LBP Image Steps

Let us take Fig. 2 and input to LBP Operator code [5]. The LBP operator takes Fig. 2
and converts into grayscale image and then constructs histogram of the image (please
see Fig. 6 and Table 4).
11
Performance Characterization in Computer Vision: A Guide to Best Practices - http://www.tina-
vision.net/docs/memos/2005-009.pdf.
Fig. 6. LBP feature extraction
Table 4. LBP (LBP - https://github.com/arsho/local_binary_patterns)
import cv2
import numpy as np
from matplotlib import pyplot as plt
def get_pixel(img, center, x, y):
new_value = 0
try:
if img[x][y] >= center:
new_value = 1
except:
pass
return new_value
def lbp_calculated_pixel(img, x, y):

'''
64 | 128 | 1
----------------
32 | 0 | 2
----------------
16 | 8 | 4
'''
center = img[x][y]
val_ar = []
val_ar.append(get_pixel(img, center, x-1, y+1)) # top_right
val_ar.append(get_pixel(img, center, x, y+1)) # right
val_ar.append(get_pixel(img, center, x+1, y+1)) # bottom_right
val_ar.append(get_pixel(img, center, x+1, y)) # bottom
val_ar.append(get_pixel(img, center, x+1, y-1)) # bottom_left
val_ar.append(get_pixel(img, center, x, y-1)) # left
val_ar.append(get_pixel(img, center, x-1, y-1)) # top_left
val_ar.append(get_pixel(img, center, x-1, y)) # top
power_val = [1, 2, 4, 8, 16, 32, 64, 128]

val = 0
for i in range(len(val_ar)):
val += val_ar[i] * power_val[i]
return val
The LBP Code is referenced from SCIKIT [6–8].
2.5 Histogram Comparison Techniques

There are two ways to compare histograms12.
• OpenCV – cv2.compareHist
• SciPy Distance metrics
(4) Open CV – compareHist
Compare Histogram, as name implies, compares two histograms13. Compare His-
togram API takes two histograms H1, H2 and applies comparison method to compute
similarities of the histogram.
Comparison methods include14:
• CV_COMP_CORREL Correlation
• CV_COMP_CHISQR Chi-Square
• CV_COMP_INTERSECT Intersection
• CV_COMP_BHATTACHARYYA Bhattacharyya distance
• CV_COMP_HELLINGER Synonym for CV_COMP_BHATTACHARYYA
(5) SciPy distance metrics
The main difference between using SciPy distance functions and OpenCV methods is
that the methods in OpenCV are histogram specific. This is not the case for SciPy,
which implements much more general distance functions15.
• Euclidean
• Manhattan
• Chebyshev
2.6 Sanjeevani Electronic Health Records [9]

Sanjeevani electronic health records (EHR) [9] store electronic health information
about individual patients. Sanjeevani stores range of electronic health data, including
demographics, medical history, medication and allergies, immunization status, labo-
ratory test results, and personal stats like age and weight. Sanjeevani is pioneered in
digitizing Electronic Health Record and storing it in the commercial cloud. As a result,
our system ensures:
12
How-To: Three Ways to compare histograms using OpenCV and Python - https://www.
pyimagesearch.com/2014/07/14/3-ways-compare-histograms-using-opencv-python/.
13
CompareHist - https://docs.opencv.org/2.4/modules/imgproc/doc/histograms.html?highlight=
comparehist#comparehist.
14
Histogram Comparison methods - https://docs.opencv.org/2.4/modules/imgproc/doc/histograms.
html?highlight=comparehist#comparehist.
15
Histogram comparison methods - https://www.pyimagesearch.com/2014/07/14/3-ways-compare-
histograms-using-opencv-python/.
• High data availability even in the presence of faults in the network or computer
hardware (e.g. due to power outages, environmental disasters, and regional strife).
• High performance to ensure the system can function even under the high loads that
may arise in emergency situations (such as a pandemic, large-scale accident or war).
Security is to protect patient data from misuse, unauthorized access or attacks [9].
3 System Overview
In order to extract additional attributes that correlates skin disease issue, we have
partnered with Sanjeevani Electronic Health Records that provides de-identified details
of patients. Our analysis is to correlate User skin diseases with other medical conditions
(see Fig. 9).
3.1 Data Collection

The Data for the Skin images and Patient de-identified details are collected from
Sanjeevani Electronic Healthcare services’ house calls and senior citizen services.
3.2 Feature Extraction

The following steps extracts image features using LBP (see Fig. 7)
1. the image is first converted to monographic image,
2. next extracted LBP vector,
3. next constructed histogram. The histogram is translated into 20 bins 0 to 256-pixel
value matrix.
4. Assign the LBP in training database.
5. Next, average pixel value to construct unique array.
Finally calculated unique feature for each image for classification purposes.
Fig. 7. Feature extraction

3.3 Pattern Database

The pattern database (see Fig. 8) contains all the columns that are necessary to store
texture features of the image and the data that can be fed to pattern classifier [10, 11].
Some of the important columns include:
• LBP Histogram matrix
• Pattern (average values of histogram matrix)
• Unique Features per image
• UUID of the Image
• Original Image
• Monographic
• Histogram
The pattern database contains all the columns that are necessary to store texture
features of the image and the data that can be fed to pattern classifier.
Fig. 8. LBP feature extraction training dataset
3.4 Pattern Classification

The Pattern classification is based on real-time time series-based feature analysis
[11, 12]. One of the features that the time series compares is LBP histogram. That is,
the texture differences between previous image to new one16.
16
MORPH II – Feature Vector Documentation - https://libres.uncg.edu/ir/uncw/f/wangy2018-1.pdf.
4 System Design and Implementation

4.1 Histogram Comparison Implementation
The historical skin texture images are stored as part of pattern database. When new
image uploaded by the patient, the LBP process extracts histogram of the new image
(see Fig. 9).
Fig. 9. LBP feature extraction training dataset
Second, applies comparison of histogram [13, 14]. This is performed by calling

Open CV histogram method (see Table 5).
The comparison would yield histogram differences. Once detected, next, to find the
differences of image texture. That would predicate prominence of skin texture change
and correlate with the disease.
Fig. 10. Sanjeevani EHR
Table 5. Open CV - Histogram comparison

This paper presented a novel and radical approach to integrate computer vision and
LBP to identify skin disease for the Selfies that are taken for medical image purposes
with Electronic Health Records. Application of Computer Vision with Open CV sta-
tistical methods would provide a great diagnostic tool to identify skin texture change
and a prognosis indicator for impending skin disease.
Acknowledgment. We sincerely thank the management and field staff of Sanjeevani Electronic
Health Records (www.sanjeevani-ehr.com) for their active support in providing images, de-
identified database and image analysis (see Fig. 10).
REFERENCES
1. Guerrero, A.: Medicalized Smartphones: Giving New Meaning to “Selfies”. https://
technologyadvice.com/blog/healthcare/medicalized-smartphones-giving-new-meaning-to-
selfies/. Accessed 11 Aug 2018
2. Langston, J.: New app could use smartphone selfies to screen for pancreatic cancer, 28
August 2017. https://www.washington.edu/news/2017/08/28/new-app-uses-smartphone-
selfies-to-screen-for-pancreatic-cancer/. Accessed 06 Aug 2018
3. Honnungar, S., Mehra, S., Joseph, S.: Diabetic Retinopathy Identification and Severity
Classification, Fall (2016). http://cs229.stanford.edu/proj2016/report/HonnungarMehra
Joseph-DRISC-report.pdf
4. Le, J.: The 5 Computer Vision Techniques That Will Change How You See The World, 12
April 2018. https://heartbeat.fritz.ai/the-5-computer-vision-techniques-that-will-change-
how-you-see-the-world-1ee19334354b
5. Pietikäinen, M., Hadid, A., Zhao, G., Ahonen, T.: Computer Vision Using Local Binary
Patterns, 1st edn. Springer, London (2011). Hardcover ISBN 978-0-85729-747-1, ASIN
B009T11G7M
6. SCIKIT-Image, Local Binary Pattern for texture classification. http://scikit-image.org/docs/
dev/auto_examples/features_detection/plot_local_binary_pattern.html. Accessed 20 July
2018
7. Praksa, E.: Texture Feature Extraction by Using Local Binary Pattern, 04 January 2017.
https://www.researchgate.net/publication/305152373_Texture_Feature_Extraction_by_
Using_Local_Binary_Pattern
8. Ojala, T., Valkealahti, K., Oja, E., PietikaKinen, M.: Texture discrimination with
multidimensional distributions of signed gray-level differences, November 1999. http://
www.ee.oulu.fi/mvg/files/pdf/pdf_58.pdf
9. Hanumayamma Innovations and Technologies, Sanjeevani Electronic Health Records &
Healthcare analytics platform. http://hanuinnotech.com/healthcare.html. Accessed 08 Jan
2017
10. Rosebrock, A.: Local Binary Patterns with Python & OpenCV, 7 December 2015. https://
www.pyimagesearch.com/2015/12/07/local-binary-patterns-with-python-opencv/
11. Hanzra, B.S.: Texture Matching using Local Binary Patterns (LBP), OpenCV, scikit-learn
and Python, 30 May 2015. http://hanzratech.in/2015/05/30/local-binary-patterns.html
12. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan
Kaufmann, Burlington (2011)
13. Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press,
New York (2011)
14. Vuppalapati, C., Ilapakurti, A., Kedari, S.: The role of big data in creating sense EHR, an
Integrated approach to create next generation mobile sensor and wearable data driven
electronic health record (EHR). In: Proceedings of International The Second Service
Applications, 2016 IEEE Second International Conference on Big Data Computing Service
and Applications (BigData) (2016)
Deep-Learned Artificial Intelligence
for Semantic Communication
and Data Co-processing
Nicolay Vasilyev1(&), Vladimir Gromyko2, and Stanislav Anosov1,3

1
Fundamental Sciences, Bauman Moscow State Technical University,
2-d Bauman street, 5, b. 1, 105005 Moscow, Russia
nik8519@yandex.ru, sanosov@cs.msu.su
2
Computational Mathematics and Cybernetics,
Lomonosov Moscow State University,
Leninskye Gory, 1-52, 119991 Moscow, Russia
gromyko.vladimir@gmail.com
3
Public Company Vozrozhdenie Bank,
Luchnikov per., 7/4, b. 1, 101000 Moscow, Russia
Abstract. Trans-disciplinary activity in system-informational culture (SIC)

caused great knowledge sophistication and necessity for any person to have
synthesis of true scientific presentations. Complication of informational flows
satiated with scientific meanings needs their co-processing with the help of
deep-learned artificial intelligence (DL IA). Artificial intelligence (IA) will assist
man to identify universalities for third world understanding. Otherwise, it will be
impossible to live comfortably in computer instrumental systems and its
applications. Arising intellectual difficulties will alter significantly SIC subject
armed with DL IA − powerful means of learning, cognition, and world study.
Trained rational consciousness allows achieving semantic level of communi-
cation in SIC. In its work, DL IA leans on system axiomatic method and per-
sonal cogno-ontological knowledge base descript in language of categories.
Examples explain contributed technology.
Keywords: Trans-disciplinary activity Meaning Universalities Deep-

learned artificial intelligence Cogno-ontological knowledge base
Consciousness auto-building Self-reflection Language of categories System
axiomatic method
1 Introduction: Bio-Socio-Intellectual Evolution
Man is transcendentally transformed in result of mind transition from oral communi-

cation to writing one. This duality was caused by speech conceptualization and
necessity of complex communication presentation. Biological genus evolution stopped
due to the formed ability of language mastering. Mind achieved its up-to-date com-
plexity. It cannot help thinking discursively. But intellectual processes need synthetic
platform of thinking for launching out discursion.

https://doi.org/10.1007/978-3-030-39442-4_70
Deep-Learned Artificial Intelligence for Semantic Communication 917
Genus socialization improved significantly communication among people. Intel-

lectual development happened by means of cultural superstructure usage created by
society. Natural sciences discoveries enriched genus with new complex meanings. From
those times artificial intelligence (IA) pierces all human existence. Nature investigation
developed mathematical analysis. Its application caused knowledge differentiation.
Initial unity of man’s humanitarian and rational origins was split a part. Knowledge lost
its self-obviousness and communication acquired professional features.
Educational systems were impelled to train personnel to skills and laws of a
knowledge field loading relic memory ability to remember in defiance to understand.
Rational knowledge generalization always supported discursive abilities of mind on
new forms of scientific abstractions. This trend to knowledge union was also supported
by humanities.
Synthesis of presentations became decisive for contemporary society. Sophisticated
scientific knowledge requires great endeavors to understand it. Trans-disciplinary work
in SIC is based on comprehension of underlying abstractions. Now it is impossible to
do without deep-learned artificial intelligence (DL IA) that will be able to tutor man
universally with the help of them [1–3].
Computer applications support cognition. Multimedia environment is used to
present information in clear, convenient, and suitable form. Bionic like brain IA assists
already a person in intellectual activity and education imitating human behavior and
traditional teaching (TT) [4–8]. But the latter cannot cope with difficulties of universal
man breeding without special concern about it. It does not even pursue the aim.
Coming cardinal change of social life requires improving TT system.
Genus has created computer networks as new tools of cognition and investigation.
Man is occupied now by interdisciplinary labor based on trans-disciplinary meanings.
Computer networks give quick access to big data satiated with sophisticated scientific
meanings. Work in electronic thecae does not accompanied by things understanding.
Knowledge has become complex and huge in its volume. It requires adequate labor
with it because methods of TT ceased to work. Universal tutoring (TU) is needed for
rational auto-poesis and harmony of real and ideal presentations [9].
Only rational consciousness self-transcendentality on the basis of SIC meanings
can resolve this educational crisis and support intellectual evolution in culture. It is
connected with man’s rational self-building by means of universal training.
Informational flows in natural sciences are full of scientific meanings. They can be
understood only in life-long partnership with DL IA [1]. DL IA can assist man in his
personal self-development. Only on the ground natural life in SIC can be maintained.
Men’s communication will occur on semantic level because learning, cognition, and
investigation are always based on meanings understanding. Arising intellectual diffi-
culties can be overcome if they are resolved in good time. Only DL IA can be con-
tinuously available.
Multi-media presentation of universal abstraction can help to their comprehension.
Mankind intellectual evolution will occur by means of TU that applies natural (LN) and
semantic languages (LC) unity to meanings constructive description [10, 11]. Future
trend is semantic glottogenesis. Under TU languages interaction stimulates crossing-
over and auto-folding intellectual processes of mind. They happen on the ground of
ontological universalities. Language of categories (LC) plays the role of mathesis
918 N. Vasilyev et al.
universalis. It is normal outcome of natural sciences development and communication

evolution. Traditional cognition will be strengthened by it. Detailed analysis of data
will be complemented by meta-mathematical study developing subject’s scope of life.
Ideated thinking supported by DL IA will transform communication uplifting it on
semantic level.
2 Consciousness Auto-development and Communication
Communication is possible due to existence of language objectivity (Leibnitz). Infor-

mation becomes available to a person by means of reasoning in precisely expressed
abstractions. Mind rationalization humanizes consciousness self-development under
study-cognition. It happens with the help of linguistic tools and needs meanings
investigation. Then communication will be getting higher level. Universalities mas-
tering will allow man super strengthening his IN in partnership with DL IA. Moreover,
contributed technology is means of DL IA creation.
Meanings are used under communication. De docta ignorantia principle in SIC
influences rational consciousness auto-building necessary for semantic interaction. DL
IA helps to activate self-reflection directed on universalities understanding resulting in
mind rational transcendentality. Human achievements in mathematics are enriched now
by language of categories (LC). Meta-mathematical investigations allow applying
knowledge without premises principle [12].
Language of categories (LC) was discovered in mathematics while meanings of
whole theories were studied. Their axiomatic presentation leant on the same similar
universalities displayed in different systems. Ontological essence of meanings allows
expecting their usage in communication. For their identification and study, precise
mathematical tools and descriptive means become necessary [14]. They supplement
inductive practical approaches to theories formation uniting knowledge in wholesome
structure. System AM and LC allows uniting meta-mathematical approach with things
properties descripting in LC. Deep learned IA functions on the basis of CogOnt pre-
senting universal constructions in all their unity [13–25].
Urged by DL IA the same thinking processes begin to work in self-reflected mind
with sophisticated scientific data. Glottogenesis increases power of linguistic tools of
communication LC [3, 13]. It occurs by synthesis of all languages used for commu-
nication. Subject’s rational consciousness self-building and development of discursive
abilities is the result of self-reflection about means of knowledge modelling. Different
models comparison provides food for the mind. Inductive work and ideal presentations
are to be coordinated by adequate means, see Table 1. The trend is knowledge gen-
eralization. It can be done only in abstract objective ontological form. Human
approaches can be classified in accordance with three levels of axiomatic method.
Table 1. CogOnt: cognogenesis repeats anthropogenesis.

Obviousness of abstractions Semantic naturalness Compulsory identification
AMI: categorical theories AMM: open theories AMS: theories synthesis
(function) (morphism) (functor + meta-mathematics)
space: Eucleidus Descarte number: Hamilton Artin Pontriagin; informatics: SI, OOP; ATD;
Hilbert Weyl Kolmogorov; space: non – Archimedes– Hilbert; ISI ; DB; KB; TC, TII ; IA;
algebra: group, ring, module; non – Eucleidus – Lobachevski, CogOnt networks;
structure – Skorniakov Riemann; algebra: categories; free
Shafarevich; universal algebra: Van der Warden; constructions;
analysis: non–standard – Hilbert; Kourosh; category theories of: logic;
standard – Newton Leibniz; analysis: discrete – Knuth Scott; algorithms; linear algebra;
calculation: machine – Turing; elemental – Tarski; non standard – analysis; casual values;
language – McCarty; insoluble - Leibniz Robinson; functional – Fourier measure;
innumerable – Post Church; Schwarz; probabilities – Kolmogorov; means of expression: ZFC –
means of expression: sentence – measurable – Lebesgue; Zermelo Fraenkel; LC –
Boole; set – Cantor; predicate – computation: effective; multi- McLane, TOPOI – Lauvere;
Frege Hilbert; processor; methods of: closure – Cantor;
factorization in CogOnt: means of expression: multi valued evaluation – Carry; transfinite
propeudevtic courses in LN logics; induction – Zorn;
factorization in CogOnt: meta method– Gödel;
philogenetic courses of general algebra theory of models – Maltsev;
factorization in CogOnt:
universalities in LC, meta-
mathematics, methods
Partnership of IN and DL IA supposes availability of intellectual interface ðTII Þ

descript in LC (ideal presentation). Constructive work in integrated computer systems
ISI serves to knowledge understanding. Cloud technology (TC) supports semantic
processing while databases (DB) are transforming into knowledge bases (KB). DL IA
realization leans on object-oriented programming (OOP). It is LC application to
informatics. Abstract types of data (ATD) usage and interpretive scheme of pro-
gramming will help to meanings co-processing. Genus always needed conceptual
grasping for strict communication. Personal adaptive form PCogOnt is based on well-
verified tutoring approaches. Similar problems were solved in the past. Knowledge
about sciences formation and origins are necessary semantic processing in SIC [16, 17].
⎧ D
↓
⎪
⎪ PCogOnt
IN = ⎨ DLIA → I N →
⎪ ↓
⎪ K
⎩
Fig. 1. Natural and artificial intelligences partnership ~IN for knowledge (K) extraction by means
of data (D) processing.
Communication understanding happens by the same laws, see Fig. 1. Let ~IN be
human mind working jointly with life-long partner DL IA. Scheme of future commu-
nication between SIC persons can be presented in the next form, see Fig. 1:
K ¼D j
~I i ! ~INj ! :
i j K
N
Knowledge of philogenetic significance is used in CogOnt [15]. Databases ðDB Þ

transformed by functor CogOnt : DB ! SSKSN give birth to understanding, see Table 1.
CogOnt is open dynamic system. Interacting with multi-disciplinary electronic libraries
and living in SIC meanings man will undergo intellectual changes under influence of
SIC and with the help of DL IA.
3 Axiomatic Method Dynamics and Meta-mathematics
Intellectual evolution of mind and communication are based on scientific presentations

about processes of natural sciences growing up. That can be acquired from system
analysis of AM dynamics, see Table 1. Besides, person must have general view on
sciences outcome that meta-mathematical investigation gives.
Means of knowledge expression and applied approaches significantly differ in
dependence of AM level. One of important problems solved in science is objects
existence. In mathematics it takes form of inheriting algebraic structures. According to
Euclid’s AMI [16], geometrical image existence means its geometrical construction
with the help of compasses and a ruler. Hilbert used AMM. Existence signified for him
possibility to define thing in the form of ATD and obtain its consistent model [17].
Euclid avoided reductio ad absurdum. Nevertheless, he proved equivalent triangles
existence and Hilbert, on the contrary, took it as one of his congruence axioms [16, 17].
In full measure equivalence relation is presented by AMS. It is introduced in objects
from different categories. Existence is postulated by means of commutative diagrams.
Thinking in LC leans on maps universality crowned by morphisms as adequate means
of modeling, see Table 1 [10, 11].
Example 1. Dual presentations accompany thinking. This universality is displayed in
speech, thinking itself, communication, languages, sciences, and concepts themselves.
Obviously that it is reflected in LC. There are mutually dual as categories as morphisms
and objects, see Figs. 2 and 3 [10]. Duality is expressed in LC by means of altering all
arrows on opposite ones in commutative diagrams.
Example 2. Equivalence relation and object factorization duality. Both properties are
very important for CogOnt creation and usage. With their help semantic compression
of communication can be executed.
Fig. 2. Descartes’ square universality.
In any category equivalence is expressed by means of kernel relation Rf [10], see

Fig. 2.
Dual concept to it is notion of factor-object. The latter falls under the same cate-
gory. Knowledge factorization is used on all levels of AM, see Table 1. Factor-object
can be presented diversely in LC. It is expressed by co-Descartes’ square − commu-
tative diagram dual to one from Fig. 2. Besides, the same object is descripted with the
help of morphism fRf co-equalizing projections p1 ; p2 [11]:
Fig. 3. Co-equalizer universality.
Universalities are interconnected. The same diagram can be used, for instance, in
order to introduce notion of map inverse image.
4 Thinking in Universal Constructions
Aim of DL IA functioning is to turn man’s instinctive synchronization with SIC to self-

reflection on the basis of mathematical universalities. Reflection on synchronization
auto-building will develop rational consciousness and prepare man for trance-discipline
work with sophisticated computer applications. Cognogenesis repeats anthropogenesis
on new basis issuing from semantic comprehension of underlying knowledge. Studied
things will become self-obviousness on the ground of idealization subjective objec-
tivity. By means of semantic forms DL IA can assist man to see ideas laid in origins of
theories [14]. Through the ideas any information or communication can be analyzed
deeply. De docta ignorantia principle and DL IA assistance can help man to understand
theories structure and methods.
Many important concepts in mathematics are morphisms and functors [2, 3]. Their
usage can significantly compress a theory presentation.
Example 3. Morphisms universality. Let consider category of multi-operator rings

RingM in which morphisms composition is corresponding maps superposition [23, 24].
Its objects
O ¼ fO; A1 ; . . .; AK g; Ak : O ! Og
are transformed with the help of linear operators Ak : O ! O; k ¼ 1; 2; . . .; K. Mor-

phisms l : O ! O0 in RingM are maps conserving the operators of considered rings i.e.
lAk ¼ A0k ; k ¼ 1; 2; . . .; K. It means that for all A ¼ Ak next diagrams are commutative
A
O!O
l# : ð1Þ
0
0 A 0
O !O
In other words, morphism transits action of operators from one ring into another
one. For instance, let p : O ! O0 be projection on invariant subspace O0 O of
operators A ¼ Ak . Then map l ¼ p satisfies to (1) i.e. p is morphism.
If an operator A ¼ Ak is invertible then property lA1 ¼ ðlAÞ1 takes place. The
scheme is universal because it is recognized in many mathematical theories and
applications. Its specialization is given in functional rings of the kind
O ¼ ff : R þ ! R; f ð0Þ ¼ 0g;
:
O0 ¼ fF : DomF ! C; Fð1Þ ¼ 0g; DomF ¼ fp : p 2 C; Re p rðf Þg
They are compared with the help of morphism l ¼ M called function f Laplace
transformation F(p):
Zþ 1
Mf ¼ FðpÞ ¼ ept f ðtÞdt:
0
In this case, rings consist of functions
O ¼ ff : R þ ! R; f ð0Þ ¼ 0g;
:
O0 ¼ fF : DomF ! C; Fð1Þ ¼ 0g; DomF ¼ fp : p 2 C; Re p rðf Þg
There is next family of linear operators
A1 ¼ DO ; A2 f ðtÞ , t f ðtÞ; Ag f ¼ g f ; g 2 O:
They are derivation DO , multiplication on function gðtÞ ¼ t, and convolution g f .

For all f ; g 2 O next equalities are satisfied:
Mðaf þ bgÞ ¼ aMf þ bMg;

MDO f ¼ p Mf ;
: ð2Þ
Mðt f ðtÞÞ ¼ DO0 F;
MAg f Mðg f Þ ¼ G F
According to (1), (2), in object O0 next family of operators act:
A01 F , pFðpÞ; A02 , DO0 ; AG F ¼ G F; G 2 O0 :
Moreover, inverse Laplace transformation M1 exists. By means of algebraic

construction identification, idea of operational calculus becomes clear and obvious.
Therefore, main formulas are discovered connecting morphisms M; M1 images. The
rings are isomorphic O ffi O0 ; O0 ¼ ImO. Integration in the rings is inverse to derivation
operators A ¼ DO ; DO0 . Next equalities are consequence of general law lA1 ¼ ðlAÞ1
applied to morphisms l ¼ M; M1 :
1 1
MD1 1 1 1
O f ¼ Mf ; M DO0 F ¼ M F:
p t
The considerations show how similar elements S : f ðtÞ ! f ðatÞ of object O are
transformed in the ones S0 : FðpÞ ! 1aFðpaÞ of object O0 .
With the help of forgiving functor U ring UO is defined without operators
Ak ; k ¼ 1; 2; . . .; K. Ring UO is embedded in ring of lattice functions On :
On ¼ ff : Ordn ! R; f ð0Þ ¼ 0g; Ordn ¼ f0; 1; . . .; n : 0\1\. . .\ng; n ¼ 0; 1; . . .:
In rings On , discrete analogues DAk of operators Ak ; k ¼ 1; 2; . . .; K are introduced.

Then there is discrete Laplace transform Mn : On ! O0 ,
X
Mn f ¼ F ðpÞ ¼ ekp f ðkÞ:
k2Ordn
Co-domain of arrow Mn is continuous object O0 . It helps to study discrete object

On ¼ ðOn ; DAk ; k ¼ 1; . . .; KÞ by analytical methods.
In probabilities theory it is used particular case of morphism M. It is Fourier
U
transform O ! O0 ; O O:
O ¼ ff : R þ ! R; f ð0Þ ¼ 0; 9C j f j Cg;
:
O0 ¼ fF : R ! C; Fð1Þ ¼ 0g
Besides it, functions F 2 O0 have analytic continuation F 0 2 O0 . That is why

morphisms M; U are connected by next commutative diagram:
U
O ! O0
#.# :
M
O ! O0
Therefore, category view helps to observe operational calculus and to understand

its essence.
Categories morphisms are functors [10, 11]. Category RingM admits division in two
parts – continuous RingM M
c and discrete Ringd one, see ex. 3.
Example 4. Functors universality. With the help of functor D ¼ D1 of objects

discretization
D1 : ðf : A ! BÞ ! ðfn : An ! Bn Þ;
algebraic methods can be supported by calculations. Derivation operator is now

replaced by first difference and integration – by finite sum of function fn values found in
lattice nodes k:
Z X
Dðf 0 ÞðkÞ ¼ Dfn ðkÞ; Dð fÞ ¼ fn :
k
Commutative diagram (1) l : RingMc ! Ringd ; l ¼ D1 , can be regarded on as the

M
categories comparison. Functor l can be also considered as “approximate” morphism

of initial category RingM having an error en ! 0; n ! 1. It can be described in the
form of diagram
U UD;n
Dn : ½O O0 ! ½On O0n
U1 U1
D;n
Discretization functor D1 transits morphism U 2 RingM c in Un 2 Ringd one of

M
M
discrete category Ringd . It is known as discrete Fourier transform of lattice functions:
n X
T 2N1 T
f ðtk Þe 2N ; n ¼ 0; 1; . . .2N 1; tk ¼ k
ipkn
Un ðfn Þð Þ ¼ 2 ½0; T :
T 2N k¼0 2N
As consequence, discrete analogues of correlations (2) take place.

In applications, there are approximation functors
D ¼ D2 ; D2 : RingM ! RingM
fin :
They reduce ring dimension conserving elements continuity:
X
1 X
n
D2 : f ¼ a k fk ! ak f k :
k¼0 k¼0
Image D2 ðf Þ of a function f 2 O is either an interpolation polynomial or partial sum

of a series, which is expansion of the map. The latter can be Laurent’s or Fourier series.
Ring O0 ’ fa : a ¼ ða0 ; . . .; an Þg 2 RingM
fin is finite dimensional. In finite dimensional
rings analytical methods can be combined with computational.
In this approach, some diagrams (1), l ¼ D2 , remain true in finite dimensional
category RingM fin , i.e. D2 is morphism in Ring . For instance, so it is in case of
M
l ¼ p − projection on subspace invariant to operators Ak : O ! O; k ¼ 1; 2; . . .; K.

In general, objects of categories RingM M
d or Ringfin are not homomorphic images of
RingM objects.
Category view helps to observe the general by means of connected presentation of
theories under study. Thinking universality overcomes theoretical complexity and
explains the essence of specialized knowledge.
5 Conclusions
Huge mind abilities are used in infinitesimal measure. Emerging technology of man’s
partnership with DL IA is to assist person’s rational auto-development (man’s self-
development on rational ground). Man is born to think in meanings (to live sane in
thinking). Self-reflection removes divergence between ideal ontological constructions
and real means of knowledge implementation in computer systems. Intellectual pro-
cesses lean on CogOnt and system axiomatic method. DL IA will foster man’s thinking
in language of categories. De docta ignorantia and attendant consciousness perfection
on the base of universalities allow breading SIC subject. Man’s auto-molding happens
on the highest genome level of neuro-organization and in trend of rational sophisti-
cation. Future information processing and communication will be holding on semantic
level in order to support person’s successful trance-disciplinary activity.
References
1. Gromyko, V.I., Kazaryan, V.P., Vasilyev, N.S., Simakin, A.G., Anosov, S.S.: Artificial
intelligence as tutoring partner for human intellect. J. Adv. Intell. Syst. Comput. 658, 238–
247 (2018)
2. Gromyko, V.I., Vasilyev, N.S.: Mathematical modeling of deep-learned artificial intelligence
and axiomatic for system-informational culture. Int. J. Robot. Autom. 4(4), 245–246 (2018)
3. Vasilyev, N.S., Gromyko, V.I., Anosov, S.S.: On inverse problem of artificial intelligence in
system-informational culture. J. Adv. Intell. Syst. Comput. Hum. Syst. Eng. Des. 876, 627–
633 (2019)
4. Sadique Shaikh, Md.: Defining ultra artificial intelligence (UAI) implementation using
bionic (biological-like-electronics) brain engineering insight. MOJ Appl. Bio Biomech. 2(2),
127–128 (2018)
5. Deviatkov, V.V., Lychkov, I.I.: Recognition of dynamical situations on the basis of fuzzy
finite state machines. In: International Conference on Computer Graphics, Visualization,
Computer Vision and Image Processing and Big Data Analytics, Data Mining and
Computational Intelligence, pp. 103–109 (2017)
6. Fedotova, A.V., Davydenko, I.T., Pförtner, A.: Design intelligent lifecycle management
systems based on applying of semantic technologies. J. Adv. Intell. Syst. Comput. 450, 251–
260 (2016)
7. Volodin, S.Y., Mikhaylov, B.B., Yuschenko, A.S.: Autonomous robot control in partially
undetermined world via fuzzy logic. J. Mech. Mach. Sci. 22, 197–203 (2014)
8. Svyatkina, M.N., Tarassov, V.B., Dolgiy, A.I.: Logical-algebraic methods in constructing
cognitive sensors for railway infrastructure intelligent monitoring system. Adv. Intell. Syst.
Comput. 450, 191–206 (2016)
9. Hadamer, G.: Actuality of Beautiful. Art, Moscow (1991)
10. Mclane, S.: Categories for Working Mathematician. Phys. Math. Ed., Moscow (2004)
11. Goldblatt, R.: The Categorical Analysis of Logic. North-Holland Publishing Company,
Amsterdam (1979)
12. Husserl, A.: From Idea to Pure Phenomenology and Phenomenological Philosophy: Book 1:
General Introduction in Pure Phenomenology. Acad. Project, Moscow (2009)
13. Pinker, S.: Thinking Substance Language as Window in Human Nature. Librokom, Moscow
(2013)
14. Kassirer, E.: Philosophy of Symbolical Forms. Language Univ. Book, Saint Petersburg 1
(2000)
15. Courant, R., Robbins, G.: What is Mathematics?. Moscow Center of Continuous Education,
Moscow (2017)
16. Euclid: Elements. GosTechIzd, Leningrad (1949–1951)
17. Hilbert, D.: Grounds of Geometry. Tech.-Teor. Lit., Leningrad (1948)
18. Kirillov, A.: What is the Number?. Nauka, Moscow (1993)
19. Artin, E.: Geometric Algebra. Nauka, Moscow (1969)
20. Bachman, F.: Geometry Construction on the Base of Symmetry Notion. Nauka, Moscow
(1969)
21. Maltsev, A.: Algebraic Systems. Nauka, Moscow (1970)
22. Maltsev, A.I.: Algorithms and Recursive Functions. Nauka, Moscow (1986)
23. Shafarevich, I.R.: Main Notions of Algebra. Reg. and Chaos Dynam, Izhevck (2001)
24. Kourosh, A.: Lecture Notes on General Algebra. Phys.-Mat., Moscow (1962)
25. Engeler, E.: Metamathematik der Elementarmathematik. MIR, Moscow (1987)
Author Index
A Boucher, Thomas R., 295

Abella, Dominique Michelle M., 541 Bucyk, Marko, 169
Aborizka, Mohamed, 631 Bux, Rahim, 621
Agerwala, Tilak, 125
Akinyemi, Mary, 692 C
Aldawsari, Layla S., 385 Cantó-Navarro, Enrique, 814
Alegata, Genesis T., 541 Cao, Hailong, 512
Alharthi, Dalal N., 27 Carvalho, Ricardo Silva, 678
Ali, Mesan, 310 Castro, Eveling, 664
Almisreb, Ali Abd, 775 Cedillo, Priscila, 111
AlTunaiji, Salem, 328 Céspedes-González, Yaimara, 17
Aly, Hossam Medhat, 631 Chamansingh, Nicholas, 147
Andaluz, Víctor H., 454 Chang, HongYu, 826
Anosov, Stanislav, 916 Chavarin, Salvador, 571
Ansonska, Evija, 67 Chen, Jiangning, 494
Arhipova, Irina, 67, 157 Choresh, Noa, 882
Arieta-Melgarejo, Patricia, 17 Condori, William, 664
Arman, Md. Shohel, 224 D
Arora, Nikita, 571 da Cunha, Urias Cruz, 678
Ayala, Darnes Vilariño, 282 Dai, Zhibo, 494
Ayana, Abraham G., 512 Das, Saikat, 721
Davoyan, Ashot, 401
B de Jesús Lavalle Martínez, José, 282
Badruzzaman, Khalid Been Md., 224 Dien, Tran Thanh, 55
Balar, Kalpesh, 83 Doma, Vikrant, 571
Barbosa, Salvador E., 270 Domnauer, Colin, 473
Bárcenas, Everardo, 17 Duan, Juntao, 494
Baumeister, Joachim, 91
Berzins, Gundars, 67 E
Bhatti, Sania, 621 Elavarasan, Pravin, 604
Bodike, Yadagiri, 133 Ellington, Beth, 362
Bolic, Miodrag, 525 Erekhinskaya, Tatiana, 864
Bonham-Carter, Oliver, 237 Erglis, Aldis, 67

https://doi.org/10.1007/978-3-030-39442-4
928 Author Index
F Kim, Andrew, 836

Fadhila, Aulia, 842 Kim, Kee Ok, 892
Farma, Vansh, 1 Kiser, Brandon, 133
Korger, Andreas, 91
G Krishna, Praful, 83
García, Aimee Cecilia Hernández, 282 Kulkarni, Harshad, 83
Ghalib, Abdulaziz, 190 Kumar, Akshi, 1
Giang, Nguyen Thi Phuong, 55
Gromyko, Vladimir, 916 L
Gupta, Aarushi, 1 Labib, Soha Safwat, 631
Gupta, Ankit, 705 Lee, James J., 78
Gupta, Himanshu, 83 Lee, Misuk, 78
Lemay, Mathieu, 525
H Leo, Justin, 739
Hammad, Mahmoud M., 27 Li, Ming, 770
Hanif, Ahad, 654 Li, Ruilin, 494
Hariharan, Ayush, 705 Li, Xu, 805
Hariprasad, N., 409 Liu, Yung-Wen, 556
Hasanuddin, Zulfajri Basri, 842 Liyanage, Liwan, 257
Heu, David, 133 López-García, Mariano, 814
Hong, Don, 270 Luna, Gabrielle C., 541
Hosein, Patrick, 42, 147
Hossain, Syeda Sumbul, 224 M
Hsu, Hui-Huang, 794 Machaca, Vicente, 664
Hu, Kaitang, 770 MacPherson, Mary Kate, 525
Hu, Qianli, 494 Madal-Hellmuth, Kelby, 836
Hutzler, Yesha’ayahu, 882 Malik, Zeeshan Haider, 310
Hwang, Hyesun, 805, 892 Manzaki, Satomi, 826
Martinez-Enriquez, A. M., 642, 654
I Matsui, Minako, 826
Ilapakurti, Anitha, 898 Matzinger, Heinrich, 494
Meirane, Inga, 157
J Memon, Mohsin, 621
Jain, Anant, 1 Moldovan, Dan, 864
Jaschek, Tim, 169 Molero-Castillo, Guillermo, 17
Jessup, Tyler D., 190 Monemian, Seyedamin, 190
Jia, Yuting, 556 Moreno, Erika F., 454
Jin, Zhao, 419 Msabaeka, Tsitsi, 295
Johnson, Daphy Louis Lovenia, 604 Muhammad, Adrees, 642
Johnson, Dhalia Sweetlin, 604 Muhammad, Andrees, 654
Johnson, Julia, 190 Muhammad, Aslam, 642, 654
Mullo, Álvaro S., 454
K Munir, Tayyab, 310
Kadari, Bhavishya, 133 Muniswamaiah, Manoj, 125
Kalita, Jugal, 739 Murugappan, Ramanathan, 480
Kamalinejad, Ehsan, 836
Karahan, Saltuk, 873 N
Karunanithi, Ashok, 604 Nanduri, Jay, 556
Kedari, Santosh, 898 Nina, Wilder, 664
Kedari, Sharat, 898 Niswar, Muhammad, 842
Khoa, Tran Thi Minh, 55 Nwachuku, Akwarandu, 692
Author Index 929
O Shu, Chun-Yung, 794

Oberoi, Jaspreet S., 169 Singh, Pardeep, 436
Odemuyiwa, Toluwanimi, 752 Singh, Sahil, 571
Ojala, Juha, 341 Sirkeci-Mergen, Birsen, 752
Okude, Naohito, 826 Siva Balan, N., 409
Orellana, Marcos, 111 Sun, Yi, 590
Ortiz, Gabriel, 571 Sundaravadivu, K., 409
Suri, P. K., 436
P
Pacheco, Evelyn E., 454 T
Pal, Trisha, 705 Tahir, Nooritawati Md., 775
Palmer, Xavier-Lewis, 873 Tappert, Charles C., 125
Partala, Timo, 341 Tian, Yan, 590
Partovinia, Vahid, 210 Toma, Tapushe Rabaya, 224
Pellerin, Robert, 210 Trujillo, Andrea, 111
Pike, Adam, 352
Pirouz, Matin, 133, 571 U
Polestico, Daisy Lou L., 541 Ugot, Ogban-Asuquo, 692
Popescu, Ionel, 494 Usman, Ali, 642
Potter, Lucas, 873
Pour, Mahan Balal, 210 V
Pugdeethosapol, Krittaphat, 419 Vasilyev, Nicolay, 916
Velasco, Lemuel Clark P., 541
Q Velázquez-Mena, Alejandro, 17
Qassoud, Hamza, 525 Venkatesh, Sai Vishwanath, 480
Qiu, Qinru, 419 Venugopal, Deepak, 721
Vidal, Mireya Tovar, 282
R Vijayaraghavan, Vineeth, 480
Rafiq, Fatama Binta, 224 Villegas, Juan, 664
Ragav, Abhijith, 480 Vitols, Gatis, 157
Rahman, Shadikur, 224 Vuppalapati, Chandrasekar, 898
Ramoudith, Shiva, 42 Vuppalapati, Jaya Shankar, 898
Rawshan, Lamisha, 224 Vuppalapati, Rajasekar, 898
Raza, Muhammad Owais, 621
Razak, Hana’ Abd, 775 W
Regan, Amelia C., 27 Wang, Jin, 770
Reyes-Ortiz, José A., 282 Wardana, Mayong Adi, 842
Rider, Daniel, 419 Wijesekara, W. M. L. K. N., 257
Rosenberg, Louis, 473 Willcox, Gregg, 473
Rossi, Markku, 341
X
S Xu, Li, 892
Sadeesh Kumar, A., 409 Xu, Shuzhe, 270
Samra, Rasha Abou, 328 Xu, Yiping, 590
Saran, Parneet Kaur, 571
Sato, Chihiro, 826 Y
Selin, Jukka, 341 Yang, Kiyoung, 556
Semwal, Sudhanshu Kumar, 352 Yee, Kieran, 525
Sengupta, Jyotsna, 436 Yen, Shwu-Huey, 794
Shapiro, Daniel, 525 Yeo, Harim, 805, 892
Sharari, Nizar Al, 328 Yinka-Banjo, Chika, 692
Shiva, Sajjan, 721 Yitzhaki, Moshe, 855
930 Author Index
Z Zhang, Yafang, 805

Zaghetto, Alexandre, 678 Zhao, Tiejun, 512
Zhai, Haoyan, 494 Zhong, Fay, 836
Zhang, Linrui, 864 Zhou, Yisheng, 864

BOOK - A Taxonomy of Social Engineering Defense Mechanisms

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BOOK - A Taxonomy of Social Engineering Defense Mechanisms

Uploaded by

Copyright:

Available Formats

Advances in Intelligent Systems and Computing 1130

More information about this series at http://www.springer.com/series/11156

ISSN 2194-5357 ISSN 2194-5365 (electronic)

We welcome all of you at the Future of Information and Communication

Comparative Study on Swarm Based Algorithms for Feature

A Comparative Evaluation of Preprocessing Techniques for Short

Evaluating Taxonomic Relationships Using Semantic Similarity

Multi-user Expert System for Operation and Maintenance

Detecting Cyberbullying in Social Commentary Using Supervised

Using Digital Image Processing to Characterize Flocculation

5G Service and Discourses on Hyper-connected Society

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927

Akshi Kumar(&), Aarushi Gupta, Anant Jain, and Vansh Farma

Computer Science Department, Delhi Technological University,

Keywords: Gravitation search algorithm Particle Swarm Optimisation

© Springer Nature Switzerland AG 2020

Fig. 1. Graph showing hughes phenomenon.

3.1 Binary Particle Swarm Optimisation

Algorithm 1: Binary Particle Swarm Optimisation

/* update the local best performance score of each particle so far */

/* update the global best score so far */

for each particle i = 1, …, n do

vid(new) = w1.vid(old) + c1.U(0,1).(pid - xid(old)) + c2.U(0,1).(pgd -

If vid > vel_max then vid = vel_max;

/* update the position following gbest */

3.2 Binary Genetic Algorithm

/* update the best value and best position */

/* Select two best individuals from the current generation */

/* crossover and mutation */

return best_val, best_pos;

3.3 Binary Gravitational Search Algorithm

Algorith nal Search Allgorithm

Fig. 2. System architecture

4.1 Collecting Data

4.2 TF - IDF Algorithm

4.3 Feature Selection

Table 1. Comparison between PSO, GA and GSA.

5. K-Nearest Neighbour (KNN) - In a feature space, classiﬁcation is done on the basis

4.5 Experiment Results

5 Result and Analysis

5.1 Feature Selection

5.2 Final Accuracy

5.3 Time Taken

Yaimara Céspedes-González1(&), Guillermo Molero-Castillo2,

Keywords: Statistical analysis French economic growth Business

2.1 Economic Growth

2.2 Related Work

3.2 Data Analysis

Table 2. Values of the public expending variable.

8. Yahyaoui, A., Rahmani, A.: Développement ﬁnancier et croissance économique: Rôle de la

Dalal N. Alharthi1(B) , Mahmoud M. Hammad2 , and Amelia C. Regan1

Abstract. Humans have become the weakest point in the information

Keywords: Cybersecurity · Social engineering · Attack vectors ·

pop-up window, instant message, or malicious website). Social engineers target

and prevent, resulting in loss of conﬁdential data, intellectual property, ﬁnancial

Fig. 1. Social engineering attack vectors.

2.1 Technical Social Engineering Attacks

2.2 Non-Technical Social Engineering Attacks

– Pretexting/Impersonation occurs when a social engineer pretends to be some-

be a technical support person working on a network problem and request the

According to a 2018 Verizon report [1], phishing and pretexting combined

whether to comply with or ignore the terms of information security policies.