Professional Documents
Culture Documents
BOOK - A Taxonomy of Social Engineering Defense Mechanisms
BOOK - A Taxonomy of Social Engineering Defense Mechanisms
Kohei Arai
Supriya Kapoor
Rahul Bhatia Editors
Advances in
Information and
Communication
Proceedings of the 2020 Future
of Information and Communication
Conference (FICC), Volume 2
Advances in Intelligent Systems and Computing
Volume 1130
Series Editor
Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland
Advisory Editors
Nikhil R. Pal, Indian Statistical Institute, Kolkata, India
Rafael Bello Perez, Faculty of Mathematics, Physics and Computing,
Universidad Central de Las Villas, Santa Clara, Cuba
Emilio S. Corchado, University of Salamanca, Salamanca, Spain
Hani Hagras, School of Computer Science and Electronic Engineering,
University of Essex, Colchester, UK
László T. Kóczy, Department of Automation, Széchenyi István University,
Gyor, Hungary
Vladik Kreinovich, Department of Computer Science, University of Texas
at El Paso, El Paso, TX, USA
Chin-Teng Lin, Department of Electrical Engineering, National Chiao
Tung University, Hsinchu, Taiwan
Jie Lu, Faculty of Engineering and Information Technology,
University of Technology Sydney, Sydney, NSW, Australia
Patricia Melin, Graduate Program of Computer Science, Tijuana Institute
of Technology, Tijuana, Mexico
Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro,
Rio de Janeiro, Brazil
Ngoc Thanh Nguyen , Faculty of Computer Science and Management,
Wrocław University of Technology, Wrocław, Poland
Jun Wang, Department of Mechanical and Automation Engineering,
The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications
on theory, applications, and design methods of Intelligent Systems and Intelligent
Computing. Virtually all disciplines such as engineering, natural sciences, computer
and information science, ICT, economics, business, e-commerce, environment,
healthcare, life science are covered. The list of topics spans all the areas of modern
intelligent systems and computing such as: computational intelligence, soft comput-
ing including neural networks, fuzzy systems, evolutionary computing and the fusion
of these paradigms, social intelligence, ambient intelligence, computational neuro-
science, artificial life, virtual worlds and society, cognitive science and systems,
Perception and Vision, DNA and immune based systems, self-organizing and
adaptive systems, e-Learning and teaching, human-centered and human-centric
computing, recommender systems, intelligent control, robotics and mechatronics
including human-machine teaming, knowledge-based paradigms, learning para-
digms, machine ethics, intelligent data analysis, knowledge management, intelligent
agents, intelligent decision making and support, intelligent network security, trust
management, interactive entertainment, Web intelligence and multimedia.
The publications within “Advances in Intelligent Systems and Computing” are
primarily proceedings of important conferences, symposia and congresses. They
cover significant recent developments in the field, both of a foundational and
applicable character. An important characteristic feature of the series is the short
publication time and world-wide distribution. This permits a rapid and broad
dissemination of research results.
** Indexing: The books of this series are submitted to ISI Proceedings,
EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **
Rahul Bhatia
Editors
Advances in Information
and Communication
Proceedings of the 2020 Future of Information
and Communication Conference (FICC),
Volume 2
123
Editors
Kohei Arai Supriya Kapoor
Faculty of Science and Engineering The Science and Information
Saga University (SAI) Organization
Saga, Japan Bradford, West Yorkshire, UK
Rahul Bhatia
The Science and Information
(SAI) Organization
Bradford, West Yorkshire, UK
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
Contents
vii
viii Contents
Abstract. In this paper, we deal with the issue of feature selection by com-
paring different approaches based on Gravitational Search Algorithm, Particle
Swarm Optimisation and Genetic Algorithm. The comparison is drawn on the
parameters of feature reduction percentage, accuracy and time taken. The
optimization is performed with the following supervised predictive models -
Multinomial Naive Bayes, Support Vector Machine, Decision Tree, K-Nearest
Neighbour and Multilayer Perceptron. Datasets were acquired from SemEval
2015 task on the sentiment analysis of figurative language on Twitter (Task 11),
which provided a set of 5198 tweets scored with finely grained sentiment score
(−5 to +5 including 0). 11538 features were generated from the datasets and the
experiments performed have been successful in reducing an average of 55% of
the features, without any decline in the accuracy.
1 Introduction
Natural language processing (NLP) is a leading research area which deals with com-
putational methods to achieve human-like language processing. Traditionally, NLP is
widely used to devise efficient and robust techniques to accomplish tasks like syntactic
and semantic analysis, text classification, grammar induction, document clustering, etc.
Recently, researches in sentiment analysis (SA) have increased to a great extent.
Sentiment Analysis includes algorithms which can provide an overall sentiment
expressed in the document. Traditional methods used to determine sentiment were less
accurate and had a little future scope. Later with the use of supervised algorithms like
Naive Bayes, SVR, and KNN, far more accurate sentiment could be predicted. These
algorithms, however, lead to high computations due to a large number of features
generated. Among thousands of features derived per document, very few actually
provided some useful information to the classifiers; hence a need for feature reduction
techniques was raised.
Social media produces large volumes of varied data at high rate, describing peo-
ple’s beliefs, views and feelings towards an object. Thus, a huge number of tweets
generated from one such social media platform, Twitter enable sentiment analysis to
transform text into an efficient knowledge database. There exists two major challenges
in twitter sentiment analysis: (a) tweets consists of figurative speech - irony, metaphor
and sarcasm - when literal meanings should be undermined and tangential or projected
meaning should be considered as they affect the polarity of the overall sentiment of the
tweet, and (b) high dimensionality of the features used to describe texts, which raises
problem of expensive computations in applying many widely-used sophisticated
supervised learning algorithms to determine sentiment.
Figurative language poses a major challenge while assigning a polarity to the
tweets. These speeches are unfortunately not restricted to specific forms of literature but
are part of common life. Veale and hao [9] while harvesting standards from the internet
knowledge targeted simile construction of the form of “as A as B” (e.g. “as brave as
lion”, “as tall as giraffe”, “as sweet as sugar”, etc.). However, to their amazement up to
20% of the similes harvested were ironic (e.g. “as white as coal”, “as clear as mud”, “as
private as a park bench” etc.). Hence, not able to gather any significant knowledge from
such texts, irony and other figurative speeches are considered worst kind of noise for
semantic analysis. Therefore, to predict “irony”, “metaphor” and “sarcasm” from the
data acquired during Semeval 2015 task 11 and other crowdfunded sources, partici-
pants were asked to provide with a fine-grained sentiment score between −5 and +5
including 0 (zero) - degree to which this sentiment has been expressed as positive,
negative or neutral. These scores were to represent the overall sentiment of the tweet.
Having a high number of variables requires more space to store training data,
expensive computations, large training time and for some algorithms even can degrade
the classifier’s performance. The computation power increase as the number of attri-
butes grows and hence models require larger amount of data for training. “Hughes
phenomenon” states that the classifier’s performance is directly proportional to the
count of features present till a threshold level of number of features is reached. Further
adding the number of features keeping size of training data constant is going to only
reduce the classifier’s performance. Thus, feature reduction methods are used to deal
with the issue of high dimensionality (Fig. 1). The objective of feature reduction is to
find the subset of original feature set which is most relevant for sentiment analysis to
produce improvement in classification, performance accuracy and reduced execution
time of learning algorithms.
In this paper, we have considered the training dataset rich with irony, metaphor and
sarcasm. Using “Term Frequency-Inverse Data Frequency” (TF IDF) 11538 features
were generated from the raw data. TF IDF provides the frequency of the word
appearing in each document in the corpus and IDF gives the weight of rare words
across all the tweets. Combining both scores selects more significant words. Thousands
of features are then reduced using three modified versions of wrapper feature selection
techniques based on GSA, PSO and GA, used in the binary form. These techniques are
used with five supervised predictive models - Multinomial Naive Bayes, Support
Vector Machine, Decision Tree, K-Nearest Neighbour and Multilayer Perceptron. The
vector of the most relevant and meaningful set of features is then selected, corre-
sponding to which relevant training data is extracted and the predictive model is
executed. The results are evaluated on the basis of the efficacy measures like percentage
of features reduced, the accuracy of the predictions and execution time.
The discussions of the topic are arranged in the following manner. Section 2
summaries the literature survey and need of this study, along with the related work that
already has been done in this field. The following Sect. 3 provides a brief on the
algorithms of PSO, GSA and GA with their application in feature reduction. Section 4
describes the implementation and the experimental steps. Later Sect. 5 demonstrates
the findings and Sect. 6 concludes the work.
2 Related Work
The sentiment analysis was discovered earlier in some published work [6, 8, 14] and
since then researches have been expanded in this area. Comprehensive research has
been conducted to determine the polarity of the tweet sentiment. Such studies explore
and evaluate this microblogger, twitter [3, 5, 22, 24, 26].
The challenge that these studies face arises from the Natural Language Processing.
The traditional experiments generate thousands of attributes which is a result of spelling
variations across separate tweets. For example: “oh my god” as “omg”, “before” as “b4”
[11, 16]. Use of multilingual word, emoticons, conjunctions, prepositions and other
irrelevant words do not differentiate in the polarity of the tweets significantly. This
causes the difficulty of high dimensionality. Many other studies have been done to
overcome these challenges [1, 4, 18, 19, 25]. Some of the studies derived various aspects
like wrapper techniques which we applied to our algorithm for subset selection [12].
This research aimed at text classification, opinion mining and sentiment analysis. The
comparison with the use of document frequency (DF), Information Gain (IG), Mutual
Information (MI), Chi-Squared test (CHI) and Term Strength (TS) are used in the past
for statistical learning of text categorization [27]. The results experiments found that IG
and CHI test were the most efficient methods among all used.
When focused on using the metaheuristic learning algorithms in feature reduction,
some researches were gathered which has their use [2, 13]. Kurniawati and Pardede
[15] has proposed to use the hybrid of particle swarm optimisation (PSO) and
4 A. Kumar et al.
Information gain (IG) to select most appropriate attributes from the documents and use
SVM as the classifier. The experiment results show a 94.80% of accuracy achieved,
which was more when PSO and IG were used as compared to without using them.
Similarly, some studies show studies related to the use of swarm-based algorithms
in feature reduction [7, 20, 23]. Various PSO variants have been applied with SVM in
wrapper phase [21]. The results of both the phases showed higher and improved
accuracy performance from that of the filter phase in the problem domain using most
suitable PSO variant. Experiments in [17] combined GSA and KNN and concluded
from the experiment that classification improved with a decrease of 66% in number of
features. In fact, the use of Genetic algorithm for feature selection showed a better
performance than using all features for text clustering and classification [10].
To the leading of our knowledge no studies have been done to distinguish between
these metaheuristic methods for feature selection for opinion investigation. Subse-
quently utilizing the five directed soft computing strategies as said over we implement
the application of PSO, GSA and GA to models. Within the paper, we center on
drawing a differentiate between utilize of these three methods on the grounds of
accuracy, time taken and feature reduction percentage.
3 Feature Reduction
This sections briefs about the steps followed for application of PSO, GA and GSA with
different classification models. The fitness score is calculated using function, imple-
menting k-fold cross validation technique to learning algorithm.
Output -
Best Value: type float.
Best Position: type list of length ‘dim’.
Begin -
Initialise pbest(i) = 0, fit = [-infinity] * n, it = 0;
gens = generate the initial population randomly.
/* randomly initialise the position and velocity vectors of the population generated */
gbest = min(fit)
xgbest = index of min(fit)
While it < max_it do
for each particle i = 1, …, n do
/* estimate the performance score of each particle using evaluate_func */
score = evaluate_func(gens[i]);
fit[i] = score;
The output returned is the vector containing zeroes and ones, where each 1 rep-
resents selection of that attribute. Then the selected subset of attributes is used to reduce
the dimensions of training and testing data to execute the prediction models.
The motive of the algorithm is to remove worse performers in the population and
tries to collect as many best individuals as possible so that only the good population is
carried forward. Hence it is used in searching and here we used it to find the most
optimum feature subset.
Mass1 Mass2
Force ¼ G ð1Þ
Distance2
Force
Acceleration ¼ ð2Þ
Mass
But, Rashedi et al. showed through their experimental results that inverse pro-
portionality to distance between agents gave better results than that of the inverse
proportionality to square of distance as stated by gravitation law [22].
In GSA, individual particles are considered as entities and their performance
measures are their masses. Each massive object has four specifications: position,
inertial mass, active and passive gravitational mass. All the particles are forced to move
towards heavier mass objects by gravity. This fact is used as best individual particle
being the heaviest move slowly and hence stay closer to the most optimum solution.
The steps and formulas are explicitly explained in [22]. The binary GSA algorithm
differs as it deals with dimensions that can take 0 or 1 only. The position update in
BGSA means that changing value from “0” to “1” or vice versa. The mass velocity
determines the probability of switch between values. Hence, the final value returned is
the most relevant subset of the feature set.
8 A. Kumar et al.
4 Systematic Architecture
The overall System Architecture (Fig. 2) can be divided into 5 parts: collecting the
data, making features from the collected tweets, using feature selection techniques,
applying machine learning models and analysing the results.
(a) (b)
Fig. 3. Distribution of fine gradient score as negative, zero and positive scores.
Let’s analyse the fine gradient score distribution. Figure 3 represents that 88.7% of
the tweets have been marked with negative score where as 6.9% of the tweets are
marked with positive score and the remaining 4.4% of the tweets are marked as neutral
i.e. Zero score.
IDF: Inverse Document Frequency, the words like is, are, the, a, etc. are usually
present in each document. There might be also some words which would be present in
every document. So, the words which would be present in every document would not
be significant in ranking the documents.
IDF can be calculated as the logarithm of total number of documents in the corpus
divided by total of number of documents in which the word “x” is present. Hence, IDF
(x) = log(N/d(x)). For Example, “dog” word is present in 1,00,000 documents out of
10,00,000 documents. So, IDF(dog) = log(10,00,000/1,00,000) = 1. So TF-IDF(-
dog) = TF(dog)*IDF(dog) = 0.1*1 = 0.1
Hence, TF-IDF has been applied on the pre-processed tweets and got 11538 fea-
tures out of it. Features here are the words that are used in the tweets. Each feature has
been assigned a TF-IDF weighted score.
4.4 Models
The various Machine Learning Models are used to fit and test the results based on the
features subset produced by GSA, PSO and GA. The following Machine Learning
Models are used:
1. Multinomial Naive Bayes (MNB) - Family of probabilistic algorithms based on
Bayes theorem with the “naive” assumption that each pair of attributes is inde-
pendent of each other [5, 24]. To estimate the crucial parameters a small amount of
training data is required.
2. Decision Tree (DT) - This algorithm breaks down the dataset into smaller subsets
creating a tree-like structure, where each internal node is the condition, branch is the
outcome and leaf node is the result. It contains only conditional control statements.
3. Support Vector Regression (SVR) - It is one of the most powerful machines
learning technique which is based on the maximising the distance between the
hyper-plane and the nearest point to one of the classification point. In this way the
classes are classified based on the planes [5, 11, 24].
4. Multi-Layer perceptron (MLP) - It is a simple form of Deep Learning technique,
which consists of input layer, hidden layers and the output layer. The hidden layers
have an activation function, which helps in generating a non-linear output. Each
layer has some weights associated with it and it’s adjusted in each epoch (forward
and backward learning) using back propagation.
Comparative Study on Swarm Based Algorithms 11
The feature reduction, final accuracies and the time taken have been recorded for the
feature selection techniques GSA, PSO and GA using the machine learning models
Multinomial Naive Bayes (MNB), Decision Tree (DT), Support Vector Regression
(SVR), Multi-Layer Perceptron (MLP) and K-Nearest Neighbour (KNN). Table 1
shows the comparison between the methods adapted during experiment and their
findings. Let’s compare the results on the above mentioned three factors:
Fig. 4. Feature reduction, final accuracy and time taken (minutes) observed using GSA, PSO
and GA on various Machine Learning Models.
b. MLP was able to immensely reduce 8884 features using GA, whereas around 11000
features in GSA PSO (10692 in GSA and 10940 in PSO).
c. The MNB and KNN performed somewhat similar. Maximum features reduced in
MMB were 4475 using GA and in KNN were 5252 using GA. So, GA outper-
formed the GSA and PSO in MNB and KNN.
d. The minimum features reduced in SVR were 4488 using GA whereas the maximum
features reduced were 7284 using PSO. Hence, GA was not able to outperform
GSA and PSO in SVR.
e. Decision Tree was the most impressing among all by reducing minimum of 8899
features in PSO whereas reducing 9430 features in GA and 11238 features in GSA
(just using 300 features out of 11538).
c. It is clear from the graph that the time taken depends only on the model used and is
mostly independent of the feature reduction used.
d. The time taken by the models is in the increasing order i.e.
MNB < DT < KNN < MLP < SVR.
6 Conclusions
The improvement in final accuracy while the huge reduction in the number of features
brings a conclusion that there are only certain number of features useful for the pre-
diction. These features are reliable in terms of accuracy and the absolute new data for
which the model has not been trained yet.
With such huge reduction in the features, it will prevent large computational tim-
ings. The time consumed is due to only one single thing i.e. only during the selection of
the most optimum subset of features. Once the best features are selected, the model will
take less time than the original number of features in predicting the output due to less
number of features and thus running the algorithm on less features.
It also shows that it is not safe to say that more features means better prediction, our
experiment has proved that the better features means better prediction that is “Quality
over Quantity”.
References
1. Agarwal, B., Mittal, N.: Categorical probability proportion difference (CPPD): a feature
selection method for sentiment classification. In: Proceedings of the 2nd Workshop on
Sentiment Analysis where AI meets Psychology, COLING 2012, Mumbai, December,
pp. 17–26 (2012)
2. Ahmad, S.R., Bakar, A., Yaakub Mohd, R.: Metaheuristic algorithms for feature selection in
sentiment analysis. In: Science and Information Conference (SAI), London, UK, pp. 222–
226, 28–30 July 2015
3. Bing, L., Chan, K.C.C.: Fuzzy logic approach for opinion mining on large scale twitter data.
In: Proceedings of 7th International IEEE Conference Utility and Cloud Computing,
pp. 652–657 (2014)
4. Brezočnik, L., Fister, Jr., I., Podgorelec, V.: Swarm intelligence algorithms for feature
selection: a review. Appl. Sci. 8(9), 1521 (2018)
5. Dash, A., Rout, J., Jena, S.K.: Harnessing twitter for automatic sentiment identification using
machine learning techniques. In: Proceedings of 3rd International Springer Conference.
Advanced Computing, Networking and Informatics, India, pp. 507–514 (2016)
6. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and
semantic classification of product reviews. In: Proceedings of 12th International ACM
Conference World Wide Web, Hungary, pp. 519–528, 20–24 May 2003
7. Fong, S., Gao, E., Wong, R.: Optimized swarm search-based feature selection for text
mining in sentiment analysis. In: IEEE International Conference on Data Mining Workshop
(ICDMW), Atlantic City, NJ, USA, pp. 1153–1162, 14–17 November 2015
Comparative Study on Swarm Based Algorithms 15
8. Gamon, M.: Sentiment classification on customer feedback data: noisy data, large feature
vectors, and the role of linguistic analysis. In: COLING 2004: Proceedings of the 20th
International Conference on Computational Linguistics, Geneva, Switzerland, pp. 841–847,
23–27 August 2004
9. Ghosh, A., Li, G., Veale, T., Rosso, P., Shutova, E., Barnden, J., Reyes, A.: SemEval-2015
task 11: sentiment analysis of figurative language in Twitter. In: Proceedings of the 9th
International Workshop on Semantic Evaluation, Denver, Colorado, pp. 470–478, 4–5 June
2015
10. Hong, S., Lee, W., Han, M.: The feature selection method based on genetic algorithm for
efficient of text clustering and text classification. Int. J. Adv. Soft Comput. Appl. 7(1), 2074–
8523 (2015)
11. Huq, M.R., Ali, A., Rahman, A.: Sentiment analysis on twitter data using KNN and SVM.
IJACSA Int. J. Adv. Comput. Sci. Appl. 8(6), 19–25 (2017)
12. Kohavi, R., John, G.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324
(1997)
13. Kristiyanti, D.A., Wahyudi, M.: Feature selection based on Genetic algorithm, particle
swarm optimization and principal component analysis for opinion mining cosmetic product
review. In: 5th International Conference on Cyber and IT Service Management (CITSM),
Denpasar, Indonesia, 8–10 August 2017
14. Kumar, A., Sebastian, T.M.: Sentiment analysis on twitter. Int. J. Comput. Sci. 9(4), 372–
378 (2012)
15. Kurniawati, I., Pardede, H.F.: Hybrid method of information gain and particle swarm
optimization for selection of features of SVM – based sentiment analysis. In: Proceedings of
2018 International Conference on Information Technology Systems and Innovation
(ICITSI), Bandung - Padang, Indonesia, 22–26 October 2018
16. Larsen, M.E., Boonstra, T.W., Batterham, P.J., Bridianne, O., Paris, C., Christensen, H.: We
feel: mapping emotion on twitter. IEEE J. Biomed. Health Inform. 19(4), 2168–2194 (2015)
17. Nagpal, S., Arora, S., Dey, S.: Feature selection using gravitational search algorithm for
biomedical data. Procedia Comput. Sci. 115, pp. 258–265 (2017)
18. Nicholls, C., Song, F.: Comparison of feature selection methods for sentiment analysis. In:
Farzindar, A., Kešelj, V. (eds.) Advances in Artificial Intelligence. Canadian AI 2010.
LNCS, vol. 6085, pp. 286–289. Springer, Heidelberg (2010)
19. O’Keefe, T., Koprinska, I.: Feature selection and weighting methods in sentiment analysis.
In: Proceedings of the 14th Australasian Document Computing Symposium, Sydney,
pp. 67–74 (2009)
20. Papa, J.P., Pagnin, A., Schellini, S.A., Spadotto, A., Guido, R.C., Ponti, M., Chiachia, G.,
Falcao, A.X.: Feature selection through gravitational search algorithm. In: IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2052–2055
(2011)
21. Rahman, S.A., Bakar, A.A., Mohamed-Hussein, Z.A.: An intelligent data pre-processing of
complex datasets. Intell. Data Anal. 16(2), 305–325 (2012)
22. Rashedi, E., Nezamabadi, H., Saryazdi, S.: BGSA: binary gravitational search algorithm.
Nat. Comput. 9(3), 727–745 (2010)
23. Vieira, S.M., Mendonça, L.F., Farinha, G.J., Sousa, J.M.C.: Modified binary PSO for feature
selection using SVM applied to mortality prediction of septic patients. Appl. Soft Comput.
13(8), 3494–3504 (2013)
16 A. Kumar et al.
24. Wang, N., Varghese, B., Donnelly, P.D.: A machine learning analysis of twitter sentiment to
the sandy hook shootings. In: Proceedings of 12th International IEEE Conference e-Science,
USA, pp. 303–312 (2016)
25. Wang, S.: A feature selection method based on fisher’s discriminant ratio for text sentiment
classification. In: International Conference on Web Information Systems and Mining. LNCS,
vol. 5854, pp. 88–97. Springer, Heidelberg (2009)
26. Whitley, D.: A genetic algorithm tutorial. Stat. Comput. 4(2), 65–85 (1994)
27. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In:
Proceeding of ICML 1997 Proceedings of the Fourteenth International Conference on
Machine Learning, San Francisco, CA, USA, pp. 412–420, 08–12 July 1997
Statistical Analysis of the Effects of Institutions
on the Economic Growth of France
in Recent Years
Abstract. This paper presents the analysis of the effects of the institutions on
the economic growth of France during the period 2014–2017. We chose to study
this period due to the reforms which took place from 2014 aiming to improve
the productivity and economic growth to preserve its place in the worldwide
economy. The data source used was the OECD database, Economic Outlook,
which collects information to encourage governments to make decisions on
economic and social issues for the future, such as finding stability and pro-
moting responsibility. The analysis was made in two stages, a preliminary
review and a trend’s analysis. Variables analyzed were business confidence,
labor market, exportation, and public spending. The results obtained show that
business confidence in France, backed by the growth of economic investment,
has increased significantly in recent years. In addition, it was observed that the
labor market had an important growth in employment, and exportation tend to
be volatile due to lack of demand strengthening from its business partners.
1 Introduction
The prosperity of a country depends on its economic institutions, and these, in turn,
depend on political power and political institutions [1, 2]. This results in an accumu-
lation of physical and human capital, technology access, an appropriate assignment of
resources, and innovation [3–5]. However, getting a sufficient level of these elements
might be affected by institutional characteristics [6]. These characteristics vary
according to the type of organization, the political powers distribution, how the pro-
ductive areas work, the security level of property rights, the risk of expropriation as
well as the efficiency of the legal system and of the government, among others [1].
The concept of institutions has been defined by several authors. According to [2],
the institutions are the limitations and incentive that determinate the human interaction
in the political, social and economic exchange. In [7] defines the institutions as
© Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 17–26, 2020.
https://doi.org/10.1007/978-3-030-39442-4_2
18 Y. Céspedes-González et al.
organized entities with clear processes, structured and regulated. As for [1] they rep-
resent the combination of three connected notions: (a) the economic institutions,
establishing the distribution of resources and influencing the assignment of de facto the
power; (b) the political power, shaped by political institutions setting up the economic
institutions and economic performance, so they can have an impact on the prosperity
and future assignment of political power; (c) the political institutions, assigning the
political power of de jure to certain groups.
The connection between these three notions establishes a subordination of the
economic institutions to the political ones. On one hand, the political institutions set up
the distribution of the political power of de jure, while on the other hand, the resources
allocation influence in the distribution of the political power of de facto. Both powers,
de jure and de facto, affect the development of political institutions and the election of
the economic institutions, being these last the ones that define the economic results and
the future resources allocation.
Institutions are considered as quality institutions when they guarantee and protect
property rights. Besides, these institutions ensure equitable access to economic resources
and, to some degree, equality of opportunities to a significant number of individuals [1, 8].
This creates competition and encourages the involvement of people in economic activi-
ties. Therefore, there are more possibilities for quality institutions to be implemented
when political power is owned by a representative group of society, and when there are
regulations.
On another hand, the economic growth means the increase of the goods and ser-
vices value that occurs in an economy during a specific period. Generally, to measure
this economic growth is used the Gross Domestic Product (GDP). This growth is linked
with the productivity and the improvement of people standard of living [6]. Therefore,
this indicator is used to weigh the socioeconomic conditions of a country.
France is one of the five largest economy worldwide, measured by the GDP, a
position mainly due to its strength in several sectors as defense, technology, aero-
nautics, nuclear industry, among others [9]. In this context, the French economic
growth for the last years, from 2014, has shown a stable expansion rhythm. Even
though this growth process has been stable for some years, economic decline periods
and stagnation have also been evinced. Nevertheless, these periods have been over-
come with success, without reaching extended terms like other countries of the region.
That is why, in the European sphere, French growth could be considered as a stable
experience.
In this regard, this study aimed to analyze the effects of the institutions over the
economic growth of France in the last four years, 2014–2017. To reach this goal, we
investigated the impact of the institutional reforms in the French economic growth. As
for [10], the right implementation of these measures foretells a rise of the GDP, of the
employment rate, of the trade balance, and of the government balance. This is essential
to emphasize that the unfolding of the French economic development has its ante-
cedents in a historic process: the transition from an absolute monarchy to institutions
that supported the feudal system and then the capitalism.
Statistical Analysis of the Effects of Institutions 19
2 Background
Democracy was established in France a long time ago. It was the first European country to
use the voting system for elections in 1848. France has gone through empires, monar-
chies, and republics. Currently, it is a semi-presidential republic [11, 12], with relatively
independent executive powers. France is a member of international organizations as [10]:
G8, European Union, Schengen Area, United Nations, OTAN, among others. In addition,
is home to important headquarters as European Council, European Parliament, as well as
market-leading multinational companies in several economic sectors.
role of political ideology was confirmed as an indirect determinant of growth, (b) it was
shown that left parties promote equality at the expense of economic growth, and (c) the
means, through which the political ideology impacted on economic growth, were
identified, such as public expenditures, fiscal and budgetary policies, these elements
affect employment and income inequality and, therefore, GDP growth.
In the current context, different studies have also been carried out. One of these is
the Organization for Economic Cooperation and Development (OECD), where some
structural reforms and their impact on France’s economic growth were analyzed. The
analysis showed that France’s economic growth between 2008 and 2013 was slow at
1.25%; while in 2014 it was 0.4%. Due to this situation, in 2014 France faced the
challenge of improving its productivity and growth. For this, it was necessary to
modify its economic and social structures through reforms with the aim of preserving
its place in the world economy. The reforms were made in four areas [9]:
– Improve competitiveness in goods and services market. These regulations were
designed to optimize competition and reduce prices, especially in sectors like:
energy, transport, commerce, legal services, accounting and architecture. In addi-
tion, other regulations were modified to protect some professional practices and to
avoid monopolies in certain sectors.
– Improve the functioning of labor market. To stimulate the supply of labor and to
reduce the cost of it. These reforms were accompanied by fiscal modifications and
other reforms aimed at improving labor supply, the incentives and the quality of
workforce.
– Clean up the fiscal structure of distortions. These reforms were aimed at reducing
corporate tax rates, introducing a carbon tax, increasing VAT for certain rates and
reducing income tax.
– Simplify the territorial distribution of country. The intention was to reduce political
segmentation and to facilitate the proper functioning of local labor markets.
In general, France has devoted special attention to reducing the bureaucracy that
affects companies and the productive process. The purpose of improving education and
training is interesting, to guarantee the implementation and continuity of appropriate
mechanisms. The creation of metropolitan areas is an important step to define
responsibilities at the territorial level with the aim of promoting local productivity and
improving trade between regions.
Therefore, according to [10], the implementation of the reforms promises for the
year 2020 an increase in GDP of 0.4%, the employment levels in 0.31%, the com-
mercial balance in 0.03%, and the government balance at 0.27%.
3 Method
3.1 Data Source
The data source from which the analysis of institutions effects on the economic growth
of France was made, were data from OECD Economic Outlook 102 database. The
access to data source was made through the OECD institutional page (http://stats.oecd.
org/index.aspx). This data source includes a set of macroeconomic data from 35
Statistical Analysis of the Effects of Institutions 21
countries that belong to OECD, including France. This database also includes data
from 10 other countries that do not belong to OECD, such as Colombia, Brazil, China,
Russia, and others.
OECD, Organization for Economic Cooperation and Development, collects and
analyzes data and information to encourage governments to take decisions about
economics and social topics for the future, as fight poverty, find stability and promote
responsibility [17]. The OECD Economic Outlook database includes data on: spending,
foreign trade, production, employment and unemployment, interest and exchange rates,
payments balance, disbursements and revenues from government and households,
government debt, supply, and fiscal indicators.
Table 1. Cumulative quarterly values for the business confidence, labor market, and exportation
variables.
Date Business Labor market Exportation (Index 2008 = 100)
confidence
Index Employment (% Unemployment Motor Air and Other
changes) (% force) vehicles spacecraft goods
2014q1 94.073 0.485 10.135 67.195 146.448 103.177
2014q2 94.172 0.190 10.204 67.494 144.872 102.559
2014q3 91.799 −0.216 10.415 68.158 143.081 102.397
2014q4 92.507 0.120 10.462 68.113 146.036 102.789
2015q1 94.224 −0.174 10.314 68.289 150.851 103.125
2015q2 97.095 −0.048 10.439 70.192 155.761 104.422
2015q3 99.678 0.367 10.505 71.625 162.634 105.747
2015q4 100.856 0.283 10.193 74.375 166.653 105.953
2016q1 100.932 0.727 10.180 76.149 165.038 106.331
2016q2 100.973 0.675 9.955 77.369 162.780 106.292
2016q3 101.371 0.510 10.121 77.349 160.748 105.508
2016q4 102.930 0.478 9.959 78.069 158.720 105.281
2017q1 104.246 0.448 9.541 79.749 162.154 106.192
2017q2 105.303 1.233 9.424 81.798 160.762 107.300
22 Y. Céspedes-González et al.
In the second stage, the trends of previously selected variables were analyzed. The
analysis of the economic growth in France was carried out from 2014, this due to the
beginning of the reforms, which occurred in that year, to improve their productivity and
growth to preserve their place in the world economy.
4 Results
Figures 1 and 2 show the results obtained for four variables analyzed: business con-
fidence, labor market, exportation, and public spending. Regarding business confidence
in France, it was observed that in recent years this confidence has increased continu-
ously, thus supporting the growth of the economic investment. This may be because of
the increase in social security and business tax cuts to companies. However, it is
notorious that the growth of the investment has suffered a slight deceleration during
2017. This may be related to depreciation in April 2017.
Statistical Analysis of the Effects of Institutions 23
(a)
(b)
Fig. 1. Trends of analyzed variables: (a) Business confidence, and (b) Labor market.
On the other hand, it was also observed that the labor market during 2017 had
significant growth. This indicates a recovery due to low-interest rates, supported by
household consumption and investment. In addition, according to [16], business sur-
veys indicate a favorable hiring prospect. To improve the results of the labor market,
the French government must promote the inclusion of less qualified workers,
encouraging the implementation of short-term and indefinite contracts. This could
improve access to more stable jobs and the training of many workers.
24 Y. Céspedes-González et al.
(a)
(b)
Fig. 2. Trends of analyzed variables: (a) Exportation, and (b) Public spending.
In the case of exportation, they show a negative (volatile) impact due to a series of
temporary factors, such as supply interruptions in aeronautics and bad weather that
affects exports in agriculture. This situation is due to exportation have been driven by a
few sectors, especially the transport industries. Another factor against is that the French
companies have not yet fully benefited from the strengthening of demand from their
business partners.
In terms of public spending, in comparison with other countries, in 2017 the French
government has taken some measures to reduce such spending. These measures include
reducing the overlap of responsibilities, that is, merging small municipalities and
improving the access of older workers to training and gradual retirement. In addition,
Statistical Analysis of the Effects of Institutions 25
it is intended to reduce hiring and housing subsidies; as well as reducing the growth of
public spending to less than 0.5% per year between 2018 and 2022. Another measure
adopted, according to the OECD, is the increase in taxes on energy and tobacco, which
helps growth economic is more ecological and strengthens prevention in health care.
5 Conclusions
France was one of the first democratic countries, which allowed it to become a solid
nation with rising economic growth.
In the French economy, despite the elimination of hiring subsidies for small
businesses and the reduction of jobs, employment growth must continue, supported by
growth in GDP. This could strengthen the growth of private consumption and gradual
fall of unemployment rate.
Labor market reform in France can facilitate negotiations in business sector,
especially in small businesses. In addition, a substantial investment in training and
strengthening of learning should be planned.
Exports in France should benefit from increased demand from trading partners and
a rebound in emerging markets; as well as the rebound of tourism after the terrorist
attacks that have occurred in recent years.
The French institutions determine the economic growth of the country. However,
political events in other European countries, such as terrorist attacks, could undermine
confidence in economic union and trade in member countries.
France has faced and overcome the crisis caused by global order events thanks to
certain institutions the country has forged. However, France has multiple challenges yet
to face to increase its competitivity. For this to be achieved, it is essential to possess
efficient regulatory systems that simplify and foster the economic growth and that
protect the rights of property.
References
1. Acemoglu, D., Johnson, S., Robinson, J.: Institutions as a fundamental cause of long-run
growth. In: Handbook of Economic Growth, vol. 1A, pp. 385–472. Elsevier (2005)
2. North, D.: Institutions, institutional change and economic performance. Fourth reprint.
Fondo de Cultura Económica, Mexico (2012)
3. Esso, L.: Changement technologique, croissance et inégalité: l’importance du capital humain
et des institutions, Ph.D. thesis, Economies et finances, Université Panthéon-Sorbonne-Paris
I (2006)
4. Galindo, M.: La innovación y el crecimiento económico: una perspectiva histórica. Econ.
Ind. 368, 7–25 (2008)
5. Galindo, M.: Crecimiento económico. Tendencias y Nuevos Desarrollos de la Teoría
Económica 858, 39–56 (2011)
6. Docquier, F.: Identifying the Effect of Institutions on Economic Growth. Institutional
Competition between Common Law and Civil Law, pp. 25–40. Springer, Berlin (2014)
7. Williamson, O.: The new institutional economics: taking stock, looking ahead. J. Econ. Lit.
38(3), 595–613 (2000)
26 Y. Céspedes-González et al.
1 Introduction
In our digital world, information security threats can be divided into two primary
types: technical hacking and social engineering attacks. In technical hacking,
cyberattackers conduct attacks using advanced techniques to gain unauthorized
access to systems.
However, it becomes difficult for hackers to successfully attack computer
systems and networks using purely technical means [3]. Therefore, hackers rely
on social engineering attacks to bypass technical controls. Social engineering
allows cyberattackers to gain unauthorized access to systems by psychologically
manipulating users [7,17].
Compared to technical hacking, social engineering is easier, cheaper, and
more effective to gain access to confidential information. Numerous previous
research efforts have demonstrated the success of social engineering attacks [6,15,
21,27,34]. Social engineering attacks are conducted either by person-to-person
interaction (in person or over the phone) or by computer-interaction (email,
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 27–41, 2020.
https://doi.org/10.1007/978-3-030-39442-4_3
28 D. N. Alharthi et al.
2 Background
This section provides an overview of the different kinds of common social engi-
neering security attacks.
Organizations mainly focus on deploying high quality and sophisticated secu-
rity tools to detect security vulnerabilities or even prevent security attacks. How-
ever, security is only as strong as the weakest point in the system, which includes
the human actors. Since humans are the weakest point in the information security
chain, they are being targeted by social engineers. According to [10], misuse of
information systems by humans, both intentionally and unintentionally, accounts
for 50% to 75% of cybersecurity threats.
As stated by Granger [14], social engineering is “the art and science of getting
people to comply with your wishes”. It can be defined as the practice of acquiring
information through technical and non-technical means [22]. Therefore, social
engineering attacks rely on convincing people that a social engineer is a trusted
friend or colleague. Social engineering attacks can be carried out either by a
human or by a machine through a software system [23]. Social engineering attacks
have no limit, and they only depend on the creativity of social engineers. In the
past few years, the number and the sophistication of social engineering attacks
have increased and became more diverse. These attacks are difficult to detect
A Taxonomy of Social Engineering Defense Mechanisms 29
As shown in the left side of Fig. 1, technical social engineering attacks are Vish-
ing, Phishing, Spear Phishing, Spam Email, Interesting Software, Popup Win-
dow, Baiting, Tailgating, and Waterholing.
– Phishing and Trojan Email rely on carefully crafted messages to entice victims
to open attachments or click on embedded hyperlinks [3]. In this security
attack, the victim is entirely unknown to the social engineer.
The phishing attack is one of the most successful social engineering attacks.
One of the biggest phishing attacks occurred in March of 2016 during the
U.S. presidential election. It targeted John Podesta, the former chairman
of Hillary Clinton’s U.S. presidential campaign, and through his account,
some of Clinton’s emails. The target of the attack was Clinton’s personal
Gmail account, which had messages from 2007 through 2016 [33], [29]. On
that phishing email, there was a “change password” link, once John Podesta
clicked on it and changed his password, social engineers maliciously received
his password and locked his account.
– Vishing (voice phishing) occurs by tricking people into revealing sensitive
information through a phone call.
30 D. N. Alharthi et al.
– Spear Phishing similar to the phishing attack, but in this attack, the victim’s
information is known to social engineers. Therefore, the social engineer can
launch customized cyberattacks.
– Spam Email is an email that offers friendships, diversion, gifts and various
free pictures and information in order to plant malicious code on the reader’s
machine.
– Interesting Software and Popup Windows are other social engineering tech-
niques in which a social engineer convinces a victim to download and install
a useful program or application such as a CPU performance enhancer or
displays a pop-up window that prevents a victim from proceeding with the
session unless he reenters his username and password.
– Baiting happens when a malware-infected storage medium is left in a location
where it is likely to be used by targeted victims [22].
– Tailgating aims at accessing unauthorized places by getting help from an
authorized person.
– Waterholing means compromising a website that is likely to be of interest to
a chosen victim [3].
3 Related Work
A large body of research efforts focuses on the pure technical security attacks
while fewer researchers have focused on social engineering attacks. This section
discusses the related research efforts in light of this research.
Medlin et al. [24] conducted a study to analyze the vulnerability of U.S.
hospitals to social engineering attacks. Employees who volunteered to complete
the survey were rewarded with both candy and a chance to win a gift card.
Within the questions, employees were asked to reveal their passwords and some
other confidential information. Surprisingly, 73% of them shared their passwords.
Krombholz et al. [22] illustrated some real-world examples of social engineer-
ing attacks against major companies, including the New York Times, Apple,
Facebook, Twitter, and the RSA Network Security LLC company. In 2013, social
engineers targeted the New York Times. The initial attack was a Spear Phishing
attack, recall Sect. 2, which sent fake FedEx notifications. Then the New York
Times hired computer security experts to analyze the attack, and they found
that some of the methods used to break into the company’s infrastructure were
associated with the Chinese military, i.e., a political motive. Because of this SE
attack, social engineers stole the passwords of some employees in The New York
Times, and hence they were able to access the personal devices of 53 employees.
As another example, leveraging Waterholing SE attacks in 2013 against
Apple, Facebook, and Twitter, social engineers were able to exploit a zero-day
vulnerability. Specifically, they were able to sneak into the corporate networks
and inject malicious code onto their websites. Once a user visited the infected
website, his device would be compromised. Moreover, in 2011, a small number of
RSA employees received an email entitled “2011 Recruitment Plan”. The email
was well written, so readers were convinced that it was legitimate. The email
contained a spreadsheet which contained a malicious payload to exploit a vulner-
ability on the user’s device. This SE attack led to stealing sensitive information
of the RSA SecureID system [22].
Aldawood and Skinner [2] suggested a few methods organizations can fol-
low to educate their employees about reducing the effect of social engineer-
ing attacks. These are Serious Games, Gamification, Virtual Labs, Simulations,
Modern Applications, and Tournaments. Serious game is a method that allows
employees to face real-time scenarios with an opportunity to use their knowl-
edge to implement mitigation strategies. Similarly, an organization can use the
Gamification to assess the behavior of hypothetical victims of social engineer-
ing attacks. Remote online networks is another method known as Virtual Labs,
32 D. N. Alharthi et al.
which helps trainees learn about threats of social engineering via virtual solu-
tions. Simulations can be used as models of real scenarios to evaluate various
social engineering attacks. Additionally, Modern Applications that rely on the
use of software application training and learning modules can be used to assess
different types of social engineering threats. Furthermore, between multiple orga-
nizations with the need for social engineering mitigation training, Tournaments
can be executed, i.e., communication threats competitions.
Orgill et al. [27] demonstrated two metrics for determining security compli-
ance in an organization. These are user education and security auditing. They
emphasized the importance of educating employees about social engineering
attacks and how to prevent them.
Ghafir et al. [13] emphasized the importance of adopting a multi-layer
defense, also referred to as defense in-depth, to lower the risk associated with
social engineering attacks. They showed that a good defense in-depth structure
should include a mixture of security policy, user education/training, audits/-
compliance, as well as safeguarding the organization’s network, software and
hardware. The paper also illustrated four steps of social engineering which are
(1) information gathering, (2) developing relationships, (3) exploitation, and (4)
execution.
Chitrey et al. [9] developed a model of social engineering attacks. The model
categorized social engineering attacks under two main entities: (1) vulnerable
entities which are human, technology, and government laws and (2) safeguards
entities which are information security awareness program, organization security
policies, physical security, access control, technical control, and secure applica-
tions development. Such a model can be used in the development of organization-
wide information security policy.
Gupta and Sharman [16] proposed a framework for the development of a
Social Engineering Susceptibility Index (SESI) based on social network theory
propositions. The framework reveals the real risks of social engineering attack
that employees are exposed to. The framework suggested five indices: social
function, organizational hierarchy, organizational environment, network charac-
teristics, and relationship characteristics.
Beuran et al. [8] used the main cybersecurity training programs in Japan
as a detailed case study for analyzing the best practices and methodologies
in the field of cybersecurity education and training. The paper defined a tax-
onomy of requirements to ensure adequate cybersecurity education and train-
ing. The developed taxonomy has two main aspects, which are training con-
tent and training activities. As far as the training content, there are three
main categories, which are attack-oriented training, defense-oriented training,
and analysis/forensics-oriented training. Another perspective on cybersecurity
training is considered to focus on security-related activities that include individ-
ual skills, team skills, and Computer Security Incident Response Team (CSIRT)
skills.
According to [20], a combination of technical, social, economic, and psycho-
logical factors affect an employee’s decision-making process when contemplating
A Taxonomy of Social Engineering Defense Mechanisms 33
4 Research Methodology
This section describes the research question this research aims to answer,
described in Sect. 4.1, and the followed research methodology to develop the
taxonomy, described in Sect. 4.2.
In this research, the target is to develop a taxonomy of the main defense mech-
anisms against social engineering attacks. To that end, this section presents and
discusses the research question this study tries to answer.
RQ1. What are the main defense mechanisms against social engineering
attacks that employees and organizations should be aware of?
In order to answer this research question, the paper conducted a thorough
literature review and discovered the main target points of social engineers. Then,
the paper outlined the various defense mechanisms regarding each target point.
These defense mechanisms reduce or even prevent social engineering attacks.
Hence, employees should be aware of them either through training programs or
reading materials given to them periodically through their organizations.
By answering this research question, organizations will have a better under-
standing of social engineering defense mechanisms so they can take the right
actions to incorporate them and make them part of their organizational culture.
5 The Taxonomy
To answer the research question, RQ1 in Sect. 4, the authors conducted a thor-
ough investigation of the literature, recall Sect. 4, and found that there are
five main target pints for social engineers. Social engineers try to achieve their
malicious goals through these five target points, which are the main assets of
any organizations. These five target points are People, Data, Software and Hard-
ware (SW/HW), and Network. For each target point, the authors determined the
defense mechanisms to prevent any potential social engineering security attack
targeting that target point. Figure 2 depicts a tree-structure taxonomy of the
main target points and the defense mechanisms for each target point. Next, the
paper provides a detailed description of each target point and the defense mecha-
nisms against social engineering attacks targeting these target points. Employees
and organizations should be aware of these defense mechanisms to prevent any
social engineering attack.
5.2 Data
Data is a valuable asset for any organization, and it is a critical target point
by cyberattacker either at the personal level or at the organizational level. At
the personal level, social engineers might target personal data of a high pro-
filed employee such as family pictures, videos, salary, etc. At the organizational
level, many types of sensitive information can be targeted, such as planning
36 D. N. Alharthi et al.
Backup and Replication. Constantly backing up the data and creating repli-
cation of the data, either inline or offline replication, ensure the integrity and
the availability of the data. Employees should be aware of any backup and repli-
cation policy in their organizations so they can consider it for the organization’s
data stored either on the servers or even on their work computers. Providing
consistent rules for backup and replication management is critical to ensure the
High Availability (HA) of the organization’s data.
Work Emails and Accounts. Protecting work emails by filtering any potential
spams are critical since, in most cases, they are considered as a formal way of
communication. Therefore, if a social engineer was able to send an email from an
employee’s work email, this could lead to severe consequences. On the other hand,
social engineers mainly use emails as a medium to spread their malicious intent.
Therefore, employees need to know about security policies. Organizations must
ensure that their employees are aware of what is acceptable and unacceptable
use of the work emails and accounts as well as preventing any unauthorized
computers or locations from accessing employees work emails and accounts.
Bring Your Own Device (BYOD). Many organizations allow their employ-
ees to use their own devices for work purposes to increase the efficiency and the
productivity of employees during the working hours. However, many employees
do not pay attention to the security risks associated with this BYOD policy.
Therefore, such employees need to be aware of any security risks that theses
devices might pose. According to [12], the lack of understanding of BYOD by
organizations puts them at risk of losing control of their critical information
resources and assets. Hence, it is essential to ensure that these devices are not
compromising the confidentiality, integrity, and availability goals of information
security. This can be done by incorporating effective security and privacy policies
to manage BYOD.
5.4 Network
Employees access databases and other servers through a network. The network
could be a Local-Area Network (LAN), Wide-Area Network (WAN), wireless
network or wired network, etc. Each network has a different security policy. For
example, if an employee is connecting to the LAN network of an organization, he
might have access to servers that he would not have access to if he is connecting
from his home network.
Moreover, most organizations nowadays allow VPN (Virtual Private Net-
work) or RDP (Remote Desktop Protocol) to allow their employees to access
the local network remotely. All of these different network security policies bring
security threats to organizations if the employees are not aware of them. For
example, if an employee accesses his/her organization’s local network using a
VPN from a public computer or his/her friend’s computer, if that computer is
38 D. N. Alharthi et al.
6 Threats to Validity
This section discusses the threats to validity for building the taxonomy and the
authors’ steps to minimize those threats.
To develop the taxonomy of the main target points and the various defense
mechanisms, the authors mainly relied on a comprehensive literature review. As
with any such review, some significant references could have gone unnoticed. To
minimize this threat, the authors examined papers with “Social Engineer” or
“Social Engineering” keywords in the title and read their abstracts, introduc-
tions, and conclusions. Moreover, this study also leveraged the Human Aspects
of Information Security Questionnaire (HAIS-Q) [28], SANS Awareness Sur-
vey, and the Essential Cybersecurity Controls of Saudi National Cybersecurity
Authority (2018).
7 Conclusion
Humans have become the weakest link in the security pipeline, and social engi-
neers are taking advantage of the knowledge gap in this area. Successful social
A Taxonomy of Social Engineering Defense Mechanisms 39
References
1. Verizon 2018. 2018 data breach investigations report (2018)
2. Aldawood, H., Skinner, G.: An academic review of current industrial and commer-
cial cyber security social engineering solutions. In: Proceedings of the 3rd Inter-
national Conference on Cryptography, Security and Privacy, pp. 110–115. ACM
(2019)
3. Applegate, S.D.: Social engineering: hacking the wetware! Inf. Secur. J.: Glob.
Perspect. 18(1), 40–46 (2009)
4. Arachchilage, N.A.G., Love, S.: Security awareness of computer users: a phishing
threat avoidance perspective. Comput. Hum. Behav. 38, 304–312 (2014)
5. Aviv, A.J., Gibson, K.L., Mossop, E., Blaze, M., Smith, J.M.: Smudge attacks on
smartphone touch screens. Woot 10, 1–7 (2010)
6. Bakhshi, T., Papadaki, M., Furnell, S.: A practical assessment of social engineering
vulnerabilities. In: HAISA, pp. 12–23 (2008)
7. Berg, A.: Cracking a social engineer. LAN times (1995)
8. Beuran, R., Chinen, K., Tan, Y., Shinoda, Y.: Towards effective cybersecurity
education and training (2016)
9. Chitrey, A., Singh, D., Singh, V.: A comprehensive study of social engineering
based attacks in india to develop a conceptual model. Int. J. Inf. Netw. Secur.
1(2), 45 (2012)
10. Choi, M., Levy, Y., Hovav, A.: The role of user computer self-efficacy, cybersecurity
countermeasures awareness, and cybersecurity skills influence on computer misuse.
40 D. N. Alharthi et al.
31. Stoner, J.A.F.: Risky and cautious shifts in group decisions: the influence of widely
held values. J. Exp. Soc. Psychol. 4(4), 442–459 (1968)
32. Thapar, A.: Social engineering: An attack vector most intricate to tackle. CISSP:
Infosec Writers (2007)
33. Thomas, K., Li, F., Zand, A., Barrett, J., Ranieri, J., Invernizzi, L., Markov, Y.,
Comanescu, O., Eranti, V., Moscicki, A., et al.: Data breaches, phishing, or mal-
ware?: understanding the risks of stolen credentials. In: Proceedings of the 2017
ACM SIGSAC Conference on Computer and Communications Security, pp. 1421–
1434. ACM (2017)
34. Workman, M.: A test of interventions for security threats from social engineering.
Inf. Manage. Comput. Secur. 16(5), 463–483 (2008)
A Trust Framework for the Collection
of Reliable Crowd-Sourced Data
The University of the West Indies, St. Augustine, Trinidad and Tobago
shiva@lab.tt, patrick.hosein@sta.uwi.edu
1 Introduction
Trust holds a key role in society. It is a multidisciplinary concept that has been
studied in sociology, economics, psychology, and computer science. Each of these
disciplines have their own definition of trust. Broadly speaking, trust can be
defined as a measure of confidence that an entity would behave in an expected
manner [1]. It allows users to make decisions, sort and filter information, receive
recommendations, and develop a context within a community with respect to
whom to trust and why [2].
Networks such as Facebook, Twitter and Slashdot are examples of decentral-
ized environments where users are allowed to create and upload content at their
discretion [3]. This freedom can result in the fabrication of misinformation and
exploitation of systems. Creating a web of trust (a network where a link between
two nodes or entities mean a trust decision has been made and the value of that
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 42–54, 2020.
https://doi.org/10.1007/978-3-030-39442-4_4
A Trust Framework for the Collection of Reliable Crowd-Sourced Data 43
the Semantic Web, trust can be applied in scenarios such as information quality
assessment and Semantic Web service composition [2].
Introducing trust can have a positive impact on the confidence placed in
users’ data. Internet-based surveys are quite common. They alleviate issues such
as: mailing/distribution costs for questionnaires, human error, and reaching peo-
ple in varying geographic areas [11]. Despite these benefits, there are a host of
issues that emerge concerning the exploitation of internet-based surveys and the
impact they can have on research being conducted using flawed data [12]. Studies
reveal that careless or insufficient responses can include 1–30% of respondents
whereas the modal rate is close to 8–12% [12]. Even a small percentage of these
types of respondents can have an impact on measures of central tendency, spread
and reliability [13].
We provide an approach based on increasing the quality of crowd-sourced
data in online surveys. Section 2 presents an example of the problems faced with
online surveys. Sections 3 focuses on describing the proposed framework in detail.
Finally, Sects. 4 and 5 gives some discussion of the results (when compared to a
traditional online survey) and the conclusion.
The interactions between the user types (trusted and non-trusted) are illustrated
in Fig. 1. Note that there are no self loops.
Invited
Users
trust user
trusted
Registered Trusted Take
Users Users Survey
recommend
complete survey
Untrusted View
Users Statistics
We assume that we are given a set of potential data providers (users), U, and
a set of surveys, S, which require responses. The following information is stored
about each user:
Initially, we assume that a subset of these users, T , are trustworthy and that
they will provide valid data as well as recommend other users whom they trust.
When new users are added to the system, they are initially not trusted and
belong to the subset ¬T . In order for a new user to be considered trusted, they
must satisfy all of the following conditions:
46 S. Ramoudith and P. Hosein
Increasing Trust
2pts Friend
1pt Acquaintance
-1pt Suspicious
-2pts Untrustworthy
For an invitation, a new user’s initial trust level is dependent on the trust level
of the user sending the invitation. When a new user Uk is added to the platform
A Trust Framework for the Collection of Reliable Crowd-Sourced Data 47
There always exist the possibility of malicious users within social networks. As
mentioned in the literature, global trust metrics are susceptible to malicious
users so we present some novel features within our framework which prevents
against easily introducing these types of users into the network and gaining trust
quickly. Table 1 contains a list of common issues that were expected to occur
with malicious users and the preventative measures built into the framework to
mitigate them. These measures are by no means fool proof but the difficulty in
overcoming them may deter malicious users.
Furthermore, each recommendation made within the platform is stored and
can be used for purposes of identifying malicious users (as mentioned in the
introduction) whom are already trusted.
Let Pj denote the number of positive trusts (Lij >= 2.0) for Uj and let ¬Pj
be the number of negative trusts (Lij <= −1.0) for Uj . One possible way of
detecting an anomalous user is when |¬Pj − Pj | ≥ 3 and they both cross some
threshold (≥5). When this type of user is detected, they will be flagged as not
1
A CAPTCHA is a program that protects websites against bots by generating and
grading tests that humans can pass but current computer programs cannot. http://
www.captcha.net.
48 S. Ramoudith and P. Hosein
trusted and the administrator will be able to discard their submission from the
respective survey(s).
The trust framework was implemented using the Django framework2 . A rela-
tional database is used for storing information concerning users, user relation-
ships, user requests, surveys and, user responses to surveys. User responses are
stored as JSON3 objects within the relational system. Scheduled recurring tasks
2
Django is a high-level Python Web framework that encourages rapid development
and clean, pragmatic design. https://www.djangoproject.com.
3
JavaScript Object Notation. JSON is a lightweight data-interchange format. https://
www.json.org.
A Trust Framework for the Collection of Reliable Crowd-Sourced Data 49
yes
Invalid Submission Admin Con- User Re-
Submission Flagged tacts User submits
no
yes no
Include in Admin Reject and
Statistics Verifies Investigate
are created to log the number of trusted and non-trusted users in the system as
well as the number of users that completed each survey on a daily basis. A cache
is also implemented to reduce the time taken to display the overall statistics
(average, standard deviation, etc.) for each survey.
Survey design also plays an important role in the collection of user informa-
tion. Obtaining information about a user’s ISP was the first survey administered
on our platform. We decided that a survey must be fairly short to complete and
it should require the least amount of user input while still gathering the most
information possible. This reduces the likelihood of users submitting inaccurate
or dishonest information. An example of this is with our first survey done on
ISPs. Instead of asking users to submit their ISP information (achieved rate,
advertised rate, ISP, etc.) we asked the user to submit their subscription rate
and the link to their speed test results (which gave us all required information).
The platform ensured that duplicate links were not submitted and verified that
the ISP identified in the link belonged to the set of ISPs we were considering for
the survey. Failure of either case mentioned above results in rejected submissions.
Users have the opportunity to resubmit.
In order to create incentive for users to complete surveys, we created a survey
statistics module to display useful information regarding a user’s submission and
other users. In the case of the ISP survey, we provided statistics that allowed a
user to gain a sense of the performance associated with their ISP and compare
this to competing ISPs. Statistics can be highly customized for each survey.
framework allows for customized statistics to be computed, thus giving the users
an informed decision on whatever the survey’s goal may be.
35
30 28
20
10
5
3
4.1 Analysis
We find that there were 9.29% less invalid submissions from the trusted platform
survey compared with the traditional survey. Given the nature of the survey we
conducted, it was possible for an invalid response to be nullified and a new
response submitted (only most recent submission considered) by a user. If we
determined that a user submitted invalid data, we attempted to contact them via
email; however, we did not receive any responses from said users in the traditional
survey. This may be due to users supplying invalid email addresses. Similarly, for
52 S. Ramoudith and P. Hosein
the trust platform survey, all users with irregular submissions (4 in total) were
contacted. These invalid submissions in the trusted platform were determined
to be truthful after contacting the relevant users and asking them to retake
the survey. There was no significant change when compared to their original
values. However, since we know that the responses are valid, as ISP might have
incorrectly provisioned a client. One user decided to contact their ISP (ISP D)
and the company realized they were provisioning the user according to an out-
of-date plan. The user then retook the survey, and this affected the properties
measured positively (decrease in μq , σq and σq /μq ). The rest of the users are
still in liaison with their ISPs (all of them have ISP D as their provider), and we
have decided to classify their submissions as invalid until they get confirmation
from their ISP on their subscription rate since they may also be victims of the
provisioning issue.
Tables 2 and 3 show the metrics using all data collected from each platform.
We notice that there is a decrease in σ/μ by an average of 74.39% for all ISP’s for
the trusted case when compared with the traditional case. The differences in σ/μ
are quite large, and this reveals that the invalid data present in the traditional
survey are misrepresenting the ISPs. In particular, only with ISP D, we notice
a substantial increase in μ, indicating that users might be trying to skew the
results concerning this ISP. This also indicates that there is less variability in
the data gathered from users in the trusted case versus the traditional case.
Furthermore, Tables 4 and 5 show the metrics using only valid responses.
Interestingly, σ/μ values are comparable for both approaches. Our platform has
the ability to easily flag submissions and link them back to a user. The submission
can then be inspected by the administrator, potentially be converted to a valid
submission and then included as part of the statistical results. Moreover, our
platform is also able to collect additional information from the user such as their
ping, upload rate, IP address without requiring any additional input.
A Trust Framework for the Collection of Reliable Crowd-Sourced Data 53
The trusted platform was able to give a unique insight into the ISP survey.
Users can compare their service against competing ISPs using our metrics. On
the other hand, the traditional survey was only able to provide a basic level of
statistical reporting to users such as the number of people who are associated
with an ISP and the number of users that took part in the survey.
References
1. Sherchan, W., Nepal, S., Paris, C.: A survey of trust in social networks. ACM
Comput. Surv. 45(4), 47:1–47:33 (2013). https://doi.org/10.1145/2501654.2501661
2. DuBois, T., Golbeck, J., Srinivasan, A.: Predicting trust and distrust in social
networks. In: 2011 IEEE Third International Conference on Privacy, Security, Risk
and Trust and 2011 IEEE Third International Conference on Social Computing,
pp. 418–424, October 2011
3. Massa, P., Avesani, P.: Trust-aware recommender systems. In: Proceedings of the
2007 ACM Conference on Recommender Systems, RecSys 2007, pp. 17–24. ACM,
New York (2007). https://doi.org/10.1145/1297231.1297235
4. Artz, D., Gil, Y.: A survey of trust in computer science and the semantic web.
Web Semant. Sci. Serv. Agents World Wide Web 5(2), 58–71 (2007)
5. Wang, Y., Vassileva, J.: Bayesian network-based trust model in peer-to-peer net-
works. In: Proceedings of the Workshop on Deception, Fraud and Trust in Agent
Societies, pp. 57–68. Citeseer (2003)
6. Bhattacharya, R., Devinney, T.M., Pillutla, M.M.: A formal model of trust based
on outcomes. Acad. Manag. Rev. 23(3), 459–472 (1998)
7. Zhao, K., Pan, L.: A machine learning based trust evaluation framework for online
social networks. In: 2014 IEEE 13th International Conference on Trust, Security
and Privacy in Computing and Communications, pp. 69–74, September 2014
8. Massa, P., Avesani, P.: Trust metrics on controversial users: balancing between
tyranny of the majority. Int. J. Semant. Web Inf. Syst. (IJSWIS) 3(1), 39–64
(2007)
9. Cho, J.H., Chan, K., Adali, S.: A survey on trust modeling. ACM Comput. Surv.
48(2), 28:1–28:40 (2015). https://doi.org/10.1145/2815595
10. Guha, R., Kumar, R., Raghavan, P., Tomkins, A.: Propagation of trust and
distrust. In: Proceedings of the 13th International Conference on World Wide
Web, WWW 2004, pp. 403–412. ACM, New York (2004). https://doi.org/10.1145/
988672.988727
54 S. Ramoudith and P. Hosein
11. Roztocki, N.: Using internet-based surveys for academic research: opportunities
and problems. In: Proceedings of the 2001 American Society for Engineering Man-
agement (ASEM) National Conference, pp. 290–295 (2001)
12. Curran, P.G.: Methods for the detection of carelessly invalid responses in survey
data. J. Exp. Soc. Psychol. 66, 4–19 (2016)
13. Curran, P., Hauser, D.: Understanding responses to check items: a verbal protocol
analysis. In: Philadelphia, PA: Paper presented at the 30th Annual Conference of
the Society for Industrial and Organizational Psychology (2015)
14. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Com-
put. Surv. (CSUR) 41(3), 15 (2009)
15. Ramoudith, S., Hosein, P.: A metric for the fair comparison of ISPs (To be pre-
sented at the Sixth International Conference on Internet Science in December 2019)
Sentiment Analysis for University
Students’ Feedback
1 Introduction
Opinion mining could solve classification problems relative to data such as customer
review, auto suggestion based on user historical purchases, etc. Kieu and Pham [1] use
sentiment analysis and computer products reviews to help sale manager to decide
which products have most positive responses. Most organizations use opinion mining
are e-commerce in the retail, the travel and so on, especially in Vietnam. We can see
that changes in the education are becoming our top priority. Evolution in education is
the heart of a nation. That is the reason why we need to point out the weak spots in the
way we deliver knowledge to students.
In recent years, every higher education institution has their own way to collect
student’s feedback. Some uses paper polls, the others do it online. Most polls are
yes/no questions, rating scales which is easy to process but not practically. These kinds
of poll prevent student from expressing opinion freely. With full text responses, the
summarizing process takes lots of time to read them one by one and take note.
Feedback is a helpful source for any institute to have better views at overall
education. There are two types of feedback: the feedback from lecturers to students
which is for the self-improvement of students and the feedback from the students to the
lecturers that allows them to guide the lecturers into teaching the course in ways they
could understand better. Most researches are mainly feedback from student side since
they have bigger expectations in education. Some sentiment analysis experiments, such
as [2–5] points out that digital methods are similar to web form, social media are better
than old ways for paper form or communicating directly at class. Students can express
their feeling, option about any study aspects with or without showing their identity.
In our experiments, we propose a student’s feedback model based on classification
algorithms to analyze and assign content sentiment in positive, negative or neutral. We
compare three different classifiers which are Naive Bayes (NB), Maximum Entropy
(MaxEnt) and Support Vector Machines (SVM). They have been used in sentiment
analysis for many years. Our goal is to find out the best classifier for our Vietnamese
educational sentiment data.
We present the foundation of our journey, they are papers, article and research
relating to opinion mining in Sect. 2. The data construction process from raw data to
annotated dataset is Sect. 3. We choose three labels four our data: positive (POS),
negative (NEG) and neutral (NEU). In Sect. 4, we built classifier system for experi-
ments and how features work with it. Section 5 is about experiments and our results
with students’ feedbacks data annotated positive, negative and neutral. We can go to
details and see which feature has most involvement lead to mislabeled results by using
our error analysis tool.
2 Related Work
Observation protocol COPUS [6] is used to qualify teaching process in United States
higher education institutes. Students have 13 codes and faculties have 12 codes to
express their opinions are agree or disagree with methods, activities in class. This
protocol was experimented by Achen and Lumpkin (2015) [4] and mixed with normal
students’ feedbacks. Beside COPUS codes, students and faculties can give back their
quantitative responses. Result from both protocols helps faculties see which aspect of
their lectures is more effective. The downside here is that quantitative responses are not
processed automatically as COPUS protocol.
In [7] by Delen, some machine learning techniques were used to determine
freshman likely dropping out from a range of factors. His study mainly focuses on
comparing machine learning algorithms just as SVM classifier, Neural Network on 2
class datasets. Dataset is quite complex with many attributes could be variable or
binary. SVM predicts the accuracy of 81,18% and the accuracy of each class with
above 74% with balanced dataset. Sensitivity analysis result detects interactions
between factors and shows important attributes in dataset.
In Vietnam, Duyen et al. [8] use hotel reviews from agoda.com page to compares
three model: Naive Bayes, Maximum Entropy and SVM. Their results show SVM with
Sentiment Analysis for University Students’ Feedback 57
word-based and unigram features give better result. They also show overall score play a
very important role in predicting sentiment. As the rise of attention in education field,
we see more study related to education in Vietnam. Phuc and Phung [5] uses Naive
Bayes classifier model to determine main subject in student messages. They use POS
features on classifier to detect Vietnamese messages. Based on result of classifier, their
system moves those misplaced messages to the right topics in school forum.
Fig. 1. The process of building an opinion mining system for Vietnamese students’ feedbacks.
4 Methodology
Figure 1 shows the complete process of our sentiment system for our students’ feed-
backs data. In natural language processing, data are processed one by one could be
word, sentence or whole documents. With students’ feedbacks, we think analyzing at
sentence level is enough information. Most students use one or two sentences to
express their opinions. In our experiments, we focus on the feedback with one sentence.
Sentiment Analysis for University Students’ Feedback 59
4.2 Features
About feature, we have word n-gram, previous/next words and sentence length as a
feature. For example, we have sentence “thầy dạy rất tuyệt vời” (means “The lecturer
teaches amazingly”) is split into “thầy”, “dạy”, “rất”, tuyệt vời”. In Vietnamese, “tuyệt
vời” is one meaning word, cannot be tokenized into smaller units. Table 2 lists our 19
features come from the sentence. Word n-gram features are unigram, bigram, trigram
and 4-gram as row 5 to row 14 in Table 2. Each word feature also is used as previous
word (or next word) for feature we use. Sentence length value could be 1 to 10, 11 to
20, 21 to 30 or above 30, so this example length is 5.
libSVM [15] is popular open source library for Support Vector Machine with flexible
configuration and many kernels for future use. We choose Joachims’s SVMlight kernel
[13] because of its fast optimization algorithm. Next, we choose Stanford Classifier
which use Maximum Entropy as main algorithm [16]. Stanford Classifier proves it can
give better result than Naive Bayes, SVM classifier based on English dataset. Stanford
Classifier also come with many feature tweaks we can use but we apply same feature
setting to all three models.
To estimate the performance of the classification models, we use 10-fold cross-
validation. The value of k we choose is 10. This means with 5000 labelled sentences,
we have 500 sentences in each fold. Empirical studies by Kohavi [17] proved that 10
seem to be an optimal number of folds. In 10-fold cross validation the entire dataset is
divided into 10 mutually exclusive folds. Each fold is used once to test the performance
of algorithm that is generated from the combined data of the remaining nine folds.
5.2 Results
Based on the 10-fold cross-validation, Table 3 presents summary result with 3 clas-
sifiers in our study. MaxEnt obtains the best overall score with 91.36%, following is
Naïve Bayes with 88.00%. Surprisingly here, SVM has the lowest accuracy with
78.45%. Some previous studies show SVM classifier stands out in binary cases e.g.
positive and negative class only. We could see SVM obtain the highest negative class
accuracy with 98.56%. But SVM is the only model fail at neutral class.
Although Maxent neutral class accuracy is not acceptable in real life, it has positive
and negative accuracy quite equally.
Tables 4, 5 and 6 are confusion matrix of each classifier models. With only 4.22%
neutral sentences, we do not have high expectation that any classifier could archive good
accuracy above 50% on neutral class. Table 4 is seen that Naive Bayes has most false in
POS (100%) and NEU (94.74%) class by classify them NEG instead of POS or NEU.
In Table 5, Maxent has lower accuracy in NEG class but it has the best result in
NEU class. Though Maxent has similar mistake Naive Bayes did with POS and NEU,
it obtains the lowest error mislabeling NEU with NEG (81.25%) in all three models.
Table 6 proves SVM could do best with binary dataset but fail at NEU class. SVM
cannot classify correctly even one case.
Along with our data, we run our Maxent model with newspaper comments (D2)
and social media dataset (D3) from Sect. 3. As seen in Table 7, D2 archives 67.29%
overall accuracy and D3 scores 60.25%. D2 result lower than our experiment could be
caused by small dataset size but it is still impressive with 92.75% of NEG class
accuracy. This could be explained by narrow domain cover of dataset. About D3, this
field is much complex than any dataset we have here. Facebook posts are truly cross-
domain datasets. Although D3 does not have best overall score, it has the most equal
results between per class accuracy, thanks to almost balanced dataset.
Table 7. Maxent model results for 10-fold cross validation of three different datasets
Dataset D1 D2 D3
Overall 91.36% 67.29% 60.25%
accuracy
Per-class POS 84.10% 12.50% 58.27%
NEG 96.83% 92.75% 63.56%
NEU 23.81% 25.00% 57.61%
shows up in training dataset, final score of sentence will depend on other words. E.
g. “cách giảng dạy thấm thía vào học sinh” (means “teaching method is under-
standable for student”), word “thấm thía” (mean easy understanding) is misspelled
but in real life we accept it. We only have 4.22% of NEU in dataset so cutting down
three class equally is not empirical. We could solve this problem by combine
positive class and neutral class then balance these two class. This is much easier,
more effective than took all the samples from the minority class (Neutral class) and
randomly selected an equal number of samples from the majority class. Neutral
space too small could pull down other class accuracy.
• Sentence length: If the length of sentence is too short (contains 2 or 3 words) or too
long (more than 10 words), our model marks it mostly as NEG. Especially, POS has
been misclassified most. We tested a small experiment but without this feature. The
accuracy is lower than at least 10%, mostly NEG class.
• Previous word/Next word features: In 13 cases, Previous or Next word feature
manipulate most result. We try eliminating these features and see accuracy decrease
from 3 to 5%.
Furthermore, we see minor errors related to elevated level negation and contrastive
conjunction. We cannot classify comparative sentences but with the complex one as
“thầy rất nhiệt tình, nói rất nhiều đến khản giọng mà sức truyền tải không bằng thầy
XYZ” (means “The lecturer is so enthusiastic that he may have raucous voice, yet he
teaches as Mr. XYZ”) we know it is completely a negative response. This is not a
common comparative sentence because it compares 1 aspects (attitude) of object A (the
lecturer), which is good, with 1 totally different aspect (teaching ability) of object B
(Mr. XYZ). By using dependency feature with neural network, we could solve some
complex sentences contain negation conjunction word such as “nhưng”, “mà”, …
(“but”, “yet” … in English). We use our tool to analyze these errors. For example, we
have “môn học rất có ích” (means “The subject is helpful”) assigned as positive by
model base on word feature “ích” and next word “ích” also (Fig. 3). These feature
numbers are over 1.0 whereas NEU and NEG class that have mostly negative numbers.
5.4 Application
We want to make an application which can take raw data user and export result in text
format or visual expression as chart. We can compare two or three model result side by
side in one table as seen in Fig. 2 which is our application interface. This application is
for both tester and daily user so we add two functions especially for each of them: error
analysis and data chart.
As testers, we need a tool that could help us figure where our system goes wrong.
We can test mislabeled sentences by tool like Fig. 3 or challenge system with any new
sentence we make up. There four columns are type of feature, statistical data of each
feature in three class.
Second addition is visualizing our results in an intuitive way. School administers
can export data as text for now but in case they want to have a quick look to the result,
our tool could draw charts like pie or column from results. If we choose more than one
year student’s feedback, we could have its charts side by side as Fig. 4. What more is
we can export these chart as images and import them easily to any student report.
64 N. T. P. Giang et al.
Three classifiers in our experiment based on probability and statistics, thus training data
play key role. Per-class accuracy results prove that unbalanced data does not affect to
negative class much like what we see in positive class. Our data contains more negative
sentence than positive is usual in real life which makes our accuracy reliable.
We contribute our educational sentiment data with 5,000 labeled positive, negative
or neutral sentences. Our data could be the helpful source for sentiment analysis
community in future.
With the best accuracy of 91.36%, we see MaxEnt is full of promises. Our research
also proves that Naïve Bayes maybe old, less accurate in some fields but not this one.
Despite feedback is complex, traditional algorithms could work effectively with
appropriate feature selected. Based on our results and analysis, we suggest some
possible directions for the development of sentiment analysis for our data in futures:
(1) Enriching our data by collecting and labeling students’ feedbacks from the uni-
versities. Our goal is the number of 10,000 sentences in our dataset.
(2) Classifying students’ feedbacks by topics as well as lectures, facilities,
curriculum.
(3) Building sentiment treebank for our data based on the method of English Senti-
ment Treebank [18].
(4) Developing a dictionary for sentimental education to improve the accuracy of the
classifiers.
References
1. Kieu, B.T., Pham, S.B.: Sentiment analysis for vietnamese. In: 2010 Second International
Conference on Knowledge and Systems Engineering (KSE), pp. 152–157 (2010)
2. Altrabsheh, N., Gaber, M., Cocea, M.: SA-E: sentiment analysis for education. In: 5th KES
International Conference on Intelligent Decision Technologies (2013)
3. Mac Kim, S., Calvo, R.A.: Sentiment analysis in student experiences of learning. In:
Educational Data Mining 2010 (2010)
4. Achen, R.M., Lumpkin, A.: Evaluating classroom time through systematic analysis and
student feedback. Int. J. Sch. Teach. Learn. 9, 4 (2015)
5. Phuc, D., Phung, N.T.K.: Using Naïve Bayes model and natural language processing for
classifying messages on online forum. In: 2007 IEEE International Conference on Research,
Innovation and Vision for the Future, pp. 247–252 (2007)
6. Smith, M.K., Jones, F.H., Gilbert, S.L., Wieman, C.E.: The classroom observation protocol
for undergraduate STEM (COPUS): a new instrument to characterize university STEM
classroom practices. CBE Life Sci. Educ. 12, 618–627 (2013)
7. Delen, D.: A comparative analysis of machine learning techniques for student retention
management. Decis. Support Syst. 49, 498–506 (2010)
8. Duyen, N.T., Bach, N.X., Phuong, T.M.: An empirical study on sentiment analysis for
Vietnamese. In: 2014 International Conference on Advanced Technologies for Communi-
cations (ATC 2014), pp. 309–314 (2014)
66 N. T. P. Giang et al.
9. Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol. 5, 1–167
(2012)
10. Rohrer, B.: How to choose algorithms for Microsoft Azure machine learning (2015)
11. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phrase-level
sentiment analysis. In: Proceedings of the Conference on Human Language Technology and
Empirical Methods in Natural Language Processing, pp. 347–354 (2005)
12. Klein, D., Manning, C.: Maxent models, conditional estimation, and optimization. In:
HLTNAACL 2003 Tutorial (2003)
13. Joachims, T.: SVM-Light Support Vector Machine, vol. 19. University of Dortmund (1999).
http://svmlight.joachims.org/
14. Vryniotis, V.: Developing a Naive Bayes text classifier in JAVA, 27 January 2014 (2014)
15. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans.
Intell. Syst. Technol. (TIST) 2, 27 (2011)
16. Klein, D.: The Stanford Classifier. The Stanford Natural Language Processing Group (2003)
17. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model
selection. In: Ijcai, pp. 1137–1145 (1995)
18. Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., et al.: Recursive
deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the
Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 1642
(2013)
Development of Waste Collection Model Using
Mobile Phone Data: A Case Study in Latvia
1 Introduction
The waste management policy in Latvia is laid out in the National Waste Management
Plan 2013–2020 as well as in the regional plans of ten waste management regions and
Riga City. The regional plans envisage the establishment of a separate waste collection
system in the regions. In cities, it means a separate waste collection point for every 300
to 500 inhabitants. Sorted waste collection points should be created in larger cities, as
well as in regional populated areas with over 1000 inhabitants.
One of the goals envisaged in the plan is to ensure that the generated waste is not
hazardous and with a low risk to the environment and human health. Where possible,
the collected waste should be returned to the economic cycle, particularly through
recycling. Specifically designed areas would constitute the necessary infrastructure for
collecting sorted household waste. Organizing household waste management and
controlling waste collection and disposal is one of the functions of the local govern-
ments. The production of glass is expensive and energy consuming, so glass products
should be in use as long as possible through re-use and recycling. In the production of
new glass products, it is economically advantageous to add up to 30% of used glass. In
Latvia, the existing glass sorting and collection system is not optimal and does not
ensure efficient waste collection and recycling.
Laws and regulations in Latvia oblige producers of household waste to entrust the
management of their waste to companies which have received the permit of the
respective local government for the operation in its territory. An individual or a legal
entity in question is also responsible for covering the costs associated with the man-
agement of household waste, including hazardous waste. These requirements also
apply to owners and users of summer cottages or other short-term accommodation
buildings. In line with the objectives set by the European Union (EU), by 2020 Latvia,
like other EU countries, must collect and hand over for recycling 68% of glass placed
in its territory and ensure annual increase in recycling. Often due to the absence of
containers for sorted glass waste, glass packaging ends up in household waste con-
tainers, making it more difficult to sort and recycle.
To increase the efficiency of the waste collection management, different models are
used. For example, the GIS based model [1], the data driven model [2], the integrated
management model [3], the network optimization model [4], the decision-making
model, the cost-minimization model and the landfill model [5], as well as multi-criteria
decision models and graphical models [6]. In recent years, the mobile phone data is
being widely used to solve various application problems in different fields – tourism
management [7, 8], population estimation and migration monitoring [9, 10], traffic flow
measuring [11], regional economic activity evaluation [12–14] and others. The research
objective is to develop responsive waste collection model, which responds to the
current demands of the population and allows planning waste container loading, based
on mobile phone data statistics.
Glass waste collection services are available to every resident and company in Latvia.
The environmental management company, Eco Baltia Vide, Ltd. provides the widest
range of environmental management services – collection of household and sorted
waste, management of used packaging, construction waste and bulky waste manage-
ment, cleaning of premises and territories, and different seasonal services. Eco Baltia
Vide, Ltd. is a part of Eco Baltia, the largest waste management company in the Baltic
countries, providing a complete waste management cycle from collection to recycling.
Development of Waste Collection Model Using Mobile Phone Data 69
Almost 55 000 tons of glass products are reused and returned to the Latvian market
every year, and only 36 000 tons of glasses are recycled. Therefore, Eco Baltia Vide in
co-operation with local governments of Latvia are planning to place 1000 additional
containers for glass waste collection, allowing to collect at least 5000 tons more glass
annually. This will reduce the amount of glass waste disposed in municipal landfills
and will improve the collection and recycling of glass in Latvia. Usually, a waste
collection point is fixed up where it is physically possible due to municipal rules and
agreements with property owners, and not as a result of decisions based on the data on
potential volumes of waste. Therefore, waste collection infrastructure is created upon
the easy-to-go principle, which means that a waste container may not be in its optimal
placement but in the possible one. For the collection of glass waste, containers of two
different volume capacities are used: 1.5 m3 container can contain up to 300 kg of glass
waste and 2.5 m3 – up to 500 kg of glass waste.
In this research, the mobile station data by the operator, LMT (Latvia Mobile
Telephone) are used. Other mobile operators are also operating in Riga, but LMT is the
oldest and largest one in the region and their data include fixed calls and text messages
made from and to other mobile operators as well. Geographically, the entire territory of
Riga is covered by 298 mobile phone base stations (see Fig. 1).
Fig. 1. Locations of glass waste containers and LMT mobile phone base stations in Riga.
Mobile phone base stations (BS) are erected according to the consumption in the
respective part of the city. In areas where the use of mobile phone services is growing,
the operator installs a base station, ensuring that the distribution of base stations
70 I. Arhipova et al.
corresponds with the level of use of mobile phone services. For each BS, the following
parameters are used:
• base station ID;
• base station position by Longitude and Latitude;
• the number of the unique users on a base station;
• base station activity or number of the outgoing call and text-message activities.
The data from BS is available in 15-min intervals, allowing to identify users’
behaviour in the respective area. This time-series data can be used for forecasts and
dynamic analysis across time and as aggregated values. In the data model for mobile
activity analysis, data could be used on different aggregation levels: daily, weekly,
monthly or even yearly aggregations. The data from various mobile base stations do not
overlap; therefore, they can be used for geographically precise location analysis.
The database used for the collected waste volume study includes information on the
weight of every glass waste container, the date of collection and the location address
for the period of 40 months, from January 2015 to April 2018 (Table 1).
As a result, the first PC can be interpreted as a business district, but the second PC –
as a residential district. Using the component loadings, which are the correlations of the
particular mobile base stations’ call activities and unique phone users with the first two
principal components, three similar groups of mobile phone base stations by mobile
phone call activities and unique phone users of Riga city district were found:
• the 1st group includes 45% of all base stations in business districts, when the 1st PC
loadings are more than 0.7 and 2nd PC loadings are less than 0.5;
• the 2nd group includes 22% of all base stations in residential districts, when the
1st PC loadings are less than 0.5 and 2nd PC loadings are more than 0.7;
• the 3rd group includes 33% of all base stations in mixed districts, when the both PC
loadings are more than 0.5 and mobile phone call activities are relatively higher at
all times during the day.
72 I. Arhipova et al.
The distribution of mobile base stations by the type of district can be marked on the
map – business districts are located in the city centre, residential districts are further
away from the city centre, while mixed districts are located in between the business and
the residential districts (see Fig. 3).
Fig. 3. Distribution of base stations and container types in the city’s districts in 2018.
The two-factor analysis of variance shows that the district type (p-value < 0.01) and
the month of the year (p-value < 0.01) have significant effect on the collected glass waste
(kg/day), but there is no significant factor interaction effect in 2015–2018. Higher vol-
umes of collected glass waste are associated with spring cleaning in April and during
Easter time. The summary of various time periods in Riga city districts from 2015 to
2018, using the two components loadings or correlation coefficients, is shown in Table 3.
All data transformations are performed to convert data in a format allowing to create a
common relational data model which can be used for data analysis from different
perspectives. Because of the amount and structure of the data, it is more reasonable to
use a relational model (normalized) instead of a single large table approach. The
relational data model also allows, fully or incrementally, load and reload data from
different data sources during analysis, or even add different data sources if necessary.
data model and data input interfaces, algorithms and transformation methods, pre-
sentation of results in tabular and graphical format using BING maps.
Other data
Calculations according
2 parameters
The most time-consuming task was data cleaning and processing to combine it with
all needed data sources in a common relational data model. The relational data model
contains several tables and some of the tables are larger than 1 million rows, therefore,
PowerPivot add-in for Excel was used. PowerPivot allows storing relational data with
tables larger than 1 million rows in Excel and is built on top of Vertipaq in-memory
engine, which allows working with data 20 times faster than in an Excel sheet.
The methods and the data analysis techniques used in this research could be built in
a commercial product for mobile data operators allowing to predict the optimal
placement of waste containers in any territory where mobile base station data is
available. It could work in three major steps:
1. preparation of source data (CDR data, waste container data, other data);
2. calculation and parametrisation;
3. results and analysis of maps and tabular results.
The main benefit of the standardised solution is the data model and data structure.
By preparing and mapping source data to the data model data, it is possible to repeat all
calculations and generation of results. Calculations could be adjusted using different
parameters, such as in what radius base stations are assigned to a waste container. To
create visualisations of geographical information on the map we needed to generate
multilayer maps that allow combining different data together to enable part of analysis
and conclusions visually. Visual examination of the data results helps to understand the
big picture and how data is distributed across a geographical area.
Several GIS tools allow creating multilayer maps such as Quantum GIS or ArcGIS.
In this research, we used 3D Bing multilayer maps built in Excel 2016. Bing maps are
the original Microsoft add-in for Excel 2016 and provide needed functionality to
complete this research in data visualisation part. The following PowerMap functions
are used:
• geocoding – enables to find point on the map using the street address of the object;
• filtering – enables to filter objects;
76 I. Arhipova et al.
• layers – enables to create several layers where visualisation on each layer could be
different;
• transparency – enables to overlap several layers with clarity;
• exact location – enables to find the objects’ location using GPS coordinates;
• map themes – enables to add different map visualisation themes and colour
schemes.
All these functions of PowerMap allowed visualising GIS information in the way it
could automatically generate results from a relational data model using a large amount
of data – several million rows and stored in PowerPivot.
5 Conclusions
The developed approach, techniques and data model could be used for:
• other waste container analysis and other applications, such as small places of
commerce, real estate, information kiosks, where optimization of placement means
possible attractiveness for a larger amount of people,
• providing digital products and applications using CDR data by mobile operators in
combination with different business data,
• short-term and long-term forecasts on the number of people located across the
geographical area for the planning of public events, marketing campaigns, and the
placement of short-term engagements.
The methods and the data analysis techniques used in this research could be built in
a commercial product for mobile data operators allowing to predict the optimal
placement of waste containers in any territory where mobile base station data is
available.
The main benefit of the standardised solution is the data model and data structure.
By preparing and mapping source data to the data model data, it is possible to repeat all
calculations and generation of results. Calculations could be adjusted using different
parameters, such as in what radius base stations are assigned to a waste container.
Choosing any of the proposed strategies helps achieve direct benefits - increasing the
amount of glass collected, and indirect benefits - increasing the amount of glass col-
lected in the remaining containers.
If the anticipated effect is achieved after a long-term monitoring, it will be nec-
essary to recalculate the appropriate locations of the remaining containers and to create
the most optimal container placement map for the entire area. An additional effect can
be achieved through a public awareness campaign and by promoting an innovative
approach to glass waste collection. The amount of data available makes it possible to
offer optimisation of routes and periods of waste collection.
Optimal placement of glass waste collection containers can contribute to the devel-
opment of the circular economy, which is still at an early stage in Latvia. In the future, the
amount of waste disposed in landfills and the costs of municipal waste management will
also decrease as the volume of glass packaging will no longer end up in household waste
Development of Waste Collection Model Using Mobile Phone Data 77
Acknowledgments. The research leading to these results has received funding from the research
project “Development of Responsive Glass Waste Collection System”, the contract Nr.
ZD2018/20580 signed between the University of Latvia and Eco Baltia Vide, Ltd.
References
1. Vu, H.L., Ng, K.T.W., Bolingbroke, D.: Parameter interrelationships in a dual phase GIS-
based municipal solid waste collection model. Waste Manag 78, 258–270 (2018)
2. Esmaeilian, B., Wang, B., Lewis, K., Duarte, F., Ratti, C., Behdad, S.: The future of waste
management in smart and sustainable cities: a review and concept paper. Waste Manag 81,
177–195 (2018)
3. Ilankoon, I.M.S.K., Ghorbani, Y., Chong, M.N., Herath, G., Moyo, T., Petersen, J.: E-waste
in the international context – a review of trade flows, regulations, hazards, waste management
strategies and technologies for value recovery. Waste Manag 82, 258–275 (2018)
4. Van Engeland, J., Beliën, J., De Boeck, L., De Jaeger, S.: Literature review: strategic
network optimization models in waste reverse supply chains. Omega 91, 102012 (2020).
https://doi.org/10.1016/j.omega.2018.12.001
5. Eiselt, H.A., Marianov, V.: Location modeling for municipal solid waste facilities. Comput.
Oper. Res. 62, 305–315 (2015)
6. Kayakutlu, G., Daim, T., Kunt, M., Altay, A., Suharto, Y.: Scenarios for regional waste
management. Renew. Sustain. Energy Rev. 74, 1323–1335 (2017)
7. Ahas, R., Aasa, A., Roose, A., Mark, U., Silm, S.: Evaluating passive mobile positioning
data for tourism surveys: an Estonian case study. Tour. Manag. 29, 469–486 (2008)
8. Zhao, X., Lu, X., Liu, Y., Lin, J., An, J.: Tourist movement patterns understanding from the
perspective of travel party size using mobile tracking data: a case study of Xi’an, China.
Tour. Manag. 69, 368–383 (2018)
9. Balzotti, C., Andrea, B., Briani, M., Cristiani, E.: Understanding human mobility flows from
aggregated mobile phone data. IFAC-PapersOnLine 51(9), 25–30 (2018)
10. Bwambale, A., Choudhury, C.F., Hess, S.: Modelling trip generation using mobile phone
data: a latent demographics approach. J. Transp. Geogr. 76, 276–286 (2019). https://doi.org/
10.1016/j.jtrangeo.2017.08.020
11. Ni, L., Wang, X.C., Chen, X.M.: A spatial econometric model for travel flow analysis and real-
world applications with massive mobile phone data. Transp. Res. Part C 86, 510–526 (2018)
12. Arhipova, I., Berzins, G., Brekis, E., Opmanis, M., Binde, J., Steinbuka, I., Kravcova, J.:
Pattern identification by factor analysis for regions with similar economic activity based on
mobile communication data. Adv. Intell. Syst. Comput. 886, 561–569 (2019)
13. Arhipova, I., Berzins, G., Brekis, E., Kravcova, J., Binde, J.: The methodology of region
economic development evaluation using mobile positioning data. In: 20th International
Scientific Conference on Economic and Social Development, pp. 111–120. Varazdin
Development and Entrepreneurship Agency, Prague, University North, Koprivnica, Croatia,
Faculty of Management University of Warsaw, Poland (2017)
14. Arhipova, I., Berzins, G., Brekis, E., Binde, J., Opmanis, M.: Mobile phone data statistics as
proxy indicator for regional economic activity assessment. In: 1st International Conference
on Finance, Economics, Management and IT Business, pp. 27–36. SCITEPRESS – Science
and Technology Publication, Lda., Crete, Greece (2019)
Artificial Social Intelligence: Hotel Rate
Prediction
1 Introduction
Artificial Intelligence (AI) has been revived in modern days and regained its fame from
the 1950’s. Today initiatives from business world are the main driving force. With help
from advancements in cloud computing and data analytics, AI is spread out using
today’s fast networking capacity. Though the fine source code libraries of AI in the
majority of programming platforms have been established, AI solutions still require
human modelers’ tremendous efforts and interactions. AI models must be refined with
parameters to be adopted today, then tweaked again tomorrow.
Artificial Social Intelligence (ASI) is an innovative framework capable of replacing
human modelers’ time-consuming jobs with machine agents implemented in
microservices. These agents are then facilitated by cloud native computing foundation
(CNCF).
2 Microservices
With the “Keep it Simple, Stupid (KISS)” Unix philosophy, Linux naturally supports
the modular nature of cloud architecture with virtualization technology that maxi-
mizes resource utilization alongside the concept of multi-tenant systems in one phys-
ical server computer [3]. This creates one of the major benefits of using cloud
computing, scalability, by creating as many virtual machines as needed. However,
Like parallel computing in the data analytics field today, orchestration in container is
getting attention as microservice architecture is a double-edged sword; decomposing
applications in multiple microservices while adding complexity by managing the
overwhelming number of microservices with application growth proves challenging
[6]. Previously, orchestration comes into this position to manage resources systemat-
ically. Managing multitude containers can be a daunting job without orchestration
tools; Kubernetes, Mesos, ECS, Swarm, and Nomad.
At this time a new paradigm has emerged from the practices of the last decade in
the form of fast moving microservices architecture deployment, called cloud native
computing foundation (CNCF). CNCF is an open source software stack that deploys
applications as microservices, packaging each part into its own containers, and
dynamically orchestrating those containers to optimize resource utilization. The com-
puting in AI field vastly needs this processing power. Today’s data analytics foundation
has widely accepted parallel computing, such as the Hadoop and Spark processing
engines. Still, human analysts are fully responsible for modeling any AI algorithms.
This research is the first step to use CNCF to aid modeling jobs by human analysts with
machine agents using microservices architecture.
The process of structuration is the reciprocal interaction of human players and insti-
tutional properties of organizations [2]. The theory of structuration recognizes that
human actions are enabled and constrained by structures, yet these structures are the
result of previous actions [5]. Though run by machines, CNCF adopts the structuration
process in AI use cases. This research is theorizing possible structuration on
microservices (machine agents) and institutional properties (rules and resources).
80 J. J. Lee and M. Lee
AI algorithms are purely logical ways of thinking. But there are various methods of
implementation when one applies algorithms to their data source. Each analyst creates
his or her own methods to interpret the data source. As a result, multiple views of data
interpretation are created and the best explanation on a case-by-case basis is adopted.
With CNCF, Artificial Social Intelligence is proposed where modeling is run by
machine agents with an additional layer of blending other algorithms. This will be
discussed in the following case study in hotel rate prediction.
Artificial Intelligence (AI) has been revived in modern days and regained its fame from
the 1950’s. When buying a perishable good, most consumers consider the trade-off of
buying today and waiting a little longer in the hope that prices will drop. This trade-off
is more pressing the larger the uncertainty of one’s preferences in the future. A buyer
must consider whether sellers will drop prices or increase prices. In the presence of
price uncertainty, price forecasts for the remaining days in the booking horizon are
valuable information to drive consumer’s purchase decision for perishable goods.
In this case study, the CNCF architecture for predicting minimum hotel prices is
applied. Several different forecasting models including traditional time-series models
and machine learning models are examined, and machine learning blending architec-
ture to improve forecasting accuracy is proposed. Using real-life hotel data, this
research provides the empirical results of the proposed approach along with a com-
parison to traditional forecasting methods.
6 Results
The performance of the proposed blending method with benchmarks pertaining to the
predictive accuracy for three hotels is compared. For the benchmarks, the ARIMA
model (one of the most advanced time-series models), support vector machine, and the
neural network model are used.
The Mean Absolute Percentage Error (MAPE) is used to assess the performance of
forecasting models:
X jFt At j
MAPE ¼
t
At
For model validation, the daily hotel rate data of three hotels with a booking
window of 1–60 days out of 120 arrival dates is used. The suggested system for
60 days is tested, i.e. forecasting models for 60 forecasting dates. For each forecasting
run (date), the system generates forecasts for the next 60 days. Table 1 reports the
average MAPE of 60 forecasting runs of machine learning blended model along with
82 J. J. Lee and M. Lee
benchmarks. The empirical result in Table 1 confirms that machine learning blended
model can significantly improve forecasting accuracy for all cases.
7 Concluding Remarks
Artificial Intelligence (AI) has been revived in modern days and regained its fame from
the 1950’s. Currently, microservices are implemented in the modeling layer alone. Data
layer will be implemented later, followed by the blending layer. Once all three layers
are implemented in microservices, CNCF will be adopted with tools such as Kuber-
netes, GitLab, and Digitalocean in CI/CD approaches.
The CNCF framework allows for deploying, scheduling, and running multiple
machine learning models. Moreover, the machine learning selection process leverages
complementary characteristics of distinctive models to produce optimal forecasts. The
empirical study indicates that the CNCF machine learning approach can significantly
improve the prediction accuracy of hotel minimum rates.
References
1. Anderson, C.: Docker. IEEE Software, May/June 2015
2. Giddens, A.: The Constitution of Society. University of California Press, Berkeley (1984)
3. Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A., Varghese, G., Voelker, G., Vahdat,
A.: Difference engine: harnessing memory redundancy in virtual machines. Commun. ACM
53(10), 85–93 (2010)
4. Julian, S., Shuey, M., Cook, S.: Containers in research: initial experiences with lightweight
infrastructure. In: XSEDE16 Proceedings of the XSEDE16 Conference on Diversity, Big
Data, and Science at Scale, Miami, USA (2016). Article no. 25
5. Orlikowski, W.J., Robey, D.: Information technology and the structuring of organizations.
Inf. Syst. Res. 2(2), 143–169 (1991)
6. Richardson, C., Smith, F.: Microservices: from design to deployment. Nginx Inc. (2016)
7. Thones, J.: Microservices. IEEE Software, January/February 2015
New Metric Based on SQuAD for Evaluating
Accuracy of Enterprise Search Algorithms
1 Introduction
In the field of Information Retrieval, searching for textual answers among enterprises
has always been thought of as being different than similar searches on the Internet. As
early as 2004, Hawking, et al. have described such differences in detail [1]. Multiple
challenges make it harder for enterprise search systems to be as accurate as search
systems focused on the Internet. First, there is no hierarchy among documents like the
hyperlinks used by Google’s Page Rank algorithm [2]. Second, the traffic on the
enterprise search system is a fraction of that of a system like Google, which is used for
more than five billion searches every day as per Internet Live Statistics as of publi-
cation date. Third, every enterprise has a highly specialized taxonomy or vocabulary
that is different than other enterprises [3]. Sometimes there are differences even among
different departments of business units of enterprises.
Search has kept evolving for both, searching for textual answers within enterprises
and over the Internet. Particularly, the concept of Natural Language Queries has
become popular, wherein the search string is just a sentence in natural language and the
search system can parse it for relevant keywords to complete the search [4]. Over time
semantic search has also been discussed, which reduces dependence on typing exact
keywords to get the right answers [5].
As enterprise search evolves there is a dearth of widely accepted objective metrics
that can measure and compare performance of various enterprise search systems. In this
© Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 83–90, 2020.
https://doi.org/10.1007/978-3-030-39442-4_8
84 H. Kulkarni et al.
paper we take such a metric for machine reading comprehension and modify it for
enterprise search. We discuss the modifications in detail and provide source code for
others to evaluate their own systems.
Today, with more sophisticated techniques like artificial intelligence, enterprise search
can achieve new successes despite the challenges enumerated above. As a research
team focused exclusively on enterprise search, and working with numerous enterprises
to solve these problems, we define Next Generation (NextGen) Enterprise Search as a
search system with the following objectives:
A. Accept a Natural Language Query and return a nugget of information that answers
the query. This answer could be a specific data point, a text snippet or an image,
but should not be a document or a list of documents. If such an answer does not
exist, the search system should not return any answers. If multiple such answers
exist, the search system should return all these answers.
B. Complete the objective A, as stated above, by focusing on the intent behind the
question and not on keywords. If the user asks the same question using a different
set of words such that the meaning of the question does not change, the answer
should not change either. On the other hand, if for the same question, if any
relevant information in the context changes, the answer should change to reflect the
new intent based on the changed context.
C. Complete the objectives A and B, as stated above, within a latency of five seconds.
There is an inverse correlation between accuracy and latency for most search
systems used for enterprise search [6]. Fixing the upper limit of latency levels the
playing field.
Different teams may use different objectives; however, there is a need for a metric
that can evaluate the performance of algorithmic search systems against any such set of
objectives. In the rest of the paper we call this metric NextGen Enterprise Search
Accuracy.
Since its publication in 2016, a scoring system based on Stanford Question Answering
Dataset (SQuAD) [7] has emerged to be a well-accepted methodology for comparing
different algorithmic approaches to reading comprehension (MRC). The SQuAD team
used more than 500 publicly available articles in Wikipedia and crowdsourced more
than 100,000 questions about these articles. The team required crowdworkers to enter
questions as free form text and mark one or multiple answers as spans within the
article. This ensures that answers are always contained in the articles. The team also
gave standard scripts to generate two scores for any reading comprehension system:
New Metric Based on SQuAD 85
• Exact Match (EM) Score gives the percentage of time a predicted answer exactly
matched the expected answer, controlling for punctuation.
• F1 Score is the average of F1 scores of each question, computed based on precision
and recall of tokens in the predicted answer vs. tokens in the expected answer.
In 2018, Rajpurkar et al. enhanced the dataset by crowdsourcing another set of
more than 50,000 questions that were unanswerable by the original set of articles [8].
As of the date of this publication, these works have been cited more than 743 times by
major publications as per Google Scholar [9]. At least 144 successful evaluations of
different algorithms have been completed against the prescribed methodology in
SQuAD as per the SQuAD leaderboards [10]. Of these attempts, an ensemble approach
by Joint Laboratory of HIT and iFLYTEK Research has achieved an Exact Match score
of 87.147, which is higher than the Exact Match score of 86.831 achieved by humans.
To illustrate the need for a new metric for NextGen Enterprise Search Accuracy other
than SQuAD scores, we used a search system based on an algorithm called Calibrated
Quantum Mesh (CQM) [11]. Other algorithms or search systems can be used for
establishing NextGen Enterprise Search Accuracy.
We chose CQM because it has been developed specifically for information retrieval
from natural language text corpora. The objectives are given in Sect. 2. It avoids many
problems related to use of other machine learning algorithms, notably Deep Learning,
in enterprise search systems. Specifically, using CQM, unsupervised learning is pos-
sible without any kind of annotation, tagging or structuring, which leads to a signifi-
cantly lower effort in training a search system.
CQM works on three basic principles, as shown in Fig. 1. First, it recognizes that
any symbol, word or text can have more than one meaning (or quantum state) with
different probabilities. Second, it recognizes that everything is correlated in a mesh.
Each node modifies other’s probability distribution across quantum states. Finally,
CQM sequentially adds all available information to help converge the mesh into a
single meaning. The calibrations are implemented using training data, contextual data,
reference data and other known facts about the problem. These calibrating systems are
86 H. Kulkarni et al.
called Calibrating Data Layers. When the training data is passed through CQM, it
defines many of the mesh’s interrelationships. Where applicable, data layer algorithms
learn from such data.
The scores in Table 1 suggest that the search system based on CQM used in this
experiment is very limited in its reading comprehension capabilities. However, from
our experience applying CQM in other situations we know that the search system
performs much better. For example, CQM was used to develop an Intelligent Machine
for Document Preparation at Eli Lilly [12]. The CQM based search system achieved an
accuracy of 89% when used by Eli Lilly’s scientists. While many other situations have
not been published in academic journals, we have observed that the CQM based search
system is consistently between 87- and 98% on NextGen Enterprise Search Accuracy.
We hypothesize that this discrepancy is there because while SQuAD scores are a
great metric for reading comprehension, they do not reflect real expectations of busi-
ness enterprises. Specifically, while SQuAD focuses on getting the exact information,
for enterprises finding the snippet or image containing that information is sufficient.
Second, SQuAD insists on getting this answer as the first and only answer from the
New Metric Based on SQuAD 87
Table 2. Key differences between Machine Reading Comprehension and NextGen Enterprise
Search.
Feature Machine Reading Comprehension NextGen Enterprise Search
Focus Focus is on returning exact answer Allows for a small text nugget
e.g. for the query “Who was the containing the answer e.g. for the
president in 2012?” the system must query “Who was the president in
return “Barack Obama” 2012?” accepts a sentence like
“Barrack Obama was the President in
2012.” as the correct answer
Number Insists on correct answer being the Allows the correct answer to be
of only answer from the system among several top answers. This
answers number varies from usecase to
usecase, but is typically 3
Handling Insists on reporting a “No Answer Requires best effort search so that the
absent Found” or equivalent if the answer is user can conclude that the answer is
answers not available in the corpus not available in the corpus
search system, for enterprises it is sufficient to produce the right answer among the top
few. For all analysis in this paper, we have assumed that if the right answer is among
top three, it is sufficient for an enterprise search system. Third, SQuAD, in its second
version (SQuAD 2.0) requires that a search system should not return any answer if the
exact information is not present in the corpus. While this is consistent with the
objectives of NextGen Enterprise Search, enterprise users do expect related informa-
tion. The related information helps them conclude on their own that the answer is
absent, which, in turn, helps them trust the search system better. Table 2 tabulates these
differences between SQuAD expectations and expectations of enterprise users.
While it is tempting to argue that MRC is a more sophisticated application than
enterprise search, and a more stringent set of metrics helps both applications, in truth
such a stark difference makes it impractical to apply SQuAD metrics for enterprise
search. We propose to leverage SQuAD’s strength in curated questions and answers,
and still use it meaningfully to evaluate algorithmic systems for enterprise search.
If Q is the set of all questions, Ei is the set of expected answers for the question i,
and Pi the set of predicted answers, then we can define a function to determine whether
an answer was produced:
1; if 9 e 2 Ei and p 2 Pi such that e is substring of p
AðiÞ ¼ ð1Þ
0; otherwise
Then,
P
AðiÞ
AnswerScore ¼ ð2Þ
jQj
We also note that in the proposed methodology, precision has little meaning. We
propose to replace SQuAD’s F1 Score by a Recall Score. For each question, we create
tuples of all expected answers and the top three predicted answers. For each tuple, we
compute Tuple Recall as the ratio of number of tokens in expected answer that are also
present in predicted answer, to total number of tokens in the expected answer. The
Recall Score of each question is taken as the maximum Tuple Recall for all tuples
related to the question. The aggregate Recall Score for a search system is computed as
the arithmetic mean of Recall Scores of all questions in the corpus.
If T(s) represents the set of tokens in string s, then recall of a string e versus another
string p is:
jT ðeÞ \ T ð pÞj
Rðe; pÞ ¼ ð3Þ
j T ð eÞ j
Further,
P
max Rðe; pÞ 8 e 2 Eq ; p 2 Pq
RecallScore ¼ ð4Þ
jQj
We used the same search system based on CQM but used the new metrics to
measure the NextGen Enterprise Search Accuracy. The scores are shown in Table 3.
To reproduce these results, please complete the following steps.
1. Download the output of the CQM based search system as run on SQuAD with three
answers per question from this link and unzip:
https://www.dropbox.com/s/517d1m0igt48ghw/cqm-output-3.zip?dl=0
2. Download the SQuAD training data and unzip from:
https://www.dropbox.com/s/id51mfcymdrox8i/train-v1.1.json.zip?dl=0
3. Download the new metric scoring script from this link and unzip:
https://www.dropbox.com/s/qnb8wskte1j6z2o/evaluate-cqm.zip?dl=0
4. On command prompt run: python evaluate-cqm.py train-v1.1.json
cqm-output-3.json
New Metric Based on SQuAD 89
Table 3. Score of CQM based search method as per the proposed methodology
F1 Score 5.21%
EM Score 0.00%
Recall Score 92.45%
Answer Score 88.42%
These results are more in line with expectations of the performance of CQM.
Table 4. Score of CQM based search method as per the various metrics and observations
Score type Value
SQuAD metric EM 0.00%
Proposed metric Answer Score 88.42%
Observed accuracy at various clients User annotated 87 to 98%
90 H. Kulkarni et al.
References
1. Hawking, D.: Challenges in enterprise search. In: Proceedings of the 15th Australasian
Database Conference (ADC 2004), vol. 27, pp. 15–24. Australian Computer Society, Inc.,
Darlinghurst (2004)
2. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing
order to the web (1999)
3. Guha, R., McCool, R., Miller, E.: Semantic search. In: Proceedings of the 12th International
Conference on World Wide Web (WWW 2003), pp. 700–709. ACM, New York (2003)
4. Voigt, C.A., Gordon, D.B., Mayo, S.L.: Trading accuracy for speed: a quantitative
comparison of search algorithms in protein sequence design. J. Mol. Biol. 299(3), 789–803
(2000). Edited by J Thornton
5. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine
comprehension of text. CoRR, abs/1606.05250 (2016)
6. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for
SQuAD. CoRR, abs/1806.03822 (2018)
7. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad Rajpurkar - Google Scholar (2019).
Accessed 21 May 2019
8. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: The Stanford Question Answering Dataset
(2019). Accessed 21 May 2019
9. Kulkarni, R., Kulkarni, H., Balar, K., Krishna, P.: Cognitive natural language search using
calibrated quantum mesh. In: 2018 IEEE 17th International Conference on Cognitive
Informatics Cognitive Computing (ICCI*CC), pp. 174–178, July 2018
10. Viswanath, S., Yates, M., Burt, J., Yazell, J., Kuhr, R., Strum, B., Krishna, P., Balar, K.,
Kulkarni, R., Kulkarni, H., Fennell, J.: An intelligent machine for document preparation. In:
AICHE Annual Meeting, October 2018
11. Han, K.H., Park, J.W.: Process-centered knowledge model and enterprise ontology for the
development of knowledge management system. Expert Syst. Appl. 36(4), 7441–7447
(2009)
12. Redfern, D.M.: Natural language meta-search system and method. VI 20 2000. US Patent
6,078,914
Case-Based Generation of Regulatory
Documents and Their Semantic
Relatedness
1 Introduction
For this, we implement and adapt techniques like word and sentence embeddings
in a small corpus.
For generating a new document, we need to answer three questions. Which
incidents are likely to happen? Which preventive and reactive measures are suit-
able for each incident under a certain context? How important is each measure?
The last question pays attention to the fact, that a limited budget and time does
not allow for the implementation of all preventive measures. Additionally, in the
case of an incident there may only be time to take limited action and implement
some of the available reactive measures. This sums up to consider a convenient
context-based ranking for the incidents and measures. The presented approach is
a general framework and easily adaptable to domains providing textual as well
as ontological information for the context dependent classification, prevention
of, and reaction to incidents.
For reasons of simplification and consistency we give examples of the domain
of public events. We therefore present a domain ontology for the classification and
security assessment of public events. The ontology is also capable of modeling the
according regulatory documents. Our approach is driven by theoretical and case-
based considerations we describe in the first part of this work. Afterwards we
show a novel technique of adaptation of word vector spaces. Then we present the
experimental setup we used to install a case-study showing practical capabilities
of our approach. We finish with related work as well as with discussions and
future work.
In this chapter we first introduce the fundamental ontological structures for the
modeling of regulatory documents. Then we extend this structure to be capable
of representing incidents, measures and special relations between them. Finally
we explain why we use the methods of case-based reasoning and introduce our
case-based architecture. We make the assumption that there exists a corpus of
regulatory documents of a certain domain. The documents are sub-classified into
passages, that are connected with incidents or the according measures. Those
passages of text are called information units. The work of identifying these pas-
sages was done by domain experts. Figure 1 shows the annotation of a document
using a hierarchical classification structure.
All information beyond the textual corpus of available documents is coded
into a knowledge base [4]. This knowledge base consists of entities (ontological
concepts) as logical units and the relations between them. An entity may be
for instance an action, an agent, an event, or a resource. In a textual context
an entity is coded or described as one word (term) or more words up to some
sentences. Entities may be composed of sub parts in an arbitrary manner. In the
following, we introduce the basic concepts of our scenario.
Case-Based Generation of Regulatory Documents 93
RD1
hasAnnotation ”page 1”
case-based approach
secco:Chapter
skos:Concept
secco:Document secco:SecurityConcept
prov:Entity
skos:narrower
rdfs:subPropertyOf
secco:hasPart skos:broader
secco:hasMandatoryPart secco:partOf
rdfs:subPropertyOf rdfs:subPropertyOf
secco:coveredByConcept secco:partOfBavarianGovRecommendations
Incident
C1 P1 0,9 0,95
R1
Contextual
Elements 0,7 0,8
C2 Influence P2 0,85
I1 R2
0,7 0,75
C3 Preventive
Measures
P3 R3 Reactive
Measures
Fig. 5. PIRI-diagram under a context C = (C1 , ..., Cj ) showing the ranked preventive
and reactive measures with the according importance weight.
targets
partOf secri:InformationUnit
secco:Document
targets
secri:Incident targets
secri:Importance
targets targets
secri:Measure subClassOf
secri:PreventiveMeasure
subClassOf
secri:ReactiveMeasure
Definition 7. The case base CB = {c1 , ..., cm } is the collection of all cases ci
extracted from available regulatory documents and constructed as described before
as PIRI-snippets. A query q to the case base is a conjunct subset of (negated)
measures and incidents.
98 A. Korger and J. Baumeister
The incidents and measures are classified by a taxonomy that was derived from
the connected ontology, building the base for the similarity assessment and adap-
tation. The local similarities SimP , SimI , SimR are calculated via the taxonomic
order of its elements. The incidents I and the measures P, R are hierarchically
structured. Each element of the hierarchy is assigned with a likelihood symbol-
izing the similarity of its sub-elements. The similarity of the leaf-elements is set
to 1 and to 0 for the root element. The similarity increases with depth d of the
element according to for instance simd = 1−1/2d [2]. If we want to compare two
PIRI-snippets it is desirable to consider the context. For this reason we define
the following extended similarity measure under the context C:
SimContext (ck , cl ) = ω1 SimPIRI (ck , cl )+ω2 SimC (Contextk , Contextl ) /2 (2)
EventType (0)
Running (1) MartialArts (1) Motorsport (1) Rock (1) Classical (1) StreetFestival (1) ChristmasFair (1)
a2
close
a1
d1
d2
gas
door
d3
VS1 a3
vector space. Depending on the weights, elements are drawn inside the event
horizon or are pushed outside.
Definition 12. To calculate the word vector space VS each element vj ∈ VS is
updated by the operator SHIFT, where vj = SHIFT(vj , A). Let fs : (A, VS, ω) →
R be a shifting function assigning the scalar norm of the shift to each pair of
(ak , vj ) ∈ (A, VS. Let (vj − ak ) be the vector direction the shift has to be applied
in. For each element of A the shifting components are summed up and vj =
n
SHIF T (vj , A) = vj + k=1 fs (ak , vj , ωk )(vj − ak ) with n = |A|.
Figure 10 shows how the vector shift of the weighting elements a1 , a2 , and a3
are applied to move the element v1 to its new position v1 . This strategy works
in vector spaces with any dimension. The efforts needed for calculation grow
linear with the number of weighting elements and number of elements to shift.
Additionally the recalculation could be limited to a certain neighborhood around
the weighting elements. The retrieval based on word embeddings is used addi-
tionally or in combination to the case-based retrieval. If a term is not covered by
the domain ontology similar terms are retrieved searching e.g. for the most simi-
lar ontological concept covering the retrieved terms. Two informational units are
compared by comparing the semantic distance of each words of the informational
units. The values are aggregated and the outcome is the similarity of the two
informational units. These similarity values are influenced by the recalculation
of the word vector space [15].
VS1
d gas
close
a1 a2
v1 d
v1
v2
a3
d v2
door
4 Case Study
We exemplify the previous approach by a case study in the domain of public
events. We started with 30 real world regulatory documents of different public
102 A. Korger and J. Baumeister
events of which the 15 most relevant were annotated manually by three differ-
ent domain experts. This corpus is the basis for the present evaluation. For the
annotation process we developed and evaluated several ontological components.
These were used for the classification of public events (OECLA ) and the struc-
turing of the according security documents (OSECCO ). The following Table 1
shows the number of ontological concepts covered by each ontology.
The architecture we use to build the knowledge based system consists of different
components. The multi modal knowledge base contains all ontological, case-based
and text-based information and according tools. The text-based system holds a
word vector model and the corpus of annotated documents.
SELECT
NEW PROBLEM INITIAL FEATURES
1 3 4
Ontological 6
2 REUSE
Concepts QUERY ADAPTATION
OLD PROBLEM
RETRIEVAL
Multi Modal Knowledge Base
RD-NEW
CBR 5
RD-OLD Ontology Model
SD
SD
Textual
SD
SD
GENERATION
SD
CBR
SD
SEMANTIC ANNOTATION
SD
Known
Corpus SOLUTION
Terms
7
RETAIN 8
Figure 11 shows the user interaction and the case-based cycle of natural doc-
ument extraction and generation. We assume that there already exists a corpus
that has been processed using the following workflow of extraction. In step (1) a
new problem arises. That may be for instance that a new regulatory document
is required or an existing document has to be improved as shown in step (2). All
features are extracted out of the problem description and the old document at
step (3) and queried to the knowledge base at step (4). The retrieved features,
Case-Based Generation of Regulatory Documents 103
phrases and documents are returned in step (5) and adapted in step (6) which
requires user interaction and work of the user. The new regulatory document is
used (7) and retained in the corpus enlarging the case base (8).
For the search in the textual corpus we used indexed text files. The index
files and according term frequency vectors were created using Apache Lucene [1].
The tool is an environment that allows for the usage of NLP-techniques like for
instance stop-word-removal and stemming in German language. It showed, that
for our application the wikipedia-ratio was almost 100%. This will surely be
different in domains with a more exceptional vocabulary and domain knowledge.
Figure 12 shows the workflow of breaking documents into reusable information
units. Beginning with selected features it shows, how they can be put together
to form a new document. It presents which methods are used on each level for
extraction, retrieval and adaptation.
Corpus RD
Information Units
RD
RD
RD EXTRACT EXTRACT
RD1
RD
RD2 SELECT SELECT
ADAPT AND GENERATE ADAPT Ontological
Concepts
secri:Disaggregation
broader
broader secri:ObeyAuthorities
broader
secri:Measure secri:CrowdControl broader
broader
broader secri:Reprimand
secri:InspectionMeasure secri:SiteDismissal
relevant cases was extracted out of the corpus and installed in myCBR mak-
ing up the experimental case base. Table 2 shows the number of different cases
contained in the case base.
The pairwise similarities of the event classification cases are already avail-
able due to a postmortem analysis done in previous work and can be seen in
Fig. 15. A case-based postmortem analysis was evaluated by and compared to a
postmortem analysis of the same cases done by domain experts.
Case-Based Generation of Regulatory Documents 105
Pages 17 30 31 41 58 72 26 10 64 4 4 9 2 47 16
Coverage 11,90% 23,70% 23,70% 18,00% 22,30% 14,40% 15,50% 16,20% 30,22% 13,31% 12,23% 13,67% 16,91% 25,18% 23,38%
Event christm wine wine folk city carne folk music carne fair fair running camp arena campus
Event Case ecla0 ecla1 ecla2 ecla3 ecla4 ecla5 ecla6 ecla7 ecla8 ecla9 ecla10 ecla11 ecla12 ecla13 ecla14
christm ecla0 x 0.63 0.5 0.6 0.6 0.6 0.75 0.6 0.6 0.65 0.65 0.35 0.34 0.59 0.34
wine ecla1 0.63 x 0.65 0.6 0.48 0.48 0.75 0.35 0.48 0.65 0.53 0.35 0.4 0.34 0.34
wine ecla2 0.53 0.65 x 0.57 0.4 0.4 0.65 0.28 0.4 0.75 0.63 0.33 0.5 0.44 0.44
folk ecla3 0.6 0.6 0.57 x 0.68 0.68 0.6 0.68 0.68 0.54 0.45 0.5 0.33 0.57 0.33
city ecla4 0.6 0.48 0.4 0.68 x 0.75 0.6 0.75 0.75 0.53 0.4 0.43 0.28 0.43 0.18
carne ecla5 0.6 0.48 0.4 0.68 0.75 x 0.6 0.75 0.75 0.53 0.53 0.43 0.28 0.46 0.21
folk ecla6 0.75 0.75 0.65 0.6 0.6 0.6 x 0.6 0.6 0.62 0.53 0.35 0.4 0.65 0.4
music ecla7 0.6 0.35 0.28 0.68 0.75 0.75 0.6 x 0.75 0.24 0.28 0.43 0.28 0.46 0.21
carne ecla8 0.6 0.48 0.4 0.68 0.75 0.75 0.6 0.75 x 0.53 0.53 0.43 0.28 0.46 0.21
fair ecla9 0.65 0.65 0.75 0.54 0.53 0.53 0.62 0.24 0.53 x 0.75 0.33 0.44 0.38 0.38
fair ecla10 0.65 0.53 0.63 0.45 0.4 0.53 0.53 0.28 0.53 0.75 x 0.33 0.44 0.44 0.44
running ecla11 0.35 0.35 0.33 0.5 0.43 0.43 0.35 0.43 0.43 0.33 0.33 x 0.33 0.48 0.23
camp ecla12 0.34 0.4 0.5 0.33 0.28 0.28 0.4 0.28 0.28 0.44 0.44 0.33 x 0.38 0.63
arena ecla13 0.59 0.34 0.44 0.57 0.43 0.46 0.65 0.46 0.46 0.38 0.44 0.48 0.38 x 0.5
campus ecla14 0.34 0.34 0.44 0.33 0.18 0.21 0.4 0.21 0.21 0.38 0.44 0.23 0.63 0.5 x
The domain experts were informed about the public event’s parameters. Then
they estimated the similarity of the events regarding to the writing of a regu-
latory security document for those events. Precision and recall [17] show how
well the manual measure done by real persons matches the objective measure
done by the case-based system. For the calculation we merged the evaluation
of the three experts into one by neglecting multiple classifications and just con-
sidering whether an event was rated by one of the three experts as depicted in
Fig. 16. As recall and precision just switch when we want to estimate how well
the case based classification matches the manual classification we did not depict
this information.
Fig. 16. Precision and recall of the evaluation, 0 = cbr, 1 = aggregated domain experts.
makes clear, where the influence of the context changes the similarity ranking of
retrieved PIRI-cases.
Fig. 17. Postmortem analysis of the PIRI-snippets for the measure FireAndExplosion
without and with the context of the event classification. The values show the similari-
ties SimPIRI | SimContext . The value for SimContext was calculated out of SimPIRI and
SimECLA which were weighted with 0.5 each. A significant change of the retrieval by
the incorporated context is marked bold.
their class name. For instance a city name is replaced by location data or by the
part-of-speech class. The following exemplary text for the incident storm shows,
how an according passage of a security document would look in reality.
“Storm. Get weather information on a regularly basis from the Munich
weather station. Weight all tents with heavy material or fix with ropes. In case
of upcoming storm, evacuate the event site using the Franz Josef avenue and call
the fire department.”
The PIRI-snippet with exemplary importance values for this would be:
Preventive(WeightTents(0.9),GetWeatherForecast(0.8))
Incident(Storm)
Reactive(CallFireDepartement(0.9),FullEvacuation(0.8)).
An abstracted information unit for the measure FullEvacuation would be:
“[FullEvacuation][StopWord][EventSite][Verb][StopWord][LocationData]”
This information unit can be for instance adapted to the measure PartialE-
vacuation. The ontological concept FullEvacuation is replaced by a retrieved
information unit for the new measure. The concept EventSite is for instance
replaced by the more specific concept EventSiteComponent. This information
can be retrieved out of other cases because PartialEvacuation is commonly com-
bined with EventSiteComponent. The concept LocationData hast to be replaced
by the contextual location information which is left to the user. The stop words
are inserted and corrected by a NLG tool or the user. The generated textual
passage before stop word correction and context correction looks as follows:
“[Partial evacuation of the affected area][the]
[EventSiteComponent][using][the][LocationData]”
5 Related Work
We started the research for related work to this paper with an overview of state
of the art publications in the domain of natural language generation presented
by Gatt and Krahmer [10]. Most of the presented work requires a large corpus for
the application of statistical methods. More suitable for our necessities seemed
108 A. Korger and J. Baumeister
6 Conclusions
In this paper we presented a data structure called PIRI for the representation of
a regulatory document describing incidents and according measures. After for-
mally describing it, we transferred the structure into a case-based model. Using
this model a novel approach for the adaptation of similarity measures and word
embeddings to different contexts was shown. In a case study the approach was
applied to a corpus of regulatory documents of the domain of public events. A
case base was built and a postmortem analysis was done on it. The results of
the case study and the general architecture were discussed with domain experts.
What we left for future work is the construction and semantic evaluation of
different shifting functions. Depending on their influence to the word vectors
a meaning has to be assigned to them. An approach for the evaluation of the
retrofitted word vector space in comparison to the original one is needed. The
major task is to find an evaluation strategy working without assistance of domain
experts. Additionally the integration of grammar-based natural language gener-
ation approaches seems to be promising to adapt abstracted information units
to different contexts and help to reduce the needed user support.
Case-Based Generation of Regulatory Documents 109
References
1. Apache. Lucene: http://lucene.apache.org/
2. Bach, K., Althoff, K.-D.: Developing case-based reasoning applications using
myCBR3. In: Agudo, B.D., Watson, I. (eds.) Case-Based Reasoning Research and
Development, pp. 17–31. Springer, Heidelberg (2012)
3. Bach, K., Sauer, C., Althoff, K.D., Roth-Berghofer, T.: Knowledge modeling with
the open source tool myCBR. In: Proceedings of the 10th International Conference
on Knowledge Engineering and Software Engineering - Volume 1289, KESE 2014,
Aachen, Germany, pp. 84–94. CEUR-WS.org (2014)
4. Baumeister, J., Reutelshoefer, J.: The connectivity of multi-modal knowledge
bases. CEUR Work. Proc. 1226, 287–298 (2014)
5. Baumeister, J., Reutelshoefer, J., Puppe, F.: KnowWE: a semantic Wiki for knowl-
edge engineering. Appl. Intell. 35(3), 323–344 (2011)
6. Bergmann, R.: Experience Management. Springer, Heidelberg (2002)
7. Craw, S., Aamodt, A.: Case-based reasoning as a model for cognitive artificial
intelligence. In: Proceedings of the 26th International Conference, ICCBR 2018,
Stockholm, Sweden, 9–12 July 2018, pp. 62–77, July 2018
8. Faruqui, M., Dodge, J., Jauhar, S.K., Dyer, C., Hovy, E.H., Smith, N.A.:
Retrofitting word vectors to semantic lexicons. CoRR, abs/1411.4166 (2014)
9. fastText. https://github.com/facebookresearch/fastText
10. Gatt, A., Krahmer, E.: Survey of the state of the art in natural language generation:
core tasks, applications and evaluation. CoRR, abs/1703.09902 (2017)
11. Huang, G., Quo, C., Kusner, M.J., Sun, Y., Weinberger, K.Q., Sha, F.: Super-
vised word mover’s distance. In: Proceedings of the 30th International Conference
on Neural Information Processing Systems, NIPS 2016, pp. 4869–4877. Curran
Associates Inc., USA (2016)
12. Khan, N., Alegre, U., Kramer, D., Augusto, J.C.: Is ‘context-aware reasoning =
case-based reasoning’ ? In: Brézillon, P., Turner, R., Penco, C. (eds.) Modeling and
Using Context, pp. 418–431. Springer, Cham (2017)
13. Korger, A., Baumeister, J.: The SECCO ontology for the retrieval and generation
of security concepts. In: Cox, M.T., Funk, P., Begum, S. (eds.) ICCBR. Lecture
Notes in Computer Science, vol. 11156, pp. 186–201. Springer, Cham (2018)
14. Korger, A., Baumeister, J.: Textual case-based adaptation using semantic related-
ness - a case study in the domain of security documents. In: Wissensmanagement
Potsdam (2019)
15. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings
to document distances. In: Proceedings of the 32nd International Conference on
International Conference on Machine Learning - Volume 37, ICML 2015, pp. 957–
966. JMLR.org (2015)
16. Moreau, L., Groth, P.: Provenance: an introduction to PROV. Synthesis Lectures
on the Semantic Web: Theory and Technology. Morgan and Claypool (2013)
17. Perry, J.W., Kent, A., Berry, M.M.: Machine literature searching x. Machine lan-
guage; factors underlying its design and development. Am. Doc. 6(4), 242–254
(1955)
18. Remus, S., Biemann, C.: Retrofitting word representations for unsupervised sense
aware word similarities. In: LREC (2018)
19. Sizov, G., Ozturk, P., Marsi, E.: Let me explain: adaptation of explanations
extracted from incident reports. AI Commun. 30, 1–14 (2017)
110 A. Korger and J. Baumeister
20. Stensrud, B.S., Barrett, G.C., Trinh, V.C., Gonzalez, A.J.: Context-based reason-
ing: a revised specification. In: FLAIRS Conference (2004)
21. W3C: OWL2 Profiles, April 2009. http://www.w3.org/tr/owl2-profiles/
22. W3C: SKOS Simple Knowledge Organization System Reference, August 2009.
http://www.w3.org/TR/skos-reference
23. W3C: PROV-O: The PROV Ontology, April 2013. http://www.w3.org/TR/
prov-o
A Comparative Evaluation of Preprocessing
Techniques for Short Texts in Spanish
1 Introduction
Natural Language Processing (NLP) is formally defined as a study field that combines
informatics, artificial intelligence, linguistic, and analysis processes related to natural
language to generate knowledge and intelligence [1]. The relevance of this science is
based on the processing and comprehension of information expressed by different
media. These media are used in areas such as searching [2], translate machines, named
entity recognition (NER), the grouping of information, classification [3], sentiment
analysis [4], among others [5].
NLP consists of techniques applied in Knowledge Discovery in Databases
(KDD) such as the selection of a dataset, data cleaning and preprocessing, transfor-
mation, data mining, and evaluation/interpretation [6]. The preprocessing is one of the
most critical processes, which proposes to perform a cleaning process and to prepare the
text for its final analysis [7]. This stage includes techniques (e.g., tokenization, stem-
ming, lemmatization, stop-word removal, lowercase) usually applied during the NLP
processes. There are specific techniques that depend on the nature of the text in which
the analyst is working. For example, when working with Twitter data, it is necessary to
apply data cleaning techniques, such as: deleting web links, usernames, hashtag sym-
bols, blank spaces, proper names, numeric words, and punctuation marks [8].
In improving the text classification through pre-processing techniques, researchers
focused their studies on texts obtained from social networks [8–10], news and emails
[3], and movie reviews [7]. However, these studies were applied for English texts and
used only one library. The preprocessing techniques are important for processing low-
quality data and allow them to obtain high-quality datasets [7]. Applying pre-
processing techniques in Spanish texts using different libraries generates knowledge
that is useful as a foundation in studies related to NLP, mainly in short texts
classification.
Pre-processing techniques have been implemented using libraries or tools such as
NLTK [11], StanFord-NPL [12], SpaCy [13], and FreeLing [14]. The FreeLing tech-
nique is considered by researchers as one of the most powerful, due to its function-
alities and its scope in the Spanish language, and for its contribution for other
languages (e.g., English, Portuguese, Italian, Russian, Catalan) [15].
The libraries for different languages have evolved substantially concerning the
corpus, dictionaries, and other functions [16]; however, for the Spanish language, they
are still in maturation stage due to their complex syntax and semantics [5]. Thus,
defining the techniques and the tools for NLP is a challenge, even more, if it is applied
to natural language in Spanish.
The combination of techniques can help in obtaining relevant results in the final
processing of the text. However, most authors evaluate the techniques combinations of
shallowly relegating experimental libraries. Hence, it is necessary to assess each of
these combinations in more detail, focusing on the main problem: data analysis in
Spanish.
The aim of this study is to analyze and measure the impact of preprocessing
techniques in PNL in the area of sentiment analysis. This study reports the results of the
application of a model of classification of feelings using libraries with support for
Spanish language. The data used in this study are tweets related to the Football World
Cup 20188 and the Transit in Ecuador. It is also reported the evaluation parameters, the
processing time, and characteristics of the techniques for each library. The results show
the differences in performance when applying preprocessing techniques in short text
written in Spanish.
The structure of this article is as follow. Section 2 presents the related work, Sect. 3
explains the methodology applied in the experimentation, Sect. 4 presents the results
obtained during this study, and finally, Sect. 5 presents the conclusions.
2 Related Works
In the last years, online data have grown exponentially as a result of the variety of
online communities, forums, and societies that promote network interactions among
users. The search for techniques that allow automatic gathering and processing of the
information generated in real-time is gaining increasing attention. NLP has become
dominant in recent years, and research relating to this topic has centered especially on
A Comparative Evaluation of Preprocessing Techniques 113
sentiment analysis, detection, and classification; areas that are linked to tools, algo-
rithms, and libraries.
NLP considers different levels of analysis including semantics [17], lexical [18],
syntactic [19], and pragmatics [20]. Researchers mainly have focused their studies on
techniques applied to the pre-processing stage. Among the most relevant techniques
addressed by researchers are: lowercase [21], tokenization [22], stemming [23],
lemmatization [24], and stop-word removal [25]. Gupta [8] explored more specific
techniques applied in social networks such as Twitter, which cover two aspects: (i) the
removal of noise, and (ii) the normalization of the text by means of word conversion
which are not in a standard manner in respect to their canonical forms. In his work, he
used the library NLTK applied to the tweet’s domain in English language. A more
specific application is the study conducted by [9], in this comparative research about
pre-processing techniques, the author focused on the opinion analysis in Twitter,
assessing the effect of six types of techniques (i.e., removal of web links, stop-word
removal, replacement of negative words, elimination of numbers, deletion of repeated
letters, extension of acronyms to their original words). These techniques were used to
evaluate the effect on the performance in the application of feelings analysis through
two types of methods, and experimenting with five different data sets in English. This
evaluation shows that the precision and the F-value are improved when using pre-
processing methods, to expand acronyms and to replace the denial, but they have not
significate changes if web links and numbers are removed, or stop-word removal is
applied in a simple way.
Other languages of complex structures such as Arabic were also investigated in
areas such as automatic text summaries [23], and information retrieval [26]. Arabic is a
highly flexible language, with variable words and complex morphology; for these
reasons, the application of techniques such as stemming is suitable since it allows the
reduction of the text length and the fast searching of information [23]. Jianqiang et al.
[9], highlighted the importance of pre-processing techniques applied to this language.
The authors experimented with two data sets, applying three types of classification
algorithms: SVM (Support Vector Machine), Naive Bayes and K-Nearest. Regarding
pre-processing techniques, they used: correlation of characteristics, stemming and n-
grams, demonstrating that the use of these techniques in revisions increases the per-
formance of the classifiers.
In NLP, authors include libraries such as FreeLing and SpaCy, which serve as tools
when processing and pre-processing text. FreeLing is considered one of the most
powerful libraries, building robust opinion mining systems, and providing language
analysis functionalities for different languages [14]. SpaCy is a library focused on the
advanced processing of natural language, which is aimed at commercial applications,
and according to [25], it is considered one of the fastest and most accurate libraries,
achieving the best accuracy in terms of tokenization, with 90% accuracy compared to
others.
Although there are studies related to preprocessing techniques and their impact in
NLP applications, most of them have been implemented for English language or Arabic
language, and measure the impact of these techniques in one library. This approach can
lead to incomplete results in short text classification.
114 M. Orellana et al.
3 Methodology
The methodology for the experimentation consists of three main sections: (i) Data
Extraction (see Fig. 1A), (ii) Techniques Application (see Fig. 1B), and (iii) Classifi-
cation (see Fig. 1C). Also, the method is based on several settings, features, and
algorithms.
The task of data extraction includes the download process, extraction settings and
the domains (i.e., Football, Transit). This process is performed by a Software Engineer
because it is required previous knowledge of programming for connecting to the
Twitter API. This process includes a cleaning sub-task which is performed by a Data
Analyst. The Data Analyst identifies the previous processing required by the tweets
extracted. The application of preprocessing techniques is also executed by the Data
Analyst, his role is to decide the correct techniques and libraries to generate clean text
easy to analyze. Finally, the task of sentiment analysis is performed by a specialist in
data analysis to set the correct configuration for efficient analysis.
measure the recall, precision, and to determine F-measure as a parameter of the model
performance.
The tweets related to transit of different cities from Ecuador were extracted in the
set T. This set consists of tweets posted during May 2018, and like in the F set case, it
was classified in a binomial manner, obtaining 381 positive tweets and 919 negative
tweets. The classification yielded two sets of unbalanced data, that is, the size of these
datasets have a large difference between classes (positive, negative). Unbalanced
datasets are common in social media data since there is always a positive or negative
opinion is predominant among social media users. This type of data can generate that
the classifier chooses to minimize the error in the predominant class, which generates
biased accuracy metrics (see Fig. 1A).
The inclusion of two different domains for the experimentation was decided to
collect data sets with different language dimensionality. For example, the M set has
different expressions and forms of communication from a variety of countries. In order
to evaluate the accuracy of the results, with the proposed method, it has been necessary
to assess two domains due to there were considered the dimensionality, the way in
which each country communicates, and the proper characteristics of each way of
communication.
On the other hand, the set T allows for another perspective in the application of the
same techniques to a more closed set as to language.
(b) Libraries
There are multiple libraries that focus on providing tools in different areas of NLP
such as text classification, automatic labeling, sentiment analysis, NER analysis and
preprocessing. However, its functionality is still limited to certain languages such as
Spanish, due to it is an extended and difficult language to be processed. For this reason,
A Comparative Evaluation of Preprocessing Techniques 117
a more objective evaluation was performed, experimenting with different libraries that
provide support in these types of techniques. The selected libraries: NLTK, SpaCy, and
FreeLing were the result of an analysis of characteristics that make them competent for
the purpose of the research. Additionally, they comply with the majority of combi-
nations of established techniques (see Table 3).
3.4 Classification
The evaluation stage was developed based on a specific application of NLP, such as
sentiment analysis. Through this experimentation, the impact of the techniques and
combinations in NLP was evaluated. For this process, an analysis model based on the
Super Vector Machine (SVM) algorithm with Radial Basis Function (RBF) kernel was
built in the RapidMiner [29] tool (see Fig. 1C). This model consists of three phases:
(i) the process of reading documents, where the combinations of techniques were
applied, which were classified according to the feeling they generated (i.e., positive,
negative); (ii) the number of tweets for training and application of the model was
established, having 70% and 30% respectively. In addition, it has been used as an
optimization operator. This finds the optimal values of the parameters chosen for the
kernel operators of the model which for this case were the parameters C and c. By
means of this optimization operator, the values of the constants C and c to-command
were configured as the basis for the investigation of [26]. In this case, it is recom-
mended to use a “grid-search” by means of the cross-validation method. This method
consists of testing different pairs of the values of C and c and select the one that obtains
the best evaluation parameter selected; in order to measure the general performance of
the model. Here, it was used the F-measure value, since it is the most objective and
optimal for working with unbalanced text. Taking into account that using the expo-
nential growth sequences in the parameters (C and c) represents the most practical
method, it has been used the sequence where C takes values of 2–5, 2–3, …, 215 and c
118 M. Orellana et al.
values of 2–15, 2–13, …, 23. The last phase was the writing of the generated confusion
matrix and the results of classification of the model with its evaluation parameters (see
Fig. 2).
4 Results
Based on the experimentation results, the characteristics of each library that partici-
pated in the selection of techniques were obtained, which implies the reduction of
combinations of techniques in the study of sentiment analysis. Among those charac-
teristics, the transformation to lowercase was mandatory in the stop-word removal
technique, assuming the elimination of combinations with the lowercase technique. In
addition, the presence of the tokenization technique within the algorithm of the analysis
of sentiment was taken into account, eliminating it also from the combinations pro-
posed (see Table 4).
Fig. 6. Evaluation of value f_measure, recall and precision in dataset M with each combination
of techniques.
Fig. 7. Evaluation of value f_measure, recall and precision in dataset M with each combination
of techniques.
5 Conclusions
This study presents the impact of pre-processing techniques in NLP, taking as a case
study the field of sentiment analysis, being one of the most experienced in this type of
language processing. The techniques were evaluated in two different domains of
122 M. Orellana et al.
tweets, World Cup 2018 and Transit in Ecuador, using all possible combinations of
selected pre-processing techniques, as well as being evaluated in different PLN
libraries. The aspects to be considered in the evaluation were the processing time of the
libraries in each technique, processing characteristics and capacity of their corpus and
algorithms applied to the Spanish language and, in terms of the sentiment analysis, the
impact of each of the combinations obtained considering its accuracy, recall, and f-
measure as a measure of overall performance.
The experimental analysis revealed that the appropriate combinations of techniques
such as lemmatization and lowercase in the pre-processing stage according to the f-
measure obtained after applying the sentiment analysis model, provide a significant
improvement in the classification. Although there are pre-processing techniques, such
as lowercase or stop-word removal that do not reach the maximum improvement in the
feeling analysis process, if there is a demonstrative improvement for it. Therefore, for a
text classification problem such as sentiment analysis in any domain, it is advisable to
carefully analyze all possible combinations of techniques, instead of using one or all in
the pre-processing stage. In case of working with other fields of NLP, it is necessary to
examine the end of the study and analyze the function of each technique in the text, to
perform more efficient processing and with greater precision in its final results.
6 Future Works
Our work evaluated three libraries: spaCy, FreeLing, and NLTK. For future works we
propose the evaluation of new libraries such as UDPipe, a library in C++ and actually
implemented as a package in RStudio. This library was designed for the data prepa-
ration in tasks NLP and has support for Spanish language. Additionally, we propose to
applicate the selected techniques in other areas such as NER analysis and classification.
References
1. Reese, R.M.: Natural Language Processing with Java. Packt Publishing (2015)
2. Battistelli, D., Charnois, T., Minel, J.L., Teissèdre, C.: Detecting salient events in large
corpora by a combination of NLP and data mining techniques. Comput. y Sist. 17, 229–237
(2013)
3. Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process.
Manage. 50, 104–112 (2014). https://doi.org/10.1016/j.ipm.2013.08.006
4. Krouska, A., Troussas, C., Virvou, M.: The effect of preprocessing techniques on twitter
sentiment analysis. In: 2016 7th International Conference on Information, Intelligent System
Application (IISA), pp. 1–5 (2016). https://doi.org/10.1109/iisa.2016.7785373
A Comparative Evaluation of Preprocessing Techniques 123
5. Hidalgo, O., Jaimes, R., Gomez, E., Luján-mora, S.: Análisis de sentimiento aplicado al
nivel de popularidad del líder político ecuatoriano Rafael Correa Sentiment Analysis applied
to the popularity level of the Ecuadorian political leader Rafael Correa. In: 2017
International Conference on Information Systems and Computer Science (INCISCOS),
pp. 340–346 (2017)
6. Gómez-Jiménez, G., Gonzalez-Ponce, K., Castillo-Pazos, D.J., Madariaga-Mazon, A.,
Barroso-Flores, J., Cortes-Guzman, F., Martinez-Mayorga, K.: The OECD Principles for (Q)
SAR Models in the Context of Knowledge Discovery in Databases (KDD). Elsevier Inc.
(2018)
7. Haddi, E., Liu, X., Shi, Y.: The role of text pre-processing in sentiment analysis. Procedia
Comput. Sci. 17, 26–32 (2013). https://doi.org/10.1016/j.procs.2013.05.005
8. Gupta, I., Joshi, N.: Tweet normalization : a knowledge based approach. In: 2017
International Conference on Infocom Technologies and Unmanned Systems (Trends Future
Directions) (ICTUS), pp. 1–6 (2017)
9. Jianqiang, Z., Xiaolin, G.: Comparison research on text pre-processing methods on twitter
sentiment analysis. IEEE Access. 5, 2870–2879 (2017). https://doi.org/10.1109/ACCESS.
2017.2672677
10. Galadanci, B.S., Muaz, S.A., Mukhtar, M.I.: Comparing research outputs of Nigeria Federal
Universities based on the scopus database. In: CEUR Workshop Proceedings, vol. 1755,
pp. 79–84 (2016). https://doi.org/10.1177/0165551510000000
11. Paramkusham, S.: NLTK: The natural language toolkit. Int. J. Technol. Res. Eng. 5, 2845–
2847 (2017)
12. Weerasooriya, T., Perera, N., Liyanage, S.R.: A method to extract essential keywords from a
tweet using NLP tools. In: 16th International Conference on Advances in ICT for Emerging
Regions, ICTer 2016 - Conference Proceedings, pp. 29–34 (2017)
13. SpaCy: spaCy. https://spacy.io/usage/linguistic-features#_title
14. Padró, L., Stanilovsky, E.: FreeLing 3.0: towards wider multilinguality. In: Proceedings
Language Resources Evaluation Conference (LREC 2012), pp. 2473–2479 (2012)
15. Henríquez, C., Guzmán, J., Salcedo, D.: Minería de Opiniones basado en la adaptación al
español de ANEW sobre opiniones acerca de hoteles. Proces. del Leng. Nat. 41, 25–32
(2016)
16. Prata, D.N., Soares, K.P., Silva, M.A., Trevisan, D.Q., Letouze, P.: Social data analysis of
Brazilian’s mood from twitter. Int. J. Soc. Sci. Humanit. 6, 179–183 (2016). https://doi.org/
10.7763/IJSSH.2016.V6.640
17. Altszyler, E., Brusco, P.: Análisis de la dinámica del contenido semántico de textos. In:
Argentine Symposium on Artificial Intelligence, pp. 256–263 (2015)
18. Pérez-guadarramas, Y., Rodríguez-blanco, A., Simón-cuevas, A.: Combinando patrones
léxico - sintácticos y análisis de tópicos para la extracción automática de frases relevantes en
textos. Proces. L. 59, 39–46 (2017)
19. Antonio, F., Velásquez, C., Paul, J., De Paz, Z., Guzmán, J.F.: Aplicación del análisis
sintáctico automático en la atribución de autoría de mensajes en redes sociales. Res. Comput.
Sci. 137, 109–119 (2017)
20. Soto Kiewit, L.D.: Un acercamiento a la concepción de gobernabilidad en los discursos
presidenciales de José María Figueres Olsen. Rev. Rupturas. 7, 1 (2017). https://doi.org/10.
22458/rr.v7i1.1609
21. Poornima, B.K.: Text preprocessing on extracted text from audio/video using R. Int.
J. Comput. Intell. Inform. 6, 267–278 (2017)
22. He, Y., Kayaalp, M.: A comparison of 13 tokenizers on MEDLINE. Bethesda, MD List. Hill
Natl. Cent. Biomed. Commun. 48 (2006)
124 M. Orellana et al.
23. Alami, N., Meknassi, M., Ouatik, S.A., Ennahnahi, N.: Impact of stemming on Arabic text
summarization. In: Colloquium in Information Science and Technology, CIST, pp. 338–343
(2017)
24. Singh, T., Kumari, M.: Role of text pre-processing in twitter sentiment analysis. Procedia
Comput. Sci. 89, 549–554 (2016). https://doi.org/10.1016/j.procs.2016.06.095
25. Katariya, N.P., Chaudhari, M.S.: Text preprocessing for text mining using side information.
Int. J. Comput. Sci. Mob. Appl. 3, 3–7 (2015)
26. Althobaiti, M., Kruschwitz, U., Poesio, M.: AraNLP: a Java-based library for the processing
of Arabic text. In: Proceedings of the Ninth International Conference on Language
Resources and Evaluation (LREC 2014), pp. 4134–4138 (2014)
27. Twitter Inc: Search Tweets. https://developer.twitter.com/en/docs/tweets/search/api-
reference/get-search-tweets.html
28. RStudio: Take control of your R code. https://www.rstudio.com/products/rstudio/
29. GmbH R: Rapidminer Documentation
Automatic Visual Recommendation for Data
Science and Analytics
Seidenberg School of CSIS, Pace University, White Plains, New York, USA
{mm42526w,tagerwala,ctappert}@pace.edu
Abstract. Data visualization is used to extract insight from large datasets. Data
scientists repeatedly keep generating different visualizations from the datasets
for their hypothesis. Analyzing datasets which has many attributes could be a
cumbersome process and lead to errors. The goal of this research paper is to
automatically recommend interesting visualization patterns using optimized
datasets from different databases. It reduces the time spent on low utility visu-
alizations and displays recommended patterns.
1 Introduction
Data visualization tools have been used increasing by data scientists and analysts. They
load different datasets and examine their hypothesis using visualization tools, this
process is repeated several times until they find an interesting pattern. Data scientists
need to derive insights using this trial and error method which is tedious. The main goal
of this research paper is to find interesting patterns in large datasets across different
databases. When a user issues query, the optimizer would substitute the query with an
optimized copy and returns the results from which the recommended visualizations are
displayed automatically without manual intervention. The SQL optimized framework
supports different kinds of databases used in an organization and provides an extension
to SeeDB [1].
Data scientists need to build different visualizations from new datasets in order to
find various patterns and anomalies. When the dataset has high dimensionality finding
various patterns becomes a tedious task. Determining relationships between attributes
and their subsets is required for analysis of the data. Visualizations are likely to display
interesting patterns if the plotted data deviates largely from the reference points or
historical data. Even for a small dataset the number of visualizations that can be
generated is large. Also, the visualizations should be displayed at interactive speed with
quicker response time to the users.
Today data is stored in different databases that have storage models been cus-
tomized to their needs. When data needs to be analyzed it is required to retrieve them
from various sources. Healthcare datasets such as MIMIC-III (Medical Information
Mart for Intensive Care) which is publicly available contains data from Intensive Care
Unit (ICU) collected from patients de-identified [2]. It contains both structured and
unstructured data of medications, doctor reports and the streaming data from medical
devices. These varied data formats are been stored in different databases. Structured
data is stored in a relational databases and unstructured data in NOSQL data stores in
order to gain the performance advantage of the native databases specifically designed to
handle them.
MIMIC-III is a data repository containing information about patients been admitted
to the hospitals. It contains details of the vital signs, medical device readings, doctor
notes and patient admission data. The data is released for researchers after de-
identifying the patient information [2]. Data federation among these different databases
is required to build a visualization recommendation tool. This would help data sci-
entists and analysts to analyze their hypothesis from different datasets. Today this
process is manual where the user has to gather the data of required interest and go
through all the visualizations which is a cumbersome task.
Graphs can be plotted from de-identified patients information collected from dif-
ferent sources. It includes patient admission date, gender, doctor notes and time series
data from the medical devices. The number of visualizations grow significantly
depending upon the number of user interested points. Tracking all the visualizations
generated becomes a difficult task. Users are interested in visualizations in which target
data shows large deviation from the referenced data. This data which a user wants to
analyze can be stored across different databases. Having a federated SQL framework
which retrieves the data quickly across databases helps them focus on their tasks.
Fig. 1. Shows the graph for heart related problems for admitted married and unmarried patients
Figure 1 shows that the target data which deviates from the reference points largely
helps in identifying the outliers and anomalies in the data for further investigation.
Automatic Visual Recommendation for Data Science and Analytics 127
2 Problem Statement
Data scientists analyze datasets which are retrieved from different databases. Dimen-
sional attributes represent the facts and measured attributes are derived from the
aggregate functions. These two attributes are used in the visualization tools [3]. This
research extends SeeDB [1] which recommends visualizations for a query using a high
utility function that exhibits larger deviations by retrieving datasets from different
databases. The customized SQL framework which is used to derive the data also makes
use of the optimized data been created by the database administrators. Healthcare and
medical devices generate data of different formats which needs to be stored in different
databases [4]. The customized SQL framework queries the data from any of the reg-
istered databases and during runtime, the query optimizer substitutes the optimized data
when applicable and retrieves the data quickly which is later used to build the required
recommended visualization. Dimensional attributes D represents the group-by attri-
butes of the query. The dimensional attributes are quantified using the measure attri-
butes M and a set of aggregate functions A. These queries are executed against a set of
registered databases S. We can group dimensional attributes D and aggregate them
based on the measure attributes M. This results in a two dimensional table which can be
used for visualization. Recommended visualizations have high utility factor and can be
obtained by executing aggregation over the group-by attributes on the registered
databases represented by function (d, m, a) where d 2 D, m 2 M, a 2 A. T(S) represents
grouping of data in set of registered databases S for the target data. R(S) represents data
for referenced datasets. Query results Q (target) and Q (reference) represent the high
utility factor which determines the visualizations to be displayed [1].
The utility factory is calculated from the views of Q (target) and Q (reference). The
difference between views is used in determining the recommended visualizations to be
displayed (see Figs. 2 and 3).
CLIENT
Recommended Visuals
SERVER
View Generator
HTAP NOSQL
RDBMS
3 Architecture
The front end component generates visualizations based on the data from different
databases. It interacts with a customized SQL framework which acts like a federated
query layer in retrieving the data. During the run time framework rewrites the query to
the optimized copy and returns the result set. The incoming query is parsed and
validated for syntax. The query optimizer applies various transformation rules to the
relational node to obtain the optimized node which has the same semantic meaning as
the original one and with reduced query cost. The query optimizer transforms the
relational node by substituting it with whole or partial rules which matches the pattern.
The metadata information regarding different registered databases is stored in a catalog
which is utilized during runtime by the query optimizer. It provides information about
the overall execution cost of the query, the data size in tables, memory and CPU usage
details for the query execution.
Sharing-Based Visualization Optimizations
The aggregate queries which have same group-by attributes are combined into a single
view and later multiple group-by are integrated. This results in improvement of query
latency and better performance [1].
SELECT diagnosis, occupation, count (diagnosis), sum (age) FROM admission_-
married GROUP BY GROUPING SETS ((diagnosis), (occupation)).
Pruning-Based Visualization Optimizations
It implements the interval-pruning based on the confidence level of the utility scores
and discards all the visualizations which have the utility score upper bound below the
least lower bound in the top-k visualizations [1].
4 Evaluation
Evaluation was performed to display the visualizations which had the best utility factor
among the top views, better accuracy and reduced response time.
130 M. Muniswamaiah et al.
Fig. 4. Recommended visualizations display for admitted patients diagnosed with respiratory
related issues.
Fig. 5. Recommended visualizations display for admitted patients diagnosed with heart related
issues.
Automatic Visual Recommendation for Data Science and Analytics 131
Fig. 6. Recommended visualizations display for admitted patients diagnosed with hypotension
related issues
All the experiments were conducted on 64-bit Linux machine with 3 GHz Intel
Xeon processor and 16 GB RAM. PostgresSQL [5] was used to store patient related
information. The medical device data was stored in Splice Machine [6] and text notes
in MongoDB [7]. As shown in the Figs. 4, 5 and 6 the dataset1 is the target dataset and
dataset2 is the reference dataset. Visualizations which had a high utility factor deviating
from the reference points are displayed as the recommended visualizations.
5 Related Work
There are various visualization tools that have been developed and introduced in the
market [8]. These tools require users to manually specify graphs which is a tedious
task. The visualization tools should support the datasets which are large in volume and
also have better response time. Some of these data are volatile and not completely
preprocessed for display. There has been some active research on developing recom-
mended visual displays such as in VISO, VizDeck [9] the problem with these
approaches is that it quickly becomes intractable as the number of the dataset attributes
to be analyzed increases. In order to increase scalability and the response time in-
memory caching and sampling of the datasets has been used. Data materialization is
been used by database administrators to help reduce the run time computation while
analyzing these datasets [10]. The visualization tools must be interactive in displaying
the data. It must also support the relationship between different views and filter down
132 M. Muniswamaiah et al.
on the data required for display and focus on information which is of interest. Visu-
alization of dataset which is both structured and unstructured is challenging. Data
reduction techniques like binned aggregation has been effectively used for visualiza-
tion. Also, pre-computed data and data parallelization are techniques which have been
adopted by many visualization tools that are introduced in the market to reduce the
latency on the display of the visualizations. Data cubes and nano cubes are built from
the tables which perform aggregation on the dimensions of the tables for scalability.
There are different types of tools which support visualization through service, libraries
and platforms [11].
6 Conclusion
This research implements a visualization analytical tool along with query optimizer that
helps in recommending interesting visualizations automatically from different data-
bases. This work helps in improving the interactive data exploration for data scientists
and analysts. Further extension to this work would be to integrate it with cloud
databases.
References
1. Vartak, M., Madden, S., Parameswaran, A., Polyzotis, N.: SeeDB: automatically generating
query visualizations. Proc. VLDB Endow. 7(13), 1581–1584 (2014)
2. Johnson, A.E., Pollard, T.J., Shen, L., Li-wei, H.L., Feng, M., Ghassemi, M., Moody, B.,
Szolovits, P., Celi, L.A., Mark, R.G.: MIMIC-III, a freely accessible critical care database.
Sci. Data 3, 160035 (2016)
3. Waller, M.A., Fawcett, S.E.: Data science, predictive analytics, and big data: a revolution
that will transform supply chain design and management. J. Bus. Logistics 34(2), 77–84
(2013)
4. Muniswamaiah, M., Agerwala, T., Tappert, C.C.: Context-aware query performance
optimization for big data analytics in healthcare. In: 2019 IEEE High Performance Extreme
Computing Conference (HPEC-2019), pp. 1–7 (2019)
5. https://www.postgresql.org/
6. https://www.splicemachine.com/
7. https://www.mongodb.com/
8. Keim, D., Qu, H., Ma, K.-L.: Big-data visualization. IEEE Comput. Graph. Appl. 33(4), 20–
21 (2013)
9. Perry, D.B., et al.: VizDeck: streamlining exploratory visual analytics of scientific data
(2013)
10. Fisher, D., et al.: Interactions with big data analytics. interactions 19(3), 50–59 (2012)
11. Wang, L., Wang, G., Alexander, C.A.: Big data and visualization: methods, challenges and
technology progress. Digit. Technol. 1(1), 33–38 (2015)
A Novel Recommender System for Healthy
Grocery Shopping
1 Introduction
With the growth of online shopping, companies are scrambling to figure out new
methods to improve growth, obtain profits, and increase customer retention. Due to the
large amount of dynamic, growing data, these methods must provide accurate, efficient,
and quick answers. With this in mind, we investigate the broad topic of recommender
systems, methods and techniques that are used to predict preferences a user has for
certain items. By using recommender systems, businesses can provide only relevant
products to users and therefore expand their total inventory of products [6].
In this paper, we analyze a specific implementation of a recommendation system:
namely, user-based collaborative filtering [9]. To test this implementation, we analyze
an Instacart dataset containing grocery order information from over 200,000 users. We
utilize user-based collaborative filtering to accomplish two tasks: first, we calculate the
similarity of users in order to provide product recommendations, and second, we cal-
culate the similarity of orders to investigate trends and classify them based on their
nutritional value [18]. Finally, we propose a method to improve user-based collaborative
filtering product recommendations by utilizing the order classification method [5].
To accomplish both tasks, we implement utility matrices to map user and order
preferences to items, and similarity matrices to store calculations and locate the closest
users and orders in the dataset. Furthermore, we implement a natural language pro-
cessing algorithm to evaluate the nutritional value of food products by comparing them
to a USDA product dataset [7]. To determine similarity measures between users and
between orders, we utilize the cosine similarity and the Pearson correlation coefficient
equations.
2 Related Works
In this section, we present various related works of recommendation systems, and other
studies specifically relating to the Instacart dataset. We also provide some examples of
real use cases of recommender systems, as in the case of Netflix.
A Stanford report [4] shows a study of the Instacart dataset and describes the
method of extracting features, usage of market basket analysis, and logistical regres-
sion. The report describes a model of initially randomly recommending users some
items, then given another model, recommend items based on predictions. This report
demonstrates the usage of a recommender method based on a binary classification
problem. Another related method of recommender systems is to use a collaborative
filtering algorithm [9]. A collaborative filtering algorithm is a generic process to obtain
similar and dissimilar users and base recommendations on those classifications [18].
However, in this journal, they use an item-based collaborative filtering algorithm and
draw the conclusion that item-based filtering is much more accurate for sparse net-
works [16].
Lastly, we see recommender systems in large corporations ranging from Amazon to
Netflix in their usage of recommending products or movies. We know that recom-
mender systems are being used, as in 2007, Netflix [1] created a challenge to see if
anyone could develop a more accurate recommender algorithm given a subset of their
movie ratings dataset. As we see, recommender systems are being vastly researched
and improved upon and have always been at the forefront of ecommerce usage.
3 Datasets
Instacart is an ecommerce grocery delivery service where users can order online from
various grocery stores to be delivered by a personal shopper. Our dataset was provided
freely by Instacart through a Kaggle competition. The dataset contains an ample col-
lection of anonymized users, orders, and product types from Instacart. Specifically, it
contains over three million grocery orders from over 200,000 users.
As shown in Table 1, our dataset is compiled into various files that are related by
identifiers; we also show the number of entries each file has and the number of attri-
butes. The structure of each file acts like a table in a relational database. For example,
orders contain a listing of product identifiers which relate to the products file.
Specifically, each order contains the user associated, a listing of products (in order),
and the date and time that the order was made. Furthermore, we are given information
on whether a product in an order has been previously purchased. The dataset also
provides an extensive listing of departments, aisles, and names for each of the products.
That is, names provided are actual branded products of Instacart with generic categories
(department, aisle).
Lastly, the data does not have any identifying material, nor do we have any sen-
sitive details of any user or order. The dates are given on a relative basis rather than an
absolute basis.
column in M1 is guaranteed to sum to at least one and therefore the sparsity of the
matrix is reduced.
This implementation of user-item preference may overweigh the quantity of pro-
duct purchases when compared to the selection of products. For instance, a user who
purchased 80 strawberries when their average purchase count is less than five is defined
solely by their purported exclusive preference for strawberries. Therefore, we define the
second utility matrix M2 equivalent to M1 , except we enforce an upper limit of six on
the value of x. An example of this normalization is provided in Fig. 2.
Utility matrix M2 prioritizes item selection rather than item quantity when deter-
mining item preference. Ideally, user preference for products would not be defined as
ambiguously as product purchase count. A user may purchase a product less than their
average purchase count and still enjoy the item.
The decision to select the upper limit for M2 as six was arbitrary. The most common
product purchase count for most users was one, whereas outliers include quantities
surpassing 80. Therefore, a limit of six emphasizes product selection by diminishing
the effect of outliers in the dataset.
Finally, to calculate order similarity, we define utility matrix M3 as the following:
for xij M 3 , if order i includes item j, then xij ¼ 1, else xij ¼ 0: M3 is similar in nature to
M1 except the rows represent orders instead of users. An example is given in Fig. 3.
To calculate the similarity between Instacart users, we utilize the cosine similarity
function [12]. This equation calculates the cosine between two vectors which can be
used to determine the correlation between them. The cosine similarity between vectors
u and v is as follows:
uv
simðu; vÞ ¼ cosðhÞ ¼ ð1Þ
kukkv k
Because all the values in each vector are positive (it is impossible to purchase a
negative number of products), the cosine similarity between any two users will range
between 0.0 and 1.0. A cosine similarity of 0.0 means that there exists zero correlation
between the two vectors. Conversely, a similarity of 1.0 means that the two vectors are
equivalent.
To calculate the similarity between Instacart orders, we utilize another correlation
calculation known as the Pearson correlation coefficient [8]. The correlation coefficient
r between two vectors X and Y is defined as follows:
P
xRy
Rxy N
r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P P ð2Þ
P 2 ð xÞ 2 P 2 ð yÞ 2
x N y N
Similar to cosine similarity, the correlation coefficient r will be a value between 0.0
and 1.0, where values near 0.0 means the vectors are dissimilar and 1.0 means they are
similar. Note that the Pearson correlation coefficient is equivalent to the cosine simi-
larity except it normalizes each vector by subtracting its mean from each value.
Using this abstract similarity matrix definition, we compute a similarity matrix for
each of the three utility matrices defined in section A: S1 for M1 , S2 for M2 , and S3 for M3 .
138 Y. Bodike et al.
The most similar users are calculated using (1) as given in section B. Of the most
similar users, we recommend the most popular products that they purchased. For
example, Table 2 displays a recommended items list for the department “pantry.” Each
item in the list is found in a previous order of the most similar users and is among the
most popular items from the department.
For a new product purchase, we check if the product is found in the recommen-
dation list. If it is, we classify the recommendation as a success. Therefore, the rec-
ommendation list may include otherwise unfavorable items but so long as the user
purchases an item from the list, the recommendation is considered successful, which
makes intuitive sense [13].
In this section, we provide a generalized explanation of our dataset and various data
visualizations that were used to gain insight on the Instacart dataset.
The dataset contains between four and 100 orders for each user, with the sequence
of products purchased in each order. It also provides the week and hour of day the order
was placed and relative measure of time between orders. The exploratory analysis
performed on the dataset revealed the facts as below.
Fig. 5. Heat map for frequency of day in a week vs hour of the day
140 Y. Bodike et al.
Figure 5 is a heat map which describes the hourly order frequency of the day vs day
of the week. It is observed that the maximum orders which are above 50,000 are placed
between 1 PM and 3 PM on Sunday which is labeled as 0 and on Monday the highest
frequency was observed from 10 AM to 11 AM. There were almost no orders from
midnight to early morning on all the days which is from 2 AM to 6 AM. Surprisingly,
the maximum number of orders is placed on Monday morning where it is generally
assumed to be a working day. On the other hand, Saturday, which is a day off for most,
has similar order frequency as other working days. It is observed that most orders are
placed between 9 AM and 4 PM in any given day.
Fig. 6. Bar chart showing frequency distribution by days since prior order
Figure 6 is a bar chart which shows the frequency distribution of orders by days
since the prior order. It suggests that the 7th day of the month has a hike in order
frequency, and then a relatively small peak at days 14, 21 and 28 which indicates that
every seven days the order frequency increases. Presumably, users prefer to make an
order on a specific day of the week. Finally, there is a huge peak the end of the month
indicating that most users make purchases on a monthly basis rather than a weekly or
bi-weekly basis.
Figure 7 is a bar plot which compares the frequency of orders for the top 5 aisles. It
is observed that fresh fruits and fresh vegetables have high frequency of orders which
are around 3.5 million. Fresh fruits and fresh vegetables, based on their nutritional
value, are considered as being healthy food whereas packaged cheese (whose order
frequency is about 1 million) is considered as being unhealthy food. From this, we can
conclude that healthy food items are exceptionally common and will appear in a
A Novel Recommender System for Healthy Grocery Shopping 141
majority of orders, even when that order may include unhealthy products. The sparsity
of the utility matrices M1 and M2 is best illustrated by figure above. 11% of the users
had approximately five total orders, with the probability rapidly decreasing as number
of orders increases.
Figure 8 presents a partial histogram illustrating the number of orders for each user.
Note that order counts exceeding 50 exist but are exceptionally low. Figure 9 is a heat
map describing the correlation matrix for different attributes of the datasets. We found
that there exists a positive correlation of 0.25 between the reordered and order_number
attributes. The later an order exists in a user’s order history, the more likely he or she
will reorder a previous product. Therefore, as order_number increases, the more likely
reordered will equal one. Furthermore, there is a negative correlation of −0.36 between
Fig. 9. Correlation matrix for various attributes merged from the datasets
A Novel Recommender System for Healthy Grocery Shopping 143
In this section, we report the results we obtained given the implementations that we
described in the methods section. We show the results of user-based collaborative
filtering on providing recommendations. Additionally, we report the trend consistency
of orders that are healthy and unhealthy and propose an improvement to our recom-
mendation algorithm.
Acknowledgment. This research is partially supported by a grant from Amazon Web Services.
References
1. Bell, R.M., Koren, Y.: Lessons from the Netflix prize challenge. SiGKDD Explor. 9, 75–79
(2007)
2. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing with the
Natural Language Toolkit. O’Reilly Media, Sebastopol (2009)
3. Chen, J., Huant, Y., Cohen, I.: Pearson correlation coefficient. In: Springer Topics in Signal
Processing, vol. 2, pp. 1–4. Springer, Heidelberg (2009)
146 Y. Bodike et al.
4. Flores-Lopez, A., Perry, S., Bhargava, P.: What’s for Dinner? Recommendations in online
grocery shopping. Final report, Stanford University (2017)
5. Herlocker, J.L., Konstan, J.A., Borchers, A., John, R.: An algorithmic framework for
performing collaborative filtering. Associated for Computing Machinery, pp. 230–237
(1999)
6. Leskovec, J., Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Stanford University
(2015)
7. National Agriculture Library: USDA National Nutrient for Standard Reference Dataset,
April 2019
8. Newman, M.E.J.: Networks: An Introduction. Oxford University Press, Oxford (2010)
9. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering
recommendation algorithms, pp. 285–295 (2011)
10. Souci, S.W., Fachmann, W., Kraut, H.: Food Composition and Nutrition Tables. Medpharm
HmbH Scientific Publishers, Stuttgart (2007)
11. Xue, G.R., Lin, C., Yang, Q., Xi, W., Zeng, H.J., Yu, Y., Chen, Z.: Scalable collaborative
filtering using cluster-based smoothing. In: Proceedings of the 28th Annual Interna-
tional ACM SIGIR Conference on Research and Development in Information Retrieval,
pp. 114–121. ACM, August 2005
12. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for
collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in
Artificial Intelligence, pp. 43–52. Morgan Kaufmann Publishers Inc., July 1998
13. Drineas, P., Kerenidis, I., Raghavan, P.: Competitive recommendation systems. In:
Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing,
pp. 82–90. ACM, May 2002
14. Linden, G., Smith, B., York, J.: Amazon. com recommendations: item-to-item collaborative
filtering. IEEE Internet Comput. 1, 76–80 (2003)
15. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT
Press, Cambridge (2009)
16. Li, D., Zhao, G., Wang, Z., Ma, W., Liu, Y.: A method of purchase prediction based on user
behaviour log. In: Proceedings of the IEEE International Conference on Data Mining
Workshop, Atlantic City, NJ, USA, 14–17 November 2016
17. Ye, F., Zhang, H.: A collaborative filtering recommendation based on users’ interest and
correlation of items. In: Proceedings of the 2016 International Conference on Audio,
Language, and Image Processing (ICALIP), Shanghai, China, 11–12 July 2016
18. Poladis, N., Georgiadis, C.K.: A dynamic multi-level collaborative filtering method for
improved recommendations. Comput. Stand. Interface 51, 14–21 (2017)
Using Topic Modelling to Correlate
a Research Institution’s Outputs
with Its Goals
1 Introduction
Topic modelling looks at the structure of text within a document or body of
text in topically coherent portions. In other words, a topic model is the prob-
ability distribution over a fixed vocabulary [1]. Given a number of documents,
which will serve as the vocabulary, a number of topics can be derived and used
to associate documents that share the probability distribution of text with a
particular topic. As a result, documents with similar contents will be associated
with the same topic distributions. Several researchers have investigated the area
of topic similarity, topic modeling and document similarity based on topics. Var-
ious methods have been proposed by [2–7], where the main focus is to determine
similar documents to a given document or group of documents based on the
2 Related Work
Topic Modeling can be applied to various document related problems. [11] pro-
posed a collaborative Web recommendation scheme based on Latent Dirichlet
Allocation (LDA). The model they proposed made associations between user
sessions and the topics of each web page visited. They used a variational proba-
bility inference technique to estimate the association between user sessions and
multiple topics and the associations between topics and Web page space. Their
work resulted in the discovery of a user’s navigational preferences distribution
over a topic space. Their results have shown that their approach achieved better
recommendations when compared to prior techniques.
The authors in [12] proposed a method to improve standard Collaborative
Filtering based recommendation systems that utilized contextual data avail-
able in the form of text descriptions of items. The features of their model were
derived from latent features of users and items through topic modeling. Their
method provided a hybrid similarity score to refine the neighbourhood forma-
tion which aided in mitigating the sparsity because it allowed the calculation of
the similarity between users if there was no overlap in the ratings among items.
Topic Modelling for Output and Goals Correlation 149
compare these with the topic clusters for SU. Since the target in this case is
essentially random then it provides a reasonable lower bound. Therefore as the
metric value obtained for the TU varies from that of the lower bound to that of
the upper bound we get an idea of how closely the TU research output matches
that of the SU research output. One additional application of this research is to
compare the publications from one of the universities that SU is comprised of
and determine how close the target set is to it. This comparison between the
universities can also be done over a period of time to get an idea of the research
output closeness trends. Naturally this approach can be extended to other prob-
lems. In particular, the initial objective of this work was to compare the research
performed at a particular university with the research goals of the university’s
country. However, we were unable to get sufficient information on the research
goals of the country.
4 Topic Modelling
6 Numerical Results
At each interval the first step involved using LDA to generate topic clusters for
the SU corpus. We started with 3 topics and increased the number of topics by
factors of 2. At each interval the coherence value was computed and compared to
the previous value. This was repeated until the coherence value did not increase
for four consecutive iterations. The number of topics that produced the highest
coherence value was used as the topic distribution for the abstracts. We refer to
this as the base topics, T .
Given the set of topics and the words for each topic we next compared the
documents for TU, RU, SU1 , and SU to determine their closeness to the top-
ics identified from the SU corpus. This is done as follows. Let us consider the
abstracts for TU. For each topic in T we determine the sum of the probabilities of
all words that belong to both the topic as well as the abstract. This is repeated
for all topics and we then choose the topic with the highest probability. This
represents the topic that contains words that are common with the abstract
and also have high probabilities of being in that topic. Hence it represents some
degree of closeness between the abstract and the topic. Finally we take the aver-
age of this highest topic probability over all abstracts in the TU dataset. This is
the metric used as the degree of closeness and will be denoted by CT U , CRU and
CSU for the sets TU, RU and SU respectively. For each set of publications from
the individual universities that make up SU, we will refer to them as CSUi for
university i when computing the closeness.
Note that since the abstracts in SU are the ones that were used to generate
T then there should be significant overlap in abstract words and words in a
specific topic and so this value would form the ideal case. In the case of RU
there is some small probability of overlap but in general this would be low and
hence this forms the lower bound. Our aim is to determine how closely the TU
set of abstract matches those of the ideal set SU. In the case of the publications
from the universities that make up SU the closeness factor will indicate which
universities performs best among them. We therefore define the following metric:
CT U − CRU
ρT U = (1)
CSU − CRU
Note that this metric roughly lies between 0 and 1 with ρRU = 0 and ρSU = 1.
Using the values that were computed for the datasets described above we
computed the various values for each of the respective sets of abstracts over
time and obtained the following results illustrated by Fig. 1. As expected, ρSU1
is very high since we are comparing the documents from a university that was
used to generate the topics. Furthermore this value does not vary significantly
over time. However, when we look at TU we see that its value is significantly
less. We also can see how this closeness varies over time. It appears that from
2007 onward the research performed at TU gradually improved and seems to be
approaching that of the best university in the set that made up SU.
Next we look at the contents of the generated clusters. In Table 1 we show
the top 15 words for each of the top topic clusters, that is, the top topic cluster
Topic Modelling for Output and Goals Correlation 153
SU1
TU
0.8
0.6
ρT U
0.4
0.2
0
1997-2007 2007-2010 2010-2014 2014-2018
that contained the most abstracts, over the period 1997 to 2018. We also look
at the respective top cluster words for the TU set. Here we can clearly see the
change in research focus over time. For each period, the number of topics that
produced the highest topic coherence value were (1997–2007, 25) (2007–2010,
25) (2010–2014, 20) (2014–2018, 15).
7 Discussion
For each interval, the number of publications for each set, SU1 and SU varied
and this resulted in various ranges of number of topics. For example, the period
1997–2007 had 25 topics which produced the highest topic coherence value and
2014–2018 had 15 topics that produced the highest topic coherence value. Table 1
illustrates the associated top topics over the period 1997–2018. The topics were
represented as a list of words. Based on the words that represent the topics there
was no consistent sets of top topics over the years and we can see a change in
research focus over the years.
A by-product of our research was that we were able to compare the close-
ness between a university that made up the SU set and the target set. Because
these datasets were used to generate the topics models, it was expected that the
closeness values would be much higher than that of the TU dataset. This was
indeed so as illustrated by Fig. 1. The SU1 set was seen to have a high value
when compared to the TU set, which implies that the top topics illustrated by
Table 1 are closely related to SU1 but has a weaker relationship with TU.
Our initial aim of this research was to compare TU to SU and determine the
closeness of the research between the two sets. Our metric lies between 0 and 1
154 N. Chamansingh and P. Hosein
Table 1. Top 15 words for each top topic clusters from 1997–2018 SU & TU
and results show that over the years TU has fluctuated between higher and lower
values when compared to SU1 . The period 1997–2007 TU had a closer value
to SU1 then dropped for the next period and then raised for the subsequent
intervals. This can imply that TU takes a while to change research focus and
direction. That is, for the periods that the values were lower implies that the
research areas were becoming outdated. This is further illustrated in Table 1
where we can see, based on the top topic words, that there seems to be topics
between SU and TU that have overlap from 2014–2018.
Over the years there is a clear distinction between the top topics from
the period 1997–2007 and the period 2014–2018. These consisted of words
like “system coupling tool support wireless communication repudiation atomic
language query exchange data application programming language peer peer
speech recognition repudiation protocol” for the period 1997–2007 and words
such as “feature data image system based learning real world model propose
approach deep learning method first large scale performance” for the period
2014–2018.
To determine the sensitivity of our approach we varied the number of
abstracts used from TU and compared it to the entire SU set. We sorted the
TU set by year and took the first 100 abstracts and subsequently added publica-
tions by 100 at a time until we reached the total number of TU publications and
calculated the closeness at each iteration. The results were: (100, 0.53), (200,
0.63), (300, 0.68), (398, 0.67). From the results we find that the closeness factor
Topic Modelling for Output and Goals Correlation 155
does not vary too much with the number of abstracts as long as at least 200 are
used. In our numerical results we had to use 100 because of the dataset size but
our objective there was on relative rather than absolute performance.
References
1. Lin, L., Tang, L., Dong, W., Yao, S., Zhou, W.: An overview of topic modeling
and its current applications in bioinformatics. SpringerPlus 5(1), 1608 (2016)
2. Chen, Y., Bordes, J.-B., Filliat, D.: An experimental comparison between NMF
and LDA for active cross-situational object-word learning. In: 2016 Joint IEEE
International Conference on Development and Learning and Epigenetic Robotics
(ICDL-EpiRob), pp. 217–222. IEEE (2016)
3. Purushotham, S., Liu, Y., Kuo, C.-C.J.: Collaborative topic regression with social
matrix factorization for recommendation systems. arXiv preprint arXiv:1206.4684
(2012)
4. Agarwal, D., Chen, B.-C.: fLDA: matrix factorization through latent dirichlet allo-
cation. In: Proceedings of the Third ACM International Conference on Web Search
and Data Mining, pp. 91–100. ACM (2010)
5. Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic
models. In: Proceedings of the Conference on Empirical Methods in Natural Lan-
guage Processing, pp. 363–371. Association for Computational Linguistics (2008)
6. Jacobi, C., van Atteveldt, W., Welbers, K.: Quantitative analysis of large amounts
of journalistic texts using topic modelling. Digit. J. 4(1), 89–106 (2016)
7. Wang, C., Blei, D.M.: Collaborative topic modeling for recommending scientific
articles. In: Proceedings of the 17th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 448–456. ACM (2011)
8. Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and
author community. In: Proceedings of the 26th Annual International Conference
on Machine Learning, pp. 665–672. ACM (2009)
9. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci.
101(suppl 1), 5228–5235 (2004)
10. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for
authors and documents. In: Proceedings of the 20th Conference on Uncertainty in
Artificial Intelligence, pp. 487–494. AUAI Press (2004)
156 N. Chamansingh and P. Hosein
11. Xu, G., Zhang, Y., Yi, X.: Modelling user behaviour for web recommendation using
LDA model. In: IEEE/WIC/ACM International Conference on Web Intelligence
and Intelligent Agent Technology, WI-IAT 2008, vol. 3, pp. 529–532. IEEE (2008)
12. Wilson, J., Chaudhury, S., Lall, B.: Improving collaborative filtering based rec-
ommenders using topic modelling. In: Proceedings of the 2014 IEEE/WIC/ACM
International Joint Conferences on Web Intelligence (WI) and Intelligent Agent
Technologies (IAT), vol. 01, pp. 340–346. IEEE Computer Society (2014)
13. Kuhn, T.S.: The Structure of Scientific Revolutions, vol. 2. University of Chicago
Press, Chicago (1963)
14. De Smet, W., Moens, M.-F.: Cross-language linking of news stories on the web
using interlingual topic modelling. In: Proceedings of the 2nd ACM Workshop on
Social Web Search and Mining, pp. 57–64. ACM (2009)
15. Nikolenko, S.I., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies.
J. Inf. Sci. 43(1), 88–102 (2017)
16. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
17. Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coher-
ence over many models and many topics. In: Proceedings of the 2012 Joint Confer-
ence on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning, pp. 952–961. Association for Computational Linguis-
tics (2012)
18. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic
coherence. In: Human Language Technologies: The 2010 Annual Conference of the
North American Chapter of the Association for Computational Linguistics, pp.
100–108. Association for Computational Linguistics (2010)
19. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing
semantic coherence in topic models. In: Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pp. 262–272. Association for Computa-
tional Linguistics (2011)
20. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence mea-
sures. In: Proceedings of the Eighth ACM International Conference on Web Search
and Data Mining, pp. 399–408. ACM (2015)
Long Period Re-identification Approach
to Improving the Quality of Education:
A Preliminary Study
Abstract. Early school leaving is one of the most frequently mentioned reasons
to social exclusion later in life. In order to reduce the risk of early school
leaving, it is necessary to automate the process of entering unjustified lessons’
delays in school management system. A person’s re-identifying (Re-ID) is a
complex automated process, where most studies use an approach to analyze the
descriptors of clothing and appearance that are intended for the use of short-
period Re-ID. In contrast, there is not much research in the real-time long-term
Re-ID process, when images or videos are taken at intervals of several days or
months in an uncontrolled environment. In this case descriptors characterizing a
person’s biometric identity based on unique features, such as a facial digital
image, are required. The objective of this research is to develop a real-time
person’s long-term re-identification approach for accounting of non-attended
lessons in educational institutions. The proposed Re-ID mechanism includes
face identification and new method of using multiple face etalon versions and
multiple versions of descriptors for a single person. This allows Re-ID of a
person in different clothing and appearance from different camera angles in a
long term.
1 Introduction
Early school leaving is one of the most frequently mentioned reasons to social
exclusion later in life [1], where one of the known factors is regular school classes non-
attendance by pupils. According to the Latvia Minister Cabinet regulations the head of
the educational institution determines the order of registration of the pupil’s attendance
or non-attendance at the school every day. For example, pupils’ delays are recorded in
the study e-journal by a class teacher and then every Friday class teacher summarizes
delays for the week, notes the reason for the delay, and informs the pupils’ parents
about unjustified delays. The real-time registration of classes’ delays isn’t assumed in
existing procedures, as the result there is a risk in data quality. Due the manual data
input, errors can be made, because a teacher who replaces another teacher during
classes is not familiar with all pupils and guided by the total number of pupils or at the
beginning of the school term the teacher is not familiar with all the pupils in his/her
© Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 157–168, 2020.
https://doi.org/10.1007/978-3-030-39442-4_14
158 I. Arhipova et al.
lessons. The checking and marking the lessons attendance of pupils should be done
during the lesson and take extra time from the lesson. The problem of pupils’ regis-
tration can be in the situations, when the pupil can come to the lesson later or leave the
classroom where the lesson takes place for different reasons, as well the pupil can be in
the school territory but not in the room where the lesson takes place.
According to the Law on Education in Latvia, the head of an educational institution
is responsible for the security at a secondary school. The implementation of a set of
measures, which should be provided by the heads of educational institutions and
municipalities, may be different, considering the requirements for the safety of pupils.
Requirements are created using potential hazards in each educational institution.
Threats mainly depend on the location of the educational institution in the social
environment area. For example, unauthorized persons at the secondary school, who
have arrived at an educational institution, must go to a duty attendant by naming their
name, surname, the purpose of the visit, the person they come to. It is forbidden for the
pupils to bring educational objects that are not necessary for the study work. Regis-
tration of personal identity data is not automated; it affects pupils’ safety, such as the
problems to:
• indicate the exact number of persons, records during evacuation or emergency
situations;
• identify pupils that are in the educational institutions but not attending classes;
• control of the pupils staying in the yard of the educational institution without
permission;
• control pupils’ behaviour;
• control the unauthorized subjects which are brought by the pupils;
• control of the stay and registration of unauthorized persons in the territory of the
educational institution.
In order to reduce the risk of early school leaving, as well to increase pupils’ safety,
facilitate the administrative work of the educational institution and to promote the
reliability of unjustified delays and operability, it is necessary to automate the process
of entering unjustified delays in school management system.
Identifying or re-identifying a person is an essential aspect of identifying a person’s
identity and location by retrieving data from multiple devices. Re-ID is a process of
matching images captured by multiple devices. It is used to determine the ownership of
digital images of multiple devices by the same person. A person’s re-identification is a
difficult automated process for several reasons, such as a person’s different appearance,
which is fixed by different devices in different locations. The short-term Re-ID is a
process of establishing correspondence between images of a person taken from dif-
ferent devices in few minutes, in turn long-period Re-ID is a process, when images are
taken during the several months apart [2].
The objective of this research is to develop a real-time long-term person’s re-
identification (Re-ID) approach for accounting of non-attended lessons in educational
institutions, increasing pupils’ safety and reducing administrative burdens.
Proposed tasks and solutions for improving the quality of education are given in
Table 1.
Long Period Re-identification Approach to Improving the Quality of Education 159
Table 1. Proposed tasks and solutions for the improvement of the quality of education.
# Tasks Solutions Results
1 Increasing security and Identification of the physical Reducing the risks of potential
safety person (face digital image) hazards at a school
2 Improving the quality of Automated input of non- Reducing the risk of early school
the implementation of the attended lessons’ hours into leaving in an educational
education program school management system institution. Pupil’s parents are
informed in real time about non-
attended lessons
3 Reducing administrative Automated input of lesson Automated tracking of
burdens attendance data into school attendance at lessons
management systems
Descriptors describing a person’s identity are obtained with several devices with dif-
ferent technical parameters. The differences in the technical parameters of the devices
make the Re-ID process more difficult. If two images are taken with a few minutes or
hours apart, one can assume that the person’s visual appearance, for example, clothing
will be the same. Such a re-identification scenario is called short-period Re-ID. If
images or videos are taken at intervals of several days or months, the person’s re-
identification process is a long-period Re-ID [2]. Distribution of images over a long
period of time (in the long run) is one of the complexity factors of Re-ID.
In 2018 alone there are 380 articles indexed by SCOPUS on Re-ID in computer
science related publication. Most studies in the Re-ID area use an approach to analyze
the descriptors of clothing and appearance that are intended for the use of short-period
Re-ID. In contrast, there is not much research in the long-term Re-ID process. If a
person’s re-identification is intended for a long period of time or in the long term,
descriptors characterizing a person’s biometric identity based on unique features, such
as a facial digital image, are required.
Modern face recognition methods provide many identified individuals. However,
these results are obtained with high quality (resolution) data obtained in a controlled
environment (lighting, posture, settings). The use of biometric data for re-identification
is problematic as Re-ID data is generated in an uncontrolled environment. Automatic
face recognition with low resolution images, considering changes in posture, age,
lighting, is still a problem. Thus, the use of biometric information in Re-ID is theo-
retically possible but has practical implementation problems [2].
Scalability is a key factor for the Re-ID process. It is necessary to ensure the
technology’s ability to adapt to changing factors while maintaining performance. The
following scalability issues are topical:
• in real applications, the gallery’s size is large and constantly increasing, as the result
general methods based on coexistence ranking are not effective;
• to increase uniqueness, descriptors are large in size and expensive to obtain, which
affects the complexity of the Re-ID process;
160 I. Arhipova et al.
• effective data analysis requires a large amount of memory and computing resources;
• automatic video analysis can be simplified by using data-processing devices (smart
cameras) and communication between cameras, however, intensive Re-ID systems
for memory and computing resources cannot be scaled simply by working with
low-power processors and narrow bandwidth transmission channels [2].
Thus, the actual problems in practical Re-ID systems are scalability and compu-
tational complexity. Available solutions to resolve this issue are limited because dis-
tribution of images over a long period of time (in the long run) is one of the complexity
factors of Re-ID. The use of biometric data for re-identification is problematic as Re-ID
data come from an uncontrolled environment. There are practical scalability and
computational complexity problems in the Re-ID system.
• light variations;
• makeup, scars and damages to the face; and
• aesthetic surgeries and aging.
Not to mention other constraints relating to hardware, blurry images, low-resolution
images, etc. To solve each constraint there are various hardware solutions and available
diversity of methods that could be possibly applied [7]. Algorithm’s applied for face
recognition use various parameters for identification such as length of the jaw line,
depth of the eye sockets, width of the nose, distance between eyes and shape of
cheekbones and others.
One of the challenges raising if face recognition where limits in current recognition
methods are reached relate to rapid changes in face such as after accident, aesthetic
surgery and face change for infants and youngsters. According to report from The
American Society for Aesthetic Plastic Surgery in US alone people spend over 8 billion
US dollars on aesthetic procedures each year and one of the most demanded procedures
relating to facial image is chin augmentation, nose surgery, eyelid and ear aesthetic
surgeries [8]. For example, South Korea leads the list of aesthetic face procedure
frequency as 13.1 cases of procedures per 1000 capita. Studies show that more than
40% of female college students perform some cosmetic surgeries [9]. These changes
can significantly impact face recognition.
Research shows [10] that after face aesthetic surgeries such recognized algorithms
as Principal Component Analysis, Fisher Discriminant Analysis, Geometric Features,
Local Feature Analysis, Local Binary Pattern, and Neural Network Architecture cannot
be successfully applied, especially if surgeries involve changing various markers of the
face, e.g. nose, skin pattern, chin, etc. Aging is another factor in face recognition.
Observation studies [11] show that various people have personalized aging patter that
depends on such factors as lifestyle, genetics, environment, stress level, etc. Authors
conclude [12] that “the variations in the shape of a face are more prominent while in the
later stages of life, texture variations such as wrinkles and pigmentation are more
visible”.
There are research papers related to persons age estimation based on facial
recognition with field of application, for example, if person is allowed to buy alcohol
and is not under legal age limitation. Results show that easiest prediction of age is for
infant and toddler group where other factors as ethnicity do not play a major role. More
challenges come in later age groups. Above 60 years old people face recognition being
most challenging [12].
Baby face versus adult face recognition solutions are available and work with
improved precision [13]. Algorithm has been developed to identify newborns, toddlers,
and pre-school children age 0–4 to extract unique features of children. Proposed
algorithm uses a deep learning model which applies class-based penalties while
learning the filters of a convolutional neural network [14].
Authors identify [12] two aspects of age-invariant face recognition system: facial
age estimation and age-separated face recognition. It is concluded that eye region is one
of the most important for age prediction. There have been successful solutions for
determining not only age of person, but gender and race [15] which shows expansion of
162 I. Arhipova et al.
3 Results
The Re-ID process excludes usage descriptors of appearance from face identification
[18] and proposes inclusion of descriptors of human face math formula versus bio-
metric template version supervision process with independent evolution option. Pro-
posed mechanism consists of M N, where M is an etalon version of biometric pattern
and N is version of combination of descriptors from biometric template with exact
values by which person can be identified with high precision (see Fig. 1).
objects. Viola-Jones performance indicators are high, but precision rate is low. SSD has
higher option of configuration. Therefore, SSD algorithm will be included in the further
investigation due to it configurations options despite YOLO performance indicators
and object detection accuracy rates are similar.
formula of the face that defines person’s identity. This set of values includes infor-
mation of descriptors of face anatomy that are not influenced by descriptors of
appearance.
Thereafter biometric pattern is used to identify person and can be used to re-identify
person in a long term due to it minor changes on everyday bases. Identification of
person includes search of the similar biometric pattern in the knowledge base in order
find the most suitable biometric pattern from the existing ones. Further research is to
determine and define similarity threshold equivalence classes that describe high pre-
cision, low precision, critical points, etc. to estimate the result of the biometric pattern
comparison. If there are similar biometric pattern in the knowledge base that is com-
pliant with necessary similarity threshold – person’s identification parameters can be
outputted for further processing.
Elaboration of the Knowledge Base by Using a Supervised or Unsupervised
Learning
Data in the knowledge base and continues training of the Siamese Neural Network is
the keys to the successful person’s identification in the long term and is the main
objective of this research. In order to achieve high identification rate, it’s necessary to
fill the knowledge base with high quality biometric patterns by using a supervised
learning. There should be used several high-quality photos that are made from different
angles, for example full face frontal, right or left profile and processed through the
proposed mechanism in a supervised learning manner in order to create pattern etalon
version and link to the other pattern etalon versions.
If there is person’s biometric template that consists of the several linked template
versions that are made at one-time, patterns can be blended together or used inde-
pendently during the detection of the best matching pattern. High quality etalons
versions are significant to person’s long-term identification and in unsupervised
learning process.
There is an alternative method to elaborate data in the knowledge base in an
unsupervised learning process. Unsupervised learning process starts with an evaluation
of the tracking results. The result consists of a conclusion whether the same person is
identified successfully or not identified in several, not sequential frames. During
unsupervised learning process (without a human interaction) can be identified and
proposed option to elaborate existing pattern etalon version if there is high quality
candidate to new pattern etalon version, version of an existing pattern etalon.
Candidate of an identified person can be selected during unsupervised learning
process by continuous elaboration biometric pattern etalon version that consists of
more than one version. For example, person is detected in the video feed. The face is
fixed in the crop. Crop is processed thought the neural network. Quality of the crop is
enough to build biometric pattern. Few seconds later the face is fixed in another crop
and so on. Object tracking algorithm provides an option to generate pairs, triplets that
consists of an object instances that are detected in several crops and can be used to train
Siamese Neural Network. During the video streaming process several persons can be
detected without defined sequence.
There should be mechanism that can receive data about compatibility among
detected persons from several crops and blend information together in order to create
166 I. Arhipova et al.
complete biometric pattern that may consist of one or more pattern etalon versions.
Siamese Neural Network (a part of the proposed mechanism) should be trained to be
able to take a decision on a base of crops that are obtained in a short term. Compat-
ibility among pattern etalon versions that are not made sequential at one time can be
detected in an evolutionary process. Short term equivalence class determination is the
further field of research. Evolution of the blending of data about several persons from
the tracking algorithm can be done in one or more steps. Each step elaborates links
between one or more versions. During biometric pattern creation process there can be
applied search in the knowledge base for a best match. Thereafter, it’s possible assign
identities to all detected persons in the crop. Each crop is made in a determined place
and time.
4 Conclusions
In order to reduce the risk of early school leaving, as well to increase pupils’ safety,
facilitate the administrative work of the educational institution and to promote the
reliability of unjustified delays and operability, it is proposed to automate the process of
entering unjustified delays in school management system.
The real-time long-term person’s re-identification (Re-ID) approach for accounting
of non-attended lessons in educational institutions is developed. Authors proposed a
mechanism M N, where M is an etalon version of biometric pattern and N is version
of combination of descriptors from biometric template with exact values by which
person can be identified with high accuracy. Descriptors of biometric pattern are more
reliable for a person’s Re-ID in a long term due to their changes evenly during a period.
Automatic face recognition in real world setting must be performed with lower
resolution images as majority of images are taken by consumer level equipment thus
taking into consideration changes in posture, age, lighting is one of the main issues that
must be addressed.
Rapid face change of the individual can significantly decrease efficiency of existing
face recognition algorithms especially in groups of infants, kids, seniors (above age 65)
and people after face damage or certain aesthetic surgeries.
The developed mechanism consists of video feed processing in order to track,
localize the object in the crop, creation of the biometric pattern and identification of
person, elaboration of the knowledge base by using a supervised and unsupervised
learning process. Supervised learning process includes manual human interactions in
order to create links between one or more etalon pattern versions.
Unsupervised learning process includes continues elaboration of knowledge base
by evaluating the results from the tracking, searching etalon pattern version with the
best match, creation of new etalon pattern or its version, blending of one or more etalon
pattern versions in one biometric etalon pattern version in order to create complete
biometric pattern and to be able to link created biometric pattern (may consist of several
etalon pattern versions created from events in fixed in two separate video feeds) to
identified persons data without human interaction.
Long Period Re-identification Approach to Improving the Quality of Education 167
Acknowledgments. The research leading to these results has received funding from the project
“Competence Centre of Information and Communication Technologies” of EU Structural funds,
contract No. 1.2.1.1/18/A/003 signed between IT Competence Centre and Central Finance and
Contracting Agency, Research No. 2.1 “Person long-period re-identification (Re-ID) solution to
improve the quality of education”.
References
1. Nevala, A.M., Hawley, J., Stokes, D., Slater, K., Otero, M.S., Santos, R., Duchemin, C.,
Manoudi, A.: Reducing early school leaving in the EU. European Parliament, Brussels
(2011)
2. Bedagkar-Gala, A., Shah, S.K.: A survey of approaches and trends in person re-
identification. Image Vis. Comput. 32(4), 270–286 (2014)
3. Lee, K.W., Sankaran, N., Setlur, S., Napp, N., Govindaraju, V.: Wardrobe model for long
term re-identification and appearance prediction. In: 15th IEEE International Conference on
Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, pp. 1–6
(2018)
4. Ding, Y.: Pedestrian re-identification based on image enhancement and over-fitting solution
strategies. In: 5th International Conference on Systems and Informatics (ICSAI), Nanjing,
China, pp. 745–750. IEEE (2018)
5. Nambiar, A., Bernardino, A.A.: Context-aware method for view-point invariant long-term
re-identification. In: Cláudio, A., et al. (eds.) Computer Vision, Imaging and Computer
Graphics – Theory and Applications, VISIGRAPP 2017. Communications in Computer and
Information Science, vol. 983, pp. 329–351. Springer, Cham (2017)
6. Kamalakumari, J., Muthuraman, V.: Recognizing heterogeneous faces-a study. Int. J. Pure
Appl. Math. 118(8), 661–663 (2018)
7. Hassaballah, M., Aly, S.: Face recognition: challenges, achievements, and future directions.
IET Comput. Vis. 9(4), 614–626 (2015)
8. The American Society for Aesthetic Plastic Surgery. https://www.surgery.org/sites/default/
files/ASAPS-Stats2018_0.pdf. Accessed 01 June 2019
9. Kim, Y.A., Cho Chung, H.I.: Side effect experiences of South Korean women in their
twenties and thirties after facial plastic surgery. Int. J. Women’s Health 10, 309–316 (2018)
168 I. Arhipova et al.
10. Singh, R., Vatsa, M., Noore, A.: Effect of plastic surgery on face recognition: a preliminary
study. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops, pp. 72–77 (2009)
11. Mayes, A.E., Murray, P.G., Gunn, D.A., Tomlin, C.C., Catt, S.D., Wen, Y.B., Zhou, L.P.,
Wang, H.Q., Catt, M., Granger, S.P.: Environmental and lifestyle factors associated with
perceived facial age in Chinese women. PLoS ONE 5(12), 1–7 (2010). e15270
12. Yadav, D., Singh, R., Vatsa, M., Noore, A.: Recognizing age-separated face images: humans
and machines. PLoS ONE 9(12), 1–22 (2014). e112234
13. Wen, D., Fang, C., Ding, X., Zhang, T.: Development of recognition engine for baby faces.
In: 20th International Conference on Pattern Recognition, Istanbul, Turkey, pp. 3408–3411
(2010)
14. Siddiqui, S., Vatsa, M., Singh, R.: Face recognition for newborns, toddlers, and pre-school
children: a deep learning approach. In: 24th International Conference on Pattern Recognition
(ICPR), pp. 3156–3161. IEEE (2018)
15. Han, H., Otto, C., Liu, X., Jain, A.K.: Demographic estimation from face images: human vs.
machine performance. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1148–1161 (2015)
16. Mandal, B.: Face recognition: perspectives from the real world. In: 14th International
Conference on Control, Automation, Robotics and Vision (ICARCV), pp. 1–5. IEEE (2016)
17. Rothoft, V., Si, J., Jiang, F., Shen, R.: Monitor pupils’ attention by image super-resolution
and anomaly detection. In: International Conference on Computer Systems, Electronics and
Control (ICCSEC), pp. 843–847. IEEE (2017)
18. Hendel, R.K., Starrfelt, R., Gerlach, C.: The good, the bad, and the average: characterizing
the relationship between face and object processing across the face recognition spectrum.
Neuropsychologia 124, 274–284 (2019)
19. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision
Conference (BMVC), pp. 41.1–41.12. BMVA Press (2015)
20. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition
and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition, Boston,
pp. 815–823 (2015)
21. Zhuang, N., Zhang, Q., Pan, C., Ni, B., Xu, Y., Yang, X., Zhang, W.: Recognition oriented
facial image quality assessment via deep convolutional neural network. Neurocomputing
358, 109–118 (2019)
22. Heyman, J.: TracTrac: a fast multi-object tracking algorithm for motion estimation. Comput.
Geosci. 128, 11–18 (2019)
A Quantum Annealing-Based Approach
to Extreme Clustering
1 Introduction
Traditionally, clustering approaches have been developed and customized for
tasks where the resultant number of clusters k is not particularly high. In such
cases, algorithms such as k-means++ [1], BIRCH [2], DBSCAN [3], and spectral
clustering produce high-quality solutions in a reasonably short amount of time.
This is because these traditional algorithms scale well with respect to the dataset
cardinality n. However, in most cases, the computational complexity of these
algorithms, in terms of the number of clusters, is either exponential or higher-
order polynomial. Another common issue is that some of the algorithms require
vast amounts of memory.
The demand for clustering algorithms capable of solving problems with larger
values of k is continually increasing. Present-day examples involve deciphering
T. Jaschek and J. S. Oberoi—have contributed equally to this research
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 169–189, 2020.
https://doi.org/10.1007/978-3-030-39442-4_15
170 T. Jaschek et al.
the content of billions of web pages by grouping them into millions of labelled
categories [4,5], identifying similarities among billions of images using nearest-
neighbour detection [6–8]. This domain of clustering, where n and k are both
substantially large, is referred to as extreme clustering [9]. Although there is great
value in perfecting this type of clustering, very little effort towards this end has
been made by the machine learning community. The proposed algorithm is, in
fact, such an effort. Its output is a clustering tree, which can be used to generated
multiple clustering assignments (or “levels”) with varying degrees of accuracy
(i.e., coarseness or fineness) of the approximation. Generating such a tree is
not uncommon for clustering algorithms. Consider, for example, hierarchical
clustering algorithms which generate binary clustering trees. Clustering trees
are useful tools for visualizing real-world data. Our algorithm, the Big Data
Visualization Tool, or BiDViT, provides this functionality.
BiDViT employs a novel approach to clustering problems, which is based on
the maximum weighted independent set (MWIS) problem in a graph induced
by the original dataset and a parameter we call the radius of interest, which
determines a relation of proximity. The use of such a parameter has been suc-
cessfully employed in density-based spatial clustering of applications with noise
(DBSCAN) [3]. The MWIS problem can be transformed into a quadratic uncon-
strained binary optimization (QUBO) problem, the formulation accepted by a
quantum annealer. An alternative way to address the underlying problem is
to use a heuristic algorithm to approximate solutions to the MWIS problem.
Quantum annealing and simulated annealing have been applied in centroid-based
clustering [10,11] and in density-based clustering [12]. However, the approaches
studied are not capable of addressing problems in the extreme clustering domain.
The document is structured as follows. Sect. 2 introduces a novel approach
to clustering problems and proves that, under a separability assumption, the
method identifies the ground truth labels when parameters are selected that are
within the bounds determined by that assumption. Sect. 3 outlines the way in
which the underlying optimization problem can be solved efficiently, and shows
how it can be algorithmically employed in extreme clustering. Runtime and solu-
tion quality values are provided for BiDViT with respect to internal evaluation
schemes such as the Calinski–Harabasz and the Davies–Bouldin scores in Sect. 4.
The results suggest that BiDViT yields clustering assignments of a quality com-
parable to that of assignments generated by common clustering algorithms, yet
does so a full order of magnitude faster.
Fig. 1. Visualization of chunk collapsing (left) and data partitioning (right). Left) A
maximal ε-separated subset (red dots) of a dataset (red dots and blue dots). The circles
have a radius equal to the radius of interest ε. The weights of the red points are updated
according to the number of blue points within a distance of ε. The yellow borders are
a Voronoi partition of the dataset indicating the clustering assignment. Right) Data
partitioning of a dataset along the axes of maximum variance. In this example, there
are s = 5 partitioning steps, resulting in 25 = 32 chunks.
For example, the coarsening method can be used for clustering assignments on
finite subsets of Riemannian manifolds with respect to their geodesic distance,
for instance, in clustering GPS data on the surface of the Earth when analyzing
population density. In what follows, we assume that X = {x(1) , . . . , x(n) } is a
dataset consisting of n d-dimensional data points, equipped with a metric d : X ×
X → [0, ∞). Finding an arbitrary ε-dense subset of X does not necessarily yield
a helpful approximation. For example, X itself is always ε-dense in X. However,
enforcing the additional constraint that any two points in the subset S must
be separated by a distance of at least ε yields more-interesting approximations,
often leading to a reduction in the number of data points (one of our primary
objectives). We call such a set ε-separated. Figure 1 shows a point cloud and
an ε-dense, ε-separated subset. The theorem that follows shows that a maximal
ε-separated set S of X is necessarily ε-dense in X. Let B(x, r) denote the open
metric ball with respect to d, with centre x and radius r.
Proof. Note that (i) is equivalent to S being ε-dense in X and that, in combina-
tion with (ii), is equivalent to S being a minimal with respect to this property.
To prove (i), let S be a maximal ε-separated subset of X and assume, in con-
tradiction, that S is not ε-dense in X. Then one could find x ∈ X such that
d(x, y) ≥ ε, for every y ∈ S. Hence, S ∪ {x} would be ε-separated, which is
in contradiction to the maximality of S. To prove (ii), we fix a point x ∈ S.
172 T. Jaschek et al.
Since S is ε-separated, d(x, y) ≥ ε for any y ∈ S and, thus, S \ {x} is not ε-dense
in X. Property (iii) follows from the triangle inequality.
Note that a maximal ε-separated subset does not refer to an ε-separated
subset with fewer than or equally as many elements as all other ε-separated
subsets but, rather, to an ε-separated subset that is no longer ε-separated when
a single data point is added. Contrary to Theorem 1, a minimal ε-dense subset
does not need to be ε-separated. Consider the set X = {1, 2, 3, 4} ⊂ R, and
let d be the Euclidean distance on R. Then, S = {2, 3} is 3/2-dense in X but
not 3/2-separated. Also note that an ε-separated subset is not necessarily an ε-
coreset, which is a weighted subset whose weighted k-means cost approximates
the k-means cost of the original set with up to an accuracy of ε [13,14].
In the following, we assume that X is equipped with a weight function w :
X → R+ . We call wi = w(x(i) ) the weight of x(i) and gather all weights in a
weight vector w ∈ Rn+ . It will be clear from the context whether we refer to
a weightfunction or a weight vector. The weight of a set S ⊆ X is given by
ω(S) = x∈S w(x). We have already argued that maximal ε-separated subsets
yield reasonable approximations. However, such subsets are not unique. We are
thus interested in finding an optimal one, that is, one that captures most of the
weight of the original dataset. In other words, we are interested in solving the
optimization problem.
maximize ω(S) subject to S is ε-separated. (P1)
S⊆X
When imposing unit weights, the solution set to (P1) will consist of the
maximal ε-separated subsets of X with a maximum number of elements among
all such subsets. The term “maximal” refers to set inclusion and the “maxi-
mum” refers to set cardinality. Since w(x) > 0 for all x ∈ X, a solution S ∗ to
(P1) will always be a maximal ε-separated subset and, therefore, by Theorem 1,
ε-dense. In Sect. 3.6, we show that this problem is equivalent to solving an MWIS
problem for a weighted graph Gε (X, E ε , w), depending solely on the dataset X,
the Euclidean metric d, and the radius of interest ε. Thus, the computational task
of finding a maximal ε-separated subset of maximum weight is NP-hard [15,16].
Every U ⊂ X gives rise to a clustering assignment C = {Cx }x∈U , where
Cx = {y ∈ X : d(x, y) ≤ d(x , y) for all x ∈ U }. (1)
Data points that are equidistant to multiple representative points are assigned
to only one of them, uniformly at random. Typically, larger values of ε result
in smaller cardinalities of C. The following corollary summarizes properties of C
when U is ε-separated, and can be readily verified.
Corollary 1. Let C be the clustering assignment generated from a maximal ε-
separated set S ⊂ X. Then, the following properties are satisfied:
(i) The clusters in C are non-empty and pairwise disjoint.
(ii) The cluster diameter is uniformly bounded by 2ε, i.e., supx∈S diam(Cx ) ≤
2ε.
(iii) For all x ∈ S, it holds that maxy∈Cx d(x, y) < ε.
A Quantum Annealing-Based Approach to Extreme Clustering 173
Notice that these properties are not satisfied by every clustering assignment, for
example, the ones generated by k-means clustering. They are desirable in specific
applications, such as image quantization, where a tight bound on the absolute
approximation error is desired. However, they are undesirable if the ground truth
clusters have diameters larger than 2ε. More details on the clustering assignment
are provided in Sect. 3.
One could argue that prior to identifying a maximum weighted independent
set and using it to generate a clustering assignment, a dataset should be nor-
malized. However, normalization is a transformation that would result in chunks
not being defined by metric balls, but rather by ellipsoids. In particular, such a
transformation would change the metric d. We assume that the metric d already
is the best indicator of proximity. In general, one can apply any homeomorphism
f to a dataset X, apply a clustering algorithm to the set f (X), and obtain a
clustering assignment by applying f −1 to the individual clusters.
A common assumption in the clustering literature is separability—not to be
mistaken with ε-separability—of the dataset with respect to a clustering C. The
dataset X is called separable with respect to a clustering C = {C1 , . . . Ck } if
that is, if the maximum intra-cluster distances are strictly smaller than the
minimum inter-cluster distances. The following theorem shows that, if ε is chosen
correctly, the proposed coarsening method yields the clustering assignment C.
Proof. To simplify notation, the lower and upper bounds of the interval in (3)
are denoted by l and r, respectively. By the separability assumption, this interval
is non-empty. One can see that, for any admissible choice of ε, any two points
from different clusters are ε-separated. Indeed, for x ∈ C and y ∈ C , it holds
that d(x, y) ≥ r ≥ ε. Furthermore, if a point x in a cluster C is selected, then no
other point y in the same cluster can be selected, as d(x, y) ≤ l < ε. Therefore,
every solution S ⊆ X to (P1) is a union of exactly one point from each cluster.
Using the separability of X with respect to C, one can see that the clustering
assignment induced by (1) is coincident with C.
3 The Algorithm
Let X = {x(1) , . . . , x(n) } ⊂ Rd denote a dataset of n d-dimensional data points.
Note that, mathematically speaking, a dataset is not a set but rather a multiset,
that is, repetitions are allowed. BiDViT consists of two parts: data partitioning
and data coarsening, the latter of which can be further subdivided into chunk
coarsening and chunk collapsing.
the maximum chunk cardinality κ, yielding a binary tree of data chunks. After
(s) (s)
s iterations, this leaves us with 2s chunks Pk such that X = 1≤k≤2s Pk ,
where the union is disjoint. Figure 1 provides a visualization.
Here, the inner summation of the constraint does not need to run over all indices,
due to the symmetry of N (ε) . The matrix form of (P2) is given by maximizing
(ε) (ε)
sT w subject to the constraint sT N s = 0, where N is the upper triangular
matrix of N (ε) having all zeroes along the diagonal. As explained in Sect. 3.6,
(P2) is equivalent to the NP-hard MWIS problem for Gε = (P, E ε , wP ), and
thus is computationally intractable for large problem sizes. Note that (P2) can
be written as the 0–1 integer linear program (ILP)
n
(ε)
maximize
n
si wi subject to si + sj ≤ 1, for i, j such that N ij = 1.
s∈{0,1}
i=1
(P3)
for any solution S ∗ to (P1) and any output S of the algorithm. Moreover, the
bound in (4) is tight.
The Quantum Method. In contrast to the heuristic method, the QUBO app-
roach provides an actual (i.e., non-approximate) solution to (P2). We reformulate
the problem by transforming the QCQP into a QUBO problem.
Using the Lagrangian penalty method, we incorporate the constraint into
the objective function by adding a penalty term. For a sufficiently large penalty
multiplier λ > 0, the solution set of (P2) is equivalent to that of
n
n
(ε)
maximize
n
si wi − λ si Nij sj . (P4)
s∈{0,1}
i=1 i=1 j>i
One can show that, for λ > maxi=1,...n , wi every solution to (P4) satisfies
the separation constraint [23, Thm. 1]. Instead, we use individual penalty terms
A Quantum Annealing-Based Approach to Extreme Clustering 177
λij , as this may lead to a QUBO problem with much smaller coefficients, which
results in improved performance when solving the problem using a quantum
annealer. Expressing (P4) as a minimization, instead of a maximization, problem
and using matrix notation yields the problem
minimize
n
sT Qs, (P5)
s∈{0,1}
(ε)
where Qij = −wi if i = j, Qij = λij if Nij = 1 and i < j, and Qij = 0 other-
wise. Solutions to (P5) can be approximated using heuristics such as simulated
annealing [24], path relinking [25], tabu search [25], and parallel tempering [26].
Before solving (P5), it is advisable to reduce its size and difficulty by making use
of logical implications among the coefficients [27]. This involves fixing every vari-
able that corresponds to a node that has no neighbours to one, as it necessarily
is included in an ε-dense subset.
The following theorem show that (P2) is equivalent to (P5) for a suitable
choice of λij , for 1 ≤ i < j ≤ n.
Theorem 4. Let λij > max{wi , wj } for all 1 ≤ i < j ≤ n. Then, for any
solution s ∈ {0, 1}n to (P5), the corresponding set S ⊆ X is ε-separated. In
particular, the solution sets of (P2) and (P5) coincide.
Proof. We generalize the proof of [23, Thm. 1] and show that every solution s
n (ε)
to (P5) satisfies the separation constraint i=1 j>i si Nij sj = 0. Assuming,
in contradiction, that the opposite were to be the case, we could find a solution
(ε)
s and indices k and such that 1 ≤ k < ≤ n and sk = s = Nk = 1. Let ek
denote the k-th standard unit vector, and let v = s − ek . Then,
n
n
v T Qv = sT Qs − sj Qkj − si Qik − Qkk (5)
j>k i<k
(ε)
= sT Qs − si λσ(i,k) Nik + wk , (6)
i=k
where σ : N2 → N2 , defined by σ(i, k) = (min(i, k), max(i, k)), orders the index
accordingly. This technicality is necessary, as we defined λij only for 1 ≤ i <
(ε) (ε)
j ≤ n. As Nk = s = 1, we have i=k si λσ(i,k) Nik ≥ λσ(,k) , and thus
v T Qv ≤ sT Qs − λk + wk . (7)
Therefore, as λk > max{wk , w } ≥ wk , it holds that v T Qv < sT Qs, which is
absurd, as, by assumption, s is a solution to (P5).
We now show that the solution sets of (P2) and (P5) coincide. Note that
(P2) is equivalent to the optimization problem
(ε)
minimize
n
, −sT w subject to sT N s = 0. (P6)
s∈{0,1}
178 T. Jaschek et al.
Let s1 and s2 be solutions to (P6) and (P5), respectively. We denote the objective
(ε)
functions by p1 (s) = −sT w and p2 (s) = −sT w + sT Λ ◦ N s, where Λ is the
matrix defined by Λij = λij for 1 ≤ i < j ≤ n, and zero otherwise, and the term
(ε) (ε)
Λ◦N ∈ Rn×n denotes the Hadamard product of the matrices Λ and N , given
by element-wise multiplication. Then, as λij > max{wi , wj } for 1 ≤ i < j ≤ n,
by the observation above, both s1 and s2 satisfy the separation constraint. Since
ε
s and N are coordinate-wise non-negative and λij > mink=1,...,n wk > 0 for
1 ≤ i < j ≤ n, it holds that
(ε) (ε)
sT N s=0 ⇔ sT Λ ◦ N s = 0, (8)
thus, if s satisfies the separation constraint, then p2 (s) = p1 (s). Using this obser-
vation, and that s1 and s2 minimize p1 and p2 , respectively, we have
Hence, the inequalities in (9) must actually be equalities; thus, the solution sets
of the optimization problems coincide.
In practice, to prevent having very large values for the individual weights, one
might wish to add a linear or logarithmic scaling to this weight assignment. In
our experiments, we did not add such a scaling.
Algorithm 2. BiDViT
input : data set X; initial radius ε0 ; maximum chunk cardinality κ; radius increase rate α
output: tree structure that encodes the hierarchical clustering T
1 T ← create node list(X)
2 ε ← ε0
3 while length(T ) > 1 do
4 P ← partition(T , κ)
5 T ←∅
6 for P ∈ P do
7 compute neighbourhood matrix N (ε) for P
8 identify representive data points by solving MWIS for P, N (ε) , and w
9 compute Voronoi partition of P with respect to representative points
10 compute centroids of the cells of the Voronoi partition
11 for x ∈ P do
12 ind ← closest centroid(x, centroids)
13 centroids[ind].weight += x.weight
14 centroids[ind].parents.append(x)
15 end
16 T .append(centroids)
17 end
18 ε ← αε
19 end
20 return T
to a single data point. We call these iterations BiDViT levels. The increase of ε
between BiDViT levels is realized by multiplying ε by a constant factor, denoted
by α and specified by the user. In our implementation we have introduced a
node class that has three attributes: coordinates, weight, and parents. We
initialize BiDViT by creating a node_list containing the nodes corresponding
to the weighted dataset (if no weights are provided then the weights are assumed
to be the multiplicity of the data points). After each iteration, we remove the
nodes that collapsed into representative nodes from the node_list and keep only
the remaining representative nodes. However, we append the removed nodes to
the parents of the representative node. The final node_list is a data tree, that
is, it consists of only one node, and we can move upwards in the hierarchy by
accessing its parents (and their parents and so on); see Fig. 2. Two leaves of
the data tree share a label with respect to a specific BiDViT level, say m, if
they have collapsed into centroids which, possibly after multiple iterations, have
collapsed into the same centroid at the m-th level of the tree. For the sake of
reproducibility, we provide pseudocode (see Algorithm 2).
It is worth noting that, at each level, instead of proceeding with the identified
representative data points, one can use the cluster centroids, allowing more-
accurate data coarsening and label assignment.
Our analysis shows that every interaction of the heuristic version of BiDViT has
a computational complexity of O(dn log(n/κ) + dnκ). Note that κ n.
The order of complexity of the partitioning procedure is O(dn log(n/κ)). To
see this, note that there are at most log2 (n/κ) partitioning stages and in the
180 T. Jaschek et al.
s-th stage we split 2s−1 chunks Pi , where i = 1, . . . , 2s−1 . Let ni denote the
number of data points in chunk Pi . Finding the dimension of maximum variance
has a complexity of O(dni ) and determining the median of this dimension can
be achieved in O(ni ) via the “median of medians” algorithm. Having computed
the
median, one can construct two chunks of equal size in linear time. Since
1≤i≤2s−1 ni = n, a partitioning step is O(dn log(n/κ)). Any division of a chunk
is independent of the other chunks at a given stage; thus, this procedure can
benefit from distributed computing.
1 2 3 4 5 6 7 8 9
BiDViT level 1
c1,1 c1,2 c1,3 c1,4
c2,1 c2,2 c2,3
c3,1
1 2 3 4 5 6 7 8 9
c3,1
Fig. 2. Dendrogram representing the output tree of BiDViT and the encoding of the
clustering assignment. The original dataset is represented by the leaves, which collapse
into a single centroid after three BiDViT iterations. The first iteration (“BiDViT level
1”) results in four centroids, each corresponding to a cluster consisting of the nodes
that collapsed into it. At the next iteration, the algorithm merges the clusters of the
centroids. For example, c1,3 and c1,4 are merged into c2,3 at the next level.
constraint mentioned earlier, where two vertices are adjacent whenever they are
less than a distance of ε apart. The MWIS problem can be expressed as
and is NP-complete for a general weighted graph [16], yet, for specific graphs,
there exist polynomial-time algorithms [28,29]. Note that the QUBO formulation
of the MWIS problem in [23,30] is related to the one in (P5).
If all weights are positive, a maximum weighted independent set is necessarily
a maximal independent set. A maximal independent set is a dominating set, that
is, a subset S of V such that every v ∈ V \ S is adjacent to some w ∈ S. This
corresponds to our observation that every maximal ε-separated subset is ε-dense.
4 Results
The datasets used to demonstrate the efficiency and robustness of the proposed
approach are the MNIST dataset of handwritten digits [31], a two-dimensional
version of MNIST obtained by using t-SNE [32], two synthetic grid datasets, and
a dataset called Covertype containing data on forests in Colorado [33]. The syn-
thetic grid datasets are the unions of 100 samples (in the 2D case) and 1000 sam-
ples (in the 3D case) drawn from N (μij , σ 2 ) with means μij = (10i + 5, 10j + 5)
and a variance of σ2 = 4 for 0 ≤ i, j ≤ 9 in the 2D case and the natural extension
in the 3D case. Dataset statistics are provided in Table 1. In the following sec-
tions, our technical experiments are explained in detail; a practical application
of BiDViT for image quantization is illustrated in Fig. 7. All experiments were
performed using a 2.5 GHz Intel Core i7 processor and 16 GB of RAM.
the ground truth label assignment for the dataset. Such schemes must be viewed
as heuristic methods: their optimal values do not guarantee optimal clusters
but provide a reasonable measure of clustering quality. Detailed analyses have
been conducted on the advantages and shortcomings of internal clustering mea-
sures [36,37]. In the extreme clustering scenario, where the objective is to obtain
an accurate approximation of the entire dataset instead of categorizing its ele-
ments, no true labels are given and thus external evaluation schemes (ones based
on the distance to a ground truth clustering assignment) do not qualify as success
measures.
Let C1 , . . . , Cnc denote a total of nc detected clusters within a dataset X
with n data points. The Calinski–Harabasz score SCH of a clustering is defined
as a weighted ratio of the inter-cluster squared deviations to the sum of the
intra-cluster squared deviations. More precisely, SCH is given by
nc
n−1 |C |c − c22
SCH (C1 , . . . , Cnc ) = nck=1 k k 2, (11)
nc − 1 k=1 x∈Ck x − ck 2
where ck , for k = 1, . . . , nc are the cluster centroids, and c is their mean. High
values of SCH are indicative of a high clustering quality. The Davies–Bouldin
score SDB is the average maximum value of the ratios of the pairwise sums of
the intra-cluster deviation to the inter-cluster deviation. The score is defined as
100 100
80 80
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
1000 1000
800 800
600 600
400 400
200 200
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
1
nc
Sk + Sj
SDB (C1 , . . . , Cnc ) = max , (12)
nc j=k ck − cj 2
k=1
where Si = x∈Ci x − ci /|Ci |. Low values of SDB indicate accurate clustering.
Figure 4 shows SCH and SDB of clustering assignments obtained with BiD-
ViT and Mini Batch k-means clustering [38] for different values of k on the
Covertype dataset. Due to their high computational complexity with respect to
k, many common clustering algorithms could not be applied. Remarkably, SCH
values are quite similar, indicating that the cluster assignments generated by
BiDViT are of comparable quality even though the runtime of BiDViT is signif-
icantly shorter. For SDB , BiDViT outperforms the others for lower values of k,
and is comparable for large values. One explanation for the slightly weaker per-
formance of BiDViT with respect to SCH is that BiDViT aims to minimize the
1.5
350,000
BiDViT
Mini Batch k-means
Calinski–Harabasz Score
Davies–Bouldin Score
300,000 1.4
250,000
1.3
200,000
150,000 1.2
100,000 1.1
50,000
1.0
0
0 5,000 10,000 15,000 20,000 0 5,000 10,000 15,000 20,000
Number of Clusters k Number of Clusters k
Fig. 4. SCH (left) and SDB (right) of clustering assignments on the Covertype dataset
generated by the heuristic BiDViT algorithm (κ = 103 , α = 1.5, and ε0 = 102 ) and
Mini Batch k-means (batch size = 50, max iter = 103 , tol = 10−3 , and n init = 1).
Whereas a higher value of SCH indicates better clustering, the opposite holds for SDB .
103
Time to Solution (s)
100
Davies–Bouldin Score
102
101
6 × 10−1
Fig. 5. Time to solution (left) and SDB (right) of common clustering algorithms and
BiDViT (κ = 103 , α = 1.3, and ε0 = 16.0) on a subset of the Covertype dataset for
different numbers of clusters. For k-means++ and Mini Batch k-means clustering, we
modified the number of initializations, and for Birch clustering, it was the branching
factor. These parameters resulted in a speed-up with a minimum loss of quality; their
values are indicated in the legend.
184 T. Jaschek et al.
Table 1. Dataset statistics (left) and runtime comparison of extreme clustering algo-
rithms in seconds (right). PERCH-C (“collapsed-mode”) was run, as it outperforms
standard PERCH. The parameter L sets the maximum number of leaves (see [9] for an
explanation). BiDViT selected the values ε0 = 30 and ε0 = 0.5, such that a percent-
age of the nodes collapsed in the initial iteration, for the Covertype and the MNIST
datasets, respectively. The mean and standard deviation were computed over five runs.
For the quantum version of BiDViT, one can observe higher-quality solutions
and a significant speed-up, when compared to common clustering methods. Both
observations are based on results shown in Fig. 6.
Fig. 6. Runtime and quality of results for the quantum version of BiDViT obtained
using a D-Wave 2000Q quantum annealer. Left) Computational time for the 3D grid
dataset. Right) Comparison of the Calinski–Harabasz score of the quantum version
of BiDViT and of k-means clustering on a subset of the MNIST dataset for different
numbers of clusters. The orientation of the abscissae has been inverted to illustrate that
at low BiDViT levels there are many clusters and at high levels only a few remain.
However, the heuristic version of BiDViT and the common clustering algo-
rithms were executed on a classical device that has a limited computational
capacity, whereas the D-Wave 2000Q is a highly specialized device. Running
these algorithms on a high-performance computer might lead to an equivalent
degree of speed-up.
186 T. Jaschek et al.
250
200
B Value
150
100
50
0
200
0 50 150 e
100 alu
100 50 G V
R Va 150 200
lue 250 0
250
200
B Value
150
100
50
0
200
0 50 150 e
100 alu
100 50 G V
R Va 150 200
lue 250 0
Fig. 7. Image quantization via clustering in the colour space of a standard test image.
The original image has 230,427 colours. BiDViT is particularly fast at reducing its
colours to a number on the order of 104 , as this falls into the extreme clustering range.
Here, the k-means clustering algorithm faces its computational bottleneck. A commonly
employed algorithm for such problems is the median cut algorithm. Naturally, it is
faster than BiDViT—as BiDViT employs the median cut algorithm in its chunking
procedure—but BiDViT produces a more accurate colour assignment.
5 Conclusion
We have developed an efficient algorithm capable of performing extreme clus-
tering. Our complexity analysis and numerical experiments has shown that, if
the dataset cardinality and the desired number of clusters are both large, the
runtime of BiDViT is at least an order of magnitude faster than that of classi-
cal algorithms, while yielding a solution of comparable quality. With advances in
quantum annealing hardware, one can expect further speed-ups in the algorithm
and the size of dataset that it can process.
Independent of BiDViT, the proposed coarsening method, based on identi-
fying a maximal ε-separated subset, is valuable in its own right—it is a novel
approach to clustering which is not limited solely to extreme clustering.
Further investigation of the proposed coarsening approach is justified, as we
have identified a domain for the radius of interest (in Theorem 2) such that,
under a separability assumption, every solution to (P1) (i.e., every maximum
weighted ε-separated subset) yields the optimal clustering assignment. Potential
paths for future research include the development of selection procedures for the
initial radius of interest and investigating the clustering quality of the coarsening
method (without data chunking) in the non-extreme domain.
A Quantum Annealing-Based Approach to Extreme Clustering 187
References
1. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-
rithms, pp. 1027–1035. SIAM (2007)
2. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering
method for very large databases. In: ACM Sigmod Record, vol. 25, pp. 103–114.
ACM (1996)
3. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for
discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp.
226–231 (1996)
4. Nayak, R., Mills, R., De-Vries, C., Geva, S.: Clustering and labeling a web scale
document collection using Wikipedia clusters. In: Proceedings of the 5th Interna-
tional Workshop on Web-scale Knowledge Representation Retrieval & Reasoning,
pp. 23–30. ACM (2014)
5. de Vries, C.M., de Vine, L., Geva, S., Nayak, R.: Parallel streaming signature EM-
tree: a clustering algorithm for web scale applications. In: Proceedings of the 24th
International Conference on World Wide Web, pp. 216–226. International World
Wide Web Conferences Steering Committee (2015)
6. Wang, X.J., Zhang, L., Liu, C.: Duplicate discovery on 2 billion internet images. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, pp. 429–436 (2013)
7. Liu, T., Rosenberg, C., Rowley, H.A.: Clustering billions of images with large scale
nearest neighbor search. In: Proceedings of the 8th IEEE Workshop on Applications
of Computer Vision, WACV 2007, p. 28. IEEE Computer Society, Washington
(2007)
8. Woodley, A., Tang, L.X., Geva, S., Nayak, R., Chappell, T.: Parallel K-Tree: a
multicore, multinode solution to extreme clustering. Future Gener. Comput. Syst.
99, 333–345 (2018)
9. Kobren, A., Monath, N., Krishnamurthy, A., McCallum, A.: A hierarchical algo-
rithm for extreme clustering. In: Proceedings of the 23rd ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 255–264. ACM
(2017)
10. Kumar, V., Bass, G., Tomlin, C., Dulny, J.: Quantum annealing for combinatorial
clustering. Quantum Inf. Process. 17(2), 39 (2018)
11. Merendino, S., Celebi, M.E.: A simulated annealing clustering algorithm based on
center perturbation using Gaussian mutation. In: The 26th International FLAIRS
Conference (2013)
12. Kurihara, K., Tanaka, S., Miyashita, S.: Quantum annealing for clustering.
arXiv:1408.2035 (2014)
188 T. Jaschek et al.
13. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In:
Proceedings of the 26th Annual ACM Symposium on Theory of computing, pp.
291–300. ACM (2004)
14. Balcan, M.F., Ehrlich, S., Liang, Y.: Distributed k-means and k-median clustering
on general topologies. In: Advances in Neural Information Processing Systems, pp.
1995–2003 (2013)
15. Lucas, A.: Ising formulations of many NP problems. Front. Phys. 2, 5 (2014)
16. Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of Com-
puter Computations, pp. 85–103. Springer (1972)
17. D-Wave Systems Inc.: The D-Wave 2000Q Quantum Computer: Tech-
nology Overview (2017). https://www.dwavesys.com/sites/default/files/D-Wave
%202000Q%20Tech%20Collateral 0117F.pdf. Accessed 13 Feb 2019
18. Fujitsu Ltd.: Digital Annealer Introduction: Fujitsu Quantum-inspired Com-
puting Digital Annealer (2018). http://www.fujitsu.com/global/documents/
digitalannealer/services/da-introduction.pdf. Accessed 13 Feb 2019
19. Malkomes, G., Kusner, M.J., Chen, W., Weinberger, K.Q., Moseley, B.: Fast dis-
tributed k-center clustering with outliers on massive data. In: Advances in Neural
Information Processing Systems, pp. 1063–1071 (2015)
20. Balaji, S., Swaminathan, V., Kannan, K.: Approximating maximum weighted inde-
pendent set using vertex support. Int. J. Comput. Math. Sci. 3(8), 406–411 (2009)
21. Hifi, M.: A genetic algorithm-based heuristic for solving the weighted maximum
independent set and some equivalent problems. J. Oper. Res. Soc. 48(6), 612–622
(1997)
22. Kako, A., Ono, T., Hirata, T., Halldórsson, M.: Approximation algorithms for the
weighted independent set problem in sparse graphs. Discrete Appl. Math. 157(4),
617–626 (2009)
23. Abbott, A.A., Calude, C.S., Dinneen, M.J., Hua, R.: A hybrid quantum-classical
paradigm to mitigate embedding costs in quantum annealing. arXiv:1803.04340
(2018)
24. Nolte, A., Schrader, R.: A note on the finite time behavior of simulated annealing.
Math. Oper. Res. 25(3), 476–484 (2000)
25. Lü, Z., Glover, F., Hao, J.K.: A hybrid metaheuristic approach to solving the
UBQP problem. Eur. J. Oper. Res. 207(3), 1254–1262 (2010)
26. Zhu, Z., Fang, C., Katzgraber, H.G.: borealis – a generalized global update algo-
rithm for Boolean optimization problems. arXiv:1605.09399 (2016)
27. Glover, F., Lewis, M., Kochenberger, G.: Logical and inequality implications for
reducing the size and difficulty of quadratic unconstrained binary optimization
problems. Eur. J. Oper. Res. 265(3), 829–842 (2018)
28. Mandal, S., Pal, M.: Maximum weight independent set of circular-arc graph and
its application. J. Appl. Math. Comput. 22(3), 161–174 (2006)
29. Köhler, E., Mouatadid, L.: A linear time algorithm to compute a maximum
weighted independent set on cocomparability graphs. Inf. Process. Lett. 116(6),
391–395 (2016)
30. Hernandez, M., Zaribafiyan, A., Aramon, M., Naghibi, M.: A novel graph-based
approach for determining molecular similarity. arXiv:1601.06693 (2016)
31. LeCun, Y., Cortes, C., Burges, C.J.: MNIST handwritten digit database. AT&T
Labs (2010). http://yann.lecun.com/exdb/mnist
32. Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res.
9, 2579–2605 (2008)
33. Blackard, J.A.: UCI Machine Learning Repository (2017). http://archive.ics.uci.
edu/ml. Accessed 13 Feb 2019
A Quantum Annealing-Based Approach to Extreme Clustering 189
34. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.
Theory Methods 3(1), 1–27 (1974)
35. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern
Anal. Mach. Intell. 2, 224–227 (1979)
36. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering
validation measures. In: 2010 IEEE International Conference on Data Mining, pp.
911–916 (2010)
37. Jain, R., Koronios, A.: Innovation in the cluster validating techniques. Fuzzy
Optim. Decis. Making 7(3), 233 (2008)
38. Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International
Conference on World Wide Web, pp. 1177–1178. ACM (2010)
Clustering and Classification
to Evaluate Data Reduction
via Johnson-Lindenstrauss Transform
1 Introduction
Each observation (row) of matrix X is represented by a point x in d-dimensional
space in Rd . A dataset can include a massive number of dimensions. However, it
is not always useful to have such a large number of attributes because oftentimes
many contain irrelevant and redundant information. Also, it is common that by
shrinking the dimensions, we could gain additional crucial information.
There are two types of dimensionalities, one is called the extrinsic dimension-
ality and the other is called the intrinsic dimensionality. The extrinsic dimen-
sionality of a matrix X indicates the dimensionality in which its data points are
observed. So we may say that there are d extrinsic dimensions in X. The other
type of dimensionality, which is significant, is known as intrinsic dimensionality.
The intrinsic dimensionality of matrix X indicates the number of dimensionali-
ties that are important and required to return a useful response to a given query
of our choice.
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 190–209, 2020.
https://doi.org/10.1007/978-3-030-39442-4_16
Data Reduction Evaluation via Johnson-Lindenstrauss Transform 191
hence also k are heavily dependent. In fact, in many cases of the evolutions and
improvements of the lemma, the choice of which random projection matrix is
used, and the chosen random projection matrix’s well defined characteristics,
integrate into the lemmas proof and derivation, and in doing so is most times
the largest impacting influence on the lemma’s lower bound k.
Now, over the evolutions and improvements of the JLT, and the JL lemma,
how has the random projection matrix been defined? The answer is that there
is a clean and clear three part distinct partition of all types of evolutions of the
random projection matrix and, observed from many proofs of improvements of
the lemma, there are indeed only these three:
All three encompass specific definitions of the random projection matrix, and
different classes of evolutions and improvements of the original lemma’s proof.
The JL lemma has been improved over time. Many researchers were looking
to enhance the lemma’s proof, by reducing the bound k and simplifying the proof
of the lemma. The improvements focused on how to make the lower bound of
the k-dimensional space to be tighter. Also, research has centered on how to
make the transformation T to be efficient by reducing the required number of
matrix operations and the amount of space to compute the transformation. Even
though there are various improved versions of the lemma, all of them still adhere
to the condition of pairwise distance preservation.
Two approaches to construction of JL lemma are of interest in this study:
the first approach uses sub-Gaussian projection and the second uses sparse pro-
jection matrices. We investigate the effects of dimensionality reduction using the
two different techniques on real-world data for Attention Deficit Hyperactivity
Disorder (ADHD).
1.1 Motivation
Authors [2] used simulated, not natural data to perform data reduction based
on the JL lemma and statistical measures to judge the quality of the reduction.
In this study we use natural neuroimaging data and clustering and classifica-
tion techniques to judge the quality of the JL lemma reduction. A question of
paramount importance is whether our original high dimensional dataset gives
the same analytic results as the reduced dataset. For instance, after performing
clustering on both original and reduced datasets, would they produce the same
classification accuracy? Even though some papers [3–6] have conducted research
on the topic of Johnson-Lindenstrauss lemma to reduce high dimensional data
Data Reduction Evaluation via Johnson-Lindenstrauss Transform 193
Fig. 1. Novelty of this research. The dotted lines shop work that has previously been
done. Clustering followed by classification is new.
for classification or clustering (see the dotted lines in Fig. 1), to the best of our
knowledge, comparison of the classification via clustering results (as illustrated
using the bold solid black arrows in Fig. 1) has not been done.
An outline of the paper follows: Sect. 2 gives related work, which includes
an introduction to the data set under consideration, that of fMRI measures of
people experiencing Attention Deficit Hyperactivity Disorder (ADHD), review
of data reduction strategies, in general, and the JL lemma, in particular, and
clustering and machine classification algorithms. Section 3 describes the ADHD
dataset in greater detail and shows how it was used for performing data reduc-
tion and classification via clustering analysis to illuminate the essential meaning
in the data. Also, in that section the method is explained that was used to imple-
ment and compare different versions of the JL lemma, and the development of
a general tool based on the JL lemma is described that makes it easy to accom-
plish data reduction tasks. In Sect. 4 we show implementation of the tool for
194 A. Ghalib et al.
comparing different JL reduction techniques and also show the results of exe-
cution of the clustering and classification algorithms on our dataset. Section 5
shows analysis and comparison of results of running two different versions of
the lemma and examines the amount of reduction done by each. Also, it shows
the results of comparing machine classification outcomes for the original high
dimensional dataset and the reduced ones. Section 6 summarizes the outcomes
from our experiments and suggests future research directions.
Currently, data are growing rapidly in all science and business domains making
it complex to process such large amounts of data. Therefore, there is a real
necessity to consider data reduction strategies to help solve many real world
problems such as crime prevention, reducing unemployment, or making decisions
for medical treatment. Also, there are other goals for data reduction techniques,
such as the minimization of the amount of data that needs to be stored in
a data storage environment, or for the efficient analysis of big data, such as
healthcare dataset, which contains a massive amount of data like the health
status of patients. Many real world problems can be solved after preprocessing
big data. Many organizations in different industries such as banking, government,
manufacturing, education, health care, and retail are using big data.
During data reduction, we perform techniques on our original dataset to
obtain a reduced representation of it. Indeed, the reduced representation should
be much smaller in volume and produce the same (or almost the same) analytical
outcomes. Therefore, using a reduced representation of our dataset would be
Data Reduction Evaluation via Johnson-Lindenstrauss Transform 195
more efficient due to the fact that the space and computational complexity are
reduced. Dimensionality reduction [9], numerosity reduction (parametric: log-
linear models [10], regression [11]), (non-parametric [12]), and data compression
[13] are three types of data reduction strategies. Choosing the right model is
essential for the given type of dataset.
The singular value decomposition (SVD) or what is also known in different
fields of study as the principal component analysis (PCA) or Latent Semantic
Indexing or Karhunen-Loeve (KL) transform, is one of the effective methods for
dimensionality reduction. [21] utilizes SVD as a dimensionality reduction method
combined with the K-means clustering algorithm in order to yield a more accu-
rate recommender system. In [22], the authors proposed Higher Order Singular
Value Decomposition (HOSVD) as a robust dimensionality reduction method
to overcome the sparsity problem of high dimensional matrices. Author in [23]
proposed a semi-supervised data reduction method, Semi-supervised Linear Dis-
criminant Analysis (SLDA), which can use limited number of labeled data and
a quantity of the unlabeled ones for training.
Theoretically, JL Lemma and SVD solve different problems in Data Reduc-
tion. JL Lemma promises a low dimensional space that captures the distances
up to a given error, while SVD gives the best possible embedding according to a
given dimension. In other words, JL Lemma yields a worst-case pairwise guaran-
tee while SVD computes a projection of the input points in a lower dimensional
space where the sum of the errors is minimized.
For PCA we were unable to scale it up to large datasets so it was not con-
sidered for our current study. PCA and Johnson and Lindenstrauss approach
(the focus of this study) are data extraction techniques that distort the values
of data in columns in contrast with feature selection techniques that leave the
column values of the selected features intact.
To select the best algorithm for our problem, we have to take into consider-
ation these questions: What kind of problem are we dealing with? For example,
are we dealing with classification, regression, or clustering? Does our data consist
of a very high dimensionality? Is our data labeled or unlabeled? Then we can
decide on answers to the following questions: Are we interested in performing
feature selection or extraction? Are we going to use a supervised or unsuper-
vised method? Do we want to use a very computationally intensive method or
an inexpensive one?
lemma aims to preserve all the pairwise distances in the Euclidean space of n
vectors of length d. The idea is that we want to embed all of the vectors into
a space of dimensionality k << d. The JL lemma was first used as an initial
geometric lemma to prove the results of Lipschitz mappings to Hilbert spaces.
In the original paper [24], the JL lemma was stated as follows: Given a set
of n points in IRd , for some n, d ∈ IN, and given ∈ (0, 1), there exists k0 =
O(−2 log n) such that, if k ≥ k0 , there exists a linear mapping T : IRd → IRk
such that for any two points u, v,
matrix as was done in [2]. However, their experiments tested the two implemen-
tations of the JL-lemma on simulated data using statistics while in this work, the
two versions were tested on natural data using machine classification techniques.
3 Methodology
An application programming interface (API) was implemented that enabled two
different versions of the JL lemma to be performed. Quality of the reduction was
judged by the amount of reduction done and by the extent to which the reduced
data capture the shape of the original data. The first criterion was determined by
looking at the value k of the reduced datasets. The second stage was to examine
the accuracy and performance of clustering and classification techniques applied
to the reduced datasets.
Fig. 3. Method by which JL lemma was tested to see if it preserves information content
in the reduction as well as geometric properties of the data.
as to the reduced datasets. Refer to Fig. 3 for an overview of how the quality of
the reduction was judged.
The ADHD data can be viewed in several ways. For example, subjects could
appear as rows and Time series/features (also called regions of interest) as
columns, or features could appear as rows and the other two as columns. In
all there are six possible permutations as illustrated in Table 1.
The original subject files contained numeric codes in their first row that
describe each column. The codes describe the regions of interest (ROIs) in the
brain. A numeric code of each ROI is associated with a narrative abbreviation.
We have provided six different permutations of the ADHD dataset each of
which contains ≥19,952 dimensions. However, only one of the permutations was
used during our experiment (permutation1), which has 37,152 dimensions. We
have developed an API to construct these permutations as well as apply the imple-
mentations of the JLT algorithms with error tolerances from 0.1 through 0.9.
Let us discuss permutation1 with subjects along the rows and feature-time
along the columns. The structure consists of all of the 116 ROIs each with its
Data Reduction Evaluation via Johnson-Lindenstrauss Transform 199
related 172 time series. In the first row (i.e. ROI number 1-time 1, ROI number 1-
time 2, and so on) the features and time-series will be given. Then, the following
rows would hold the record for each subject (i.e. row 2 will have subject number
1, row 3 will have subject number 2, and so on).
The API for running the JL-lemma construction provides flexibility for reor-
ganizing the input dataset to view it from different angles of interest.
WEKA is a powerful tool that allows us to perform many machine learning (ML)
algorithms and to analyze their results. Subjects were grouped based on their
brain signals measurements. Therefore, we applied clustering technique using
the K-means algorithm. Then we used the ‘AddCluster’ filter that is available
in WEKA, which allowed us to know which instances are in which cluster. This
resulted in having a new attribute called ‘cluster’, that acts as a class label
in the dataset that tells the cluster number for each instance. Then different
classification algorithms were performed as the data in each cluster were labeled
by the cluster’s number. The so labeled ADHD dataset was used to test the
accuracy and performance of classification algorithms. In our experiment, we
used ten-fold cross-validation for training and testing purposes for the ADHD
dataset.
Datasets can be divided into two separate sets. One is called training set and
the other is called testing set. The training set is intended for building a model.
On the other hand, the testing set is intended for validating that built model.
Note that the data points from the training set are excluded from the testing
set. Some available evaluation measurements in WEKA include cross validation,
confusion matrix, error rate, accuracy, precision, recall, and ROC curve.
To be able to evaluate our predictive models, we used cross-validation tech-
nique for all of our classifiers. This supervised learning evaluation technique
allowed us to divide our dataset into two sets (training set and testing set). We
used 10-fold cross-validation which means that the dataset will be sliced into 10
equal sized sets. Each one of these ten sets were divided into two parts (90%
training set and 10% testing set). The selected classification algorithm was run
on each of the ten sliced sets. At the end the cross-validation technique returns
the average accuracy out of these ten models. The selection of 10-fold was based
on theoretical evidence [27,28] as well as on empirical evidence of using different
datasets with different learning techniques. The studies show that 10-fold is the
optimal number of folds for having a good estimator error.
Using data reduction techniques will result in reducing the computational
complexity of the K-means clustering algorithm as well as classification tech-
niques. We have tried performing PCA on the ADHD dataset in WEKA and R
software for the sake of comparison. However, attempts were unsuccessful due
to the computational complexity of the PCA with the high dimensional ADHD
dataset. In fact, computation of PCA would have required an upgraded machine
environment.
200 A. Ghalib et al.
4 Implementation
We have implemented the JL Embedding Tool, which acts like an API for using
two different MatLab programs to compute two different versions of the JL
lemma: JL lemma using sub-Gaussian projection and JL lemma using sparse
matrix projection. Mtlb t 4 2 and mtlb t 4 3 are abbreviations used for the dif-
ferent versions of the JL lemma because in [2] the two approaches were referred
to as theorems 4.2 and 4.3.
Let us describe the architecture of the input/output of the JL Embedding
Tool that did the reduction (Fig. 4). The tree shows the former datasets at (root)
as an input data file. The second layer contains all of the six permutations (pre-
embedded) of that input data file. The third layer (post-embedded) produces the
output of using the two different versions of the JL lemma algorithms. For each
permutation with each algorithm, there will be nine different result files as arising
from = 0.1 to 0.9. That would result in having (6 permutations × 2 algorithms
× 9 epsilon values) = 108 ‘.csv’ output files. Figure 4 shows an overview of the
input output structure of the JL Embedding Tool that was used. Note that this
was the main tool for our experiments.
1. The API was used to reconstruct the ADHD data files for 216 subjects, which
will produce six different permutations. Also, the API will apply two different
JL transformations (for error tolerance = 0.1, . . . , 0.9 per permutation per
JL algorithm) on these six permutations.
2. Choose one permutation (Only permutation 1 was chosen to conduct the
experiment).
3. Load the dataset into R programming language to obtain its transpose.
4. Write an R program script to apply the elbow method on the dataset so
that we will have insight on the optimal number of K needed for K-means
clustering algorithm.
5. Load the ‘.csv’ data file in WEKA and save it as ‘.arff’ WEKA file.
6. Apply K-means clustering algorithm with K = 3 (from step 5, it turns out
that 3 is the optimal number of K) on the dataset and then add a new
feature that represents the cluster name for each instance in the dataset
using WEKA’s filter ‘AddCluster’.
7. Apply classification algorithms and observe the accuracy and performance
of each algorithm.
8. Cluster the two reduced dataset (JL 4.2 and 4.3 for error tolerance = 0.1,
0.5, 0.9) using K-means clustering algorithm with K = 3. Then add a new
feature to the datasets that represents the cluster name for each instance in
the datasets.
9. Apply classification algorithms on the datasets in step 8 and observe the
accuracy and performance of each algorithm.
10. Compare the accuracy and performance of the original data and the reduced
datasets.
11. Compare reduced dimension of the datasets (JL 4.2 and JL 4.3 for error
tolerance = 0.1, . . . , 0.9).
that we have tested for both sub-Gaussian and sparse matrix projections. The x
axis represents the value of error tolerance and the y axis represents the number
of reduced attribute (k dimensions).
Now we can answer research question three ‘Which one of the two JL algo-
rithms produces more reduction?’ The answer is JLT 4.2 using the sub-Gaussian
approach.
Table 2 shows the amount of retained cluster labels from the original ADHD
dataset in the reduced dataset using JLT 4.2. The table shows that with error
tolerance 0.1, the reduced dataset was able to capture 67% of the labels. With
error tolerance 0.5, the dataset was able to capture 55% of the labels. However,
with reduction using error tolerance 0.9, the reduced dataset captured 37% of
the labels.
Table 2. Retained amount of clusters in the reduced ADHD dataset using JLT 4.2
Table 3. Retained amount of clusters in the reduced ADHD dataset using JLT 4.3
Table 3 shows the amount of retained cluster labels from the original ADHD
dataset in the reduced dataset using JLT 4.3. As we saw on the previous table,
the amount of retention was significant with error tolerance 0.1 while with large
error tolerance (e.g. 0.5 and 0.9) the retention rate decreases.
Tables 2 and 3 answer research question one which was ‘Does the Johnson-
Lindenstrauss Transform (JLT) in fact protect the reduced data from major loss
of important characteristics by virtue of the fact that it preserves the pairwise
distance structure?’ and the answer is it depends on the value of error tolerance.
Table 4. Accuracy rates of JL 4.2 and JL 4.3 each with ε = 0.1, 0.5, and 0.9 for each
of the classification algorithms.
The accuracy rates are summarized in Table 4 for several different classifica-
tion algorithms. The accuracy rates are good except for ZeroR which is known
to be a poor classifier because it relies on the target and ignores all predictors,
simply predicting the majority category.
Overall, the K-NN and random forest have the ideal accuracy rate among
all classification algorithms that is 100% when used to predict the model of the
reduced ADHD dataset that uses JLT 4.3 with error tolerance 0.9. Notice that we
have used the same settings for training set and testing set for all classifiers. By
calculating the average accuracy of all models for each classification algorithm,
the decision tree has 88%, the K-NN has 92%, the Naı̈ve Bayes has 95%, the
random forest has 93%, and the ZeroR has 50%. The highest average achieved
by the Naı̈ve Bayes algorithm that is 95%. However, we cannot just rely on the
correctly classified instances (True rate/accuracy rate) because we might have a
high accuracy rate but not a good model.
After clustering the ADHD dataset, both original and reduced datasets con-
tain unbalanced class labels. A balanced class label means to have 50% of the
Data Reduction Evaluation via Johnson-Lindenstrauss Transform 205
labels in one class and 50% in the other class (ie. 50% in cluster1 and 50% in
cluster2) while an unbalanced class label means to have more labels in one class
than the other. For example, assume we have 90% of our instances in one class,
we will end up saying everything belongs to that class. Thus we will report 90%
accuracy most of the time but our prediction model is poor in terms of per-
formance. Therefore, it is important to report other metrics such as the ROC
area under the curve (AUC) to check on our classifiers performance. This metric
provides the accuracy but in a different way. It will check on the items from
each class in a dataset and give the percentage of the time that the classifier is
correctly predicting that item’s class.
The above answers research question two that asks ‘Does the reduced dataset
by JLT produce the same accuracy or almost the same of classification models
as the original dataset?’ and the answer is that it depends on the classification
algorithm used and the value of error tolerance.
In Table 5, we show comparison between the highest ROC AUC rates for
different classification algorithm. The table shows that the K-NN, Naı̈ve Bayes,
and random forest have the highest ROC AUC rates of 1 using the reduced
ADHD datasets, which indicates that their model is perfect. Despite the fact
that the Naı̈ve Bayes was not previously selected as having the highest accuracy
rate, now we can consider it as a good prediction model as the K-NN and random
forest based on the ROC AUC rate achieved. Notice that this model of Naı̈ve
Bayes using the reduced ADHD dataset by JLT 4.3 with distortion rate 0.9
has 97% accuracy rate whereas K-NN and random forest have 100% accuracy
using JLT 4.3 with distortion rate 0.9. Also, random forest using JLT 4.2 with
distortion rate 0.9 represents a prefect model, it has accuracy rate of 99%.
be reduced. Even though this method could help in reducing data to a more
condensed representation of its original state, it was found to be unusable on the
ADHD dataset due to computational intensity.
We strive for a better way to acquire the intrinsic dimensionality in data. For
that reason, we have found a mathematical result from 1984 named The Johnson
and Lindenstrauss Lemma (JL lemma), which states that we can embed high
dimensional data points into a lower dimensional representation while approxi-
mately preserving the pairwise distance between each pair of points. Indeed, it
states that any set of n points in d-dimensional Euclidean space can be embed-
ded into k-dimensional Euclidean space while preserving the pairwise distances
between all points within a given factor , where k << d and k is logarithmic in
n and independent of d.
Two different approaches were used for construction of the JL lemma: the
first approach uses sub-Gaussian projection (JLT 4.2) and the second uses sparse
projection matrices (JLT 4.3). Both approaches were implemented in an API
and the sub-Gaussian was found to be superior for reduction of the ADHD data
(research question three).
The accuracy and performance of classifying the original and reduced ADHD
datasets were compared using different classifiers. The K-means clustering algo-
rithm was used to partition the brain signals of the subjects into different groups.
The selection of the number of partitions was obtained by using the elbow
method, which provided us with the optimal number of partitioning (k = 3)
(research question four). Both original and reduced ADHD datasets became
labeled and thus we were able to perform classification algorithms on them.
It was illustrated that using dimension reduction by the JLTs as a preprocess-
ing step before performing clustering and classification yields the best prediction
models both in terms of accuracy and performance rates. This shows that the
reduced datasets have positive impact on the accuracy and performance of those
techniques which is to be expected. However, the clustering and classification
results themselves help to answer the research questions of this study concern-
ing whether information is preserved in the Johnson-Lindenstrauss Transform
(JLT).
Clustering results answer research question one ‘Does the Johnson-
Lindenstrauss Transform (JLT) in fact protect the reduced data from major loss
of important characteristics by virtue of the fact that it preserves the pairwise
distance structure?’. Best results were obtained for error tolerance 0.1 because
amount of retained cluster labels is 67% for JLT 4.2 and 82% for JLT 4.3.
The K-NN and random forest algorithms were found to have the best predic-
tive models in terms of accuracy rate. Both have 100% accuracy when predicting
the reduced ADHD dataset using JLT 4.3 with error tolerance 0.9. On the other
hand, the K-NN using JLT 4.3 with error 0.9, Naı̈ve Bayes using JLT 4.3 with
error 0.9, and random forest using both JLT 4.2 and 4.3 with error 0.9 have the
perfect models in terms of performance, having AUC RUC = 1. The results show
that these prediction models outperform the other models. The worst prediction
models of all were obtained by using the ZeroR classification algorithm, the
Data Reduction Evaluation via Johnson-Lindenstrauss Transform 207
models showed poor accuracy and performance rates. The use of ZeroR classifier
was as a baseline to compare other classification algorithms.
Hence, research question two ‘Does the reduced dataset by JLT produce
the same accuracy or almost the same of classification models as the original
dataset?’ can be answered in the affirmative depending on the classification
algorithm used and the value of error tolerance.
Previous conceptualizations of the ADHD have been limited by the lack of
tools to analyze huge stores of complex neurological data. It seems likely that
there is a complex topology of the disorder. For example, it seems likely that
different people have different brain regions involved in symptoms of the disease
and those brain regions might be the foundation for the typology. An initial step
towards developing tools for analyzing ADHD data has been provided.
Finally, it was observed from proofs of JL-lemma and JLT that the lin-
ear transformation can be computed in polynomial time for high dimensional
datasets without any assumptions on the original data. In contrast, other dimen-
sionality techniques such as Principal Component Analysis are only useful for
datasets constrained to lower dimensional spaces. That is, the technique explored
here is not constrained by the assumptions of any arbitrary dimension. High-
dimensions impose no computational expense or cost in this technique’s opera-
tion or computational performance.
Our description of the Johnson-Lindenstrauss Transform/Embedding is an
extremely top down, and course grained description, and of all things, only
a description. The description was made in an attempt to be simple, clear
and straightforward, in common English and wording, avoiding mathematical
complexities imposed by the Johnson-Lindenstrauss Lemma’s mathematical for-
mulation and proof. Further work is in development towards understanding
the implicit geometrical nature of the Lemma’s construction and the implica-
tions imposed on relativity, proportionality, and space in general. Development
towards these objectives will supersede the high-level description discussed here
in further work.
We are in the process of creating a single application that is a hybrid of a
variety of JL techniques. This application will be a single pipeline, of one black-
box to another, forming a chain from input to output. UML and architectural
diagrams of the design and implementation are forthcoming. Refactoring will
be undertaken to adhere to a solution stack in an attempt to decouple the
application and perhaps provide it as an open source library for JL.
References
1. Wang, A., Gehan, E.A.: Gene selection for microarray data analysis using principal
component analysis. Stat. Med. 24(13), 2069–2087 (2005)
2. Fedoruk, J., Schmuland, B., Johnson, J., Heo, G.: Dimensionality reduction via
the Johnson-Lindenstrauss lemma: theoretical and empirical bounds on embedding
dimension. J. Supercomput. 74(8), 3933–3949 (2018)
3. Cannings, T.I., Samworth, R.J.: Random-projection ensemble classification. J. Roy.
Stat. Soc. B (Stat. Methodol.) 79(4), 959–1035 (2017)
208 A. Ghalib et al.
18. Stein, G., Chen, B. Wu, A.S., Hua, K.A.: Decision tree classifier for network intru-
sion detection with GA-based feature selection. In: Proceedings of the 43rd Annual
Southeast Regional Conference - Volume 2, ACM-SE 43, pp. 136–141. ACM,
New York (2005)
19. Mukherjee, S., Sharma, N.: Intrusion detection using Naive Bayes classifier with
feature reduction. Procedia Technol. 4, 119–128 (2012). 2nd International Confer-
ence on Computer, Communication, Control and Information Technology (C3IT-
2012) on February 25–26, 2012
20. Deshmukh, S., Rajeswari, K., Patil, R.: Analysis of simple K-means with multiple
dimensions using WEKA. Int. J. Comp. Appl. 110(1), 14–17 (2015)
21. Zarzour, H., Al-Sharif, Z., Al-Ayyoub, M., Jararweh, Y.: A new collaborative filter-
ing recommendation algorithm based on dimensionality reduction and clustering
techniques, pp. 102–106 (2018)
22. Nilashi, M., Ibrahim, O., Ahmadi, H., Shahmoradi, L., Samad, S., Bagherifard, K.:
A recommendation agent for health products recommendation using dimensional-
ity reduction and prediction machine learning techniques. J. Soft Comput. Decis.
Support Syst. 5, 7–15 (2018)
23. Wang, S., Lu, J., Gu, X., Du, H., Yang, J.: Semi-supervised linear discriminant
analysis for dimension reduction and classification. Pattern Recogn. 57, 179–189
(2016)
24. Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert
space. Contemp. Math. 26(7), 189–206 (1984)
25. Bellec, P., Chu, C., Chouinard-Decorte, F., Benhajali, Y., Margulies, D.S., Crad-
dock, R.C.: The neuro bureau ADHD-200 preprocessed repository. NeuroImage
144, 275–286 (2017). data Sharing Part II
26. Matoušek, J.: On variants of the Johnson-Lindenstrauss lemma. Random Struct.
Algorithms 33(2), 142–156 (2008)
27. Bengio, Y., Grandvalet, Y.: No unbiased estimator of the variance of K-fold cross-
validation. J. Mach. Learn. Res. JMLR 5, 1089–1105 (2004)
28. Markatou, M., Tian, H., Biswas, S., Hripcsak, G.: Analysis of variance of cross-
validation estimators of the generalization error. J. Mach. Learn. Res. JMLR 6,
1127–1168 (2005)
Application of Statistical Learning
in Ferro-Titanium Industry
1 Introduction
The need to improve productivity and efficiency, reduced management levels and
process security has led to increased research activity on fault detection and isolation
[2].
From the other side, quality and price are an essential requirement that clients
consider it on top of all other items they are looking to make a purchase or sign a
contract with a producer. Therefore, for producers, having a rigid quality plan to
control critical processes to hint process owners before arriving at the limit points is
vital. One of these main industries which can easily be called the root of other
industries is steel producer companies. Steel producers compete with each other to be
ahead of the others in quality of production. To improve their qualities, steel producer
companies always trying to improve the quality of their products by applying some
purifying chemical compounds. One of these main elements is called ferroalloy
chemicals. Ferroalloy is a combination of the metal with a high amount of chemical
alloys such as vanadium, titanium, silicon, aluminum, etc. Each of these ferroalloys has
By plotting these chemical alloys, the results show there are correlations between
some of the chemical alloys which we more investigate them using some supervised
machine learning methods over this research to determine the linear and non-linear
correlation between chemical compounds. Therefore, the specific objective of this work
is to determine the relationship between the dependent and independent variables in the
ferrotitanium dataset and form a stable algorithm for predicting these relations at the
production process of ferrotitanium.
In order to explore the goal of this research, the following experiments will be
performed:
• Visualizing the collected dataset to detect the potential correlations;
• Data cleansing and remove the sample’s data which were generated because of
human error’s happening during sampling steps;
• Determining dependent (response) and independent (predictor) variables;
• Exploring robust patterns between response and predictor variables;
• Applying multiple regression and random forests as the main statistical method of
supervised learning method;
• And, training the model on the general dataset of collected data from the quality
department of a titanium producer.
Application of Statistical Learning in Ferro-Titanium Industry 213
The result is a roadmap for the quality, production, and engineering departments of
this industry in order to predict the suitable combination of different materials as the
main recipe for the production process. The results can indisputably assist the company
to boost further conveniences to the operation, by this means better planning and
guidance for high-quality products.
2 Research Methodology
This research starts by collecting and sorting the data from the quality department of a
ferrotitanium producer and is continued by applying supervised learning methods over
the collected data. Analyzing the main predict variables affecting the quality of
response variables of the ferrotitanium producers is the target of this research. Some of
these companies use statistical process control method to control the main chemical
alloys required by their clients. They are using several control charts to control different
alloys by setting different upper control limit (UCL) and lower control limit (LCL).
Figure 1 represents the SPC control chart of vanadium in the studied company.
6.00
4.00
2.00
• Multiple linear regression is applied to assess the relation between dependent and
Applying independent variables
Linear
Regression
• Using Random Forest, the result of linear regression method is analysed and the main
Applying predictive variables which affect each dependant variable is determined
Random
Forest
• Compare the main elements were determined by multiple linear regression with random
Result forest results
Comparissio • Control the calculated R-squared to see which trained data is more reliable
n
3 Data Analysis
3.1 Dependent (Response) and Independent (Predictor) Variables
In this research after investigating all sales contracts of 2018, it was considered that
there are five main alloys which are requested by clients, and quality officers are
controlling them at every sample result using the SPC control charts to meet the
customer requirements. In this paper, the results of vanadium as a critical response
variable is presented and the results of the other four response variables are presented in
the appendix section.
To control the key elements required by clients, the technical department prepares a
special recipe to blend and melt the different materials to meet the customer’s
requirements. This recipe is prepared according to the types of available materials in
the inventory system of the company and the description of the job order for each
alloy’s boundary coming from the sale’s contract. Therefore, we consider these five
main alloys as our dependent (response) variables to run the data analysis and see
which alloys have more effects on these dependent variables. Table 2 displays the
independent variables which are analyzed to find out their effect on vanadium as our
response variable.
4 Supervised Learning
Statistical learning methods are divided into two general categories supervised and
unsupervised. For the supervised learning, for every observation of the predictor
measurement(s) xi, i ¼ 1; 2; . . .: n there is an associated answer as yi sometimes called
response. In supervised learning, we fit a model for forecasting and predicting
responses. A simple statistical model may help to understand the relationship between
predictors and response [4].
Supervised learning models are built to predict the relation between the response
and predictors using multiple regression as a linear method, and random forests as a
non-linear method.
Application of Statistical Learning in Ferro-Titanium Industry 217
Y ¼ B0 þ b1mo þ b2si þ b3sn þ b4zr þ b5mn þ b6cr þ b7fe þ b8ni þ b9cu þ b10n
ð1Þ
In Table 3, the results of predictor variables for vanadium as the response variable
are depicted. For the response variables with the reported R-squared of more than 50%,
shows that the model adequately fits the data and multiple linear regression is a suitable
tool to predict the results. On the other hand, for the response variables with R-squared
lower than the 50%, it shows this trained model cannot be used uniquely to predict the
response variable on the dataset.
To obtain a fitted model for vanadium regression it takes the following form.
v ¼ B0 þ b1mo þ b2si þ b3sn þ b4zr þ b5mn þ b6cr þ b7fe þ b8ni þ b9cu þ b10n ð2Þ
The results from the fitted model on the training set are shown below.
218 M. B. Pour et al.
These results1 indicate that there are two chemical components as si and cu which
are not significant. This means the model can be simplified and si and cu can be
removed from the model. We continue to regenerate the regression model by removing
these two alloys, respectively which its results are shown in Table 4.
1
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.
Multiple R-squared: 0.5526, Adjusted R-squared: 0.5504.
Application of Statistical Learning in Ferro-Titanium Industry 219
In this section using the random forest method, we verify the results of multiple
linear regression using a nonlinear extension of multiple regression which was
explained at the previous section and also to find out the most important elements to
predict the vanadium as the response variable. The random forest algorithm can be used
for both classification and the regression of problems [1].
Results are depicted in Table 6 and determine the first main chemical components
to predict the vanadium content are fe, mn, zr, and Sn, respectively.
Table 7. Comparison of linear regression (LR) and random forest (RF) results for Vanadium.
Response Vanadium
Method Random forest Linear regression
Predictors Mo 3
Si
Sn 4
Zr 3
Mn 2 2
Cr
Fe 1 1
Ni
Cu
N 4
Priority results The first two items have the same
priority
220 M. B. Pour et al.
5 Conclusion
In this research, we have shown the results of vanadium as one of the main response
variables. The results of the other four response variables such as aluminum (Al),
carbon (C), oxygen (O2), and titanium (Ti) are shown in the appendix section from
Tables 8, 9, 10, 11, 12 and 13. After plotting and visualizing the collected dataset,
some correlations between the variables were observed and encouraged us to imple-
ment multiple linear regression models over the dataset. The multiple linear regression
results show that there is some linear relationship between the variables. The results
show that the linear regression model fits the data effectively to predict titanium and
vanadium because R-squared is more than 50% for each model. Also, to predict the
other responses like aluminum, carbon, and oxygen it shows that linear regression does
not fit the data effectively and linear regression is not an efficient statistical model to
predict these responses because the R-squared is lower than 50%.
To find out if there is any nonlinear regression between variables and also to
prioritize the most important predictors of every response, we have run the random
forest across the general dataset. It shows that the results of multiple linear regression
and random forest of the first two main predictors are the same for titanium, vanadium,
and oxygen and results are the same for the first main item of carbon and aluminum
over the general dataset. The main predictors for titanium content are iron and nickel,
for vanadium content are iron and manganese, and for oxygen are nitrogen and silicon.
Therefore, looking over the analyzed general dataset results, we can say that both
linear regression and random forest results are same for the first two main predictors of
titanium content as iron and nitrogen and for the oxygen content as nitrogen, and
silicon. For the other responses, we need more research to find out the main predictors
of each. Also, we just studied the chemical behavior of all different products under the
name of the general dataset. We recommend future researchers to continue this study
over every single product as low vanadium, low carbon, low oxygen, etc. to see if the
results are the same. Also, working on the other products and preparing a predicting
algorithm based on the historical data analyzing is recommended.
By on these initial results, we are planning to implement the unsupervised learning
methods over the general dataset as well and results will be published in the future
papers. The knowledge which is generated during this research by analyzing the his-
torical data of a ferrotitanium producer can be a hint for metallurgist and material
science researchers to continue the complementary studies on the technical reasons of
each correlation which find out over this research.
The structure of this research can be a guide to be developed by future researchers
over the other ferroalloy producers to have a robust understanding of material reaction
based on the historical data observations and statistical calculations.
Application of Statistical Learning in Ferro-Titanium Industry 221
Appendix
Table 8. All dependent and independent variables over the general dataset.
Variables Response Predictor
Al √
Mo √
Si √
Sn √
Zr √
Mn √
Cr √
V √
Fe √
C √
Ni √
Cu √
N √
O √
Ti √
Table 10. Comparison of linear regression (LR) and random forest (RF) results for titanium.
Response Titanium
Method Random forest Linear regression
Predictors Mo
Si
Sn
Zr
Mn 3 3
Cr
Fe 1 1
Ni 2 2
Cu
N 4 4
Priority results Same priorities
222 M. B. Pour et al.
Table 11. Comparison of linear regression (LR) and random forest (RF) results for aluminum.
Response Aluminum
Method Random forest Linear regression
Predictors Mo 4 4
Si
Sn
Zr 3
Mn 2
Cr
Fe 1 1
Ni
Cu 2 3
N
Priority results Not the same for second and third items
Table 12. Comparison of linear regression (LR) and random forest (RF) results for Carbon.
Response Carbon
Method Random forest Linear regression
Predictors Mo
Si 1 1
Sn
Zr 3
Mn
Cr 2
Fe 4
Ni
Cu 4 2
N 3
Priority results Just first item has the same priority in
both methods
Table 13. Comparison of linear regression (LR) and random forest (RF) results for Oxygen.
Response Oxygen
Method Random forest Linear regression
Predictors Mo
Si 2 2
Sn
Zr
Mn 3 4
Cr 4
Fe 3
Ni
Cu
N 1 1
Priority results First two items are the same
Application of Statistical Learning in Ferro-Titanium Industry 223
References
1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
2. Chen, Q., Wynne, R.J., Goulding, P., Sandoz, D.: The application of principal component
analysis and kernel density estimation to enhance process monitoring. Control Eng. Pract.
8(5), 531–543 (2000)
3. Holappa, L.: Towards sustainability in ferroalloy production. J. South Afr. Inst. Min. Metall.
110(12), 703–710 (2010)
4. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning, vol.
112. Springer (2013)
5. Kano, M., Hasebe, S., Hashimoto, I., Ohno, H.: A new multivariate statistical process
monitoring method using principal component analysis. Comput. Chem. Eng. 25(7–8),
1103–1113 (2001)
6. Liu, Y.-J., André, S., Saint Cristau, L., Lagresle, S., Hannas, Z., Calvosa, É., Devos, O.,
Duponchel, L.: Multivariate statistical process control (MSPC) using Raman spectroscopy
for in-line culture cell monitoring considering time-varying batches synchronized with
correlation optimized warping (COW). Anal. Chim. Acta 952, 9–17 (2017)
7. Panigrahi, M., Paramguru, R.K., Gupta, R.C., Shibata, E., Nakamura, T.: An overview of
production of titanium and an attempt to titanium production with ferro-titanium. High
Temp. Mater. Processes (Lond.) 29(5–6), 495–514 (2011). https://doi.org/10.1515/HTMP.
2010.29.5-6.495
8. Patel, S.V., Jokhakar, V.N.: A random forest based machine learning approach for mild steel
defect diagnosis. In: 2016 IEEE International Conference on Computational Intelligence and
Computing Research (ICCIC), pp. 1–8. IEEE (2016)
9. Rogalewicz, M.: Some notes on multivariate statistical process control. Manag. Prod. Eng.
Rev. 3(4), 80–86 (2012)
10. Rogalewicz, M., Poznańska, P.: The methodology of controlling manufacturing processes
with the use of multivariate statistical process control tools. J. Trends Dev. Mach. Assoc.
Technol. 17(1), 89–93 (2013)
Assessing the Effectiveness of Topic
Modeling Algorithms in Discovering
Generic Label with Description
1 Introduction
In our previous study [9], we proposed a model to find generic label for
the polynomial topics over text documents. In which we have used only LDA
model to generate topic models. Correspondingly, we have used the same model
for generating the generic label and used LDA, LSI and NMF for training the
model. We have run these models over the short text documents and measure
the WUP similarities of each labels. By comparing the WUP similarities for each
models, we come to a conclusion that we can find the exact labels by using the
LDA model. Though we have done this experiment in short documents, it can
be done on large texts or documents.
The rest of the paper is organized as follows: Related Work is discussed in
Sect. 2 followed by Research Methodology and Result & Discussion in Sects. 3
and 4, respectively. And finally, Sect. 5 concludes the summary.
2 Related Work
The task of topic labelling is introduced by [14] for LDA topics which was an
unsupervised approach. Author in [12] presented an approach for topic labelling.
First, they generated a set of candidate labels from the top-ranking topic terms,
titles of Wikipedia articles containing the top-ranking topic terms, and sub-
phrases extracted from the Wikipedia article titles. Then finally they ranked
the label candidates using a combination of association measures, lexical fea-
tures and an Information Retrieval feature. Natural language processing(NLP)
is also used for topic modeling and labeling method. Using Word2Vec word
embedding, [1] labeled online news considering the article terms and comments.
Author in [16] proposed topic2vec approach for topic representation with words.
Using Wikipedia document titles as label candidates, in [3] author presented
an approach to select the most relevant labels for topics by computing neural
embedding for documents and words. A graph-based method for topic labelling
was developed [10] by using the structured data exposed by DBpedia. In [2],
authors introduced a framework which apply summarizing algorithms to gener-
ate topic labels. Using embedding and letter trigrams, a novel method for topic
labelling was proposed by [11].
Though different approaches had been proposed in several studies, as per our
knowledge no evidence is present for assessing the models for generating topic
labels.
3 Research Methodology
The overall process of our research activities has been described in this section.
First of all, we have chosen our dataset1 . For the dataset, we have selected some
online documents to perform our research process. For doing so, we have to
cover a lot of process e.g. step by step pre-processing, noun phrase choosing,
1
https://github.com/sadirahman/Effectiveness-of-Topic-ModelingAlgorithms-in-
Discovering-Generic-Label-withDescription.
226 S. Rahman et al.
N-Gram, training model, label processing and label description with the help of
WordNet Synset. Then we receive to find out topic label based on our topic model
clustering result. We transferred a retrieve responsibility to compare three topic
representations: (1) Clustering result, (2) Topic labels and (3) Labels description.
Figure 1 shows an overview of our research experiment.
Stop Words Removing: Stop words are the usual common words in a lan-
guage like “about”, “an”, “the”, “is”, “into”. These words do not give important
meaning and are normally removed from text documents of our dataset.
After preprocessing, we have only picked the noun and proper noun from the pre-
processed result. Using this approach, the topic is taken by top nouns words with
the largest frequency in text corpus. For noun phrase choosing: first, the tok-
enization of text is executed to lemma out the words. The tokenized text is then
tagged with parts of speech NN (nouns), NNP (proper nouns), VB (verbs), JJ
(adjectives) etc. Before lemmatizing and stop-words removing, parts-of-Speech
(POS) tagging is done. The stop-words are removed after POS tagging. In the
final stage, words including their tags and rounds are put in a hash table and
most solid nouns are obtained from these to create a heading for a text. The
results of non phrase words is presented in Table 1.
3.3 N-Gram
In this section, after training models, we take topics from the documents we
only get top-weighted word from the topic because its result is hardy for its
topic set. Then we explore the semantic definition from the lexical database for
English parts of speeches which are termed WordNet [15]. We choose the defini-
tion because that is several appropriate within its phrases. After pre-processing
the definition, we prepare the candidate labels and from those candidate labels,
we estimate the candidate labels with the main topic word for conceptual seman-
tic relatedness measure by WUP [18] similarity. Then we began in our WUP
similarity process for labeling. Figure 3 WUP similarity process for labeling our
topics.
In this section, we discuss the overall results of our research work. We have
selected top three words for each models and found the top word considering
the highest weighted value. Then we obtain the description of top weighed word
using the WordNet synsets. After that, we pre-process the discovered description
and choose the candidate labels of each description by using Noun-phrase and
N-gram.
Finally we consider the label comparing the WUP similarities accuracy
between the each candidate label and the top weighted word. Tables 6, 7, and 8
are showing the details of selecting the labels of three documents by using LSI,
LDA and NMF, respectively.
For choosing the final label of each topic, WUP similarities values are used.
The values of WUP similarities accuracy is showing in Tables 9, 10 and 11 for
models LSI, LDA and NMF, respectively.
Topic Modeling Algorithms 231
After choosing the topic labels, we put a description based on topic labels
from WordNet synset and also a synsets definition with synsets example. We can
see at Table 13 that shows the labels word description with labels word example.
After getting the WUP similarities values of each label, we average the WUP
similarities values. We can see at Table 12 that LDA shows the best accuracy
(71%) comparing with the other two models.
234 S. Rahman et al.
Table 10. WUP similarity between topic and label with LDA
Table 11. WUP similarity between topic and label with NMF
5 Conclusion
The main objective of this research is to find the most relevant topic label with
description. We have used LDA, LSI, NMF to train our model. For each models,
we have determine the top three words and selected the top weighted word. Using
the wordNet synset of lexical database, we obtain the description of selected top
words and generate the candidate labels for each of the words. Comparing the
WUP similarities between candidate labels and top words, we selected the most
exact label for topics. After analyzing the WUP similarities of each models, we
found that LDA gives the most accurate topic label. This work will help others to
consider the best model while labeling any topics. This research can be extended
to find the topics and labels for large documents and also can be modified to
any optimized solutions.
References
1. Aker, A., Paramita, M., Kurtic, E., Funk, A., Barker, E., Hepple, M., Gaizauskas,
R.: Automatic label generation for news comment clusters. In: Proceedings of the
9th International Natural Language Generation Conference, pp. 61–69 (2016)
2. Basave, A.E.C., He, Y., Xu, R.: Automatic labelling of topic models learned from
twitter by summarisation. In: Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics (Volume 2: Short Papers), pp. 618–624
(2014)
3. Bhatia, S., Lau, J.H., Baldwin, T.: Automatic labelling of topics with neural
embeddings. arXiv preprint arXiv:1612.05340 (2016)
4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
5. Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based
n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
6. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea
leaves: how humans interpret topic models. In: Advances in Neural Information
Processing Systems, pp. 288–296 (2009)
236 S. Rahman et al.
7. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Index-
ing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
8. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fif-
teenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan
Kaufmann Publishers Inc. (1999)
9. Hossain, S.S., Ul-Hassan, R., Rahman, S.: Polynomial topic distribution with topic
modeling for generic labeling. In: Communications in Computer and Information
Science, vol. 1046, pp. 413–419. Springer (2019)
10. Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic
labelling using DBPedia. In: Proceedings of the Sixth ACM International Confer-
ence on Web Search and Data Mining, pp. 465–474. ACM (2013)
11. Kou, W., Li, F., Baldwin, T.: Automatic labelling of topic models using word
vectors and letter trigram vectors. In: AIRS, pp. 253–264. Springer (2015)
12. Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic
models. In: Proceedings of the 49th Annual Meeting of the Association for Com-
putational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Asso-
ciation for Computational Linguistics (2011)
13. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-
torization. Nature 401(6755), 788 (1999)
14. Mei, Q., Shen, X., Zhai, C.X.: Automatic labeling of multinomial topic models. In:
Proceedings of the 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 490–499. ACM (2007)
15. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11),
39–41 (1995)
16. Niu, L., Dai, X., Zhang, J., Chen, J.: Topic2Vec: learning distributed representa-
tions of topics. In: 2015 International Conference on Asian Language Processing
(IALP), pp. 193–196. IEEE (2015)
17. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Sharing clusters among related
groups: hierarchical Dirichlet processes. In: Advances in Neural Information Pro-
cessing Systems, pp. 1385–1392 (2005)
18. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the
32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138.
Association for Computational Linguistics (1994)
BeagleTM: An Adaptable Text Mining
Method for Relationship Discovery
in Literature
Oliver Bonham-Carter(B)
1 Introduction
When performing a literature review using searching algorithms, locating articles
is difficult due to the obscure nature of the keywords necessary. For any project,
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 237–256, 2020.
https://doi.org/10.1007/978-3-030-39442-4_19
238 O. Bonham-Carter
one must supply the literature’s search engines with keywords that are closely
associated to one’s research. Correct keywords are carefully selected to isolate
relevant works, however, there are three general problems inherent to locating
knowledge based on the non-uniform links that have been provided by the diverse
authors of the articles.
The first major problem is that the terms for the particular desired infor-
mation may not follow a popular convention of keyword-naming. It would seem
that some authors choose words which are no-longer current with their fields.
Their articles are hence, found with other non-contemporary research due to
an antiquated use of language. In other cases, authors of seminal articles may
invent their own terms to describe the details of their work. This implies that
one must know an exact term or a particular usage of word(s), according to spe-
cific authors, to locate their articles. Simultaneously, other research teams may
be working along similar research themes, yet use entirely different keywords to
put their work in a scientific context. Therefore, when locating articles across
different researchers, multiple sets of specific keywords must be applied to search
engines to retrieve articles from a particular area of research.
The second general problem concerns the growth that many research areas
enjoy as a result of their popularity. As an area of research evolves, some of
its terminology, including keywords, may gradually be replaced by others as a
consequence. The natural evolution of a research field may cause disconnections
between current and former work, and to continue to locate new developments
one must be knowledgeable of the modernized keywords, in addition to the for-
mer ones. In a single field of research, we already see the creation of a widening
gap between different generations of knowledge in function of its evolution – if
former keywords were queried by search engines, then one may find only the for-
mer research while the latest literature remains undiscovered. In Fig. 1, we use
terms from network research to illustrate the phenomenon caused by keyword
obsolescence, as a result of field evolution.
The third general problem of searching for articles by keywords is that
although particular knowledge is likely to exist in the literature, specific insights
may be obtained in articles of completely alternative keywords. For instance,
a particular fact or detail may be briefly mentioned in one or several articles,
for which the associated keywords are irrelevant to one’s research interests. The
discovery of knowledge in seemingly unrelated articles is occasional and unpre-
dictable. Often, the researcher may read numerous articles of diverse keywords
to discover pieces of valuable knowledge. One becomes familiar with the articles
of many types of alternative research, from which to derive header and footnote
knowledge to weave together into parts of one’s informed literature review.
In bioinformatics, the keywords of articles often do not describe the total
wealth of information contained within the article. For instance, a gene or a pro-
tein may be often included in many types of articles but there may be no formal
keyword(s) to declare that they have been included in a particular article. This
lack of information necessitates one to be familiar with many types of articles,
in addition to those which directly concern one’s field of research. For instance,
BeagleTM: Relationship Discovery 239
In this article, we present a text mining framework and method to assist inves-
tigators to locate articles, while countering the annoyances caused by the three
general problems discussed above that would otherwise persist to inhibit knowl-
edge discovery. We have applied this method to develop and create a tool called
BeagleTM which performs text mining for researchers without being limited to
the sometimes cryptic keywords of article meta-data. Our method processes the
abstracts sections of articles, provided by the PubMed, to locate information
that may be aggregated with that of other articles to infer relationships which
are outputted as visual networks. Also in this article, we provide the details of
our method and discuss how selected keywords are used to drive the network cre-
ation system. Finally, we explain how to read the resulting relationship networks
to obtain knowledge of the connections from the processed articles.
The text of an article’s abstract relates the goals and context of the information
in an article. Since there is limited space in abstract sections, this text is often
written exactly and unambiguously, and it is likely to be a better source of
information than what is inferred using keywords alone. Our approach is to
process the abstracts of the corpus articles to connect the words and concepts
to those of abstracts in other articles. The user inputs chosen terms into our
algorithm, which are then used to create focus points during the process to
create customized networks that describe relationships.
When multiple keywords are found together in the same abstract, one may
assume that the keywords have some common thread that runs between them to
connect them. In this article, we propose a text mining method called BeagleTM
which permits the investigator to locate the common threads of multiple key-
words found in article abstracts of PubMed literature. Furthermore, our method
creates relationship networks (i.e., graphical visualizations) of keywords to visu-
ally describe how they are associated to each other according to peer-reviewed
articles of the literature. In Sect. 1.2 we discuss our approach to text mining and
our method’s ability link terms.
Text mining tools have been successfully applied to extract information for
convenient use (text summarization, document retrieval), assess document simi-
larity (document clustering, key-phrase identification), extract structured infor-
mation (entity extraction, information extraction) [6], and social medial infor-
mation extraction [7]. Additionally, text mining tools also exist as plug-ins or
libraries for programming languages such: TM [8], Rattle [9] and on-line tools
such as, [10].
While text mining tools fulfill important needs for the bioinformatics com-
munity, they are generally hosted by web sites and their automation in pipelines
may be problematic. Furthermore, many of these tools find specific details from
particular articles and do not infer associations between search terms. Tools such
as PubTator [11], PIE The Search [12], Meshable [13] are useful for bringing arti-
cles to the attention of the investigator where keywords have been highlighted,
BeagleTM: Relationship Discovery 241
from the author’s work is shown in Fig. 4. The models created from BeagleTM
describe the connections between the predefined terms automatically, and the
exact causality behind this relationship must be explored in the article by Mar-
celli et al.. Observing the created plot may help researchers to determine the
relevance of an article at a glance.
2 Methods
Once the keywords have been defined for a text mining operation, BeagleTM
scans all abstracts of the corpus to find their occurrence across articles as shown
in Fig. 5. A relationship implies that these keywords are relevant to a study
where they play a role. Several keywords found together in an article may likely
signify a central theme that binds them together. In bioinformatics, for example,
learning that a particular protein and a stress have been found in the same article
is very likely to suggest that the protein has been studied in some context of the
stress. If a type of PTM is also mentioned in the text, then there is reason to
suggest that it may be a part of the stress response for the protein, for example.
If several distinct articles are uncovered where these same keywords are found
together, or are linked by edges in relationship networks, then, again this may
point to a deeper relation or, perhaps, a common mechanism.
In general, by studying keywords and how they are found in the literature, we
may have strong evidence to suggest that they share some form of a relationship.
244 O. Bonham-Carter
Fig. 5. Flowchart: The abstracts of each article are individually parsed for pre-
selected keywords. The results of this parsing are organized in specific database tables.
keywords changes. We note that working with such an enormous amount of files
may introduce bottle-necks, however, our analysis was completed on solid-state
hard drives which enabled elevated performance without compromising the exe-
cution time. In addition, any lost time during text processing is recovered when
the database programming (discussed in Sect. 2.1) is applied.
Since each article from NCBI has a unique PMID number (identification
reference for PubMed citations) that acts as a primary key for the database pro-
gramming element of the method (discussed in Sect. 2.1), all encountered key-
words are recorded with the PMID number. The associations between keywords
are made by connecting their sources by these PMID numbers. Our method
connects these keywords to each other by finding the intersections according to
PMIDs and uses databases to manage this task.
Fig. 6. Article meta-data: Each article is downloaded from NCBI as an nxml for-
matted file. We note that BeagleTM parses each file for specific types of information
to be stored in its internal SQLite database. This information is shown by arrows (i.e.,
PMID, the title of journal, the title of the article, and the abstract block.) The text of
this abstract [17] is parsed for relevant information to the keywords.
246 O. Bonham-Carter
Fig. 7. Extracted data: For each article where relevant terms are found, referential
details (such as the PMID, article sources, and a the blurb of text in which the term
is found) is inserted into the database for further analysis with advanced queries.
file with which our tool works. SQLite3 also provides a convenient way to setup
and import data from BeagleTM processing by simple scripts to build the tool’s
database on seemingly any hardware. Our tool extracts information about each
occurrence of a keyword as it is encountered in an article (i.e., PMID number,
article references, the occurrence number, and its associated blurb of text) to be
inserted into the database as shown in Fig. 7.
The database is used to perform stringent queries to locate keywords and
associated data according to matching PMID numbers. We have customized
our tool and database to perform analyses of articles for PTMs, stresses and
protein keywords, as discussed above, to help explain its function. Our method
has six main tables: Functional (containing functional origins of types of pro-
teins), MitoSymbols, PTMGeneral (general PTM names such as acetylation or
phosphorylation) and PTMSpecific (containing specific types of PTMs which
are actually subsets of more-general types of PTMs such as, phosphotyrosine
which is a type of phosphorylation, for example). The Stress table concerned
the factors with which the proteins had been exposed, according to the articles.
In Table 1, we provide the details of two of these tables, and note that all tables
have a similar construction.
2.2 Networks
Across all networks, edges between nodes signify that at least one peer-reviewed
study exists in which the keywords have been mentioned together in the same
study (i.e., the keywords share a common PMID number). Cliques in the net-
works represent that keywords originated from the same abstract. To find the
associations of keywords bound by cliques, BeagleTM queries all keywords hav-
ing the same PMID number. In our tool, this output is then relayed to the
NetworkX plotting tool [18] to create the relationship networks. In Sect. 3, we
will discuss the specific results of the networks that suggest interrelationships
according to our method. We note that sometimes when all these cliques are
shown together it may create some confusion when differentiating a particular
BeagleTM: Relationship Discovery 247
Table 1. Database schemas; Here we provide the SQLite3 code to create two of the
tables in our database. The creation code for the other tables is similar. The integrity
constraint NOT NULL was necessary to ensure that the relation for each article was
complete. PubMed’s PMID, an article reference number, was assigned by the NIH
National Library of Medicine and functions as a primary key.
clique from another. To ascertain the members for a specific clique, including the
article PMIDs, one may consult the non-graphical data (i.e., the output provided
by BeagleTM, not shown) from which networks are made.
Table 2. Sample keywords: Below are the four main rubrics for our curated key-
words, from which we built relationship networks. The total number of keywords is
current as of April 2018. We note that the number of terms increases in tandem with
the expansion of the PubMed corpus.
been linked to several diverse disorders such as, Alzheimer’s Parkinson’s, diabetes
and other neurodegenerative ailments as discussed in [22] (PMID: 22384126).
There are also links to types of stresses – ROS (reactive oxygen species),
oxidative stress, general stress, toxins and others, which have been introduced
by the articles. From this observation, we note that SOD1 may likely be involved
with these disorders and stresses since there is at least one peer-reviewed study
found in which these edge-connected nodes have been mentioned in the same
article.
From the network, we note the article by Milani et al. [23] (PMID: 23983902),
in which the authors discuss the involvement of induced oxidative damage by
ROS in Parkinson’s disease and amyotrophic lateral sclerosis. This suggests the
role played by these actors. Furthermore, the authors study SOD1 for its con-
nection to NRF2, a transcriptional factor and master regulator of the expression
of many antioxidant /detoxification genes. With the exception of the NRF2
protein and the discussion of amyotrophic lateral sclerosis, this relationship to
SOD1 may readily be observed from the network itself. We reserve judgment on
the NRF2 (an important neuroprotective protein in neurodegenerative diseases)
which may be deeply connected to the stresses and ailments in the network of
Fig. 8.
Also in Fig. 8, we explore the article (PMID: 25998424), relating to the article
by Collins et al. [24] and we note the keywords of the network – {acidosis,
oxidation, acetylation and SOD1 } share a commonality. According to the article,
our network infers the actors of the article, acidosis, SOD1 and ROS, share a
relationship. From the simplicity of the network, these relationships may be used
to determine that SOD1 can be related to the stress of ROS. In addition, since
the authors mentions ROS specifically, we may infer that other oxidative stresses
are likely to play roles by the discussion of redox in the article. It is interesting to
note here that if, upon consultation of the article, there is no discussion about
oxidation, one may form a hypothetical theory that such a relationship may
eventually be discovered.
BeagleTM: Relationship Discovery 249
Fig. 8. Protein clique: Relationship model of SOD1 that has been found according
to the literature to share a relationship to Alzheimer’s, Parkinson’s disease, as well as
others. There are three types of nodes featured in this model: the square represents the
single protein to which each other node is related. The circles and pentagons denote
the PMIDs and stresses, the triangles denote the disorders that have documented
relationships to the other nodes. All edges denote that terms are connected by at least
one common article.
In Fig. 9, we note that ageing shares a relationship with stresses ROS, oxida-
tive stress, pollutants and others. A relationship is also shared with PTMs such
as thyroxine (the main hormone secreted into the bloodstream by the thyroid
gland), methionine sulfoxide, lactic acid, and triiodothyronine – a thyroid hor-
mone that plays vital roles in the body’s metabolic rate.
More specifically, when exploring the article (PMID: 27199942) by Bastard
et al. [25], we note that the actual clique for this article is composted of {ageing,
stress and tolerance}. The article concerns the ability of the Gram-positive bac-
terial species, Oenococcus oeni, is used in the production of wine to reduce acidity
and to tolerate stresses caused by the formation of biofilms or planktonic cells.
The article provides examples of relationships where stress and tolerance play
major roles in the study.
In Fig. 10 which has been reduced from its full size to facilitate discussion,
we note that Alzheimers is related to stress-types such as, {heat shock, oxi-
dation, oxidative stress, reactive oxygen species, (stress) tolerance, and oth-
ers}. We have customized our output to show only peer-reviewed journals in
which relationship between Alzheimer’s and the stress-factors are described.
250 O. Bonham-Carter
Fig. 9. Functional clique: The red circles represent the PMID numbers for PubMed
articles, the blue squares indicate PTMs, the green triangles denote the stress-factors
and the mustard pentagons correspond to the ailment by name, to which all elements
are related by the literature. We note that all these terms are related by peer-reviewed
studies however, we must return to the PMID of each clique to determine the nature
of the relationship. We show the summary plot of text mining tasks for the keyword
aging.
Fig. 10. Functional clique: The red circles represent the PMID numbers for PubMed
articles, the blue squares indicate stress-factors, the green triangles denote the journal
names, and the mustard pentagon correspond to the ailment by name. This network
allows us to study which types of journals are featuring unique types of research. We
show the summary plot of text mining tasks for the keyword, Alzheimer’s Disease. This
relationship network is actually a subset of the entire network which was too populated
to be legible.
which are likely working together for a process for disorders such as Parkinson’s
[28] and discussed in [29], [30], [20], investigators may also wish to consult PTM
relationship networks, such as that of Fig. 12 (glycosylation), to gain a fuller
understanding of other PTMs that may work in tandem.
Some literature reviews may begin by a study of stress-factors and PTMs to
determine effects and/or potential onsets of disorders. In such a case, the research
of stresses, in conjunction with a particular PTM would lead the research team
to articles where potential disorders are explored, where stresses and PTMs
are integral components. To determine some of the disorders, which may result
during exposure of a particular stress and PTM, we created the relationship
networks of Fig. 13 (reduced to facilitate discussion) to facilitate this task. In
the relationship networks of the figure, we note that oxidative stress has been
linked to: {apoptosis, diabetes, heart disease, obesity and others}, in concert with
PTMs such as: methionine sulfoxide (oxidation), nitrated tyrosine (nitration),
thyroxine (iodination), and others. More information about the nature of each
PTM of this network is available from UniProt at http://www.uniprot.org/docs/
ptmlist.
252 O. Bonham-Carter
Fig. 11. PTM clique: The red circles represent the PMID numbers for PubMed arti-
cles, the blue square indicates a PTM, the green triangles denote stresses, and the mus-
tard pentagon correspond to the ailment by name. This relationship network is from the
study of the keyword acetylation in light of stress-factors and associated disorders.
Fig. 12. PTM clique: The red circles represent the PMID numbers for PubMed arti-
cles, the blue square indicates a PTM, the green triangles denote stresses, and the mus-
tard pentagon correspond to the ailment by name. This relationship network is from the
study of the keyword glycosylation in light of stress-factors and associated disorders.
BeagleTM: Relationship Discovery 253
Fig. 13. Stress clique: The red circles represent the PMID numbers for PubMed
articles, the blue square indicates a stress, the green triangles denote ailments, and
the mustard pentagon correspond to the PTMs. This relationship network is from the
study of the keyword oxidative stress, in light of, PTMs and associated disorders.
4 Conclusion
Due to the problems associated with attaching keywords to articles, we noted
that text mining may be the appropriate remedy to help researchers find concepts
in the literature which has seemingly no uniform method for writing keywords.
In addition, these keywords for articles may exist, yet they do not suggest the
full depth of knowledge that their articles contain. In this study, we discussed
examples where our method and tool, BeagleTM, was used to extract relation-
ships between PTMs, stress-factors, and proteins which may be involved with
disorders. We described how to read relationship networks suggesting connec-
tions between keywords and allow researchers to obtain knowledge from the
literature.
During the discussion of the technicalities of the method itself, we discussed
how our method is able to process a corpus of arbitrary size since it parses one
article’s abstract at a time. Since abstracts are excellent representations of the
entire work, we used the articles’ abstracts as the inputs to our tool. However, our
method and tool will work similarly on any size of text, including a full article.
We discussed that the method of determining commonalities across keywords
revolves around the idea that each article in our corpus (supplied by PubMed)
is automatically given a PMID number. When a keyword is located, then the
keyword, its reference details, along with its PMID number are inserted into the
local SQL database. Robust SQL queries can then be utilized to determine data
254 O. Bonham-Carter
Acknowledgment. I would like to thank Janyl Jumadinova for her help in proofing
this manuscript.
References
1. Splendiani, A., Donato, M., Drăghici, S.: Ontologies for bioinformatics. In: Springer
Handbook of Bio-/Neuroinformatics, pp. 441–461. Springer, Heidelberg (2014)
2. Schouten, K., Frasincar, F., Dekker, R., Riezebos, M.: Heracles: a framework for
developing and evaluating text mining algorithms. Expert Syst. Appl. 127, 68–84
(2019)
3. Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B.,
Kochut, K.: A brief survey of text mining: classification, clustering and extraction
techniques. arXiv preprint arXiv:1707.02919 (2017)
4. Sharma, S., Srivastava, S.K.: Review on text mining algorithms. Int. J. Comput.
Appl. 134(8), 39–43 (2016)
5. Lamurias, A., Couto, F.M.: Text mining for bioinformatics using biomedical litera-
ture. In: Encyclopedia of Bioinformatics and Computational Biology, vol. 1 (2019)
6. Paynter, R., Bañez, L.L., Berliner, E., Erinoff, E., Lege-Matsuura, J., Potter, S.,
Uhl, S.: EPC methods: an exploration of the use of text-mining software in sys-
tematic reviews (2016)
7. Maynard, D., Roberts, I., Greenwood, M.A., Rout, D., Bontcheva, K.: A framework
for real-time semantic social media analysis. Web Seman.: Sci. Serv. Agents World
Wide Web 44, 75-88 (2017)
8. Feinerer, I.: Introduction to the tm package text mining in R (2017)
9. Williams, G.J., et al.: Rattle: a data mining GUI for R. R J. 1(2), 45–55 (2009)
BeagleTM: Relationship Discovery 255
10. Müller, H.-M., Van Auken, K.M., Li, Y., Sternberg, P.: Textpresso central: a cus-
tomizable platform for searching, text mining, viewing, and curating biomedical
literature. BMC Bioinform. 19(1), 94 (2018)
11. Wei, C.-H., Kao, H.-Y., Lu, Z.: PubTator: a web-based text mining tool for assisting
biocuration. Nucleic Acids Res. 44, gkt441 (2013)
12. Kim, S., Kwon, D., Shin, S.-Y., Wilbur, W.J.: PIE the search: searching PubMed
literature for protein interaction information. Bioinformatics 28(4), 597–598 (2011)
13. Kim, S., Yeganova, L., Wilbur, W.J.: Meshable: searching pubmed abstracts by
utilizing mesh and mesh-derived topical terms. Bioinformatics 32(19), 3044–3046
(2016)
14. Papadopoulou, P., Lytras, M., Marouli, C.: Bioinformatics as applied to medicine:
challenges faced moving from big data to smart data to wise data. In: Applying
Big Data Analytics in Bioinformatics and Medicine, pp. 1–25. IGI Global (2018)
15. Ncbi, R.C.: Database resources of the national center for biotechnology informa-
tion. Nucleic Acids Res. 45(D1), D12 (2017)
16. Marcelli, S., Corbo, M., Iannuzzi, F., Negri, L., Blandini, F., Nisticò, R., Feligioni,
M.: The involvement of post-translational modifications in Alzheimer’s disease.
Curr. Alzheimer Res. 15, 313–335 (2017)
17. Hunnicut, J., Liu, Y., Richardson, A., Salmon, A.B.: MsrA overexpression targeted
to the mitochondria, but not cytosol, preserves insulin sensitivity in diet-induced
obese mice. PloS One 10(10), e0139844 (2015)
18. Schult, D.A., Swart, P.: Exploring network structure, dynamics, and function using
networkX. In: Proceedings of the 7th Python in Science Conferences (SciPy 2008),
vol. 2008, pp. 11–16 (2008)
19. Bonham-Carter, O., Pedersen, J., Bastola, D.: A content and structural assessment
of oxidative motifs across a diverse set of life forms. Comput. Biol. Med. 53, 179–
189 (2014)
20. Bonham-Carter, O., Pedersen, J., Najjar, L., Bastola, D.: Modeling the effects of
microgravity on oxidation in mitochondria: a protein damage assessment across a
diverse set of life forms. In: IEEE Data Mining Workshop (ICDMW), pp. 250–257.
IEEE (2013)
21. Thygesen, C., Boll, I., Finsen, B., Modzel, M., Larsen, M.R.: Characterizing
disease-associated changes in post-translational modifications by mass spectrome-
try. Expert Rev. Proteomics 15(3), 245–258 (2018)
22. Li, Y., Chigurupati, S., Holloway, H.W., Mughal, M., Tweedie, D., Bruestle, D.A.,
Mattson, M.P., Wang, Y., Harvey, B.K., Ray, B., et al.: Exendin-4 ameliorates
motor neuron degeneration in cellular and animal models of amyotrophic lateral
sclerosis. PLoS One 7(2), e32008 (2012)
23. Milani, P., Ambrosi, G., Gammoh, O., Blandini, F., Cereda, C.: SOD1 and DJ-1
converge at Nrf2 pathway: a clue for antioxidant therapeutic potential in neurode-
generation. Oxidative Med. Cell. Longevity 2013 (2013)
24. Collins, J.A., Moots, R.J., Clegg, P.D., Milner, P.I.: Resveratrol and n-
acetylcysteine influence redox balance in equine articular chondrocytes under acidic
and very low oxygen conditions. Free Radical Biol. Med. 86, 57–64 (2015)
25. Bastard, A., Coelho, C., Briandet, R., Canette, A., Gougeon, R., Alexandre, H.,
Guzzo, J., Weidmann, S.: Effect of biofilm formation by Oenococcus oeni on malo-
lactic fermentation and the release of aromatic compounds in wine. Front. Micro-
biol. 7, 613 (2016)
26. Millan, M.J.: The epigenetic dimension of Alzheimer’s disease: causal, consequence,
or curiosity? Dialogues Clin. Neurosci. 16(3), 373 (2014)
256 O. Bonham-Carter
27. Ansari, A., Rahman, M., Saha, S.K., Saikot, F.K., Deep, A., Kim, K.-H., et al.:
Function of the SIRT3 mitochondrial deacetylase in cellular physiology, cancer,
and neurodegenerative disease. Aging Cell 16(1), 4–16 (2017)
28. Ferrer, I.: Early involvement of the cerebral cortex in Parkinson’s disease: conver-
gence of multiple metabolic defects. Progress Neurobiol. 88(2), 89–103 (2009)
29. Stetz, G., Tse, A., Verkhivker, G.M.: Dissecting structure-encoded determinants
of allosteric cross-talk between post-translational modification sites in the Hsp90
chaperones. Sci. Rep. 8(1), 6899 (2018)
30. Bonham-Carter, O., Thapa, I., Bastola, D.: Evidence of post translational mod-
ification bias extracted from the tRNA and corresponding amino acid interplay
across a set of diverse organisms. In: Proceedings of the 5th ACM Conference
on Bioinformatics, Computational Biology, and Health Informatics, pp. 774–781.
ACM (2014)
Comparison of Imputation Methods
for Missing Values in Air Pollution Data:
Case Study on Sydney Air Quality Index
Abstract. Missing values in air quality data may lead to a substantial amount
of bias and inefficiency in modeling. In this paper, we discuss six methods for
dealing with missing values in univariate time series and compare their per-
formances. The methods we discuss here are Mean Imputation, Spline Inter-
polation, Simple Moving Average, Exponentially Weighted Moving Average,
Kalman Smoothing on Structural Time Series Models and Kalman Smoothing
on Autoregressive Integrated Moving Average (ARIMA) models. The perfor-
mances of these methods were compared using three different performance
measures; Mean Squared Error, Coefficient of Determination and the Index of
Agreement. Kalman Smoothing on Structural Time Series method is the best
method among the methods considered, for imputing missing values in the
context of air quality data under Missing Completely at Random (MCAR)
mechanism. Kalman Smoothing on ARIMA, and Exponentially Weighted
Moving Average methods also perform considerably well. Performance of
Spline Interpolation decreases drastically with increased percentage of missing
values. Mean Imputation performs reasonably well for smaller percentage of
missing values; however, all the other methods outperform Mean Imputation
regardless the number of missing values.
1 Introduction
Air quality data is widely used in models for various purposes including assessing the
impact of air quality on health and wellbeing. It is common to expect a large number of
missing values in data sources collected using sensors. Missing values of air quality
data may lead to underestimate the associated health effects. In general, missing values
create problems by introducing a substantial amount of bias and reducing the efficiency
of analysis [1]. Although many methods have been developed for missing value
imputations, methods for time series data is still at its infancy. The inherent nature of
autocorrelation, trend, seasonality and cyclic effects have made the process more dif-
ficult and challenging. However, it is necessary to impute missing values in cer-
tain situations to make more accurate predictions. Therefore, it is an important area of
research. In this study, we focus on the air pollution data in the Sydney region of
© Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 257–269, 2020.
https://doi.org/10.1007/978-3-030-39442-4_20
258 W. M. L. K. N. Wijesekara and L. Liyanage
Australia. Figure 1 summarizes the percentage of missing data in air pollutant variables
at two monitoring stations (Liverpool and Rozele) from 1994 to 2018.
Fig. 1. Missing value percentages of pollutant variables in two monitoring stations in Sydney
Substantial percentages of missing values are present in almost all pollutant vari-
ables. Liverpool station has lower percentages of missing data than Rozelle. Even
though only two stations are presented here, all other stations also showed a consid-
erable percentage of missing values which we cannot disregard. Therefore, a proper
mechanism to impute these missing values is essential before any type of modelling.
In this paper, we discuss six well established methods of dealing with missing
values in a univariate time series context and compare their performance on imputing
missing values for air quality data in the Sydney region. The methods discussed here
are Mean Imputation, Spline Interpolation, Simple Moving Average, Exponentially
Weighted Moving Average, Kalman Smoothing on Structural Time Series Models and
Kalman Smoothing on ARIMA models. The performances of these methods were
compared with three performance measures; Mean Squared Error (MSE), Coefficient of
Determination (R2) and Index of Agreement (d). The objective of this paper is to
identify the best available imputing method for air quality data in the Sydney region.
2 Literature Review
Variety of methods ranging from the simple methods such as mean imputation to
advanced methods such as Long Short Term Memory (LSTM) Recurrent Neural Net-
works have been applied to impute missing values in the context of air pollution data.
Mean imputation methods have performed well in most of the situations where the
percentage of missing values is as low as 5% and especially in single imputations [2, 3].
Other widely used methods include interpolations (linear, quadratic and cubic), Nearest
Neighbor (NN) [2, 4], Regression-based methods [4, 5], Self-Organizing Maps
Comparison of Imputation Methods for Missing Values in Air Pollution Data 259
(SOM) and Multi-Layer Perceptron (MLP) [4]. When the data can be formulated as a
multivariate normal time series, the Expectation-Maximization (EM) based methods
appeared to perform well [6]. Moreover, attempts have been made by combining the
power of neural network and fuzzy logic in handling missing air quality data [7, 8]. These
methods have been recommended for nonlinear and complex phenomena. One such
method is the hybrid approach of Multiple Imputation (MI) and Adaptive Nero-Fuzzy
Inference System (ANFIS). Recently, it can be seen that deep learning techniques such as
LSTM Recurrent Neural Networks are also used in missing value imputations in air
quality data [9]. However, the simplest technique, the mean imputation is still dominant
in this area and it is considered the best method in some scenarios.
The most commonly used performance measures for comparing missing data
imputation methods are Mean Absolute Error (MAE), Root Mean Square Error
(RMSE) and the Coefficient of Determination (R2). The Index of Agreement (d) also
have been used in some studies. There is no one universal method to measure the
performance of imputation techniques. Methods widely depend on the nature of data
and distribution of missing values.
There are three types of missing data mechanisms identified as Missing Completely
at Random (MCAR), Missing at Random (MAT) and Missing Not at Random
(MNAR) [10]. In the MCAR scenario, the missingness is independent of both
observable and unobservable parameters of interest. In MAR, there is a systematic
relationship between the propensity of a value to be missing and the observed data
while in MNAR there is a relationship between the propensity of a value to be missing
and its unobserved value.
3 Methodology
Mean Imputation
This is the most commonly used single imputation technique where the missing values
are replaced with the mean value of the variable. The mean of a series of values
y1 ; y2 ; . . .; yn is given by
1 Xn
y ¼ y
i¼1 i
ð1Þ
n
Spline Interpolation
For n + 1 pair of observations fðti ; yi Þ : i ¼ 0; 1; . . .; nÞg, the shape of spline is mod-
eled by interpolating between all the pairs of observations ðti1 ; yi1 Þ and ðti ; yi Þ with
polynomials
y ¼ qi ðtÞ; i ¼ 1; 2; . . .; n ð2Þ
260 W. M. L. K. N. Wijesekara and L. Liyanage
1 Xn1
yma ¼ y
i¼0 tðniÞ
ð3Þ
n
X
p X
q
yt ¼ c þ ui yti þ et þ hj etj ð5Þ
i¼1 j¼1
where ui s are the autoregressive parameters, hj s are the moving average parameters to
be estimated, and et s are a series of unknown random errors that are assumed to follow
a normal distribution.
Time series which needs to be differenced to be stationary is said to be an “inte-
grated” version of a stationary series. In ARIMA(p, d, q) model, the number of
autoregressive terms, the number of non-seasonal differences and the number of lagged
forecast errors in the prediction equation is denoted by p, d and q respectively.
Structural Time Series Models
All linear time series have a state space representation. This representation relates the
disturbance vector {et} to the observation vector {yt} via a Markov process {at}.
A convenient expression of the state space form is
yt ¼ Zt at þ et ; et N ð0; Ht Þ;
ð6Þ
at ¼ Tt at1 þ Rt gt ; gt ð0; Qt Þ; t ¼ 1; . . .n
Kalman Smoothing
Kalman filter calculates the mean and variance of the unobserved state, given the
observations. This filter is a recursive algorithm; the current best estimate is updated
whenever a new observation is obtained. Kalman Smoothing takes the form of a
backwards recursion and it can be used to compute smoothed estimator of the dis-
turbance vector [11].
The R package, “imputeTS” was used for Kalman smoothing and moving average
imputations [12].
1X n
MSE ¼ ðPi Oi Þ2 ð7Þ
n i¼1
where are n is the number of imputations, Oi is the observed data point, Pi is the
P
imputed data point O, are the means and rO; rP are the standard deviations of
observed and imputed data, respectively [4].
Fig. 2. Percentage of missing values of air pollutants at Liverpool station over the time from
1994 to 2018
In order to select a reference time series to carry out missing value simulations and
to use as the ground truth in the comparison of imputation methods, the Air Quality
Index (a standard index calculated by incorporating all the air pollutants) recorded at
each station was considered. Figure 3 shows the heat map of number of missing values
in the hourly air quality index for each monitoring station from 1994-01-01 01:00:00
AEST to 2018-12-31 24:00:00 AEST.
Comparison of Imputation Methods for Missing Values in Air Pollution Data 263
Fig. 3. Number of missing values of the air quality monitoring stations in Sydney from 1994 to
2018
As can be seen in Fig. 3, the dataset suffers from the problem of missing values.
Liverpool, Richmond and Earlwood stations appeared to have less number of missing
values. Therefore, these three stations were further analyzed and a subset of Earlwood
hourly air quality indices for a two-year period starting from 2014.01.01 01:00:00
AEST to 2015.12.31 24:00:00 AEST with no missing values (Fig. 4) was selected as
the reference series.
Missing values for this series were created by artificially deleting observations
under Missing Completely at Random (MCAR) mechanism. Four scenarios were
created where the percentage of missing values were 5%, 10%, 15% and 20%. The
missing values of each scenario were imputed by using the six methods; Mean
Imputation, Spline Interpolation, Simple Moving Average, Exponentially Weighted
Moving Average, Kalman Smoothing on Structural Time Series Models and Kalman
Smoothing on ARIMA models. Then the performance of each method on each scenario
were assessed by using the three performance measures; Mean Squared Error (MSE),
Coefficient of Determination (R2) and Index of Agreement (d).
264 W. M. L. K. N. Wijesekara and L. Liyanage
Fig. 4. Distribution of hourly air quality data from 2014.01.01 01:00:00 AEST to 2015.12.31
24:00:00 AEST in Earlwood
Figure 5 shows the position missing values in the time series in each of the four
scenarios as mentioned in Sect. 3.3. The vertical red lines indicate the positions of
missing values. Only the first 1,000 observations out of 17,517 observations are dis-
played for ease of viewing.
Comparison of Imputation Methods for Missing Values in Air Pollution Data 265
Fig. 5. Distribution of missing values in the simulations for 5%, 10%, 15% and 20% of missing
values in the dataset
The six methods as stated in Sect. 3.3 were used to impute missing values which
were artificially deleted under simulations. The performance of each method was
evaluated with three measures MSE, R2, and d.
Table 1 shows the performance of the six methods measured by MSE. Since the
performance of Mean Imputation method was poor compared to other methods, Fig. 6
compare the performance of other five methods except Mean Imputation.
266 W. M. L. K. N. Wijesekara and L. Liyanage
The Kalman Smoothing on Structural Time Series method appeared to be the best
while Mean Imputation appeared to be the worst. When the percentage of missing
values increases, performance of all the methods decreases. Kalman Smoothing on
ARIMA models and Exponentially Weighted Moving Averages perform well for small
percentages of missing values.
Table 2 shows the performance of the six methods measured by R2. Figure 7
displays the performance of methods except Mean Imputation.
Table 2. R2 measures
Method 5% 10% 15% 20%
Spline Interpolation 0.999 0.998 0.998 0.996
Kalman Smoothing 0.999 0.999 0.998 0.998
Structural TS
Kalman Smoothing 0.999 0.999 0.998 0.998
ARIMA
Simple MA 0.999 0.998 0.997 0.996 Fig. 7. Comparison of R2 measures of each
Exp MA 0.999 0.999 0.998 0.997 method for the 5%, 10%, 15% and 20%
Mean 0.973 0.944 0.918 0.893 missing value scenarios
Comparison of Imputation Methods for Missing Values in Air Pollution Data 267
Again the Kalman Smoothing on Structural Time Series methods appear to be the
best among the considered methods. The Kalman Smoothing on ARIMA and Expo-
nentially Weighted Moving Averages methods also perform well. However, once again
the performance of all the methods decreases as the percentage of missing values
increases.
Table 3 gives the performance of methods measured by Index of Agreement (d).
Figure 8 shows the performance of methods excluding Mean Imputation in order to
compare the other models clearly.
The Kalman Smoothing on Structural Time Series models and on ARIMA models
perform equally well. Although Spline interpolation performed well with a smaller
percentage of missing values, its performance drastically decreases with increasing
missing values. Except the Mean Imputation all other methods show approximately
equal performances. It is clear that the, performance of all the methods decreases when
the percentage of missing values increases.
Figure 9 exhibits the imputed values from the Kalman Smoothing on Structural
time series model for the four simulated scenarios. Red, green and blue represents the
imputed values, actual values and known values respectively. Again, only the first
1,000 observations out of 17,517 observations are presented for ease of viewing.
It can be seen that this method has performed extremely well for MCAR missing
mechanism in the air quality data. However, further studies must be carried out to
compare the performance of these methods under MAR and MNAR missing mecha-
nisms. Also, here we have considered a subset of observed series to artificially create
missing values. When there are large numbers of missing values, these methods may
result in sub-optimal results.
268 W. M. L. K. N. Wijesekara and L. Liyanage
Fig. 9. Comparison of imputed data using Kalman Smoothing on Structural Time Series models
against the actual data for the 5%, 10%, 15% and 20% missing value scenarios
Among the six methods considered, Kalman Smoothing Methods on Structural Time
Series is the best method for imputing missing values in the context of air quality data
where the missing mechanism is MCAR. Kalman Smoothing on ARIMA, and
Exponentially Weighted Moving Average methods also perform considerably well.
Performance of Spline Interpolation decreases drastically with increased percentage of
missing values. Even though Mean Imputation performs reasonably well for smaller
percentages of missing data, all the other five methods outperform this method
regardless the number of missing values. The six methods can be ranked from best to
worst as; Kalman Smoothing on Structural Time Series Models, Kalman Smoothing on
ARIMA models, Exponentially Weighted Moving Average, Simple Moving Average,
Comparison of Imputation Methods for Missing Values in Air Pollution Data 269
References
1. Nakagawa, S., Freckleton, R.P.: Missing inaction: the dangers of ignoring missing data.
Trends Ecol. Evol. 23(11), 592–596 (2008)
2. Norazian, M.N., et al.: Estimation of missing values in air pollution data using single
imputation techniques. ScienceAsia 34(3), 341–345 (2008)
3. Zakaria, N.A., Noor, N.M.: Imputation methods for filling missing data in urban air pollution
data for Malaysia. Urbanism 9(2), 159–166 (2018)
4. Junninen, H., et al.: Methods for imputation of missing values in air quality data sets. Atmos.
Environ. 38(18), 2895–2907 (2004)
5. Wyzga, R.E.: Note on a air method to estimate missing pollution data. J. Air Pollut. Control
Assoc. 23(3), 207–208 (1973)
6. Junger, W.L., de Leon, A.P.: Imputation of missing data in time series for air pollutants.
Atmos. Environ. 102, 96–104 (2015)
7. Lei, K.S., Wan, F.: Pre-processing for missing data: a hybrid approach to air pollution
prediction in Macau. In: Proceedings of the 2010 IEEE International Conference on
Automation and Logistics (2010)
8. Shahbazi, H., et al.: A novel regression imputation framework for Tehran air pollution
monitoring network using outputs from WRF and CAMx models. Atmos. Environ. 187, 24–
33 (2018)
9. Yuan, H.W., et al.: Imputation of missing data in time series for air pollutants using long
short-term memory recurrent neural networks. In: Proceedings of the 2018 ACM
International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings
of the 2018 ACM International Symposium on Wearable Computers (Ubicomp/Iswc 2018
Adjunct), pp. 1293–1300 (2018)
10. Rubright, J.D., Nandakumar, R., Glutting, J.J.: A simulation study of missing data with
multiple missing X’s. Pract. Assess. Res. Eval. 19(10) (2014)
11. Abril, J.C.: Structural time series models. In: Lovric, M. (ed.) International Encyclopedia of
Statistical Science, pp. 1555–1558. Springer, Heidelberg (2011)
12. Moritz, S., Bartz-Beielstein, T.: imputeTS: time series missing value imputation in R. R J. 9
(1), 207–218 (2017)
BERT Feature Based Model for Predicting
the Helpfulness Scores of Online
Customers Reviews
1 Introduction
Online shopping has grown exponentially across the world, and customer reviews play
an increasingly important role in helping shoppers make purchase decisions. These
reviews usually contain detailed information about the product, including product
descriptions, user experiences, and personalized suggestions. Given the large number
of reviews for specific products, an automated recommendation system incorporating
reviewer comments can be of great importance to consumers.
A lot of research work along this direction has been done, and many different tech-
niques, including regression methods and machine learning algorithms, have been used
to evaluate the helpfulness of product reviews. Liu et al. [1] developed a non-linear
model by incorporating linguistic features such as writing style and by using radial
basis function for predicting helpfulness. Zhang et al. [2] proposed a regression model
using lexical subjective clues, lexical similarity features, and shallow syntactic features
to evaluate helpfulness. Mudambi and Schuff [3] built a regression model based on
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 270–281, 2020.
https://doi.org/10.1007/978-3-030-39442-4_21
BERT Model for Predicting Scores of Customer Reviews 271
their hypotheses that some factors in the review influenced the review helpfulness as
explanatory variables in the regression model. Those explanatory variables, including
such things as word count, can be unreliable in representing the review’s helpfulness.
Park [4] analyzed psychological and linguistic features of reviews as explanatory vari-
ables in a linear model. However, linear and non-linear regression models based on
limited explanatory variables have the limitation of reliability and often fail to con-
sider the complete review contents. Additionally, some of the models may require
time-consuming pre-processing of data. Due to uncertainty in many such factors, the
prediction of product review helpfulness becomes a difficult and challenging research
problem. Using machine learning techniques to analyze customer reviews as a Natural
Language Processing (NLP) task can lead to the identification of linguistic indicators
between the review texts and their helpfulness. Some researchers [5–7] have success-
fully used machine learning methods to extract features for opinion mining, semantic
classification and sentiment classification. They showed that machine learning tech-
niques as well as traditional statistical methods can solve NLP tasks. With modern
computer technology, more information from the data can be extracted to reduce uncer-
tainty and to more fully consider available data during modeling and predicting.
The Bidirectional Encoder Representations from Transformers (BERT) is a machine
learning based NLP tool released by Google in late 2018 [8]. BERT obtains new state-
of-the-art results on many NLP tasks such as General Language Understanding Evalu-
ation (GLUE), as a pre-trained language representation model that requires fine-tuning
to process different NLP downstream tasks. Researchers have already shown that BERT
can process product reviews data with high accuracy, working alongside reading com-
prehension and aspect-based sentiment analysis [9]. In that research, features extracted
from a review by BERT are used to represent the review text, and avoid heavy pre-
processing task on the original data to answer the questions from customers based on
product reviews. Hence, BERT is expected to be useful for the goal of predicting scores
of review helpfulness.
In this study, a neural network (NN) based model is developed with BERT features,
instead of explanatory variables, and is used to rank the helpfulness of product review
data collected by Amazon.com, using the ratio of helpful votes to total votes for each
review. This NN based tool is used to analyze the product review data by incorporating
BERT features. The proposed model predicts the helpfulness of customer reviews with
a ranking score by analyzing the review text, its star rating, and the product type. The
prediction should help consumers to make a better purchase decision. The remainder
of this paper is organized as follows. In Sect. 2, a brief introduction of related previous
research work is provided. Data preprocessing and the details of the model are pre-
sented in Sect. 3. An analysis and comparison of the proposed model’s results with one
that uses explanatory variables are included in Sect. 4. In the last section, results and
possibilities that could help improve the model in future research are discussed.
2 Related Work
2.1 Factors Affecting Review Helpfulness
Predicting a helpfulness score as a NLP task is quite challenging. There are many
factors that determine the helpfulness score of a review. Some of these are extractive
272 S. Xu et al.
information, such as the overall star rating for each product obtained directly from the
data, and others are abstractive information like linguistic features that are more difficult
to extract from the review text.
Star Rating and Product Type. The star rating is intuitive and can be extracted from
data directly. The star ratings for online product review usually are numerical values
and range from one to five stars. A one star rating reflects an extremely negative aspect
of the review; conversely, a five star rating shows an extremely positive aspect. For each
product, the star rating indicates the attitude from customers.
Previous research has evaluated the relative diagnosticity of review extremity (those
tending toward one star or toward five stars) [3]. The relationship between the numerical
star ratings and the actual helpfulness scores is difficult to establish. Past research has
shown that a moderate star rating in the mid-range has great credibility [10], and is often
more helpful to customers than an extreme one. On the contrary, other researchers found
that an extreme star rating became more helpful than a moderate rating for eBay sellers
[11] and books [12]. These contradictory findings do not provide an exact answer to the
question of which one is more helpful.
Depending on the nature of the product type being reviewed, the relative value of
moderate ratings and extreme ratings can differ. In 1970, Nelson [13] defined search
goods as products that consumers are able to get product quality information about
before purchasing them, and experience goods as products that require user experience
to establish product quality. Past research has found that customers are more skeptical
of experience, or subjective, than of search, or objective product claims [14]. Mudambi
and Schuff [3] found that prior research failed to take into consideration product types
in assessing moderate ratings versus extreme ratings. They indicated that moderate
reviews were more helpful than either extremely positive or extremely negative reviews
for an experience good, in the decision making stage. For a search good, the extreme
ratings were seen as more helpful than moderate ones.
Linguistic Features, Content of Reviews and Other Factors. Review details can
increase information availability of the product, and help consumers to determine
whether the product is good or not. However, this is not always the case. Consumers
expect the information that they want to know about the product. A review could be
less helpful if there is not enough information for those consumers, even if the review
contains many sentences and words. There are many other similar factors that affect the
perceived helpfulness of reviews to consumers, depending on different situations.
Previous studies addressed some factors that affect review helpfulness. For the lin-
guistic aspect, a review with a high readability is likely to be accepted by customers [15–
17]. Mudambi and Schuff [3] indicated that review depth by word count has a positive
effect on review helpfulness depending on the product type. Ghose and Ipeirotis [16]
have shown that readability-based features affect the review helpfulness. Other research
[17] also supports the finding that readability has a higher importance for review help-
fulness, than review length.
From the content aspect, the meaning of reviews was extracted using latent seman-
tic analysis (LSA) in a previous study by Cao et al. [18]. They showed that the semantic
BERT Model for Predicting Scores of Customer Reviews 273
features of review have greater effect on the review helpfulness. Some research [12, 16]
found that reviews containing both subjective and objective information are more help-
ful to customers, compared to reviews containing subjective sentences only, or objective
sentences only. Sentiment features such as the words with positive motions or negative
motions are also indicated as important elements of review helpfulness. Pan and Zhang
[15] found that positive reviews are more likely to receive helpful votes than negative
reviews. However, negative reviews also have a large impact on decision making for
customers. Kuan et al. [19] showed that negative reviews are more helpful than positive
reviews. Other research discussed sentiment features in more detail. [20, 21] found that
sentiment features affect review helpfulness.
In past few years, researchers have tended toward using machine learning based
techniques, instead of traditional statistical based methods, for extracting features. The
study in [22] used deep learning technique to measure the helpfulness of hotel reviews
with user-provided photos. In [23], a convolutional neural network was applied to
extract information from the product reviews with auxiliary domain discriminators.
2.2 BERT
BERT is a fine-tuning based language representation model developed very recently. It
obtains state-of-the-art results on many NLP tasks such as General Language Under-
standing Evaluation (GLUE), Natural Language Inference (NLI), and Corpus of Lan-
guage Acceptability (CoLA), etc. [8]. The basic idea of BERT is to pre-train the lan-
guage model by using large-scale corpora using on a transformer model [24]. The model
is trained bidirectionally through multiple layers. Compared to other language models
[25, 26], BERT is adapted for different end tasks through a fine-tuning approach. Thus,
BERT can feed any of a number of different downstream tasks, without changing its
pre-trained language model. BERT has two primary model sizes with different param-
eters:
BERTB ase : 12 layers, 768 hidden dimensions, 12 self-attention head and 110 mil-
lion total parameters.
BERTLar g e : 24 layers, 1024 hidden dimensions, 16 self-attention head and 340
million total parameters.
Compared to other machine learning based methods, BERT has the best perfor-
mance in many different downstream task domains [8], and almost all of them are
related to features extraction. Features extraction is the most important task in review
helpfulness analysis. Previous studies showed linguistic features and psychological fea-
tures matter in influencing review helpfulness. Also, those features are difficult to mea-
sure by limited number of explanatory variables. In this research, BERT is used as a
front-end model to extract linguistic, psychological, and other features that are passed
to the downstream task of determining review helpfulness.
Total votes: Number of total votes the review received. This is equal to the number of
helpful votes plus the number of not helpful votes.
Review body: The review text.
5000 reviews of each category are selected from the original data set by letting the
total number of votes be a control variable to filter the reviews. The helpfulness score
denoted by s is a percentage defined as follows.
helpf ul votes
s= . (1)
total votes
If the number of total votes is a small number, the helpfulness score can be more unsta-
ble than one with a large number of total votes, and thus, the helpfulness score with a
small number of total votes could be biased.
10000 reviews were selected randomly (from those where the number of total votes
was greater than 30) from 2 different categories as listed in Table 1.
After filtering the data, the extracted data is divided into three subsets: 80% of the
data as the training set, 10% as the development set and another 10% as the testing set.
Therefore, for each category, there are 4000 reviews to train the regression model, 500
reviews to optimize the parameters of the model (development set), and 500 reviews for
testing. The second data set is for used for comparison with previous research, and will
be discussed in Sect. 4.2.
Here s is the helpfulness score, f is the activation function ReLU, x is the vector of
input and w is a vector as the weight of x, where the length of x and w is determined
by the length of the input. b is bias and it is a single value since there is only one output
which is the predicted helpfulness score. The equation (2) can be extended in detail:
276 S. Xu et al.
Data Pre-processing. The files are generated for testing and the comparison. The pro-
cess is listed below:
Step 1: Select data to match the condition, for example, the number of total votes is
greater than 30.
Step 2: Convert the data file to .tsv file
Step 3: Randomly divide the data into training, development and test sets.
Step 4: Repeat step 3 until enough data is collected.
Neural Network Incorporating BERT. The general steps to run the code on Tensor-
Flow are:
Step 2: Combine the BERT features with star rating and product type as the input to
the NN.
Step 3: Train the model in NN using training data and optimize the parameters with
the development set.
Step 4: Test the model and calculate the error.
Step 5: Repeat Steps 1 to 4 on 10 random data subsets, and calculate the average
error and standard deviation.
4 Results
4.1 Prediction
The BERT pre-trained model was fine-tuned to extract BERT features by setting the
values of hyper parameters. The model was tested on a GeForce GTX 1080 GPU by
adjusting hyper parameters to avoid the out-of-memory issues. The hyper parameters
used are listed in Table 2. The regression model was tested 10 times for each category
Hyper parameter
max seq length 144
train batch size 16
M odeltype BERTBase
Optimizer Adam
with random subsets (training, development, and testing). First, the model was trained
and optimized with the training set and development set, then the helpfulness scores
were predicted for the test set. The prediction results of each test are listed in Tables 3
and 4. From the results, it can be see that the BERT features worked well as variables in
the regression model. The MSE differs from different product categories. Comparison
of these results with those from an explanatory variable based regression model are
included in the next section.
4.2 Comparison
In this section, the exact same data in [4] is used for comparison, and the best results
(best average MAE and smallest stand deviation) in that paper are compared. The data
is collected from http://jmcauley.ucsd.edu/data/amazon/ and reviews with more than 10
total votes are selected, to match what that paper did (this data is different from the
testing data used in Sect. 3). Ten different data sets are generated randomly by using
these selected data. For each data set, 80% is chosen as the training set, 10% as the
development set, and another 10% as the test set. There are 520 reviews for each test
of Cellphone products and 836 reviews for each test of Beauty products, as was done
in [4], and the proposed model does not contain the product type (experience or search
goods) for this comparison. The mean absolute error (MAE) is given as follows.
n
1
M AE = |si − ŝi |. (5)
n i=1
The results of comparison are listed in Tables 5 and 6 and Tables 7 and 8, respectively.
Data set MSE (NN) Data set MAE (SVR) MAE (M5P)
1 0.05619 1 11.1354 12.1070
2 0.05368 2 11.5486 12.5098
3 0.05824 3 10.4736 11.7598
4 0.06105 4 10.3240 10.6372
5 0.06281 5 11.8700 12.2581
6 0.05478 6 11.5203 11.8930
7 0.05458 7 11.9216 12.4486
8 0.04820 8 12.1916 11.8303
9 0.05474 9 13.3505 12.2984
10 0.05135 10 12.9721 12.6732
Average 0.05556 Average 11.7308 12.0415
Std. dev. 0.00432 Std. dev. 0.9178 0.5495
BERT Model for Predicting Scores of Customer Reviews 279
Data set MAE (NN) Data set MAE (SVR) MAE (RandF)
1 11.2604 1 11.6135 11.9215
2 12.0296 2 12.2497 12.6919
3 11.0899 3 13.1854 13.2427
4 11.9005 4 11.9664 12.2743
5 11.5507 5 10.9451 11.5257
6 11.4561 6 11.1959 11.8278
7 11.1970 7 11.5863 12.1718
8 10.5708 8 10.8911 11.3696
9 11.9430 9 11.2281 12.1587
10 11.5520 10 12.3415 12.5452
Average 11.3550 Average 11.7203 12.1729
Std. dev. 0.4382 Std. dev. 0.6862 0.5300
Comparison of the results are shown in Tables 5 and 6. Although, the BERT features
based NN model did not get the better MAE in every subset tested, compared to the best
support vector regression (SVR) explanatory variables regression model, the proposed
NN model has a better average MAE across all 10 subsets with nearly 9% improvement.
The difference in the standard deviations also show that the NN model is much stable
than the SVR model, across different test datasets. The standard deviation of this model
is even smaller than the explanatory variables based (M5R) model, which yielded the
best standard deviation in Park’s result.
Tables 7 and 8 show a similar comparison result using another dataset of a different
product category. The proposed NN model still resulted in the best average MAE and
the smallest standard deviation.
Many additional efforts can be done in the future to improve the prediction accuracy
from this study. First, the model can be run on a TPU sever with better hyper param-
eters to check the improvements. Second, a post-process procedure can be considered
to optimize the weights of BERT features, based on the specific products category of
reviews. Lastly, BERT features can be modeled with more advanced statistical comput-
ing techniques for use is review helpfulness prediction.
Acknowledgment. The authors are indebted to anonymous reviewers for providing constructive
comments and suggestions which has resulted in improvement both the readability and quality of
the paper.
References
1. Liu, Y., Huang, X., An, A., Yu, X.: Modeling and predicting the helpfulness of online
reviews. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 443–452. IEEE
(2008)
2. Zhang, Z., Varadarajan, B.: Utility scoring of product reviews. In: Proceedings of the 15th
ACM International Conference on Information and Knowledge Management, pp. 51–57.
ACM (2006)
3. Mudambi, S.M., Schuff, D.: What makes a helpful review? A study of customer reviews on
Amazon.com. MIS Q. 34(1), 185–200 (2010)
4. Park, Y.J.: Predicting the helpfulness of online customer reviews across different product
types. Sustainability 10(6), 1735 (2018)
5. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and
semantic classification of product reviews. In: Proceedings of the 12th International Confer-
ence on World Wide Web, pp. 519–528. ACM (2003)
6. Read, J.: Using emoticons to reduce dependency in machine learning techniques for sen-
timent classification. In: Proceedings of the ACL Student Research Workshop, pp. 43–48.
Association for Computational Linguistics (2005)
7. Ye, Q., Zhang, Z., Law, R.: Sentiment classification of online reviews to travel destinations
by supervised machine learning approaches. Expert Syst. Appl. 36(3), 6527–6535 (2009)
8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:181004805 (2018)
9. Xu, H., Liu, B., Shu, L., Yu, P.S.: BERT post-training for review reading comprehension and
aspect-based sentiment analysis. arXiv preprint arXiv:190402232 (2019)
10. Eisend, M.: Two-sided advertising: a meta-analysis. Int. J. Res. Mark. 23(2), 187–198 (2006)
11. Pavlou, P.A., Dimoka, A.: The nature and role of feedback text comments in online market-
places: implications for trust building, price premiums, and seller differentiation. Inf. Syst.
Res. 17(4), 392–414 (2006)
12. Forman, C., Ghose, A., Wiesenfeld, B.: Examining the relationship between reviews and
sales: the role of reviewer identity disclosure in electronic markets. Inf. Syst. Res. 19(3),
291–313 (2008)
13. Nelson, P.: Information and consumer behavior. J. Polit. Econ. 78(2), 311–329 (1970)
14. Ford, G.T., Smith, D.B., Swasy, J.L.: Consumer skepticism of advertising claims: testing
hypotheses from economics of information. J. Consum. Res. 16(4), 433–441 (1990)
15. Pan, Y., Zhang, J.Q.: Born unequal: a study of the helpfulness of user-generated product
reviews. J. Retail. 87(4), 598–612 (2011)
BERT Model for Predicting Scores of Customer Reviews 281
16. Ghose, A., Ipeirotis, P.G.: Estimating the helpfulness and economic impact of product
reviews: mining text and reviewer characteristics. IEEE Trans. Knowl. Data Eng. 23(10),
1498–1512 (2010)
17. Korfiatis, N., Garcı́a-Bariocanal, E., Sánchez-Alonso, S.: Evaluating content quality and
helpfulness of online product reviews: the interplay of review helpfulness vs. review con-
tent. Electron. Commer. Res. Appl. 11(3), 205–217 (2012)
18. Cao, Q., Duan, W., Gan, Q.: Exploring determinants of voting for the helpfulness of online
user reviews: a text mining approach. Decis. Support Syst. 50(2), 511–521 (2011)
19. Kuan, K.K., Hui, K.L., Prasarnphanich, P., Lai, H.Y.: What makes a review voted? An empir-
ical investigation of review voting in online review systems. J. Assoc. Inf. Syst. 16(1), 48
(2015)
20. Yin, D., Bond, S., Zhang, H.: Anxious or angry? Effects of discrete emotions on the perceived
helpfulness of online reviews. MIS Q. 38(2), 539–560 (2014)
21. Ahmad, S.N., Laroche, M.: How do expressed emotions affect the helpfulness of a product
review? Evidence from reviews using latent semantic analysis. Int. J. Electron. Commer.
20(1), 76–111 (2015)
22. Ma, Y., Xiang, Z., Du, Q., Fan, W.: Effects of user-provided photos on hotel review helpful-
ness: an analytical approach with deep leaning. Int. J. Hosp. Manag. 71, 120–131 (2018)
23. Chen, C., Yang, Y., Zhou, J., Li, X., Bao, F.S.: Cross-domain review helpfulness prediction
based on convolutional neural networks with auxiliary domain discriminators. In: Proceed-
ings of the 2018 Conference of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies (Short Papers), vol. 2, pp. 602–607
(2018)
24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing
Systems, pp. 5998–6008 (2017)
25. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.:
Deep contextualized word representations. arXiv preprint arXiv:180205365 (2018)
26. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv
preprint arXiv:180106146 (2018)
27. Bragge, J., Storgårds, J.: Utilizing text-mining tools to enrich traditional literature reviews.
case: digital games. In: Proceedings of the 30th Information Systems Research Seminar in
Scandinavia IRIS, pp. 1–24 (2007)
28. Weathers, D., Sharma, S., Wood, S.L.: Effects of online communication practices on con-
sumer perceptions of performance uncertainty for search and experience goods. J. Retail.
83(4), 393–401 (2007)
Evaluating Taxonomic Relationships
Using Semantic Similarity Measures
on Sensor Domain Ontologies
1 Introduction
The information on the current web grows day by day and much of this informa-
tion is generated without a structure that can be understood by both machines
and humans, which makes it a difficult task to process.
The semantic web, proposed by Berners-Lee [3], searches to give structure and
knowledge to the conventional web. Ontologies are used to represent knowledge
in a structured way on the semantic web.
An ontology is defined as “an explicit specification of a conceptualization”
[9]. Conceptualization refers to the abstract model of some phenomenon of the
world, explicitly because the concepts and constraints used are explicitly defined.
Then an ontology is understood as a formal model that describes a particular
domain and specifies the concepts within the domain with relationships between
them.
Ontologies in the domain of sensor networks are mainly used to have an
accepted language that represents the definitions of sensors, properties, tax-
onomies and thus to improve data fusion and interoperability [2].
They are also used as a dictionary of terms and for the analysis of observed
data. It is for these reasons that several researchers have designed ontologies for
the semantic representation of sensor networks.
The evaluation of an ontology is an essential part of the process of construc-
tion of it. If the ontology has been designed manually or automatically, it is
necessary to evaluate its quality.
In this area, the proposals in the literature are classified depending on the
form of evaluation used, which are: compare the ontology with a gold standard,
apply the ontology in an application and evaluate the results, make comparisons
against source data of the domain ontology, and finally evaluations made by
humans to determine what established criteria the ontology satisfies [4]. On
the other hand, Gómez-Pérez [8] presents two terms for the ontology evaluation:
verification and validation. The verification ensures that the definitions meet the
requirements correctly. The validation ensures that the meaning of the definitions
correctly models the phenomena of the world.
There are tools for the design of ontologies that include the use of reasoners
for their evaluation, such as those integrated with the Protégé software1 . The
reasoners allow evaluating the consistency criterion of an ontology.
If the ontology is large, the evaluation becomes an exhaustive task that
requires more man-hours by the engineer or knowledge expert. Given this prob-
lem, automatic methods have been proposed to carry out the evaluation of
ontologies. One aspect that is considered in the evaluation of ontologies and that
can be automated is the verification of taxonomic relationships. The semantic
similarity is used to measure the semantic closeness of one concept with another
using a mathematical metric for its calculation.
In this work, semantic similarity measures are applied to evaluate the taxo-
nomic relationships of the ontologies in the sensor domain or IoT. The aim of the
experiment is to decide if the evaluation of taxonomic relationships using mea-
sures of similarity is an effective method to evaluate the quality of ontologies in
this domain. For this experiment, four ontologies are used, such as MMI Device
Ontology [22], OntoSensor [10], CSIRO Sensor Ontology [17], and Intelligent
Environment [5,6].
1
https://protege.stanford.edu/.
284 M. T. Vidal et al.
The first three ontologies are based on the SSN ontology2 , an ontology cre-
ated by the W3C Semantic Sensor Network Incubator group (SSN-XG) that
describes the sensors and its observations. The Intelligent Environment ontol-
ogy was considered because it is an ontology for the semantic representation of
objects, people and interactions that occur in academic environments enabled
by IoT. All are available for use.
The content of the present paper is divided in the following way. In Sect. 2,
the state of the art about semantic similarity measures based on information
content is briefly discussed. Section 3 shows the similarity measures used for
evaluating taxonomic relationships. Section 4 shows the algorithm to evaluate
taxonomic relationships in domain ontologies. In Sect. 5, the data set considered
and the evaluation of our approach is presented. Finally, conclusions and future
work are presented in Sect. 6.
2 Related Work
Some works related to this research are presented below.
In [7], an ontology is designed and implemented to describe the concepts
and relationships of data in sensor networks. This ontology aims to improve the
accuracy and recall of the sensor data sought. After constructing the proposed
ontology, an experimental evaluation was carried out using two tests: the sub-
sumption test and verification of consistency. For this task it is used the reasoner
RacerPro and verified the logical consistency of the ontology.
In [12], a tool was designed to allow ontologies to be validated against the
concepts and properties used in the SSN (Semantic Sensor Network) model of
W3C. It also generates validation reports and collects statistics regarding the
terms or concepts most used in ontologies. The authors used the tool to validate
a set of ontologies in which the SSN ontology of the W3C was used as a basis, the
evaluation includes discovering noise errors, inconsistency, checking syntax and
similarity between the concepts of SSN ontology of W3C and the concepts of the
other ontologies. This tool can help developers to indicate which parts of the SSN
ontology are most used and thus focus on the functions most commonly used
to create tools that improve interoperability between systems or applications in
the sensor’s domain or loT.
In [1], an ontology is proposed to allow semantic interoperability in several
fragmented test benches in the IoT domain. This ontology reuses basic concepts
of several popular ontologies and taxonomies, such as SSN, M3-Lite, WGS84,
IoT-Lite, and DUL. The ontology is supported by annotation of references and a
validation tool called Annotation Validator Tool (AVT), as well as best practices
and guides. AVT tries to adopt the SSN validator by applying the necessary
changes and checking semantic and syntactic problems together, compared to
the SSN validator that only makes a syntactic validation. The elements that
AVT reviews are: the inheritance relations of classes and properties, cardinality,
domains and ranges unexpected.
2
https://www.w3.org/2005/Incubator/ssn/ssnx/ssn.
Evaluating Taxonomic Relationships Using Semantic Similarity Measures 285
The similarity measures used by the system to calculate the similarity of the
concepts that integrate semantic relations are detailed below.
Rada [19] proposed semantic similarity measure Path that uses the shortest path
(em length) between two concepts cu and cv in a taxonomy.
1
simpath (cu , cv ) = (1)
1 + length (cu , cv )
Wu and Palmer [26] defined the measure of Wu and Palmer that measures the
similarity of two concepts cu and cv using the shortest distance of each concept
cu and cv with the root concept and the distance of the lca (least common
ancestor) of each concept cu and cv with the root concept.
2 ∗ depth (clca )
simwup (cu , cv ) = (2)
depth(cu ) + depth(cv )
286 M. T. Vidal et al.
Li [13] formulated Eq. 3 where the depth (depth) of both concepts cu and cv is
calculated, and the lca (clca ) of the concepts to calculate their similarity.
eβdepth(clca ) − e−βdepth(clca )
simli (cu , cv ) = e−αlength(cu ,cv ) · (3)
eβdepth(clca ) + e−βdepth(clca )
Where α is a parameter that contributes to the length of the path and β
is the parameter for the depth of the path. According to the work of [13], the
optimal parameter for α is 0.2 and for β is 0.6.
Maedche and Staab. The measure proposed in [15], based on the Jaccard index,
considers the characteristics or set of ancestors of a concept.
|A(u) ∩ A(v)|
simCM atch (u, v) = (4)
|A(u) ∪ A(v)|
|A(u) ∩ A(v)|
simRE (u, v) = (5)
γ|A(u) \ A(v)| + (1 − γ)|A(v) \ A(u)| + |A(u) ∩ A(v)|
Where γ is a parameter to adjust the symmetry of this measure, γ in [0, 1],
A(u) is the set of ancestors of the concept u, and A(v) is the set of ancestors of
the v concept.
Lin. Lin [14] proposed a measure based on the work of Resnik [20], but Lin also
took into account the information content of the pair of concepts (u, v). The
measure is defined in Eq. (9).
Jiang and Conrath. The semantic similarity measure defined in [11] is derived
from the proposal based on edges, adding information content to its decision
factor. To compute this similarity the following is used: the information content
of both, the parent concept and the child concept; the shortest path between
concepts u and v; and the MICA of them. This similarity is defiened in Eq. (10).
Mazandu and Mulder. This measure is proposed in [16], the semantic simi-
larity between two concepts is gotten computing the MICA between them and
taking the maximum IC of the two concepts, as shown in Eq. (12).
Zhou. This hybrid measure, proposed in cite Zhou, combines the proposal based
on the information content of [11] with the calculation of the shortest distance
of the structure-based measures.
Razan et al. The HSS3 measure proposed in [18] introduces smoothing param-
eters such as the constant K, where K = 1/N D (N D is the total number of
disorders in the original proposal, in this case, it was taken as the total number
of taxonomic relationships). This measure also combines the proposal of mea-
sures based on information content and measures based on structure. In Eq. 14
this measure is defined.
L
D ∗ (IC(M ICA(u, v)) + K)
simHSS3 (u, v) = (14)
length(u, clcs ) + length(v, clcs )
The HSS4 measure also proposed in [18] is similar to the HSS3, but with
the difference that this measure also considers the information content of each
concept compared. This measure is defined in Eq. 15.
IC(u)∗IC(v)
L
D ∗ (IC(M ICA(u, v)) + IC(u)+IC(v) )
simHSS4 (u, v) = (15)
length(u, clcs ) + length(v, clcs )
4 Proposed Algorithm
The proposed algorithm uses each measure of semantic similarity: based on struc-
ture, based on characteristics, based on information content and hybrids. The
algorithm steps are presented below.
1. Preprocessing. In this phase, through the use of Apache Jena, the concepts and
taxonomic relationships of the input ontology in OWL format are extracted.
2. Semantic Similarity Computation. For each pair of concepts (u, v) the com-
putation of the semantic similarity is applied using the Semantic Measure
Library3 (SML).
3. Thresholds Computation. The computation of thresholds is obtained by per-
forming an average of the similarity results obtained for each similarity mea-
sure applied. RT refers to the total number of taxonomic relationships in the
ontology.
ressimi (u, v)
thldsimi = (16)
RT
The average threshold is obtained by applying Eq. 17, where N is the total
of the measures of the category used, for example, if the similarity measures
based on the structure are being used, N = 3 which are the three structure-
based measures used in this work.
thldsimi
thldavg = (17)
N
3
http://www.semantic-measures-library.org/sml/.
Evaluating Taxonomic Relationships Using Semantic Similarity Measures 289
5 Experimental Results
In this section, the results obtained by applying the approach to the ontolo-
gies of the sensor domain are presented. The proposed algorithm was imple-
mented in the programming language, Python. Semantic similarity measures
based on features and hybrid measures were implemented in the Java program-
ming language using the library SML (https://www.semantic-measures-library.
org/sml/). In this library the measures based on structure and those based on
information content are already implemented.
Table 1 shows the total number of classes, the number of taxonomic relation-
ships (TR) and the maximum depth of each ontology.
This subsection presents the results obtained with the structure-based measures
using the SML library and which applies the Eqs. 1, 2 and 3 to ontologies in the
sensor domain. Table 2 shows the average threshold, calculated using Eq. 17 and
the thresholds obtained for each measurement by ontology applying Eq. 16.
Table 3 shows the results obtained with the measure of accuracy, and each
measure of similarity applied to each ontology. As noted, the ontology with the
highest value in the overall average of the approach is Intelligent Environment
followed by Onto Sensor.
290 M. T. Vidal et al.
Table 3. Results of the accuracy measure obtained with each structure-based semantic
similarity measure
Table 4. Thresholds for sensor domain ontologies using measures based on character-
istics
Table 5 shows the results obtained with the measure of accuracy, and each
measure of similarity applied to each sensor ontology.
In the case of this type of similarity measures applied to sensor ontologies;
the ontology with the highest accuracy value is again the ontology of Intelligent
Environment, followed by Device Ontology.
Table 5. Results of the accuracy measure obtained with each semantic similarity
measure based on characteristics
Table 6 shows the thresholds obtained for each measure of similarity for these
ontologies.
Table 6. Thresholds for sensor domain ontologies using measures based on information
content
Table 7 shows the results obtained with the measure of accuracy, and each
measure of similarity applied to each sensor ontology.
Table 7. Results of the accuracy measure obtained with each semantic similarity
measure based on information content
As can be seen in Table 7, the ontology with the best accuracy value in the
general average is Sensor Ontology, followed by Onto Sensor.
the results obtained from the system are presented, when these types of measures
are applied to the ontologies of the sensor domain. Table 8 shows the thresholds
obtained for each measure of similarity for these ontologies.
Table 9 shows the results obtained with the measure of accuracy, and each
measure of similarity applied to each sensor ontology.
Table 9. Results of the accuracy measure obtained with each hybrid semantic similarity
measure
For this category of measures, the ontology with the highest value in the
overall system average is Device Ontology, followed by Onto Sensor.
As shown in Table 10, Intelligent Environment on average has the best results
in the evaluation of taxonomic relationships with a 61% of GAaverage , although
with the measures based on information content and hybrids it obtains the GA
lower with 44.4% and 29.6%, respectively.
6 Conclusions
Acknowledgment. This work is supported by the Sectoral Research Fund for Educa-
tion with the CONACyT project 257357, and partially supported by the VIEP-BUAP
project.
References
1. Agarwal, R., Fernandez, D.G., Elsaleh, T., Gyrard, A., Lanza, J., Sanchez,
L., Georgantas, N., Issarny, V.: Unified IoT ontology to enable interoperability
and federation of testbeds. In: 2016 IEEE 3rd World Forum on Internet of Things
(WF-IoT), pp. 70–75, December 2016
2. Ali, S., Khusro, S., Ullah, I., Khan, A., Khan, I.: Smartontosensor: ontology for
semantic interpretation of smartphone sensors data for context-aware applications.
J. Sensors 2017, 8790198:1–8790198:26 (2017)
3. Berners-lee, T., Hendler, J.: The semantic web. Sci. Am. 284, 34–43 (2001)
4. Brank, J., Grobelnik, M., Mladenić, D.: Automatic evaluation of ontologies, pp.
193–219. Springer, London (2007)
5. Bravo, M., Reyes, J., Cruz-Ruiz, I., Gutiérrez-Rosales, A., Padilla-Cuevas, J.:
Ontology for academic context reasoning. Procedia Comput. Sci. 141, 175–182
(2018)
6. Bravo, M., Reyes-Ortiz, J.A., Cruz, I.: Researcher profile ontology for academic
environment. In: Arai, K., Kapoor, S. (eds.) Advances in Computer Vision, pp.
799–817. Springer, Cham (2020)
294 M. T. Vidal et al.
7. Eid, M., Liscano, R., Saddik, A.E.: A novel ontology for sensor networks data. In:
2006 IEEE International Conference on Computational Intelligence for Measure-
ment Systems and Applications, pp. 75–79, July 2006
8. Gómez-Pérez, A.: Ontology evaluation, pp. 251–273. Springer, Heidelberg (2004)
9. Gruber, T.R.: Toward principles for the design of ontologies used for knowledge
sharing? Int. J. Hum.-Comput. Stud. 43(5), 907–928 (1995)
10. Russomanno, D.J., Kothari, C., Thomas, O.A.: Building a sensor ontology: a prac-
tical approach leveraging ISO and OGC models, pp. 637–643, January 2005
11. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and
lexical taxonomy. CoRR, cmp-lg/9709008 (1997)
12. Kolozali, S., Elsaleh, T., Barnaghi, P.M.: A validation tool for the W3C SSN ontol-
ogy based sensory semantic knowledge. In: TC/SSN@ISWC (2014)
13. Li, Y., Bandar, Z.A., Mclean, D.: An approach for measuring semantic similarity
between words using multiple information sources. IEEE Trans. Knowl. Data Eng.
15(4), 871–882 (2003)
14. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the
Fifteenth International Conference on Machine Learning, ICML 1998, pp. 296–
304. Morgan Kaufmann Publishers Inc., San Francisco (1998)
15. Maedche, A., Staab, S.: Measuring similarity between ontologies. In: Gómez-Pérez,
A., Benjamins, V.R. (eds.) Knowledge Engineering and Knowledge Management:
Ontologies and the Semantic Web, pp. 251–263. Springer, Heidelberg (2002)
16. Mazandu, G., Mulder, N.: Information content-based gene ontology semantic sim-
ilarity approaches: toward a unified framework theory. BioMed Res. Int. 292063,
2013 (2013)
17. Neuhaus, H., Compton, M.: The semantic sensor network ontology: a generic lan-
guage to describe sensor assets (2009)
18. Paul, R., Groza, T., Zankl, A., Hunter, J.: Semantic similarity-driven decision
support in the skeletal dysplasia domain, November 2012
19. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a
metric on semantic nets. IEEE Trans. Syst. Man Cybern. 19(1), 17–30 (1989)
20. Resnik, P.: Semantic similarity in a taxonomy: an information-based measure and
its application to problems of ambiguity in natural language. CoRR, abs/1105.5444
(2011)
21. Rodriguez, M.A., Egenhofer, M.J.: Determining semantic similarity among entity
classes from different ontologies. IEEE Trans. Knowl. Data Eng. 15(2), 442–456
(2003)
22. Rueda, C., Galbraith, N., Morris, R., Bermudez, L., Arko, R., Graybeal, J.: The
MMI device ontology: enabling sensor integration. In: American Geophysical Union
Fall Meeting – Session, vol. 16, pp. 44–48, January 2010
23. Seco, N., Veale, T., Hayes, J.: An intrinsic information content metric for semantic
similarity in wordnet. In: Proceedings of the 16th European Conference on Artificial
Intelligence, ECAI 2004, pp. 1089–1090. IOS Press, Amsterdam (2004)
24. Sánchez, D., Batet, M., Isern, D., Valls, A.: Ontology-based semantic similarity: a
new feature-based approach. Expert Syst. Appl. 39(9), 7718–7728 (2012)
25. Tversky, A.: Features of similarity. Psychol. Rev. 84, 327–352 (1977)
26. Wu, Z., Palmer, M.: Verb semantics and lexical selection. CoRR, abs/cmp-
lg/9406033 (1994)
Trained Synthetic Features in Boosted
Decision Trees with an Application
to Polish Bankruptcy Data
1 Introduction
In the classification problem, a training set consists of n observations of a binary
categorical target Y and p predictors X1 , . . . , Xp . The training set is used to
build a model that predicts Y based on X1 , . . . , Xp . For example, in a financial
application the training set could consist of observations on n companies and Y
could denote whether or not a company declared bankruptcy during the obser-
vation period. The numeric predictors X1 , . . . , Xp could be financial indicators
taken from the companies’ financial statements. This data can then be used to
build a model that will predict the bankruptcy status of companies Y given
observations of their financial indicators X1 , . . . , Xp .
One approach to building such a model is to augment the data set by com-
bining the predictors X1 , . . . , Xp into new variables called synthetic features and
adding these synthetic features to the collection of predictors. Standard model-
building procedures such as decision trees or ensembles of boosted trees can then
applied to the augmented data set. Zieba et al. [11] follow this approach to create
ensembles of boosted classification trees built using synthetic features. Two pre-
dictor variables are selection by variable-importance weighted random selection;
the predictors are then combined using a randomly chosen binary arithmetic
operation. Mellville and Mooney [6] create artificial data by randomly sampling
values of predictors and then sampling a class value inversely proportional to the
current ensemble’s predictions. Rodriguez et al. [8] create synthetic features by
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 295–309, 2020.
https://doi.org/10.1007/978-3-030-39442-4_23
296 T. R. Boucher and T. Msabaeka
2 Methods
Least squares regression, which minimizes the RSS, can be replaced with
variants like robust regression with a psi function, or least-trimmed squares
regression. One concern with these alternatives to least-squares regression is
that computational time is increased; for example, robust regression models are
fit using iteratively reweighted least squares. Whichever method is used, the
synthetic feature is the value of β0 + β1 x1 + . . . + βp xp returned by the fitted
model, which is the estimate of P (Y = A) given the data.
selected is then |Corr(Xi , Ynew )|/ |Corr(Xi , Ynew )|. If the observations are
weighted unequally, the weighted correlation Corrw will be used where
n
i=1 wi (Xi − X̄)(Yi − Ȳ )
2
Corrw (X, Y ) = n n .
i=1 wi (Xi − X̄) i=1 wi (Yi − Ȳ )
2 2 2 2
This is similar to the methods used in [11] but with weighted correlations used
here in place of variable importance. Weighted correlations are preferred to vari-
able importance in boosting as this will encourage the algorithm to explore new
regions of the predictor space as the weights evolve, rather than favor predic-
tor variables by the frequency with which they have already been selected. The
synthetic feature is the value of Xi Xj .
Naive Bayes: Naive Bayes applies Bayes’ rule to express P (Y = A|X) as pro-
portional to the product f (X|Y = A) × π(Y = A), where π(Y = A) is the
prior probability of class A and f (X|Y = A) is the joint likelihood of X =
(X1 , . . . , Xp ) given that Y = A. The naive assumption is that the Xi are inde-
pendent given the value of Y so that the joint conditional likelihoodp equals the
product of the conditional marginals; that is, f (X|Y = A) = i=1 f (Xi |Y = A),
with f (Xi |Y = A) being the conditional marginal distribution of Xi given that
Y = A. In practice the f (Xi |Y = A) are not known and so are assumed. For
example, each Xi given Y = A is often assumed to follow a Normal distribution
with mean X̄iA and standard deviation SiA , these statistics being calculated for
the Xi values for which Y = A. P (Y = B|X) is calculated similarly; the constant
of proportionality is handled by normalizing and returning
p
i=1 f (Xi |Y =
A)π(Y = A)
P (Y = A|X) = p p
i=1 f (Xi |Y = A)π(Y = A) + i=1 f (Xi |Y = B)π(Y = B)
as the synthetic feature. If the observations are weighed unequally the weighted
observations are used in calculating X̄iA , SiA , X̄iB , SiB , i = 1, . . . , p. This will
have the effect of shifting the modes of the conditional marginal Normal densities
towards the more heavily weighted observations.
and create synthetic features with cubic smoothing splines [5] of each Xi as a
predictor of Ynew . Multivariate splines can be applied to collections of Xi as
predictors of Ynew . The computational burden of fitting splines can be intense
but splines allow for nonlinear class boundaries. Observation weights can be
incorporated in the spline fitting.
3 Experiments
All work was done in the R environment [7]. R package rpart [9] was used for
creating the stumps at each boosting iteration. Original scripts were written in
R to perform the discrete and gentle boosting other than using rpart for stump
creation, as the new observation weights were needed at each boosting iteration
in order to create the new set of synthetic features. The synthetic features were
then added to the collection of predictor variables available for that iteration.
3.1 Data
Polish Companies Bankruptcy Data: The Polish companies bankruptcy
data set is described in [11] and publicly available through the UCI Machine
Learning Repository [1]. This is an interesting data set as it contains a large
number of observations, 64 predictors, many missing values, the predictor dis-
tributions are often skewed and contain outliers. The classes are also highly
unbalanced; for example, the data for Year 1 contains 7027 observations, 271 of
which represent bankrupted companies, and the remaining 6756 companies did
not bankrupt. Two of the variables, X37 and X27 are missing approximately
39% and 23% of their observations, respectively. The remaining 62 variables are
missing from 0% to 4.4% of their values. The missing values do not appear to be
missing at random so imputation should be applied cautiously. However, since
the data is being used to illustrate boosting with synthetic features, the missing
values will be imputed according to the following simple scheme. For a given
predictor Xi containing missing values, the most highly correlated predictor Xj ,
j = i, is selected and simple linear regression of Xi on Xj is used to impute the
missing values of Xi . If Xj contains missing values so that some of the missing
302 T. R. Boucher and T. Msabaeka
Checkerboard Data: This is an artificially generated data set often used for
machine learning experiments. A black-and-white checkerboard pattern is super-
imposed over the unit square. Instances are generated by randomly sampling X
and Y values uniformly between 0 and 1. These (X, Y ) coordinates determine
which square in the black-and-white checkerboard pattern the instance falls into.
The color of the square determines the class of the instance. An example of
10,000 simulated instances is in Fig. 1. The classification problem involves pre-
dicting whether an instance belongs to the black-square class or the white-square
class given the X, Y coordinates.
1.0
0.8
0.6
y
0.4
0.2
0.0
Fig. 1. 10,000 simulated instances of checkerboard data. The black squares are one
group and the white squares are another group.
Spheres Data: This is another artificially generated data set often used for
machine learning experiments. Instances belong to one of two classes defined by
two spheres, one nested within the other. The first class is defined by a sphere
centered at the origin of radius 1/2. The second class is defined by a hollow sphere
centered at the origin having inner radius 1/2 and outer radius 1. Instances are
generated by randomly sampling X, Y and Z values from each of the two spheres.
An example of 1,000 simulated instances is in Fig. 2. The classification problem
involves predicting the class membership given the (X, Y, Z) coordinates.
Boosting Trained Synthetic Features 303
1.0
●●
●
●
● ●
●● ● ●● ●● ● ●
●● ●●●● ● ● ● ●● ●
● ●
● ●● ● ●●●●●● ●●● ● ● ● ● ●
● ●
●
● ● ●● ● ●● ●● ●●●●● ● ● ●● ●
●
● ●● ●● ●● ●●● ● ● ●
●
●
●●●●●●● ● ●
●● ●●
●●●●●● ● ●
●●● ●●●● ●
● ● ●●●● ●● ● ● ●● ● ●●●
● ●●● ●● ●●● ●
●●●● ●● ●●● ●● ● ● ●
●●
●
●●●●●●● ●● ●●● ●● ● ● ●
0.5
● ● ● ●●●● ● ●●
● ● ● ●
● ● ●
● ●● ● ● ●●●● ● ● ●● ● ●● ●●
●● ●
●●●●●●● ●
●
● ●●
● ●● ● ●
● ●●●● ● ●
●● ●● ●●
●●● ●
● ● ●●●●● ●● ●
●● ●●● ●●●
●● ●
●●
● ●● ●●●●● ● ●●● ● ● ●● ●● ● ●●●●
●
● ● ●●●●●
●● ●●●●
●
●●
●
●●● ●●
● ●●
●● ●●●
●●● ●
●●●● ●●
●●● ●●●●● ●
●
●●
●● ●●●
●
●●
●● ● ●●
● ●● ●●●●●●
● ●●● ●●
●●
●
●●● ●
●●
● ● ●● ●● ● ●
●●
● ●●● ●
●●●
●
●
●
●●●
●●
● ●● ●
●● ●
● ●● ●●●● ●
●●
● ●●●● ●
●●●●●●● ●●●
●● ●●
● ●●
●●● ●
0.0
●● ●
●● ●● ● ●●
●
z
●●
1.0
y
0.5
−0.5
0.0
−0.5
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0
Fig. 2. 1,000 simulated instances from two nested spheres. Classes are indicated by
plotting symbol.
3.2 Results
Polish Bankruptcy: Due to the size of the training set (7027 observations),
a stratified sample of size n = 200 was selected with classes represented in the
same proportions as found in the training set. The stratified sample was used
for model fitting. The 64 predictor variables were replaced with 16 principal
components which together accounted for over 90% of the generalized variance.
There were still too many variables to be able to compute multivariate splines
so this synthetic feature was omitted.
Using both discrete boosting and gentle boosting combined with 10-fold
cross-validation, models with a variety of boosting iterations were fit to the
stratified sample from the Polish data. The average and standard deviation of
the misclassification errors across the cross-validation runs were calculated. The
results are in Table 2.
The mean cross-validation classification errors should be compared with the
0.039 misclassification error rate for the naive majority class classification rule
which classifies every observation as the not bankrupt majority class (6756 of
the 7027 companies did not go bankrupt). Table 2 shows that for both discrete
and gentle boosting the mean cross-validation misclassification error decreases
as boosting iterations increase. The standard deviation of the cross-validation
misclassification errors decreases for gentle boosting and remains stable for dis-
crete boosting. Table 2 indicates discrete boosting resulted in slightly smaller
mean cross-validation misclassification error and smaller standard deviation of
cross-validation error than gentle boosting. Both boosting methods crossed the
0.039 naive misclassification error rate at about 20 boosting iterations.
304 T. R. Boucher and T. Msabaeka
Table 2. Results for Polish bankruptcy data. Average (‘MCVE’) and standard devia-
tion (‘SCVE’) of misclassification errors from 10-fold crossvalidation of boostrap ensem-
bles of sizes (‘Iterations’) 3, 5, 10, 15, 20, 25, 50, and 100 boosting iterations. Discrete
(‘Discrete’) and gentle (‘Gentle’) boosting applied.
Variable importances from the discrete and gentle boosting models with
100 boosting iterations were averaged over the 10 cross-validation runs and
normalized to sum to 1 for easier evaluation. The top 5 variables for discrete
and gentle boosting by the resulting variable importance are in Table 3.
The variable importance for discrete boosting differ very little, suggesting
none dominate the others in their ability to predict the bankruptcy status of a
company. For discrete boosting the first 4 variables in importance are univarate
splines of principal components of the original predictor variables, suggesting
some nonlinearity in the partition between classes on these principal components.
The final variable in the top 5 is logistic regression with the principal components
as independent variables, a further suggestion of a nonlinear partition between
the classes. The variable importance using gentle boosting differ from those using
discrete boosting as the logistic regression synthetic feature strongly dominates
the other trained synthetic features. Univariate splines again appear in the top
5, though of a different collection of principal components. Recall that discrete
Table 3. Results for Polish bankruptcy data. Top 5 variables by variable importance in
decreasing order, 100 boosting iterations. Discrete and gentle boosting applied. ‘DVar’
and ‘DImp’ are variables and importances for discrete boosting, ‘GVar’ and ‘GImp’
are variables and importances for gentle boosting.
and gentle boosting have differing observation weights at each boosting iteration,
leading to differing sets of synthetic features, explaining the difference between
variables and their importance.
Table 4. Results for checkerboard data. Average (‘MCVE’) and standard deviation
(‘SCVE’) of misclassification errors from 10-fold crossvalidation of boostrap ensembles
of sizes (‘Iterations’) 3, 5, 10, 15, 20, 25 boosting iterations. Discrete and gentle boosting
applied.
Spheres Data: The training set consisted of 1,000 instances, 500 in each group.
The predictors were not replaced with their principal components as this yielded
no appreciable increase in performance and the predictors were few in num-
ber so no computational improvement would result. Table 6 shows the mean
cross-validation misclassification error and the standard deviation of the cross-
validation misclassification error across boosting iterations, using both discrete
and gentle boosting to update the observation weights.
Table 6. Results for spheres data. Average (‘MCVE’) and standard deviation (‘SCVE’)
of misclassification errors from 10-fold crossvalidation of boostrap ensembles of sizes
(‘Iterations’) 3, 5, 10, 15, 20, 25 boosting iterations. Discrete and gentle boosting
applied.
While both methods perform well, the results for gentle boosting are even
more impressive than those for discrete boosting as the corresponding cross-
validation average misclassification errors and standard deviation of these errors
are smaller than those obtained from discrete boosting. For both discrete and
gentle boosting, the average cross-validation misclassification error and standard
deviation of cross-validation misclassification error are low for a small number
of boosting iterations and vary little as boosting iterations change. The models
clearly outperforms the naive majority class classification rule/coin toss with
its 0.5 misclassification error rate. It was also common to get to a perfectly
performing boosting model in fewer than the maximum 20 boosting iterations.
Boosting Trained Synthetic Features 307
Variable importance from the discrete and gentle boosting models with 100
boosting iterations were averaged over the 10 cross-validation runs and normal-
ized to sum to 1 for easier evaluation. The top 5 variables by variable importance
are in Table 7.
Table 7. Results for spheres data. Top 5 variables by variable importance in decreasing
order, 100 boosting iterations. Discrete and gentle boosting applied. ‘DVar’ and ‘DImp’
are variables and importance for discrete boosting, ‘GVar’ and ‘GImp’ are variables
and importance for gentle boosting.
The variable importance are similar for discrete and gentle boosting. Naive
Bayes comes through as the most important variable; this was also noted in
[4] when performing experiments on randomly generated instances of this data.
The remaining synthetic features are nonlinear splines, suited to capturing the
nonlinear boundary between the classes. Taken together the 5 variables account
for nearly 90% of variable importance.
The cross-validation estimate of model performance can be overly optimistic,
since the final model selection (including deciding the number of boosting itera-
tions, and gentle or discrete boosting) is based on the cross-validation results. As
a check, a run with gentle boosting and 20 boosting iterations on the entire train-
ing set then applied to another test set of 1,000 simulated instances of sphere
data yielded a misclassification error of 1.3% and confusion matrix in Table 8.
Table 8. Confusion matrix for model with 20 gentle boosting iterations, performance
on new simulated test set of 1,000 instances.
A test run using discrete boosting and 20 boosting iterations on the same
new set of 1,000 simulated instances of sphere data yielded a misclassification
error of 2.6% and confusion matrix in Table 9.
In both cases, the model performance on the test set falls short of the cross-
validation estimate but is still very good.
308 T. R. Boucher and T. Msabaeka
Table 9. Confusion matrix for model with 20 discrete boosting iterations, performance
on new simulated test set of 1,000 instances.
4 Conclusions
References
1. Dua, D., Karra Taniskidou, E.: UCI Machine Learning Repository. University of
California, School of Information and Computer Science, Irvine (2017). http://
archive.ics.uci.edu/ml
2. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning
and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
3. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical
view of boosting. Ann. Stat. 28(2), 337–374 (2000)
4. Frank, E., Hall, M., Pfahringer, B.: Locally weighted Naive Bayes. In: Proceedings
of the Nineteenth Conference on Uncertainty in Artificial Intelligence, pp. 249–256
(2003)
5. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edn. Springer,
New York (2009)
6. Melville, P., Mooney, R.J.: Creating diversity in ensembles using artificial data.
Inf. Fusion 6(1), 99–111 (2005). https://doi.org/10.1016/j.inffus.2004.04.001
7. R Core Team: R: a language and environment for statistical computing. R Foun-
dation for Statistical Computing, Vienna, Austria (2016). https://www.R-project.
org/
8. Rodriguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation forest: a new classifier
ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1619–1630
(2006). https://doi.org/10.1109/TPAMI.2006.211
9. Therneau, T., Atkinson, B., Ripley, B.: rpart: Recursive partitioning and
regression trees. R package version 4.1-10 (2015). https://CRAN.R-project.org/
package=rpart
10. Wang, L., Wu, C.: Business failure prediction based on two-stage selective ensem-
ble with manifold learning algorithm and kernel-based fuzzy self-organizing map.
Knowl.-Based Syst. 121(2017), 99–110 (2017)
11. Zieba, M., Tomczak, S.K., Tomczak, J.M.: Ensemble boosted trees with synthetic
features generation in application to bankruptcy prediction. Expert Syst. Appl.
58, 93–101 (2016)
12. Zieba, M., Hardle, W.: Beta-boosted ensemble for big credit scoring data. Expert
Systems With Applications, SFB 649 Discussion Paper 2016-052 (2016). Avail-
able at SSRN: https://ssrn.com/abstract=2875664 or https://doi.org/10.2139/
ssrn.2875664
UI Design Patterns for Flight
Reservation Websites
Abstract. Flight Reservation Websites have become one of the most frequently
used online booking systems in today’s technology orientated world. In order to
ensure that such systems provide good user experience a research has been
conducted to analyze Flight Reservation Websites to formulate User Interface
Design Patterns for such systems to enhance their User Experience. Systematic
approach has been adopted for user study to come up with UI Design Patterns
which will help the system developers to design a system in a way that it will
ensure a better User Experience and it will address diverse user needs.
1 Introduction
2 Literature Review
The research [1] mainly focuses on creating successful interactive systems. User
interface which needs to be developed with the cooperation with developers and
designers and experts. This group of people exchanges the ideas and terminologies
which are used to enhance the design and usability of the project. This paper also
presents an approach that uses pattern languages in software developments, HCI and
application domains. One of the sections in this paper explains from where the pattern
ideas come from and how it has been adapted to other disciplines, for example Patterns
in Urban Architecture, Patterns in Software Engineering, Patterns in Human Computer
Interaction and Patterns in the Application Domain.
Some pattern languages in interdisciplinary design are a Formal Hypertext Model
of a Pattern Language. This is a form of description of patterns which makes it less
ambiguous for the parties involved to decide what a pattern is supposed to look like in
terms of structure and content. The most important work for the developer is to use
patterns in the Usability Engineering Lifecycle. There are eleven major points in these
patterns like “Know the user”, “competitive analysis”, “setting usability goal”, “parallel
design”, participatory design”, “coordinated design of the total interface” etc. Focus on
HCI is particularly important for exhibits, but they are of equal importance to “kiosk”
and similar public-access systems where typical users are first-time and one-time users
with short term interaction times and no time for learning.
Talking about software design patterns, they are important to supplement the
general ones. Main focus in this text is on those patterns that relate specifically to
software design for example, music exhibits. Purpose of all the text is to apply the
pattern technique to entirely new application domains which could further strengthen
our argument that structuring them into patterns is generally a valid approach. A more
intensive use of HCI patterns in user interface design courses will hopefully give us
more detailed findings about their usefulness in education [1].
Another research [2] focused on how UI Design Patterns are introduced to HCI
students and the students’ review about it. The author talks about UI Design Patterns
and how these patterns are collected form pattern languages. Patterns provide the reader
with a solution to a known problem within a defined context.
Fourteen students were asked to develop a pattern-based model for each of the two
given interfaces after they attended a lecture on how to develop UI Design Patterns.
They were also given a six-step generative process to follow which was another
method of developing UI Design Patterns. Then the students had enough knowledge
about the development so they were observed and asked to fill questionnaires about
their experience. The results were more positive than negative. Most of the students
thought that design patterns should be used. This paper tells what approach should be
used to form UI Design patterns and how one should start learning in this field [2].
A research [3] about one day workshop analyzed how UI designers are using
patterns today. Ten possible topics within this scope were listed and explained. These
topics were divided into two categories: topics related to writing patterns and topics
related to using existing patterns. It was explained why writing patterns is important but
there was no discussion about how to write a pattern which makes this workshop more
theoretical and less practical. The paper claimed that there is very less work done on
reading and using existing patterns, which could be true because we have not found any
similar work for our literature review. The topics related to writing patterns are basi-
cally different approaches which can be used while writing a pattern, these ideas can be
helpful when one is writing a pattern of his own [3].
Another qualitative study about design patterns, this research [4] refers to problems
in usability of existing interaction patterns. The existing patterns do not fulfill their
312 Z. H. Malik et al.
purpose, and after using several patterns the problems faced were listed. Among the
problems included, one was that none of the pattern collections covered all of the
problem cases. Furthermore, a number of pattern collections used actually tend to focus
on the different aspects of the design process. In some cases, the naming of interaction
patterns seemed inconsistent and difficult to learn.
After proving that problems exist when it comes to using design patterns, their
solutions included, organizing the patterns using problem based or platform-based
grouping, using graphical references helps the developer understand the core of the
solution, systematic and standardized naming conventions could help identifying and
remembering patterns [4].
Another research [6] talks about how UI designers and researchers can evaluate the
existing design patterns before using them. The goal was to form a tool to manage
pattern collections. At first, the problems faced while forming a tool were addressed
which included standardizing a common pattern form, customizing patterns, relating
patterns etc. After this the existing patterns were investigated in a survey and it turned
out that some of the functions were already supported in some of the tools while others
were given very less support. There are details about which pattern holds which
specific function. This information was further used to identify six activities which
require support from a tool when managing pattern collections. The minor details about
what functions does a pattern tool hold were important for us because it tells how
patterns differentiate and what features are important, and we will be needing this
information while forming our UI Design Patterns [5].
Another research talks about building Complex analysis patterns from the combi-
nation of simpler analysis patterns in context of flight reservations. The requirement as
a common context were some different cases related to the reservation of flight like a
flight is defined by a number and a date, a plane is defined to a flight and it contains a
set off numbered seats. Flights are connected to a route and route consists of origin and
destination of a flight (span). Different use cases of flight reservations about how
different passengers book a flight according to their plan and chooses different routes
for the same destination.
Then comes the component patterns, these patterns are basically the building blocks
for the complex patterns. Each pattern has some intent, has a problem and their solution
is described with the approach of Object-Oriented Analysis and Design. These different
patterns include Travel Ticket, Seat Assignment, Collection of Seats, Self-Connection
pattern, Flight - Route pattern and Airport Role pattern. Then finally there is a flight
reservation pattern which describes the placement of an order of series of tickets.
Different examples with different problems have been discussed in this paper in detail
with their solutions. Authors approach involves the use of object-oriented methods and
semantic analysis patterns and by solving the problems using object-oriented methods
the benefits they got are reusability, extensibility and conceptual abstraction. In this
paper, the ability of Semantic Analysis Patterns to compose patterns to build complex
patterns or complex models in general has been illustrated through a case study. The
specific problem that the authors used as a case study is of intrinsic interest because of
its economic importance. They designed a software that has been designed either by the
procedural approach (most likely) or by object-oriented methods (in most of the cases).
UI Design Patterns for Flight Reservation Websites 313
However, their search did not yield any complete examples but the use of analysis
patterns can help build good conceptual models for designers who have little
experience [6].
Another research [7] is about design patterns for ubiquitous computing. Design
patterns are a format for capturing and sharing design knowledge. The overall goal of this
work was to aid practice by speeding up the diffusion of new interaction techniques and
evaluation results from researchers, presenting the information in a form more usable to
practicing designers. Design patterns were first developed by Christopher alexander for
architectural purposes, he developed 253 patterns. Pattern language for ubiquitous
computing uses the definition generated at INTERACT 99: “The goals of an HCI pattern
language are to share successful HCI design solutions among HCI professionals”. This
paper tells that patterns offer solutions to specific problems instead of providing high
level suggestions. Pattern languages started for architecture and have been emerging
since for UI design. Patterns have seen their success in the area of software design and
among software development community. They developed their own pattern language
using an iterative process and they believe that more work can be done on this language
following another iteration. The patterns formed in this paper were evaluated by prac-
ticing designers and the evaluations were used to improve the patterns [7].
When it comes to validating UI pattern language, three types of validation need to
be considered: the validity of the individual pattern, the internal validation of the
pattern and the external validation of the pattern. The elements which characterize a
pattern and pattern language were identified. Six tests or questions were developed in
the paper [8] to determine internal validity of a pattern language. The same validation
technique has been used to validate the UI Design Patterns of the current paper.
3 User Study
To conduct this study 35 participants were selected of diverse age group and experi-
ence in terms of travelling and using online booking system to book flights. Moving on
to specifics, out of 35, 20 participants had the prior experience of online flight booking
and the remaining 15 were never interested in booking their flight through online
booking system. They preferred to contact travel agent to book flights for them.
Moreover, 27 participants were male and 8 were female.
To conduct this study 5 top airline websites of 2017 were selected which were used
to book flights [9]. The five airlines were:
1. American Airlines (www.aa.com)
2. Delta Airlines (www.delta.com)
3. Southwest Airlines (www.southwest.com)
4. United Airlines (www.united.com)
5. Ryan Air (www.ryanair.com)
Within Subject Design [10] was used to conduct the experiment. This technique
was used in order to test each website with each participant. Within Subject Design also
helped the participants to analyze the websites and come up with a good comparison
between different websites.
314 Z. H. Malik et al.
In order to avoid fatigue and learning bias, Latin square method [11] is used. “In
general, a Latin square is an N N table filled with N different symbols positioned
such that each symbol occurs exactly once in each row and each column” [12]. This
method makes sure that each airline website gets an unbiased spot from first to last to
create variation.
4 Experiment
To make the experiment and interview of the participants smooth, a pilot test was
performed on each of the five airline websites by the interviewers and each step of that
pilot test was noted down from the beginning to end of the procedure of booking a
flight. After pilot testing each task was transformed into flash cards of 5 different color
for five different Airline websites on which the pilot test was performed. It was made
sure that during the process of tasks interviewees were able to perform every task on
their own without anyone’s help.
Retrospective Testing [13] was selected to conduct the study. The user was handed
the tasks at first and instructed a little bit about what he had to do. For each task, time of
performing the task was recorded. Not only the time was recorded, during the interview
a camera was also placed at such an angle so that participant’s facial expressions can be
recorded for each task and also the screen was recorded so that it can be used for
analysis afterwards. The website on which the user was going to perform the task was
already open on the screen. One of the interviewers had to sit right next to the par-
ticipant so that if there was any kind of technical problem or if the participant had any
kind of question.
Interviewer was also taking notes during each task so that the problems that the
participant faced could be collected as data so that it could further be used to come up
with solutions to make the system better. The user proceeded on to book his flight as
instructed on the tasks and the recordings were stopped when the purchase button was
clicked without any error in form.
After performing a task, five tasks per participant, a usability evaluation form was
asked to be filled by the participant. After that a set of decided questions were asked by
the interviewer about the task that had just been performed by the participant. All the
responses from participants were recorded.
All the data collected from this experiment was stored in five different spreadsheet
files. Every sheet consisted of user’s name, id, airline’s usability score by every user,
user’s prior flight booking and travelling experience, time taken by each user to book a
flight and duration of time taken and usability scores. Rankings of these flights were
recorded on a separate sheet based on average time and score.
5 Analysis
After conducting the experiment, the analysis showed that there are several problems
that in terms of usability with can be easily handled with some UI Design Patterns.
A series of questions were asked from the participants in form of an interview about the
UI Design Patterns for Flight Reservation Websites 315
website which they performed the task upon. Those questions included the information
about their prior experience of booking online flights and travelling, what was the first
impression of a particular site when they started the task. General question about User
Interface was also included to get more feedback and about their experience. One of the
optional questions in that interview was about participants’ suggestions about adding
additional features or any kind of change that they felt to be in a particular flight
booking website. Some UI design patterns are designed in such a way that they play a
role of a sub-pattern of a particular main pattern. So, most of the participants gave
suggestions about the design and other different aspects related to layout, colors or
some main functionalities like date selection for the travel. Some of the users got
confused about the date, the issue was if a passenger wants to travel a One-way flight
and at the instance of selecting a date for departure a user was asked to enter two dates,
one for departure and one for return, this happened because when a user opens an
airlines website for flight booking the default Trip Type was “Round-Trip”, and the
icons or radio buttons varies in different airline websites. They were so small in font
and were not placed appropriately in a way that a normal or new user would see and
select the option according. So, a pattern is designed to eliminate such problems.
Another major issue that was noticed was that all the websites share almost a
similar type of layout for a procedure of booking a flight and most users faced difficulty
or their time of performing task got delayed, another new pattern for flight booking is
suggested and has a simple, efficient and easy to operate layout. This particular layout
has an appropriate font size and a color scheme that can be visible to all users. Booking
a flight pattern contains five different sub-patterns and all of these patterns can defi-
nitely be solving basic design problems that a typical user faces. These patterns include
“Flight Menu” pattern followed by the “Airport Selection” pattern which is obviously
the main functionality of a flight booking procedure which ensures and gives you
enough options to select a preferred airport for departure and arrival of a passenger.
The next is “Trip Type Selection” pattern, because of which most of the users got
confused and were asked to enter or select multiple dates for flight booking, which
requires to select Round-Trip, One-Way or Multi-City trip type in drop-down menu.
So, this makes sure that the user would not have to enter or select multiple dates if
he/she is travelling a one-way trip. Next comes the “Flight Date Selection” pattern in
which a monthly calendar appears with navigation buttons to scroll the month in which
a traveler wants to travel. Then the final sub-pattern for flight booking is “Passenger
Selection” pattern which is simple and provides enough options for a traveler to select
from multiple age groups.
Another pattern “Travel Date Flexibility” is formed as per users’ suggestion that
provides multiple travelling dates so that the traveler can decide between feasible dates
which can also vary the price range of a particular flight on a particular date. This
pattern leads to the formation of another pattern which was not up to the mark in terms
of layout, positioning and other aspects of designs. One of the patterns formed is “Sort
Flight Options” pattern which provides numerous categories to sort the flight reser-
vation details e.g. if the traveler wants to have a direct flight to his destination then he
will select the option of “number of stops” in a dropdown menu.
316 Z. H. Malik et al.
Further on, another pattern, “Flight Filter Option” is formed so that the traveler can
also filter through the options of his departure/arrival based on his budget and this
option also provides possible number of departure and arrival airports of a city from or
to which a traveler is travelling.
6 UI Design Patterns
Following are the patterns which are formed after conducting a series of interviews,
getting questionnaires filled, managing tasks, compiling the data retrieved from par-
ticipants and completing the overall analysis.
6.1.1. Pattern Name
Flight Menu.
6.1.2. Pattern Description
The very first menu on a flight reservation website.
6.1.3. Problem Statement
Some websites use these options in random order.
6.1.4. Solution
Options in right order as per user’s need with a design that visibly shows which option
is selected.
6.1.5. Use When
Shall be used as main menu when making a flight reservation website.
6.1.6. Example
Example of the UI Design Pattern is illustrated in Fig. 1.
6.3.6. Example
Example of the UI Design Pattern is illustrated in Fig. 3.
Fig. 3. A single menu with two calendars to select departure and return date.
Fig. 7. Drop-down menu showing the list of all sorting options for flights.
Fig. 10. Payment method for flight booking with necessary information only.
Fig. 12. Select suitable contact method from two possible options.
326 Z. H. Malik et al.
Fig. 13. A map of the airplane to select the seat of your choice.
7 Conclusion
The paper provides UI Design Patterns for Flight Reservation Websites. These UI
Design Patterns will prove to be quite beneficial in terms of providing guidance to the
future developers of such websites. These UI Design Patterns will prove to be the basis
for design decisions for flight reservation websites. In light of these UI Design Patterns
the flight reservation websites will provide better user experience and they will cater
diverse user needs.
UI Design Patterns for Flight Reservation Websites 327
References
1. Borchers, J.O.: A pattern approach to interaction design, pp. 1–10. ACM (2000)
2. Todd, E.G., Kemp, E.A., Phillips, C.P.: Introducing students to UI patterns, pp. 37–40. ACM
(2009)
3. Van Welie, M., Mullet, K., McInerney, P.: Patterns in practice: a workshop for UI designers,
pp. 908–909. ACM (2002)
4. Segerståhl, K., Jokela, T.: Usability of interaction patterns, pp. 1301–1306. ACM (2006)
5. Deng, J., Kemp, E., Todd, E.G.: Managing UI pattern collections, pp. 31–38. ACM (2005)
6. Jiang, Z., Fernandez, E.B.: Composing analysis patterns to build complex models: flight
reservation. ACM (2009)
7. Chung, E.S., Hong, J.I., Lin, J., Prabaker, M.K., Landay, J.A., Liu, A.L.: Development and
evaluation of emerging design patterns for ubiquitous computing, pp. 233–242. ACM (2004)
8. Todd, E., Kemp, E., Phillips, C.: Validating user interface pattern languages, pp. 125–126.
ACM (2003)
9. World Airline Rankings (2017). https://www.flightglobal.com/news/articles/insight-from-
flightglobal-worldairline-rankings-20-439587
10. Malik, Z.H., Arfan, M.: Evaluation of accuracy: a comparative study between touch screen
and midair gesture input, pp. 448–462. Springer (2019)
11. Malik, Z.H.: Usability evaluation of ontology engineering tools, pp. 567–584. IEEE (2017)
12. MacKenzie, S.: Human-computer interaction: an empirical research perspective, pp. 177–
188. Morgan Kaufmann Publishers (2013)
13. Malik, Z.H., Farzand, H., Shafiq, Z.: Enhancing the usability of android application
permission model, pp. 236–255. Springer (2018)
Conceptual Model for Challenges
and Succession Opportunities for Virtual
Project Teams in the GCC
Many researches focused on the fields of project management (Kelly, Edkins, Smyth,
Konstantinou, 2013 [28], Sommerville, Craig, Hendry, 2010 [16], Smith [3], Paton,
Hodgson, Cicmil, 2010) [37], and project teams (Rezania, Lingham [4], Sackmann,
Friesl, 2007 [29], Mueller, 2012 [19]) at work places. However, the higher tendency
toward connectivity and virtual contexts of project teams’ work did not have the same
research efforts by researchers. Virtual project teams are increasingly used for the types
of projects that need high knowledge levels from different cultures. Virtual teams are
solving many problems like time differences and geographical constraints. The for-
mulation of organizational project teams needs cooperative background among team
members [15, 20, 21].
Globalization and advances in digital communication facilitated the creation of
International virtual project teams (IVPTs), allowing cross-cultural and multinational
collaboration between team members regardless of cultural, historical, socio-political,
and educational differences [1]. On the Contrary, virtual teamwork is not always a
guaranteed success, and can be a bumpy ride. This indicates the increasing importance
as well as usage of virtual project teams nowadays and in the future. This also means that
the importance of research in this area will increase. Traditionally, virtual team members
are chosen because of having certain knowledge not because they are able to cooperate
well with other styles of personalities in the virtual team.i. Mehtab, Ishfaq [25] stated
that virtual leadership is one of the major challenges in modern works. From the
organizational point of view virtual leadership is important because they give it gives the
organization high level of flexibility and responsiveness (Potter; Balthazard; and Cooke,
2000) [34]. Potter and others, 2000 found that the manager of the team will be able to
have higher control over the team members if the size of the virtual team is small and the
project needed time is relatively short. Communication and controlling long-term virtual
projects is a challenge of modern work environment in addition to large sized virtual
project teams as stated by Morgan, Paucar-Caceres, Wright, 2014 [26].
From the team members’ point of view, they can join more than one team in more
than one organization; they can also have higher flexibility and knowledge diversifi-
cation. Potter, 2000 also found that virtual team members can participate and contribute
their knowledge and opinions simultaneously, without waiting for a dominating
member to stop speaking. This conclusion leads us to the expectation of having lower
importance of the role of the virtual team manager over time. There is an opportunity
that self-managed virtual team members arise intensively in the coming decade. Holton,
2001 [18] found that the organizational effectiveness may be affected by the progress of
virtual project team performance if there is no deep dialogue to cultivate a set of shared
values among virtual team members.
However, in the GCC countries such factor may represent a positive point that may
highly affect the success of virtual project teams in the Arabian Gulf Aria.
330 R. A. Samra et al.
Collaboration enhances the organizational learning. In the GCC charter, it is clear that
GCC countries have common cultural factors like language, because they are all Arabic
nations. In addition, they are all Islamic ones. They have strong historical bonds.
Moreover, those are the main reasons of the establishment of the Gulf Countries
Council. “Charter 1” The GCC has cooperation values among them according to this
charter; however, the research on the factors affecting virtual cooperation among GCC
is rare. A window is open for both practitioners and researchers to take further steps in
the field of virtual project teams’ potential of success in the GCC. In the following parts
of the study we will systematically review the contribution of the literature to the
relationships between factors related to cultural rapprochement and the efficiency of
virtual communication and coordination. The cost of VPT and the efficiency of com-
munication and coordination may hinder the succession of the VPT in the GCC;
however the innovation capabilities will play a positive role in reinforcing the suc-
cession of VPT in GCC countries as we will discuss in the coming parts of this study.
2 Literature Review
Having shared values among virtual project teams in the gulf countries may highly
affect the effectiveness of its organizations. This will also achieve the GCC countries
objectives of reinforcement of links and strengthening cooperation. This research is
important because it helps in a way to achieve the GCC countries integration with
higher flexibility, lower geographical constraints, and better cultural understanding
among virtual team members. David J. Pauleen from Victoria University of Willington
in New Zealand, 2002 [6] defined VPT “virtual project teams as the dispersed groups
of coworkers that use telecommunication technologies to accomplish organizational
tasks. This definition highlights the importance of virtual and global project teams.
Another definition for the virtual project teams classifies virtual teams to geographi-
cally dispersed and organizationally dispersed.
These teams have a goal to accomplish and it is possible to accomplish it using
information technologies and telecommunication (de Jong, Schalk and Curşeu, 2008
[33]). As a conclusion, virtual project teams are having technological mediated com-
munications to perform their tasks. The organization will be able by using virtual
project teams to become more dynamic and to be involved in projects that are more
complex. Complexity and ambiguity of boundary-less virtual project teams tends to be
main research point in most of the virtual project teams researches (Peters and Manz,
2007 [17, 22], Denton [5]; Oertig and Buergi, 2006 [24]; Morris, 2008 [36], de Jong,
Schalk and Curşeu, 2007 [33]. However, the nature of GCC countries has positive
impact on minimizing the ambiguity of boundary less virtual project teams and yet
there is still more research need on the area of challenges and potential of success in
GCC countries.
km in 2016. The national Institutes in Arab Gulf Countries estimated that total pop-
ulation for GCC countries reached 51 density for each one km2 of GCC countries is
increasing with higher percentages in 2009 compared to 2007 and 2008, in Saudi
Arabia is increased 700% from 1961 to 2017 (Millions in 2016) [27, 30]. Statistic
shows the share of urban population in the Gulf Cooperation Council region will reside
in an urban setting. The following chart (Fig. 1) shows share of urban population in the
Gulf Cooperation Council region from 2005 to 2030.
This increase gives higher priority to researchers to find more cost control methods
of conducting business. Researches approved that virtual project teams are cost con-
trolled (Potter, Balthazard and Cooke, 2000 [34]).
The foreign direct investment in the GCC is rapidly increasing. The development
ranged from less than 50000 million US dollars in the year 2000, to almost 300000
million US dollars in the year 2009). The main GCC countries attracting foreign
investments are KSA. Regarding the GCC foreign, direct investment in other countries
it increased from 1100000 Million US dollars in the year 2000 to 13500000 Million US
dollars in the year 2009. The main GCC countries having foreign direct investments in
other countries are UAE (39.7%), and KSA (29.9%) [31, 32].
This makes it more important to study how to increase the effectiveness of con-
ducting high performance virtual project teams, whether among GCC or between the
GCC countries and the rest of the world. Accordingly, the researchers are applying
their research on these two countries KSA and UAE. The following table shows the
size of KSA and UAE countries’ intra-trade for the years (2007-2008-2009) (Numbers
are in million dollars):
332 R. A. Samra et al.
Table 1 shows that UAE and KSA were not able to increase the intra-trade size
continuously in the year 2009. They must find tools to facilitate the intra-trade between
each other. One of the main objectives of GCC countries is mentioned in the GCC
charter as:
“To stimulate scientific and technological progress in the fields of industry,
mining, agriculture, water and animal resources; to establish scientific research;
to establish joint ventures and encourage cooperation by the private sector for the
good of their peoples” [35].
This objective is another reason for researchers to focus on technological solutions.
Using the technological advances in establishing joint ventures and encouraging
cooperation among the GCC, mainly in the private sector is very important. Working
very early or working very late according to time differences is one component of virtual
aspects of communication.ii This challenge does not exist in the virtual project teams in
the GCC if they are not including members from outside the GCC in their teams. Morris,
2008 [36], found that project managers in many cases treat the virtual teams as the same
as treating the co-located teams. This indicates that a study is needed to investigate the
human impact versus the virtual working impact on the success of the project. Some
practices are affected by cultural retention and traditions, however; these practices are
not tested against productivity and success if applied virtually. Virtual work environ-
ment may increase stress indicators like tensions and conflicts. Disengagement in the
virtual environment is higher than the disengagement in the co-location environment as
found by Morris, 2008 [36]. This shows that cultural differences are still not well
managed for virtual project teams and that tendency toward managing virtual project
teams in similar cultures will be more successful and effective [38].
Instead of saying “hi” they use “sir” which is not the case for Australian members.
These environmental barriers may be common in other virtual contexts. In Gulf coun-
tries, they share common language and common traditions. They have higher chances to
repeat co-location communications due to the short geographical distance among GCC
countries. Holton, 2001 [18] used both inductive and deductive research approaches to
find out how to increase trust and collaboration among virtual team members as a major
factor for organizational success. He focused on the team climate and the opportunity for
regular communication to create mutual trust and cohesiveness [39].
Beranek and Martz, 2005 [40] conducted a research on how to make virtual teams
more effective by improving relational links using training tools. They concluded that
they could increase cohesiveness and the ability to exchange information, which affects
positively the team performance. In the GC countries, other reasons for cohesiveness
are available as a cultural background before the process of virtual team formulation.
The point is the level of knowledge that represents another important factor affecting
the virtual project team performance.
The GCC are still in the process of importing external experts in different fields,
which may decrease the desire to formulate the all-virtual project team members from
the GCC only. The advantage of facilitated cultural communication may increase team
learning especially of higher levels of knowledge is one of the conditions in the stage of
virtual project team formulation. Kuruppuarachchi, 2006 [41] was trying to find out
how to maximize the performance of virtual project teams. He emphasized arranging
periodic co-location face-to-face meetings among team members on both monthly
bases and weekly bases. This is more convenient if the team members are living in
neighbor countries [16].
Giving more freedom to the virtual team rather than counting on old control
management styles and raising the skills of members were part of Kuruppuarachchi,
2006 [41] research conclusion. This enhances that for the successful application of
virtual project teams in the Arabian Gulf Area it is necessary to minimize the role of the
manager control and to raise the learning effort of organizational members. This may
lead to better performance of GCC virtual project teams. This conclusion matches with
the findings of Denton, work [5]. Increasing the intranet learning through comparing
results with plans and creating feedback loops increases the level of virtual project
teams’ performance. Denton [5] found that the rapid and clear feedback loops
encourages flexibility of controlling virtual teams’ self-directed tasks.
This also supports that idea of minimizing the need of traditional management
control and maximizing virtual project members’ self-control. Keeping on track by
repetitive regular co-location face-to-face communication also is mandatory for
effective task performance. Environmental context of GCC virtual project teams is
another positive component of the potential of success of using it in GCC countries [8].
Oertig and Buergi, 2006 [24] talked about the challenges of managing cross-
cultural virtual project teams. Through his qualitative research, which took the form of
conductive thematic analysis, he was able with his colleague to find out that the
challenges include managing language and cultural issues among virtual project team
members. Consequently, managing virtual aspects of communication and building trust
are other challenges. As long as there is unified language, and semi-unified culture
among the GCC, it is expected that building trust among virtual project team members
will be easier, faster, and more effective.
334 R. A. Samra et al.
Peters and Manz, 2007 [22] were trying to find out how the depth of relationships,
trust, and shared understandings among team members feed into a team’s collaborative
ability. They initiated a conceptual model for identifying antecedents of virtual project
teams’ collaboration. They conceptualized an effect of the depth of the relationship on
the degree of virtual collaboration. In addition, they expected that the depth of rela-
tionships affect positively the level of trust as well as speeds reaching shared under-
standing. Consequently, this increases the potential of success of the degree of virtual
collaboration. If those four factors are reinforced in the virtual project team, then the
researcher expected that more innovation in performance might occur.
The internal feeling of collaboration by virtual project team members is another
point that is highly expected by GCC citizens due to the agreements they have and the
GCC charter that has the same goal. More focus on the task rather than focusing on the
relationship building may lead to higher competition or sometimes miscommunication
(Oertig and Buergi, 2016) [24]. This will not be the case with people living next to each
other and have the same economic and social goals like the case of the GCC. For
having more depth relationship and keeping all team members in the loop it was found
that face-to face communication is a must. In addition, it is recommended that the team
leader is located in one country while the team manager is located in another country
with almost the same number of virtual project team member with each (Oertig and
Buergi, 2006) [24]. In their qualitative research, Oertig and Buergi found by inter-
viewing managers that building trust in virtual project teams takes between three to
nine months. This trust encourages virtual team members to report problems to the
project leader even before taking the formal action to report it. This leads to higher
performance efficiency.
This aspect of trust will be related from the researcher’s point of view to the task
itself and after agreement on the relationship itself. Informal contacts among virtual
team members and between the team manager and the team leader are reinforcing
factors for reaching better levels of trust more quickly. Keeping everybody in the same
level of information is another very important aspect. This aspect may be part of the
role of the virtual project manager and leader.
Equality of participation without dominance and overload understanding are main
characteristics of managing virtual project teams. Margaret [24] found in her research
that between USA, Europe, and Japan using the English language had cultural barriers
that were significantly affecting the level of trust. Nevertheless, in the case of the GCC
using the Arabic language will remove barriers and increase trust especially that this
language is correlated with the same zone’s traditions and religion. The regional
stereotyping is minimal among the GCC, because they have many shared factors like
whether, language, level of income, religion, history, norms and traditions. They also
have interrelationships that are revealed by the intra-travelling ratio that is increasing
among the GCC.
Regarding the management of the tasks of virtual project teams, the use of matrix
virtual project teams has its challenges also. The functional loop of information has to
feed all functions to avoid isolation as well as project information. That is why it is
recommended that the virtual project team manager must continue the co-location face-
to-face communication for better relationship building. The cost of continuous
Conceptual Model for Challenges and Succession Opportunities for Virtual Project 335
intercultural communication skills training is saved or at least is less than for other
dispersed cultural virtual project teams compared to GCC virtual project teams.
The fact of the high turnover in virtual project teams, which is one of its main
characteristics, represents one of trust building barriers. That is why it is highly rec-
ommended that the formulation of the virtual project team be based on shared back-
ground to start with acceptable level of trust every time, which is exactly the case of the
GCC’s virtual project teams. This may increase the speed of performance and its
efficiency compared to other different cultures based virtual project teams. This
enhances the idea of assuming the responsibility about the team goals not only about
own contribution, which was one of the barriers of creating collaboration in Linda’s
research, 2007 [17]. Also in her research, she recommended the formulation of
informal personal relationships for more functioning and collaboration, this also is
highly available and approved by statistics that GCC countries’ citizens are having
extended families in other GCCs and are increasingly travelling to each other.
Frankly speaking the reliability on the team member’s knowledge is the thing that
needs more in depth research compared with multicultural virtual project teams. de
Jong, Schalk and Curşeu, 2008 [33] found that higher levels of virtuality increases the
effect of perceived task conflict on virtual team performance and vice versa. Remco
meant by the level of virtuality three dimensions. They are the degree of synchro-
nization, the presence of nonverbal and para-verbal clues, and the extent of use of
communication media. Regarding the nonverbal communication tools among the GCC
virtual project teams it is expected that there is high understanding and clear inter-
pretations for these clues. This will lessen the manager’s effort in resolving conflicts
arising from misunderstandings of those clues. Remco and others were able to prove
that the relationship conflict has a negative impact on the perceived virtual project team
performance [33].
Task conflicts have positive impact on perceived virtual project team performance.
The relationship conflict in that research if increased will decrease the perceived per-
formance of the virtual project team. This relationship conflict in the GCC is less if
compared with multicultural virtual project teams due to the informality of the nature of
the relationship among GCC citizens, interrelationships and related families [13].
The cognitive ability of the virtual team members in the research was measured by
four variables: presence of breadth of perspectives on the problem at hand; compre-
hensive pool size of potential solutions to examine; extent of innovative ideas present
among members; and wide variety of criteria for evaluating possible solutions among
members. All these variables may represent training needs for the virtual team mem-
bers in the GCC countries, which represents one of the main challenges.
The innovation in the virtual team performance will highly depend on the cognitive
preparations of virtual project team members. This is also another cost item that has to
be put under control in the stage of virtual project teams’ formulation and the return on
investment in this training on the level of virtual project team performance has to be
measured and evaluated. Behrend and Erwee [2], conducted a research in the same area
about mapping knowledge flows in virtual teams. They found that knowledge man-
agement in virtual environments is more complex than common business practice
suggests. This gives higher value of training before the formulation of the virtual
project team. They concluded that Organizations often launch new initiatives without
understanding the inner working of involved formal and informal networks, relying on
the philosophy that more communication and collaboration are better.
This keeps the initial understanding of cognitive capital essential. It also tells that
strong communication and collaboration is a factor for success but it is not enough
alone to reach that success without understanding the dimensions for virtual knowledge
sharing. Resource allocation, coordination, and communication support systems were
found to be other challenges the virtual project team management will have to deal with
according to the findings of Drouin, Bourgault and Gervais [7]. Innovative performance
is a conclusion of both the knowledge exploitation and the access to knowledge in
virtual project teams.
Innovation capabilities are another challenge in front of the GCC localized virtual
project teams because they want to overcome the cultural communication problems but
not on the bases of losing the higher levels of team performance. This conclusion
matches with the conceptual model founded by Gressgard, 2011 [42] he found that
technological development improves communication characteristics and communica-
tion characteristics improves two innovation capabilities which are the knowledge
exploitation and the knowledge access which researchers can refer to as knowledge
exploration and both will improve the innovative performance of the virtual project
team.
Although the easiness of building trust was one of the succession points in GCC
virtual project teams, it was found that choosing the communication medium carefully
affects the virtual team member perception of trust. This was found by Joel Olson and
her sister Linda in 2012 [17]. The last challenge is the need to have decentralized
responsibly to enhance sharing. Charismatic and participative leadership is needed but
at the same time project goals and decisions are negotiated in the level of virtual team
members. These are the findings of Muganda and Pillayin 2013 [29] when they were
trying to study the Forms of power, politics and leadership in asynchronous virtual
project environment.
Conceptual Model for Challenges and Succession Opportunities for Virtual Project 337
3 Conclusion
The research aims at investigating the special environmental needs for the applicability
of virtual project teams in the GCC countries. The research main question is: “What are
the main challenges and succession opportunities for virtual project teams in the
GCC?”.
To answer this research question researchers conducted a systematic review in the
literature and came up with a conceptual model (Fig. 2) that represents a start for
validation and measurement of successful performance of the GCC virtual project
teams:
Efficiency of virtual
communication and
coordination:
- Choosing the
communication
medium
- Easiness of co-
location
Cultural rapprochement Innovation capabilities of VPT:
communication
- Language - Self-managed
Traditions knowledgeable teams
- Cost of VPT functioning:
- History - Cognitive capital
- Religion and - building trust - Knowledge sharing
values - resolving relationship techniques
- Families relations conflict - Decentralization
- sharing knowledge
- VPT leadership Success of VPT in the GCC
- Resources Allocation
In fact, we can consider the virtual project teams as organizational dispersed teams
rather than cultural dispersed ones because of their cultural rapprochement. The virtual
project teams have high potential of success in the GCC countries because better
communication is available. This includes better understanding of non-verbal com-
munication clues, the easiness of repetitive co-location face-to-face communication.
Also the shared common values, traditions and the same language will minimize the
difficulty of resolving virtual project teams’ conflicts. The private sector entrepreneurs
have higher chance to be supported by the GCC for starting their Virtual project teams
which is considered cost controlled method of performing projects. Time limitations do
not exist as one of the challenges for GCC virtual project teams. There are some
challenges regarding the virtual project teams’ performance in the GCC area. They are
mainly focused on the knowledge sharing techniques, the needed training on raising the
cognitive capital of virtual project members, the careful choosing of communication
338 R. A. Samra et al.
affecting this model is also needed especially when taking into considerations the
effects of economic and political factors in the region.
References
1. A strategic direction. 2(5), 22–24 (2011)
2. Behrend, F., Erwee, R.: Mapping knowledge flows in virtual teams with SNA. J. Knowl.
Manag. 13, 99–114 (2009). https://doi.org/10.1108/13673270910971860
3. Smith, C.: Understanding project manager identities: a framework for research. Int.
J. Manag. Proj. Bus. 4(4), 4 (2011)
4. Rezania, D., Lingham, T.: Coaching IT project teams: a design toolkit. Int. J. Manag. Proj.
Bus. 2(4), 577–590 (2009)
5. Denton, D.K.: Using intranets to make virtual teams effective. Team Perform. Manag. 12
(7/8), 253–257 (2006)
6. Pauleen, D.J.: Leadership in a global virtual team: an action learning approach. Leadersh.
Organ. Dev. J. 24(3), 153–162 (2003)
7. Drouin, N., Bourgault, M., Gervais, C.: Managing virtual project teams: recent findings
(2009)
8. http://sites.gcc-sg.org/Statistics/Files/1300866234.pdf. Accessed 23 Jan 2019
9. http://sites.gcc-sg.org/Statistics/Files/1306750608.pdf. Accessed 5 Dec 2018
10. http://sites.gcc-sg.org/Statistics/Files/1306750608.pdf. Accessed 19 Dec 2018
11. https://www.statista.com/statistics/1005526/gcc-population-growth/. Accessed 12 Nov 2019
12. https://data.worldbank.org/indicator/EN.POP.DNST. Accessed 2 Feb 2019
13. https://www.statista.com/statistics/957368/gcc-urban-population. Accessed 15 Jan 2019
14. Sommerville, J., Craig, N., Hendry, J.: The role of the project manager: all things to all
people? Struct. Surv. 28(2), 132–141 (2010)
15. Olson, J., Olson, L.: Virtual team trust: task, communication and sequence. Team Perform.
Manag. 18(5/6), 256–276 (2012)
16. Holton, J.A.: Building trust and collaboration in a virtual team. Team Perform. Manag. Int.
J. 7(34), 36–47 (2001)
17. Mueller, J.: Knowledge sharing between project teams and its cultural antecedents. J. Knowl.
Manag. 16(3), 435–447 (2012)
18. Kuruppuarachchi, P.: Managing virtual project teams: How to maximize performance.
Handb. Bus. Strategy 6, 71–87 (2006)
19. Jarle Gressgård, L.: Virtual team collaboration and innovation in organizations. Team
Perform. Manag. 17(1/2), 102–119 (2011)
20. Peters, L.M., Manz, C.C.: Identifying antecedents of virtual team collaboration. Team
Perform. Manag. 13(3/4), 117–129 (2007)
21. Lee-Kelley, L.: Situational leadership managing the virtual project team. J. Manag. Dev. 21
(6), 461–476 (2002)
22. Oertig, M., Buergi, T.: The challenges of managing cross-cultural virtual project teams.
Team Perform. Manag. 12(1/2), 23–30 (2006)
23. Mehtab, K., Rehman, A., Ishfaq, S., Jamil, R.: Virtual leadership: a review paper. Mediterr.
J. Soc. Sci. 8, 183–193 (2018)
24. Morgan, L., Paucar-Caceres, A., Wright, G.: Leading effective global virtual teams: the
consequences of methods of communication. Syst. Pract. Action Res. 27, 607–624 (2014)
25. Drouin, N., Bourgault, M., Gervais, C.: Effects of organizational support on components of
virtual project teams. Int. J. Manag. Proj. Bus. 3(4), 625–641 (2010)
340 R. A. Samra et al.
Abstract. Virtual technologies and game engines provide new possibilities for
collaborative virtual design within digital building models. The current paper
describes an approach, in which computer-aided design (CAD) models of
buildings are transferred into a game engine based environment, where they can
be reviewed and further designed collaboratively. Following a user-centered
design (UCD) process based on interviews and iterative interactions with
designers and architects, the prototype of Virtual Construction─a game engine
based platform for collaborative virtual design meetings─was designed and
implemented using Unreal Engine 4. The interactive tools developed can be
used both in full immersive virtual reality and using traditional devices (e.g.
laptop or desktop computers). Based on identified user needs, interaction
techniques were implemented for moving, rotating, and aligning objects, adding
and resizing shapes and objects, as well as moving and measuring distances in
the three-dimensional (3D) building model. In addition, the communication
techniques implemented based on user needs included synchronous features
such as voice communication, text chat, pointing, and drawing, and asyn-
chronous features such as leaving messages and feedback augmented with
screenshots to exact virtual locations. Other implemented scenarios included
different lighting scenarios, an evacuation scenario and crowdsourced voting
between different designs.
1 Introduction
The current paper contributes by presenting a user needs based approach for
designing tools for interaction and virtual design and review meetings for the con-
struction industry. A process for utilizing a game engine (Unreal Engine) in design and
review of buildings is suggested. Based on the approach and a user-centered design
process, a prototype platform and a set of tools for collaboration in building design
were developed based on interviews and regular discussions with architects and
designers.
We have developed a process and methods for bringing building information models to
game engines in order to be able to view and manipulate them in virtual reality. In our
process, there are two ways to bring the building information model with its metadata
(e.g. manufacturer, material, and price for each component) into a game engine. If the
Unreal Engine is used, the Datasmith add-on for Unreal Engine can be used to import a
building information model or its parts. Datasmith is a collection of tools and plugins,
which directly supports bringing content from more than 20 modeling formats into
Unreal Engine 4. For example, if a model or a part of it has been created with
Autodesk’s Revit software, it can be imported into the Unreal Engine with metadata
directly in Revit’s native format. Figure 1 below illustrates the proposed process for
bringing building models into game engines.
Fig. 1. A process for bringing building information models into game engines.
For a universal solution working with all the game engines and CAD software, the
Industry Foundation Classes (IFC) format is the only possible option. The IFC format
is the standard data storage and data transfer format for BIM. The IFC format is not
344 J. Ojala et al.
based on meshes (polygon structures), and game engines do not support it as such for
performance reasons. Thus, IFC models must first be converted into a mesh format
supported by game engines. The IFC format also supports the inclusion of metadata,
while pure mesh-based models do not contain metadata. In the process we have
developed, the IFC sub-models of the building data model are converted into the mesh-
based Wavefront OBJ format with the IfcOpenShell open source software. In addition,
the IFC models are translated into Extended Markup Language (XML) using the same
software. Then the metadata they contain can be read directly into the game engine
with an XML parser along with the data model. In the game engine, the models and
associated metadata are linked to each other using common identifier fields (e.g. id or
name). We have developed a computer program to support this process and convert
data models into a format that is supported by game engines.
Table 1. Summary of identified user needs for a virtual building design platform
Need category Needs
General Need for an integrated communication and visualization solution
Need for real-time information, plans always updated and shared
Visualization Immersion and walk-in (to examine the model like a real building)
Easy navigation and wayfinding within the entire 3D model.
Sketching by drawing and adding simple objects
Changing textures and colors “on the fly”
Seeing behind structures (e.g. locations of pipes)
Decoration by easily adding and moving stock objects
Simulation of lighting options and different times of day and year
Communication Synchronous design meetings and asynchronous communication
Communicate in the model using voice and chat (“like in Skype”)
Awareness of others and their locations in the model
Pointing and highlighting parts of the 3D model
Leaving feedback messages to 3D objects
(continued)
Virtual Construction: Interactive Tools for Collaboration in Virtual Reality 345
Table 1. (continued)
Need category Needs
Access control and Default user group of all project members and private groups
crowdsourcing Involving end users of the building, “design your own work room”
Crowdsourcing the design of public buildings
Access control with high information security
Easy versioning and version control
Special needs Supporting design for accessibility
Scenario and simulations for building evacuation
First, a design meeting focusing on the requirements of each company for col-
laborative virtual technology was arranged. Next, four in-depth user interviews were
conducted. The interviewed persons were: a senior architect, a junior architect, a senior
building designer, and a designer of industrial structures. The semi-structured inter-
views concentrated on understanding their work processes, communication during the
processes, current use of tools and technology, and ideas for improved or new tools.
The identified requirements and ideas were listed and categorized. The most central
need categories and needs are summarized in Table 1 above. Interactive tools were
designed based on the requirements. The designs were refined iteratively in the context
of monthly meetings and special events (e.g. live demonstrations) over the course of
more than one year.
3.2 Technology
The frontend application of the Virtual Construction platform was implemented using
Unreal Engine 4 due to its superior visualization capabilities and more straightforward
compatibility with CAD software. The server side backend was programmed using
Node.js. HTTP calls and web sockets are used for communication between the frontend
application and the backend. The system uses Vivox voice services for voice chat. The
system has been tested with both HTC Vive Pro and Oculus Rift virtual reality
headsets.
The different tools implemented can be accessed from a radial menu activated by
pressing the menu button of controller. The menus and dialogs are operated using two
hands so that the 2D menu or dialog can be moved with the non-dominant hand, and
pointing and selecting items is carried out with raycasting as in interaction with the 3D
objects. The default method for moving both shorter and longer distances in virtual
reality is to teleport to visible locations. This is achieved by pressing the controller
touchpad button, moving the controller so that the ray ends at a desired position on
ground or floor, and releasing the touchpad button. Thus, the technique used for
moving is a type of specified coordinate movement [14].
Interaction and Communication Techniques. The different interaction and com-
munication techniques designed and implemented for VR are displayed in Figs. 2 and
3 below. Fairly detailed descriptions are given for each technique to give the readers
the possibility to adopt the techniques in their VR developments.
Floor Plan and Guidance. Using the map tool, a 2D architectural floor plan of the
premises appears on a sign on the user’s non-dominant hand (Fig. 4, left). The
architectural plan can help in understanding the overall design of the building. It also
acts as a map, in which the user’s current location is indicated, and the user can point
and select a destination on the plan with the raygun and the trigger. The user gets
guidance to the selected location. The user can also see the locations of other users in
the same virtual model. The guidance is drawn as arrows on the floor on walkable
routes (Fig. 4, right). If the user points the map with the raygun and selects a location
with the touchpad button, he/she is directly teleported to that location.
Crowdsourcing and Voting. Opinions and votes can be gathered from the end users
of the building or the general public in the case of public buildings. After completing a
registration, end users can inspect the model freely, leave location-specific feedback
(see Fig. 3), and participate in voting. Voting buttons can be added to rooms (Fig. 5).
After selecting the voting button, the user can toggle between predefined alternatives
for, for example, furniture, layouts, materials and colors, or lighting by selecting the
number of the option in the voting dialog, and give her/his vote after inspection.
Scenarios for Visualization. The developed platform allows the selection of one of
the (currently eight) available visualization scenarios to display 3D models of buildings
under different lighting conditions from the menu. An example of visualizing a
building under normal lighting conditions and in an evacuation mode is presented in
Fig. 6.
Virtual Construction: Interactive Tools for Collaboration in Virtual Reality 347
Fig. 4. The floor plan (left) and route instructions to a destination (right)
Fig. 5. Buttons can be added to rooms for toggling between alternatives and voting.
4 Conclusion
Acknowledgments. The authors would like to thank everybody who participated in the current
study. This research was funded by Business Finland from the European Regional Development
Fund (project A73293) and by the participating companies.
References
1. Kunz, J., Fischer, M.: Virtual design and construction: themes, case studies and
implementation suggestions. Working Paper #097, Center for Integrated Facility Engineer-
ing (CIFE), Stanford University (2012)
2. Waly, A.F., Thabet, W.Y.: A virtual construction environment for preconstruction planning.
Autom. Constr. 12(2), 139–154 (2003)
3. Hilfert, T., König, M.: Low-cost virtual reality environment for engineering and
construction. Vis. Eng. 4(2), 1–18 (2016)
4. Kuliga, S.F., Thrash, T., Dalton, R.C., Hölscher, C.: Virtual reality as an empirical research
tool—exploring user experience in a real building and a corresponding virtual model.
Comput. Environ. Urban Syst. 54, 363–375 (2015)
Virtual Construction: Interactive Tools for Collaboration in Virtual Reality 351
5. Kosmadoudi, Z., Lim, T., Ritchie, J., Louchart, S., Liu, Y., Sung, R.: Engineering design
using game-enhanced CAD: the potential to augment the user experience with game
elements. Comput. Aided Des. 45(3), 777–795 (2013)
6. Lee, G., Eastman, C.M., Taunk, T., Ho, C.H.: Usability principles and best practices for the
user interface design of complex 3D architectural design and engineering tools. Int. J. Hum
Comput Stud. 68(1–2), 90–104 (2010)
7. Partala, T., Nurminen, A., Vainio, T., Laaksonen, J., Laine, M., Väänänen, J.: Salience of
visual cues in 3D city maps. In: Proceedings of the 24th BCS Interaction Specialist Group
Conference, pp. 428–432 (2010)
8. Partala, T., Salminen, M.: User experience of photorealistic urban pedestrian navigation. In:
Proceedings of the International Working Conference on Advanced Visual Interfaces, AVI
2012, pp. 204–207 (2012)
9. de Klerk, R., Duarte, A.M., Medeiros, D.P., Duarte, J.P., Jorge, J., Lopes, D.S.: Usability
studies on building early stage architectural models in virtual reality. Autom. Constr. 103,
104–116 (2019)
10. Moloney, J., Amor, R.: StringCVE: Advances in a game engine-based collaborative virtual
environment for architectural design. In: Proceedings of CONVR 2003 Conference on
Construction Applications of Virtual Reality, pp. 156–168 (2003)
11. Lin, Y.C., Chen, Y.P., Yien, H.W., Huang, C.Y., Su, Y.C.: Integrated BIM, game engine
and VR technologies for healthcare design: a case study in cancer hospital. Adv. Eng.
Inform. 36, 130–145 (2018)
12. Glue – Universal Collaboration Platform. https://glue.work/. Accessed 30 Aug 2019
13. Fake multi-user VR/AR – Next level meetings and collaboration. http://www.fake.fi/
multiuser. Accessed 30 Aug 2019
14. Mackinlay, J.D., Card, S.K., Robertson, G.G.: Rapid controlled movement through a virtual
3D workspace. Comput. Graph. 24(4), 171–176 (1990)
15. Bassanino, M., Fernando, T., Wu, K.C.: Can virtual workspaces enhance team communi-
cation and collaboration in design review meetings? Archit. Eng. Des. Manag. 10(3–4), 200–
217 (2014)
Implementing Material Changes in Augmented
Environments
1 Introduction
Today, Computer Aided Design (CAD) applications are a standard in the design and
construction industries. With new technology innovations comes improvements to how
CAD is used and implemented. AR (Augmented Reality) and VR (virtual reality) are
two examples of how designers are taking the next steps in the design process that
create immersive experiences. In this paper, an application will be discussed that uses
augmented reality to allow the user to change the material that make up an object. The
application is used on a mobile device (smartphone or tablet). The application creates
an augmented reality house and then can choose to change the color or material of the
house.
2 Previous Work
The primary usage for this application is intended for those in the fields of architecture
and design, including building design, landscape architecture, and environmental
planning. As a subarea of building design, remodeling is also a key industry with which
this application of augmented reality can benefit. For the growing field of mixed reality
applications, there is a subtle question that helps draw the line for the promise of an
application – that is how real does it have to be? Our answer is that it is dependent on
the type of application it is applied to. Those applications that look at health sciences
or with working with patients need to be highly realistic and multi-sensory [1]. Some
examples would be working with remote surgeries or post-traumatic stress therapy. On
the other side, specific detailing may not be as necessary for landscape design.
The level of realism will greatly depend on the level of immersion that the user is
expecting and that can provide interactive interaction. Finally, the level of realism can
only be expected to improve for all applications as it becomes easier to design and
create 3D models.
VR/AR is being used to enhance collaboration among design teams [2]. Virtual
environments for example have been used to bring remotely located members into a
common space to co-design [3]. This helps designer-customer relationship as well. In
the immersive environments of mixed reality applications, the designers can bring the
design to the client and work with their imagination.
Another advantage with the use of mixed-use applications is the testing of spatial
concepts. Within the controlled environments of VR, it enables the testing of hypo-
thetical designs and practices [4]. These safe spaces that can be easily manipulate can
allow designers to test and survey different design ideas to get user input before
physical structure is built. Spatial concepting can help create real-world perspectives
that are not possible on computer screens or physical models. For the application
developed here, material changes could be applied to interior modeling as well, with
floors, walls, or countertops, even furniture. To a further extent, the application could
be used with fashion and let a user test different clothes or shoes on with different
colors. This application helps take the guesswork and uncertainty out of the picture.
What makes this application useful is the level of realism that is achievable with 3D
modeling. With the modelling capabilities of such programs as 3DS Max and Unity,
the model home can be given great details and realism like that shown in Fig. 1.
There are two ways to change a material in augmented reality. The first is to identify
the appropriate polygons for which made up the object and apply a new material to
them. The second is to recreate the same object with the select material. This paper will
focus of the later here and discuss possible ways to accomplish the former. Augmented
objects are created using various objects and game components that are grouped
together to form the needed object. The difficulty in changing the materials in an object
is being able to identify the specific polygon or group of polygons that need to be
changed. This becomes a challenge, at least in our opinion, because augmented objects
are stored in the application as a prefabricated object, which generally does not get
ungrouped or changed. If the objects are grouped, then materials can be changed as
shown in Fig. 2.
converted to prefabrications. Next, a script would be added that changes the objects
based on the color selected. This can be done using an array of materials and linking
the selected buttons to the array number for that material. In this way, you are creating
one model and can have many different options for materials.
Some part of the current success of augmented reality in recent years can be linked to
Google™ and Apple™, and their augmented reality platforms ARCore™ and
ARKit™, respectively. These platforms have enabled easy augmented systems from
their mobile devices. Google’s ARCore has its origins from the now terminated plat-
form Project Tango, a division of Google in 2014. The platform used computer vision
to enable mobile devices to detect the position relative to the world around the user,
enabling the platform to create various 3D mapping and environmental recognitions.
The software worked by integrating motion-tracking, area learning, and depth per-
ception. Like Project Tango, ARCore uses three technologies to implement its aug-
mented reality functions: motion tracking, environmental recognitions, and light
estimations. ARCore provides Standard Developers Kits (SDKs) to development
platforms like Android, Unity, and Unreal. This allowed us to create this AR
application.
There are ways to counter this. These solutions include sensing the 3D structure of
the real world, creating a digital 3D model of the real-world structure, and rendering the
model as a transparent mask that hides virtual objects or integrating the model through
in the 3D of the real-world. Sensing the environment can be done by using a light
sensor, a time-of-flight sensor, or stereo cameras. Light sensors project an IR light
pattern onto a 3D surface and uses the distortion to reconstruct the surface contours.
356 A. Pike and S. K. Semwal
Time-of-flight sensors also use IR light but instead reflect off objects in its field of view
and uses the delay in the reflected light to calculate the depth. Stereo Cameras simulate
human binocular vision measuring the displacement between pixels using two cameras.
Unfortunately, each of these sensors has their limitations. Some of us feel that devices
are not yet able to understand the environment well enough or quickly enough to make
this work yet for real-time augmented reality. The main issues stem from poor range of
sensors, low resolution, and a slow mesh reconstruction of the 3D scene. Some early
attempts have shown that a convolutional neural network architecture can create feature
maps representing components of a simplified histogram of oriented depth [6].
For our application, a skybox was created around the model that would simulate a
blue sky outside the windows. This was made optional so that the user could place the
home in the actual location where the home would be and get the real view of what
they would see. To overcome the movement problem a feature needs to be developed
that will allow the user to move around in the environment without the need of physical
movement. A set of buttons programmed to move the environment around the user can
solve this issue. Button activates event trigger when button is down and button is up
and call a method to execute a script.
To implement this project, first the model needed to be created for which the
application would project as an augmented image. These 3D models were created in
3DS Max. Next, the application for which to run an augmented reality environment
was created using Unity, a game engine. As part of the application, ARCore (as
mentioned earlier) was used that allows for plane-detection and object dropping.
ARCore is an SDK from Google that provides the tools to allow users to create their
own augmented reality applications using mobile smart devices. Finally, scripts needed
to be written that would accomplish the stated features.
ARCore™ helps make augmented applications possible by allowing Unity™ to use
certain elements that detect and track planes within the camera’s field of view. By
adding the ARCore™ library to a Unity script, it can then be used to estimate where
planes are called trackedPlanes and grouping them into a List. A separate script and
prefab made from ARCore can then be implemented to display these planes as a grid of
triangles so that the user can see where the program has identified as a plane. If a plane
is detected, the user can then touch a point on that plane that can be used as a tracking
point for where an object can be placed. This touched point, or hit, will create an
anchor for which an object is then instantiated at.
For future work, it will be necessary to have a mix of what is occlude and what is
not occluded. This will have to be done computationally. If using AR movement on a
headset, then the user would need to use their hands to manipulate the scene but keep
the real-world environment of desks and tables to stay hidden. To do this, a new
approach would be needed that uses hand pose recognition in combination with model-
based methods for estimating occlusion [7]. By using a sensing device, like the Leap
Motion, one can track the hand’s estimated position and calculate the mask of occluded
portion. In our case materials could be changes in the AR environment by using buttons
displayed on top of the windows displayed on mobile phone or tablet (Fig. 4). The
material buttons would call the method, using an On Click () event, that would assess
the object created and find the appropriate tags. In the model, certain elements were
tagged as “ExtWalls”, “Floors”, etc. The method will find each of these elements and
Implementing Material Changes in Augmented Environments 357
change the material to the new selected element. Exterior elements are changed in
groups like in the code snippet on the right (Figs. 4, 5, 6 and 7).
4.2 Scaling
This initial test of the applicability of the material changer was a success and the next
steps are ready to be taken. Next, the method for changing the materials needs to be
updated so that only one model is used, and larger amounts of material options are
available (Fig. 7).
Fig. 4. Using Instantiation, Skybox, and interaction to move in augmented reality environment
[10]
Fig. 6. Interaction and On click button calls method to provide the interaction [10].
Implementing Material Changes in Augmented Environments 359
With the promising results of this project, further developments of the application could
strengthen the ability of AR in design and architecture [8, 9]. These future works
include both improvements to the existing features and additional features that would
add more benefits. Some ideas would not fit with this application but could merit
applications of their own. Currently, the application works if there is only one floor, but
split floors or multiple stories are not navigated in our implementation at this time,
certainly they are feasible as the person can walk up through stairs upstairs in 3D
world. To facilitate this, the functionality to change floors could be added with a button
in our implementation, thus changing the scene displayed from one model to another.
To push the material changer idea a bit further, a user could be given the ability to
change between different home options, like an extended deck or 3-car garage. Instead
of just changing a simple look of an object the user would be able to change the entire
layout of the model. A step beyond even that would be to implement an object
manipulation feature could be added that allowed the user to change individual com-
ponents like a single wall or window without breaking the model up. In this way, the
user could play around with the ideas of expanding their house or moving walls around.
The latter idea could be accomplished by linking objects’ data points. For example, if a
user wanted to move out the blue wall several feet, it would then leave a gap in the
360 A. Pike and S. K. Semwal
adjacent red wall, as seen in Fig. 8. To account for this, a script, in our application,
would need to take the vertex points of the red wall connected to the blue wall and
move those points an equal distance, and in the same direction, as far as the blue wall
moved. An algorithm that could follow this premise could redesign a model in-app
while maintaining the model’s completeness. Another feature that will help is the
adding and removing of objects, e.g. furniture. There are several companies with
similar features such as IKEA™.
Finally, an issue when touring a potential home is you are only seeing it during a
given time of day and particular day of the year. Narrowing the time of viewing does
not allow the buyer (or designer) to know how the home will look at other times of the
day. For example, a kitchen that gets a lot of light during the morning may not get any
light during the rest of the day, or a driveway that during the winter stays icy because it
doesn’t get enough sun. By linking a home’s potential geographic location to a data-
base that tracks sun paths could allow a simulation of what the natural lighting for that
home could be. With this feature, the user could be better informed to all conditions
that may affect the home. This project looked at the creation of an augmented reality
application in which displays an interactive 3D model home. The results show working
models that were easy to tour through and manipulate. Although the realism of the
augmented image was passable, there is still great room for improvement and refine-
ment. Based on the promising results found in this application, we believe there is room
for growth for augmented and virtual reality in design and architecture, as well as
consumer usage for finding, touring, and remodeling homes. This application is merely
scratching the surface of what is possible with the new AR and VR technology
available. As consumer adoption and understanding of mixed reality increase and
hardware capabilities advance, so too will the potential for this application.
Acknowledgments. This paper is based on an independent study which the first author
undertook during the Spring 2019 with the second author. The independent study paper was
entitled: Material Changes in Augmented Environments, Adam Pike, pp. 1–5 (Spring 2019).
References
1. Portman, M.E., Natapov, A., Fisher-Gewirtzman, D.: To go where no man has gone before:
virtual reality in architecture, landscape architecture and environmental planning. Comput.
Environ. Urban Syst. 54, 376–384 (2015)
2. Wang, X.: Mutually augmented virtual environments for architectural design and
collaboration. In: Dong, A., Moere, A.V., Gero, J.S. (eds.) Computer-Aided Architectural
Design Futures (CAAD Futures) 2007. Springer, Dordrecht (2007)
3. Gu, N., Kim, M.J., Maher, M.L.: Technological advancements in synchronous collaboration:
the effects of 3D virtual worlds and tangible user interfaces on architectural design. Autom.
Constr. 20, 270–278 (2011)
4. Parush, A., Berman, D.: Navigation and orientation in 3D user interfaces: the impact of
navigation aids and landmarks. Int. J. Hum. Comput. Stud. 61, 375–395 (2004)
5. Image taken from website: https://hackernoon.com/why-is-occlusion-in-augmented-reality-
so-hard-7bc8041607f9
Implementing Material Changes in Augmented Environments 361
6. Höft, N., Schulz, H., Behnke, S.: Fast semantic segmentation of RGB-D scenes with GPU-
accelerated deep neural networks. In: Proceedings of 37th German Conference on Artificial
Intelligence (KI) (2014)
7. Feng, Q., Shum, H.P.H., Morishima, S.: Occlusion for 3D object manipulation with hands in
augmented reality. In: MIRU 2018: Proceedings of the 2018 Meeting on Image Recognition
and Understanding, Sapporo, Japan, August 2018
8. ACM Woodstock Conference. Image taken from website: https://www.mecanoo.nl/Media/
Model-Workshop-2017-11-22
9. WOODSTOCK 2018, El Paso, Texas USA, June 2018. https://doi.org/10.1145/1234567890.
Image taken from website: https://video.architecturaldigest.com/watch/this-hologram-table-
could-revolutionize-architecture-2017-11-22. ISBN 978-1-4503-0000-0/18/06
10. Pike, A.: Architectural Modelling in Augmented Reality. MS Project Advisor: Dr. SK
Semwal, Department of Computer Science, pp. 1–31, Summer 2019
Using Activity Theory and Task Structure
Charts to Model Patient-Introduced Online
Health Information into the Family
Physician/Patient Examination Process
Beth Ellington(&)
1 Research Objectives
1.1 Introduction
Online health information may be found from a variety of sources including govern-
ment, educational institution, medical, non-profit and commercial web sites. Research
has shown that with the increased access to health information provided via the
Internet, not only patients, but their healthcare providers, their caregivers and even
healthy people are increasingly seeking health information online [18]. Health infor-
mation seeking is not a new patient practice but one that has been enhanced through the
24/7 convenient availability of online health information. The ability to access this
information from home, work or school and with mobile devices, such as smartphones
and tablets, has magnified the access limitations of traditional sources of health
information by providing the technology for collaboration between patients, doctors
and caregivers via websites and email, and enabled patients to self-educate and form
online support communities [20].
One factor that may be contributing to the rise in patients’ searching for online
health information is the shortage of primary care providers, including family physi-
cians and pediatricians, existing today in the United States [19]. Primary care provider
[14]. By modeling the physician examining patient activity through defining the sub-
ject, object, community, tools, rules, and division of labor of the activity, the contra-
dictions in the process should emerge [15]. Activity modeling, therefore, provides a
better understanding of the best fit for online health information introduction by the
patient during the physician/patient examination and knowledge transfer process.
This study was conducted to define the patient-introduced online health information
niche in the physician examining patient activity by analyzing physician interview
transcripts, activity diagrams and task structure charts to answer the following research
questions:
1. How does the introduction of patient-introduced online health information into the
family physician/patient examination process impact clinical workflow?
2. What are the potential barriers, challenges or improvements to physician/patient
examination and communication effectiveness created by patient-introduced online
health information introduction?
3. What process improvements or best practices may be developed to better manage
patient-introduced health information that could enhance the productivity of the
physician examining patient activity?
2 Methods
2.1 Interviews
This research employed the interview method because it allowed data collection for the
study with an individual activity as the unit of analysis. Interviews can be used for
descriptive, explanatory and exploratory purposes [17]. Interviews provide the
researcher with a mechanism for collecting subjective, objective, qualitative and
quantitative data with the advantages of greater flexibility in sampling and fewer
misunderstood questions. Interviews are also more effective for gathering data by tape
recording responses and analyzing those transcribed responses for complicated issues
such as physician/patient communication and clinical workflow processes [5].
The interview design utilized both closed-ended and open-ended questions (see
Appendix A). The interview questions were designed to specifically answer questions
to complete the elements of a preliminary activity diagram (see Fig. 1). Quantitative
data was gathered via questions 4, 5, 21 and 22, which collected data through closed-
ended questions, (e.g., How many days a week do you schedule patient appointments?
or How many patients do you see each week?). Qualitative data was gathered via
questions 1–3, 6–20, 23 and 24, which collected data through a combination of closed-
ended and open-ended questions, (e.g., List the steps you follow when interacting with
a patient from the time you enter the examination room until you exit the room or What
does the phrase “patient health literacy” mean to you?)
366 B. Ellington
3 Results
Four of the physicians described their practice of medicine in traditional terms such
as private practice or urgent care center. Others defined their practice by describing
those they worked with, specific populations served or providing detailed information
about their practice.
Frequency of Patient-Introduced
Online Health Information
8
6
6
4 3
Frequency
2 1
0
Last 7 Days Last 14 Days Last 30 Days
Some physicians felt that the introduction of online health information by the
patient was disruptive and time-consuming. Other physicians just seemed to accept
patient-introduced health information as part of their normal routine and fit it into their
examination workflow.
Evaluation of the task structure charts, using Hierarchical Task Analysis tech-
niques, demonstrated that all physicians reported that they performed the same set of
six subtasks during the patient examination. The subtasks performed by all physicians
were (1) Enters examination room, (2) Communicates with patient, (3) References
medical record, (4) Examines patient, (5) Communicates with patient and (6) Leaves
examination room. The introduction of online health information by the patient
occurred during one of the “Communicates with patient” subtasks, either at the
beginning or the end of the examination depending upon when it was introduced by the
patient.
The physicians who were utilizing electronic medical records systems, that inclu-
ded patient educational material often, printed out health information for their patients.
Other physicians referred their patients to web sites or provided brochures with links to
web sites. One physician recommended web sites when they needed to counteract bad
information their patients had found online. Figure 4 below demonstrates the various
types of health information provided to patients by the physicians.
6 5
4
Frequency
2 1
0
Web Sites Printouts Brochures
Some physicians suggested their patients visit specific web sites for additional
health information. Some physicians had developed criteria used to select web sites
they described as known, trusted, accurate, reliable and high quality or they knew the
url. One physician recommended that their patients not go to as he described “junk”
sites. The physicians recommended thirteen specific web sites to their patients. Fig-
ure 5 shows the actual web sites recommended to patients by the physicians, and of
those recommended sites, those sites containing advertisements or commercial sites.
Using Activity Theory and Task Structure Charts 369
The physicians expressed both legitimate concerns and practical reasons for not
communicating with patients via email. Physicians were concerned about HIPAA
violations when communicating via unencrypted servers, and personal privacy issues.
The physician working in the urgent care center stated that she had no relationship with
her patients, and therefore, had no reason to communicate via email with them.
Two physicians encouraged their patients to communicate via their patient portals.
One physician expressed a desire to communicate with his patients via email but was
unable to do so due to lack of office technology. One physician communicated with her
patients via email but stated it was very time-consuming because they sent them at all
hours of the day and she spent hours answering patient emails.
The reason categories for not communicating via email with patients included: lack
of office technology, lack of encryption resulting in potential HIPAA violations, lack of
time, lack of patient relationship, personal preference, personal privacy and preference
for using their patient portal (see Fig. 6).
370 B. Ellington
Lack of Time 2
Frequency
Lack of Patient Relationship 1
Personal Preference 1
Personal Privacy 2
Tools are used to mediate an activity and include both tangible and intangible tools
which support the activity of examining the patient [14]. Tools utilized by the physi-
cians during the examination process included the physician’s medical knowledge, the
patient’s knowledge, patient-introduced online health information, physician suggested
health information, standard medical instruments, electronic medical records, non-
electronic medical records, vital statistics, lab and/or diagnostic test results, prescription
and/or medication information, computers and smartphone. Patient-introduced online
health information is an intangible tool used by the physicians during the examination
process.
Nine of the physicians had implemented electronic medical records systems in their
practice and one physician was using a smartphone to access prescription information.
Computer technology was used by nine of the physicians in the examination room.
Examples of using a smartphone and electronic medical records in the examination
room are in the physicians’ responses below.
372 B. Ellington
Most physicians with access to electronic medical records systems in their practices
perceived them simply as another “tool” in the physician’s toolbox. Electronic medical
records systems were used for patient file storage and as communication tools within
practices and between practices. However integration with other systems was men-
tioned frequently as a barrier to full implementation of electronic medical records
systems and its use as a communication tool for patients and other physicians.
Community is defined as the office staff that supports the activity of examining the
patient [14]. Community in the physicians’ practices included the following titles:
Administrative Staff, Business Office Staff, Certified Nurse’s Aides, Front Desk
Supervisor, Front Office Personnel, Practice Managers, Clinical Care Coordinators,
Instructional Assistants, Laboratory Technicians, Licensed Practical Nurses, Medical
Assistants, Medical Residents, Nurse Practitioners, Nursing Supervisor, Office Man-
agers, Phlebotomists, Physician Assistants, Radiology Technicians, Receptionists,
Referral Clerks, Registered Nurses and Schedulers. The size of the physician’s com-
munity ranged from three to thirty employees. Size of community was generally
dependent upon practice type, for example, the physicians who worked in major
medical center clinics had larger communities compared to the private practices and the
urgent care center.
Division of labor is defined as relationships and interactions within the community
that affect the completion of the activity of examining the patient [14]. Division of labor
was obtained by analyzing the physicians’ answers to the interview question: How do
these other employees support the activity of examining patients? to determine which
employees directly supported the physician examining patient activity. The results of
the analysis showed division of labor was comprised of Registered Nurses, Licensed
Practical Nurses, Nurse’s Aides and Medical Assistants directly supporting the activity
of examining patients.
Rules regulate the activity within the community [14]. Common rules and laws that
govern the practice of medicine in North Carolina, in addition to local, state and federal
laws, include the Chaperone Rule, Clinical Laboratory Improvement Amendments of
1988 (CLIA), Health Information Technology for Economic and Clinical Health Act,
Health Insurance Portability and Accountability Act of 1996 Privacy and Security
Rules (HIPAA), Medical Malpractice Liability, and the Patient Protection and
Affordable Care Act of 2010. Regulatory agencies that govern medical practices in
addition to the North Carolina Medical Board are the Centers for Disease Control and
Prevention (CDC), U. S. Department of Health & Human Services, Drug Enforcement
Administration (DEA), Occupational Safety & Health Administration (OSHA) and
Recovery Audit Contractors (RAC).
There were no practice specific policies or procedures that the family physicians
were required to follow during the examination activity mentioned in the interview
responses. Many of the physicians expressed a sense of autonomy with fewer regu-
lations and guidelines to follow in the organization of their work in both medical center
owned clinics and private practices. However all physicians seemed to be keenly aware
of the risk management implications for not following rules, laws and regulations to
minimize liability, litigation and malpractice claims for their clinical practices such as
the Chaperone Rule and HIPAA. Family Physicians Two and Nine also mentioned
Using Activity Theory and Task Structure Charts 373
other regulatory agencies such as OSHA, DEA and CLIA or RAC auditors and private
medical insurance company inspectors.
Outcome in an activity diagram is defined as the product of the examination activity
[14]. The physician defined outcomes were based upon the main result or outcome they
hoped to have achieved when they exited the patient examination room. The intended
outcomes the physicians hoped to have achieved when they left the examination room
are combined in the activity diagram in Fig. 9 below. Activity diagrams provided a
visual representation to better identify the online health information niche in the
physician examining patient activity and to better understand the interactions that
occurred during the activity. Figure 9 also summarizes all the subject, objects, tools,
community, division of labor and rules for the ten physician examining patient activities.
1.00
0.00
The physician with the highest productivity value worked directly in the exami-
nation room with a Nurse Aide only, the second highest worked directly with a Nurse
only, and the third highest worked directly with a Medical Assistant and a Nurse.
Figure 11 provides a comparison of physician productivity with division of labor.
These results indicate that neither type of practice nor the number of employees
directly supporting the physician examining patient activity influence higher physician
productivity values. Family Physician Nine, with the highest productivity value, works
Using Activity Theory and Task Structure Charts 375
in a solo private practice where patient appointments are scheduled three at one time,
and Nurse Aides assist in the examination room. This physician utilizes a tablet PC in
the examining room and inputs patient data into the electronic medical record during
the examination. He also uses voice recognition software for dictation. This physician
has also linked suggested web sites for his patients to his practice web site, and either
he or his Nurse Aides recommend those web sites to patients.
Family Physician One had the third highest productivity value and worked directly
with a Nurse in the examining room. His community contained not only Nurses, but
Medical Office Assistants, Instructing Assistants, Administrative Clerks and Medical
Residents. The facility where this physician worked may have directly influenced his
higher productivity value since according to its web site description, “Our center has 58
exam rooms, with areas for X-rays and minor procedures.” Fifty eight exam rooms
would accommodate a higher patient volume capacity by allowing the practice to
schedule 58 appointments at one time. Family Physician One was also the only
physician who mentioned using a smartphone in the examination room, in addition to
his use of electronic medical records and a laptop.
These results indicate that higher physician productivity values are influenced by
multiple factors. These factors include: efficient utilization of support staff, organization
of workflow, practice business model and use of technology such as electronic medical
records, computers and smartphones in the examination room.
Compared to the physicians with higher productivity values, the physicians with
lower productivity values had other factors adversely affecting their productivity val-
ues. Family Physician Six, with the lowest productivity value, works in a medical
center owned clinic where a Medical Assistant and a Licensed Practical Nurse assist in
the examination room. This physician utilizes electronic medical records and a com-
puter in the examination room but he is currently using two electronic medical records
systems that are not integrated. This physician also mentioned that he is a practicing
geriatrician, so his patients are most likely over 65 with chronic health conditions that
require more time to examine, diagnose and treat. In addition, he stated that he had only
been in that practice for about two years and was in the process of building the practice
resulting in patient output being below capacity. Therefore, his productivity may be
adversely affected by the use of two nonintegrated electronic medical records systems
while trying to build the practice to full capacity along with the extra time needed to
care for his elderly patients.
These results indicate that lower physician productivity values are influenced by
lack of office technology, inefficient use of electronic medical records systems, low
patient output due to practice not being at full capacity and characteristics of patient
populations served creating inefficiencies in workflow.
4 Discussion of Results
An analysis of the results data obtained from the physicians’ interview transcripts, task
structure charts and evaluation of the activity diagrams modeled from the interview
data, revealed that all physicians experienced patient introduction of online health
information during the previous thirty days. The findings indicate that patient-
introduced online health information has an established niche within the
physician/patient examination process (see Fig. 3).
There were no indications from the results of this study that these family physicians
were “unprepared” to deal with the health information introduction, nor that they
experienced anxiety when it was introduced by the patient as in previous studies [3].
This may be attributed to the family physician/patient relationship being a longitudinal
relationship which better enables management of the situation when the introduction of
online health information occurs. The exception was Family Physician Three who was
employed by an urgent care center and self-reported that she had no relationship with
her patients. However, many of the physicians perceived online health information as
problematic, generating patient misinformation and increasing a physician’s workload
as found in other studies (Kim and Kim [21]; Ahmad et al. [4]).
Nine of the ten physicians suggested health information to their patients by pro-
viding printouts and brochures, or recommending web sites (see Fig. 4). In addition
eight of the physicians had learned to use the Internet as their ally by recommending
online resources to their patients. This was a suggested strategy from previous studies
(Ball and Lillis [6]; Kim and Kim [21]). Of the thirteen physician suggested web sites,
four were commercial web sites that contained advertisements for various products
including pharmaceuticals and prescription drugs. This indicates that those physicians
do not consider bias or conflict of interest to be criteria for exclusion when suggesting
online health resources to patients or determining the “quality” of the resource as
suggested in earlier studies (Brann and Anderson [8]). Physicians may not be aware
that their patients consider bias, especially from pharmaceutical advertising (Fox and
Rainie [18]; Elkin [13]), when determining the reliability and trustworthiness of online
health information (see Fig. 5).
Since health literacy is an emerging clinical concept in health communication, a
question was included in the interview asking physicians to define the term “patient
health literacy.” The physicians were able to partially define it compared to the Centers
for Disease Control and Prevention’s (2011) [10] definition, indicating that they have
experienced some degree of patient health “illiteracy” when examining patients (see
Fig. 7). Most physicians defined it in terms of language, educational literacy or level of
understanding without taking into account the patient’s capacity to obtain health
information and health services. According to the National Action Plan to Improve
Literacy, quality of clinician–patient communication can affect patient health out-
comes, including how well patients follow instructions from clinicians but few health
care professionals receive formal training in communication, particularly in working
with people with limited literacy. Therefore, these results are not uncommon [28].
However, the analysis of the quality of the physician recommended web sites did find
six of the thirteen sites contained low health literacy, easy to read or resources in a
language other than English, so some of the physicians were actually providing these
types of resources to patients similar to the targeted health communication models
suggested in the Kreuter and McClure [22] study. Ironically only three physicians
Using Activity Theory and Task Structure Charts 377
expressed concern for patient health literacy among their patients when defining it, and
of those three, only one mentioned recommending Family Doctor, which contains
health information in Spanish, to their patients.
The use of electronic medical records, computers in the examination room and
direct input of data by the physician appears to enhance physician productivity,
however technology was mostly used for computer supported work methods by the
physicians as opposed to a physician/patient communication tool. All of the physicians
maintained personal email accounts but none chose to actively communicate with their
patients via professional email instead expressing a legitimate concern with the level of
encryption in their email systems and those of their patients’ email accounts (see
Fig. 6). Other reasons for not communicating with patients via email were time
required to read and answer email, and the desire to maintain personal boundaries by
not being accessible to patients 24/7. This indicates that email would not be a feature
that family physicians would utilize in an electronic medical records system but would
prefer a more secure form of professional communication with patients such as an
encrypted patient portal with HIPAA compliant levels of encryption.
Nine of the physician practices were utilizing electronic medical records systems.
However some had invested in earlier versions and were in the process of upgrading
their systems due to the compliance requirements of the Patient Protection and
Affordable Care Act and the Health Information Technology for Economic and Clinical
Health Act [7]. The physicians mentioned problems with integration of systems when
referring patients to specialists in other hospitals and having to “wait for the mail” but
those who worked in medical center owned clinics or large private practices valued the
ability to communicate patient information with other healthcare “co-workers”
electronically.
Evaluation of the task structure charts demonstrated the steps involved in the
physician examining patient activity. All physicians expressed a sense of autonomy in
their organization of work during the examination process in regard to their use of
tools, adherence of rules and division of labor. However, they independently self-
reported the same “set” of six subtasks performed during the physician examining
patient activity, with the only difference being the order of subtask completion. This
was true regardless of the physician’s age, gender, years practicing medicine, type of
practice or whether they utilized electronic medical records, computers in the exami-
nation room, traditional medical charts, low tech medical instruments or high tech
medical instruments during the examination. Of interest from the analysis of the data
and the evaluation of the diagrams, was the recurring “sameness” of the physicians’
activities. According to the physicians’ data obtained from the North Carolina Medical
Board [23], all attended medical school in the United States with six of them attending
medical school in North Carolina. This indicates that this “sameness” may be attributed
to the physicians’ medical school physician/patient examination and/or relationship
training, which is one of the most commonly assessed qualities of students in medical
schools [12]. This “sameness” behavior may also be attributed to the absence of
observation in the data collection process creating an inability to detect possible
variations in actual behaviors.
Productivity is a measure of effective use of resources and is expressed as a ratio of
output to input. In this study the output was defined as number of patients examined
378 B. Ellington
and the input was defined as number of physician labor hours spent examining patients.
In order to better evaluate productivity value comparisons you must also evaluate the
characteristics of the workplace that effect productivity [26]. In the case of the family
physician practices those factors that influence productivity values were efficient uti-
lization of support staff, organization of workflow, type of business model and use of
technology. The analysis of the work organization of the physicians with the three
highest productivity values demonstrated how those factors influenced their produc-
tivity values.
There was no recognizable difference in productivity between physicians in regard
to type of medical practice or years practicing medicine. However there emerged
recognizable characteristics of the physicians with higher productivity such as:
(1) highly organized office support staff; (2) utilization of electronic medical records;
(3) utilization of either a laptop or tablet computer in the examination room;
(4) physician’s direct input of data into the electronic medical record during the
examination; (5) physician suggested online health resources linked directly to practice
web site or specific sites suggested routinely for chronic disease management. These
characteristics that emerged were similar to findings from the Wensing et al. [29] study
suggesting methods to improve knowledge management and patient outcomes.
There were 17 intended outcomes the physicians hoped to have achieved when they
left the examination room, that were combined in Fig. 9 from the individual physician
examining patient activity diagrams. The intended outcomes were: (1) resolution of
patient’s concerns; (2) make the correct diagnosis; (3) develop a plan to cure the
problem; (4) patient satisfaction; (5) determine or address the patient’s specific prob-
lem; (6) patient’s understanding of their treatment plan; (7) addressed the patient’s
questions; (8) helped the patient; (9) patient is more confident; (10) patient is informed
about their condition; (11) patient’s understanding; (12) answers patient’s questions;
(13) improve the patient’s health; (14) modify the patient’s disease behavior; (15) al-
leviate the patient’s suffering; (16) diagnose the problem; and (17) set up treatment for
the patient. These responses indicate that the physicians are more interested in the
quality versus the quantity of their patients’ examination outcomes since these mea-
sures are more qualitative than quantitative or quantifiable.
5.1 Recommendations
In addition to defining the patient-introduced online health information niche, this
study was undertaken to answer the following questions:
1. How does the introduction of patient-introduced online health information into the
family physician/patient examination process impact clinical workflow?
2. What are the potential barriers, challenges or improvements to physician/patient
examination and communication effectiveness created by patient online health
information introduction?
Using Activity Theory and Task Structure Charts 379
enabled the physician to spend “quality” time with the patient thus increasing the
effectiveness of the examination, diagnosis and treatment.
One issue for the physicians mentioned was the amount of time spent providing
information for audits to insurance companies and government entities such as
Medicare. Electronic medical records systems provided ease of storage and retrieval of
documentation needed for these types of audits for the practices that had implemented
electronic medical records. All physicians should utilize their electronic medical
records systems for this type of information storage and management to facilitate audits
by regulatory agencies and compliance of rules, laws and regulations.
Other interesting practice business model issues mentioned by the physicians were:
(1) the current trend of hospitals buying private practices from Family Physician Nine;
and (2) the institutional guidelines and goals being developed for monitoring physician
response time for patients through the medical center admissions group from Family
Physician Six. If the business model for private practices is indeed changing and
medical center owned clinics are monitoring response time for process improvement
then physicians should be utilizing their electronic medical records systems to docu-
ment and justify their practice value and efficiency goals attained to the hospitals.
5.2 Conclusions
This study indicates that physician workflow and process efficiency improvements may
be gained by moving the patient-introduced online health information niche currently
residing in the physician/patient examination activity, and the recommending of online
health information sites that coincides with it, to the office support staff activities. This
workflow model is similar to the model currently in the urgent care center described by
Family Physician Three.
This could be accomplished by simply tagging patients as online health information
seekers in their medical chart/electronic medical record, to remind the staff to discuss
this type of information during the staff/patient encounter, prior to physician/patient
examination. This method of tagging could also be utilized for patients with low health
literacy, language comprehension and cultural issues to alert the support staff to direct
the patient to “easy to read” resources and resources in their native language. This
could be accomplished with minimal support staff training in the areas of health lit-
eracy, evaluation of online health information and electronic medical records systems.
Utilizing these changes in process may minimize online health information introduc-
tion’s effect on the examination and knowledge transfer effectiveness by ensuring that
only “quality” online health information is discussed during the physician/patient
examination.
This study indicates a need for the development of policies, procedures and best
practices for integrating health information into medical practice workflow to replace
the ad hoc methods currently being utilized. Developing these types of guidelines has
the potential to improve operational efficiencies for the medical practice. Improving
operational efficiencies could improve physician productivity and enhance quality of
patient care by optimizing time spent with the patient during the physician/patient
examination activity. Additional studies should also be conducted to determine best
practices for integrating online health information with electronic medical records and
Using Activity Theory and Task Structure Charts 381
Thank you for participating in this interview today. This interview should take
approximately 30 min to complete. I would like to assure you that all of your responses
will remain confidential. You will be assigned a participant code that will be used to
maintain your anonymity. Your participant code for this study is FP###.
(Hand the participant the confidentiality agreement to read and sign with their
participant code already entered on the form.)
382 B. Ellington
By signing the confidentiality agreement you have agreed that your responses may
be recorded on audiotape and you are guaranteed that no personally identifiable
information will be linked to your recorded responses. (Turn on tape recorder.)
Interview Questions
1. How would you describe your practice of medicine? Probe if necessary: for
example private practice, hospital, medical school
2. Do you communicate with your patients via email? Probe: Why or why not?
3. Does your practice use electronic medical records? Probe: Why or why not?
4. How many days a week do you schedule patient appointments?
5. How many patients do you see each week?
6. What does the phrase “patient health literacy” mean to you? Probe: Are you
concerned about your patients’ health literacy?
7. List the steps you follow when interacting with a patient from the time you enter
the examination room until you exit the room. Probe for tools: Do you use a
computer in the examining room? Do you use medical instruments such as blood
pressure monitor, stethoscope, ear scope, tongue suppressor? Do you reference
their medical record, lab results, diagnostic tests results? Do you talk to the patient?
Do you talk to the patient’s family if they are present? Do you record information
into their medical record?
8. What is the main focus of your activity during the patient’s examination in the
steps you listed above? Probe if needed: the patient, the patient’s health, diagnosis
of the problem, other?
9. What is the main result or outcome you hope to have achieved when you exit the
patient examination room? Probe: If more than one is mentioned.
10. In the past 7 days have any of your patients brought health information they found
on the internet to their examination? (If no, then past 30 days? If no, then past 60
days? If no then omit questions 11 and 12.)
11. (If yes in #10 then ask), Was the health information your patient found on the
internet directly related to their disease or health condition?
12. Did you discuss the information with your patient?
13. (If yes in #12 then ask), Where did the discussion occur in the steps outlined in the
generic patient examination activity above?
14. How many employees other than physicians do you work with in your practice?
15. What are the titles of these employees? (Probe: i.e. nurses, nurse practitioners,
physician’s assistants, administrative assistants, medical technologists?)
16. How do these other employees support the activity of examining patients?
17. Does your practice have a policy to refer patients to internet health information? (If
no, omit questions 18 and 19.)
18. (If yes in #17 then ask),Who is designated to refer the patient to internet health
information in your practice? Probe: Where/How does this occur? During exam-
ination, after examination, follow-up visit, sent to patient later?
19. In what format do they give the suggested resources to the patient? (Probe: word
document?, brochure?, email attachment?, information prescription?)
Using Activity Theory and Task Structure Charts 383
20. Other than local, state, HIPAA and other federal laws what additional rules,
guidelines, policies or procedures are you expected to follow when examining
patients?
I would also like to ask you a few more questions to allow me to better understand
the characteristics of my interviewees.
21. In what year were you born?
22. In what year did you start practicing medicine?
(Questions for the interviewer to answer by observation if possible)
23. What is the interviewee’s gender?
24. What is interviewee’s race?
Thank you for agreeing to participate in this interview. (Ask if they would be
willing to complete a brief online survey in the future relating to internet health
information. If so, then ask them for their email address to send them the survey link
or give them a printed copy of the survey link on the signed confidentiality form.)
References
1. Aarts, J., van der Sijs, H.: CPOE, alerts and workflow: taking stock of ten years research at
Erasmus MC. Stud. Health Technol. Inform. 148, 165–169 (2009)
2. Ahern, D.: Challenges and opportunities of eHealth research. Am. J. Prev. Med. 32, 75–85
(2007)
3. Ahluwalia, S., Murry, E., Stevenson, F., Kerr, C., Burns, J.: ‘A heartbeat moment’:
qualitative study of GP views of patients bringing health information from the internet to a
consultation. Br. J. Gen. Pract. 60, 88–94 (2010)
4. Ahmad, F., Hudak, P., Bercovitz, K., Hollenberg, E., Levinson, W.: Are physicians ready for
patients with Internet-based health information? J. Med. Internet Res. 8, e22 (2006)
5. Babbie, E.: The Practice of Social Research. Thomson Wadsworth, Belmont (2007)
6. Ball, M., Lillis, J.: E-health: transforming the physician/patient relationship. Int. J. Med.
Inform. 61, 1–10 (2001)
7. Blumenthal, D.: Implementation of the federal health information technology initiative.
N. Engl. J. Med. 365, 2426–2431 (2011)
8. Brann, M., Anderson, J.: E-Medicine and health care consumers: recognizing current
problems and possible resolutions for a safer environment. Health Care Anal. 10, 403–415
(2002)
9. Brooks, L., Griffin, T.: Is it time for a new practice environment? An operational look at your
practice. J. Med. Pract. Manag. 25, 307–310 (2010)
10. Centers for Disease Control and Prevention: Health Literacy (2011). http://www.cdc.gov/
HealthLiteracy/. Accessed 9 Oct 2011
11. Dugdale, D., Epstein, R., Pantilat, S.: Time and the patient-physician relationship. J. Gen.
Intern. Med. 14, S34–S40 (1999)
12. Elcin, M., Odabasi, O., Gokler, B., Sayek, I., Akova, M., Kiper, N.: Developing and
evaluating professionalism. Med. Teach. 28, 36–39 (2006)
13. Elkin, N.: How America Searches: Health and Wellness. iCrossing, Inc., pp. 1–17 (2008)
384 B. Ellington
Layla S. Aldawsari1,2(B)
1
Department of Computer Science and Engineering, University of Colorado Denver,
Denver, CO, USA
layla.aldawsari@ucdenver.edu
2
College of Computer and Information Sciences,
Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia
lsaldossary@pnu.edu.sa
1 Introduction
Anonymous process naming is one of the most fundamental problems in dis-
tributed systems. It can be seen as a basis for solving other distributed problems
because many solutions for distributed systems problems rely on processing using
unique names, such as leader election. A system of anonymous processes is con-
sidered where processes communicate through shared memory based on a special
object type known as a test-and-set (TAS) register in addition to read/write reg-
isters. This system is in a symmetric state, where each process is indistinguish-
able from other processes. Thus, breaking the symmetry is necessary to allow a
naming solution to assign unique names.
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 385–400, 2020.
https://doi.org/10.1007/978-3-030-39442-4_29
386 L. S. Aldawsari
A wait-free algorithm was introduced by Panconesi et al. [21] for the naming
problem where the process can experience crashes. The developed algorithm uses
single-writer multi-reader atomic registers. Each process has a private register
that can be read by all other processes. However, addressing a specific private
register is accessed differently by other processes. The algorithm has a running
time of O(n log n log log n) and space size of (1+)n with a probability of 1−o(1),
where n is the number of processes and > 0. The developed algorithm uses an
α-Test and Set-Once object to assign a name randomly among all contending
processes with a minimum probability of α against a dynamic adversary.
Chlebus et al. [8] presented solutions for naming anonymous process when
they can only communicate using beeps that are sent over a communication
channel. The channel is assumed to be a shared communication channel such
that any beep sent over the channel, is delivered to all processes connected to
that channel. The channel has only two possible types of feedback, which is either
a beep or silence. Chlebus et al. developed a Las Vegas algorithm and a Monte
Carlo algorithm for the models where n is known or unknown. The presented
solution gives processes contiguous integer names starting from 1. The algorithms
388 L. S. Aldawsari
have been proved to be optimal with an expected run time of O(n log n) steps
and O(n log n) used bits.
The main task of the work proposed by Chlebus et al. [9] is that processors
are able to assign names to themselves in the anonymous distributed environ-
ment. The Monte Carlo algorithm was introduced that has an optimum running
time and uses O(n log n) random bits. The developed algorithms work with two
different scenarios to solve the naming problem. One is where the number of
memory cells is considered constant, while in the other, the number of mem-
ory cells is not restricted, yet the amount of memory available to the algorithm
depends on the number of processors.
Anonymous distributed systems have been considered in several other dis-
tributed system problems such as leader election and consensus [5,7,12,13,18].
Some of these researches focus on showing the power of a distributed system in
computing certain functions such as SUM, AND and Orientation [5] in anony-
mous systems, where the authors developed solutions for these functions that
have a complexity of O(n log n) messages. Moreover, Guerraoui and Ruppert [13]
considered anonymity as a way for protecting the identity of processes where they
developed algorithms for solving problems such as time-stamping, snapshots and
consensus.
3 Preliminaries
It is assumed that the system consists of anonymous processes that start with
no names assigned to them, and they can only communicate through shared
registers. The processes are assumed to be failure free, and the objective is to
assign unique names to them with or without randomization. Synchronous and
asynchronous communication models are considered in which n processes com-
municate through shared atomic multiple-reader/single-writer registers. Several
categories of the problem model are considered based on number of TAS registers
available and knowledge of n.
Coarse-grain atomicity is assumed regarding shared memory such that more
than one operation can be performed in a single step, which allows serialization
of access to shared registers. Some shared registers are implemented as TAS
registers. It is assumed that the initial value for all TAS registers is 0. A process
that is calling operation TAS() on a TAS register to obtain either winner or
non-winner status is said to be querying the TAS register. Simultanous queries
to a TAS register by multiple processes will result in exactly one process with a
winner status.
4 Naming Algorithms
Two deterministic algorithms have been implemented for both synchronous and
asynchronous communication models with an optimal number of shared registers
using an optimal namespace size. The first algorithm, developed for synchronous
communication models, known as the Counting algorithm, works by having each
Naming Anonymous Processes 389
process keep an internal counter variable that tracks the number of processes with
assigned unique names. The second deterministic algorithm, which is developed
for asynchronous communication models, uses a global shared counter instead
of a private variable. Finally, a randomized algorithm was developed that uses
log n shared TAS registers and log n shared read/write registers that are
used as counters.
TAS Register
Algorithm 1: Counting
Initialization;
counter ← 0;
repeat;
v ← test-and-set(T);
if v = 0 then
name ← counter;
reset(T);
else
counter ← counter + 1;
end
until v = 0
Fig. 3. Pseudo code for Global Counting algorithm. T and global_counter are shared
registers where T is a TAS register
Naming Anonymous Processes 391
Global_Counter
Register
TAS Register
value. The Global Counting algorithm works by having each process query a
TAS register until a winning status is returned. Each process is guaranteed to
win the TAS register at least once before terminating because processes do not
experience a fault, so the TAS register is always reset by the winning process.
Fig. 6. Pseudo code for Segment Shuffling algorithm. It uses an array of shared TAS
registers T and an array of shared read/write registers global_counter. Array L con-
tains the index values of unfinished segments
Naming Anonymous Processes 393
Proof. The proof for this is through induction. The value of the internal counter
is first initialized to 0 for all processes before starting the algorithm. Assuming
that the value of the counter is k, the proof must show that it is incremented
identically in all remaining n − k processes. Because all processes check the same
TAS register, exactly one process is a winner of the TAS register, while the
remaining n − k − 1 processes fail to win the TAS register. The non-winner
processes increment the counter value to k + 1 and wait for the end of the phase,
which occurs immediately after the winning process resets the TAS register.
Thus, all remaining n − k − 1 processes begin a new phase with an identical
counter value of k + 1, which concludes the proof.
Next, all processes eventually terminate with a unique name. In each phase,
exactly one process becomes a winner of the TAS register and assigns itself the
value of the counter as a name, after which it resets the TAS register. Thus,
after n phases, all processes have terminated with a unique name. The space
complexity of the algorithm is O(1) shared registers since exactly one shared
394 L. S. Aldawsari
TAS register is used by all processes, while the time complexity of the Counting
algorithm is O(n2 ) steps, because there are n processes performing at most n
queries on the TAS register.
Infinite Number of TAS Registers and Unknown Number of Processes
In Counting algorithm, the knowledge of number of stations n is not required
since each process uses a counter to identify the number of processes that have
already obtained names and after getting a name the process immediately ter-
minates. The assigned name is unique and is based on the counter’s value, which
is dynamically modified during the whole execution of the algorithm. It is nec-
essary in this algorithm, to have the counter’s value being identically modified
by all process in each phase. The proof of correctness is shown in two parts,
where the first part shows consistency of the counter’s value among non-winner
processes in each phase.
Lemma 2. The internal counter variable is identical for all non-winner pro-
cesses in every phase of the algorithm.
Proof. From Lemma 2, it was shown that the value of the counter is identical for
all processes in every phase. Because all processes query a single TAS register,
exactly one winner process exists in every phase, where a process uses the counter
value as a name and terminates after resetting the TAS register. Thus, each
counter value is used as a name by no more than one process, which shows that
all processes obtain unique names.
In this category of the model, the step complexity of the algorithm is unknown
because it depends on the number of processes, which is unknown, while the space
complexity is O(1) shared register as exactly one shared TAS register is used by
all processes to completely assign unique names for all processes.
Finite Number of TAS Registers and Known Number of Processes
Regardless of whether the model has a finite or infinite number of TAS registers,
the Counting algorithm needs exactly one TAS register. This is shown by the
fact that the single TAS register is reused by calling reset operation, so that a
Naming Anonymous Processes 395
new unique name is obtained and assigned by a new process. Thus, the Counting
algorithm can be used as a naming solution for anonymous processes whether
the number of processes is known or not.
To prove that unique names are assigned to every process, the proof must
show that the value of the counter is exactly the same for all processes. From
Lemma 1, it is implied that the internal counter variable is modified identically
on all non-winner processes, which can be proved by induction as follows.
Proof. It is already known that the counter value is initialized to 0 for all
processes, so the proof must show that, when the counter value is k for all
processes, it is identically incremented for all non-winner processes. Exactly one
process wins the TAS register while the remaining n − k processes increment the
counter value to k + 1. Thus, at the end of the phase, all remaining n − k − 1
processes begin a new phase with an identical value of k + 1, which concludes
the proof.
Next, to prove the correctness of the algorithm, it should be shown that all
processes are assigned unique names. It can be easily proven that each process is
assigned a unique name because a TAS register guarantees one winner process.
In addition, based on the fact that each process’s counter value is identical to
other processes, it is guaranteed that each counter value is assigned as a name
to at most one process. From the definition of the algorithm, each process keeps
querying the single shared TAS register at the beginning of each phase, which
implies that all processes will eventually obtain unique names.
The time complexity of the Counting algorithm is O(n2 ) steps, as each pro-
cess performs at most n queries to the TAS register before being assigned a
unique name, and there are a total of n processes performing n queries.
Finite Number of TAS Registers and Unknown Number of Processes
To prove that the algorithm is correct in this category of the model, it must be
shown that the internal counter is consistent among all non-winner processes.
Even though there are finite number of TAS registers, correctness of the algo-
rithm is proved using the same proof for Lemma 2, which shows consistency
of the internal counter value. The proof must also show that each process is
assigned a unique name. Even though there are finite number of TAS registers,
only one TAS register is needed. Thus, the proof from Theorem 1 can be used
to prove correctness of the Counting algorithm, which shows that all processes
terminate with assigned unique names.
Lemma 3. At most one process uses each value of the global counter register.
Proof. The proof is shown by induction. Assuming that the current value of
the counter is y and that a process has obtained a winning status, it will assign
the value y as a name to itself. When the next process wins the TAS register
after it has been reset by the previous process, it will assign itself the current
global counter value y . However, y is guaranteed to be a different value because
each winning process increments the global counter by 1 after assigning the old
value as a name. The key idea of the proof is that no more than one process can
use and alter the global counter variable by using the TAS register for mutual
exclusion.
The second part of the correctness proof is to show that each process wins
the TAS register at least once before termination.
Theorem 2. Each process assign itself a unique name after winning the TAS
register at least once.
Proof. Assuming that k processes have already won the TAS register once and
that all of them have terminated with unique names assigned to them, this
implies that global counter = k. The next process to win the TAS register will
assign the value k as a name and set global counter = k + 1 before termination.
Since each process is fault free and each of them terminates after resetting the
TAS register, it is guaranteed that each of the n processes wins the TAS register
at least once. Based on Lemma 3 and the fact that each process wins the TAS
register at least once, each process assign itself a unique name before termination.
To analyze the time complexity, it is assumed that the steps for all processes
are bounded by t, where, in a step, processes can either read or write to a register
and can perform some computations. The number of steps taken by a winning
process to finish assigning itself a name and resetting the TAS register is five
Naming Anonymous Processes 397
7 Conclusion
In this work, I have designed several models for naming anonymous processes
with different categories where the communication model is either synchronous
Naming Anonymous Processes 399
References
1. Alistarh, D., Aspnes, J., Censor-Hillel, K., Gilbert, S., Zadimoghaddam, M.:
Optimal-time adaptive strong renaming, with applications to counting. In: Pro-
ceedings of the 30th Annual ACM Symposium on Principles of Distributed Com-
puting, PODC 2011, San Jose, CA, USA, 6–8 June 2011, pp. 239–248 (2011)
2. Alistarh, D., Aspnes, J., Gilbert, S., Guerraoui, R.: The complexity of renaming.
In: IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS
2011, Palm Springs, CA, USA, 22–25 October 2011, pp. 718–727 (2011)
3. Alistarh, D., Attiya, H., Gilbert, S., Giurgiu, A., Guerraoui, R.: Fast randomized
test-and-set and renaming. In: Lynch, N.A., Shvartsman, A.A. (eds.) Distributed
Computing, pp. 94–108. Springer, Heidelberg (2010)
4. Angluin, D.: Local and global properties in networks of processors (extended
abstract). In: Proceedings of the Twelfth Annual ACM Symposium on Theory
of Computing, STOC 1980, pp. 82–93. ACM (1980)
5. Attiya, H., Snir, M., Warmuth, M.K.: Computing on an anonymous ring. J. ACM
35(4), 845–875 (1988)
6. Boldi, P., Shammah, S., Vigna, S., Codenotti, B., Gemmell, P., Simon, J.: Sym-
metry breaking in anonymous networks: characterizations. In: ISTCS, pp. 16–26
(1996)
7. Buhrman, H., Panconesi, A., Silvestri, R., Vitanyi, P.: On the importance of hav-
ing an identity or, is consensus really universal? Distrib. Comput. 18(3), 167–176
(2006)
8. Chlebus, B.S., De Marco, G., Talo, M.: Naming a channel with beeps. Fundamenta
Informaticae 153(3), 199–219 (2017)
9. Chlebus, B.S., De Marco, G., Talo, M.: Anonymous processors with synchronous
shared memory: Monte Carlo algorithms. In: 21st International Conference on Prin-
ciples of Distributed Systems (OPODIS 2017). Schloss Dagstuhl-Leibniz-Zentrum
fuer Informatik (2018)
400 L. S. Aldawsari
10. Eberly, W., Higham, L., Warpechowska-Gruca, J.: Long-lived, fast, waitfree renam-
ing with optimal name space and high throughput. In: Proceedings of 12th Interna-
tional Symposium on Distributed Computing, DISC 1998, Andros, Greece, 24–26
September 1998, pp. 149–160 (1998)
11. Eǧecioǧlu, O., Singh, A.K.: Naming symmetric processes using shared variables.
Distrib. Comput. 8(1), 19–38 (1994)
12. Glacet, C., Miller, A., Pelc, A.: Time vs. information tradeoffs for leader election
in anonymous trees. ACM Trans. Algorithms 13(3), 31 (2017)
13. Guerraoui, R., Ruppert, E.: What can be implemented anonymously? In: Interna-
tional Symposium on Distributed Computing, pp. 244–259. Springer (2005)
14. Kruskal, C.P., Rudolph, L., Snir, M.: Efficient synchronization on multiprocessors
with shared memory. ACM Trans. Program. Lang. Syst. 10(4), 579–601 (1988)
15. Kutten, S., Ostrovsky, R., Patt-Shamir, B.: The Las-Vegas processor identity prob-
lem (how and when to be unique). J. Algorithms 37(2), 468–494 (2000)
16. Lim, L., Park, L.: Solving the processor identity problem in O(n) space. In: Pro-
ceedings of the Second IEEE Symposium on Parallel and Distributed Processing
1990, pp. 676–680. IEEE (1990)
17. Lipton, R.J., Park, A.: The processor identity problem. Inf. Process. Lett. 36(2),
91–94 (1990)
18. Matias, Y., Afek, Y.: Simple and efficient election algorithms for anonymous
networks. In: International Workshop on Distributed Algorithms, pp. 183–194.
Springer (1989)
19. Mitzenmacher, M., Upfal, E.: Probability and Computing - Randomized Algo-
rithms and Probabilistic Analysis. Cambridge University Press, New York (2005)
20. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press,
New York (1995)
21. Panconesi, A., Papatriantafilou, M., Tsigas, P., Vitányi, P.: Randomized naming
using wait-free shared variables. Distrib. Comput. 11(3), 113–124 (1998)
22. Teng, S.-H.: Space efficient processor identity protocol. Inf. Process. Lett. 34(3),
147–154 (1990)
Development Trends of Information
Technology Industry in Armenia
Ashot Davoyan(&)
1 Introduction
web design. The total export of IT products from Armenia amounted to 338 million
USD in 2017 [1]. The main export destinations are the USA and Canada (45%),
European Union (25%), Russia (10%) and Asian countries (10%). The rest of the
products are consumed by the former soviet republics, where 3d modelling, animation,
game and mobile apps dominate. At an early stage of artificial intelligence, the market
for technological solutions is limitless where Armenia, in spite of its small IT com-
munity, is equipped with the necessary resources to offer and export technological
solutions. Its high potential for growth is influenced by the following factors:
– High-quality IT programs are set in universities as a result of university, IT Industry
and State collaboration.
– Availability of highly skilled specialists with relevant educational background and
knowledge of English language.
– Collaboration between local companies and diaspora creates synergistic effects.
– Availability of a competitive IT workforce and low operating costs.
– A large number of multinationals have set branches including CISCO, Synopsys,
National Instruments, etc.
Foreign investors can benefit from the following advantages:
– IP protection regulations.
– Free economic zones (FEZs). Residents of FEZs are completely exempted from
profit tax, VAT, property tax and customs duty. Services on behalf of the state
bodies are delivered on “one stop shop” basis.
– Right of 100% property ownership.
– No restriction on staff recruitment.
– Duty free import of personal goods.
– Armenia is a member of the Eurasian Economic Union and enjoys Generalized
Scheme of Preferences status with USA, Canada, Japan, Switzerland and Norway in
addition to the European Union states. Armenia implements an open-door policy as
a result of a positive attitude towards investments from overseas.
Table 1 below shows the key economic indicators.
The economy grew by 7.5% in 2017 and reached approximately 11.60 billion USD,
while the per capita GDP reached 3,862 USD. Stable high growth is positive in terms
of attracting new investors who acknowledge the country as a hotspot for high-quality
IT product development and start-ups. However, due to varying conditions, this is not
always an attainable goal taking into consideration the volatility of the South Caucasus
region.
Global
MARKETS
Academia
State INVESTORS
IT Industry
SUPPORT
ORGANIZATIONS
COWORKING
SPACES
Talents,
IT Products
and Start-
ups
2.2 Investors
Armenia has an open and favourable policy towards foreign investment. The gov-
ernment continually carries out reforms aimed at improving business and investment
climate. According to the World Bank ‘Doing business 2019’ report, Armenia ranked
41st among 190 countries. As the former center of soviet states for software devel-
opment, semiconductor and electronics production, and industrial computing, Armenia
has managed to keep its capacity as a regional IT hub. Table 2 below shows a com-
parison of several key indicators to consider before investing in Armenia.
Florence [3]. “Bottegas” or workrooms, translated from Italian, brought together dif-
ferent types of talents to improve skills, often under supervision of a master teacher
whose role was similar to that of a mentor. “Bottegas” encouraged interaction and
helped participants turn ideas into reality which led to an overall higher level of
creativity. Coworking spaces play vital role in creating localized (town/city/
region/state) innovation processes through tailored programs, diversification and col-
laboration. They typically charge a service fee in exchange for a chair at a desk which
includes the use of office equipment, locker, business address and other services,
depending on the selected plan. The environment in those coworking spaces usually
leads to improved performance due to the high level of independence and flexibility. It
is noteworthy that members at the local coworking spaces are predominantly well-
educated and hold jobs in IT and other related industries. Currently, there are up to a
dozen coworking spaces in Armenia:
1. Platform Coworking Space is a collaborative workspace with a strong sense of
community.
2. AEON Co-Work is a community of forward-thinking, innovative freelancers, co-
workers, and start-up professionals, giving them the opportunity to not only work
independently but also network. This anti-cafe doesn’t offer membership; the space
is free to use. Just pay for the hours you spend there.
3. PMC is a shared workspace which offers a coworking office, private offices and
meeting rooms. Members can also book facilities for events.
4. Mydesk accommodates 12 residents and has a fully completed conference room.
Convenient location and good infrastructure make Mydesk a perfect place for
young businesses that want to be at the center of events.
5. Coworking Armenia is a coworking space created by Startup Armenia Fund. It
provides opportunities for professionals and startups from all over the world.
6. Impact Hub Yerevan is a shared community which incorporates an inspiring
workspace, a social enterprise community center, an opportunity for education as
well as a global network of like-minded people, rather than a simple coworking
space. Its membership is valid across 81 hubs operating across the world.
7. Loft is a self-development and leisure multifunctional center where everything is
free except time.
8. Utopian Lab is a shared workspace environment made to thrive alongside a moti-
vated community in Armenia.
9. Innovative Solutions and Technologies Center (also known as ISTC) is a co-
working space founded by IBM, USAID, Government of Armenia, and Enterprise
Incubator Foundation.
currently six in number: The State Engineering University of Armenia, Yerevan State
University (YSU), American University of Armenia (AUA), European Regional
Educational Academy and Russian-Armenian Slavonic University (RAU). Russian-
Armenian Slavonic University and American University of Armenia provide degree
programs in Russian and English languages, respectively. Developers in Armenia are
still considered low-cost service providers, capable of producing IT products and
services that meet international standards. They mainly specialize in software devel-
opment, semiconductor production, systems integration, web design and multimedia.
Demand for IT services is also growing in local markets from banks, enterprises,
universities and other entities. Independent developers serve global markets through
support organizations or through their own contacts. Armenia’s start-up scene is fuelled
by well-thought initiatives and tax-breaks aimed at boosting the industry. Over the last
decade, the Armenian start-up ecosystem has flourished in terms of quality and
quantity. Armenia is home to some 200 tech start-ups of which some of the successful
technological start-ups include:
– Picsart – It is regarded as the number 1 photo and video editing app, powered by a
community of 100M+ monthly active users.
– Sololearn – It is an app offering mobile code and programming tutorials, as well as
specialized AI and machine learning content.
– ServiceTitan – It is a software platform, allowing home-service businesses to
manage businesses and improve customer service.
– Teamable – It is an employee referral and diversity hiring platform transforming
social networks into high-performance talent pools.
– Joomag – It is an all-in-one platform, offering integrated solutions for content
marketing, corporate communications, digital publishing and sales.
It is widely believed that the IT industry will be a dominant player for the Armenian
economy, which will create new wealth for generations to come.
2.8 Obstacles
A disintegrated regional ecosystem, lack of access to financing as well as corruption
create obstacles for development of start-ups and IT companies in Armenia which is in
large part due to regional conflicts and an inefficient economic system. However, the
difficult environment does not prevent IT industry from growing. The number of IT
companies, including start-ups, has increased 250 times since the independence of
Armenia, growing to 800 companies with a total turnover of 730.2 million USD
excluding turnovers generated by ISPs. In 2017, the IT industry grew 25%, making it
730.2 million USD or 6.25% of total GDP in Armenia.
3 Conclusion
This article provided a brief description of the Armenian IT industry where 2.5% of the
total workforce is employed. IT industry is considered a priority by the Government of
Armenia, which has taken effective steps to improve the quality of specialized education
and develop infrastructure for local and foreign IT companies as well as start-ups. In order
to support its IT industry, the government has defined 0% profit tax for 3 years of
operation. Given the availability of high-quality workforce along with improvements in
investment climate, this industry promises a high return for the development of other
industries. The country exported IT products and services worth $338 million to the USA,
the EU, Russia and other countries in 2017. Armenian IT companies mainly specialize in
software development, semiconductor design, multimedia and web design. In spite of
being recognized as a lucrative place for development of IT products and services, a
number of challenges still remain as a result of the country’s location and geopolitics.
408 A. Davoyan
References
1. EIF, State of the industry report
2. FocusEconomics, Armenia Economic Outlook
3. IDC Research Consultancy
4. Formica, P.: The innovative coworking spaces of 15th-century Italy, Harvard Bus. Rev.
(2016)
5. Etkowitz, H.: Triple Helix Conference, Daegu, Korea (2017)
A Study on the Inspection of Fundus
Retinal Picture
1 Introduction
forever requires multifarious computations due to its Red, Green and Blue pixel groups.
Because of its complication, a selection of events is employed by the researchers to pre-
process the RGB images to improve the information throughout the appraisal. The
current work aims to employ a Image-Evaluation-System (IES) to analyse the retinal
optic-disc (OD) section from fundus image. Usually, to categorize the defect in the eye,
an imaging practice is to be followed. This system accounts the eye elements with the
specialized camera, called the Fundus-camera and these images are then examined by
the doctor or the dedicated tool existing in eye-clinics to recognize the eye-abnormality
to plan for an appropriate treatment procedure [9, 10].
In this work, assessment of the retinal OD is used and the benchmark OD database
called Rim-One [11, 12] is adopted to check the presentation of the implemented
picture processing process. In the literature, a number of conventional and recent soft-
computing based retinal OD assessment events are previously discussed and executed
[7, 8]. The current works on OD inspection verify that, a two-step process will forever
assist to attain improved outcome compared to a single step process. Hence, in this
paper, the amalgamation of RGB multi-thresholding and the segmentation is incor-
porated to improve the outcome throughout the retinal OD assessment. The RGB multi-
thresholding is applied with the Modified-Firefly-Algorithm (MFA) [13, 14] based
Otsu’s between class variance and the segmentation is executed with the DRLS
segmentation [15].
The proposed investigational task is implemented with the Matlab7 and the results
of this approach is confirmed by considering the vital Picture-Quality-Parameters
(PQP), like Sensitivity (SE), Specificity (SP), Accuracy (AC) and Precision (PR).
These standards are calculated by evaluating the extracted OD segment with OD of
Ground-Truth (GT). This database includes a sum of five GTs for each test image.
Ultimately, the average of these PQP is considered for the verification of performances
of the IES. Here, the implemented imaging system is very proficient and offers
improved values of PQP, such as SE (98.11%), SP (97.89%), AC (98.22) and PR
(98.57%).
The other sections are arranged as follows: Sect. 2 presents the methodology,
Sect. 3 outlines the results and discussion section and Sect. 4 provides the conclusion
of the proposed approach.
2 Methodology
2.2 Pre-processing
Otsu’s multi-thresholding was initiated in 1979 to improve the gray-scale images [16].
This process assists to find the finest thresholds by maximizing the Between-Class-
Variance (BCV) of picture pixels. Due to its advantage, this scheme was widely
adopted to improve RGB images [17].
In RGB class, let L indicate whole magnitude phases of choice [0,1,2,…, L-1].
Then, the likelihood allotment ACi can be demonstrated as:
hci X
L1
ACi ¼ AC ¼ 1 ð1Þ
N i¼0 i
2
X
m 2
rcB ¼ wCj lCj lCT ð2Þ
j¼1
Fundus
Retinal Picture
EFA + Otsu’s
thresholding
DRLS
segmentation
Comparison of
OD with GT
Assessment of PQP
and validation
The information on the FA considered in this paper can be found in [17]. The FA
values are allocated as follows, total fireflies = 30, iteration number = 2000, search-
dimension = 3, and the stopping parameter = Jmax.
The above process can be used as the pre-processing practice, which will improve
the visibility of OD based on the Otsu’s three-level thresholding procedure [7].
In this work, an adoptable arc is authorized to recognize all the probable pixel
values connected to the irregular segment existing in the picture. After identifying all
the probable pixel clusters, it will mine the section which is within the converged arc
for the evaluation practice.
2.4 Assessment
The merit of PES is established by computing the PQP as conversed in [4]. In this
paper, the PQPs like SE, SP, AC and PR are used to appraise the advantage of the
proposed method [19, 20]. The necessary PQP standards are calculated by comparing
the extracted OD next to the five-GTs obtainable in the Rim-One. Later, the average
principles of these PQPs are recorded individually for each picture group and based on
this value, the advantage is established.
This segment presents the experimental outcome attained with the projected IES. All
these works are realized with Matlab 7.
Figure 2 shows the example test image considered for the early assessment. It is
from the Deep (D) class, and Fig. 2(a) represents the pseudo-name and Fig. 2(b) shows
the examination image. Later, a tri-level threshold with the MFA and Otsu is imple-
mented on these imagery and the pre-processed test images are shown in Fig. 2(c). In
this image, the OD segment is alienated from the blood-vessels and the background due
to the pre-processing procedure. Later, the DRLS is then implemented to mine the OD
section.
The OD examination consist the procedures, such as test image thresholding,
DRLS based image segmentation and validation of the OD with the GT picture. The
pre-processing picture will enhance the test image with a suitable image processing
approach. Later, the OD section is then mined using the RDLS segmentation. Finally, a
relative assessment among the extracted OD and the GT is then performed to confirm
the advantage of proposed technique.
Figure 3 presents the example test-image obtainable in the Rim-One for other
illness cases considered in this manuscript. In which Fig. 3(a) is the sickness, Fig. 3
(b)–(d) indicates the sample image. Alike pre-processing and segmentation techniques
are implemented for these images and the necessary PQP are calculated with a com-
parative study with the GTs.
Figure 4 presents the outcome attained with the DRLS process. Primarily, a
bounding-box is allocated on the OD segment of pre-processed image physically by
choosing the X and Y-axis values. When the iteration augments, the box is permitted to
congregate towards the OD and lastly it mines the OD segment. Later, the OD fragment
is then compared against each GT obtainable from Figs. 4(f) to 4(j) and the required
PQP are computed and the matching values are offered in Table 1. Alike process is
then implemented for other test-pictures and the matching average PQP for each illness
class is presented in Table 2.
414 K. Sundaravadivu et al.
D13
D38
D48
D69
D103
From the above table, it can be noted that, the average PQP values obtained for the
Rim-One dataset is as follows; SE (98.06%), SP (98.89%), AC (97.67) and PR
(98.78%). From these result, it is obvious that, these values are >97.67%, which
substantiate that, projected approach is incredibly capable for the assessment of fundus-
retinal images. In future, this technique can be used to inspect clinical images. Further,
in future, instead of the proposed technique, a suitable Deep-Learning and the Machine
learning procedures can be implemented to examine the fundus retional pictures.
A Study on the Inspection of Fundus Retinal Picture 415
Early
Moderate
Normal
OHT
Fig. 4. Result attained through the DRLS segmentation. (a) Pre-processed D13 picture,
(b) Initial bounding-box, (c) Converged curve, (d) Extracted OD, (f)–(j) GT1–GT5
416 K. Sundaravadivu et al.
4 Conclusion
This work implements a modern imaging approach based on the MFA and Otsu based
pre-processing and the DRLS based segmentation. The standard retinal OD image
database called the Rim-One is considered for the examination and the investigational
work is implemented with the Matlab7. In this work, the Rim-One’s images with the
Deep, Early, Moderate, Normal and OHT classes are separately examined and the PQP
values such as SE, SP, AC, and PR are then computed based on the relative exami-
nation between the extracted OD and the GTs existing in the dataset. The results of this
study confirms that, this imaging system is competent and offers an average PQP
of >97.67%.
References
1. Dey, N., Rajinikanth, V., Ashour, A.S., Tavares, J.M.R.S.: Social group optimization
supported segmentation and evaluation of skin melanoma images. Symmetry 10(2), 51
(2018). https://doi.org/10.3390/sym10020051
2. Raja, N.S.M., Kavitha, N., Ramakrishnan, S.: Analysis of vasculature in human retinal
images using particle swarm optimization based Tsallis multi-level thresholding and
similarity measures. Lecture Notes in Computer Science, vol. 7677, pp. 380–387 (2012)
A Study on the Inspection of Fundus Retinal Picture 417
3. Vaishnavi, G.K., Jeevananthan, K., Begum, S.R., Kamalanand, K.: Geometrical analysis of
schistosome egg images using distance regularized level set method for automated species
identification. J. Bioinform. Intell. Control 3(2), 147–152 (2014)
4. Rajinikanth, V., Raja, N.S.M., Kamalanand, K.: Firefly algorithm assisted segmentation of
tumor from brain MRI using Tsallis function and Markov Random Field. J. Control Eng.
Appl. Inform. 19(3), 97–106 (2017)
5. Rajinikanth, V., Kamalanand, K.: Advances in Artificial Intelligence Systems. Nova Science
Publishers Inc., USA
6. Khan, M.W., Sharif, M., Yasmin, M., Fernandes, S.L.: A new approach of cup to disk ratio
based glaucoma detection using fundus images. J. Integr. Des. Process Sci. 20(1), 77–94
(2016). https://doi.org/10.3233/jid-2016-0004
7. Shree, T.D.V., Revanth, K., Raja, N.S.M., Rajinikanth, V.: A hybrid image processing
approach to examine abnormality in retinal optic disc. Procedia Comput. Sci. 125, 157–164
(2018). https://doi.org/10.1016/j.procs.2017.12.022
8. Sudhan, G.H.H., Aravind, R.G., Gowri, K., Rajinikanth, V.: Optic disc segmentation based
on Otsu’s thresholding and level set. In: International Conference on Computer Commu-
nication and Informatics (ICCCI), pp. 1–5. IEEE (2017). https://doi.org/10.1109/iccci.2017.
8117688
9. Dey, N., Roy, A.B., Das, A., Chaudhuri, S.S.: Optical cup to disc ratio measurement for
glaucoma diagnosis using Harris corner. In: Third International Conference on Computing
Communication & Networking Technologies (ICCCNT). IEEE (2012). https://doi.org/10.
1109/icccnt.2012.6395971
10. Almazroa, R., Burman, K., Raahemifar, K., Lakshminarayanan, V.: Optic disc and optic cup
segmentation methodologies for glaucoma image detection: a survey. J. Ophthalmol. 2015
(2015). https://doi.org/10.1155/2015/180972. Article ID 180972, 28 pages
11. Fumero, F., Alayon, S., Sanchez, J.L., Sigut, J., Gonzalez-Hernandez, M.: RIM-ONE: an
open retinal image database for optic nerve evaluation. In: 24th International Symposium on
Computer-Based Medical Systems (CBMS), pp. 1–6. IEEE (2011). https://doi.org/10.1109/
cbms.2011.5999143
12. http://medimrg.webs.ull.es/research/retinal-imaging/rim-one/
13. Roopini, I.T., Vasanthi, M., Rajinikanth, V., Rekha, M., Sangeetha, M.: Segmentation of
tumor from brain MRI using Fuzzy entropy and distance regularised level set. In: Nandi, A.,
Sujatha, N., Menaka, R., Alex, J. (eds.) Computational Signal Processing and Analysis.
Lecture Notes in Electrical Engineering, vol. 490, pp. 297–304. Springer, Singapore (2018)
14. Rajinikanth, V., Satapathy, S.C., Dey, N., Vijayarajan, R.: DWT-PCA image fusion
technique to improve segmentation accuracy in brain tumor analysis. In: Anguera, J.,
Satapathy, S., Bhateja, V., Sunitha, K. (eds.) Microelectronics, Electromagnetics and
Telecommunications. Lecture Notes in Electrical Engineering, vol. 471, pp. 453–4622.
Springer, Singapore (2018)
15. Li, C., Xu, C., Gui, C., Fox, M.D.: Distance regularized level set evolution and its
application to image segmentation. IEEE Trans. Image Process. 19(12), 3243–3254 (2010).
https://doi.org/10.1109/TIP.2010.2069690
16. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man
Cybern. 9(1), 62–66 (1979)
17. Rajinikanth, V., Couceiro, M.S.: RGB histogram based color image segmentation using
firefly algorithm. Procedia Comput. Sci. 46, 1449–1457 (2015)
418 K. Sundaravadivu et al.
18. Raja, N.S.M., Rajinikanth, V., Latha, K.: Otsu based optimal multilevel image thresholding
using firefly algorithm. Model. Simul. Eng. 2014 (2014). https://doi.org/10.1155/2014/
794574. Article ID 794574, 17 pages
19. Rajinikanth, V., Satapathy, S.C., Fernandes, S.L., Nachiappan, S.: Entropy based
segmentation of tumor from brain MR images–a study with teaching learning based
optimization. Pattern Recogn. Lett. 94, 87–95 (2017). https://doi.org/10.1016/j.patrec.2017.
05.028
20. Grgic, S., Grgic, M., Mrak, M.: Reliability of objective picture quality measures. J. Electric.
Eng. 55(1–2), 3–10 (2004)
Accelerating Block-Circulant Matrix-Based
Neural Network Layer on a General Purpose
Computing Platform: A Design Guideline
Abstract. Deep neural networks (DNNs) have become a powerful tool and
enabled the state-of-the art accuracy on many challenging tasks. However, large-
scale DNNs highly consume both computational time and storage space. To
optimize and improve the performance of the network while maintaining the
accuracy, the block-circulant matrix-based (BCM) algorithm has been intro-
duced. BCM utilizes the Fast Fourier Transform (FFT) with block-circulant
matrices to compute the output of each layer of the network. Unlike conven-
tional pruning techniques, the network structure is maintained while using the
BCM. Compared to conventional matrix implementation, the BCM reduces the
computational complexity of a neural network layer from O(n^2) to O(n^2/k),
and it has been proven to be highly effective when implemented using cus-
tomized hardware, such as FPGAs. However, its performance suffers from
overhead of FFT and matrix reshaping on general purpose computing platforms.
In certain cases, using the BCM does not improve the total computation time of
the networks at all. In this paper, we propose a parallel implementation of the
BCM layer and guidelines that generally lead to better implementation practice
is provided. The guidelines run across popular implementation language and
packages including Python, numpy, intel-numpy, tensorflow, and nGraph.
1 Introduction
In the past few years, driven by increasing amounts of data and processing speed, Deep
Neural Network (DNN) has been able to deliver impressive results for many complex
and challenging problems. Particularly large-scale DNNs have significantly enhanced
object recognition accuracy and led a revolution in many real-world applications, such
as automatic machine translation [1], self-driving systems [2], and drug discovery [3].
The resurgence of neural networks has attracted both academic and industry in eval-
uation, improvement and promotions.
The deep neural networks consist of multiple layers of various parameters and
thousands of neurons. Recent research has proven that the depth of DNN structure is
crucial to satisfactory that stands out accuracy [4]. As a result, large scale DNNs
requires computation and memory remarkably. Driven by this challenge, more and
more techniques have been proposed to compress deep neural network size with a
negligible accuracy loss. One strategy is the block-circulant matrix-based (BCM) al-
gorithm [5], a principled approach utilizing Fast Fourier Transform (FFT) and block-
circulant matrices to reduce both computational and memory complexity. Compared to
other compression techniques such as weight pruning [9], BCM algorithm has three
main advantages. First, it allows us to derive a tradeoff between accuracy and accel-
n2
eration. Second, the BCM algorithm reduces storage complexity from Oðn2 Þ to O k
by compressing the weight matrix into k dense vector, whereas conventional weight
pruning gives a sparse weight matrix that requires additional memory footprint for
indexing. Lastly, BCM algorithm maintains the regular network structure and retains a
rigorous mathematical foundation on a compression ratio and accuracy [5].
In prior work, the BCM algorithm has only been evaluated for embedded platforms
due to their portability, versatility, and energy efficiency [5]. We aim to solve two
remaining questions. First, can the BCM algorithm be implemented on software-based
platforms, especially in Python which is the most popular programming language used
for deep learning. Second, how to configure the BCM algorithm to balance the tradeoff
between accuracy and compression/acceleration.
We propose in this paper to guide users to implement the BCM algorithm and
achieve the best performance. To solve the two questions, we will evaluate the per-
formance of the algorithm in Python using numpy, intel-numpy, tensorflow, and
nGraph packages. Additionally, we design the parallel BCM algorithm that effectively
utilizes multiple cores in the target systems.
2 Related Works
In the past decade, numerous techniques have been proposed to compress neural
network. These include structured weight matrices [6, 7], parameter pruning [8–10] and
quantization [11, 12]. Recently weight pruning methods have become more and more
popular. Although weight pruning could achieve an amazing compression ratio, but the
network structure and weight storage after pruning become irregular, hence indexing is
required which weaken the performance improvement. Specially, when implemented in
embedded system, it requires a customized hardware capable of loading sparse matrices
and/or performing sparse matrix-vector operations [13]. Inherently, irregular memory
access and extra storage footprint reduce the speed of the weight pruning.
Frequency domain operation was first proposed by LeCun to accelerate the com-
putations of convolution layer by replacing the convolution operation using element
wise multiplication in frequency domain [14]. No weight compression was considered
in [14]. Circulant-weight matrix was first proposed in [6] in 2015, as a mean to reduce
the storage complexity of fully connected neural networks. By compressing a weight
matrix into a circulant weight matrix, it reduces the space complexity from O(d2) to O
(d). As a property of circulant matrix, the matrix-vector multiplication (between the
weights and the inputs) can be done as element wise vector-vector multiplication in
frequency domain, and hence reduce the time complexity from O(d2) to O(d). Unlike
conventional weight pruning, the circulant weight matrix has a dense structure and it
Accelerating Block-Circulant Matrix-Based Neural Network Layer 421
could be used to optimize both speed and space. FFT is used to transform weights and
inputs to frequency domain.
For very large weight matrices, the circulant matrix approach provides very sig-
nificant compressing ratio, but also will lead to considerable quality degradation of the
neural network. Block-circulant weight matrix was first proposed by Ding et al. [5] as a
way to balance the storage complexity and neural network quality during the com-
pression. The authors also proposed the CirCNN to implement the BCM based Deep
Neural Networks on hardware, such as ASIC and FPGA. With the customized pipeline
structure, the FFT and element-wise operation achieve their best performance on the
customized hardware implementation. However, the efficiency of the BCM on general
purpose computing platforms, which is still normally used by machine learning
community, has not been studied.
In this paper, we consider all the potential overheads in software and propose a
parallel design of the block-circulant matrix-based algorithm for general purpose
computing platform. We evaluate its performance on popular deep learning
frameworks/packages and provide guidelines that can generally lead to better
implementations.
Xq
ai ¼ IFFT j¼1
FFT wij FFT xTj ; ð2Þ
where denotes element-wise multiplication, and FFT denotes Fast Fourier transform.
An illustration of the BCM calculation is shown in Fig. 2.
the storage complexity from OðmnÞ to
By using the BCM algorithm, we can reduce
Oðpqk Þ. Since we only need to store FFT wij for each submatrix, it is equivalent to
OðnÞ for small p and q values. Additionally, the computational complexity of FC layer
is reduced from Oðn2 Þ to OðnlognÞ, and from OðWHr 2 CPÞ to O(WHQlogQ) where
Q ¼ maxðr 2 C; PÞ for CONV layer.
Fig. 3. Total time used with different number of CPU cores. Left panel: Matrix multiplication
results. Right panel: Block-circulant matrix-based algorithm results. The y-axis display time used
in milliseconds, the x-axis shows the number of CPU cores which the maximum number is 8
since the machine has 4 physical cores and 8 threads, and each label represents different
packages.
As shown in Fig. 3, the time used in matrix multiplication decreases as the number
of cores increases. This can be explained through utilization of multiple cores in each
package by using either multiprocessing or multithreading. In contrast, the time used in
the block-circulant matrix-based algorithm slightly decreases in tensorflow and ten-
sorflow+nGraph while remains stable in numpy and intel-numpy.
Therefore, we design the parallel block-circulant matrix-based algorithm to accel-
erate the computations. The key idea is to separate each computation block and run it
on different processes as each block can be calculated independently. Figure 4 repre-
sents parallel block-circulant matrix-based algorithm. We partition the block-circulant
matrix by row and run it on the different processes. Once the calculations are com-
pleted, we combine and convert the matrices to get the final output.
424 K. Pugdeethosapol et al.
In case there are multiple inputs, we can use data parallelism to separate these
inputs into processes such that each portion of data is assigned to different processes.
The portion of data is defined as input size=number of CPU cores.
In terms of implementation, we initially use native multithreading and multipro-
cessing provided in Python. However, Python has Global Interpreter Lock (GIL) that
only allows one thread to hold the control of its interpreter, creating the performance
bottleneck in multithreading. In contrast to multithreading, multiprocessing uses sub
processes to solve GIL which allows the program to optimize multiple cores in a given
machine. Nevertheless, there are overhead in spawning processes and sending data.
To improve the performance, we use Ray [16]: a simple framework that has been
proven to be faster than native multiprocessing and multithreading. Additionally, Ray
can be easily integrated with Python. Figure 5 represents the basic sample code of how
to use Ray to accelerate the parallel block-circulant matrix-based algorithm.
@ray.remote
def f(): def f():
time.sleep(1) time.sleep(1)
return 1 return 1
ray.init()
results = [f() for i in range(4)] results = ray.get([f.remote() for i in range(4)])
Using parallel design and Ray, we can achieve better performance than the original
block-circulant matrix-based algorithm when increasing the number of CPU cores.
Figure 6 shows the results of our parallel version versus the previous version. The new
parallel version can achieve a stable speedup ratio up to 4 cores which is the number of
physical cores.
Fig. 6. Comparing the original and our ray-parallel implementation. Top-left panel: numpy.
Top-right panel: intel-numpy. Bottom-left panel: tensorflow. Bottom-right panel: tensorflow
+nGraph. A batch size of 1024, and a block size of 128 are used in these experiments. The y-axis
display time used in milliseconds, and the x-axis shows the number of CPU cores.
In this paper, the block-circulant matrix-based algorithm has been applied to the model
during inference phase. When it comes to computational complexity, the block-
circulant matrix-based algorithm is faster than matrix multiplication. However, when it
comes to implementation, we need to examine the overhead from IFFT, FFT, and
matrix reshaping. Design parameters such as the batch size, block size, and number of
CPU cores all will affect the calculation time. For some combinations, matrix multi-
plication may be faster than the BCM, while for others the BCM may be more effective
than matrix multiplication. While increasing the block size always reduces storage and
computing complexity, it also lowers the model capacity of the neural network and
hence may lead to larger prediction error. It has been shown in [5] that with a com-
pression ratio of up to 30-50x, sometimes the loss may be negligible, and the com-
pressed models may even outperform the baseline models. However, in some cases the
loss is noticeable. In general, the loss is monotonically increasing with the compression
ratio. By focusing on the speed during the inference phase, users must decide whether
the accuracy loss is acceptable.
426 K. Pugdeethosapol et al.
In order to choose the best model with the most efficient configuration and
acceptable accuracy without exhaustively exploring the entire design space, the
designer needs to know how the performance is affected by these design parameters
including the batch size, block size, and number of CPU cores. In this work, we
designed a set of benchmark programs that characterize the performance of different
configurations of BCM with a comparison to the matrix based implementation.
Guidelines in choosing the configuration of the BCM were derived from the result.
These guidelines will help designers to choose the configuration without having to
attempting all combinations.
The study is performed on Intel(R) Xeon(R) W-2123 CPU @3.6 GHz which has 4
physical cores and 8 threads. Matrix multiplication and the BCM algorithm are
implemented using Python programming language with various packages. The fol-
lowing lists of possible choices of the hardware/software configurations and design
parameters that were evaluated.
• Packages: numpy, intel-numpy, tensorflow, and tensorflow+nGraph
• Number of CPU cores: 1, 2, 4, and 8
• Block size (M): 128, 256, 512, 1024, and 2048
• Batch size (N): 128, 256, 512, 1024, 2048, 4096, and 8192
The evaluation results reported in this paper are platform specific; however, the
benchmarks and methodologies can be applied to other platforms. The model that we
considered is a fully connected layer with 4,096 hidden neurons. The weights and
inputs have the size of (4096, 4096), and (number of batches, 4096), respectively.
Table 1 shows the size of inputs, blocks, and weights used in the experiments.
Table 1. Size of inputs, blocks, weights before and after partitioning & FFT.
Input size Weight size before partitioning Block Weight size after partitioning
& FFT size & FFT
(X, Y) (M) (1, MX , MY , M2 þ 1)
(N, 4096) (4,096, 4096) 128 (1, 32, 32, 65)
256 (1 16, 16, 129)
512 (1, 8, 8, 257)
1024 (1, 4, 4, 513)
2048 (1, 2, 2, 1025)
We assume that the block-circulant weight matrix has already been trained and the
FFT of each block has been calculated. Please refer to [5] about how to train a block
circulant weight matrix. For a fully connected layer whose input size and output size
X
are X and Y, let M denote the block size, then there will be M MY circulant blocks.
Each block, after FFT, will be represented as a vector of size M2 þ 1 since we only
compute the real part of discrete Fourier Transform.
Overall we will represent the
X
weight as 4D tensor with size 1 M MY M2 þ 1 . In Table 1, N represents the
number of batches. The weight size after partitioning & FFT has 4 dimensions,
Accelerating Block-Circulant Matrix-Based Neural Network Layer 427
X Y
including 1, M , M , and size after FFT where 1 represents additional dimension that
help matching the number of batches in input size during element-wise multiplication.
In each experiment, we record the time used starting from the initial step until we
receive the output from the algorithm. The algorithm consists of six steps as the
following:
X
1. Reshaping input X into 4 dimensions N; M ; 1; M to match the size of weight
tensor.
2. Calculating the
X FFT Mof input
X from step 1, FFT(X) where the size of this step
becomes N; M ; 1; 2 þ 1 :
3. Calculating element-wise
X Y M multiplication
FFT ðW Þ FFT ð X Þ: The output size
becomes N; M ; M ; 2 þ1 :
PdMX e
4. Summing the output from step 3 using the formula: i¼1 FFTðwij Þ FFTðxi ÞÞ,
Y
where j is each block in M . The output size becomes N; MY ; M2 þ 1 .
5. Calculating IFFT along the third
dimension
of the tensor from step 4 and get output
Y. The output size becomes N; M ; M
Y
6. Reshaping the output into ðN; output sizeÞ where the output size is 4,096.
The dimensions that have been set to 1 are used for broadcasting which is available
in all packages that we use in our experiment, reducing and simplifying the codes.
Fig. 7. Total time used with different block size with single core in numpy package. Left panel:
Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large batch size
(8192). The y-axis display time used in milliseconds, and the x-axis shows different block size
428 K. Pugdeethosapol et al.
Fig. 8. Total time used with different block size with single core in intel-numpy package. Left
panel: Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large batch
size (8192). The y-axis display time used in milliseconds, and the x-axis shows different block
size
Fig. 9. Total time used with different block size with single core in tensorflow package. Left
panel: Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large batch
size (8192). The y-axis display time used in milliseconds, and the x-axis shows different block
size
Fig. 10. Total time used with different block size with single core in tensorflow+nGraph
package. Left panel: Small batch size (128). Middle panel: Medium batch size (1024). Right
panel: Large batch size (8192). The y-axis display time used in milliseconds, and the x-axis
shows different block size.
In all packages, increasing the block size reduces the time used. However, this
correlation is not linear; increasing the block size twice does not reduce the time by
half. Once the block size reaches 1024, the speedup ratio remains stable.
In numpy, using the BCM algorithm is faster than matrix multiplication in all block
size, small, medium, and large batch size. Meanwhile, the BCM algorithm is faster
when the block size is larger than 128 in intel-numpy. The same results apply to all
batch size with more difference observed in the large batch size.
In contrast to numpy and intel-numpy, tensorflow and tensorflow+nGraph display a
slower speed in the BCM algorithm when compared to matrix multiplication.
Accelerating Block-Circulant Matrix-Based Neural Network Layer 429
This outcome applies to small batch size when block size less than 256, and all block
size for medium and large batch size. The results can be explained by the overhead time
used to compute FFT, IFFT, and to create session to run the calculation.
Multiple Cores. Exploiting the resources of multiple cores of the system/machine will
help increase the overall performance. As mentioned earlier, we use Ray library to
implement the parallel block-circulant matrix-based algorithm. In this experiment, we
set the number of cores to 4 and ran the experiments using small, medium, and large
batch size. Figures 11, 12, 13 and 14 represents how much time used when we increase
block size in each package using multiple cores.
Fig. 11. Total time used with different block size with multiple cores in numpy package. Left
panel: Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large batch
size (8192). The y-axis display time used in milliseconds, and the x-axis shows different block
size.
Fig. 12. Total time used with different block size with multiple cores in intel-numpy package.
Left panel: Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large
batch size (8192). The y-axis display time used in milliseconds, and the x-axis shows different
block size.
Fig. 13. Total time used with different block size with multiple cores in tensorflow package.
Left panel: Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large
batch size (8192). The y-axis display time used in milliseconds, and the x-axis shows different
block size.
430 K. Pugdeethosapol et al.
Fig. 14. Total time used with different block size with multiple cores in tensorflow+nGraph
package. Left panel: Small batch size (128). Middle panel: Medium batch size (1024). Right
panel: Large batch size (8192). The y-axis display time used in milliseconds, and the x-axis
shows different block size.
Due to different techniques used in multiple cores in each package, different results
are expected. In numpy, the BCM algorithm appears to be faster than matrix multi-
plication once the block size is above 128 and small batch size is used, while it appears
to be faster with all block sizes when using a medium and large batch size.
With intel-numpy,matrix multiplication is always faster than the BCM algorithm
for block size below 128. The BCM outperforms matrix multiplication only for rela-
tively larger blocks. This outcome can be explained by highly optimized matrix
multiplication by Intel Math Kernel Library (MKL) packages which are designed
specifically for Intel CPU.
With tensorflow and tensorflow+nGraph, the break-even block size where BCM
and matrix multiplication has similar performance moves left. We now need even
larger blocks to outperform matrix multiplication. Similar to single core system, the
BCM performance still prefers smaller batch size. The break-even block size gets
bigger when the batch size increases.
Fig. 15. Total time used with different number of CPU cores in numpy package. Left panel:
Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large batch size
(8192). The y-axis display time used in milliseconds, the x-axis shows different number of CPU
cores, and the legend represents matrix multiplication and different block size.
Fig. 16. Total time used with different number of CPU cores in intel-numpy package. Left
panel: Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large batch
size (8192). The y-axis display time used in milliseconds, the x-axis shows different number of
CPU cores, and the legend represents matrix multiplication and different block size.
Fig. 17. Total time used with different number of CPU cores in tensorflow package. Left panel:
Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large batch size
(8192). The y-axis display time used in milliseconds, the x-axis shows different number of CPU
cores, and the legend represents matrix multiplication and different block size.
432 K. Pugdeethosapol et al.
Fig. 18. Total time used with different number of CPU cores in tensorflow+nGraph package.
Left panel: Small batch size (128). Middle panel: Medium batch size (1024). Right panel: Large
batch size (8192). The y-axis display time used in milliseconds, the x-axis shows different
number of CPU cores, and the legend represents matrix multiplication and different block size.
Fig. 19. Total computation time. Top-left panel: 1 core with 128 block size. Top-right panel: 4
cores with 128 block size. Bottom-left panel: 1 core with 256 block size. Bottom-right panel: 4
cores with 256 block size. The y-axis displays total time used in milliseconds, and the x-axis
shows the batch size.
Fig. 20. The time used per sample. Top-left panel: 1 core with 128 block size. Top-right panel:
4 cores with 128 block size. Bottom-left panel: 1 core with 256 block size. Bottom-right panel: 4
cores with 256 block size. The y-axis display time used per sample in milliseconds, and the
x-axis shows the batch size.
434 K. Pugdeethosapol et al.
5 Conclusion
This paper proposed a parallel design of the block-circulant based-matrix algorithm and
demonstrated that this new design can achieve better performance than previous ver-
sion of algorithm. We also provide guidelines on how to select block size, batch size,
and number of cores in certain situations in order to achieve optimal performance in the
least amount of time. The guidelines run across popular implementation language and
packages including Python, numpy, intel-numpy, tensorflow, and nGraph.
References
1. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural
networks with multitask learning. In: Proceedings of the 25th International Conference on
Machine Learning, pp. 160–167. ACM (2008)
2. Huval, B., Wang, T., Tandon, S., Kiske, J., Song, W., Pazhayampallil, J., Andriluka, M.,
Rajpurkar, P., Migimatsu, T., Cheng-Yue, R., et al.: An empirical evaluation of deep
learning on highway driving. arXiv preprint arXiv:1504.01716 (2015)
3. Burbidge, R., Trotter, M., Buxton, B., Holden, S.: Drug design by machine learning: support
vector machines for pharmaceutical data analysis. Comput. Chem. 26(1), 5–14 (2001)
4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–
778 (2016)
5. Ding, C., Liao, S., Wang, Y., Li, Z., Liu, N., Zhuo, Y., Wang, C., Qian, X., Bai, Y., Yuan,
G., Ma, X.: CirCNN: accelerating and compressing deep neural networks using block-
circulant weight matrices. In: Proceedings of the 50th Annual IEEE/ACM International
Symposium on Microarchitecture, pp. 395–408. ACM, October 2017
Accelerating Block-Circulant Matrix-Based Neural Network Layer 435
6. Cheng, Y., Yu, F.X., Feris, R.S., Kumar, S., Choudhary, A., Chang, S.F.: An exploration of
parameter redundancy in deep networks with circulant projections. In: Proceedings of the
IEEE International Conference on Computer Vision, pp. 2857–2865 (2015)
7. Saxe, A.M., Koh, P.W., Chen, Z., Bhand, M., Suresh, B., Ng, A.Y.: On random weights and
unsupervised feature learning. In: ICML, vol. 2, no. 3, p. 6, June 2011
8. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural
networks for resource efficient transfer learning. arXiv preprint arXiv:1611.06440, 3 (2016)
9. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient
neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143
(2015)
10. Luo, J.H., Wu, J., Lin, W.: ThiNet: a filter level pruning method for deep neural network
compression. In: Proceedings of the IEEE International Conference on Computer Vision,
pp. 5058–5066 (2017)
11. Cai, Z., He, X., Sun, J., Vasconcelos, N.: Deep learning with low precision by half-wave
Gaussian quantization. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 5918–5926 (2017)
12. Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using
vector quantization. arXiv preprint arXiv:1412.6115 (2014)
13. Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model
compression. arXiv preprint arXiv:1710.01878 (2017)
14. Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs.
arXiv preprint arXiv:1312.5851 (2013)
15. Pan, V.Y.: Structured Matrices and Polynomials: Unified Superfast Algorithms. Springer,
Boston (2012)
16. Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z.,
Paul, W., Jordan, M.I., Stoica, I.: Ray: a distributed framework for emerging AI applications.
In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI
2018), pp. 561–577 (2018)
Energy Aware Next Fit Allocation Approach
for Placement of VMs in Cloud Computing
Environment
1 Introduction
As the data demands of living beings are growing day by day, it has become com-
pulsory to sustain the quality of data resources. For which, high performance com-
puting equipment and lots of processing energy in required. This would not only be
highly expensive but also unavailable sometimes. To fulfil these requirements, various
cloud computing services are being deployed by the industry that provide various
features such as affordability, mobility and proper utilization of expensive infrastruc-
ture at a reasonable price.
Service providers have to maintain the large scale datacenters that comprised of
millions of computing services. These services consume a high amount of electrical
energy. In the year 2013, 91 billion KWh of electricity was consumed by data center
equipment in the United States, which is approximately equivalent to output power
generated from 34 coal-fired power plants in a year. This consumption is projected to
reach approximately 140 billion KWh in a year by 2020. Estimated price for this
energy is around $13 billion in a year [1]. Moreover, various equipment in the data
centers also produce heat, for which cooling devices are required. These cooling
devices also add to the cost of running data centers to around 2–5 million annually.
Primary reason behind the consumption of this huge amount of energy is improper
management of available resources in the cloud environment such as Servers, Storage
devices, Network etc. However, energy consumption has direct relationship with
Server’s utilization, as it linearly increases with the increment in utilization of server,
from 0% to 100%. So, the main concept behind reducing the consumption of energy is
optimizing the CPU utilization. This relationship between energy and CPU utilization
can be understood by following equations:
E ¼ PT 0 ð1Þ
PðU Þ ¼ Pi þ Pf þ Pi U ð2Þ
Where,
E = Energy
T′ = Time
P = Power function of utilization
Pi = Power consumption when CPU is idle i.e. 0% utilization
Pf = Power consumption while full utilization of CPU
U = current value of CPU utilization
There is one another empirical nonlinear power model defined by [2] as:
PðU Þ ¼ Pi þ Pf Pi 2U U R ð3Þ
1.1 VM Consolidation
Virtualization technique can be used to handle the server’s utilization issue in a better
way by dividing them into various virtual machines. It not only facilitates the sharing of
hardware among the users but also manage the large amount of user’s data separately.
Each instance of virtual machine (VM) defined over a server possesses a virtual
hardware and software package. All the instances of VMs can be homogenous or
heterogeneous depending upon their application requirement.
the services availed by users through these VMs [5]. On one side, shifting of workloads
in form of VMs from the overloaded hosts reduce the probability of SLA violation,
because overloaded hosts put the VMs on hold and increase the waiting time. On the
other side, migration assists to make the under-loaded hosts idle, by migrating all their
VMs to other vacant hosts. This dynamic process to free the hosts can than save the
energy by shutting down these hosts. So, dynamic or live migrations of VMs is another
important aspect needed to be handled for optimal resource management [6]. Decision
algorithm includes the decision for placing a VM over the physical servers that would
host the VM. This process of deciding the hosts for all the available VMs is known as
VM Placement, that not only includes the initial placement of VMs over the hosts, but
is also responsible for managing the placements of VMs on regular intervals according
the load variation. Various consolidation strategies have been defined by researchers
and are further in research, for obtaining different objectives, where an effective
strategy can achieve more than one objective without affecting the others.
For each VM vi , where i ¼ f1; 2; 3 . . .:ng, a suitable host hk need to be find out
such that, E should be minimum [8]. Many research works have been done, and are
going on, to solve BPP that considers Best Fit, First Fit, Worst fit and Next fit allo-
cation approaches. Researchers have also applied these approaches to solve problem
similar to BPP such as VM placement problem [4, 8, 10]. Applied by [4], Best fit
searches the minimum power consuming host for a VM, where hosts are arranged in
order of decreasing CPU utilization. Apparently, in this work, Next Fit approach of
BPP is applied primarily to optimize VM Placement for less energy consumption.
In this paper, VM placement algorithm is modified using Next Fit allocation
approach and results have been verified using basic performance metrics. The prime
contributions of this research paper are:
• Introduce the Next Fit allocation approach for placing the VMs over hosts.
• Implementing the proposed approach by modifying the existing approach in
cloudsim.
• Analyzing the effects of proposed approach for improving the energy efficiency and
reducing the SLA violation in the data centers.
The document is further sectioned in five parts, in which next section is describing
the work of existing researches in the same realm and next to this is explaining the
problem statement. Further, it is a preliminary section that introduces the baseline for
this research and then proposed approach is elaborated. Following this, experiment and
result analysis section not only define the experiment configuration and performance
parameters, but also verify the significance of proposed approach by showing results
using tables and graphs. Finally, the conclusion of the paper is mentioned with the
future scope of this research perspective.
2 Related Work
Many researchers have paid their attention to optimize the energy usage, because, more
the energy consumption is, more it would affects the environment and increase running
cost. In addition, QoS is another important factor considered by inventors as better QoS
helps in retaining a better relationship with the clients. Most of the researches have
focused on solving this problem by applying some modifications in VM consolidation
scheme, out of which some has been reviewed as below.
Chowdhury et al. [9] have considered VM placement as bin packing problem and
introduced three variants of PABFD [4] that processed modified worst fit, second worst
fit and first fit allocation in place of best fit allocation. Also, a clustering technique i.e.
modified k-means algorithm is applied to create clusters of VMs based upon value of
CPU utilization and allocated RAM. High density cluster i.e. cluster with more
numbers of VMs, will allocate the VMs firstly than next dense cluster and so on, till all
the VMs are not allocated. The adopting clustering technique, developed in this work,
performed best with modified first fit decreasing with decreasing host algorithm
(FFHDVP_C).
Damodar and Koli [10] modify the VM Consolidation process by proposing a new
overloaded host detection and underloaded host detection approach. For detecting the
Energy Aware Next Fit Allocation Approach for Placement of VMs 441
overloaded host, they consider the SLA violation as the basic parameter. According to
their proposed method, if the VMs over a host demands more resources than the
available capacity, than surely it would violate the SLA. So, put that host into the list of
overloaded hosts. Then after using the existing VM selection technique to select the
VMs from those overloaded hosts and place them over other hosts using PABFD, they
declared one host as underloaded based upon the minimum utilization.
Kuo et al. [11] stated the resource based first fit algorithm for assigning VMs to
host. In this procedure, resource requirements of requesting VM are analyzed before
assignments and available resources of each host are updated after the termination of
VM assigned to it. Results of proposed approach are compared with worst fit (RWFA)
and best fit (RBFA) approaches, algorithm (RWFA) and resource based best fit
algorithm (RBFA). The Performance evaluation shows that RFFA scheme performed
well than RWFA & RBFA for decreasing energy consumption.
Mosa and Paton [12] solved the issue of SLA violation while conserving the energy
in data centers by developing a new utility based approach. In this approach, a design
model consisting of Monitoring, Analysis, Planning and Execution (MAPE), is fol-
lowed. Monitoring phase find the of CPU utilization of active hosts, Analysis phase
find the alternative of VMs to hosts mapping using Genetic Algorithms, Planning phase
evaluate the alternatives and find the best VMs to hosts mapping based upon utility
formula and last phase of Execution perform the necessary VMs migration and shutting
down idle hosts.
Castro et al. [13] worked upon two sub-problems of VM consolidation process, for
underloaded host detection. They define a novel algorithm named Underload Detection
(UD) and for VM placement they denoted their algorithm as CPU and RAM Energy
Aware (CREW). In UD, an average CPU usage is calculated using an Exponential
Weighted Moving Average (EWMA). Then average CPU utilization value is used to
define underloaded hosts. CREW is a modified PABFD, which check both, the
available CPU and available RAM on a particular host before placing a VM over that
host. In this way, their work proves the role of RAM for consuming the energy and
provide a heuristic that consider the usage of RAM with CPU for reducing energy and
SLA violation.
Han et al. [14] defined two algorithms one for under loaded host detection using
power efficient value (PE) of a host, which defines the ratio between power con-
sumption and number of VMs running on a particular host; and second algorithm for
VM placement based upon remaining utilization that consider remaining available CPU
resources of PMs for placing the VMs. Then, a new integrated VM consolidation
algorithm, consisting of these two techniques, is applied on five deferent types of
planetlab workload data using cloudsim simulator. The results are compared with
PABFD and UMC, which shows that the proposed algorithm has less impact for energy
consumption. However, SLA violation, SLATAH, number of VM migrations and ESV
metrics reduced significantly.
Mevada et al. [15] improvised the VM placement algorithm as a part of VM
consolidation process. They consider the utilization of the complete data center for
defining the host that can be evacuated and further use this information for defining
lower threshold value. Based upon this value, underutilized host are detected and shut
downed after their all VMs are shifted to other vacant hosts.
442 J. Sengupta et al.
Khoshkholghi et al. [16] researched on all the four sub problems of VM consoli-
dation defined by Beloglazov and Buyya [4] and generate a new consolidation scheme
comprising of all proposed techniques. In this, considering two utilization thresholds,
which are defined by Iterative weighted linear regression technique, an over loaded
host detection algorithm is developed. Also, a new VM selection algorithm based upon
three different policies i.e. maximum power reduction policy, time and power tradeoff
policy & violated Mips–VMs policy, is generated. Than a two phase algorithm for VM
placement is proposed. In first phase, VMs selected from overloaded hosts are placed
and in second phase all VMs selected from under loaded hosts are allocated host.
Further a Multiple Resources Under-loaded Host Detection algorithm (MRUHD) is
introduced. This algorithm defines an adaptive lower threshold value and host is
considered under loaded all the utilizations of CPU, RAM and bandwidth below this
threshold. The proposed consolidation scheme is evaluated using cloudsim toolkit. The
results depict that new proposed scheme outperformed in all the benchmarks and can
reduce energy consumption to 28% while SLA violation until the level of 87%.
3 Problem Statement
According to Beloglazov and Buyya [4], during the VM consolidation process for
optimal mapping of VMs and hosts, four sub problems needed to be researched. First is
to find out the overloaded hosts i.e. hosts which have been allocated more number of
VMs than their serving capacity. Because, it would result into idle VMs and hence
promote the SLA violation. Second, selecting the adequate number of VMs from the
overloaded hosts for migration depending upon various parameters such as CPU uti-
lization, size of VMs, VM correlation etc. Third, detecting the under-loaded hosts, in
which a threshold value is defined for some particular parameters and hosts with less
than this threshold value is considered as under-loaded. Purpose for detecting under-
loaded host is to find suitable host that can easily be shut down without affecting the
other parameters, as all the VMs from under-loaded hosts can migrate to other active
hosts. Lastly, to place all the VMs selected for the migration on suitable physical hosts
in an optimal way. Placement here defines how the migrated VMs can be allocated to
available resources of active physical hosts that are neither overloaded nor under-
loaded. This mapping between hosts and VMs is not stable, because, new VMs keep on
generating to satisfy dynamic user demands and old VMs terminating progressively
after the completion of tasks assigned to them. In addition, it impacts on all the
performance measurements such as energy consumed by hosts; how many hosts can
further be get overloaded or under-loaded after a specific interval; how much amount of
VM migrations would be occurring and last but not least, how much violation of SLA
be occurred.
With the analysis of the related work, some research gaps have been identified,
which are helpful in designing the proposal of this research work. All the reviewed
papers and related research have modified the existing techniques or developed some
new policies for improvements in one or more performance metrics defined later.
Researchers have developed many strategies for reducing the energy consumption, and
for this, they have considered required or available resources, or amount of energy
Energy Aware Next Fit Allocation Approach for Placement of VMs 443
consumed by applying different allocation policies. For reducing the energy, main
concepts used are to allocate the VMs in such a way that it can reduce the number of
active PMs and shutting down the idle servers. However, there are various other
solutions existed for NP hard problems that have not been tested. So, for optimizing a
NP hard problem, solution of other similar type can be applied. This gap in research
leads us to concentrate over applying a technique to solve bin packing problem over the
VM placement procedure in cloud environment. However, for this it is necessary that
both the problems should be scaled to a common format. Also, it has been analyzed
that there is a tradeoff between SLA violation and energy consumption [9]. Algorithms
reducing the energy consumption might have impact on SLA violation and, similarly,
to reduce the SLA violation more energy could be consumed. So, a VM placement
policy that could maintain the balance between both parameters need to be defined.
4 Preliminary
Out of four sub problems defined in the problem statement section, VM placement is
main sub-problem considered in this research. This section describes the base VM
placement policy that is modified.
In the research [4], power aware best fit decreasing technique is applied that is
defined in Algorithm 1, shown in Fig. 3. This is the default algorithm for VM place-
ment in cloudsim. In this, VMs to be migrated are selected one by one from a list, and
their suitability is checked against each and every available host. Power consumption
is the factor considered mainly in this method for finding the suitable host for a VM.
444 J. Sengupta et al.
As shown in step no. 8 of algorithm 1, power is estimated for each VM for a particular
host. Then, based upon the condition applied in step no. 9, VM will be allocated to the
specific host only if it has the minimum power consumption on that host. Hence, VMs
are allocated to the best fit hosts available.
5 Proposed Algorithm
In this section, the proposed algorithm Power Aware Next Fit Decreasing (PANFD) is
elaborated in Fig. 4. Next fit allocation is an advanced version of first fit allocation
policy. In this, the next host for allocation will be searched on the basis of location of
previous allocated host. However, the condition is that all the hosts should be in sorted
order. Order of sorting depends upon the different parameters of hosts as well as VMs,
such as available number of MIPS, amount of energy consumed, current CPU uti-
lization etc. In this algorithm, CPU utilization value of VM is considered for sorting the
VMs in ascending order because utilization is directly proportional to the energy
consumption. And, available amount of MIPS per host is selected for arranging the
available hosts in an ascending order. Thus, this phenomenon is more suitable to
allocate the VM with required utilization requirement to the first suitable host in the list,
and, thereafter, for next VM, search would continue from the position where last host
was selected. This could increase the time for searching the host because hosts are
sorted, but, it would utilize the resources in an optimized way. Thus, resulting into less
consumption of energy with less number of migrations.
This algorithm is based upon the concept that it can be assumed that if a host is not
compatible with a VM, then it would also not be compatible for the next VM in the list,
because hosts and VMs are in sorted in ascending order. So, better to skip those hosts
and search from the remaining only. Modified algorithm shown in Fig. 4, defined a
new pointer named “position” in searching the next suitable host. Initialized to 1st
position, pointer is updated after every allocation of suitable host to the current VM in
the list. This modification would not only improve the performance of the algorithm but
also would affect the overall allocation of VMs in an optimized manner, leading to the
optimized performance in various factors discussed later, compared to the base
approach.
Where,
V = total available VMs,
Pdi = the estimated performance degradation due to ith VM Migration,
Pci = total cpu capacity required by ith VM.
446 J. Sengupta et al.
SLA Violation Time per Active Host (SLATAH): Represents the percentage of time
for which a CPU is 100% utilized. In other words, it defines the portion of time for
which VMs have to wait because of 100% utilization of CPU.
1 XH TCi
SLATAH ¼ ð3Þ
H j¼1 T
Ai
totalRequestedMips totalAllocatedMips
OSLAV ¼ ð4Þ
totalRequestedMips
Where totalRequestedMips is the sum of all the requested MIPS by available VMs
in data center according to their resource requirements, whereas totalAllocatedMips is
sum of all actual allocated MIPS to them.
are considered [18]. The energy consumed by these two types of servers for various
utilization levels of CPU is mentioned in Table 2. Reason for selecting this configu-
ration is to measure the effectiveness of proposed VM placement algorithm, as servers
with less resource capacity can be overloaded easily with light workload. Four types of
virtual machines corresponding to Amazon EC2 instance types are created that are
defined in Table 1 [4].
Therefore, the amount of energy consumed during an interval at running time by all
the Physical Machines defined as Energy Consumption. Lower the value of energy
consumption means less expenditure. So, an algorithm can be considered optimized
only if it helps to reduce this parameter.
Table 2. Amount of energy consumed (in Watts) by hosts at various CPU utilization levels.
Server 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
HPG4 86.0 89.4 92.6 96.0 99.5 102.0 106.0 108.0 112.0 114.0 117.0
HPG5 93.7 97.0 101.0 105.0 110.0 116.0 121.0 125.0 129.0 133.0 135.0
SLA Time Per Active Host: This factor is mainly responsible for defining the level of
QoS attained in during catering the services of cloud users. Represented in Fig. 8, it can
be analyzed that this has been decreased dramatically in all the combinations except
IQR_MU. Here, LRR_MC has performed best with 65.13% improvement and least
value of 2.43% in SLATAH. One other related quality factor considered for evaluation
is Overall SLA violation (OSLAV), which is shown in Fig. 9. It depicts that OSLAV
has not shown improvement for IQR_MMT, IQR_MU, LRR_MU and LR_MU.
However, for LR_RS it has been optimized by 41.17%. Also, IQR_MMT for existing
approach has minimum OSLAV of 0.07%. Hence, PANFD has produced mixed results
for this factor and shown improvement to some extent.
452 J. Sengupta et al.
7 Conclusion
References
1. Kepes, B.: Aligned energy changes the data center model. https://www.networkworld.com/
article/3025455/aligned-energy-changes-the-data-center-model.html
2. Fan, X., Weber, W.D., Barroso, L.A.: Power provisioning for a warehouse-sized computer.
ACM SIGARCH Comput. Archit. News 35(2), 13–23 (2007)
3. Singh, P., Sengupta, J., Suri, P.K.: A novel approach of virtual machine consolidation for
energy efficiency and reducing sla violation in data centers. Int. J. Innovative Technol.
Exploring Eng. 8, 547–555 (2019)
4. Beloglazov, A., Buyya, R.: Optimal online deterministic algorithms and adaptive heuristics
for energy and performance efficient dynamic consolidation of virtual machines in cloud data
centers. Concurrency Comput. Pract. Experience. 24, 1397–1420 (2011)
5. Silva Filho, M., Monteiro, C., Inácio, P., Freire, M.: Approaches for optimizing virtual
machine placement and migration in cloud environments: a survey. J. Parallel Distrib.
Comput. 111, 222–250 (2018)
6. Clark, C., Fraser, K., Hand S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., Warfield, A.: Live
migration of virtual machines. In: Proceedings of the 2nd Conference on Symposium on
Networked Systems Design & Implementation, vol. 2, pp. 273–286 (2005)
7. Coffman, E.G., Garey, M.R., Johnson, D.S.: Approximation algorithms for bin packing: a
survey. In: Approximation Algorithms for NP-hard Problems, pp. 46–93 (1996)
Energy Aware Next Fit Allocation Approach for Placement of VMs 453
8. Kumaraswamy, S., Nair, M.K.: Bin packing algorithms for virtual machine placement in
cloud computing: a review. Int. J. Electr. Comput. Eng. (IJECE) 9, 512 (2019)
9. Chowdhury, M., Mahmud, M., Rahman, R.: Implementation and performance analysis of
various VM placement strategies in CloudSim. J. Cloud Comput. 4, 20 (2015)
10. Pagare, M.J.D., Koli, N.A.: Performance analysis of an energy efficient virtual machine
consolidation algorithm in cloud computing. Int. J. Comput. Eng. Technol. (IJCET) 6(5),
24–35 (2015)
11. Kuo, C.F., Yeh, T.H., Lu, Y.F., Chang, B.R.: Efficient allocation algorithm for virtual
machines in cloud computing systems. In: Proceedings of the ASE BigData & SocialInfor-
matics, p. 48. ACM (2015)
12. Mosa, A., Paton, N.: Optimizing virtual machine placement for energy and SLA in clouds
using utility functions. J. Cloud Comput. 5, 17 (2016)
13. Castro, P., Barreto, V., Corrêa, S., Granville, L., Cardoso, K.: A joint CPU-RAM energy
efficient and SLA-compliant approach for cloud data centers. Comput. Netw. 94, 1–13
(2016)
14. Han, G., Que, W., Jia, G., Shu, L.: An efficient virtual machine consolidation scheme for
multimedia cloud computing. Sensors 16, 246 (2016)
15. Mevada, A., Patel, H., Patel, N.: Enhanced energy efficient virtual machine placement policy
for load balancing in cloud environment. Int. J. Cur. Res. Rev. 9(6), 50 (2017)
16. Khoshkholghi, M.A., Derahman, M.N., Abdullah, A., Subramaniam, S., Othman, M.:
Energy-efficient algorithms for dynamic virtual machine consolidation in cloud data centers.
IEEE Access 5, 10709–10722 (2017)
17. Calheiros, R., Ranjan, R., Beloglazov, A., De Rose, C., Buyya, R.: CloudSim: a toolkit for
modeling and simulation of cloud computing environments and evaluation of resource
provisioning algorithms. Softw. Pract. Experience. 41, 23–50 (2010)
18. Standard Performance Evaluation Corporation, “SPECpower_ssj2008”, Spec.org. https://
www.spec.org/power_ssj2008/results/res2011q1/power_ssj2008-20110124-00338.html.
https://www.spec.org/power_ssj2008/results/res2011q1/power_ssj2008-20110124-00339.
html
19. Park, K., Pai, V.: CoMon: a mostly-scalable monitoring system for PlanetLab.
ACM SIGOPS Operating Syst. Rev. 40, 65 (2006)
Multi-user Expert System for Operation
and Maintenance in Energized Lines
1 Introduction
2 System Structure
Electrical
software
DigSilent Power
Factory
Virtual Environment
ZXE
Overseer
ZXE
Group
Leader
ZXE
Operator
3 Virtual Environment
communication network and interact with each other, strengthening collaborative work.
The structural diagram of interrelation between components is (shown in Fig. 2).
Cad Models
Vitualization of the
environment Software Cad
3D Models
World Composer
Script
Electric software
controller
Virtual input and
Electrical Software
output devices Device
Detector Electrical Electrical
Virtual Reality
Device input output
Oculus 3D Moddels variables
variables
Rift Manager behavior
Application Digsilent Power
Sound Controller Factory
Audio
Effects
Communication
Plugin Movement Multiuser Tcp-Ip
Vr Control support
Controller
Grab Object
Avatar-
Audio Game Scenes operator
Interacion
Canvas UI
Methods
Location:Illuchi I-II
National Interconnected
System-Ecuador
Location: Cotopaxi
S/E El Calvario
Real virtualized
environment
Phase 4
Regions
Heightmap export
Image export
Create terrain
In the first phase it is required the survey of information of the exact place to be
virtualized with the coordinates of location respectively, same that is obtained with
photographs in 360°; each photograph has as data the georefenciation of the real point
in which it was taken. The WorldComposer tool allows to extract height maps with
images of a real location of any part of the world, for its initialization the icons
460 E. F. Moreno et al.
described below must be activated: map parameters, regions, heightmap export, image
export and create terrain, once activated the parameters the virtualized area is created
directly in the Unity assets.
Figure 9 presents the levels of the operators: (i) Chief of maintenance, Engineer-
Master-High level; controls, coordinates with the group leader maintenance maneuvers
and energized line operation and ensures compliance with safety standards; (ii)
Overseer, Engineer-High level; performs work orders and delivers materials according
to the operations to be performed in the work area; (iii) Chief of group, Technologist-
half level; coordinates, receives materials according to work orders and performs
maintenance maneuvers on energized lines; (iv) Operator, Low-level technician; fulfills
work orders, performs operations and maneuvers such as: cleaning and change of
insulators, change and relocation of crossheads, change of disconnectors, assembly and
disassembly of transformer.
Multi-user Expert System for Operation and Maintenance 463
Chief of
maintenance
Advanced level
Overseer
Advanced level
Adobe Fuse
Group leader
Medium level
Operator
Low level
4 Multi-user System
User
Connection system Administrator
Internet
Virtual Environment
Sistema de selección de
personaje
The results achieved by the multi-user virtual training expert system are described in
this section as an extra tool for learning and correct management in the operation and
maintenance of energized lines, due to the dangerous events represented by such work
and the high cost of line protection equipment, by means of the virtual application the
novel advantage was obtained of simulating different failure events caused in the
electrical power system, specifically on the main busbars without the need to directly
expose operators who have integral experience theoretical-practical learning, focused
on an efficient collaborative work between groups in case of emergencies caused in the
system.
Multi-user Expert System for Operation and Maintenance 465
The virtual work environment shows two training modes (operator and visitor) that
allow user interaction and immersion according to the learning requirements as (shown
in Fig. 12).
MAINTENANCE ON ENERGIZED
VISIT TO THE SUBSTATION
LINES
Maintenance Mode on Energized Lines. In this mode the user from any location can
access the environment by entering the name and id, through which it is presents a
selection menu of maintenance operators by hierarchical position according to activities
and responsibilities as (shown in Fig. 13). After selection the operator enters the
information room in which visual and auditory instructions on safety standards, dia-
grams and maintenance protocols are displayed (see Fig. 14).
Selection of Maneuver and Work Area. A menu is presented to select the type of
maintenance in the sub-transmission or distribution system with maneuvers such as:
assembly and disassembly of single phase transformer, change and cleaning of insu-
lators and change of crossarms (see Fig. 15).
Work Area. In this environment the operator is able to visualize and interact directly
with the potency system, substation of elevation, substation of reduction, insulators,
lines of transmission, power transformer and other electromechanical components (see
Fig. 16), for perform maintenance maneuvers on lines energized under safety standards
and protocols.
Multi-user Expert System for Operation and Maintenance 467
Multi-user. The training system has the novel alternative for training operators on a
multi-user platform, in which it allows the immersion and interaction of several users to
the work area, according to the complexity of the manoeuvre, as shown in the Fig. 17.
Fault Model. In Fig. 18 is displayed the fault occurred in the E-bar of the system
caused by an opening error in a disconnector provoking that the circuit opens and
generates an electric arc as (shown in Fig. 19), and The switches of the Illuchi 1 lifting
468 E. F. Moreno et al.
substation are be directly affected. The Programming facilitates the effect of visual and
auditory realism at the exact moment of the electrical emergency.
Visitor Mode. In this mode you have access to the three substations Illuchi 1, Illuchi 2
and El Calvario in which students and new operators have the advantage of visiting and
learning about the elements that make up the system through an audible guide that
facilitates tour and learning in the electric field (see Fig. 20).
Multi-user Expert System for Operation and Maintenance 469
The evaluation and validation of the multi-user training assistant provides a range
of approaches to the effectiveness of training tasks and maintenance maneuvers. Where
finally, the joint evaluation verifies the validity of the user-system interface in real
situations, quantifying the results in terms of benefit (knowledge, cost and extra risk-
free training tool). The SUS [17] usability of the expert system was evaluated by means
of a questionnaire of questions. During the development of the evaluation a group of 15
people, among them 10 students and 5 professors of the Career of Electromechanical
Engineering located in different locations: Research Laboratory, Network Laboratory &
Industrial Processes Control participated in this evaluation process, with students who
take the subjects of Maintenance Engineering and Electrical Systems (see Fig. 21). The
evaluators issued criteria with respect to the animations and presentation of the
maintenance procedures for the analysis of the information received by the evaluated
and possible improvements in terms of redesign of the system. Subsequently, research
was carried out whose main focus was the improvement of training through two groups
of people: users trained using the conventional method (theoretical knowledge) versus
users trained (theoretical knowledge plus training in the Virtual Reality environment),
showing a positive influence on the improvement of cognitive skills, collaboration of
students and teachers and retention of knowledge.
In Table 1, the items for the usability assessment corresponding to the application
are shown.
16
14
Number of people surveyed
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
Number of question
The results obtained are (shown in Fig. 22), where 14 of the 15 people evaluated
indicate that the virtual application for training operators on energized lines contributes
to the learning and development of the cognitive skills of students and teachers in the
area, since it is a novel tool that is easily accessible as a theoretical-practical support
when interacting with the components of the electrical power system.
Multi-user Expert System for Operation and Maintenance 471
6 Conclusions
The article presents the development of the application of the multi-user training system
based on virtual reality, within the electrical area in which it offers an assisted follow-up
to the user within the recognition of processes and electrical elements of subtransmission
and distribution contributing to the education and training of professionals by means of
immersion and interaction of operation and maintenance manoeuvres in a risk-free
environment optimizing economic resources, time and infrastructure.
The Unity 3D graphics engine features an easy-to-use interface that provides the
realism of an electrical power system through interaction between multi-user system
operators whose experimental results show the development of skills and abilities in a
collaborative environment, renewing current training models.
Sending data from Unity 3D electrical software and graphics engine provides
analysis of possible faults which can be critically caused by bad maneuvers or external
conditions in the electrical power system (substation bars, distribution systems). The
objective of the development of this application is to provide an extra training tool that
facilitates learning the maneuvers and safety protocols in the operation and mainte-
nance of energized lines.
The virtual environment is currently used by professors and students of the career
of engineering in Electromechanics at the University of the Armed Forces ESPE that
facilitates your theoretical-practical training, this training system based on the
immersive virtual reality system can be easily installed on a computer and through the
Internet interact with the multi-user system and strengthen collaborative work.
In the future, it is desired to integrate an augmented reality system that facilitates
maintenance not only in energized lines but also in electrical substations, starting out
like this from his generation.
References
1. Faccio, M., Persona, A., Sgarbossa, F., Zanin, G.: Industrial maintenance policy
development: a quantitative framework. Int. J. Prod. Econ. 147, 85–93 (2014)
2. Van de Kerkhof, R., Akkermans, H., Noorderhaven, N.: Knowledge lost in data:
organizational impediments to condition-based maintenance in the process industry. In:
Zijm, H., Klumpp, M., Clausen, U., Hompel, M. (eds.) Logistics and Supply Chain
Innovation, pp. 223–237. Springer, Cham (2016)
3. Daily, J., Peterson, J.: Predictive maintenance: how big data analysis can improve
maintenance. In: Richter, K., Walther, J. (eds.) Supply Chain Integration Challenges in
Commercial Aerospace, pp. 267–278. Springer, Cham (2017)
4. Koksal, A., Ozdemir, A.: Improved transformer maintenance plan for reliability centred asset
management of power transmission system. IET Gener. Transm. Distrib. X(8), 1976–1983
(2017)
5. Barbosa, C., Nallin, F.: Corrosion detection robot for energized power lines. In: Proceedings
of the 2014 3rd International Conference on Applied Robotics for the Power Industry, pp. 1–
6. IEEE (2014)
472 E. F. Moreno et al.
6. Neitzel, D.K.: Electrical safety when working near overhead power lines. In: 2016 IEEE PES
13th International Conference on Transmission & Distribution Construction, Operation &
Live-Line Maintenance (ESMO), pp. 1–5. IEEE (2016)
7. Galvan, I., Ayala, A., Rodríguez, E., Arroyo, G.: Virtual reality training system for
maintenance of underground lines in power distribution system. In: Virtual Reality (2016)
8. Ayala, A., Galván, I., Pérez, G., Ramirez, M., Muñoz, J.: Virtual reality training system for
maintenance and operation of high-voltage overhead power lines. In: Third International
Conference on Innovative Computing Technology (INTECH 2013) (2013)
9. Perez-Ramirez, M., Arroyo-Figueroa, G., Ayala, A.: The use of a virtual reality training
system to improve technical skill in the maintenance of live-line power distribution
networks. Interact. Learn. Environ. 1–18 (2019)
10. Zayas, B., Perez, M.: An instructional design model for virtual reality training environments.
In: EdMedia+ Innovate Learning. Association for the Advancement of Computing in
Education (AACE), pp. 483–488 (2015)
11. Li, B., Bi, Y., He, Q., Ren, J., Li, Z.: A low-complexity method for authoring an interactive
virtual maintenance training system of hydroelectric generating equipment. Comput. Ind.
100, 159–172 (2018)
12. Hernández, Y., Pérez, M., Ramírez, W., Ayala, E., Ontiveros, N.: Architecture of an
intelligent training system based on virtual environments for electricity distribution
substations. Res. Comput. Sci. 129, 63–70 (2016)
13. Dos Reis, P., Matos, C., Diniz, P., Silva, D., Dantas, W., Braz, G., Araújo, A.: An immersive
virtual reality application for collaborative training of power systems operators. In: 2015
XVII Symposium on Virtual and Augmented Reality, pp. 121–126. IEEE (2015)
14. Chiluisa, M., Mullo, R., Andaluz, V.H.: Training in virtual environments for hybrid power
plant. In: International Symposium on Visual Computing, pp. 193–204. Springer, Cham
(2018)
15. Zhang, S., Ying, S., Shao, Y., Gao, W., Liang, Y., Peng, P., Luo, X.: Design and application
of electric power skill training platform based on virtual reality technology. In: 2018 Chinese
Automation Congress (CAC), pp. 1548–1551. IEEE (2018)
16. Cardoso, A., do Santos Peres, I., Lamounier, E., Lima, G., Miranda, M., Moraes, I.:
Associating holography techniques with BIM practices for electrical substation design. In:
International Conference on Applied Human Factors and Ergonomics, pp. 37–47. Springer,
Cham (2017)
17. Cai, L., Cen, M., Luo, Z., Li, H.: Modeling risk behaviors in virtual environment based on
multi-agent. In: 2010 The 2nd International Conference on Computer and Automation
Engineering (ICCAE) (2010)
The Repeatability of Human Swarms
1 Introduction
2 Repeatability Study
Ten groups of randomly selected participants were tasked with answering a set of 25
subjective assessment questions, each of which required them to rate items on a 1–5
scale. Each group consisted of between 20 and 25 individuals. All questions were of
the same format: “How important is it to have taken a class in [subject area] before
graduating high school?” The group congregated on the Swarm AI platform to answer
the questions as a real-time human swarm. A swarm is shown answering question 14
(Environmental Science) in Fig. 1.
For each unique question, the repeatability of the swarm’s answers was calculated
as the fraction of the ten groups that gave the most common answer. For example, if
25% of groups tested chose “1” and 75% of groups tested chose “2”, the repeatability
of the swarms on that question would be 75%.
Over the 25 subjective assessment questions used in this experiment, the ASI swarms
generated answers that were 67% repeatable (i.e., on average, swarms chose the most
commonly generated answer 67% of the time.) The repeatability was also broken down
per question, shown in Fig. 2. Four questions were answered the same way by all 10
swarms, resulting in a 100% repeatability for questions 4, 7, 15, and 21. The minimum
repeatability observed was 40%, on questions 1, 3, and 10. On these questions, the
most popular answer chosen by the swarms was only selected 40% of the time, while
two or more other answers were selected the other 60% of the time. The swarm answers
to each of the 25 questions were bootstrapped 5000 times with replacement, and a 95%
confidence interval of the repeatability of swarms in this question set was calculated.
We can be 95% confident that swarms in this question set were between 60% and 75%
repeatable on average.
476 G. Willcox et al.
Fig. 3. Swarm repeatability and conviction correlation, including a moving average trendline to
show local average accuracy.
4 Swarm Interpolation
While the explicit answers from swarms are instructive, it’s often the case that a deeper
analysis of the behaviors of individuals within the swarms reveals a more accurate
picture of the beliefs of the group itself. A process of interpolation was used to refine
the swarm’s explicit answer into a fractional answer index that better represents the
beliefs of the group. This interpolated swarm answer was calculated as the mean
The Repeatability of Human Swarms 477
answer that individuals expressed over the swarm’s entire deliberation, expressed as a
decimal (e.g. 3.15). One example of the process of interpolating an explicit swarm
answer into an interpolated answer is shown in Fig. 4. In this example, the swarm
chose “2”, but debated mostly between answers “1 (not important)” and “2”. The
interpolated answer for this swarm was 1.60.
An interpolated answer was calculated for each explicit swarm answer in this study.
The interpolated answers had an intra-question variance of 0.138 of an answer index,
meaning that the swarm answers to a single question had a standard deviation of 0.371
answer indexes, on average. The interpolated answers observed in response to each
question were bootstrapped 5,000 times, and a 95% confidence interval of the average
intra-question standard deviation was calculated as [0.316, 0.412].
The average intra-question standard deviation of the explicit answers was 0.574
indexes, with a 95% confidence interval of [0.490, 0.656]. When comparing the intra-
question standard deviation of the explicit and interpolated answers, we find that it’s
highly unlikely (p < 0.001) that the interpolated answers have a lower intra-question
standard deviation. As a result, we can be confident that the interpolated answer metric
from a single swarm is a more precise predictor of the population’s sentiment on a
given question, as measured by the average swarm response to that question, than the
explicit answer that a swarm chooses.
5 Conclusions
These results indicate that ASI swarms of 20–25 people are statistically repeatable
systems. Using the highly subjective questions in this study, the swarms were found, on
average, to produce 67% repeatable results. Furthermore, the repeatability of each
response was found to significantly correlate to the Conviction metric generated for that
swarm (r2 = 0.33, p < 0.01). Specifically, swarm-based responses with a Conviction
metric of more than 85% were shown to produce answers that were on average 90% or
more repeatable, and that produced the answer chosen most often by the swarm more
than 95% of the time. This provides a valuable guideline for practitioners using human
swarms to generate insights from sampled populations.
In addition, for questions with ordered answer options, the results of this study
show that interpolating the swarm’s output using the underlying behavioral data can
significantly decrease the variance of answers, indicating that the interpolated answer
may be a more precise estimator of group sentiment than the swarm’s explicit answer.
This study was limited by the number of groups tested, the size of the groups tested,
and the content of the questions. Future work may investigate the repeatability of
smaller (n < 15) or larger (n > 30) groups than in this study: the repeatability of a
survey is expected to increase as more people are surveyed, so it would be interesting to
see if the same trend holds for human swarms. It would further be interesting to
compare the expected repeatability of surveys and human swarms as the number of
respondents changes from small (n = 3) to large (n > 100): is one method more
repeatable at large/small sample sizes? Future work may also expand the question
content considered, as this study was limited to subjective assessments that tasked the
groups with rating items on a 5-point scale. In this study, some questions had full
repeatability (100%), while others were less repeatable than a coinflip (40%), which
may be in part attributable to the content of the questions themselves, so future work
may explore the impact that question content has on the repeatability of human swarms.
Acknowledgment. Thanks to Chris Hornbostel for his efforts in coordinating the swarms. Also,
thanks to Unanimous AI for the use of the Swarm platform for this ongoing work. This work was
partially funded by NSF Grant #1840937.
References
1. Galton, F.: Vox populi. Nature 75, 450–451 (1907)
2. Steyvers, M., Lee, M.D., Miller, B., Hemmer, P.: The wisdom of crowds in the recollection
of order information. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I. (2009)
3. Tetlock, P.E., Gardner, D.: Superforecasting: The Art and Science of Prediction. Crown
Publishing Group, New York (2015)
4. Dana, J., Atanasov, P., Tetlock, P., Mellers, B.: Are markets more accurate than polls? The
surprising informational value of “just asking”. Judgm. Decis. Making 14(2), 135–147
(2019)
5. Rosenberg, L.B.: Human swarms, a real-time method for collective intelligence. In:
Proceedings of the European Conference on Artificial Life, pp. 658–659 (2015)
The Repeatability of Human Swarms 479
6. Rosenberg, L.: Artificial swarm intelligence vs human experts. In: Clerk Maxwell, J. (ed.)
International Joint Conference on Neural Networks (IJCNN). IEEE (2016). A Treatise on
Electricity and Magnetism, 3rd edn, vol. 2, pp. 68–73. Oxford, Clarendon, (1892)
7. Rosenberg, L., Baltaxe, D., Pescetelli, N.: Crowds vs swarms, a comparison of intelligence.
In: IEEE 2016 Swarm/Human Blended Intelligence (SHBI), Cleveland, OH, pp. 1–4 (2016)
8. Baltaxe, D., Rosenberg, L., Pescetelli, N.: Amplifying prediction accuracy using human
swarms. In: Collective Intelligence 2017, New York, NY (2017)
9. Willcox, G., Rosenberg, L., Askay, D., Metcalf, L., Harris, E., Domnauer, C.: Artificial
swarming shown to amplify accuracy of group decisions in subjective judgment tasks. In:
Arai, K., Bhatia, R. (eds.) Advances in Information and Communication. FICC 2019.
Lecture Notes in Networks and Systems, vol. 70. Springer, Cham (2020)
10. Rosenberg, L., Pescetelli N., Willcox, G.: Artificial swarm intelligence amplifies accuracy
when predicting financial markets. In: 2017 IEEE 8th Annual Ubiquitous Computing,
Electronics and Mobile Communication Conference (UEMCON), New York City, NY,
pp. 58–62 (2017)
11. Rosenberg, L., Willcox, G.: Artificial swarm intelligence vs vegas betting markets. In: 2018
11th International Conference on Developments in eSystems Engineering (DeSE),
Cambridge, United Kingdom, pp. 36–39 (2018)
12. Rosenberg, L., Lungren, M., Halabi, S., Willcox, G., Baltaxe, D., Lyons, M.: Artificial
swarm intelligence employed to amplify diagnostic accuracy in radiology. In: 2018 IEEE 9th
Annual Information Technology, Electronics and Mobile Communication Conference
(IEMCON), Vancouver, BC, pp. 1186–1191 (2018)
13. Metcalf, L., Askay, D.A., Rosenberg, L.B.: Keeping humans in the loop: pooling knowledge
through artificial swarm intelligence to improve business decision making. Calif. Manag.
Rev. (2019). https://doi.org/10.1177/0008125619862256
14. Lee, R.M., Blank, G.: The SAGE Handbook of Online Research Methods, 2nd edn. SAGE
Publications, Thousand Oaks (2017). Edited by Nigel G. Fielding
15. Bartlett, J.E., et. al.: Organizational research: determining appropriate sample size in survey
research. Inf. Technol. Learn. Perform. J. 19, 43–50 (2001)
16. Schumann, H., Willcox, G., Rosenberg, L., Pescetelli, N.: Human swarming amplifies
accuracy and ROI when forecasting financial markets. In: IEEE International Conference on
Humanized Computing and Communication (HCC), Laguna Hills, CA, pp. 77–82 (2019)
17. Willcox, G., Askay, D., Rosenberg, L., Metcalf, L., Kwong, B., Liu, R.: Measuring group
personality with swarm AI. In: IEEE International Conference on Transdisciplinary AI
(TransAI), Laguna Hills, CA, pp. 10–17 (2019)
18. Willcox, G., Rosenberg, L.: Group sales forecasting, polls vs swarms. In: Future Technology
Conference (FTC), San Francisco, CA (2019)
A Two-Stage Machine Learning Approach
to Forecast the Lifetime of Movies
in a Multiplex
1 Introduction
The media and entertainment sector in India represents approximately 1% of the
country’s GDP with a $15.6 billion value estimate. Among this, the Indian film
industry grosses $3.8 billion and is constantly growing with a compound annual
growth rate of over 10% in the past years. This paper focuses on the most
economically valuable subset of the Indian film Industry - Cinema exhibition.
According to a Deloitte report on the economic contribution of the film industry
in the year 2017 [1], more than 50% of the gross share of the film industry is
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 480–493, 2020.
https://doi.org/10.1007/978-3-030-39442-4_36
A Two-Stage Machine Learning Approach to Forecast the Lifetime of Movies 481
2 Related Work
The Film Industry has been an area of active research for several years. Ajay
Shiva Santhosh Reddy et al. [3] deal with box office performance of movies based
on Hype Analysis using Twitter. Hype is calculated using information pertaining
to the number of tweets relating to the movie, total number tweets from distinct
users and the follower count for each user. Box-office collection is predicted
by multiplying the hype factor with the number of shows screened during the
first weekend. Ramesh Sharda and Dursun Delen [4] in one of their works have
dealt with the prediction of how successful a movie turns out to be. The target
variable has been divided into nine classes, ranging from ‘flop’ to ‘blockbuster’
based on the movie’s box-office receipts. The classification problem is tackled
using a Neural Network architecture with features including presence of a star
actor, the genre of the movie, number of screens allocated to the movie, etc.
Sameer Ranjan Jaiswal and Divyansh Sharma [5] have come up with a similar
model specifically targeting Bollywood movies. They utilize a feature named
‘music score’, a characteristic factor for Bollywood movies that greatly improves
the performance of the model.
Andrew Ainslie et al. [6] propose an interesting concept in which they ana-
lyze box-office sales in the context of a market share model. They claim that
the number of screens allotted during the opening week are overestimated in
traditional models. The work also specifies that the actors have a direct effect
with the customers’ movie choice while the director has an indirect effect.
Majority of the research work revolving around movies and theaters focus on
predicting whether a movie is successful based on box office revenue. While most
of the authors consider external factors such as tweets and market share, they
do not consider local behavioral factors such as the behavior of the crowd, the
operational pattern of the multiplex and seasonal characteristics. Each state in
India has a diverse demographic with varying regional languages. Consequently,
movie preferences change vastly from state to state. Hence, behavioral factors
are crucial in analyzing the success of a movie in the locality of the multiplex.
The authors of this paper predict the lifetime of movies as a measure of
success using local behavioral factors as illustrated in Sect. 4. Moreover, accu-
rate estimation of the lifetime of movies within a multiplex helps the exhibitors
maximize profits by making smarter negotiations with distributors. Additionally,
selecting movies that would continue to screen next week helps the multiplex in
scheduling.
3 Dataset
Operating 17 multiplexes across 10 cities, each with different cultural, language
and ethnic backgrounds, the multiplex under consideration is one of India’s lead-
ing cinema exhibitors. The dataset consists of over 15 million records and has
been previously used for food sales forecasting [7]. Its dataset consists of informa-
tion of transactions from 2015–2017 pertaining to movie ticket purchases from
A Two-Stage Machine Learning Approach to Forecast the Lifetime of Movies 483
Fields Description
Film String Code Unique Identification Key for Film
Screen name Unique name assigned to screen
Session datetime Date and time of film Screening
Session seats reserved Number of seats reserved by the multiplex for every show
Show number The corresponding time slot of the show considered in the day
Transaction value The payment amount remitted by the customer for the transaction
Transaction datetime Date and time of transaction
Seats per transaction The number of seats sold in the transaction
Transaction ID Unique key of identification for each transaction
The multiplex offers special screenings and arrangements for films to screen
exclusively for a maximum of 4 days. For this reason, the authors only consider
movies with at least 5 days of screening. The transaction data also labels tickets
that are canceled by the multiplex due to unforeseen circumstances. Hence to
maintain the integrity of the dataset, the canceled tickets are removed for further
analysis.
Since the multiplex releases movie schedules for the following week each
Wednesday, the authors define a business week to start from Wednesday to
deliver predictions prior to scheduling. The transaction records for over 2500
movies are aggregated based on our definition of a business week. Each record
contains fields specific to a movie in a business week such as the number of seats
filled in a week, weeks since movie release. Some of these fields will be discussed
in detail in Sect. 4.
484 A. Ragav et al.
The box plot shown in Fig. 1 represents the distribution of lifetime for all the
films considered. The average lifetime is observed to be 2.5 weeks, with 75% of
movies having a lifetime ≤ 3 weeks.
Fig. 2. Variation of PSF and average LB across weeks since release of movies
A Two-Stage Machine Learning Approach to Forecast the Lifetime of Movies 485
where,
i represents a unique movie,
n represents the max number of weeks the movie has been screened.
The mean of PSF (PSFavg ) for a movie is a measure of consistency for a
movie’s occupancy. While PSF is a short term indicator of the film’s popularity,
PSFavg serves as a long term indicator for measuring the performance of a movie
in theaters.
Late-Bookings (LB). LB represents the ratio of the seats booked at the closing
of a show’s transaction window (defined by the authors as after 3 PM of the
previous day of the show screening) to the total seats allocated for the screening.
It is important to know that the multiplex facilitates offline in-person bookings
besides the conventional online bookings. The rate with which the seats are filled
typically represents the zeal for a film. Hence people tend to book tickets well in
advance of the show to ensure they get a seat. Late bookings also account for last
minute transactions that occur moments before a show is screened. Increase in
the average number of late bookings per week (in Fig. 2(b)) indicates a depleting
interest for movies. This inference is drawn based on the comparable reduction
in average occupancy per week as seen in Fig. 2(a).
Relative Occupancy (RO). A higher number of occupants for a particular
movie may require the multiplex to increase the capacity for it, consequently
reducing the capacity for another movie in the week. ROj (i) specifies the share
of seats held by the ith movie among other movies screening in the jth week at
the multiplex and is calculated by Eq. (2). If the ROj (i) is 5%, it implies that
the seats booked for the ith movie holds 5% of all seats booked in the multiplex
for the jth week. The relative occupancy for the multiplex ranges from 0.007% to
81% with an average RO of 4% observed. Movies with consistently higher RO
values have a higher probability of screening in the multiplex the next week.
A smoothed histogram in Fig. 3 illustrates the distribution of movies. It can be
observed that for movies failing to screen the next week, the average RO is 0.8%
and ranges from 0.007% to 18%. Meanwhile, the movies that screen the next
week have a relatively greater average (5.2%) and a larger range (0.14%–81%).
where,
i represents a unique movie,
j represents a particular week,
k represents a movie in the week,
n represents the total number of movies in the week.
Frequency of Seats Booked per Transaction (SBT). The number of seats
booked in a single transaction-SBTm (where m represents the number of book-
ings) sheds light into the response of the incoming crowd towards a film, and
in turn characterizes the film. The multiplex facilitates the booking of multiple
seats in a single transaction. Figure 4 represents the correlation between SBT
486 A. Ragav et al.
and the probability of movies continuing to screen the following week. It can be
observed that the occurrence of higher values of SBT (SBT7 and above) indicate
a very high probability for movies to continue to be screened.
History Features. Since the model operates on a weekly basis to make pre-
dictions, it is important that we supply short term memory features to help the
model understand the variations in the behavior across the past week. Therefore,
7 history points corresponding to the days in the prior week are provided for
occupancy features such as RO, LB and PSF.
A Two-Stage Machine Learning Approach to Forecast the Lifetime of Movies 487
New Releases in a Week (NRW). A new release can pivot the sales and
demand of movies currently running. NRW refers to the number of new movie
releases normalized over the total number of movies screening in the week.
5 Methodology
The authors make predictions at the start of a business week for two different
use cases. The first is binary classification, to flag whether a particular movie
A Two-Stage Machine Learning Approach to Forecast the Lifetime of Movies 489
screening in the week will continue to screen the following week (classification use
case). For the movies predicted to screen the following week, a regressor engine
predicts how long the movie will continue to be screened in weeks (regression
use case).
Booking transaction data from the years 2015, 2016 and 2017 were considered
for analysis of movie lifetime prediction. The Booking behavior was observed
to change within seasons through the years considered. The domain experts
attributed this to the volatile nature of the data considered and further specified
that this volatility consistently prevailed over the last decade. Therefore, the
best way to model such data is to consider a uniform split across the 3 years for
training and testing. Hence, the authors allocated 70% data from each year for
training the model and the rest was considered as testing data. This way, the
model is able to capture the behavior across all the years considered.
DNN XGB ET
DaysMetrics Movie willMovie will Movie willMovie will Movie willMovie will
Mean Mean Mean
not screen screen not screen screen not screen screen
Precision0.97 0.91 0.93 0.99 0.84 0.89 1.00 0.90 0.93
1 Recall 0.78 0.99 0.93 0.55 1.00 0.87 0.74 1.00 0.92
F1 score 0.86 0.95 0.92 0.71 0.91 0.87 0.85 0.95 0.92
Precision0.96 0.94 0.95 0.98 0.88 0.91 0.99 0.92 0.94
2 Recall 0.85 0.98 0.95 0.67 0.99 0.90 0.79 1.00 0.94
F1 score 0.90 0.96 0.94 0.80 0.93 0.89 0.88 0.96 0.93
Precision0.95 0.95 0.95 0.96 0.91 0.93 0.98 0.93 0.95
3 Recall 0.88 0.98 0.95 0.77 0.99 0.92 0.82 0.99 0.94
F1 score 0.91 0.96 0.95 0.86 0.95 0.92 0.90 0.96 0.94
Precision0.93 0.95 0.95 0.94 0.93 0.93 0.98 0.94 0.95
4 Recall 0.88 0.97 0.95 0.82 0.98 0.93 0.86 0.99 0.95
F1 score 0.91 0.96 0.95 0.88 0.95 0.93 0.91 0.97 0.95
Precision0.91 0.95 0.94 0.89 0.94 0.93 0.97 0.95 0.96
5 Recall 0.89 0.96 0.94 0.86 0.96 0.93 0.88 0.99 0.96
F1 score 0.90 0.96 0.94 0.88 0.95 0.93 0.92 0.97 0.95
Precision0.88 0.96 0.94 0.82 0.95 0.91 0.96 0.96 0.96
6 Recall 0.90 0.95 0.93 0.89 0.92 0.91 0.91 0.98 0.96
F1 score 0.89 0.95 0.93 0.85 0.93 0.91 0.93 0.97 0.96
Precision0.77 0.97 0.91 0.77 0.96 0.90 0.88 0.97 0.95
7 Recall 0.90 0.91 0.90 0.90 0.89 0.89 0.93 0.95 0.94
F1 score 0.90 0.88 0.90 0.83 0.92 0.89 0.91 0.96 0.94
Regression. Since the regressors forecast the number of weeks a movie will
continue to screen, the error in lifetime is measured in terms of weeks. The
Movie Lifetime Error (MLE) is calculated as shown in Eq. 3.
DNN XGB ET
[0,1) MLE (%) 64.85 59.00 55.16
[1,2) MLE (%) 19.10 22.48 28.44
[2,3) MLE (%) 7.82 10.71 10.58
[3,4) MLE (%) 3.45 4.10 3.17
>4 MLE (%) 4.77 3.70 2.65
A Two-Stage Machine Learning Approach to Forecast the Lifetime of Movies 493
7 Conclusion
Of the two approaches considered, the Two Model Approach (Sect. 5.4) provides
maximum accuracy and is the optimal solution for the two business use-cases
discussed. The standalone ET classifier performs better than the regressors trans-
formed to do the classification task. The model provides an accuracy of 97% for
the classification use-case, which helps the multiplex accurately schedule movies
on a weekly basis. The regression use case is performed using the DNN due to its
superior performance as shown in Sect. 6.2. The DNN is more robust to outliers
and is able to capture the non-linear trends present in the data.
Currently, the multiplex scheduling experts consider the admissible range of
error to be within 2 weeks. They estimate the lifetime and films screening the
next week based on empirical methods and heuristics. Our approach estimates
the remaining lifetime correctly with less than 2 MLE 85% of the time. These
results set a benchmark for the experts in the domain regarding lifetime esti-
mation. Since our method is the first of its kind, it will be tested in real-world
circumstances in the next revision of strategies by the considered multiplex.
References
1. Deloitte: Economic Contribution of the Film ad Television Industry in India,
2017 (2018). Accessed from https://www.mpa-i.org/wp-content/uploads/2018/
05/India-ECR-2017 Final-Report.pdf
2. Eliashberg, J., Hegie, Q., Ho, J., Huisman, D., Miller, S.J., Swami, S., Weinberg,
C.B., Wierenga, B.: Demand-driven scheduling of movies in a multiplex. Int. J.
Res. Mark. 26(2), 75–88 (2009). ISSN 0167-8116
3. Sivasantoshreddy, A., Kasat, P., Jain, A.: Box-Office opening prediction of movies
based on hype analysis through data mining. Int. J. Comput. Appl. 56(1), 1–5
(2012)
4. Sharda, R., Delen, D.: Predicting box-office success of motion pictures with neural
networks. Exp. Syst. Appl. 30(2), 243–254 (2006)
5. Jaiswal, S.R., Sharma, D.: Predicting success of bollywood movies using machine
learning techniques. In: Proceedings of the 10th Annual ACM India Compute Con-
ference (2017)
6. Ainslie, A., Drèze, X., Zufryden, F.: Modeling movie life cycles and market share.
Mark. Sci. 24(3), 508–517 (2005)
7. Ganesan, V.A., Divi, S., Moudhgalya, N.B., Sriharsha, U., Vijayaraghavan, V.:
Forecasting food sales in a multiplex using dynamic artificial neural networks.
In: Arai, K., Kapoor, S. (eds.) Advances in Computer Vision. CVC: Advances in
Intelligent Systems and Computing, vol. 944. Springer, Cham (2019). (2020)
8. Huang, L., Liu, X., Liu, Y., Lang, B., Tao, D.: Centered Weight Normalization
in Accelerating Training of Deep Neural Networks. In: 2017 IEEE International
Conference on Computer Vision (ICCV), Venice, pp. 2822–2830 (2017)
9. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn.
Res. 15(1), 1929–1958 (2014). https://dl.acm.org/citation.cfm?id=2670313
10. Geurts, P., Ernst, D. Wehenkel, L.: Mach. Learn. 63(3) (2006). https://link.
springer.com/article/10.1007/s10994-006-6226-1
A Cost-Reducing Partial Labeling
Estimator in Text Classification Problem
Jiangning Chen, Zhibo Dai(B) , Juntao Duan, Qianli Hu, Ruilin Li,
Heinrich Matzinger, Ionel Popescu, and Haoyan Zhai
Abstract. The paper proposes a new approach to address the text clas-
sification problems when learning with partial labels is beneficial. Instead
of offering each training sample a set of candidate labels, researchers
assign negative-oriented labels to ambiguous training examples if they
are unlikely falling into certain classes. Researchers construct two new
maximum likelihood estimators with self-correction property, and prove
that under some conditions, new estimators converge faster. Also the
paper discusses the advantages of applying one of new estimators to a
fully supervised learning problem. The proposed method has potential
applicability in many areas, such as crowd-sourcing, natural language
processing and medical image analysis.
1 Introduction
2 Related Work
The text classification problem is seeking a way to best distinguish different
types of documents [5,12,25]. Being a traditional natural language processing
problem, one needs to make full use of the words and sentences, converting them
into various input features, and applying different models to process training and
testing. A common way to convert words into features is to encoding them based
496 J. Chen et al.
on the term frequency and inverse document frequency, as well as the sequence
of the words. There are many results about this, for example, tf-idf [19] encodes
term t in document d of corpus D as:
tf idf (t, d, D) = tf (t, d) · idf (t, D),
where tf (t, d) is defined as term frequency, it can be computed as tf (t, d) =
|t:t∈d|
|d| , and idf (t, D) is defined as inverse document frequency, it can be com-
|D|
puted as idf (t, D) = log |{d∈D:t∈d}| . We also have n-gram techniques, which first
combines n nearest words together as a single term, and then encodes it with
tf-idf. Recently, instead of using tf-idf, [21] defines a new feature selection score
for text classification based on the KL-divergence between the distribution of
words in training documents and their classes.
A popular model to achieve this target is to use Naive Bayes model [6,11,20],
the label for a given document d is given by:
label(d) = argmax P (Cj )P (d|Cj ),
j
where Cj is the j-th class. For example, we can treat each class as a multino-
mial distribution, and the corresponding documents are samples generated by
the distribution. With this assumption, we desire to find the centroid for every
class, by either using the maximum likelihood function or defining other different
objective functions [2] in both supervised and unsupervised learning version [7].
Although the assumption of this method is not exact in this task, Naive Bayes
achieves high accuracy in practical problems.
There are also other approaches to this problem, one of which is simply
finding linear boundaries of classes with support vector machine [3,9]. Recurrent
Neural Network (RNN) [15,23] combined with word embedding is also a widely
used model for this problem.
In real life, one may have different type of labels [14], in which circumstance,
semi-supervised learning or partial-label problems need to be considered [4].
There are several methods to encode the partial label information into the learn-
ing framework. For the partial label data set, one can define a new loss combining
all information of the possible labels, for example, in [17], the authors modify
the traditional L2 loss
n
1 m
L(w) = l(xi , yi , w) + l(xi , Yi , w) ,
n + m i=1 i=1
where Yi is the possible label set for xi and l(xi , Yi , w) is a non-negative loss
function, and in [4], they defined convex loss for partial labels as:
1
LΨ (g(x), y) = Ψ ( ga (x)) + Ψ (−ga (x)),
|y| a∈y
a∈y
/
approach to this problem and [8] gives the following optimization problem using
Naive Bayes method:
θ∗ = arg max p(y|xi , θ)
θ
i yi ∈Si
3 Formulation
for y, where θ = {θ1 , θ2 , ..., θm } is the parameter, and fi (x; θ) is the likelihood
function of sample x in class Ci . Now assuming that in training set, we have two
types of dataset S1 and S2 , such that S = S1 ∪ S2 :
To build the model, researchers define the following likelihood ratio function
and likelihood function:
k
k 1−z
i
L1 (θ) = fi (x; θ)yi fi (x; θ) k− j=i zj . (3.1)
x∈S1 i=1 x∈S2 i=1
the zi = 1 labeled sample will have negative affection for class Ci , so we put
it in the denominator. With t > 1, all the terms in denominator will finally be
canceled out, so that even fi (x; θ) = 0 for some sample x ∈ S will not cause
trouble. Another intuition for L2 is that, it can be self-correct the repeated data,
which has been labeled incorrectly.
Take logarithm for both side, we obtain the following functions:
k
k
1 − zi
log(L1 (θ)) = yi (x) log fi (x, θ) + log fi (x, θ), (3.3)
k − j=i zj
x∈S1 i=1 x∈S2 i=1
and
k
log(L2 (θ)) = (yi (x) + t − zi (x)) log fi (x, θ). (3.4)
x∈S i=1
For Naive Bayes model. Let class Ci with centroid θi = (θi1 , θi2 , ..., θiv ), where v
v
is the total number of the words and θi satisfies: j=1 θij = 1. Assuming inde-
pendence of the words, the most likely class for a document d = (x1 , x2 , ..., xv )
is computed as:
So we have:
v
log fi (d, θ) = log P (Ci ) + xj log θij .
j=1
v
subject to : θij = 1 (3.9)
j=1
θij ≥ 0.
The problem (3.8) can be solve explicitly with (3.7) by Lagrange Multiplier,
for class Ci , we have θi = {θi1 , θi2 , ..., θiv }, where:
xj
θ̂ij = d∈C iv . (3.10)
d∈Ci j=1 xj
1. θ̂ij is unbiased.
θij (1−θij )
2. E[|θ̂ij − θij |2 ] = |Ci |m .
4 Main Results
From Theorem 1, we can see that traditional Naive Bayes estimator θ̂ is an
θi (1−θi )
unbiased estimator with variance O( j|Ci |m j ). Now researchers are trying to
solve new estimators. Researchers prove that new estimators can use the data
in dataset S2 , and perform better than traditional Naive Bayes estimator.
Thus,
|Ri |K(lij − θij )
E[θ̂iLj1 − θij ] =
|Ci | + |Ri |K
2. As is for the second part, we have
2 1
θ̂iLj1 = yi (α)yi (β)αj βj
(|Ci | + |Ri |K)m
α∈S1 β∈S1
Zi (α)Zi (β)αj βj
α∈S2 β∈S2
yi (α)Zi (β)αj βj .
α∈S1 β∈S2
Then, by introducing C = (|Ci | + |Ri |K)m and Lij = E[x2j |Zi (x) = K] it is
true that
2 1
L1
E θ̂ij = 2 E[yi2 (x)x2j ] + E[yi (α)αj ]E[yi (β)βj ]
C
x∈S1 α,β∈S1 ,α=β
+ E[Zi2 (x)x2j ] + E[Zi (α)αj ]E[Zi (β)βj ]
x∈S2 α,β∈S2 ,α=β
+2 E[yi (α)αj ]E[Zi (β)βj ]
α∈S1 ,β∈S2
1
= |Ci |mθij (1 − θij + mθij ) + |S1 |2 − |S1 | p2i m2 θi2j
C2
+ |Ri |K 2 Lij + |S2 |2 − |S2 | K 2 qi2 m2 li2j
+ 2|Ci ||Ri |m2 Kθij lij |
1
= |Ci |mθij (1 − θij + mθij ) − |S1 |p2i m2 θi2j
C2
+ |Ri |K 2 Lij − |S2 |K 2 qi2 m2 li2j
2
|Ci |θij m + |Ri |Klij m
+
(|Ci | + |Ri |K)m
2
L1 L1
2 2
Using the fact that E θ̂ij − E[θ̂ij ] = E θ̂iLj1 − E[θ̂iLj1 ] , we can
conclude that
2
E θ̂iLj1 − E[θ̂iLj1 ]
1
= |Ci |mθij (1 − θij + mθij ) − |S1 |p2i m2 θi2j
(|Ci | + |Ri |K)2 m2
+ |Ri |K 2 Lij − |S2 |K 2 qi2 m2 li2j
1
=O
|S1 | + |S2 |
502 J. Chen et al.
Comparing θ̂ij and θ̂iLj1 , we can see that even though our estimator is biased,
the variance of θ̂iLj1 is significant smaller than the variance of θ̂ij , which means
by using negative sample set, θ̂iLj1 converges way faster than original Naive Bayes
estimator θ̂ij .
Another way to use both S1 and S2 dataset is to solve (3.8) with L(θ) = L2 (θ),
where L2 is defined as (3.2), let:
v
Gi = 1 − θij ,
j=1
v (4.3)
⎪
⎪
⎪
⎪ θij = 1, ∀ 1 ≤ i ≤ k
⎩
j=1
Theorem
v 3. Assume we have normalized length of each document, that is:
j=1 x j = m for all d. Assume the negative label has only one entry to be
1, namely, i zi (x) = 1, ∀x ∈ S2 . Let |Ci | denote the number of documents in
Class i and |Di | denote the number of documents labelled not in Class i with
pi = |C i| |Di |
|S| and qi = |S| . Further, we assume if a document x is labelled not in
Class i, it will have equal probability to be in any other class. Then the estimator
(4.4) satisfies following properties:
Cost-Reducing Partial Labelling Estimator 503
E[θ̂iLj2 ]
(yi (x) + t − zi (x))E[xj ]
= x∈S
N
l=i θlj
t x∈S1 E[xj ] + t x∈S2 E[xj ] + m|Ci |θij − m|Di | k−1
=
N
k k =l θ r j l=i θlj
t l=1 |Cl |θlj + |Ci |θij + t l=1 |Dl | rk−1 − |Di | k−1
=
|Ci | − |Di | + t|S|
k k =l θ r j l=i θlj
t l=1 pl θlj + pi θij + t l=1 ql rk−1 − qi k−1
=
pi − q i + t
Therefore, we can compute the bias:
(pi − qi + t)θij
E[θ̂iLj2 − θij ] = E[θ̂iLj2 ] −
pi − q i + t
k k =l θ r j l=i θlj
t l=1 pl θlj + t l=1 ql rk−1 − tθij − qi ( k−1 − θij )
=
pi − q i + t
(4.5)
t + qi
= O( )
t + pi − q i
504 J. Chen et al.
2
where var(xj ) = E (xj − E[xj ]) .
k
= |Ci |(1 + 2t)mθij (1 − θij ) + |Cl |t2 mθlj (1 − θlj )
l=1
= O(|Ci |(1 + 2t)m) + O(|S1 |t m) 2
k
= (t − 1) var(xj , x ∈ Di ) +
2
t2 var(xj , x ∈ Dl )
x∈Di l=i x∈Dl
k
= (1 − 2t)var(xj , x ∈ Di ) + t2 var(xj , x ∈ Dl )
x∈Di l=1 x∈Dl
(1 − 2t)|Di | t2
k
= mθrj (1 − θrj ) + |Dl | mθrj (1 − θrj )
k−1 k−1
r=i l=1 r=l
V1 + V2
= O( 2 )
m (pi − qi + t)2 |S|2
⎛ ⎞
[(1 + 2t)pi + |S 1| 2
|S| t + (1 − 2t)q i + |S2 | 2
|S| t ]
=O⎝ ⎠
m(pi − qi + t)2 |S|
(1 + 2t)pi + (1 − 2t)qi + t2
=O
m(pi − qi + t)2 |S|
Using the same strategy as in 1, we have the first part of our variance
1
estimation should be of order O( m|S| ), which is less than the order of vari-
ance for Naive Bayes estimation: O( |C1i | ). We also showed that its order is
1
O( |S1 |+|S 2|
) < O( |C1i | ), therefore, θ̂iLj2 converges faster than θ̂ij .
5 Experiment
Researchers applied new methods on top 10 topics of single labeled documents
in Reuters-21578 data [13], and 20 news group data [10]. Researchers compared
results of traditional Naive Bayes estimator θ̂ij and new estimators θ̂iLj1 , θ̂iLj2 ,
506 J. Chen et al.
6 Conclusion
This paper has presented an effective learning approach with a new labeling
method for partially labeled document data, for some of which we only know
the sample is surely not belonging to certain classes. Researchers encode these
labels as yi or zi , and define maximum likelihood estimators θ̂iLj1 , θ̂iLj2 , as well as
θ̂i∗j for multinomial Naive Bayes model based on L1 and L2 . There are several
further questions about these estimators:
1. We have proved that with multinomial Naive Bayes model, our estimators
have smaller variance, which means our estimators can converge to true
parameters with a faster rate than the standard Naive Bayes estimator.
An interesting question is the following: if we consider a more general situation
without the text classification background and the multinomial assumption,
by solving the optimization problem (3.8) with L1 and L2 , can we get the
same conclusion with a more general chosen likelihood function fi ? If not,
what assumption should we raise for fi to land on a similar conclusion?
2. The effectiveness of an algorithm in machine learning depends heavily upon
well-labeled training samples; to some extent, our new estimator can utilize
incorrect-labeled data or different-labeled data. Our estimator, especially L2 ,
Cost-Reducing Partial Labelling Estimator 507
can resolve this problem (3.2), since the incorrect-labeled data can be canceled
out by the correct-labeled data, thus the partial-labeled data can still have
its contribution.
Another question is: besides θ̂iLj1 and θ̂iLj2 , can we find other estimators, or
even better estimators satisfying this property?
3. Based on our experiment, the traditional Naive Bayes estimator acts almost
perfectly in the training set as well as during the cross validation stage, but
the accuracy rate in the testing set is not ideal. To quantify this observation,
we are still working on a valid justification that the traditional Naive Bayes
estimator has a severe over-fitting problem in the training stage.
A Proof of Theorem 1
v
Proof. With assumption j=1 xj = m, we can rewrite (3.10) as:
d∈Ci xj xj
θ̂ij = = d∈Ci .
d∈Ci m |Ci |m
Since d = (x1 , x2 , ..., xv ) is multinomial distribution, with d in class Ci , we have:
E[xj ] = m · θij , and E[x2j ] = mθij (1 − θij + mθij ).
1.
xjd∈Ci d∈Ci E[xj ] d∈Ci m · θij
θ̂ij = E[ ]= = = θij .
|Ci |m |Ci |m |Ci |m
Thus θ̂ij is unbiased.
2. By (1), we have:
E[|θ̂ij − θij |2 ] = E[θ̂i2j ] − 2θij E[θ̂ij ] + θi2j = E[θ̂i2j ] − θi2j .
Then
2 d1 d2
( d∈Ci xj )2 d∈Ci xj + d1 ,d2 ∈Ci 2xj xj
θ̂i2j = = , (A.1)
|Ci |2 m2 |Ci |2 m2
where di = (xd1i , xd2i , ..., xdvi ) for i = 1, 2. Since:
|Ci |mθij (1 − θij + mθij ) θi (1 − θij + mθij )
E[ x2j ] = = j ,
|Ci |2 m2 |Ci |m
d∈Ci
and
|Ci |(|Ci | − 1)m2 θi2j (|Ci | − 1)θi2j
E[ 2xdj 1 xdj 2 ] = = .
|Ci |2 m2 |Ci |
d1 ,d2 ∈Ci
508 J. Chen et al.
B Figures
See Figs. 1, 2, 3 and 4.
training set = negative set = 10%, behavior in Reuter data training set = negative set = 10%, behavior in 20 news group
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
(a) (b)
Fig. 1. We take 10 largest groups in Reuter-21578 dataset (a) and 20 news group
dataset (b), and take 20% of the data as training set, among which |S1 | = |S2 |. The
y-axis is the accuracy, and the x-axis is the class index.
training with only negative set = 90%, behavior in Reuter data training with only negative set = 90%, behavior in 20 news group
0.9 0.9
0.8
0.8
0.7
0.7 0.6
0.5
0.6
0.4
0.5 0.3
0.2
0.4
0.1
0.3 0
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
(a) (b)
Fig. 2. We take 10 largest groups in Reuter-21578 dataset (a), and 20 news group
dataset (b), and take 90% of the data as S2 training set. The y-axis is the accuracy,
and the x-axis is the class index.
Cost-Reducing Partial Labelling Estimator 509
training set 10% behavior in Reuter data training set 10% behavior in 20 news group
1 1
0.9 0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.2 0.3
0.1 0.2
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
(a) (b)
Fig. 3. We take 10 largest groups in Reuter-21578 dataset (a), and 20 news group
dataset (b), and take 10% of the data as S1 training set. The y-axis is the accuracy,
and the x-axis is the class index.
testing set = training set, trainging set 10% behavior in Reuter data testing set = training set, trainging set 10% behavior in 20 news group data
1 1
0.95
0.98
0.9
0.96
0.85
0.94
0.8
0.92
0.75
0.9
0.7
0.88 0.65
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
(a) (b)
Fig. 4. We take 10 largest groups in Reuter-21578 dataset (a), and 20 news group
dataset (b), and take 10% of the data as S1 training set. We test the result on training
set. The y-axis is the accuracy, and the x-axis is the class index.
References
1. Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classifi-
cation. Pattern Recogn. 37(9), 1757–1771 (2004)
2. Chen, J., Matzinger, H., Zhai, H., Zhou, M.: Centroid estimation based on sym-
metric KL divergence for multinomial text classification problem. arXiv preprint
arXiv:1808.10261 (2018)
3. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297
(1995)
510 J. Chen et al.
4. Cour, T., Sapp, B., Taskar, B.: Learning from partial labels. J. Mach. Learn. Res.
12, 1501–1536 (2011)
5. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms
and representations for text categorization. In: Proceedings of the Seventh Interna-
tional Conference on Information and Knowledge Management, pp. 148–155. ACM
(1998)
6. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach.
Learn. 29(2–3), 131–163 (1997)
7. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fif-
teenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan
Kaufmann Publishers Inc. (1999)
8. Jin, R., Ghahramani, Z.: Learning with multiple labels. In: Advances in Neural
Information Processing Systems, pp. 921–928 (2003)
9. Joachims, T.: Text categorization with support vector machines: learning with
many relevant features. In: European Conference on Machine Learning, pp. 137–
142. Springer (1998)
10. Lang, K.: 20 newsgroups data set
11. Langley, P., Iba, W., Thompson, K., et al.: An analysis of Bayesian classifiers. In:
AAAI, vol. 90, pp. 223–228 (1992)
12. Larkey, L.S.: Automatic essay grading using text categorization techniques. In:
Proceedings of the 21st Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 90–95. ACM (1998)
13. Lewis, D.D.: Reuters-21578
14. Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In:
IJCAI, vol. 3, pp. 587–592 (2003)
15. Liu, P., Qiu, X., Huang, X.: Recurrent neural network for text classification with
multi-task learning. arXiv preprint arXiv:1605.05101 (2016)
16. McCallum, A.: Multi-label text classification with a mixture model trained by EM.
In: AAAI Workshop on Text Learning, pp. 1–7 (1999)
17. Nguyen, N., Caruana, R.: Classification with partial labels. In: Proceedings of the
14th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 551–559. ACM (2008)
18. Belongie, S., Welinder, P., Branson, S., Perona, P.: The multidimensional wisdom
of crowds. In: Advances in Neural Information Processing Systems, pp. 2424–2432
(2010)
19. Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries.
In: Proceedings of the First Instructional Conference on Machine Learning, vol.
242, pp. 133–142 (2003)
20. Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001
Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46. IBM,
New York (2001)
21. Schneider, K.-M.: A new feature selection score for multinomial Naive Bayes text
classification based on KL-divergence. In: Proceedings of the ACL 2004 on Inter-
active Poster and Demonstration Sessions, p. 24. Association for Computational
Linguistics (2004)
22. Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? Improving data quality
and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
614–622. ACM (2008)
Cost-Reducing Partial Labelling Estimator 511
23. Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network
for sentiment classification. In: Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pp. 1422–1432 (2015)
24. Zhang, M.-L., Zhou, B.-B., Liu, X.-Y.: Partial label learning via feature-aware dis-
ambiguation. In: Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 1335–1344. ACM (2016)
25. Zhang, M.-L., Zhou, Z.-H.: ML-KNN: a lazy learning approach to multi-label learn-
ing. Pattern Recogn. 40(7), 2038–2048 (2007)
26. Zhang, M.-L., Zhou, Z.-H.: A review on multi-label learning algorithms. IEEE
Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)
Unsupervised Cross-Lingual Mapping
for Phrase Embedding Spaces
1 Introduction
To address this issue, some previous work attempted to learn a shared phrase
embedding space of two languages. For instance, bilingual phrase translation
is used as a supervision to learn semantic phrase embedding space between
two languages by bilingually-constrained recursive auto-encoder [10]. This was
later extended to learn inner structures and correspondences within the bilin-
gual phrase [11]. However, these methods depend on parallel data, which is not
available for most language pairs.
On the other hand, unsupervised cross-lingual embedding mapping model
has shown an interesting mapping strategy [12–15]. The model takes indepen-
dently trained monolingual word embeddings of two languages, and learn a linear
transformation to map them into a shared embedding space. The clue is that the
monolingual word embedding graphs are approximately isomorphic [16]. Based
on this, some recent methods explored adversarial training [12] or iterative self-
learning [13] to obtain cross-lingual embeddings without any supervision signals.
However, these methods are limited to word level embedding.
In this paper, we focus on unsupervised cross-lingual mapping to explore
the ways to learn cross-lingual phrase embedding. We propose a model of a
three-step process. First, we preprocess the corpus to identify phrases based on
their mutual information and then combine them into a single token. This helps
us to effectively extract meaningful phrases from raw text. Second, taking the
preprocessed data we train Word2Vec11 model for phrase embeddings. Finally, a
fully unsupervised linear transformation based on self-learning is used to map the
phrase embeddings into a shared space. The general framework of our method
is shown in Fig. 1. Our main contributions are:
– Most of the unsupervised Cross-lingual mapping focuses on individual word
embeddings. In this work, we proposed to map phrase embeddings space.
– Easily combining words in n-grams has a problem of creating meaningless
phrases, that in turn creates data sparsity problem. To mitigate this problem,
we have adopted collocation extraction methods using Mutual Information for
phrase Identification.
– Moreover, we proposed an unsupervised cross-lingual phrase embedding
model that helps as input for unsupervised statistical machine translation,
specifically between low resource languages.
The rest of this paper is organized as follows. In Sect. 2 we briefly review
related works. Section 3 presents our proposed method, the three-step process
each with detail explanation. The experimental settings and Results and Dis-
cussion are discussed in Sects. 4 and 5 respectively. Finally, Sect. 6 explains our
conclusion and future directions.
2 Related Work
Through time, researchers have tried to investigate on how to represent Cross-
lingual embeddings into a shared space using different methods. In 2013, [2] came
1
Word2Vec downloaded from: https://github.com/tmikolov/word2vec.
514 A. G. Ayana et al.
up with the notion that vector spaces can encode meaningful relations between
words. They also noticed that the geometric relations that hold between words
are similar across languages, which also indicates the possibility to transform
one language’s vector space into the space of another by utilizing a linear trans-
formation. They use a dictionary of five thousand words as supervision to learn
this mapping.
Consequently, several studies aimed at improving these cross-lingual word
embeddings. For instance, Canonical correlation analysis [3] techniques learn a
transformation matrix for every language. The transformation matrix of each
language is learned separately to transform embeddings into a different rep-
resentation, then the new representation can be seen as a shared embedding
space. Another method solely relies on word alignments by counting the num-
ber of times each word in the source language is aligned with each word in the
target language in a parallel corpus [17]. Adversarial auto-encoders [14] was pro-
posed to work like gameplay, a competition between the auto-encoder trained
to reconstruct the source embeddings, and the discriminator trained to differen-
tiate the projected source embeddings from the actual target embeddings. On
the other hand, [4] proposed normalization of word embeddings during training
to solve inconsistence between the embedding and the distance measurement.
They make the inner product the same as cosine similarity and then constrains
the transformation matrix to be an orthogonal matrix by solving a separate
optimization problem. A generalized framework was also built on the idea of
linear transformation by combining orthogonal transformation, normalization,
and mean centering for cross-lingual mapping [18]. But all the above methods
rely on bilingual word lexicons.
Alternatively, recent works have proposed to minimize the amount of bilin-
gual supervision used with a few dictionary entries [19], by iteratively using
embedding mapping to induce a new dictionary in a self-learning fashion. The
self-learning method is able to start from the weak initial solution and iteratively
improve the mapping. This method later became bases for mapping without any
cross-lingual guidance by initializing a weak mapping and exploits the structure
of the embedding spaces through self-learning approach [13]. Another work on a
fully unsupervised adversarial training [12,15] introduces an unsupervised selec-
tion metric that is highly correlated with the mapping quality, which is used
both as a stopping criterion and to select the best hyper-parameters. However,
all these previous works were based on cross-lingual word embeddings, aims to
induce bilingual dictionaries of the languages. There are also limited efforts on
supervised bilingual-lingual phrase embeddings. Bilingually-Constrained phrase
embeddings [10] uses Recursive Auto-Encoder (RAE) to learn semantic phrase
embeddings that depend on bilingual phrase translation. On the other hand, [11]
extends the work to capture both inner structures and correspondences within
the bilingual phrase. In the way, they also learn bilingual embedding space. But,
both of the methods depend on bilingual supervision to learn semantic phrase
embeddings.
Unsupervised Cross-Lingual Mapping 515
3 Proposed Method
In order to induce the bilingual phrase representations, we have to first represent
the phrase in the languages into their respective vectors. To convert phrase
to their corresponding vector representation, first, we identify phrases in the
sentence based on their Normalized Pointwise Mutual Information and combine
them into a single token (3.1). Then we feed the phrase tagged corpus into
the Word2Vec model to independently learn the phrase embeddings (3.2). The
monolingual phrase embeddings are used to map the embeddings into a shared
space using unsupervised linear transformation (3.3). The general framework of
the proposed method is shown in Fig. 1.
Fig. 1. The general framework of our proposed method, L1 is monolingual corpus for
the source language and L2 is monolingual corpus for the target language. The output
of the unsupervised linear transformation are the cross-lingual phrase embeddings for
both the source and target language phrases.
The value of the mutual information will be 1 if x and y come together and 0
otherwise. To further strength the result, a concrete way to evaluate dependence
516 A. G. Ayana et al.
between words x and y is called Pointwise Mutual Information (PMI), Given by:
p(x, y)
P M I (x, y) = log (2)
p (x) p(y)
It’s easy to see that when two words x and y appear together many times,
but not alone, P M I(x, y) will have a high value, while it will have a value of
0 if x and y are completely independent. While PMI is a good measure for the
dependence of occurrences of x and y, we don’t have an upper bound on its
values [20]. When there is no fixed upper limit, we do not know how close a bi-
gram is to perfect correlation. We want a measure that can be compared between
all bi-grams; thus, we can choose only bi-grams above a certain threshold. We
want the PMI measure to have a maximum value of 1 on perfectly correlated
words x and y. This is called Normalized (Point) Mutual Information (NPMI),
Formally:
p(x,y)
log p(x)p(y)
N P M I (x, y) = (3)
−logp(x, y)
The value of NPMI, when two words always occur together is 1 when they
are distributed as expected under independence, NPMI is 0 as the numerator is
0; finally, when two words occur separately but not together, we define NPMI to
be negative 1, as it approaches this value when p(x, y) approaches 0 and p(x),
p(y) are fixed. For comparison, these orientation values for PMI are respectively
negative ln p(x, y), 0 and negative infinity. Based on this, we prefer NPMI as a
good score to identify meaningful bigrams from sentences.
Now that we have a way to extract meaningful bi-grams from our large
monolingual corpus, we can replace bi-grams with an NPMI above a certain
threshold to one unigram, for example: “computer science” will be transformed to
“computer science”. It’s easy to create tri-grams by using the transformed corpus
with bi-grams and running again the process (with a lower threshold). Similarly,
we can continue this process to more n-grams with a decreasing threshold. A
pseudocode for phrase identification is given in Algorithm 1.
Once we have a monolingual corpus with an identified phrase, we can use the
corpus to train Word2Vec model for phrase embeddings. Word2Vec works with
an idea of the distributional hypothesis. In Word2Vec, if we have a large mono-
lingual corpus and for each word in the corpus, we try to predict it by its given
context (CBOW), or trying to predict the context given a specific word (Skip-
Gram). Word2Vec is a neural network with one hidden layer (with dimension d)
and optimization function of Negative-Sampling or Hierarchical Softmax [16]. In
the training phase, we iterate through the tokens in the corpus and look at a
window of size k.
In our case, the skip-gram model with dimension of 300, windows size of 5, a
negative sampling of 10, 5 iterations and subsampling disabled, as a parameter
Unsupervised Cross-Lingual Mapping 517
to train the model. Figure 2 shows Similar geometric relations between numbers
and animals in English and French. This suggests that it might be possible
to transform one language’s vector space into the space of another simply by
utilising a linear transformation.
Suppose we have given a source phrase embedding matrix P and target embed-
ding matrix Q. The goal is to find the linear transformation matrix W p and
W q that the dot product P W p and QW q are in the common embeddings space.
We also induce a phrase dictionary D by the assumption that if two phrases
have an equivalent embedding matrix, they are likely translated to each other.
This assumption is used to initialize the model, which will later be improved by
a self-learning procedure [19]. It works by first normalizing the embeddings on
preprocessing stage, then create an initial solution without any supervision that
will later be improved iteratively.
Based on this, we have taken the two independently learned phrase embed-
dings into a vecmap2 2 (an open source framework by (Artetxe et al. 2018a) to
map the embeddings into a shared space without any supervision.
2
Vecmap downloaded from: https://github.com/artetxem/vecmap.
518 A. G. Ayana et al.
Fig. 2. An example phrase embedding for English (a) and French phrases (b). We
have taken the translated sentence from the two languages; we can observe that their
embeddings are almost in the same geometrical space.
4 Experimental Settings
For this work, we have used an openly available WMT14 News Crawl shared task
monolingual dataset to evaluate our model. To build our monolingual dataset for
English, French, Germany, and Spanish, we have performed some preprocessing
tasks. First, we have used Moses tokenizer to tokenize the texts, then we cleaned
and removed the duplicated sentence. We also trained the true caser using Moses
true-caser and finally prepared a cleaned version of our monolingual corpus.
Table 1 shows the size of the training monolingual corpus for each language after
preprocessing.
using online Google Translator, which later used as a golden standard phrase
translation for evaluation purpose. Our test set contains a total of 3k phrase
pair of most frequent 1000 unigrams, 1000 bigrams and 1000 trigrams for each
language pair.
Table 2. The average result of the proposed model versus different language pairs
Table 2 shows the result of our model on different lingual pairs. The results in
this table are the performance of our model, trained with a purely monolingual
dataset for all language pairs and each result given in the table are average
evaluation results for the language pair. As one can observe from the table,
with 40.7% coverage of EN-FR pairs, with 85.41% coverage of EN-DE pairs,
and 16.92% coverage of EN-ES pairs, our method achieved 35.62%, 27.10%, and
31.08% accuracy, respectively.
Given the fact that phrase embeddings are more difficult than word embed-
dings, our average result of 35.62% for EN-FR language pair is promising for
phrase-based Cross-Lingual embeddings.
520 A. G. Ayana et al.
We have also compared our proposed model with the model that uses a skip-
gram model to learn word and phrase embeddings at the same time [8]. We
have used an openly available vecmap code to train both models without any
supervision. Table 3 shows the results of the proposed method in comparison to
the baseline.
Table 3. Comparison of our model performance with the state-of-the-art model, in all
case English is used as a source language.
Based on the table, our proposed system has obtained better result compared
to the other model when trained on the large training corpus. Accordingly, our
proposed method achieved 4.10% higher accuracy over (Artetxe et al. 2018d)
on EN-FR pairs, 0.08% on EN-DE pairs, and 0.98% on EN-ES pairs. Thus, the
result we achieved using NPMI based phrase identification has succeeded in all
language pairs providing an improved result for phrase-based translation.
Our result is also competitive compared to even the state of the art unsuper-
vised cross-lingual unigram embedding (Artetxe et al. 2018a), which for instance
achieved 37.33% average result for EN-ES language pairs.
To show the role of NPMI based phrase identification in the proposed method, we
perform three separate experiments based on bigram score for phrase identifica-
tion. The first is based on simple word frequency score, words frequently appear
together. The second method is based on the PMI score before normalized and
the third experiment is conducted using our newly adopted NMPI based phrase
identification. We have implemented the two model on English (EN), France
(FR), Germany (DE) and Spanish (ES) monolingual dataset. English is used
as a source language in all cases. Table 4 presents the comparison result of the
experiment performed using the random phrase identification and NPMI based
phrase identification.
In this table, we can observe that the proposed method has achieved by far
better result than the random phrase identification. Accordingly, our proposed
method achieved 40.26% higher accuracy over the model based on the random
phrase extraction on EN-FR pairs, 23.55% on EN-DE pairs, and 39.57% on EN-
ES pairs. These results confirm the relevance of using mutual information for
phrase identification in phrase-based translation.
Unsupervised Cross-Lingual Mapping 521
The word cooccurrence frequency based approach can be simple but perform
less in our evaluation. PMI based score for phrase identification perform better
than frequency based score but less than NPMI based score. NPMI is more
accurate cooccurrence score with upper bound to identify bigrams as we have
stated in Sect. 3.1 and shown relatively better result in our evaluation. It also
sustains the structure of the sentence while extracting meaningful phrases that
are easy to find in the phrase dictionary. This help us to capture the real phrase
representations across the languages, which also affects the result of the cross-
lingual phrase embedding.
5.4 Example
Table 5. Some English phrase generated based on their NPMI score and their French
translation. We also presented the original translation from dictionary entry for com-
parison
6 Conclusions
We introduced a model that can represent Cross-lingual phrase embeddings,
without any supervision. Our model has used a three-step process to achieve its
goal. First, we have identified and combined most frequent n-grams in a sen-
tence based on their mutual information, then we used the n-gram tagged cor-
pus to independently learn phrase embeddings of the two languages, finally, the
embeddings are mapped into the shared space using linear transformation and
self-learning approach. Our implementation of Normalized point mutual infor-
mation as a phrase identification technique has helped to extract a meaningful
phrase from the corpus. Thus, the result showed that our method succeeded in
all cases provided a promising result for phrase-based cross-lingual embeddings.
In the future, we plan to apply our model on unsupervised phrase-based
machine translation. It is also important to test the finding on more language
pairs.
References
1. Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding
models. Computing Research Repository abs/1706.0 (4304), no. 661–3 (2017).
arXiv:1706.04902
2. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among lan-
guages for machine translation. arXiv:1309.4168v1. https://doi.org/10.1162/
153244303322533223
3. Faruqui, M., Dyer, C.: Improving vector space word representations using multilin-
gual correlation. In: Proceedings of the 14th Conference of the European Chapter
of the Association for Computational Linguistics, pp. 462–471. Association for
Computational Linguistics, Stroudsburg (2014). https://doi.org/10.3115/v1/E14-
1049. http://aclweb.org/anthology/E14-1049
4. Xing, C., Wang, D., Liu, C., Lin, Y.: Normalized word embedding and orthog-
onal transform for bilingual word translation. In: Proceedings of the 2015 Con-
ference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pp. 1006–1011. Association for Com-
putational Linguistics, Stroudsburg (2015). https://doi.org/10.3115/v1/N15-1104.
http://aclweb.org/anthology/N15-1104
5. Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributed
semantics. In: Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics, pp. 58–68 (2014). arXiv:1404.4641
6. Vulic, I., Moens, M.F.: Bilingual distributed word representations from
document-aligned comparable data. J. Artif. Intell. Res. 55(2), 953–994 (2016).
arXiv:1509.07308v2
7. Vyas, Y., Carpuat, M.: Sparse bilingual word representations for cross-lingual
lexical entailment. In: Proceedings of the 2016 Conference of the North Amer-
ican Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, pp. 1187–1197. Association for Computational Linguistics,
Stroudsburg (2016). https://doi.org/10.18653/v1/N16-1142. http://aclweb.org/
anthology/N16-1142
Unsupervised Cross-Lingual Mapping 523
8. Artetxe, M., Labaka, G., Agirre, E.: Unsupervised statistical machine translation.
In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, pp. 3632–3642 (2018). arXiv:1809.01272
9. Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M.: Phrase-based &
neural unsupervised machine translation. In: Emperical Methods for Natural Lan-
guage Processing, vol. 25, no. 6, pp. 1109–1112 (2018). arXiv:1804.07755. https://
doi.org/10.1053/j.jvca.2010.06.032. https://arxiv.org/pdf/1804.07755.pdf
10. Zhang, J., Liu, S., Li, M., Zhou, M., Zong, C.: Bilingually-constrained phrase
embeddings for machine translation. In: Proceedings of the 52nd Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), pp. 111–
121 (2014). https://doi.org/10.3115/v1/P14-1011. http://aclweb.org/anthology/
P14-1011
11. Su, J., Xiong, D., Zhang, B., Liu, Y., Yao, J., Zhang, M.: Bilingual correspondence
recursive autoencoder for statistical machine translation. In: Proceedings of the
2015 Conference on Empirical Methods in Natural Language Processing, pp. 1248–
1258. Association for Computational Linguistics, Stroudsburg (2015). https://doi.
org/10.18653/v1/D15-1146. http://aclweb.org/anthology/D15-1146
12. Alexis Conneau, H.J., Lample, G., Ranzato, M., Denoyer, L.: Word transla-
tion without parallel data. In: ICLR Conference Paper (2018). arXiv:1710.04087.
https://doi.org/10.1111/j.1540-4560.2007.00543.x. http://doi.wiley.com/10.1111/
j.1540-4560.2007.00543.x
13. Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsu-
pervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Long Papers),
Melbourne, Australia, pp. 789–798 (2018). arXiv:1805.06297
14. Miceli Barone, A.V.: Towards cross-lingual distributed representations without
parallel text trained with adversarial autoencoders. In: Proceedings of the 1st
Workshop on Representation Learning for NLP, pp. 121–126. Association for Com-
putational Linguistics, Stroudsburg (2016). arXiv:1608.02996. https://doi.org/10.
18653/v1/W16-1614. http://aclweb.org/anthology/W16-1614
15. Zhang, M., Liu, Y., Luan, H., Sun, M.: Adversarial training for unsupervised bilin-
gual lexicon induction. In: Proceedings of the 55th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long Papers), pp. 1959–1970.
Association for Computational Linguistics, Stroudsburg (2017). https://doi.org/
10.18653/v1/P17-1179. http://aclweb.org/anthology/P17-1179
16. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word rep-
resentations in vector space. arXiv preprint arXiv:1301.3781. https://doi.org/10.
1162/153244303322533223
17. Guo, J., Che, W., Yarowsky, D., Wang, H., Liu, T.: Cross-lingual dependency
parsing based on distributed representations. In: Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language Processing Volume 1, pp. 1234–1244
(2015). http://www.research.philips.com/publications/downloads/martin wilcox
thesis.pdf
18. Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word
embedding mappings with a multi-step framework of linear transformations. In:
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018). Asso-
ciation for the Advancement of Artificial Intelligence (2018)
524 A. G. Ayana et al.
19. Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with
(almost) no bilingual data. In: Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pp. 451–462.
Association for Computational Linguistics, Stroudsburg (2017). https://doi.org/
10.18653/v1/P17-1042. http://aclweb.org/anthology/P17-1042
20. Bouma, G.: Normalized (pointwise) mutual information in collocation extraction.
In: International Conference of the German Society for Computational Linguistics
and Language Technology, pp. 31–40 (2009)
Sidecar: Augmenting Word Embedding
Models with Expert Knowledge
1 Introduction
This work examines three widely-used word embedding models, and explores the
use of a secondary, smaller domain-specific embedding to improve the represen-
tation of domain-specific text1 .
Word embedding is a technique for representing words using numeric vec-
tors, which allows calculations to be performed on a “word” in comparison to
all other words. In a good embedding, those calculations will map to semantic
relationships; the canonical example being “king - man + woman = queen” [9].
Unfortunately, processing a large enough corpus to generate useful vectors
is computationally expensive. Therefore, common practice for practical appli-
cations is to use a pre-trained word embedding model trained on a very large
1
The code and training data for the best performing approach described in this work
(fastText pre-trained model with fastText-based custom embedding) are available
at: https://github.com/lemay-ai/sidecar/.
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 525–540, 2020.
https://doi.org/10.1007/978-3-030-39442-4_39
526 M. Lemay et al.
2
Google News Vectors Binary File (2019) for Word2Vec can be found at https://
drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit.
3
More information on this can be found at https://www.auditmap.ai/.
Sidecar 527
2 Background
that, while a medical student may be able to memorize the same information
with the same degree of accuracy as another less educated individual, a medical
expert is better at identifying and recalling the important pieces of information.
In [7], expertise is summarized as:
– extensive knowledge and experience that affect the perception of systems and
the organization of information;
– the ability to recognize a situation and efficiently recall the most appropriate
knowledge to solve a specific problem.
In this work, the focus is on the second point: the ability of an expert to
discriminate between situations that may appear similar on the surface. The
Cochran-Weiss-Shanteau (“CWS”) ratio [13] is a quantitative measure of this
discriminatory ability. Suppose that a candidate is given a series of classification
tasks. The CWS ratio is defined as:
Discrimination
CW S =
Inconsistency
“Discrimination” is described as the “ability to differentiate between similar,
but not identical, cases”. “Consistency” is described as the ability of the can-
didate to “repeat their judgment in a similar situation”. “Inconsistency” is the
complement of consistency.
An example of applying CWS is useful for illustrating the advantage of this
metric. Consider the task of comparing the performance of two categorical classi-
fication models on a balanced dataset that contains an equal number of samples
from each of 5 categories. Tables 1 and 2 show example results for the two mod-
els and five categories. Each cell represents a test case. The value of each cell is
the predicted category for that test case. Each column represents the true cat-
egory for each test case. Each row represents a different test of the model, as is
commonly seen with k-fold cross validation when evaluating a machine learning
model. A perfect model would contain only 1s in the first column, only 2s in the
second column, and so forth.
Consider two cells containing the same predicted category. If they are in the
same column (“within-column match”) then the candidate replied with the same
response for the same category; the answer may not be accurate, but it is at least
consistent. If the matched cells are in different columns (“cross-column match”),
then the candidate failed to discriminate between the two categories. This type of
measurement is called measuring agreement or measuring inter-rater reliability,
and there are many statistical methods that could be used for this task [12].
This leads to the CWS ratio calculation of Algorithm 21. nk is the binomial
n!
coefficient, k!(n−k)! , and UNIQUE COUNTS() is a function that takes in a table
and returns the set of categories that appear in the table, along with the total
count of each category in the table.
Note that both candidate models had similar overall accuracy (15 out of 25 for
both models), recall (0.60 for both models), and precision (0.64 for Candidate 1
and 0.61 for Candidate 2), but Candidate 1 was more consistent in its classifica-
tion. Table 3 walks through the CWS ratio calculation, showing the intermediate
results of each step.
As expected, Candidate 1’s more consistent classification leads to a higher
CWS ratio. If accuracy, precision, and recall were the only metrics under con-
sideration, the two candidates would be indistinguishable. However, as demon-
strated, not all wrongness is measured equally.
3 Prior Art
Word embedding models are ultimately limited by the content of the training
corpus. Retraining a model is a computationally expensive endeavor, and so
related works have proposed a number of approaches for improving the contex-
tual map (and therefore the quality) of a pre-trained word embedding model
without retraining it entirely. The retraining approach in [5] was defining a
Sidecar 531
domain-specific model, and then finding a mapping that translates the pre-
trained word vectors into the domain-specific vector space. Another approach
to retraining [16] is to combine several large pre-trained models into a single
ensemble model with a large, multilingual vocabulary. Similar to this work, the
retraining in [3] combined pre-trained and custom word vectors by concatenat-
ing them to create a richer vector representation. [17] used knowledge graphs to
improve the quality of word embeddings.
Prior work [6] explored a method of concatenating equally-sized vectors from
a pre-trained embedding model and from a custom-trained embedding model
for a particular dataset. In particular, [6] investigated how this method would
improve understanding of a collection of adverse drug reaction tweets, as well
as a dataset of movie reviews. The paper focuses specifically on improving
Word2Vec’s pre-trained Google News embedding when used by a simple classifi-
cation CNN. By concatenating the equally-sized pre-trained and custom vectors,
it was found that the percentage accuracy improved from 88.47% to 88.85% on
the adverse drug reaction tweets, and from 80.56% to 81.29% on the movie review
dataset. To build upon this research, this work investigates a similar technique’s
impact on multiple methods for creating embeddings, rather than just focusing
532 M. Lemay et al.
on Word2Vec. As well, by testing on a dataset with more than two possible cat-
egories, it is possible to bring in more precise methods of measuring expertise
than just percentage accuracy. This work also reduces the custom embedding’s
number of dimensions from the 300 of [6] to 100, as with a much smaller domain,
not as many dimensions are needed as with the pre-trained embedding [14].
The sidecar technique begins with training a small, custom word embedding
model (the “sidecar” model) on a corpus of domain-specific texts, which creates
word embedding vectors of length n. Since the sidecar model is trained on docu-
ments exclusively from that domain, it should learn meanings and relationships
exclusive to that domain.
When analyzing text, for each word, vectors were generated using both the
sidecar model and a general-purpose word embedding model, which generates
embedding vectors of length m. The final word vector is created by concatenating
both vectors lengthwise to create a vector of length n + m.
The intuition behind this technique is that, by using both vectors, the model
gains the strengths of both while covering their weaknesses. The rest of this paper
will examine the benefits of this technique in a domain-specific text classification
task.
was curious on how to write a method to remove all zeros from an array.
If I have the array in the main method. For example my Main Method
would look something like
and have the code output the length and the array without the zeros like:
Stack Overflow posts were collected and labelled as being related to one of
ten programming languages: C++, vb.net, Java, Perl, PHP, Python, R, SQL,
4
https://stackoverflow.com.
5
https://cloud.google.com/bigquery/public-data/stackoverflow.
6
For some interesting background on programming jargon in particular, see http://
catb.org/jargon/html/distinctions.html.
534 M. Lemay et al.
Javascript, and C#. 1000 posts were collected per language, for 10000 posts
total. The task was to examine the content of the post and correctly classify
which language it is related to.
6 Results
Table 4 shows the results of hyperparameter tuning for the classification model
using only the pre-trained word embedding models. The highest accuracy for
Tables 5, 6 and 7 show the results of hyperparameter tuning for the sidecar
variants. The highest accuracy for each individual model is highlighted. The
hyperparameters chosen for each model were summarized in Table 8.
Table 4. Context classification accuracy for the pre-trained Word2Vec, GloVe and
fastText models
windowSize 10 20 30 40 50 60 70 80 90 100
#LSTM
neurons
2 0.671 0.705 0.679 0.733 0.745 0.741 0.752 0.756 0.741 0.739
3 0.661 0.661 0.703 0.725 0.735 0.719 0.725 0.747 0.774 0.752
4 0.669 0.677 0.689 0.699 0.731 0.725 0.731 0.743 0.750 0.725
5 0.647 0.681 0.707 0.754 0.745 0.745 0.750 0.754 0.764 0.749
6 0.621 0.725 0.739 0.754 0.733 0.731 0.731 0.766 0.756 0.764
windowSize 10 20 30 40 50 60 70 80 90 100
#LSTM
neurons
2 0.651 0.703 0.747 0.747 0.754 0.754 0.737 0.758 0.743 0.731
3 0.613 0.697 0.735 0.729 0.739 0.754 0.735 0.762 0.768 0.743
4 0.673 0.695 0.721 0.727 0.750 0.754 0.733 0.772 0.741 0.782
5 0.693 0.715 0.743 0.745 0.735 0.760 0.762 0.745 0.737 0.762
6 0.649 0.711 0.754 0.733 0.741 0.762 0.723 0.758 0.747 0.756
Figure 2 records the overall accuracy for all three pre-trained embeddings, side-
car alone, and those embeddings enhanced with a sidecar. Unsurprisingly, the
accuracy of fastText is head and shoulders above the other pre-trained models;
the n-gram approach gives it a significant advantage when evaluating a cor-
pus containing many out-of-vocabulary words. What is more significant is that
the sidecar-enhanced Word2Vec and GloVe embeddings manage to consistently
equal or exceed the overall accuracy of the fastText model; and the sidecar-
enhanced fastText consistently exceeds fastText in overall accuracy. The sidecar
embedding performs poorly by itself compared to all other models, rejecting
the hypothesis that sidecar alone (and not the pre-trained embedding model)
was responsible for the performance improvement. This experiment has demon-
strated that sidecar augments pre-trained models to provider a richer context
than either model provides by itself.
Figure 3 records the CWS ratio scores for each embedding. Once again,
the sidecar-enhanced embeddings consistently out-perform the originals. To fur-
ther illustrate the point, the CWS numerator (discrimination) and denominator
Sidecar 537
windowSize 10 20 30 40 50 60 70 80 90 100
#LSTM
neurons
2 0.703 0.758 0.750 0.780 0.784 0.774 0.788 0.784 0.784 0.752
3 0.681 0.752 0.776 0.770 0.794 0.766 0.758 0.796 0.788 0.766
4 0.713 0.731 0.776 0.780 0.774 0.784 0.790 0.806 0.774 0.782
5 0.723 0.747 0.762 0.782 0.760 0.784 0.796 0.786 0.784 0.774
6 0.705 0.776 0.768 0.780 0.794 0.782 0.792 0.776 0.776 0.768
Fig. 2. Context classification accuracies for each model over 10 different random seeds.
Fig. 3. CWS score for each model over 10 different random seeds.
Fig. 4. CWS numerator (discrimination) for each model over 10 different random seeds.
Fig. 5. CWS denominator (inconsistency) for each model over 10 different random
seeds.
7 Conclusion
A method was presented in this work for augmenting word embedding models
with expert knowledge from a corpus of text. This work investigated a simple
method of enriching existing word embeddings with domain-specific information
using a small, custom word embedding. The advantage of this technique is that it
is less computationally expensive than retraining a full word embedding model,
and it leverages the general syntactic information encoded in large pre-trained
word embeddings. This work demonstrated quantitatively that this approach
leads to improvements in performance when classifying domain-specific text,
as measured by both common machine learning metrics such as overall accu-
racy and a quantitative measure of expertise. One major limitation is that it
does require a dataset from the target domain - if there is not enough training
data, the sidecar structure will not be able to improve the embeddings. Another
limitation is that this approach does require additional compute time and con-
figuration to generate the sidecar embeddings, rather than using a pre-trained
model as-is. Future work may include experiments on other expert knowledge
domains in the realm of internal audit, such as medicine, finance, and law. The
sidecar method could also be applied to other forms of embedding altogether,
such as improving the quality of image embeddings in specific contexts. Another
interesting approach would be to experiment with merging the pre-trained and
custom embeddings deeper inside the model, passing the custom and pre-trained
vectors separately into a model and then merging the layers later on, as in [6].
540 M. Lemay et al.
References
1. Anders Ericsson, K., Charness, N.: Expert performance: its structure and acquisi-
tion. Am. Psychol. 49, 725–747 (1994)
2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
subword information. arXiv preprint arXiv:1607.04606 (2016)
3. Dong, J., Huang, J.: Enhance word representation for out-of-vocabulary on Ubuntu
dialogue corpus. CoRR abs/1802.02614 (2018). http://arxiv.org/abs/1802.02614
4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9,
1735–1780 (1997)
5. Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A.,
Fidler, S.: Skip-thought vectors. In: Advances in Neural Information Processing
Systems, pp. 3294–3302 (2015)
6. Limsopatham, N., Collier, N.: Modelling the combination of generic and target
domain embeddings in a convolutional neural network for sentence classification.
Association for Computational Linguistics (2016)
7. McBride, M.F., Burgman, M.A.: What is expert knowledge, how is such knowledge
gathered, and how do we use it to address questions in landscape ecology? In:
Perera, A., Drew, C., Johnson, C. (eds.) Expert Knowledge and Its Application in
Landscape Ecology, pp. 11–38. Springer, New York (2012)
8. Merriam-Webster Online: Merriam-Webster Online Dictionary (2009). http://
www.merriam-webster.com
9. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
10. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in Neural
Information Processing Systems, pp. 3111–3119 (2013)
11. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word repre-
sentation. In: Empirical Methods in Natural Language Processing (EMNLP), pp.
1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162
12. Saal, F.E., Downey, R.G., Lahey, M.A.: Rating the ratings: assessing the psycho-
metric quality of rating data. Psychol. Bull. 88(2), 413 (1980)
13. Shanteau, J., Weiss, D., Thomas, R., Pounds, J.: Performance-based assessment of
expertise: how to decide if someone is an expert or not. Eur. J. Oper. Res. 136(2),
253–263 (2002)
14. Shapiro, D.: Composing recommendations using computer screen images: a
deep learning recommender system for PC users. Ph.D. thesis, Université
d’Ottawa/University of Ottawa (2017)
15. Shapiro, D., Qassoud, H., Lemay, M., Bolic, M.: Visual deep learning recom-
mender system for personal computer users. In: The Second International Con-
ference on Applications and Systems of Visual Paradigms, VISUAL 2017, pp. 1–
10 (2017). https://www.thinkmind.org/index.php?view=article&articleid=visual
2017 1 10 70006
16. Speer, R., Chin, J.: An ensemble method to produce high-quality word embeddings.
arXiv preprint arXiv:1604.01692 (2016)
17. Xu, C., Bai, Y., Bian, J., Gao, B., Wang, G., Liu, X., Liu, T.Y.: RC-NET: a general
framework for incorporating knowledge into word representations. In: Proceedings
of the 23rd ACM International Conference on Information and Knowledge Man-
agement, pp. 1219–1228. ACM (2014)
Performance Analysis of Support Vector
Regression Machine Models in Day-Ahead
Load Forecasting
1 Introduction
Through an evaluation of SVRM models for day-ahead load forecasting, this research
hopes to contribute to the growing body of literature that explores the predictive
capability of SVM that will hopefully assist the appropriate generation, transmission
and distribution of electricity.
2 Methodology
2.1 Electric Load Data Preparation
Data preparation involving data selection, data representation and feature selection
plays an important role in successful SVRM implementation for any forecasting pur-
poses as this process ensures the good quality of data before fed into the machine
learning framework. As shown in Table 1, electric load data in terms of kilowatt
delivered (KW_DEL) in fifteen-minute resolution was acquired from a power utility
along with its corresponding, kilowatt per hour delivered (KWH_DEL), kilo volt amps
reactive hours delivered (KVARH_DEL), metering point (SEIL), and its billing date
(BDATE).
A total of 70,109 rows of data was then represented, scaled and then partitioned in
order to be processed by the SVRM. The BDATE attribute was split into their
respective day, month and year attributes to ease machine learning and deal the limi-
tation of SVRM libraries in handling data with commas, semicolons and other marks.
Additional attributes such as calendar day, holiday and weekend indicators were added
to the original dataset as literature suggest that this will be material to the performance
of SVRM in load forecasting [10]. Binary variables was then used to represent non-
numeric features [4, 11]. The 15 min TIME attribute was represented into an integer
equivalent with 00:15 read as 12:15 AM converted as 1 and 00:00 read as 12:00
Midnight converted as 96. As shown in Eq. (1), Min - Max normalization method was
then used to scale the electric load data since this normalization method has exhibited
to yield good results in regression problems by enabling the data to be scaled in such a
way that the higher values should not suppress the lower values in order to retain the
activation function [12].
x minð xÞ
x0 ¼ ð1Þ
maxð xÞ minð xÞ
Where x is the electric load, min(x) is the minimum electric load data of the dataset,
max(x) is the maximum electric load and x’ as the scaled electric load data. After data
544 L. C. P. Velasco et al.
representation and data scaling, the dataset was partitioned into two data sets of the
training set and validation set. The training set was used for training the formulated
SVRM models along with different parameters while the validation set was used for
testing the design of the SVRM model to confirm its predictive accuracy. As shown in
Fig. 1, January 1, 2013 to November 2014 was identified as the training set while
December 2014 was set as validation set with the available dataset of December 2014
starting from December 1 2014 to December 25 2014.
Despite the few number of variables used in this study, the researchers were still
conservative enough to perform feature selection to ensure the materiality of the data to
be fed into the SVRM models by identifying the most relevant input variables within
the data set and removing irrelevant, redundant, or noisy data. Proper selection of
features or relevant input variables can improve the prediction performance of machine
learnings, specifically that of SVRM which was originally designed for classification
[3, 8, 13]. Given a subset of the whole data set, correlation-based filter feature selection
and information gain approaches were used for the feature selection process [14–16].
The Waikato Environment for Knowledge Analysis (Weka) and R Programming lan-
guage were used as tools to perform feature selection with Pearson’s correlation,
Spearman correlation matrix, and Kendall’s correlation used in correlation based
approach. The value of these correlation coefficients ranges between −1 and 1 with the
strongest linear relationship being indicated by a correlation coefficient of −1 or 1 while
the weakest linear relationship is indicated by a correlation coefficient equal to 0.
A positive correlation means that if one variable gets bigger, the other variable tends to
get bigger with it while a negative correlation means that if one variable gets bigger, the
other variable tends to get smaller along with it. On the other hand, information gain IG
(G1;G2;C) is the gain of mutual information of knowing both G1 and G2 with respect
to the class C. A positive value of IG(G1;G2;C) indicates the synergy between G1 and
G2. In other words, it measures the amount of information in bits about the class
prediction which, in this case, the KW_DEL feature. The information gain measure is
based on the entropy concept. It is commonly used as the measure of feature relevance
in filter strategies that evaluate features individually with the advantage of being fast
[17]. Let D(A1;A2;…;An;C), n > 1, be a data set with n + 1 features, where C is the
Performance Analysis of SVRM Models in Day-Ahead Load Forecasting 545
class attribute. Let m be the number of distinct class values. The entropy of the class
distribution in D, represented by Entropy(D) as shown in Eq. (2).
Xm
EntropyðDÞ ¼ i¼1
pi log2 ðpi Þ ð2Þ
Where pi is the probability that an arbitrary instance in D belongs to class ci. This
concept is used by the single-label strategy known as Information Gain Attribute
Ranking to measure the ability of a feature to discriminate between class values [16,
17]. Similar to correlation based approach, this relationship is constrained by
−1 < r < 1.
The parameter C is a parameter that allows the tradeoff of training error vs. model
complexity. If the value of the parameter C is too large over fitting will occur. On the
other hand, if the C is too small, it may result to an underfitting and increase the number
of training errors [2, 5]. The parameter e controls the width of the e-insensitive zone,
used to fit the training data [5]. Larger e-value results in fewer support vectors selected
and will result in more flat or less complex regression estimates [3, 10]. If the value of e
is too big, the separating error is high, the number of support vectors is small, and vice
versa [10]. The kernel parameter (c) defines the nonlinear mapping from input space to
some high dimensional feature space [3, 9]. The optimality of the parameter values will
depend on its effect on the predictive accuracy of the resulting model. From the
selected kernel in the previous step, testing of parameters was performed. By com-
paring the predictive results, the parameters with the lowest error was selected. The
acquired values of the parameters and kernels were then used to select which SVRM
architecture will be used in the model.
Architecture selection is the process of choosing the structure of the dataset
depending on the number of previous days to be included in the training and set. In this
study, two SVRM architectures were developed wherein each model represents 96
forecasted load data per 15-min resolution on the day to be forecasted. As shown in
Fig. 2, the architectures have all attributes approved by the feature selection phase as
inputs.
i represents the current date of the model (i = 1, …, 365). The format of Architecture 1
was considered since it has been found relatively effective by researchers [1]. In
Architecture II, the model initially has a day consumption data in a 15-min resolution
taken from the date of the day before it. And as long as a minimum error is not
achieved, another one day’s worth of consumption data was added as input to
Architecture II. This process was iterated until the optimal accuracy is found. This will
be represented as i-1, i-2, …, i-n where n represented the number of days. Because
previous researches do not have a standard number of consumption data used to predict
forecast, Architecture II was created as to ensure that the appropriate number of days
are utilized to maximize accuracy in load prediction [1, 7, 8]. The architectures were
then trained using the kernels and parameters chosen by the kernel selection phase and
the parameter selection phase. Identifying the optimally performing SVRM model was
then conducted after choosing which architecture yielded the best results.
of evaluated features improves the accuracy and effectiveness of the SVRM model with
lower forecasting error [1, 2, 4]. This is due to feature selection reducing the dimen-
sionality of the data and enabling regression algorithms like SVRM to operate faster
and more effective. Since this study used R programming language to implement
Pearson’s correlation, Spearman correlation matrix, and Kendall’s correlation, the
function cor() of R was used calculating the weighted correlation of the given data set.
Pearson’s correlation is a statistical measure of the strength of a linear relationship
between paired data [14, 15].
With Pearson’s correlation, Fig. 3(a) shows that the TIME attribute has a
0.52901261 correlation value to KW_DEL, MONTH has −0.270584464, year has
−0.0675271275, DAY TYPE −0.028027669, DATE TYPE −0.021399675, and DAY
has −0.0183968342. Thus, the time attribute has the highest correlation to KW_DEL.
Attributes YEAR, DAY TYPE, DATE TYPE, and DAY have a relatively low corre-
lation to KW_DEL since they are closer to zero.
Figure 4 shows the Information Gain relationship between the different attributes
and KW_DEL. The orange parts of the illustration represent KW_DEL while the blue
parts represent specific attributes. Only attributes TIME and MONTH converge to
KW_DEL, signifying that time and month has a relevant relationship to KW_DEL. The
attributes DAY, YEAR, DATE TYPE, and DAY TYPE did not converge with
KW_DEL, signifying little to no relationship. Thus, in the Information Gain approach,
MONTH and TIME were selected as the features that can affect the predictive variable
while the rest of the attributes show poor correlation.
Linear Parameters Linear MAPE Value RBF Parameters RBF MAPE Value
The MAPE which was computed from the daily prediction for the 25 days of
December 2014 shows that the RBF accuracy is superior than the linear kernel which
Performance Analysis of SVRM Models in Day-Ahead Load Forecasting 551
could not even produce accuracy below 5% MAPE. A study on load forecasting using
SVRM used the RBF kernel in their load prediction model resulted to an accuracy of
97% [1]. RBF kernel was also used a similar study which yielded a MAPE of 2.31%
[4]. With the behavior of the data used in this study, RBF was found to be more
accurate. Thus, RBF was chosen as the kernel for the developed SVRM models.
In searching for the suitable SVRM parameters, this study assessed the performance
of select SVRM models. To do this, the partitioned datasets namely the training and
validation sets were used to generate the MAPE values of the tested models.
The MAPE value generated by the model comes from the 25 days of December 2014
which, in this case, was the validation set of the study. According to the model’s
performance on the validation set, the researchers tried to infer the proper values of the
parameters. Figure 5 shows the iterated procedure conducted by this study to select the
best parameters that was used for the best performing SVRM model.
A study suggested similar results generation procedure and the developed model of
the study yielded 3.62% margin error [2]. Another study also used similar procedure
and concluded good parameters used in their SVRM model [7]. The developed model
of the said study yielded an accurate predictive power which is less than 5% and won
the first place of the EUNITE competition in 2001 marking the popularity of SVM to
be used in forecasting. Table 7 shows the SVRM parameters with the lowest MAPE
among the 80 tested models. These models used the RBF kernel as decided in the
kernel selection phase.
A c=110 g= 0.001 e= 0.01 p=0.005 4.36% G c= 125 g=0.001 e=0.011 p=0.005 4.12%
B c= 120 g=0.001 e=0.01 p=0.005 4.10% H c= 125 g=0.001 e=0.01 p=0.0045 4.09%
C c= 118 g=0.001 e=0.01 p=0.005 4.13% I c= 127 g=0.001 e=0.01 p=0.005 4.12%
D c= 115 g=0.001 e=0.01 p=0.005 4.12% J c= 126 g=0.001 e=0.01 p=0.005 4.12%
E c= 125 g=0.001 e=0.01 p=0.005 4.11% K c= 123 g=0.001 e=0.01 p=0.005 4.11%
F c= 130 g=0.001 e=0.01 p=0.005 4.16% L c= 124 g=0.001 e=0.01 p=0.005 4.13%
552 L. C. P. Velasco et al.
Since there is no standard number of load consumption data used for load fore-
casting, this study introduced Architecture II. Architecture II initially has a day con-
sumption data in 15 min resolution taken from the previous date, denoted as i-1. The
process is iterated, incrementing the values of the previous days to be considered until
the smallest predictive error is found. Table 9 shows a portion of the Architecture II
dataset.
In this research, the architecture that yielded the lowest MAPE was the architecture
with seven days as past attribute (i-1, i-2, i-3, i-4, i-5, i-6, i-7). As shown in Table 10,
the Architecture that is the least accurate is the architecture with five days as past
attribute (i-1, i-2, i-3, i-4, i-5). Interestingly, i-1 and i-1 & i-2 only has 0.01% difference
Performance Analysis of SVRM Models in Day-Ahead Load Forecasting 553
in accuracy. The most accurate of the seven was the model with (i-1, i-2, i-3, i-4, i-5, i-
6, i-7), reaching a MAPE of 4.19%. Since Architecture (i-1, i-2, i-3, i-4, i-5, i-6, i-7) got
the highest accuracy, this will be the architecture to be compared to Architecture I.
i-1 4.37%
To select the best performing architecture for the SVRM model, the researchers
compared the predicted values of the two architectures to the actual values of the
December 2014 consumed load. Table 11 shows that Architecture II with seven days as
past attribute (i-1, i-2, i-3, i-4, i-5, i-6, i-7) has a MAPE of 4.19% for daily prediction in
December 2014. While Architecture I has a MAPE of 4.03% with Architecture I
superior only by 0.16% from Architecture II. It can also be observed that there is a
sharp increase of inaccuracy on Day 25 which is December 25. Architecture I gener-
ated the smallest MAPE with a value of 1.58% for December 11 while the smallest
MAPE of Architecture II which is 1.68%.
The result shows the performance of Architecture I yielding the better accuracy.
Despite Architecture I yielding the lowest average MAPE, it is also worth noticing that
it only differs by 0.160 when compared with Architecture II. This indicates that
Architecture II is also a promising architecture for SVRM load prediction.
554 L. C. P. Velasco et al.
Acknowledgment. The authors would like to thank the support of the MSU-IIT Office of the
Vice Chancellor for Research and Extension through PRISM-Premiere Research Institute in
Sciences and Mathematics for their assistance in this study.
References
1. Ostojin, S., Kulić, F., Švenda, G., Bibić, R.: Short-term electrical load forecasting using
support vector machines. In: Computers and Simulation in Modern Science. Mathematics
and Computers in Science Engineering. A Series of Reference Books and Textbooks, vol. I,
pp 138–142. WSEAS Press (2008)
2. Elattar, E.E., Goulermas, J., Wu, Q.H.: Electric load forecasting based on locally weighted
support vector regression. IEEE Trans. Syst. Man Cybern. C 40(4), 438–447 (2010)
3. Velasco, L.C.P., Polestico, D.L.L., Abella, D.M.M., Alegata, G.T., Luna, G.C.: Day-ahead
load forecasting using support vector regression machines. Int. J. Adv. Comput. Sci. Appl.
(IJACSA) 9(3), 22–27 (2018)
Performance Analysis of SVRM Models in Day-Ahead Load Forecasting 555
4. Ceperic, E., Ceperic, V., Baric, A.: A strategy for short-term load forecasting by support
vector regression machines. IEEE Trans. Power Syst. 28(4), 4356–4364 (2013)
5. Nagi, J., Yap, K.S., Tiong, S.K., Ahmed, S.K.: Electrical power load forecasting using
hybrid self-organizing maps and support vector machines. In: Proceeding of the 2nd
International Power Engineering and Optimization Conference, PEOCO 2008, Shah Alam,
Selangor, Malaysia (2008)
6. Türkay, B.E., Demren, D.: Electrical load forecasting using support vector machines. In:
Proceedings of the 7th International Conference on Electrical and Electronics Engineering,
ELECO, Bursa, pp. I-49–I-53 (2011)
7. Chen, B.J., Chang, M.W., Lin, C.J.: Load forecasting using support vector machines: a study
on EUNITE. Competition 2001. IEEE Trans. Power Syst. 19(4), 1821–1830 (2004)
8. Matijas, M., Vukicevic, M., Krajcar, S.: Supplier short term load forecasting using support
vector regression and exogenous input. J. Electr. Eng. 62(5), 280–285 (2011)
9. Tan, H., Yu, X., Chang L., Wan, W.: The performance analysis of support vector machine
parameters based on statistical analysis. In: 2007 IET Conference on Wireless, Mobile and
Sensor Networks, CCWMSN 2007. IEEE (2009)
10. Turkay, B.E., Demren, D.: Electrical load forecasting using support vector machines: a case
study. Int. Rev. Electr. Eng. 6(5), I-49–I-53 (2011)
11. Espinoza, M., Suykens, J., Belmans, R., Moor, B.D.: Electric load forecasting. IEEE Control
Syst. 27(5), 43–57 (2007)
12. Patro, G.K., Sahu, K.K.: Normalization: a preprocessing stage. Int. Adv. Res. J. Sci. Eng.
Technol. 2(3) (2015)
13. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn.
Res. 3, 1157–1182 (2003)
14. Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-
based filter approach to the wrapper. In: Proceedings of 12th International Florida Artificial
Intelligence Research Society Conference, pp. 235–239 (1999)
15. Hauke, J., Kossowski, T.: Comparison of values of Pearson’s and Spearman’s correlation
coefficient on the same sets of data. In: Proceedings of the MAT TRIAD 2007 Conference,
Bedlewo, Poland (2007)
16. Azhagusundari, B., Thanamani, A.S.: Feature selection based on information gain. Int.
J. Innov. Technol. Explor. Eng. (IJITEE) 2(2), 18–21 (2013)
17. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In:
Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997.
ACM Digital Library (1997)
18. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge
University Press, New York (2004)
19. De Bie, T., Cristianini, N.: Kernel methods for exploratory data analysis: a demonstration on
text data. In: Proceedings of the Joint IAPR International Workshops on Syntactical and
Structural Pattern Recognition (2004)
20. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for
support vector machines. Mach. Learn. 46, 131–159 (2002)
21. Ben-Hur, A., Weston, J.: A user’s guide to support vector machines. Methods Mol. Biol.
609, 223–239 (2010)
Ecommerce Fraud Detection Through Fraud
Islands and Multi-layer Machine
Learning Model
1 Introduction
Ecommerce has caused a tectonic shift in the retail landscape and opened vast new
opportunities for retail merchants. Although it provides greater convenience for mer-
chants and customers in conducting business, it unfortunately has also exposed them to
serious threats from sophisticated fraudsters who commit various online transaction
fraud and online service abuse. Retail fraud amounts to tens of billions of dollars lost in
the US alone. How to prevent the loss caused by unforeseen fraud attacks without
pushing away the revenue from legitimate transactions has always been a major task for
all online merchants around the world.
Main challenge for ecommerce transaction fraud prevention is that fraud patterns
are rather dynamic and diverse. Fraudsters often change their attack vectors when either
they sense their malicious behaviors are successfully prevented by the merchants or
they find another new loophole of the fraud prevention system to attack. In addition,
fraudsters also attempt to behave like the genuine customers to stay below the radar of
fraud prevention systems. Since like fraudsters, legitimate customers also change their
online transaction behaviors over time, it makes arduous for online merchants to be
able to accurately distinguish fraudsters from legitimate customers. Many existing
papers that address ecommerce fraud detection focus on investigating various types of
classification machine learning modeling, e.g. logistic regression, neural network,
random forest, Support Vector Machines (SVM), etc. [1–4]. In this research, machine
learning model, gradient boosting decision tree (GBDT) more specifically, was also
applied for fraud detection. It is important to emphasize here that in this paper, instead
of developing new machine learning model formulation, this research focuses on
improving the performance of the currently adopted machine learning model with the
following two proposed approaches.
The first one is that link analysis (fraud island) is applied to investigate the rela-
tionships between different fraudulent entities and to uncover the hidden complex fraud
patterns through the formed network. The outcome of link analysis then can be used as
an important feature input for machine learning model (the gradient boosting decision
tree in this research). Traditionally, online merchants employed the discrete analysis
method to distinguish fraudulent transactions from legitimate ones [5]. While the
discrete approach is effective enough for spotting patterns and capturing fraudsters
acting alone, it doesn’t necessarily detect patterns across all the different data endpoints
and therefore, is not very useful in detecting elaborate crime rings. Furthermore,
sophisticated fraudsters have learnt how to co-exist with and to act like the real cus-
tomers by making their transaction behaviors/patterns look as legitimate as possible.
Therefore, another great opportunity for improving classification accuracy can be
achieved by looking beyond the individual data points and through the network of
associated transaction features to uncover these larger complex patterns and reveal the
tricky fraud transactions.
The second one is that sequential-assembling modeling technique, called multi-
layer model is designed for advancing the accuracy of the conclusive classification
decision. In many fraud management systems, the fraud labels are determined through
different internal and external channels which are banks’ declination decision, manual
review agents’ rejection decisions, banks’ fraud alert and customers’ chargeback
requests. It can be reasonably assumed that different distinct fraud patterns could be
caught though different fraud risk prevention forces (i.e. bank, manual review team and
fraud machine learning model). Therefore, in this paper, various machine learning
models that are respectively trained using the fraud labels tagged by different fraud risk
prevention forces are assembled, and sequentially improve the final machine learning
model for conclusive classification.
The rest of the paper is organized as follows. In Sect. 2, some existing research in
link analysis and machine learning models that are either the building block or
important references of this research are briefly reviewed and discussed. In Sect. 3, the
main innovative methodologies proposed in this paper which are the fraud islands that
connect and cluster the entities which shared similar or same fraud transaction features,
and the multi-layer model which applies sequential-assembling modeling technique are
558 J. Nanduri et al.
In this section, some existing research that either are the building block of this proposed
research or provide important reference for theory development in this study will be
discussed. As mentioned in the previous section, there are two main methodologies,
fraud island and multi-layer model, proposed in this paper. Therefore, literature review
is conducted based on these two topics, respectively.
Step 1: The first step was to create the fraud graph from a set of known fraudulent
transactions. Entities associated with each transaction, such as account identifier,
device ID, email address, and payment instrument, will be extracted as nodes in the
graph. Each entity in a single transaction is essentially connected through edges.
Different transaction could be potentially connected through common entities. For
example, if two fraudulent transactions use the same payment instrument on two
different devices, these two devices are connected via the payment instrument. After
connecting all the entities from the historical fraudulent transactions, a pool of
transactions represented by these entities would be formed. The collection of linked
entities is referred to as “Fraud Archipelago”. See Fig. 1 below to see how Fraud
Archipelago and Fraud Islands are constructed.
Step 2: After Fraud Archipelago and Fraud Islands are constructed using the
fraudulent transactions, non-fraudulent transactions associated with the entities in
the Fraud Archipelago/Islands are added. Through the connected component, a
number of statistics per each entity and per Fraud Island are calculated, e.g., node
count in each Fraud Island, edge count per entity, clique size per Fraud Island,
centrality and connectedness.
Step 3: At the entity level, an entity’s existence in a Fraud Island is highly pre-
dictive of future fraudulent transactions which use the same entity. On the level of
each island, some statistics are calculated, such as the total number of nodes by
nodes’ type. Intuitively, the more interconnections exist among entities, the greater
the cause for the concern. To quantify the risk for these linked entities, some
statistics on the entity level, such as good transaction and bad transaction counts, are
calculated through aggregating the historical transactions.
Step 4: In the end, the system will check the connection between each incoming
transaction and the existing fraud island. Both the statistics retrieved from the level
of island and nodes will be provided as extra features to the existing modelling
engine, which output scores to indicate whether the input transaction is fraudulent
or not.
Ecommerce Fraud Detection Through Fraud Islands 561
Approved &
SeƩled
Term model which is based on the most recent few weeks’ worth of data is built. Since
this Near Term model would employ incomplete chargeback information, this model
may not reliably work as a standalone model. Combined with other models, and as an
input to the Long Term model, however, the Near Term model complements them and
improves the overall performance.
Table 2 summarizes the models that are build and the datasets that are used in this
research, and Fig. 4 illustrates the overall Multi-layer model architecture. That is, given
a transaction, three models are run, i.e., Bank Auth Model, Fraud Alert Model and Near
Term Model, and the outputs which are subsequently provided to the Final Model to
generate the final prediction are receive.
In this section, how the proposed approach can be applied for fraud detection is
illustrated and the effectiveness of this proposed research is also demonstrated through
some case studies and experiments on real ecommerce data obtained from some
ecommerce business partners.
564 J. Nanduri et al.
From the ROC (Receiver Operating Characteristics) curve below (Fig. 5), it can be
seen that the machine learning model (gradient boosting tree) with aggregate features
from link analysis performed better than the model without. There was 3% improve-
ment in AUC (area under curve).
Due to the limitation of computing power availability, this study was conducted
only in a relatively smaller scale and scope in terms of number of collaborative
business partners, the length of time period for data collection and training, and the
types of customer information used for link graph vertices. It is not hard to believe that
with more data from more business partners, longer data collection and training peri-
ods, and adding more customer information nodes, this proposed approach will be
more effective and helpful for ecommerce fraud detection.
Ecommerce Fraud Detection Through Fraud Islands 565
Fig. 5. Performance comparison between machine learning model with and without graph
features
100
Porƞolio1 Porƞolio1_ML
90
80
70
PCT_FRAUD (%)
60
50
40
30
20
10
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 6. Performance comparison of Portfolio 1 using in-time dataset between Long Term Model
and Multi-layer (ML) model
100
Porƞolio1 Porƞolio1_ML
90
80
70
PCT_FRAUD (%)
60
50
40
30
20
10
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 7. Performance comparison of Portfolio 1 using out-of-time dataset between Long Term
Model and Multi-layer (ML) model
Ecommerce Fraud Detection Through Fraud Islands 567
50
Porƞolio2 Porƞolio2_ML
45
40
35
PCT_FRAUD (%)
30
25
20
15
10
5
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 8. Performance comparison of Portfolio 2 using in-time dataset between Long Term Model
and Multi-layer (ML) model
60
Porƞolio2 Porƞolio2_ML
50
40
PCT_FRAUD (%)
30
20
10
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 9. Performance comparison of Portfolio 2 using out-of-time data between Long Term
Model and Multi-layer (ML) model
568 J. Nanduri et al.
60
Porƞolio3 Porƞolio3_ML
50
40
PCT_FRAUD (%)
30
20
10
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 10. Performance comparison for Portfolio 3 using in-time data between Long Term Model
and Multi-layer (ML) model
60
Porƞolio3 Porƞolio3_ML
50
40
PCT_FRAUD (%)
30
20
10
0
0 0.5 1 1.5 2
PCT_NONFRAUD (%)
Fig. 11. Performance comparison of Portfolio 3 using out-of-time data between Long Term
Model and Multi-layer (ML) model
Ecommerce Fraud Detection Through Fraud Islands 569
5 Conclusions
This paper introduces two approaches, fraud islands and multi-layer model to boost the
fraud detection capability of the currently running machine learning model.
Through fraud island, link graph aggregated features were created, that can be more
effectively provide some valuable information about the hidden fraud patterns. In
addition, through the conducted case study, it is found that those link graph aggregate
feature also can help improve the currently adopted machine learning models for the
fraud detection accuracy. As discussed in Sect. 4.1, due to availability of the current
cloud computing power, this proposed approach could only be implemented in a
relatively smaller scale. For the future study, the authors will try to get the access to
Azure machine learning service [18] which can provide much larger computing power
and will allow more data (both volume and variety) to be incorporated for improving
this proposed approach.
Using the multi-layer modelling technique, three sub-models were created for
subpopulations (transactions) that received fraud labels from different risk prevention
systems i.e. risk decisions made by merchants’ fraud risk system, banks’ authorization
decisions, and fraud alert from associations. it is believed that using the fraud labels
determined by different internal and external risk systems, more fraud in various kinds
of fraud patterns could be caught. The case studies shown in Sect. 4 proved this
intuition. For the future work, more representative subpopulations for sub-models for
multi-layer modelling will be investigated to achieve more accurate and conclusive
fraud detection.
References
1. Kou, Y., Lu, C.T., Sirwongwattana, S., Huang, Y.P.: Survey of fraud detection techniques.
In: IEEE International Conference on Networking, Sensing and Control, vol. 2, pp. 749–754.
IEEE (2004)
2. Wang, S., Liu, C., Gao, X., Qu, H., Xu, W.: Session-based fraud detection in online e-
commerce transactions using recurrent neural networks. In: Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pp. 241–252. Springer, Cham,
September 2017
3. Şahin, Y.G., Duman, E.: Detecting credit card fraud by decision trees and support vector
machines (2011)
4. Minastireanu, E.A., Mesnita, G.: An analysis of the most used machine learning algorithms
for online fraud detection. Inform. Econ. 23(1), 5–16 (2019)
5. Omen Homepage. https://omen.sg/detect-fraud-in-real-time-with-graph-databases/. Accessed
13 June 2019
6. Shah, N., Lamba, H., Beutel, A., Faloutsos, C.: The many faces of link fraud. In: 2017 IEEE
International Conference on Data Mining (ICDM), pp. 1069–1074. IEEE, November 2017
7. Sadowski, G., Rathle, P.: Fraud detection: discovering connections with graph databases.
White Paper-Neo Technology-Graphs are Everywhere (2014)
8. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM (JACM)
46(5), 604–632 (1999)
570 J. Nanduri et al.
9. Jurczyk, P., Agichtein, E.: Discovering authorities in question answer communities by using
link analysis. In: Proceedings of the Sixteenth ACM Conference on Information and
Knowledge Management, pp. 919–922. ACM, November 2007
10. Vanneschi, L., Castelli, M.: Multilayer perceptrons. Encycl. Bioinform. Comput. Biol. 1,
612–620 (2019)
11. Svozila, D., Kvasnickab, V., Pospichalb, J.: Introduction to multi-layer feed-forward neural
networks. Chemometr. Intell. Lab. Syst. 39(1), 43–62 (1997)
12. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
13. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: 22nd SIGKDD
Conference on Knowledge Discovery and Data Mining (2016)
14. Volkovs, M., Yu, G., Poutanen, T.: Content-based neighbor models for cold start in
recommender systems. In: Proceeding of RecSys Challenge 2017, Proceedings of the
Recommender Systems Challenge 2017, Article no. 7 (2017)
15. “Machine Learning Challenge Winning Solutions” in “Awesome XGBoost”. https://github.
com/dmlc/xgboost/tree/master/demo. Accessed 14 June 2019
16. Xia, Y., Liu, C., Da, B., Xie, F.: A novel heterogeneous ensemble credit scoring model based
on bstacking approach. Expert. Syst. Appl. 93, 182–199 (2017)
17. Apache Spark Homepage. https://spark.apache.org/. Accessed 14 June 2019
18. Azure Machine Learning Service Homepage. https://azure.microsoft.com/en-in/services/
machine-learning-service/. Accessed 14 June 2019
Automated Drug Suggestion Using Machine
Learning
1 Introduction
Analytics is a method of creating insights through effective use of data and applying
qualitative and quantitative analysis upon it. Recognizing technological advancement
and growth in medical data have created quite a good opportunity, but it has also
provided challenges for protecting the privacy and security of patient and research data
but working with federal and expert agencies promote security procedures which
enable better scientific and medical advances. Understanding the effects of drugs helps
with the selection of which drug is the most effective and popular amongst the patients.
Although there are many reviews for each drug available, people still struggle to pick
the one that is best for them. Using publicly available datasets, in order to implement
exploratory, descriptive, and predictive analysis to advance the understanding of the
effects of drugs and recommend the best possible drug for an individual.
Recently, there has been a significant rise in the number of data mining and ana-
lytics techniques being used in the field of medicine and healthcare, thus arises the need
to have multiple sources of data having similar attributes. For Example, the dataset
taken from UCI Machine learning [1] contains over 200,000 patient drug reviews.
There are two different datasets. The training dataset dimensions are 161,000 by 7,
while the testing dataset has 53,800 by 7 in terms of rows by columns. Table 1 gives an
overview of the studied datasets. Combining multiple datasets resulted in a dataset
containing 670,000 rows. The second dataset was taken into consideration as a result of
not having life-threatening diseases in the UCI dataset [1]. The final dataset [3] gave a
much-needed diagnosis of the patients’ conditions and symptoms.
Creating a drug recommendation assistant, with the help of these datasets and for
optimal accuracy, making it necessary to test various machine learning and natural
language processing techniques. The rest of this paper discusses the background
information, the framework setup, the result analysis, and exploratory, predictive, and
prescriptive analyses.
2 Literature Review
Being able to understand and take care of symptoms should be more closely observed.
Death due to a medical condition is on the rise and out of the ten biggest killers in the
United States, eight of them are related to some type of medical condition. In addition,
many of these diseases progressively get worse if there is no medication or treatment.
Getting the proper help and medication along with regular screening can help the
patients. To aid someway in ending this tragedy, we utilize three different datasets and
exploratory, predictive, and prescriptive analyses, to fight this ongoing epidemic with a
better chance of success. Whilst searching for datasets, UCI Machine Learning [1] and
[3] showed tremendous potential, along with Center for Disease Control and Preven-
tion (CDC), and Kaggle. The UCI dataset [1] also contains a published paper that
works with the data and covers in-domain and cross-domain sentiment analysis [1].
Accessing data that is both relevant and up to date is a top priority since certain
medicine will have a greater impact than they did before and a better prescriptive model
can be created. Also, people getting the proper help should be a priority as there has
been a link between getting medical attention in contrast to those that do not, as
discussed by Larmuseau [3]. Viveka and Kalaavathi [4] discuss the importance of data
mining as an essential part of medical analysis. The reason behind a better data
exploration is to complement the existing tools with more effective solutions to health
Automated Drug Suggestion Using Machine Learning 573
care professionals and to control the quality of the medicine being discovered, the
treatment, and adverse side effects of the drugs.
Also, being able to understand the death trends caused by some form of disease is
an aspect studied in this research. The reason for this is to be able to figure out which
drug side effects can escalate into life-threatening conditions. Several studies [5, 6]
suggest that CVD has been regularly decreasing over time. CVD mortality has
decreased significantly, resulting in most cancers surpassing CVD because of the
leading cause of demise in excessive-profits counties inside the USA.
We also investigated the current technologies available and how they help patients.
For example, from the patient records of the future, a body can access quickly a list of
current problems, with a kind of clinical logic, the patient’s health status, and infor-
mation about various treatment patient can go for, for treating their condition. Easy
access to and sound organization of data elements can be provided by the automation of
patient records, as suggested in [7] but the availability of the data elements depends on
whether or not practitioners are collecting and recording such data in the first place,
another aspect to consider is the security aspect as talked about in [8–10]. Thus, we can
see so many barriers in existing systems.
We investigated different techniques currently under research like in [7] and [11]. It
was explained in [12] that the study of PCA has been completed which unearths the
minimal number of attributes required to enhance the precision of numerous supervised
machine learning algorithms. It proposed new techniques to examine a supervised
system which gains knowledge in order to predict heart disease. There are numerous
data mining strategies like categorization, preprocessing.
Romanosky [10] suggested the use of Hard and soft clustering methods for
detecting patterns of medication use by patients and then check to see if it could
correctly estimate the probability of certain patient profile match. However, our data is
already labeled and thus unsupervised methods would be redundant. Expanding upon
the same logic for sentiment prediction techniques used in [13] for classification, to
drug prediction.
The first dataset was taken from Kaggle and published by the UCI Machine Learning
Repository. [1] contains medicine reviews with over 200,000 entries. There are 2
different datasets. The Train dataset dimensions are 161,000 7, while the Test
dataset has 53,800 7 in terms of rows by columns. The rows represent every unique
case while the columns represent a unique ID, drug name, condition name, patient
review, a rating out of 10 stars, the date, and the number of users that found the drug
useful. Another dataset we used was originally from NCHS but we came across it on
Kaggle, named [2]. This dataset contains dimensions of 206k 13 which includes
columns like states, year of death, and the age. The Kaggle dataset used [3] System
Disease Sorting contains seven different datasets within it, describing the available
diagnoses for each symptom. Preprocessing steps were necessary to format, and to map
the id of the symptom to the id of the condition.
574 V. Doma et al.
Python was selected because of the abundance of libraries and APIs capable of
efficient and concise code generation. Some of the environments required in the
development are provided below:
Jupyter Notebook: Jupyter Notebook documents are produced by the Jupyter Note-
book App, which contains both computer code (e.g. python) and rich text elements
(paragraph, equations, figures, links, etc.). Notebook documents are both human-
readable documents containing the analysis description and the results (figures, tables,
etc.) as well as executable documents which can be run to perform data analysis.
Installation of the Jupyter Notebook through conda: pip3 install Jupyter.
Anaconda: Anaconda is a package supervisor and is a Python distribution and a group
of over 1,000+ open source packages. Its miles unfastened and clean to install, and it
offers free network support. Over 150 programs are automatically established with
Anaconda. The subsequent command helped in installing the present-day version of
Anaconda manager
It is also possible to make a customized package using the “conda build” command.
Some of the required packages and tools were made available through Conda:
• Scikit-learn for implementing machine learning, as it is an efficient gear for mining
data and data evaluation, and is reachable to all people, and reusable in various
contexts; constructed on NumPy, SciPy, and matplotlib; Open source, commercially
usable. maximum of the inbuilt system learning algorithms like SVM, multinomial
NB, KNN. At its lowest level, one can say it is a third-party extension to SciPy [15,
16].
• Pandas is one of the easiest data structure and data analysis tool for the Python
programming language. Reading and analysis of CSV dataset can be done with
pandas, and has various features that allow us to format datasets and perform
cleaning.
• NumPy makes use of excessive-degree mathematical capabilities to perform
operations on arrays. NumPy can also be used as an efficient multi-dimensional field
of commonplace information. Arbitrary facts-kinds may be defined. This allows
NumPy to seamlessly and rapidly corroborate with a wide type of databases.
• NTLK is one of the leading platforms for working with human language data and
Python, One such module in this library called snowball stemmer was very essential
while constructing the NLP block of the code. Snowball is a small string processing
language designed for creating stemming algorithms for use in Information
Retrieval.
• MATPLOT is a data visualization library in Python for 2D plots of arrays. Mat-
plotlib is a multi-platform data visualization library constructed on NumPy arrays
and designed to paintings with the broader SciPy stack. One of its key aspects is
visualization which allows one to see large amounts of information in easily
digestible visuals. Matplotlib consists of numerous plots like line, bar, scatter,
histogram, pie and so forth as suggested in [17–19].
Automated Drug Suggestion Using Machine Learning 575
4 Methods
4.1 Cleaning/Preprocessing
Before any kind of machine learning algorithms on textual data can be used, we needed
to find out what features could be useful in terms of being able to accurately predict a
well-labeled class. Cleaning involves removing the ‘NaN’ values, 0 instances, and
other garbage data. combining this phase with preprocessing, made it necessary to
further included removing stop words and other strange entries at the same time from
the column ‘review’ in the dataset [1]. For example, one type of entry that was found in
minority under the review column was “<?span> user found this comment helpful
</span>” without the actual review written, as this did not give any information of what
the actual review was, cleaning techniques helped to remove rows containing these
bogus entries. Removing unnecessary whitespaces and other symbols were also done
effectively with Pandas. The following are the techniques used to handle the cleaning
and preprocessing as suggested in [12, 14].
4.2 Tokenize
Removes the punctuations, strips to lowercase, and tokenizes the parsed tweet.
4.3 CountVectorizer
The predefined defined tokenize function is passed as a parameter to the CountVec-
torizer function. The result of this function is the Bag of Words generated from the text.
snowball function, with one of the parameters being “English stop words” it was pos-
sible to feed in all stop words of the English language in order to eliminate misclassi-
fication errors due to using of pronouns, prepositions, articles, and strengthen the
importance of adverbs adjectives nouns, etc.
4.7 MulitnomialNB
MulitnomialNB is a sklearn function that implements Multinomial Naïve Bayes set of
rules. The multinomial Naïve Bayes classifier is appropriate for discrete functions (e.g.,
word counts for textual content class). The multinomial distribution usually calls for
Automated Drug Suggestion Using Machine Learning 577
integer characteristic counts. But, in practice, fractional counts including TF-IDF work
as well. Some of the hyperparameters for this function are prior probabilities,
Laplace/Lidstone smoothing parameter. In general, the naïve Bayes equation is given
as (5)
PðBjAÞPð AÞ
PðAjBÞ ¼ ð5Þ
PðBÞ
Where P(A | B) is the posterior probability, P(B | A) is the likelihood. P(A) is the
class Prior Probability. P(B) is the predictor prior probability.
In Multinomial Naïve Bayes The distribution is parametrized by vectors hy ¼
hy1 ; . . .::; hyn for each class y, and n is the number of features (basically the size of the
vocabulary) and hyi is the probability Pðxi jyÞ of feature appearing in a sample belonging
to a class y. The parameters hy is a smoothed version of maximum likelihood which is
estimated, giving the relative frequency counting as (6)
Nyi þ a
hyi ¼ ð6Þ
Ny þ an
P
where Nyi ¼ xT xi is the number of times the feature i occurs in a sample of that class
P
y from the training set T, and Ny ¼ ni¼1 Nyi i is the total number of all features for
class y. a represents features present or not present if a 0 accounts for features not
taken in consideration while learning.
Essentially Learning the SVM simply becomes an optimization problem where kw2 k
represents the margin of separation stated by (10), (11)
2
Max w subject to W:xi þ b 1 if yi ¼ þ 1 ð10Þ
kwk
2
Max w subject to W:xi þ b 1 if yi ¼ 1 for i ¼ 1 to N ð11Þ
kwk
Using these two equations we have the following simplified Eq. (12)
ð12Þ
In terms of application, the idea was to feed in the features from tfidf, essentially the
‘review’ column and output y is a string from the ‘condition’ column.
It should also be noted that Manhattan, Chebyshev, and Hamming distance, are
other metrics which can be used. The classifier algorithm follows two steps, given a
positive integer K, an unseen observation x and a similarity metric d, the algorithm runs
through the whole dataset computing between x and each training observation. Con-
sider the K points in the training data that are closest to x the new or existing set
A. Note that K is usually odd to prevent situations where a tie occurs.
The classifier then takes an estimate of the conditional probability for each class,
which is, the fraction of points in A, having that class label. Here I is the indicator
function that evaluates to 1 only when the argument ‘x = true’ else 0, given by (14)
1X
ðiÞ
Pðy ¼ jjX ¼ xÞ ¼ I y ¼ j ð14Þ
K i2A
Finally, the input x is assigned to the class having the largest probability.
The relationship between the rating of medication, and the usefulness of the review
was mapped, and in doing so suggested that if someone rates a drug as useful and
leaves a positive review, then others are likely to find that comment useful. Symptoms
can indicate an underlying disease or can become one with time. From the dataset [2],
Automated Drug Suggestion Using Machine Learning 581
we analyzed the death rate for the top 20 most populous states in the United States.
Since the dataset presents us with five leading causes of deaths; Cancer, Chronic Lower
Respiratory Disease, Heart Disease, Stroke, and Unintentional Injury, a heatmap was
generated to better depict which states lose the greatest amount of lives to certain
conditions. Even though Heart Disease is the biggest killer in the United States [5], our
graph shows that Cancer causes the most amount of deaths. This can be attributed to
the fact that Cancer is grouped together with all forms, it does not distinguish between
say Throat Cancer and Lung Cancer. Cardiovascular disease affects more people living
in rural areas, while Cancer affects those in more urban areas [6].
decreased drastically. Later since 2014, the drug Levonorgestrel has been the most
widely used and all the five drugs were being equally popular and recommended by the
doctors. The main motive is predicted the possible medical conditions which will be
most common soon and the possible top medicines which can be taken to prevent them.
Analysis on the second dataset [2] revealed the deadliest diseases in the U.S.A
causing deaths between ages 50 and 85, from the years 2005 to 2015, as observed in
Fig. 9, the deadliest disease was found to be cancer, followed heart disease. Unin-
tentional injuries came under the category for least deadliest diseases. The variation in
colors depicts the change in the percentage for the diseases over the years. Darker
shades depict the most recent years.
After the model is trained, the system allows one to enter his/her own feelings into
the system which then matches with the features corresponding to the medical con-
dition currently stored in the combined dataset, and based on that sentence, the system
should return the best-predicted condition that the user has, then the system uses that
condition to return the best top 10 rated drugs (after grouping unique ‘conditions’ and
‘drug name’) used to cure that condition based on the dataset. To ensure accuracy
stemming was used. Using feature selection through ‘chi2’ the important words
describing each ‘condition’ was revealed in the form of unigrams and bigrams as seen
in Table 2 also it should be noted that this system created was trained on the reviews of
the drug taken, therefore the system can also predict which drug you took or the
condition you were having after the ingestion of the drug. Figure 10 presents the
relationship between the popularity of a drug and its “usefulness”. The heat map in
Fig. 11 explains the correct classifications and miss-classifications being done when
taking a smaller sample of each class.
We used data visualization to visually see the accuracy of the resulting graphs
created. Analysis of the diseases and death rate dataset revealed to us the leading causes
of death in the United States, with cancer and heart disease being the deadliest of all as
they resulted in the most deaths. One way we can prevent this is by helping people
identify the disease from their occurring symptoms as well as recommend the best-rated
drugs available as early as possible before the condition becomes life-threatening.
586 V. Doma et al.
Fig. 10. The relationship between ratings and how useful a drug is
Fig. 11. Confusion matrix showing correct classifications of the top 5 most common conditions
in regard to the testing set
Automated Drug Suggestion Using Machine Learning 587
Table 2. The most important words in a review that the classifier can give more preference to
while performing classification
Conditions Most important unigrams Symptoms
Acne Really worried, pill 039, Face, Dry skin, cystic acne
Skin, Acne
Birth Headache, attacked, birth, control Mood swings, birth control
Control period, relief, pain
Depression Sadness, crying, hurt, relief, pain, Anxiety depression, doc suggested,
annoying years asked, depression anxiety
High Blood Tired, relief, pain, levels, High pressure, high heartbeat
Pressure dropped, said, surgery
Pain Surgery, relief, pain Chronic pain, pain relief
Today, with an increase in the diseases all over the world, the number of drugs for the
diseases is also increasing, and the important factor is to realize which drugs are most
suitable for a specific disease. So, the objective for our research work is to suggest
practitioners/ medical students the best-recommended drug for a particular condition
according to the reviews of the drug. Medical Data Analysis is a research-based
strategy in which specialists retrieve, clean and visualize data from accessible quali-
tative and quantitative data from Electronic Health Records. It was observed that ‘pain’
has the maximum number of different drugs used to treat it as pain is the first thing that
comes to mind when something is wrong. The accuracy of machine learning is visu-
alized using a heat-map. An inputted symptom by the user returns the condition the
person is suffering from and also the top drugs which can be used to treat the condition.
As a result of the user being returned a list of the top medication used to treat their
given condition, one must be wary of the possible ill-effects of consuming the drug.
Furthering this research project can mean looking at the medications and also returning
the negative effects associated with it. Moreover, some medicine does not mix well in
conjunction with others so there should be a precaution. Finally, every person is
physiologically different. Some conditions are more prevalent if gender, BMI, age, or
even if the race is taken into consideration.
Acknowledgement. This research is partially supported by a grant from Amazon Web Services.
References
1. Gräßer, F., Kallumadi, S., Malberg, H., Zaunseder, S.: UCI Machine Learning Repository
(2019). http://archive.ics.uci.edu/ml, https://archive.ics.uci.edu/ml/datasets/Drug+Review+
Dataset+%28Drugs.com%29#Irvine. Accessed 08 Apr 2019
588 V. Doma et al.
2. National Center for Health Statistics. NCHS Data Visualization. Gallery - Potentially Excess
Deaths in the United States. Centers for Disease and Control and Prevention, 28 August
2017. https://www.cdc.gov/nchs/data-visualization/potentially-excess-deaths/. Accessed 27
Apr 2019
3. Larmuseau, P.: Licence Public Last updated ‘2017-03-10’, Date created ‘2017-03-04’,
Current version: Version 4, Symptom Disease sorting. https://www.kaggle.com/plarmuseau/
sdsort/metadata
4. Viveka, S., Kalaavathi, B.: Review on clinical data mining with psychiatric adverse drug
reaction. In: 2016 World Conference on Futuristic Trends in Research and Innovation for
Social Welfare (Startup Conclave), Coimbatore, pp. 1–3 (2016). http://ieeexplore.ieee.org/
stamp/stamp.jsp?tp=&arnumber=7583945&isnumber=7583750
5. ACC News Story. CDC Report Shows CVD Still #1 Killer in US - American College of
Cardiology. American College of Cardiology, 3 December 2018. https://www.acc.org/latest-
in-cardiology/articles/2018/12/03/16/11/cdc-report-shows-cvd-still-1-killer-in-us. Accessed
8 May 2019
6. Rosenburg, J.: Cancer Surpasses CVD as Leading Cause of Death in High-Income Counties.
Ajmc.com, 13 November 2018. https://www.ajmc.com/focus-of-the-week/cancer-surpasses-
cvd-as-leading-cause-of-death-in-highincome-counties. Accessed 8 May 2019
7. Syverson, P., Reed, M., Goldschlag, D.: Private medical instances. J. Comput. Med. Data
(JCS) 5(3), 237–248 (1997)
8. Saint-Jean, F., Johnson, A., Boneh, D., Feigenbaum, J.: Private web search. In: Proceedings
of the 6th ACM Workshop on Privacy in the Electronic Society (WPES) (2007)
9. Levy, S., Gutwin, C.: Improving understanding of website privacy policies with fine-grained
policy anchors. In: Proceedings of the Conference on the World-Medical Data, pp. 480–488
(2005)
10. Romanosky, S.: FoxTor: helping protect your health while browsing online for conditions.
cups.cs.cmu.edu/foxtor
11. Khalid, S., Ali, M.S., Prieto-Alhambra, D.: Cluster analysis to detect patterns of drug use
from routinely collected medical data. In: 2018 IEEE 31st International Symposium on
Computer-Based Medical Systems (CBMS), Karlstad, pp. 194–198 (2018). http://ieeexplore.
ieee.org/stamp/stamp.jsp?tp=&arnumber=8417236&isnumber=8417175
12. Kanchan, B.D., Kishor, M.M.: Study of machine learning algorithms for special disease
prediction using principal of component analysis. In: 2016 International Conference on
Global Trends in Signal Processing, Information Computing and Communication
(ICGTSPICC), Jalgaon, pp. 5–10 (2016). http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=
&arnumber=7955260&isnumber=7955253
13. Gaydhani, A., Doma, V., Kendre, S., Bhagwat, L.: Detecting hate speech and offensive
language on twitter using machine learning: an N-gram and TFIDF, CoRR, Volume is
abs/1809.08651 (2018)
14. Kanchan, B.D., Kishor, M.M.: Study of machine learning algorithms for special disease
prediction using principal of component analysis. In: 2016 International Conference on Global
Trends in Signal Processing, Information Computing and Communication (ICGTSPICC).
IEEE (2016). https://doi.org/10.1109/icgtspicc.2016.7955260
15. Scikit-learn developers (BSD License). SVM Example—scikit-learn 0.20.3 documentation.
Scikit-learn.org (n.d.). https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.
html#sphx-glr-auto-examples-linear-model-plot-ols-py. Accessed 20 Apr 2019
16. The SciPy community. Quickstart tutorial—NumPy v1.16 Manual. Scipy.org Sponsored by
Enthought (2019). https://docs.scipy.org/doc/numpy/user/quickstart.html. Accessed 30 Apr
2019
Automated Drug Suggestion Using Machine Learning 589
17. Hunter, J., Dale, D., Firing, E., Droettboom, M., Matplotlib Development Team: Beginner’s
Guide—Matplotlib 1.5.3 documentation. Matplotlib.org (2016). https://matplotlib.org/users/
beginner.html. Accessed 19 Apr 2019
18. Hunter, J., Dale, D., Firing, E., Droettboom, M., Matplotlib Development Team: Matplotlib
Pyplot Semilogx—Matplotlib 3.0.3 Documentation. Matplotlib.org (n.d.). https://matplotlib.
org/api/_as_gen/matplotlib.pyplot.semilogx.html. Accessed 26 Apr 2019
19. Droettboom, M., Matplotlib Development Team: Matplotlib Pyplot Semilogx—Matplotlib
3.0.3 Documentation. Matplotlib.org (n.d.). https://matplotlib.org/api/_as_gen/matplotlib.
pyplot.semilogx.html. Accessed 26 Apr 2019
20. Dateutil. Parser—dateutil 2.8.0 documentation. Readthedocs.io (2016). https://dateutil.
readthedocs.io/en/stable/parser.html. Accessed 30 Apr 2019
Conditional Random Fields Based
on Weighted Feature Difference Potential
for Remote Sensing Image Classification
1 Introduction
Hyper-spectral and multi-spectral image interpretation is one of the most chal-
lenging tasks in remote sensing. The goal of this task is pixel-wise labeling auto-
matically.
Some classification methods, such as SVM [1,2], KNN [3,4], Boosting [5] and
Random Forest [6,7], have frequently been applied in image interpretation. Yet,
these methods only focused on the features of each pixel, while ignoring the
contextual information. The classification errors thus are particularly significant
when the pixel-feature ambiguity occurs. Some works based on markov random
fields (MRF) [8,9] attempted to remove the features’ ambiguity by introducing
the contextual information. However, MRF model only considered the contextual
information of the label and ignores the observation of the image, which may
cause the label bias problem [10].
To overcome the disadvantage of MRF, conditional random fields (CRF)
[11] further considered the contextual information of the features of adjacent
pixels, which constructed the feature-difference pairwise potential to measure
the similarity between two adjacent sites (pixels). However, the current pairwise
potential often equally treated the importance of each feature, that did not
highlight the role of key features. In this work, we propose a weighted feature
difference potential (WFDP) which lets the key features to play important roles
by different weights. That is, for the pairwise potential, different features should
have different contribution which is reflected by weight.
The WFDP-based CRF model needs an efficient optimization method in
practice. The maximum likelihood (ML) optimization involves the evaluation of
the partition function in CRF models, which is proved a NP-hard problem. To
avoid the partition function, max-margin method [12] converted maximizing the
likelihood estimation to minimizing the energy with respect to potential func-
tions. Joachims et al. [13] introduced the cutting plane algorithm [14] into the
max-margin method [12], which efficiently solved the constrained optimization
problem in polynomial time. Szummer et al. [15] extended the method [13] to
the optimization of CRF model. Thus, we are inspired by [15] to use the max-
margin optimization method with cutting plane in the proposed WFDP-CRF
model. Our method thus achieves the better performance on the hyper-spectral
and multi-spectral remote sensing image datasets.
Here xi denotes the ith site (pixel) and yi corresponds to the label of xi . The
site set X = {xi }, 1 ≤ i ≤ N , where N is the number of the pixel in image.
The labels set Y = {yi }, 1 ≤ yi ≤ C, where C is the number of categories.
ϕ(·) is the unary potential function, and ψ(·) is the pairwise potential function.
The parameters w of the unary potential and the parameters v of the pairwise
potential are integrated as a set u = {w, v}.
There are a variety of definitions [12,16] of the unary potential function.
Following [12], we utilize the linear form:
C
ϕ(xi , yi , w) = wlT · h(xi ) · δ(yi = l) (3)
l=1
592 Y. Sun et al.
3 Method
We first pursue the parameters of the CRF by optimizing a max-margin leaning
on training samples. Then, we predict testing samples with the trained CRF
model to obtain testing labels.
CRF Based on WFDP for Remote Sensing Image Classification 593
Optimizing the likelihood function w.r.t. a graph model tends to involve the eval-
uation of partition function Z in CRF models, which is proved a NP-hard prob-
lem. To avoid processing the partition function Z, Szummer et al. [15] introduced
the max-margin leaning method with cutting plane into CRF model. Inspired by
[15], we employ the max-margin method to learn the WFDP-based CRF model
as well. The energy functions w.r.t Ground Truth (GT) and inference labels of
training samples should be met the following inequality
where y ∗ denotes the GT label set of the all training samples, and ŷ is the
estimated labels of these training samples in learning. Equation (8) means that
the minimum energy only corresponds to the label set of the Ground Truth. That
is, the global probability p(y|x, u) of the CRF model on the training samples is
maximized on the Ground Truth, which is the idea of maximum likelihood.
We illustrate the max-margin optimization in Algorithm 1, which is based
on the cutting plane method [13]. Before learning, we establish the WFDP-CRF
model with initialized parameters and an empty constraint set S.
Then, the learning process starts. Firstly, we utilize the current WFDP-CRF
model to infer the labels of the training samples by a reasoning method of Graph
Cut (GC) [18]. Secondly, we add the estimated labels that are different from the
GT labels into the constraint set S. Thirdly, we optimize the objective function
of max-margin with the constraint inequality in Eq. (9) to pursue the model
parameters and update the model with the new parameters. In Eq. (9), the 0–1
loss function Δ(ŷ, y ∗ ) and the slack variable ξn are added into the constraint
inequality. The 0–1 loss function Δ(ŷ, y ∗ ) calculates the number of incorrect esti-
mated labels, which punishes a relatively larger margin for those labels that are
far from the ground truth. The slack variable ξn relaxes the margin constraints.
Notice that the v ≥ 0 constrain in Eq. (9) is to satisfy the precondition [19] of
the GC inference. Then, we repeat the above iterative process until convergence.
To the quadratic program problem shown in Eq. (9), an important work in
this paper is to calculate the energy difference between the estimated labels and
the ground truth on the constraint set S. Next, we support the derivation process
of the energy difference. According to the Eq. (3), the energy of unary potential
on the label set {y} is written as
C
ϕ(xi , yi , w) = wlT · h(xi ) · δ(yi = l)
i∈X i∈X l=i
(10)
C
= wlT ·h(xi ) · δ(yi = l).
l=i i∈X
594 Y. Sun et al.
Then, the energy difference of the unary potential between the estimated-
label set {ŷ} and ground-truth set{y ∗ } is
Dϕ = ϕ(xi , ŷi , w) − ϕ(xi , yi∗ , w)
i∈X i∈X
C
C
= wlT h(xi ) · δ(ŷi = l) − wlT h(xi ) · δ(yi∗ = l) (11)
l=i i∈X l=i i∈X
C
= wlT h(xi ) · [δ(ŷi = l) − δ(yi∗ = l)].
l=i i∈X
For pairwise potential, according to the Eq. (6), the WFDP energy of all the
edges on the label set {y} is written as
M
(xik −xjk )
2
Dψ = ψ(xi , xj , ŷi , ŷj , v) − ψ(xi , xj , yi∗ , yj∗ , v)
(i,j)∈ε (i,j)∈ε
M
−
(xik −xjk )2
M
−
(xik −xjk )2 ∗ ∗
= vk · e 2σ 2 · δ(ŷi = ŷj ) − vk · e 2σ 2 · δ(yi = yj )
(i,j)∈ε k=1 k=1
2
M − (xik −xjk )
= vk δ(ŷi = ŷj ) − δ(yi∗ = yj∗ ) · e 2σ 2 .
k=1 (i,j)∈ε
(13)
Thus, the total energy difference of unary and pairwise potential between
{ŷ} and {y ∗ } is
∗ ∗ −
+ vk δ(ŷi = ŷj ) − δ(yi = yj ) · e 2σ 2
.
k=1 (i,j)∈ε
Substitute Eq. (14) into Eq. (9) to obtain the constraint inequality, and then
solve the Quadratic Program.
3.2 Inference
Fig. 1. Washington D.C. HYDICE Dataset. (a) RGB pseudo-color image and (b)
Ground-truth image.
4 Experiments
4.1 Datasets
consists of blue, green, red, and near-infrared spectral channels. The labeled-
pixel number is 50650. The dataset contains the following six classes: Corn field,
Paddy field, Cotton field, Path, Road, and Roof. The labeled-pixel number of
each category is shown in Fig. 2.
The third dataset is a multi-spectral image of the northern suburbs of Hobart
in Australia, and was taken in August 2010 by the GEOEYE-1 Satellite sensor.
The image data of 859 × 593 pixels with four bands which consists of blue, green,
red, and near-infrared spectral channels. The labeled-pixel number is 58783. The
image includes five land cover/use classes: Bare, Road, Roof, Tree, and Grass.
The pixel number of each category is shown in Fig. 3.
Fig. 2. Wuhan Resources Satellite III Dataset. (a) RGB pseudo-color image and (b)
Ground-Truth image.
The used features on different datasets are not the same. For the hyper-spectral
image data, owing to the rich spectral information, we directly employ the spec-
tral values of 191 bands as the features.
For the two multi-spectral image datasets, i.e. Wuhan and Hobart datasets,
since their spectral channels are both four bands, we utilize the feature-
extraction method of Filter Bank [20,21] in which each band is extended to
11-dimensional features. There are total 44-dimensional features generated via
the Filter Bank. Further, the four bands can also serve as features. Thus, we
obtain 48-dimensional features on the Wuhan and Hobart datasets.
598 Y. Sun et al.
Fig. 3. Hobart GEOEYE-1 Satellite Dataset. (a) RGB pseudo-color image and (b)
Ground-Truth image.
As can be seen from Table 1, we see that the performance of WFDP-CRF are
superior to that of the other three methods. The AOA of WFDP-CRF is 2.32%
1
Ci denotes the classification accuracy of the ith category.
CRF Based on WFDP for Remote Sensing Image Classification 599
Fig. 4. Classification results on the Washington D.C. Mall Dataset. (a) is the results
of SSVM, (b) is the results of RF, (c) is the results of FDP-CRF, (d) is the results of
WFDP-CRF (our method), and (e) is the RGB pseudo-color image of this dataset.
higher than that of FDP-CRF ranked second. The KAPPA value of WFDP-
CRF is 0.0243 higher than that of FDP-CRF as well. We notice that SSVM
only achieves 89.80% OA and 89.12% AOA, which are weaker than the other
methods. This may derive from the well-known fact that contextual information
is not used. Although RF also does not use the any spatial dependencies, RF
is an ensemble leaning method, in which we employ 150 decision trees to vote
the inference results. Thus, the RF method has a significant improvement over
the SSVM method. Owing to using the contextual information, e.g., feature-
distance pairwise potential, the AOA of FDP-CRF is even 1.5% higher than
that of the RF method. The contextual information may play an important role
in eliminating ambiguity. Our method achieves 97.12% OA and 96.85% AOA,
which are better than the 94.91% OA and 94.53% AOA of FDP-CRF, because
we further assign the different weights to the different features-distances on the
pairwise potential. Figure 4 shows the inferring-label images of all the methods
on Washington D.C. mall dataset.
For the Wuhan and Hobart multi-spectral datasets, we randomly select 10%
labeled samples as a training set as well, leaving the leftover labeled samples as
a test set. The corresponding experimental results on the Wuhan and Hobart
datasets are listed in Tables 2 and 3, respectively. As shown in Table 2, WFDP-
CRF obtained 94.86% OA and 94.56% AOA, which are better than the 91.62%
OA and 91.35% AOA of FDP-CRF. Also, for the results of Hobart dataset in
Table 3, the OA and AOA of WFDP-CRF are both over 2.1% higher than those
of FDP-CRF. For the two datasets, FDP-CRF and RF have the comparable
600 Y. Sun et al.
performance, and SSVM still are weaker than all the other methods. The exper-
imental results of these two datasets further demonstrate that WFDP improves
the classification performance of CRF. The corresponding inferring-label images
are shown in Figs. 5 and 6, respectively.
Fig. 5. Classification results on the Wuhan Dataset. (a) is the results of SSVM, (b) is
the results of RF, (c) is the results of FDP-CRF, (d) is the results of WFDP-CRF (our
method), and (e) is the RGB pseudo-color image of this dataset.
CRF Based on WFDP for Remote Sensing Image Classification 601
Fig. 6. Classification results on the Hobart Dataset. (a) is the results of SSVM, (b) is
the results of RF, (c) is the results of FDP-CRF, (d) is the results of WFDP-CRF (our
method), and (e) is the RGB pseudo-color image of this dataset.
5 Conclusion
This paper devotes to study a remote sensing image classification method based
on conditional random fields. The introduction of contextual information which
is characterized by the pairwise potential of CRF can effectively eliminate the
ambiguity of samples. The traditional pairwise potential such as FDP tend to
equally treat the distance of each feature, while our approach generates more
flexible context relationship which assigns different weights on different feature
distances so as to improve the ability of eliminating ambiguity of CRF model.
Experimental results on three remote sensing image datasets demonstrate that
our method has better classification performance and support the practical appli-
cation of WFDP-CRF.
602 Y. Sun et al.
References
1. Chandra, M.A., Bedi, S.S.: Survey on SVM and their application in image classi-
fication. Int. J. Inf. Technol. 1–11, 2018 (2018)
2. Mountrakis, G., Im, J., Ogole, C.: Support vector machines in remote sensing: a
review. ISPRS J. Photogram. Remote Sens. 66(3), 247–259 (2011)
3. Blanzieri, E., Melgani, F.: Nearest neighbor classification of remote sensing images
with the maximal margin principle. IEEE Trans. Geosci. Remote Sens. 46(6),
1804–1811 (2008)
4. Bo, C., Huchuan, L., Wang, D.: Spectral-spatial k-nearest neighbor approach
for hyperspectral image classification. Multimed. Tools Appl. 77(9), 10419–10436
(2018)
5. Qi, C., Zhou, Z., Sun, Y., Song, H., Lishuan, H., Wang, Q.: Feature selection and
multiple kernel boosting framework based on PSO with mutation mechanism for
hyperspectral classification. Neurocomputing 220, 181–190 (2017)
6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
7. Xia, J., Ghamisi, P., Yokoya, N., Iwasaki, A.: Random forest ensembles and
extended multiextinction profiles for hyperspectral image classification. IEEE
Trans. Geosci. Remote Sens. 56(1), 202–216 (2018)
8. Cao, X., Zhou, F., Lin, X., Meng, D., Zongben, X., Paisley, J.: Hyperspectral image
classification with markov random fields and a convolutional neural network. IEEE
Trans. Image Process. 27(5), 2354–2367 (2018)
9. Jackson, Q., Landgrebe, D.A.: Adaptive Bayesian contextual classification based
on markov random fields. IEEE Trans. Geosci. Remote Sens. 40(11), 2454–2463
(2002)
10. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: proba-
bilistic models for segmenting and labeling sequences. In: ICML 2001: Proceedings
of the Eighteenth International Conference on machine Learning (2001)
11. Sutton, C., McCallum, A., et al.: An introduction to conditional random fields.
Found. Trends R Mach. Learn. 4(4), 267–373 (2012)
12. Taskar, B., Chatalbashev, V., Koller, D., Guestrin, C.: Learning structured pre-
diction models: a large margin approach. In: Proceedings of the 22nd International
Conference on Machine Learning, pp. 896–903. ACM (2005)
13. Joachims, T., Finley, T., Yu, C.-N.J.: Cutting-plane training of structural svms.
Machine Learn. 77(1), 27–59 (2009)
14. Kelley Jr., J.E.: The cutting-plane method for solving convex programs. J. Soc.
Ind. Appl. Math. 8(4), 703–712 (1960)
15. Szummer, M., Kohli, P., Hoiem, D.: Learning CRFs using graph cuts. In: European
Conference on Computer Vision, pp. 582–595. Springer (2008)
16. Kumar, S., Hebert, M.: Multiclass discriminative fields for parts-based object
detection. In: Snowbird Learning Workshop, vol. 164 (2004)
17. Boykov, Y.Y., Jolly, M.-P.: Interactive graph cuts for optimal boundary & region
segmentation of objects in ND images. In: Proceedings Eighth IEEE International
Conference on Computer Vision, ICCV 2001, vol. 1, pp. 105–112. IEEE (2001)
18. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction
using iterated graph cuts. In: ACM Transactions on Graphics (TOG), vol. 23, pp.
309–314. ACM (2004)
19. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via
graph cuts. In: Proceedings of the Seventh IEEE International Conference on Com-
puter Vision, vol. 1, pp. 377–384. IEEE (1999)
CRF Based on WFDP for Remote Sensing Image Classification 603
20. Varma, M., Zisserman, A.: A statistical approach to texture classification from
single images. Int. J. Comput. Vis. 62(1–2), 61–81 (2005)
21. Andreetto, M., Zelnik-Manor, L., Perona, P.: Unsupervised learning of categorical
segments in image collections. IEEE Trans. Pattern Anal. Mach. Intell. 34(9),
1842–1855 (2011)
22. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–
1484 (2005)
Feature Selection Using Flower Pollination
Optimization to Diagnose Lung Cancer
from CT Images
1 Introduction
The detection of various structures in medical imaging being the need of the hour
requires high end technological advancements. There exists a range of medical imaging
techniques to carry over the detection of abnormalities in lung. One such technique
used widely by the practitioners of medicine due to the numerous benefits is Computer
Aided Diagnosis (CAD). CAD systems are developed to detect the cancerous cells
including the assessment of its growth over a period [1].
Many different medical imaging techniques are available at present, which includes
Magnetic Resonance Imaging (MRI), X ray, Computerized Tomography (CT) Scans
[2]. In this work, CT images are used. Enormous CT image slices are produced for a
2 Literature Survey
In this section, the works related to segmentation of lung and feature selection has been
discussed.
606 D. S. Johnson et al.
Dhalia Sweetlin et al. [19] proposed a CAD system that improves the diagnosis of
pulmonary tuberculosis. The lung regions have been extracted using region growing
and edge reconstruction algorithms from which the ROIs are extracted. A wrapper-
based approach that combines cuckoo search and one-against-all SVM classifier has
been used for optimal selection of texture features. The algorithm selected 47 features
with 92.77% as classification accuracy. The authors have also proposed a hybrid
feature selection approach using ant colony optimization with tandem run recruitment
[20] to reduce the number of features used in the diagnosis of bronchitis.
Experiments over some public and private datasets against with Particle Swarm
Optimization (PSO) [12, 21], Cuckoo Search (CS) [19] and Ant Colony Optimization
(ACO) [20] algorithms have demonstrated the suitability of FPA for feature selection.
The experimental setting has evaluated the recognition rates, convergence speed,
number of selected features and computational load. All techniques have obtained
similar recognition rates; it is shown that FPA is also suitable for feature selection
tasks, since its results are comparable to the ones obtained by some state-of-the-art
evolutionary techniques.
Yang et al. [22] proposed a new feature selection algorithm for finding an optimal
feature set by using metaheuristic approach, called Swarm Search. Simulation exper-
iments were carried out by testing the Swarm Search over a high-dimensional dataset,
with different classification meta-heuristic algorithms. Swarm search was observed to
achieve good results. Though there are many works existing to diagnose cancerous
tumors in lung, this is the first work in this field to combine a spline curve-based
segmentation and a nature inspired feature selection approach.
The proposed approach is divided into many modules and is collectively illustrated in
Fig. 1. The modules are explained in this section.
Candidate Features
Spline construction Performance Evaluation
Classifier
Scaled on to diseased
Feature set
Segmented
Classification
Trained Classifier
Separation of Left and Right Lung. The left and right lung is isolated from the
binary image. This process is done by use of a filter which on subtracting and negating
from the full image gives the left and consequently the right lung images.
Input: A stack of binarized lung images.
Process: The stack of binarized lung images are made to determine the largest
connected component in them and on removing this component from the entire
image will result in isolation of the two lungs separately.
Step 1: Identify the largest connected component in the image.
Step 2: Store this component and subtract this from the binarized image.
Step 3: Obtain and store the isolated image of a lung.
Step 4: Invert the image obtained in step 3 to get lung in black pixel and back-
ground in white.
Step 5: Store the separated lungs.
Output: A stack of isolated left and right lungs stored in a directory.
Feature Selection Using Flower Pollination Optimization 609
Controlxð1; y0 Þ Controlxð1; yÞ
SlopeðmÞ ¼ ð2Þ
Controlxð1; x0 Þ Controlxð1; xÞ
Step 7: Compute difference in slope and if 0 record the first control point of that
sequence of 0’s.
Step 8: Fix equally spaced control points along the regular shape of lung by
dividing major axis into 16 regions.
Step 9: Combine all such control points into a single array.
Output: An array of control points to be used for model generation.
t2 t t t1
C¼ B1 þ B2 ð3Þ
t2 t1 t2 t1
Where
Feature Selection Using Flower Pollination Optimization 611
t2 t t t0
B1 ¼ A1 þ A2 ð4Þ
t2 t0 t2 t0
t3 t t t1
B2 ¼ A2 þ A3 ð5Þ
t3 t1 t3 t1
t1 t t t0
A1 ¼ P0 þ P1 ð6Þ
t1 t0 t1 t0
t2 t t t1
A2 ¼ P1 þ P2 ð7Þ
t2 t1 t2 t1
t3 t t t2
A3 ¼ P2 þ P3 ð8Þ
t3 t2 t3 t2
where P0, P1, P2 and P3 represents consecutive control points and t, t0, t1, t2 and t3
represents the knot sequences that decides interpolation.
Step 4: Construct a model plotting the interpolation of control points to define a
reference model.
Output: A model that holds the reference structure of the lung.
Step 5: Segment the diseased lung based on this ratio factor even along the
boundary as it withholds the shape characteristics of the model structure.
Output: A segmented lung.
where T, the threshold value is set after repeated trials in this work.
Output: The ROIs are obtained for feature extraction.
TP þ TN
Fitness ¼ ð11Þ
FP þ FN þ TP þ TN
k:CðkÞ : sinðkÞ 1
LðkÞ ¼ : 1 þ k s s0 [ 0 ð12Þ
p s
Where gamma ðCÞ and step (s) functions are used.
Step 7: Use the exploration explained by Eq. (13), and correspondingly values are
updated.
xit þ 1 ¼ xti þ aLðkÞ g xti ð13Þ
where g gives the global best value determined and L(k) is the Levy’s value.
Step 8: Perform exploitation for self-pollination as in the following Eq. (14), where
e is derived from normal distribution [0, 1].
xti þ 1 ¼ xti þ e xtj xtk ð14Þ
Step 9: Perform Step 7 and 8 only for those features which has been selected to
obtain the global best or the features of current feature set that holds better per-
spective than global best features.
Step 10: After every exploration or exploitation of the feature, it is to be binarized
for further computation using the Eqs. (15) and (16), where r belongs to normal
distribution [0, 1].
1
S xij ðtÞ ¼ j ð15Þ
1 þ exi ðtÞ
1; if S xij ðtÞ [ r
xij ðtÞ ¼ ð16Þ
0; otherwise
Step 11: Obtain fitness value for that feature set and changed if fitness value greater
than global best.
Step 12: Steps 5–10 are repeated for every random feature set chosen, its value gets
changed and its corresponding binary array change.
Step 13: After completion of all the iterations the global best is the accuracy of the
classifier and the binary array corresponding to that accuracy determines the optimal
subset of features using BFPA.
Output: A subset of features to provide best accuracy on classification.
4 Results
The input CT images are collected from hospitals containing the normal and cancerous
CT images. These images are processed and their results are presented. Figure 2 shows
the input image, binarized lung image after the removal of background and the sepa-
rated right and left lungs.
Fig. 2. (a) Input image, (b) background removal, (c) isolated left, (d) isolated right
Reference lung image is selected from the CT stack as the lung with maximum area.
The left and right halves belong to the same CT stack of the same patient but it may be
from different slices within the stack. The left and right lung reference images and the
corresponding zoomed images thus obtained are shown in Fig. 3(a) through (d).
Fig. 3. (a) Reference left, (b) reference right, (c) zoomed left lung with control points,
(d) zoomed right lung with control points.
The sequence of control points is interpolated using Catmull ROM spline to gen-
erate the left and right lung shape model and is shown in Fig. 4(a) and (b).
616 D. S. Johnson et al.
Fig. 4. (a) and (b) Left and right lung models generated using spline curves
The suspected and diseased CT slices are identified with the help of an expert.
These slices are preprocessed and the model generated using spline curve is placed over
the diseased image and is shown in Fig. 5(a) through (d).
Fig. 5. (a) Left lung with initial fit, (b) left lung with final fit, (c) right lung with initial fit,
(d) right lung with final fit
The original gray scale pixels are superimposed onto the binarized lung to facilitate
ROI extraction. This can be seen in Fig. 6(a) and (b).
Feature Selection Using Flower Pollination Optimization 617
(a) (b)
The comparison of the fitness value of the feature selection algorithm during
various iterations has been illustrated in Fig. 8.
The binary flower pollination algorithm is compared with various other filter
methods using measures such as information gain, gain ratio, relief attribute consid-
ering all the 56 features which resulted in an accuracy of 80%. Principal components
selected 6 features yielding an accuracy of 68%. CFS subset selected single feature
with an accuracy of 76%. Comparatively binary flower pollination algorithm selected
features and yielded a better accuracy of about 84%. The number of features selected
by each algorithm is shown in Fig. 9.
618 D. S. Johnson et al.
0.86
accuracy
0.84
0.82
0.8
0.78
0.76
1 2 3 4 5 6 7 8 9 10
Fitness… No of Iterations
The selected features are trained with various classifiers to find the best classifier.
Further classifier parameters are tuned to find the better accuracy that the classifier can
yield to the data. The accuracy obtained when the data is classified with various
algorithms is shown in Fig. 10. SVM classifier with linear kernel function yielded
higher accuracy compared to other classifiers.
Feature Selection Using Flower Pollination Optimization 619
5 Conclusion
In this work, a spline curve-based segmentation approach is used to segment lungs and
a binary flower pollination optimization technique is used to select optimal image
features to be used for classification. From the extracted features, 33 features are
selected by the wrapper based BFPA to give 84% in diagnosing lung cancers. Though
there are many existing works in this domain, this work uses a flower pollination
optimization approach to select features. The accuracy of the approach can be improved
if the segmentation technique is optimized to reduce overfitting of the reference lung
image over diseased lung image. The features extracted from the ROIs could be
increased for more accurate results and improvements in BFPA algorithm is likely for
better results. Also, this work concentrates on lung cancer alone which could be
extended to other lung diseases.
References
1. Firmino, M., Morais, A.H., Mendoça, R.M., Dantas, M.R., Hekis, H.R., Valentim, R.:
Computer-aided detection system for lung cancer in computed tomography scans: review
and future prospects. BioMed. Eng. Online 13(41), 1–16 (2014)
2. Woods, R.E., Gonzalez, R.C.: Digital Image Processing, 3rd edn. Pearson, London (2016)
3. Dhalia Sweetlin, J., Nehemiah, H.K., Kannan, A.: Patient-specific model based segmentation
of lung computed tomographic images. J. Inf. Sci. Eng. 32, 1373–1394 (2016)
4. Ma, Z., Manuel, J., Tavares, R.S., Jorge, R.M.N.: a review on the current segmentation
algorithms for medical images. In: Ist International Conference Proceedings on Computer
Imaging Theory and Applications, pp. 135–140, Lisbon, Portugal (2009)
5. Cootes, T.F., Cooper, D., Taylor, C.J., Graham, J.: Active shape models-their training and
applications. Comput. Vis. Image Underst. 61(1), 38–59 (1995)
6. Williams, D.J., Shah, M.: A fast algorithm for active contours and curvature estimation.
J. CVGIP: Image Underst. 55, 14–26 (1992)
620 D. S. Johnson et al.
1 Introduction
With the rise of internet, social media and online discussion forums, there is a new kind
of bullying taking place that does not occur in classroom, home or in our neighbor-
hood, but, it happens online and carried out on internet and called as cyberbullying.
The usual way of cyberbullying is in the form of hate message through e-mail or
insulting online social commentary; it includes verbal attacks on one’s personal body
type, appearance, race or color. Cyberbullying could lead to depression, low self-
esteem, low confidence, self-harm or suicide in some cases. To deal with this problem,
the PTA (Parent Teachers Association) in Japan started net-patrol in which they
monitor websites for such activities and whenever they detect any such activity they
would contact internet provider or website admin for removing such content [2]. This
whole process is carried out manually in which keeping record of activities is done
manually requiring a lot psychological power and exert a lot of mental pressure.
Here, in this paper a method is proposed that require very less or no human involve-
ment. Authors of this paper have developed machine learning classifiers based on
supervised learning and these classifiers are trained using dataset from [1] that contains
both neutral and insulting text with their respective labels. With the proposed technique
authors are able to catch cyberbullying with no or very less human involvement. The
proposed mechanism achieves 84.4% accuracy in detecting cyberbullying.
2 Related Work
There are various researchers who have attempted to automate the process of detection
of cyberbullying. For example, researchers in [2] have tried to detect harmful or
insulting entries based on category relevance maximization. Their method has three
steps: phrase extraction, categorization with harmfulness polarity and in the final stage,
relevance maximization is performed. Their results show recall and precision for test
data with 50% of harmful or insulting entries and precision was between 49–79%. For
test data containing 12% of harmful or insulting entries, they observed precision
between 10 to 61%. In [3], researchers have used data which is gathered from
Formspring.me, a website, which contains a lot of bullying content. This data was
labelled using AWS’s Mechanical Turk [3], and then Weka [3] tool was employed to
train the model with C4.5 decision tree. They were able to achieve 78.5% accuracy on
recognizing true positives. Researchers in [5] have taken corpus of posts from Form-
spring.me website. There are two parts of results from this article. In part one, an
experiment was performed to get the specific word and context used for bullying. They
have recognized most relevant cyberbullying terms that they can query the content to
check whether it is cyberbullying or not. In second part, researchers have used machine
learning to detect cyberbullying. The social media posts having highest scores have
more amount of cyberbullying content. Most of the research in detection of cyber-
bullying focuses on the content, but not on the features of bullying [7]. Content used by
harasser depends on the features, such as, gender, age, race, skin tone, etc. In [4],
researchers have used content with the gender of harasser to train data using SVM
(Support Vector Machine) classifier. Using this technique, researchers have decreased
gender based discrimination capabilities of classifier. They have concluded that for
baseline, precision, recall and F-measure are 0.31, 0.15 and 0.20, respectively. How-
ever, with gender approach, Precision, Recall and F-measure are 0.43, 0.16 and 0.23.
Using gender based discrimination capabilities of the classifier, the results have shown
much improvement. The paper [6] has gone one step further, they have proposed to
incorporate the characteristics and information of users as a context before and after
harassing. With this approach, researchers have improved the accuracy of detection of
cyberbullying. In [11], researchers have used weakly supervised machine learning to
train model from small existing vocabulary related to bullying, and then applied that
model with huge corpus of unlabeled data to evaluate whether an interaction is bullying
or not. The authors in [12] have proposed unique set of features for cyberbullying
detection and performed comparison with baseline feature and results were tremen-
dously better. Authors have achieved f-measure of 0.936. In our research, standard
supervised machine learning was used in combination with ensemble learning which
helped to drastically improve performance of cyberbullying detection mechanism.
Detecting Cyberbullying in Social Commentary 623
3 Text Classification
Based on the content of text, assigning tags and categories is the called Text Classi-
fication. It is a very basic NLP (Natural Language Processing) task. With text classi-
fication a person can perform things such as spam detection, sentiment analysis and
much more. In this age of Internet, where there is a lot of unstructured data in the form
of text, extraction of proper information from resources like internet, text documents,
news articles, electronic mails, different databases, etc. is challenging [8]. In this article,
the goal of identifying social commentary as bullying cannot be achieved without text
classification. In this case, the information contained in text is extremely rich, but due
to the unstructured nature, it is hard to gain insight of data from these comments [10].
Therefore, text classification was used to classify and categorize the comments as
bullying by working together with natural language processing and machine learning
tools & techniques. The text classification results in text classifiers which are used to
categorize, organize and structure any textual information. In this case this classifier is
used to classify comments. A text classifier takes comment text as input, analyses the
content of the text and specifies label as output. Text classification algorithms act as a
key in processing the natural language to extricate and cleave the data into different
classes according to the required specifications [9, 15]. By using different machine
learning tools and techniques given person can extract the information from the given
document or a paragraph and can classify it according to different groups or classes. It
is believe of authors that ensemble learning can be better solution as it combines
multiple models to improve accuracy. In this case, researchers have applied two
techniques, Voting and Adaboost classifiers as an example of ensemble learning and
found that ensemble learning provides more accurate results.
Text classification comprises of different phases for filtering or categorizing the
document. It follows a series of steps given below [3, 16].
• Text Preprocessing
• Feature Extraction
• Training Classifier
• Classification Model
4 Implementation
The dataset used in this paper is taken from [1]. This dataset contains social com-
mentary with labels mentioning those comments which are insults. Authors have also
performed data exploration on the dataset shown in Fig. 1 to give understanding of the
dataset and shown a glimpse of dataset in Fig. 2. In dataset we have three columns,
insult which is either zero or one zero representing neutral comment and 1 representing
insult. Second column is date of comment and third column is comment which is string.
We have used voting and AdaBoost classifiers for prediction of comments. The fol-
lowing subsections will elaborate these classifiers.
Fig. 1. Orange bar shows comments which are not insults and blue shows the comments which
are insults.
Detecting Cyberbullying in Social Commentary 625
Fig. 2. Dataset
4.2 AdaBoost
The basic rule of AdaBoost is to fit a sequence of learners on frequently changed
versions of the data. The predictions are then combined from all sequence of learner
with the help of a weighted majority vote in order to produce the prediction. Data is
reweighted and learning algorithm applied again for each iteration [13, 15]. The
ensemble technique creates a classifier after under sampling and oversampling data
with various under sampling and oversampling rate [16]. In this case, AdaBoost is used
on logistic regression that is it is going to apply logistic regression and on various
versions of data and induces a model that is best of all models. AdaBoost classifier
achieved an accuracy of 84.04%.
The solution proposed in this paper is carried out in several steps shown in Fig. 3.
In the first step, pandas are used to import dataset. It converts the given dataset which is
in CSV file to the panda’s data frame. Once, pandas data frame from dataset is created,
features and labels are extracted which are comment and insult respectively. After
obtaining the comments, removal of Stopwords is done because these are high fre-
quency words with very less semantic weight then perform stemming of words in the
comment to avoid variance of words with same root. Now, count vectorizer is applied,
which gives us an easy and simple method to build vocabulary with the words which
are known and tokenize collection of text and documents. Authors then create standard
and ensemble classifiers. Once a classifier is created, t is time to train it with training
data. We used various percentages and cross folds of the training data shown in
Tables 1 and 2 and applied classifier on that data. After classifier is trained, a prediction
vector is made to see how many correct predictions it was able to make. Based on these
readings, authors have calculated accuracy of model and also generated training curve
to see if model is over-fitting or under-fitting.
626 M. O. Raza et al.
In this paper, two parameters for evaluating the models are used first parameter is
which is defined as “Sum of True positive and True Negative divided by sum of True
positive, True Negative False Positive and False Negative” it is shown by Eq. (1).
Second parameter used for the evaluation of the model is learning curve which is
defined as “A basically a graph that is used to compare performance of model over
training and test set and to see whether the performance can be increased with increase
in data.” Using both of these parameters, we have discussed the results we got
Detecting Cyberbullying in Social Commentary 627
Tp þ Tn
Accuracy ¼ ð1Þ
Tp þ Tn þ Fp þ Fn
The system proposed in this paper for detecting cyber bullying is tested on different
number of test train splits and cross folds on different classifiers, and, based on these
observations it can be said that the proposed approach for detecting cyber bullying in
social commentary can be effective. In Table 1, authors have shown the accuracy of
Logistic Regression, Naïve Bayes and Random Forest on different number of test train
split and in Table 2, researcher has shown the accuracy of Logistic Regression, Naïve
Bayes and Random Forest at different number of cross folds of data ranging from 10 to
90. To improve accuracy of the models, ensemble classifiers namely voting classifier
and Adaboost are used and both provided better accuracy then Logistic Regression,
Naïve Bayes and Random Forest. In Table 3, we have shown the best performances of
all classifier with respective test train split and number of cross folds. With the help of
training curves, it is shown that which model is good, worst or best fit. According to the
training curve shown in Fig. 4, there is deviation in start for training and testing lines
but lines are converging which explains that this is a good fit, however, this model can
be improved with data. From Fig. 5, it is evident that it is a best fit because both
training and testing lines are on almost same points. The model depicted in Fig. 6 is the
worst fit because of testing line going down and this model can’t be improved because
both lines in the ending region seems to be parallel.
6 Conclusion
Cyberbullying is one of the rising issues with the increase in usage of internet and
social media. Cyberbullying can have very dangerous consequences, so there is a need
for a way to control, detect and remove it. Authors have discussed methods used by
various researchers, then we have applied standard machine learning algorithms and
ensemble machine algorithms. Ensemble algorithms performed better. Researcher have
achieved an accuracy of 84.4% with the help of voting classifier and authors have also
drawn learning curve to compare scores of train and test with various instances and
conclude that logistic regression is a good fit, however, random forest is less accurate
but best fit. Naïve Bayes is worst fit algorithm to detect cyberbullying. One of most
important and potential use case for these models is applying them in a social net-
working site to detect cyberbullying in real time with least human involvement.
Like any other system, the proposed model has certain limitations. The model
requires robustness and it works only for English. So, for future work, we will be
applying other state of art machine learning techniques to improve the accuracy of
model. By doing so, we can fully automate the process of cyberbullying detection on
online forums. We will also attempt to work on multiple language support for this
model so that we can detect cyberbullying in various languages.
References
1. Detecting Insults in Social Commentary “Kaggle”, Kaggle.com (2019). https://www.kaggle.
com/c/detecting-insults-in-socialcommentary/data. Accessed 09 Apr 2019
2. Nitta, T., et al.: Detecting cyberbullying entries on informal school websites based on
category relevance maximization. In: Proceedings of the Sixth International Joint Conference
on Natural Language Processing (2013)
630 M. O. Raza et al.
3. Reynolds, K., Kontostathis, A., Edwards, L.: Using machine learning to detect cyberbul-
lying. In: 2011 10th International Conference on Machine learning and applications and
workshops, vol. 2. IEEE (2011)
4. Dadvar, M., et al.: Improved cyberbullying detection using gender information. In:
Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012).
University of Ghent (2012)
5. Kontostathis, A., et al.: Detecting cyberbullying: query terms and techniques. In:
Proceedings of the 5th Annual ACM Web Science Conference. ACM (2013)
6. Dadvar, M., et al.: Improving cyberbullying detection with user context. In: European
Conference on Information Retrieval. Springer, Berlin (2013)
7. DeGregory, K.W., et al.: A review of machine learning in obesity. Obes. Rev. 19(5), 668–
685 (2018)
8. Wu, J.-Y., Hsiao, Y.-C., Nian, M.-W.: Using supervised machine learning on large-scale
online forums to classify course-related Facebook messages in predicting learning
achievement within the personal learning environment. In: Interactive Learning Environ-
ments, pp. 1–16 (2018)
9. Balyan, R., McCarthy, K.S., McNamara, D.S.: Comparing machine learning classification
approaches for predicting expository text difficulty. In: The Thirty-First International Flairs
Conference (2018)
10. Hoogeveen, D., et al.: Web forum retrieval and text analytics: a survey. Found. Trends® Inf.
Retrieval 12(1), 1–163 (2018)
11. Raisi, E., Huang, B.: Cyberbullying detection with weakly supervised machine learning. In:
Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining 2017, pp. 409–416. ACM, July 2017
12. Al-garadi, M.A., Varathan, K.D., Ravana, S.D.: Cybercrime detection in online commu-
nications: the experimental case of cyberbullying detection in the Twitter network. Comput.
Hum. Behav. 63, 433–443 (2016)
13. Randhawa, K., et al.: Credit card fraud detection using AdaBoost and majority voting. IEEE
Access 6, 14277–14284 (2018)
14. Voting Classifier. https://scikitl-earn.org/stable/modules/ensemble.html#voting-classifier.
Accessed 24 Apr 2019
15. Ensemble Methods. Scikit. https://scikit-learn.org/stable/modules/ensemble.html#adaboost.
Accessed 24 Apr 2019
16. Rahman, H.A.A., Wah, Y.B., He, H., Bulgiba, A.: Comparisons of ADABOOST, KNN,
SVM and logistic regression in classification of imbalanced dataset. In: International
Conference on Soft Computing in Data Science, pp. 54–64. Springer, Singapore, September
2015
Predicting the Risk Factor for Developing
Chronic Kidney Disease Using a 3-Stage
Prediction Model
Abstract. Chronic Kidney Disease (CKD) is considered one of the major high-
risk chronic diseases on humans’ health that causes death in its late stages.
Moreover, treating CKD patients cost huge amounts to be paid based on their
stage. In fact, it becomes significant not only to detect the disease in its early
stages, but also, to have a way to early assess and predict the possibility for
individuals to get affected in the future. In this research, a 3-Stage predictor is
introduced to help predicting the risk factor for developing CKD during the
healthcare screening based on a questionnaire and some laboratory tests. Also, it
aims to reduce and eliminate the unjustified tests’ costs unless the tests are
needed for the assessment by categorizing parameters into stages. A comparison
between 12 classifiers led to choosing the 3 classifiers used in designing the 3-
Stage model, based on the best accuracy and prediction speed. The 3-Stage
model is designed using Bagged, Boosted and Medium Trees classifiers. The
model was assessed on the dataset collected from the Centers of Disease Control
and Prevention (CDC) in the United States. The trained 3-Stage model resulted
in 99.97% accuracy by predicting around 3K cases in comparison with a 1-Stage
model.
1 Introduction
Chronic Diseases have been a burden for many years worldwide. They are the main
reason of death in many countries, especially in developing ones with low- and middle-
income. They cause around 80% of the deaths in these countries due to the poverty and
economic instability [1]. Chronic Kidney Disease (CKD) is one of those chronic dis-
eases, by which the kidney is damaged and its function to filter the blood from the
waste is degraded; this allows harmful materials to remain in human’s body and leads
to other health problems. CKD is perilous as it can show no symptoms at all till a
massive damage occurs that cannot be recovered. That’s why the first step to treat a
CKD patient is to early detect the disease [2]. Moreover, the early detection of CKD
can be achieved by monitoring some risk factors like: Diabetes, High blood pressure,
Cardiovascular (heart and blood vessel) disease, Anemia, Analgesic (taken long time
medicines) or by knowing that there is a family history of kidney failure [3].
In fact, CKD is ranked to be the 12th cause of death worldwide. Nevertheless, in
Egypt, CKD is ranked the 6th cause of death [4]. According to statistics in the US,
estimated that for each seven people there is one person who is affected by CKD. In
addition, there are around 15% of the adults, which constitutes around 30 Million, are
affected with CKD. Also, it stated that about 96% of those who have damaged, or
reduced function kidney are not aware of having CKD. More statistics reveal that the
ratios of getting affected by CKD due to other diseases that are considered as risk factors,
like diabetes or high blood pressure, are 1 every 3 and 1 every 5 respectively [5].
From the economical perspective, Honeycutt et al. [6], mentioned that based on
their research, the annual medical costs for a CKD person was $1,700, $3,500 and
$12,700 based on their CKD stages 1, 2 and 3 correspondingly. This proves that
detecting the CKD early does not only help in better addressing treatment for CKD
patients, but also help in reducing the cost of it by diagnosing the disease in its early
stages [6]. Moreover, knowing the risks of the CKD from different aspects like
human’s health and economy, motivated researchers to work on reducing these
impacts. That’s why there were several attempts to detect CKD through technology by
engaging computer systems.
Nowadays, the world is fronting huge steps in the revolution of technology, with
the need for using computers in each and every field, this makes the development of
new and innovative techniques to solve issues a demand. Advancing computer systems
to be more capable of monitoring, analyzing, learning, predicting risks and even
suggesting solutions for real problems in different domains, turns researchers to be
more curious to make the best use of it to solve more issues and make the world a better
place. One of these domains is the healthcare and medicine field. Many researches are
adopting the involvement of advanced computer systems and algorithms, like data
mining and machine learning, in the healthcare field in different ways like: early
detecting some diseases and identifying their stages or by proactively predicting the
possibility of getting affected by these diseases [7].
In this paper, a 3-Stage predictor model is introduced. Section 2 briefly discusses
the background and the related work interrelated to CKD and machine learning models.
In Sect. 3, the introduced predictor was explained in detail, starting from the motive
and objective, going through the designed model, data set, techniques and methodology
and finally discussing the experiments’ results. At the end, the conclusion and some
recommendations for future work is revealed in Sect. 4.
2 Related Work
Data Mining and Machine Learning are predominantly used when it comes to detecting
and predicting diseases in the healthcare industry [7]. Several research papers were
interested to predict CKD by comparing multiple classification algorithms and con-
cluding their findings with regards to performance and accuracy. One of these resear-
ches, included the Naïve Bayes and Support Vector Machine (SVM) algorithms in the
comparison, to find out that the SVM was better when it came to performance [7].
Predicting the Risk Factor for Developing Chronic Kidney Disease 633
Another reference, by Koklu & Tutuncu [8], formed an evaluation between the Naive
Bayes, C4.5 Algorithm, SVM and Multilayer Perceptron to detect CKD, spotting that
the correct detection percentages were 95,00%, 97,75%, 99,00% and 99,75% for Naive
Bayes, C4.5 Algorithm, SVM and Multilayer Perceptron respectively, on a dataset of
400 samples and 25 attributes [8].
Anantha Padmanaban and Parthiban [9] used the Naive Bayes and the decision tree
algorithms to conclude that the accuracy was 91% for the decision tree classification on
a larger dataset of 600 samples and 13 attributes. In addition, Sharma et al. [10],
evaluated 12 classification models on CKD UCI dataset that consists of 400 samples
and 24 attributes. The tested classifiers in this paper were: Decision Tree, Linear
Discriminant, Quadratic Discriminant, Linear SVM, Quadratic SVM, Fine KNN,
Medium KNN, Cosine KNN, Cubic KNN, Weighted KNN, FFBPNN (GD) and
FFBPNN (LM). The results show that the decision-tree owned the best scores by which
the accuracy was recorded as 98.6% [10].
Moreover, Pavithra & Shanmugavadivu [2] worked on the early prediction for
kidney failure with the Fuzzy C Means algorithm on UCI CKD dataset. Aside pre-
dicting the CKD and its severity, Tahmasebian et al. [11] research was focusing on
determining the relation between parameters and their relevance degree to detect the
CKD.
By inspecting the different researchers’ perspectives in this topic, an observation
can be reached that these attempts aim to detect the Chronic Kidney Disease based on
certain set of parameters by applying number of classification methods. But it does not
necessarily predict the possibility of getting affected by CKD in the first place.
Moreover, these trials used the whole set of parameters at one stage which might lead
to perform unnecessary laboratory tests and consequently unjustified cost.
3 Objective
3.1 Model
The motive of this research is to help people and governments reducing the costs of
CKD treatment by working on predicting the risk of getting affected by CKD during
the regular healthcare screening. By which a 3-Stage predictor model is implemented
that works on a questionnaire and basic screening checkups in its first stage, then if
there is a risk based on these inputs, the predictor asks for more lab tests to confirm its
assessment. The 3-Stage model is presented in Fig. 1.
The model is designed to work on 13 CKD related data attributes, categorized per
stage, as presented in Fig. 2, based on real medical diagnostic practices. The 3-Stage
predictor outputs the CKD risk values described in Table 1.
Fig. 2. The input data attributes, per stage, for the 3-Stage predictor
Predicting the Risk Factor for Developing Chronic Kidney Disease 635
3.2 Dataset
In this research, large datasets with total of around 30K real patients’ data and 239
attributes were collected to help experimenting the proposed 3-Stage model predictor.
The datasets are part of the National Health and Nutrition Examination Survey
(NHANES), between 2011 and 2016, provided by the Centers for Disease Control and
Prevention (CDC), US [12].
Data Preparation. To make sure of the relativeness to the CKD medical diagnostics,
Nephrologists were interviewed for parameters selection, focusing on adults’ patients,
(>18 years old). Accordingly, the datasets were merged and filtered using SAS software
to outcome a ready dataset with around 8 K records, after data cleansing, and 16
attributes. Table 2 shows the list of attributes of the dataset.
In addition, the last attribute, CKD Risk, was populated according to Nephrologists’
real medical diagnostic practices, in preparation to train the machine learning classi-
fiers. Figures 3 and 4 show the conditions applied to determine the CKD risk values of
the training dataset as part of preparing the data.
Decision Matrix
Doctor told you have diabetes Yes - - - - - - - - -
Questionnaire
(Obese)
Systolic: Blood pres (2nd rdg) mm Hg - High - - High - - - High -
Diastolic: Blood pres (2nd rdg) mm Hg - - High - - High - - - High
CKD Risk? At Risk Low Risk
Decision Matrix
Level 2 Paramters
Level 3 Paramters
Classifiers Selection. The classifiers for the 3-Stage predictor model was selected
based on the best accuracy and prediction speed. Twelve different classifiers were
compared for every stage of the 3-Stage model using MATLAB. Tables 3, 4, 5
summarize the results of the classifiers for each of the 3 stages.
Moreover, a comparison was made by adding all parameters into 1 stage, to choose
the best classifier for the 1-Stage model as mentioned in Table 6. Similarly, Figs. 5 and
6 display the differences between the classifiers in respect to the accuracy and pre-
diction speed in each stage.
The accuracy is calculated by dividing the correctly classified cases over the total
number of cases available in the testing dataset. Equation (1) shows the accuracy
calculation rule where, T is the number of True Predictions and N is the total number of
cases in the testing dataset.
Table 6. Comparison between classifiers accuracies – All in one for the 1-Stage Model
Classifier All in 1-Stage 1-Stage pred. speed 1-Stage training
accuracy (%) (obs/sec) time (seconds)
Fine Tree 99.300% 170000 1.55600
Medium Tree 98.800% 170000 0.66212
Coarse Tree 91.000% 120000 2.57550
Linear SVM 50.400% 63000 6.05280
Quadratic SVM 54.700% 30000 11.52900
Cubic SVM 55.000% 29000 15.84500
Fine Gaussian SVM 45.700% 3300 26.91200
Medium Gaussian SVM 53.700% 16000 13.40100
Coarse Gaussian SVM 48.300% 12000 19.36400
Boosted Trees 99.200% 19000 24.33800
Bagged Trees 99.100% 17000 23.81500
RUSBoosted Trees 52.800% 18000 24.73300
Fig. 5. Classifiers’ accuracy comparison for the 3-Stage and the 1-Stage models
Fig. 6. Classifiers’ prediction speed comparison for the 3-Stage and the 1-Stage models
Both, the 3-Stage and 1-Stage models were trained on around 5K records and tested
on around 3K records. The 3-Stage model shows better accuracy of 99.97% compared
to the 1-Stage model with accuracy of 99.16%. Table 7 illustrates the differences
between the 2 compared models in regard to the accuracy. Furthermore, Fig. 7 presents
the confusion matrix for both models as per the predicted values over the 3K records.
Based on the experiments results, splitting the CKD attributes into stages, con-
sidering their medical diagnostics relativeness, has enhanced the prediction accuracy in
comparison to when passing them all at once to the 1-Stage predictor.
In this research paper, a 3-Stage model was designed to predict the risk factor for
developing Chronic Kidney Disease (CKD). Different classifiers were compared to
choose from for each of the three stages based on the best accuracy and prediction
speed. The 3-Stage model was tested and compared to a 1-Stage model to show better
accuracy of prediction. In conclusion, having multi-stages predictors can help in the
healthcare and medicine fields to early detect and predict diseases like CKD during the
regular healthcare screenings. In addition, the early prediction of CKD using the 3-
Stage model will help in avoiding unnecessary tests and cost.
One of the limitations and challenges in this research was related to the dataset to be
used in conducting the model and the experimental work. Reaching a dataset that is
trusted and diverse from other studies was a limitation since there is no digital records
for patients that hold information that is relevant particularly to CKD in Egypt. That
was resolved by finding the dataset from the CDC.
For future recommendations, more trials and experiments can be considered in
respect to the multi-stages’ prediction models by which more data attributes can be
added and more classifiers can be compared, not only for CKD but other diseases that
its occurrence can be predicted during the regular healthcare screening.
References
1. WHO, Preventing chronic diseases: a vital investment, World Health Organization (2005).
https://www.who.int/chp/chronic_disease_report/part1/en/
2. Pavithra, N., Shanmugavadivu, R.: Efficient early risk factor analysis of kidney disorder
using data mining technique, pp. 1690–1698 (2017)
3. American Kidney Fund, Chronic kidney disease (CKD). http://www.kidneyfund.org/kidney-
disease/chronic-kidney-disease-ckd/#what_causes_chronic_kidney_disease
4. LeDuc Media, World Rankings - Total Deaths. https://www.worldlifeexpectancy.com/
world-rankings-total-deaths
5. National Chronic Kidney Disease Fact Sheet (2017)
6. Honeycutt, A.A., Segel, J.E., Zhuo, X., Hoerger, T.J., Imai, K., Williams, D.: Medical costs
of CKD in the medicare population. J. Am. Soc. Nephrol. 24(9), 1478–1483 (2013)
7. Vijayarani, S., Dhayanand, S.: Data mining classification algorithms for kidney disease
prediction. J. Cybern. Inform. 4(4), 13–25 (2015)
8. Koklu, M., Tutuncu, K.: Classification of chronic kidney disease with most known data
mining methods. Int. J. Adv. Sci. Eng. Technol. 5(2), 14–18 (2017)
9. Anantha Padmanaban, K.R., Parthiban, G.: Applying machine learning techniques for
predicting the risk of chronic kidney disease. Indian J. Sci. Technol. 9(29), 1–5 (2016)
10. Sharma, S., Sharma, V., Sharma, A.: Performance based evaluation of various machine
learning classification techniques for chronic kidney disease diagnosis (2016)
Predicting the Risk Factor for Developing Chronic Kidney Disease 641
11. Tahmasebian, S., Ghazisaeedi, M., Langarizadeh, M., Mokhtaran, M.: Applying data mining
techniques to determine important parameters in chronic kidney disease and the relations of
these parameters to each other. J. Ren. Inj. Prev. 6(2), 83–87 (2017)
12. CDC NHANES Dataset, Centers for Disease Control and Prevention (CDC). National
Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey
Data. Hyattsville, MD, U.S. Department of Health and Human Services, Centers for Disease
Control and Prevention. https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/
Classification of Diabetic Retinopathy
and Retinal Vein Occlusion in Human Eye
Fundus Images by Transfer Learning
Abstract. Sight threatening diseases are viral these days. Some of these are so
harmful that it may cause complete vision loss. Diabetic Retinopathy (DR) and
Retinal Vein Occlusion (RVO) are from this category. The first step to cure such
diseases is to accurately predict it. For prediction of these diseases there is a
large number of machine/deep learning algorithms employed. In this research,
we have proposed a DR&RVO prediction system, which may help eye spe-
cialists for the prediction of these diseases. The proposed methodology shows
that a retinal image undergoes through three main steps in a Deep Neural
Network (DNN) like, preprocessing, image segmentation, and feature extraction
and classification. For classification of this processed image into DR and RVO,
and normal labels, pre-trained deep neural networks (DNNs) are used. More
than 2680 eye fundus images are collected from 7 online available datasets, all
images are converted to jpg file format during preprocessing step, after class
labels distribution into three categories, the proposed model is firstly trained and
then tested randomly on Inception v3, ResNet50 and Alex Net. This is done by
using a deep learning technique named as ‘Transfer Learning’. The accuracy
obtained from these models shows that Inception v3 (85.2%) outperformed than
other two state of the art models.
1 Introduction
Diabetes is widely spreading disease all over the globe. Patients suffering from diabetes
are 381.8 million till 2018 and the number will be increased as 591.9 million (55%) by
2030 according to the International Diabetes Federation (IDF) [1]. The main cause of
diabetes is the excess of blood sugar level. It can affect nervous system of eye, kidneys
and heart etc. Because eye is the most sensitive part of our body so it should be treated
with care. Most of the patients that suffer from diabetes are the victims of diabetic
retinopathy. This disease becomes active in severe case of diabetes that’s why early
detection of diabetic retinopathy is difficult. In order to avoid the serious damage to the
eye and to avoid the visual impairment or complete blindness, timely detection and
treatment of DR is necessary [2]. There are many disorders in human eye in which
complete vision loss can occur like diabetic retinopathy, Cataract, hypertension,
macular degeneration, neovascularization, hemorrhages, glaucoma, Retinal Artery
Occlusion and retinal vein occlusion [3]. All the disease mentioned here are injurious to
eyesight. But two most common of them are under our study, that are DR and RVO [4].
People suffering with moderate DR supposed to have good eye-sight but the DR that
can cause vision loss have two types, Diabetic-macular o-edema (DMO) and
Proliferative-diabetic-retinopathy (PDR) [5]. The complication of diabetes is called
DR, and it is the main cause of retinal blood vessel destruction. DR occurs in diabetes
patients. Patients with week diabetic control, suffer with very high sugar levels in blood
for a long period of time. This leads to retinal blood vessel destruction. DR cause
damage up-to 8/10 patients who suffer from diabetes for ten or more years [6]. Retina is
the layer of tissues at back side of eye. It functions as to detect light that enables human
being to see. In DR, light sensitive tissues at the back of retina are affected. This may
lead to vision loss.
Formation of a clot of blood in retina leads to the RVO [7]. Whenever arteries
crossing over vein, vein bursts leading to hemorrhage, and this further causes the RVO.
It also occurs due to complete blockage of blood vessels. When painless reduction of
eye-sight occurs suddenly in old age people it may be the symptom of RVO. In case
one of the veins is blocked. This causes blood transport prevention in the eye and
further leads to fluid and blood leakage in retinal parts with the swelling, bruising and
deficiency of oxygen at rear side of retina. This is the basic interference in light
detection by the retina and causes lack of vision. This condition is not common till 50
years of age, but its frequency increases as the life passes by. Like DR, RVO is also
commonly occurring disease in retina of human eye. It causes blindness. RVO has
three types [8], Branch-Retinal-Vein-Occlusion (BRVO), Central-Retinal-Vein-
Occlusion (CRVO) and Hemi-Retinal-Vein-Occlusion (HRVO). BRVO occurs at the
branch vein. When one of the veins out of four is blocked, each vein drain almost one
forth part of retina. CRVO affects the central vein. It occurs as a result of central vein
blockage. It drains blood about the whole retinal area. This is the most dangerous type
of vision loss in general. HRVO occurs at sub vein. BRVO is most commonly
occurring of all of above types. The exact cause of RVO is unknown. Formation of the
blood clot in retinal vein disturbs the flow of blood due to High-blood-pressure, High-
cholesterol, Glaucoma, Smoking, Diabetes and Certain-rare-blood-disorders. The main
difference between DR and RVO is that, DR occurs in diabetes patients and its fractal
dimensions are totally different from RVO patients [9].
These two disease are classified through different supervised and unsupervised
machine learning algorithms. Segmentation techniques for DR prediction and Fractal
dimension analysis is employed for the RVO prediction in most of state of the art work.
Still there is no deep/machine learning algorithms used for the prediction of both the
disease through same model with improved accuracy [10]. A neural network model is
proposed in this research, that can predict multiple classes like DR, RVO and Normal
images by using transfer learning technique. The main objective of this research is to
644 A. Usman et al.
provide the single interface to the medical specialists so that they can predict above
mentioned eye disease with improved accuracy.
In Sect. 1, a brief introduction to eye disease is provided with problem statement
and research objective. Section 2 is about the current state of the art work with para-
metric analysis of techniques used. Proposed methodology is described in Sect. 3.
Section 4 is about proposed DR&RVO prediction system, results and discussions.
Section 5 is about the conclusion and future directions.
2 Related Work
3 Proposed Methodology
3.1 ANN Learning
Starting point of leaning the network is to determine the internal parameters and readjust
the weights and biases. Then the randomly generated input data is feed to the network
for training. Weight estimation and optimization is performed by ANN. Error function is
determined to define the error in training data. The process of training is ceased when the
error in data is less than or equal to the predefined value for all patterns. A schematic
representation of an ANN learning process is shown in the Fig. 1.
scratch, we can modify the existing pre-trained neural networks according to our
dataset. For example, if a neural network is trained for prediction and classification of
1000 classes, we can change the last layers of this particular neural to predict the labels
in our data according to the training image data [20].
The proposed research on Retinal images is carried out in three main steps shown in
Fig. 2:
• Preprocessing
All the images collected from different datasets were in different image file formats
like in PNG, PPM, JPG TIF and GIF. So before further processing, all these images are
converted into one file format. In our case, the images files that are not in JPG format,
are firstly converted into JPG by a python script, prior to further preprocessing steps.
An input image is then passed to input layer of CNN, namely convolutional layer,
here various convolutional filters are applied on this image. The image is converted to
RGB channel. Then noise removal filters and enhancement filters are applied to con-
clude the first step.
• Segmentation
In second step the generated features map is used for segmentation of blood vessels
in eye images. The eye fundus image segmentation in neural networks is generally
based on different steps; e.g. edge detection, soft and hard exudate segmentation and
thresholding etc.
• Feature Extraction and Classification
At last stage the image is passed through the fully-connected layer to generate a
final feature map of retinal images and an activation is used here to predict the class
label at the output layer.
In order to predict Classify, DR, RVO and Normal eye image. We are using three
pre-trained neural networks here. By transfer learning in these models, we can get the
desired results. The detail of Inception v3 architecture we have used, and implemen-
tation is described in next sections.
The main aim of conducting this research is to design a system that can predict Diabetic
Retinopathy (DR), Retinal Vein Occlusion (RVO) and Normal through a single model
with improved accuracy. For this purpose, three CNN models are tested on eye fundus
image data i.e. Inception V3, ResNet50 and Alex-Net.
The interface of proposed DR&RVO prediction system is shown in Figs. 4, 5 and
6. The proposed model is predicting the three class labels in the figures. This interface
is built on the results obtained from Inception v3 model.
The image is selected from test data, on the basis of training of the model, the
predicted class label is Normal, which is correct.
In the Fig. 5, the image is selected from test dataset, on the basis of training data,
the predicted class label is DR.
Figure 6 is showing the predicted class label is RVO, while it is DR image this
shows that, little disruption in image, while taking the image by image taking source
can change the result.
As mentioned we built this interface on the Inception v3 model, giving us the best
accuracy on our eye image data. So, to develop this interface we have used python 3.5
and we have imported Tkinter and other library files.
The loss function graph of our model in Fig. 8, is showing that, at start it is high
and by increasing the number of epochs, it is minimized.
In our proposed methodology, we have discussed the steps to follow and discussed the
detailed working of proposed models. The dataset for this research is obtained from
seven online available datasets. Considering all these aspects, we have designed a
DR&RVO prediction system. We have define three class labels namely DR, RVO and
Normal eye images. And used pre-trained deep learning algorithms for the prediction.
The main idea of using these models is “Transfer Learning” in which we have changed
the last layers of these networks according to our dataset. Now these models are
classifying eye disease images with improved accuracy. Our system can be used as
decision support system by eye specialists in order to predict general purpose DR and
RVO infected patient’s eye images. In future, this work can be extended by using local
eye image dataset with further improving the accuracy of the model. In this research,
we are predicting three general eye image categories (two diseased and one normal).
This work also can be extended to predict more than three eye disease categories or to
predict parent as well as subcategories of these disease. Like, while the DR, the model
should also predict Diabetic-macular o-edema (DMO) and Proliferative-diabetic-
retinopathy (PDR). Similarly the same model should also predict RVO with Branch-
Retinal-Vein-Occlusion (BRVO), Central-Retinal-Vein-Occlusion (CRVO) and Hemi-
Retinal-Vein-Occlusion (HRVO).
Funding. This research is funded by the National Research Program for Universities (NRPU),
Higher Education Commission (HEC), Islamabad, Pakistan, grant number 20-
9649/Punjab/NRPU/R&D/HEC/2017-18.
References
1. Samant, P., Agarwal, R.: Machine learning techniques for medical diagnosis of diabetes
using iris images. Comput. Methods Programs Biomed. 157, 121–128 (2018)
2. Kaur, M., Talwar, R.: Automatic extraction of blood vessel and eye retinopathy detection.
Eur. J. Adv. Eng. Technol. 2(4), 57–61 (2015)
3. Guo, J., et al.: Automatic retinal blood vessel segmentation based on multi-level
convolutional neural network. In: 2018 11th International Congress on Image and Signal
Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE (2018)
4. Qureshi, I., et al.: Computer aided systems for diabetic retinopathy detection using digital
fundus images: a survey. Curr. Med. Imaging Rev. 12(4), 234–241 (2016)
5. Solkar, S.D., Das, L.: Survey on retinal blood vessels segmentation techniques for detection
of diabetic retinopathy. Diabetes (2017)
6. Pratt, H., et al.: Convolutional neural networks for diabetic retinopathy. Procedia Comput.
Sci. 90, 200–205 (2016)
7. Nicolò, M., et al.: Real-life management of patients with retinal vein occlusion using I-
macula web platform. J. Ophthalmol. (2017)
8. Zode, J.J., Pranali, C.C.: Detection of branch retinal vein occlusions using fractal analysis.
Asian J. Convergence Technol. (AJCT)-UGC LISTED 3 (2017)
Classification of Diabetic Retinopathy and Retinal Vein Occlusion 653
9. Fazekas, Z., et al.: Influence of using different segmentation methods on the fractal properties
of the identified retinal vascular networks in healthy retinas and in retinas with vein
occlusion, pp. 361–373 (2015)
10. Schmidt-Erfurth, U., et al.: Artificial intelligence in retina. Prog. Retinal Eye Res. (2018)
11. Ramachandran, N., et al.: Diabetic retinopathy screening using deep neural network. Clin.
Exp. Ophthalmol. 46(4), 412–416 (2018)
12. Roy, P., et al.: A novel hybrid approach for severity assessment of diabetic retinopathy in
colour fundus images. In: 2017 IEEE 14th International Symposium on Biomedical Imaging
(ISBI 2017). IEEE (2017)
13. Dutta, S., et al.: Classification of diabetic retinopathy images by using deep learning models.
Int. J. Grid Distrib. Comput. 11(1), 89–106 (2018)
14. Padmanabha, A.G.A., et al.: Classification of diabetic retinopathy using textural features in
retinal color fundus image. In: 2017 12th International Conference on Intelligent Systems
and Knowledge Engineering (ISKE). IEEE (2017)
15. Annunziata, R., et al.: Leveraging multiscale hessian-based enhancement with a novel
exudate inpainting technique for retinal vessel segmentation. IEEE J. Biomed. Health
Informat. 20(4), 1129–1138 (2016)
16. Khomri, B., et al.: Retinal blood vessel segmentation using the elite-guided multi-objective
artificial bee colony algorithm. IET Image Process. 12(12), 2163–2171 (2018)
17. Saleh, E., et al.: Learning ensemble classifiers for diabetic retinopathy assessment. Artif.
Intell. Med. 85, 50–63 (2018)
18. Hassan, G., et al.: Retinal blood vessel segmentation approach based on mathematical
morphology. Procedia Comput. Sci. 65, 612–622 (2015)
19. Zaheer, R., Humera, S.: GPU-based empirical evaluation of activation functions in
convolutional neural networks. In: 2018 2nd International Conference on Inventive Systems
and Control (ICISC). IEEE (2018)
20. Szegedy, C., et al.: Rethinking the inception architecture for computer vision. arXiv preprint
arXiv:1512.00567 (2018)
21. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (2016)
22. Krizhevsky, A., Ilya, S., Geoffrey, E.H.: Imagenet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems (2012)
Crop Monitoring Agent System Based
on Pattern Recognition Techniques
1 Introduction
Crop science grows and changes continuously, and it is necessary to assist farmers
under critical situations. Currently, there exist several electronic devices like sensors
which can recollect environmental data and dedicated computing tools for supporting
farmers. The acquisition of data is primordial to deduce critical situations, for instance,
excessive water and low soil Oxygen (O2) damages the roots of plants [3]. Manual
monitoring of crop depends on the experience of farmers, but problems arise when a
farmer has not enough expertise. Furthermore, manual monitoring spends time and
decreases the economy. Besides, a farmer may not to be sure about his/her supervision
is precise and performed on time.
The two main crucial factors for plants are the crop water status and the availability
of fertilizers to meet the requirements of essential nutrients. Environmental factors
disturb the mobility of nutrients in plants like temperature and humidity. For instance,
plants may suffer from dehydration, chilling, and freezing. Besides, plants become
vulnerable to the attack of pests and fungus. Traditionally, farmers use Leaf Color
Charts (LCC) to check the deficient in nutrients in plants (Nitrogen). The manual LCC
checking requires experience. If plants have an insufficient amount of nutrients, then
the yield decreases, and the production cost increases. Thus, small scale farming
requires system support to assure a good yield and reduced cost.
Many monitoring systems exist using radar, satellite imaging [15], and aerial
photography, but they are quite expensive, sophisticated, and require expertise to
operate them. By contrast, sensors are not expensive. Thus, a monitoring system using
sensors networks can maintain optimum range of parameters: light, temperature,
humidity, protecting plants from pests and diseases [1–3].
Greenhouses and tunnel farming techniques would be more productive when they
include a remote monitoring system. Thus, this research proposes an intelligent
monitoring system for the agriculturist in remote areas. This dedicated system inte-
grates different hardware equipment, and adaptable software for network sensors,
agents based on a knowledge-base of inference rules.
The sensors acquire data concerned with the parameters related to the growing of
plants: humidity, temperature, and nutrients. The analysis of this information provides
appropriate measure before the damage to the crop, in case of risk.
The organization of the paper is as follows: Sect. 2 introduces the current research
work in Phytomonitoring techniques for understanding the approach proposed in this
article. Section 3 describes the methodology in crop monitoring in the present study.
Section 4 explains the experimental results. Section 5 concludes and presents future
work.
2 Related Works
3 Methodology
The most crucial atmospheric element which effects the growth and yield of plants is
the temperature. The photosynthesis respiration process requires optimum temperature.
The second important parameter is the humidity. The relative humidity is the ratio of
two water vapor contents: the real and the saturated. These values are expressed in
percentage and measured at the same conditions of pressure and temperature.
Plants use air carbon dioxide (CO2) for their photosynthesis process. Plant’s leaves
have pores to take in CO2. During the respiration process and through the pores, some
moisture in the air goes in. The plant transpires moisture more slowly when high
humidity in the air exists and vice versa. So when air is dry, plants evaporate more
water, implying that plants become deficient in water and leaves close their pores.
Although, outgoing water decreases, at the same time, the CO2 intake reduces too.
Plants absorb water through roots from the soil.
Crop monitoring utilizes developed sensors to acquire information about the
environmental parameters. The closed environment area in which plants grow can use
the DS18B20 sensor for the temperature and the HSU-04 sensor for the humidity.
The IP of the camera installed in the field gets images of plants and compares them
with the standard Leaf Color Charts (LCCs). The Electronic LCC (E-LCC) substitutes
the manual LCC used before. E-LCC is flexible and also easy to use, assists farmers to
minimize the production cost by indicating the dosage of the appropriate fertilizers.
Finally, the Knowledge-base of inference rules (KBR) includes the expertise of the
specialist of the domain, who provides possible solutions to farmers’ problems.
Crop Monitoring Agent System Based on Pattern Recognition Techniques 657
The graphical user interface (GUI) (see Fig. 2) sets the ranges for the temperature
and the humidity. Till values (in real-time) remain within ranges, no intimation or
warning email is generated. By contrast, as soon as value goes out of the range,
communication with the End-user (farmer) is established by email reporting unwanted
conditions. Some inference rules apply for such cases.
While doing color image processing, it is necessary to separate the desired color,
which is under observation, from the original image. In the present case, the green color
is under study.
Crop Monitoring Agent System Based on Pattern Recognition Techniques 659
So firstly, the green color is identified in the original image. There are many
disturbing factors like sunshine, shades, background, and presence of non-green colors
in the original image, which create hurdles in detecting only the pure green color.
The developed image processing technique to identify the green color creates a mask
of the required color and applies it to the original image. Thus, the program finds output
for each simple color detection in the RGB model (red, green, and blue). Firstly, it figures
out the individual color bands from the image. Next, it processes the image histogram
and selects the color threshold range. The mask developed has a smooth border and filled
regions. This mask applied to the original image; identify the green portion in the original
image. The next step is to compare the processed image with the standard LCCs.
LCC is a straightforward technique to check the Nitrogen deficiency in the plants.
Thus, farmers can find out the plant’s Nitrogen demand and apply the Nitrogen fer-
tilizer in an appropriate amount and minimize the production cost. Usually, there are
six shades of green colors on LCC, which vary from light green to dark green. Each
shade represents a specific amount of Nitrogen deficiency. Figure 3 shows the different
shades of green color.
Light green color means that plants have a high deficiency of Nitrogen and dark
green color means that plants have sufficient amount of Nitrogen. Sometimes it hap-
pens that the color of leaf lies between the two shades and does not entirely match with
any shade. This work addresses this issue. Electronic LCC gives an actual comparison
and precise if the color lies between two shades.
Figure 4 shows the field image to compare with the standard LCC. However,
before comparing and minimizing the effects of non-green parts, some image pro-
cessing techniques extract the green color from the image as described above.
4 Experimental Results
The image composed by a total of 256 pixels, whose range is [0 to 255] and each color
has its histogram. For instance, Fig. 5(a) shows the histograms of the RGB bands (Red,
Green, Blue); Fig. 5(b) shows the histogram of the green band. After the identification
of the green color, the Electronic LCC compares it with the standard color chart.
Shades of Electronic LCC may vary depending upon the type of crop and the area in
which it grows.
Fig. 5. (a) Histogram of red, green, blue bands; (b) Histogram of green band
Firstly, it treats the collection in a specific region with the referenced images by the
Electronic LCC. These images produce a color scheme. Next, the user can monitor the
content of Nitrogen in the plants by comparing in real-time the color of the image with
the standard color chart. It calculates the average values of green shades and field
images. It divides each pixel of a field image by the average value of the reference
images. However, before finding the average value, it has to use image processing
techniques as described above to figure out the actual green color taken from the image.
Figure 6 shows the results.
In the graph (Fig. 6), the X-axis indicates the percentage of Nitrogen deficiency in
plants. There are six shades of the green color in the Electronic LCC, so the six vertical
lines represent different shades of the green color. The red circle at the top of the graph
is near to the dotted horizontal line (passing through the Y-axis whose value is 1). The
vertical line shows the matching between the field images with the corresponding
shade. If the circle is on a dotted line, it means that the corresponding shade from the
scheme exactly matches the field image. If it is below or above the dotted line, it means
that the color of field image lies between the two shades. The X-axis indicates the
deficiency in Nitrogen in the plant. According to that information, regulate Nitrogen
fertilizer (like urea) in the soil.
The Electronic LCC provides cost-effective, good performance, and quick mea-
surement of Nitrogen deficiency within green plants. The intensity of the green color is
associated with nutrients content (Nitrogen). The critical value is one on LCC, which
corresponds with yellowish-green that shows the lowest Nitrogen concentration. The
highest concentration indicated by six is a critical value and corresponds with the dark-
green. The image processing tools eliminate the effects of disturbing factor-like
Crop Monitoring Agent System Based on Pattern Recognition Techniques 661
shadow, background, and sunlight. For instance, the sunlight may cause a slight dif-
ference in the color.
The camera must focus on the green portion of plants and should not include the
non-green parts like the color of the trunk or the soil to obtain a better result. It verifies
that the picture is not taken in sunlight, because it may show different shade as com-
pared to one taken in overcast weather.
Some examples of applied rules are the following:
(R7) If LCC = 5, Then Add Nitrogen 35 kg. Per hectare.
(R8) If LCC = 4, Then Add Nitrogen 20 kg. Per hectare.
A champ recorded 87% higher yield when applying the rule R7. In the case of the
application of R8 rule, although the yield was low, the farmer saved [40 to 50] % of
fertilizer.
This paper has described the designed, development, and the validation of crop
monitoring expert-agent system that analyzes the impacts of environmental stresses on
the growth of plants. The approach combines the hardware instrumentation and a
Knowledge-based agent system. The central computer service receives crucial envi-
ronmental parameters data from the sensors network, the image processing finds the
deficiencies of nutrients in plants, and minimizes the effects of no large green portion
captured. The Electronic leaf color chart replaces the manual LCC, providing enough
information for future scenarios. End-user receives warning messages to prevent crit-
ical situations and the ad hoc solution about environmental stress.
The application of Artificial Intelligence techniques like image processing and the
ad hoc instrumentation in Agriculture can make this sector more productive and
662 A. Hanif et al.
effective. The monitoring plants in real-time through color image processing has made
the system quite useful. It compares in the real-time, image of the plant with patterns of
previous images and communicates the results by e-mail. To avoid any abrupt fluc-
tuation in temperature and humidity value, agents take the average of the previous
values. So it avoids temporary alarming conditions.
The obtained results show that the Agriculture can improve day by day. The
process of plants growth can become more sophisticated but better controlled. For
instance, an adequate amount of water, fertilizers, and nutrients improve the yield and
at the same time, reduces the cost. Besides, the monitoring system advises to adjust the
water supply, temperature, and required humidity as a preventive measure, quite
convenient to see the past trends of Nitrogen deficiency in the plants. In addition to
remotely monitoring and controlling the plant environment, the system minimizes the
human effort and supports farmers and landowner by indicating the appropriated
quantity of nutrients to minimize the energy consumption, so to produce efficient and in
consequence with a better economic rating.
The obtained results were helpful to extend this research for producing new types
of crops with higher yield in remote areas. Plants can grow in those areas which do not
have suitable environmental conditions for a specific crop. Future research addresses
the improvement in the data acquisition, increase, modify, and organize the knowledge-
base of inference rules automatically to compensate environment conditions not yet
considered; also, to test different type of plants (e.g., cotton), nature of soil and area, as
well as the effects of the volatile substances.
References
1. Zuo, X., et al.: Design of environmental parameters monitoring system for watermelon-
seedlings based on wireless sensor networks. Appl. Math. Inf. Sci. 5(2), 243S–250S (2011)
2. Albright, L.D., et al.: Environmental control for plants on earth and space. IEEE Control
Syst. Mag. 21(5), 28–47 (2011)
3. Schaffer, B.: Effects of soil oxygen deficiency on avocado (Persea americana Mill) Trees. In:
Seminario International: Manejo del Riego y Suelo en el Cultivo del Palto La Cruz, Chile,
27–28 September 2006 (2006)
4. Avidan, A., Hazan, A.: Application of the phytomonitoring technique for table grapes. In:
The International Workshop on Advances in Grapevine and Wine Research, 15–17
September 2005 (2005)
5. Puig, V., et al.: Optimal predictive control of water transport systems: arrêt-darré/arros, a
case study. Water Sci. Technol. 60(8), 2125–2133 (2009)
6. Ton, Y., Kopyt, M.: Phytomonitoring in realization of irrigation strategies for wine grapes.
Acta Hortic. 652, 167–173 (2004). (ISHS)
7. Kakran, A., Mahajan, R.: Monitoring growth of wheat crop using digital image processing.
Int. J. Comput. Appl. 50(10) (2012). ISSN 0975-8887
8. Ibrahim, M., Rabah, A.B.: Effect of temperature and relative humidity on the growth of
Helminthosporium fulvum. Niger. J. Basic Appl. Sci. 19(1), 127–129 (2011)
9. Plataniotis, K.N., Venetsanopoulos, A.N.: Color Image Processing and Applications (2000)
Crop Monitoring Agent System Based on Pattern Recognition Techniques 663
10. Rodríguez, F., Guzman, J.L., Berenguel, M., Arahal, M.R.: Adaptive hierarchical control of
greenhouse crop production. Int. J. Adapt. Control Signal Process. 22, 180–197 (2008).
https://doi.org/10.1002/acs.974
11. Kopyt, M., Ton, Y.: Phytomonitoring Technique for Table Grapes Application Guide, 2nd
edn. PhyTech Ltd. (2005)
12. Clarke, T.R.: An empirical approach for detecting crop water stress using multispectral
airborne sensors. HortTechnology 7(1), 9–16 (1997)
13. Dupin, S., Gobrecht, A., Tisseyre, B.: Airborne thermography of vines canopy: effect of the
atmosphere and mixed pixels on observed canopy temperature. UMR ITAP, Montpellier
(2011)
14. Gao, L., Cheng, M., Tang, J.: A wireless greenhouse monitoring system based on solar
energy. Telkomnika 11(9), 5448–5454 (2013). e-ISSN 2087-278X
15. Diao, C.: Innovative pheno-network model in estimating crop phenological stages with
satellite time series. ISPRS J. Photogramm. Remote Sens. 153, 96–109 (2019). https://doi.
org/10.1016/j.isprsjprs.2019.04.012
Small Ship Detection on Optical Satellite
Imagery with YOLO and YOLT
1 Introduction
In the field of remote sensing, ship detection can help in many problems like
maritime management and illegal fishing surveillance. Meanwhile, there is a lot
of work in ship detection. Moreover, the majority of methods in object detection
failed in detecting small objects.
Is compared the last version of You Only Look Once (YOLO) and You Only
Look Twice (YOLT) on the problem of detecting small ships. For that rea-
son, Two datasets were used, for the performance metrics. The High-Resolution
Ship Collection (HRSC) dataset [27] with satellite images of ships with different
objects sizes and the Mini Ship Data Set (MSDS) dataset built in this project,
this dataset contains only small objects.
The Average Precision (AP) metric of YOLO and YOLT in the MSDS dataset
(this dataset contains just small ships) were evaluated. In this case, the pro-
posal got 72% of AP for YOLT and 68% for YOLO. This demonstrates that
for small objects is a good idea to use YOLT. In the case of the HRSC dataset
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 664–677, 2020.
https://doi.org/10.1007/978-3-030-39442-4_49
Small Ship Detection on Optical Satellite Imagery with YOLO and YOLT 665
(big, medium and small ships), YOLT just got 40% of AP against 75% of YOLO.
In this case the YOLT’s AP decrease considerably, the reason is because of
YOLT changed the architecture net in order to not downsample the objects,
this improves the performance in small object detection but affects the detec-
tion of big and medium objects.
2 Related Works
There are plenty of methods for ship detection. We group them in two. The first
consist of methods that use statistical and image processing. The second group
consists of methods based on deep learning.
The use of deep learning for object detection had outperformed the classic
methods. Moreover, the top deep net for object detection are R-CNN, Fast R-
CNN, Faster R-CNN, YOLO, and SSD. The R-CNN [22] presented in 2014,
used selective search [26] for the initial object proposal bounding boxes. Then
a deep convolutional net was used in each proposal region with Support Vec-
tor Machine (SVM) in the last layer, this method has a very good performance
evaluated in the Visual Object Classes (VOC) challenge. Then, the R-CNN was
improved with Fast R-CNN [23] in 2015, in this case, the first convolution lay-
ers are share for all the proposal; they also change the SVM classifier by a soft
max function in the last layer. Also, in 2015 the Faster R-CNN was presented
[24]. The problem with R-CNN and Fast R-CNN was the method they used for
object proposal, the selective search is a bottle neck, and also it proposes too
many candidates. For that reason, the Faster R-CNN used a convolution net for
666 W. Nina et al.
classifying and detecting region of interest. In this work, the time processing was
reduced considerably.
YOLO, was proposed in the work of [21], similarly to Faster R-CNN they
used a convolutional net in order to detect and classify objects. They treat the
object detection problem as a regression problem, in this case, the bounding box
coordinates are numbers, so it permits using a net to predict numbers as the
regression method do. This method has the minor time processing but is out-
performed by Faster R-CNN in terms of accuracy. Moreover, SSD [25] presents
a method for detecting objects in images using a single deep neural network and
discretizes the output space of bounding boxes into a set of default boxes over dif-
ferent aspect ratios and scales per feature map location. Then, at the end of 2016
Yolov2 [20] is presented, it added batch normalization, anchors, higher resolution
input image, fine-grained features and multi-scale training. Finally, Yolov3 [19]
outperformed the lastest versions adding a Feature Pyramid Networks (FPN),
logistic classifiers for every class instead of softmax and Darknet-53 instead of
Darknet-19.
In 2016, Liu [8] proposed a model based on CNN in order to detect ships in
Optical Remote Sensing Images (ORSI), also Zhang [6] proposed a S-CNN for
the same propose. More recently, Zhang in 2019 [7] adapted the Faster R-CNN
for detecting ships. Moreover, Ma [9] proposed a model to detect and recognize
ships in normal images meanwhile Zhao in 2019 [10] proposed a model for ship
detection in video.
Normally, a Convolutional Neural Network (CNN) for object detection
returns bounding boxes based on xmin , ymin , xmax , ymax or x, y, width, height,
but some implementation improved them by returning rotated bounding boxes
by adding an angle, for example a bounding box could be in the form of:
x, y, width, height, angle. For example, Yang [12] proposed a model called Rota-
tion Dense Feature Pyramid Networks (R-DFPN) in order to detect rotated
ships. Also, Yang [5] and Li [4] proposed models that detect the position and
direction of ships. Moreover, Fu [2] proposed a model based on a Feature Fusion
Pyramid Network and deep reinforcement learning (FFPN-RL) meanwhile Dong
[3] used the saliency and a rotation-invariant descriptor.
3 Proposal
We evaluated YOLOv3 (YOLO version 3) and YOLT in order to detect small
ship on optical satellite imagery, by small, it is referred to small objects in the
image. Normally, top state-of-art CNN for object detection as Faster R-CNN
failed in order to detect small objects, as is tested in [11]. YOLT was presented
as a modification of YOLOv2 in order to detect small objects, but YOLOv3
have included an FPN network, so the work focused on a comparison of both
detecting small objects.
Small Ship Detection on Optical Satellite Imagery with YOLO and YOLT 667
Normally, the top state-of-art CNN for object detection were trained with
datasets as COCO [1] and VOC [31] with good performance, for example, see
the Fig. 1 it is noted that the objects normally take a considerable size related
to the full image, approximately the object represent the 20% to 70% of the
size image. The problem happens when you need to detect small object as in
Fig. 2, in this case the image size is 2400 × 1200 pixels, but the objects just have
60 pixels of width on average, taking the width of the image and objects, the
objects represent just the 2.5% of the image.
Fig. 1. Normal size of objects in object detection methods Source: YOLO [19].
3.2 YOLO
3.3 YOLT
Yolt is proposed to reduce model coarseness and accurately detect dense objects
(such as cars), follows [28] and implement a network architecture that uses 22
layers and downsamples by a factor of 16 rather than the standard 32× down-
sampling of YOLO, in the Table 2 is presented the Network Architecture.
Thus, a 416×416 pixel input image yields a 26×26 prediction grid. The archi-
tecture is inspired by the 28-layer YOLO network, though this new architecture
is optimized for small, densely packed objects. The dense grid is unnecessary
for diffuse objects such as airports, but improves performance for high density
scenes such as parking lots; the fewer number of layers increases run speed [29].
Small Ship Detection on Optical Satellite Imagery with YOLO and YOLT 669
4.1 Datasets
Despite, there are some satellite image datasets, a dataset with the special condi-
tions of Peruvian sea were built. This dataset named Mini Ship Data Set (MSDS)
was built, with images from Google Earth of the Peruvian and Chilean seas. For
example in Fig. 2, we present some images. As you can see, these images are
not like other datasets, there are very small objects respect to full image size.
670 W. Nina et al.
Moreover, some images have noise, as you can see in the left part of Fig. 2. The
dataset consists of 200 images with a resolution of 2400×1200 pixels, split in 120
for training, 40 for validation and 40 for testing. The object haves width from
20 to 500. The majority of objects have widths of 40 and 50 pixels. In Fig. 3, the
histogram of object’s width is presented.
Fig. 2. Satellite images of MSDS dataset. Left: Images with noise. Right: Normal
images.
Also, the dataset High Resolution Ship Collection 2016 (HRSC2016) [18]
were used. This dataset includes ships on sea and ships close inshore from
Google Earth (see Fig. 4). In this case, the image sizes range from 300 × 300
to 1500 × 900, in Fig. 5, the histogram of object’s width is presented. As you
could see, in HRSC dataset, there are objects of all sizes, ranging from 20 pixels
to 620 pixels.
Small Ship Detection on Optical Satellite Imagery with YOLO and YOLT 671
4.2 Results
In the Fig. 11 is presented that some results of detected ship that are red
rectangles, the size test image is 2400 × 1422 and it applied a slice size of 1500.
Experiments with HRSC dataset with YOLT, it got a mAP of 40,00% and
with YOLOv3 it got a mAP 75,00%. It demonstrated that YOLT works well
with small objects.
YOLO YOLT
MSDS 69,80% 76,06%
HRSC 75,00% 40,00%
YOLO YOLT
Time 0.05327 0.06
Small Ship Detection on Optical Satellite Imagery with YOLO and YOLT 675
4.3 Problems
Some problems with Yolt is that when there are ships together, it does not detect
well, some reasons is images size in training stage. In the Fig. 12 is presented
some examples.
5 Conclusions
Two datasets were used. The HRSC dataset, that contains objects of different
sizes and MSDS dataset that has been built for this project. This MSDS contains
small objects of ships on Optical Satellite Imagery.
YOLO and YOLT in the HRSC and MSDS datasets were compared. For the
MSDS dataset, YOLT outperformed YOLO with 76% and 69% of AP respec-
tively. Meanwhile, in the case of HRSC dataset, YOLO outperformed YOLT
with 75% and 40% of AP respectively.
Therefore, the work demonstrated that YOLT is a good framework for detect-
ing small objects, but it fails in other cases. The work also used FASTER-R-CNN
but it didn’t work for very small objects.
References
1. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft COCO: common objects in context. In: European Confer-
ence on Computer Vision, pp. 740–755. Springer (2014)
2. Fu, K., Li, Y., Sun, H., Yang, X., Xu, G., Li, Y., Sun, X.: A ship rotation detection
model in remote sensing images based on feature fusion pyramid network and deep
reinforcement learning. Remote Sens. 10(12), 1922 (2018)
3. Dong, C., Liu, J., Xu, F.: Ship detection in optical remote sensing images based
on saliency and a rotation-invariant descriptor. Remote Sens. 10(3), 400 (2018)
4. Li, S., Zhang, Z., Li, B., Li, C.: Multiscale rotated bounding box-based deep learn-
ing method for detecting ship targets in remote sensing images. Sensors 18(8),
2702 (2018)
5. Yang, X., Sun, H., Sun, X., Yan, M., Guo, Z., Fu, K.: Position detection and
direction prediction for arbitrary-oriented ships via multitask rotation region con-
volutional neural network. IEEE Access 6, 50839–50849 (2018)
6. Zhang, R., Yao, J., Zhang, K., Feng, C., Zhang, J.: S-CNN-based ship detection
from high-resolution remote sensing images. In: International Archives of the Pho-
togrammetry, Remote Sensing & Spatial Information Sciences, vol. 41 (2016)
7. Zhang, S., Wu, R., Xu, K., Wang, J., Sun, W.: R-CNN-based ship detection from
high resolution remote sensing imagery. Remote Sens. 11(6), 631 (2019)
8. Liu, Y., Cui, H.-Y., Kuang, Z., Li, G.-Q.: Ship detection and classification on
optical remote sensing images using deep learning, vol. 12, p. 05015. EDP Sciences
(2017)
9. Ma, M., Chen, J., Liu, W., Yang, W.: Ship classification and detection based on
CNN using GF-3 SAR images. Remote Sens. 10(12), 2043 (2018)
10. Zhao, H., Zhang, W., Sun, H., Xue, B.: Embedded deep learning for ship detection
and recognition. Future Internet 11(2), 53 (2019)
11. Eggert, C., Brehm, S., Winschel, A., Zecha, D., Lienhart, R.: A closer look: small
object detection in faster R-CNN. In: 2017 IEEE International Conference on Mul-
timedia and Expo (ICME), pp. 421–426. IEEE (2017)
12. Yang, X., Sun, H., Fu, K., Yang, J., Sun, X., Yan, M., Guo, Z.: Automatic ship
detection in remote sensing images from google earth of complex scenes based
on multiscale rotation dense feature pyramid networks. Remote Sens. 10(1), 132
(2018)
13. Corbane, C., Pecoul, E., Demagistri, L., Petit, M.: Fully automated procedure
for ship detection using optical satellite imagery. In: Remote Sensing of Inland,
Coastal, and Oceanic Waters, vol. 7150. International Society for Optics and Pho-
tonics (2008)
14. Inggs, M.R., Robinson, A.D.: Ship target recognition using low resolution radar
and neural networks. IEEE Trans. Aerosp. Electron. Syst. 35(2), 386–393 (1999)
15. Bi, F., Zhu, B., Gao, L., Bian, M.: A visual search inspired computational model
for ship detection in optical satellite images. IEEE Geosci. Remote Sens. Lett. 9(4),
749–753 (2012)
16. Shi, H., Zhang, Q., Bian, M., Wang, H., Wang, Z., Chen, L., Yang, J.: A novel ship
detection method based on gradient and integral feature for single-polarization
synthetic aperture radar imagery. Sensors 18(2), 563 (2018)
17. Cheng, G., Han, J.: A survey on object detection in optical remote sensing images.
ISPRS J. Photogramm. Remote Sens. 117, 11–28 (2017)
Small Ship Detection on Optical Satellite Imagery with YOLO and YOLT 677
18. Liu, Z.K., Weng, L.B., Yang, Y.P., et al.: A high resolution optical satellite image
dataset for ship recognition and some new baselines (2017)
19. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv (2018)
20. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. arXiv preprint
arXiv:1612.08242 (2016)
21. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified,
real-time object detection. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 779–788 (2016)
22. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-
rate object detection and semantic segmentation. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
23. Girshick, R.: Fast R-CNN. arXiv preprint arXiv:1504.08083 (2015)
24. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object
detection with region proposal networks. In: Advances in Neural Information Pro-
cessing Systems, pp. 91–99 (2015)
25. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.:
SSD: single shot multibox detector. In: European Conference on Computer Vision,
pp. 21–37. Springer (2016)
26. Uijlings, J.R.R., Van De Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective
search for object recognition. Int. J. Comput. Vision 104(2), 154–171 (2013)
27. Liu, Z., Yuan, L., Weng, L., Yang, Y.: A high resolution optical satellite image
dataset for ship recognition and some new baselines. In: ICPRAM, pp. 324–331
(2017)
28. Van Etten, A.: You only look twice: rapid multi-scale object detection in satellite
imagery. CoRR, abs/1805.09512 (2018)
29. Van Etten, A.: Satellite imagery multiscale rapid detection with windowed net-
works. CoRR, abs/1809.09978 (2018)
30. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
31. Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J.,
Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int.
J. Comput. Vision 111(1), 98–136 (2015)
A Text Classification Model to Identify
Performance Bonds Requirement
in Public Bidding Notices
1 Introduction
A performance bond is a “bond taken out by the contractor (the obliging party),
usually with a financial institution or insurer, for the benefit and at the request of
the employer (the offering party), in a defined sum of liability and enforceable by
the employer in the event of the contractor’s default” [1,2]. It aims to guarantee
that a product (construction, service or goods) will be delivered in a timely and
workmanlike manner, otherwise, the contractor should pay an indemnity to the
buyer.
In Brazil, government procurement is regulated – among other rules and
laws – by the Law 8,6661 which establishes in its article 56 that “[...] in each
1
Lei no¯ 8.666, de 21 de junho de 1993 – http://www.planalto.gov.br/ccivil 03/Leis/
l8666cons.htm.
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 678–691, 2020.
https://doi.org/10.1007/978-3-030-39442-4_50
Classification of Public Bidding Notices 679
where tfk,j (dj , tk ) represents the term frequency and idfk represents the inverse
document frequency, as described in Eq. 1.
p(dj |c)p(c)
p(c|dj ) = (3)
p(dj )
Classification of Public Bidding Notices 683
In the context of text classification, p(c|dj ) is the probability that the docu-
ment dj belongs to the class c [3].
Other novel algorithms, such as Deep Learning, Hierarchical Deep Learning
for Text Classification, and Convolutional Neural Network, were not adopted in
this study because they present high complexity, with increased computational
effort and, since our dataset is not large, their implementation could also lead
to overfitting [18].
3 Related Works
The work in [21] identified that many enterprise systems have to handle a large
pool of documents. Manual verification of those documents can take a huge
amount of time. Then, they proposed a rule-based framework that enables auto-
matic verification of document-based systems. The objective of the framework
is to reduce manual intervention as much as possible such that an automated
support for document verification can be provided to the customers.
In [22], researchers sought to investigate the downstream potential of
biomedicine researches in Indonesia based on scientific publication. To achieve
their objectives, they built a classification model designed to classify a document
into four different classes. For this, they applied four different classifiers: NB, NB
(kernel), k-NN, and SVM. The k-NN model showed better performance than the
other classifiers in accuracy, achieving 84.85%.
In [23], it was proposed the development of a classifier able to identify com-
ments and posts which are racists and malicious. To train the model they used a
set of documents which contains different antisocial texts collected from various
blogs, speeches, and articles. The algorithm used to create the model was k-NN.
From the experimental results, the authors concluded that the task of detecting
antisocial texts can be accomplished through text mining.
684 U. C. da Cunha et al.
4 Methodology
The methodology applied in this study consists of four main steps: (1) Document
gathering – identification and collection of bidding notices (documents); (2) Pre-
processing – text extraction, restructuring, cleansing and transformation; (3)
Modeling – creation and evaluation of the models; and (4) Model evaluation –
estimation of how well the chosen model will work with new data. Figure 1 shows
the execution order of the steps.
The data used in this work consists of 478 bidding notices retrieved from the
e-BC system – a CBB’s internal document management system – and from
Classification of Public Bidding Notices 685
the Brazilian government purchases portal2 . These are the repositories where
the documents of interest can be found. The bidding notices were published
from 2014 to 2018. They are distributed in two classes with percentages of 37%
and 63%, as we can see in Fig. 2. The labels of the data were obtained from
another CBB’s internal system called Sistema de Administração de Instrumentos
Contratuais (SAIC). Most of the files were available in PDF format whilst the
rest of them were available in DOCX format.
4.2 Pre-processing
Once we had all files downloaded, we started the text extraction from both PDF
and DOCX files. For PDF files, we used the program pdftotext.exe available
in the open source toolkit Xpdf3 . The program takes a PDF file as input and
outputs a TXT file containing the PDF’s file text content. For the DOCX files,
we used the Python’s library python-docx 4 , which can be used for creating and
updating Microsoft Word (.docx) files. Differently from pdftotext, to extract text
from a DOCX file using python-docx, we need to go through every element in
the file, such as paragraphs, tables, rows, and cells, extract their content and
concatenate them into a string. After all iterations, we save the string into a
new file. At the end of this process, we have got a set of TXT files, each one
containing the text of its corresponding PDF or DOCX file.
After extracting the data, it was necessary to restructure the text retrieved
from the PDF files. The reason is that, differently from the texts retrieved from
DOCX files, the texts obtained from PDF files come with hyphenation, so that
we need to identify where hyphenation occurred and try to rebuild the word
2
https://www.comprasgovernamentais.gov.br/.
3
http://www.xpdfreader.com/.
4
http://python-docx.readthedocs.io/en/latest/.
686 U. C. da Cunha et al.
correctly, removing the hyphen and the newline character between the two parts
of the word.
At the moment all texts from all files were equally structured, that is, without
hyphenation and newline characters between parts of a word, we started the
process of cleansing and transformation. Firstly, we standardized the encoding
of all texts in order to be able to work them indistinctly. The next step was to
remove diacritical marks, such as acute (´) and cedilla (¸). After that we removed
words, expressions, abbreviations, numbers etc. that are non-informative in the
context of the problem. The stop words for Portuguese language available in
the Natural Language Toolkit5 (NLTK) were also identified and removed from
the texts. Finally, we used stemming to remove the suffixes, keeping only the
roots, or stems, of the words. Most of the tasks of cleansing were done with
the use of regular expressions, whereas stemming was done using the method
RSLPStemmer of NLTK, which stands for Removedor de Sufixos da Lı́ngua
Portuguesa, or in English, Portuguese Language Suffixes Remover.
4.3 Modeling
After all documents were properly cleansed and transformed, we started the
modeling phase. First, the data were split into training and test sets in a pro-
portion of 3:1 using a stratified fashion with the target class distribution as
reference. After that, the data were transformed into TF and TF-IDF vector
space representations. To build a TF representation, we applied the CountVec-
torizer method from scikit-learn 6 library on the documents, regardless of the
modeling algorithm used. CountVectorizer method converts a collection of text
documents to a matrix of token counts. To build a TF-IDF, we also applied
the TfidfTransformer method, which transforms a count matrix to a normalized
TF-IDF representation according to equation tf-idf(d,t) = tf (t)× idf(t), where
n+1
t is the term, d id the document, and idf(d, t) = log + 1, where n
df (t) + 1
is the total number of documents and df(d, t) is the number of documents that
contain the term t. Here, we can see that the equation applied by the scikit-learn
library differs slightly from Eq. 2 mentioned in Sect. 2. According to the scikit-
learn library documentation7 , the default parameter smooth idf=True adds “1”
to both numerator and denominator in order to prevent zero divisions.
Each vector space was used separately as a training dataset so that all four
algorithms were trained with TF and with TF-IDF representations. For each
algorithm, a set of hyper-parameters were defined and all possible combinations
were tested in order to improve the chances of generating the model that could
yield the best performance possible. To accomplish this, we used the methods
Pipeline and GridSearchCV from scikit-learn library. F1 -score metric and cross-
validation with 10-fold were used to estimate the models’ performance. The best
5
https://www.nltk.org/.
6
http://scikit-learn.org/stable/index.html.
7
http://scikit-learn.org/stable/modules/feature extraction.html#text-feature-
extraction.
Classification of Public Bidding Notices 687
model was the one that presented the highest metric value, regardless of the algo-
rithm or vector space being used. Table 1 presents the sets of hyperparameters
of each algorithm.
Algorithm Hyperparameters
Multinomial Naive Bayes Default
k-Nearest Neighbors n neighbors: [3, 5 ,7, 9]
weights: [‘uniform’, ‘distance’]
algorithm: [‘ball tree’, ‘kd tree’]
leaf size: [20, 30, 50]
p: [1, 2]
n neighbors: [3, 5, 7, 9]
weights: [‘uniform’, ‘distance’]
algorithm: [‘brute’]
p: [1, 2]
Support Vector Machines C: [0.1, 0.5, 1, 1.5, 2, 2.5, 5, 10]
penalty: [‘l1’]
loss: [‘squared hinge’]
dual: [False]
C: [0.1, 0.5, 1, 1.5, 2, 2.5, 5, 10]
penalty: [‘l2’]
loss: [‘squared hinge’, ‘hinge’]
dual: [True]
Random Forest max features: [None, ‘sqrt’]
max depth: [None, 200, 50]
min samples split: [2, 10]
min samples leaf: [1, 10]
n estimators: [10, 50, 100]
Once we determined the best model regarding the performance obtained within
the cross-validation, i.e., the model whose F1 -score was the highest, we estimated
the out-of-bag error using the test dataset. The main goal of evaluating the
performance of the final model against the test set is to verify if the model did
not overfit the training data and how well it will perform when classifying new
documents.
688 U. C. da Cunha et al.
5 Results
In this section, we present the best result of each algorithm with the respective
set of parameters. The algorithms’ results were presented in ascending order
according to the performance obtained in cross-validation. All methods used to
fit the data are from scikit-learn library. Table 2 summarizes the performance
obtained in all four algorithms. Metric values are from cross-validation applied
on the training set.
The fourth highest performance was obtained with Multinomial Naive Bayes
classifier using TF representation. The F1 -score obtained was 0.888 and the
method used to fit the data was MultinomialNB.
The third best performance was obtained with k-NN on TF representa-
tion. The F1 -score obtained was 0.905 and the method used to fit the data
was KNeighborsClassifier. The set of parameters of k-NN’s best model was
{algorithm: ball tree, leaf size: 20, n neighbors: 9, p: 1; weights: uniform}.
The second best performance was obtained with SVM on TF-IDF represen-
tation. The F1 -score obtained was 0.924 and the method used to fit the data
was LinearSVC. The set of parameters of SVM’s best model was {C: 1.5, dual:
False, loss: squared hinge, penalty: l1 }.
Finally, the best performance was obtained with Random Forest on TF rep-
resentation. The F1 -score obtained was 0.952 and the method used to fit the
data was RandomForestClassifier. The set of parameters of Random Forest’s
best model was {max depth: None, max features: None, min samples leaf: 10,
min samples split: 2, n estimators: 100 }. When classifying new documents with
the test set, the performance obtained was 0.933. Table 3 presents a confusion
Predicted
No Yes Total
No 40 5 45
True
Yes 5 70 75
Total 45 75 120
Classification of Public Bidding Notices 689
Matrix generated from the classification performance on the test set for Random
Forest classifier.
It is worth noting that the difference between the values obtained in the
cross-validation and the final evaluation (test set) was of only 0.019. This is
strong evidence that the final model didn’t overfit. It can be related to the
fact that bagging approaches – which is used by the Random Forest algorithm
– usually leads to a reduction of variance, consequently improving the overall
generalization error. This hypothesis is supported by the fact that the value of
the parameter n estimators used in the final model was 100, that is, the biggest
value possible according to the set of options available (see Table 1).
6 Conclusion
could be. One solution to this issue could be a creation of a second model that
identifies the excerpt which defines the requirement of a bond. Combining these
two models by means of a pipeline, the final solution would not only classify a
bidding notice correctly but it would also present to the auditor the piece of text
used to determine the class.
Future work should focus on verifying performance when applying different
vector space models, such as binary and latent semantic indexing (LSI), and new
term weighting methods, such as bi-normal separation (BNS). Another area of
improvement could be the use of n-grams with a different number of items,
instead of only unigrams, as done in this study.
References
1. Hassan, A., Adnan, H.: IOP Conference Series: Earth and Environmental Science,
vol. 117 (2018)
2. Supardi, A., Yaakob, J., Adnan, H.: Performance bond: conditional or uncondi-
tional, MPRA Paper 34007. University Library of Munich, Germany, revised 2009
(2009). https://ideas.repec.org/p/pra/mprapa/34007.html
3. Kim, S.B., Han, K.S., Rim, H.C., Myaeng, S.H.: Some effective techniques for Naive
Bayes text classification. IEEE Trans. Knowl. Data Eng. 18(11), 1457 (2006)
4. Mahfud, F.K.R., Tjahyanto, A.: 2017 International Conference on Sustainable
Information Engineering and Technology (SIET), pp. 220–225 (2017)
5. Onan, A., Korukoğlu, S., Bulut, H.: Ensemble of keyword extraction methods and
classifiers in text classification. Expert Syst. Appl. 57, 232 (2016)
6. Srivasatava, S.K., Kumari, R., Singh, S.K.: 2017 International Conference on Com-
puting, Communication and Automation (ICCCA), pp. 345–349 (2017)
7. Wang, R., Chen, G., Sui, X.: Multi label text classification method based on co-
occurrence latent semantic vector space. Procedia Comput. Sci. 131, 756 (2018)
8. Souza, E., Costa, D., Castro, D.W., Vitório, D., Teles, I., Almeida, R., Alves, T.,
Oliveira, A.L.I., Gusmão, C.: Characterising text mining: a systematic mapping
review of the Portuguese language. IET Software 12(2), 49 (2018)
9. Hotho, A., Nürnberger, A., Paass, G.: A brief survey of text mining. LDV Forum
GLDV J. Comput. Linguist. Lang. Technol. 20, 19 (2005)
10. Mirończuk, M.M., Protasiewicz, J.: A recent overview of the state-of-the-art ele-
ments of text classification. Expert Syst. Appl. 106, 36 (2018)
11. Lan, M., Tan, C.-L., Low, H.-B.: Proposing a new term weighting scheme for
text categorization. In: Proceedings of the 21st national conference on Artificial
intelligence - Volume 1 (AAAI 2006). AAAI Press, pp. 763–768 (2006)
12. Jiang, H., Li, P., Hu, X., Wang, S.: 2009 IEEE International Conference on Intel-
ligent Computing and Intelligent Systems, Shanghai, China, pp. 294–298. IEEE
(2009)
13. De Silva, J., Haddela, P.S.: 2013 IEEE 8th International Conference on Industrial
and Information Systems, Peradeniya, Sri Lanka, pp. 381–386. IEEE (2013)
14. Zhang, W., Yoshida, T., Tang, X.: 2008 IEEE International Conference on Systems,
Man and Cybernetics, Singapore, Singapore, pp. 108–113. IEEE (2008)
15. Liu, C., Wang, W., Wang, M., Lv, F., Konan, M.: An efficient instance selection
algorithm to reconstruct training set for support vector machine. Knowl.-Based
Syst. 116, 58 (2017)
Classification of Public Bidding Notices 691
16. Hochbaum, D.S., Baumann, P.: Sparse computation for large-scale data mining.
IEEE Trans. Big Data 2(2), 151 (2016)
17. Breiman, L.: Mach. Learn. 45(1), 5 (2001). https://doi.org/10.1023/A:
1010933404324
18. Xu, Q., Zhang, M., Gu, Z., Pan, G.: Neurocomputing (2018)
19. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name
matching in information integration. IEEE Intell. Syst. 18(5), 16 (2003)
20. Jeni, L.A., Cohn, J.F., De La Torre, F.: 2013 Humaine Association Conference on
Affective Computing and Intelligent Interaction, Geneva, Switzerland, pp. 245–
251. IEEE (2013). https://doi.org/10.1109/ACII.2013.47. http://ieeexplore.ieee.
org/document/6681438/
21. Roychoudhury, S., Bellarykar, N., Kulkarni, V.: 2016 IEEE 20th International
Enterprise Distributed Object Computing Conference (EDOC), pp. 1–10 (2016)
22. Silalahi, M., Hardiyati, R., Nadhiroh, I.M., Handayani, T., Amelia, M., Rahmaida,
R.: 2018 International Conference on Information and Communications Technology
(ICOIACT), pp. 515–519 (2018)
23. Chandra, N., Khatri, S.K., Som, S.: 2017 6th International Conference on Reli-
ability, Infocom Technologies and Optimization (Trends and Future Directions)
(ICRITO), pp. 348–354 (2017)
24. Mishu, S.Z., Rafiuddin, S.M.: 2016 19th International Conference on Computer
and Information Technology (ICCIT), pp. 409–413 (2016)
25. Zeng, T., Wu, B., Ji, S.: DeepEM3D: approaching human-level performance on 3D
anisotropic EM image segmentation. Bioinformatics 33(16), 2555 (2017)
Estimating the Time-Lapse Between
Medical Insurance Reimbursement
with Non-parametric Regression Models
1 Introduction
algorithms and more specifically, nearest neighbors, support vector machines, decision
trees and random forests. These algorithms have been extensively studied on diverse
datasets [1–3]. We attempt to contribute to the comparative literature by studying the
algorithms performance on a medical insurance dataset. The machine learning goal is
centered around a regression model that estimates the time lapse between the date upon
which medical treatment occurs and the date when an insurance company reimburses
charges spent on the treatment.
The analysis focuses on the comparative effect of data preprocessing practices
including, feature engineering and extraction, encoding and feature scaling, as well as
other important aspects of machine learning like the hyper parameter tuning using the
grid search algorithm. The R2 score is chosen as the primary scoring technique simply
because we wish to know how well each nonparametric regression model estimates the
real data points, that is, the goodness of fit.
Section 2 of this paper presents a brief but precise literature review under the
moniker related work, here we discuss some important comparative studies that make
use of nonparametric algorithms in various applications. Section 3 presents our
methodology for the comparisons, we also discuss the features of the dataset with
important data analysis and visualization. Section 4 presents our results, Sect. 5 dis-
cusses the results, and finally we conclude in Sect. 6.
2 Related Work
The “no free lunch” theorem stated by David Wolpert [4], shows us that indeed, there is
no one best supervised learning algorithm over another. What this means is that one
can only say that an algorithm performs better in some task domains with a well-
defined target label y and less so in other domains, this notion is key when carrying out
comparative studies. The goal of a comparative analysis is not to state that one algo-
rithm is better than the other but rather it is to reinforce certain expectations about the
performance of an algorithm given a certain task. Most researchers are aware of the
lack of a priori distinctions between supervised learning algorithms, and that the per-
formance of an algorithm highly depends on the nature of the target label y. This is
seen in the way the results from comparative studies are presented. In [5], the authors
found that there is no universal best algorithm, however, using datasets from different
substantive domains their results showed that boosted trees followed by SVMs had the
best overall performance while logistic regression and decision trees performed the
worst. Still based on the same prediction of recidivism, the results from [6], showed
that when a subset of predictors was used, traditional techniques such as logistic
regression performed better, while random forest performed the best on the whole set of
predictors. More recent studies, have shown again, that the sample size is a key factor
when comparing the performance of older algorithms such as logistic regression with
newer algorithms like logitboost [7]. Research on breast cancer detection in [9] used
five machine learning algorithms to separately classify a multidimensional image
dataset and the results where compared. The five machine learning classifiers used on
the Shearlton transformed images were, Support Vector Machines (SVM), Naive
bayes, Multilayer perceptron, k-Nearest Neighbour, and Linear discriminant analysis
694 M. Akinyemi et al.
classifier. The results conclude that SVM models were the best classifiers for breast
cancer detection using images with a well-defined region of interest.
The study conducted in [8], compares Gradient Boosting Machine, Random For-
ests, Support Vector Machines, and Naive Bayes model. An ensemble of these models
was fitted to create the ML model used for comparison with EUROSCORE II and
logistic regression. The models were fitted twice due to the sensitivity to the input data,
a chi-square fit was applied on the second fitting to the input features and only relevant
features were used. The performance of each machine learning model was assessed
with the area under the ROC curve, Random Forest produced the best result regardless
if the data was filtered or not. Out of the four machine learning models, Naive Bayes
produced the weakest accuracy without filtering, but proved to be better than SVMs
with filtering; but in both cases logistic regression does better than both Naive Bayes
and SVMs.
However brief, the key thing to take from the review is the various reasons why a
particular learning algorithm performs better than another. While presenting our results,
we shall discuss the performance of the algorithms with respect to certain aspects of the
dataset as well as hyperparameter tuning. Although most of the papers reviewed where
classification problems we are interested in seeing if similar effects of say the training
set sample size as seen in [5] and [6] will affect the performance of our models.
3 Methodology
In this section, we present first the dataset used for the training of the models, then we
present the various techniques used to train the models, test and score their
performance.
Our goal is to augment a statistical exploratory analysis of the data with supervised
machine learning. The supervised learning task is to train a model to estimate the time
lapse between a treatment date and payment date. To get the time lapse between
reimbursement of the charges spent, we calculate the number of days between the
treatment date and the payment date. This derived time-lapse is then used as the target
variable during the training.
Descriptive Statistics. In Table 1, we present a descriptive analysis of the health
insurance dataset.
Data Visualization. The frequency distribution in Fig. 1 shows that most insurance
companies make reimbursements within 100 days.
696 M. Akinyemi et al.
In Fig. 2, the frequency distribution of the company codes is presented, this indi-
cates the number of records associated with each company. In Fig. 3, one can observe
some insurance companies make reimbursements in less than 100 days while others
may take as long as a year.
Fig. 3. Scatter plot showing the insurance companies (company_code) and the time lapse.
Data Preprocessing. All the features except for the payment date are selected for our
training and test set. The target label is of course the time lapse. We apply label
encoding to the character features, the diagnosis and the prescription. Next, we apply
one hot encoding to ensure that categorical features are not taken as ordinal values.
This increased the number of features in the dataset to 533, we did not apply any
dimensionality reduction techniques such as principal component analysis. We split the
dataset into the training set and test set using the 80-20 (percent) ratio. This resulted in
56,709 instances for the training set and 14,178 instances for the test set
Finally, we normalize all the values in the training and test set to lie between 0 and
1 using the Minmax scaler function in sklearn. A mathematical description of the
Minmax scaler is shown in Eq. 1, x is an original value and x0 is the normalized value.
Normalization ensures all values fall within a particular range and that an algorithm
doesn’t place a higher precedence on larger numerical values. The data preprocessing
functions were also carried out using the sklearn library [11].
x minð xÞ
x0 ¼ ð1Þ
maxð xÞ minð xÞ
Where P is the optimization problem of finding the optimal value of w . In Eq. (2),
w is the model that minimizes the cost function and it is a function of the optimized
parameters a and b. In Table 2, we present briefly the hyperparameters we chose for
optimization, the range of optimization values and the effect of these choices.
Where yi is the estimated value of the i-th instance of n samples, and yactualðiÞ is the
P
actual i-th value of n samples and y ¼ 1n n1i¼0 yi .
Estimating the Time-Lapse Between Medical Insurance Reimbursement 699
4 Results
Training Results
The models are trained using the optimized parameters from Table 3. Table 4 shows
the training time for each model.
Validation Results
Table 5. The mean of the 10-fold cross validation (R2) score for each regression model.
S/N Regression model R2
1. KNN 0.4753
2. SVM 0.3631
3. Decision tree 0.6618
4. Random forest 0.7189
Test Results
The k-NN’s visibly perform better than the SVM model. From Table 6, it is evident
that k-NN model ascertains correct predictions on close to 50% of inputs, and finely
predicts readings of a certain type of information in the data. The issue with this model
is the rate of correct prediction on this type of information in the data is low. From the
Fig. 4, in about 10 data points of input data, the model appears to correctly predict 3 of
700 M. Akinyemi et al.
those data points correctly. This form of uncertainty brings the accuracy of the model
down, although it performs better than SVM’s it still does not top the work done by
Decision trees or the Random forest model.
Fig. 4. Plot showing the predicted time lapse over the real time lapse for the KNN regression
model.
The SVM’s were not able to perform at a level of over 50% accuracy. This poor
performance is observed in Fig. 5 and its R2 shown in Table 6. The most common input
readings (readings on the lower half of the plot), are not even correctly predicted by the
model. On closer inspection, the SVM model could been deemed to have poorly underfit
the data, as most of the new input data-points all fall into the same bracket or range of
prediction. The nature of the data and the SVM model are not a good match at all.
Estimating the Time-Lapse Between Medical Insurance Reimbursement 701
Fig. 5. Plot showing the predicted time lapse over the real time lapse for the SVM regression
model.
The plot in Fig. 6 does not bare evidence of overfitting or underfitting in the model.
The visual representation of the model does not show a general output or fit across an
input range or a prediction of outlying data-points. Instead, it shows an accurate reading
of new inputs spread evenly across the test set. The model performs very well com-
pared to SVM’s and k-NN’s and this figure reinforces that.
Fig. 6. Plot showing the predicted time lapse over the real time lapse for the Decision tree
regression model.
702 M. Akinyemi et al.
The Random Forest plot in Fig. 7 also shows no signs of overfitting or underfitting,
on closer inspection, it tends to show a model that does not fit too closely to the data.
This calls to mind that the best performing model may not closely fit every data-input,
but instead must have a less descriptive analysis when fitting inputs. This is evident
with the non-linearity of the data, a less rigorous check on an input from this dataset
favors a less descriptive analysis when predicting. Another scenario is random forest
being the average of decision trees trains on a range of scores from many decision trees.
This normalization on its descriptive analysis may be the reason for this permissible
model.
Fig. 7. Plot showing the predicted time lapse over the real time lapse for the Decision tree
regression model.
5 Discussion
The decision tree cross validation score of 0.66 and accuracy of 67% prediction on the
test set shown in Tables 5 and 6, is a fairly good results but they stand out when
compared to the performance of the two other dissimilar models (SVMs and k-NNs).
The data used in this experiment is highly non-linear, this means that a unique instance
of a feature (i.e., Drug) could map to more than one unique instances of other features.
This similarity in the dataset depresses the search space of predictions, models would
then act with very minimal discriminations on new inputs. Although the ‘Charges sent’
feature may have a unique instance across all samples, it still doesn’t produce a big
enough discrepancy for classification. Therefore, models like SVMs and k-NNs suffer
greatly for this, because they assess new inputs on how close they are to a cluster of
points(SVMs) or they make an intuitive prediction based on the likeness to other data-
points (k-NNs), most of the predictions would then be fairly wrong.
Estimating the Time-Lapse Between Medical Insurance Reimbursement 703
Decision Trees and Random Forests on the other hand define the attention that
should be placed on features in new inputs. A new input to the model traverses a tree
that classifies based on information and not on distance or likelihood to other data-
points. This behavior does not suffer greatly from overly similar or intertwined data
instances but bears the risk of overfitting. Thus, a search method was employed to
monitor this unwanted attribute in our model. Grid Search, grid search was able to find
an optimal depth for the decision tree and thereby improving the performance. The only
model that performed better than decision trees is Random Forest, which intuitively
makes sense and is expected. Random Forest is an average of a number of these well
performing decision trees on predictions and so should have better instances to make a
decision from.
6 Conclusion
References
1. Tang, L., Pan, H., Yao, Y.: PANK-A financial time series prediction model integrating
principal component analysis, affinity propagation clustering and nested k-nearest neighbor
regression. J. Interdiscip. Math. 21, 1–12 (2018)
2. Sun, J., Fujita, H., Chen, P., Li, H.: Dynamic financial distress prediction with concept drift
based on time weighting combined with Adaboost support vector machine ensemble.
Knowl.-Based Syst. 120, 4–14 (2017)
3. Sim, D.Y.Y., Teh, C.S., Ismail, A.I.: Improved boosted decision tree algorithms by adaptive
apriori and post-pruning for predicting obstructive sleep apnea. Adv. Sci. Lett. 24(3), 1680–
1684 (2018)
4. Wolpert, D.: The lack of a priori distinctions between learning algorithms. Neural Comput.
8, 1341–1390 (1996)
704 M. Akinyemi et al.
1 Introduction
In recent years, the importance of varying fields within computer science, particularly
cybersecurity and machine learning, has skyrocketed. With new systems depending on
intelligent tools that bring next-level computation and systems open to security brea-
ches, the importance of the intersection of machine learning for analysis in cyberse-
curity data has flourished. However, the burdens of uneven cybersecurity data from a
variety of different sources often makes development of a tool that effectively and
accurately makes use of machine learning to improve cybersecurity data difficult. As a
result, very few end-to-end systems that can automatically classify anomalies in data
exist, let alone those that are accurate.
In this paper, we propose a novel, accurate system for real-time anomaly detection.
We term this product CAMLPAD, or the Cybersecurity and Autonomous Machine
Learning Platform for Anomaly Detection. By processing a plethora of different forms
of cybersecurity data, such as YAF, BRO, SNORT, PCAP, and Cisco Meraki real-time
using a variety of machine learning models, our system immediately determines if a
particular environment is at immediate risk for a breach as represented by presence of
anomalies. The specific machine learning algorithms utilized include Isolation Forest,
Histogram-Based Outlier Detection, Cluster-Based Local Outlier Factor, Multivariate
Gaussian, and K-Means Clustering. Once the data has been processed and anomalies
have been calculated, the CAMLPAD system utilizes Kibana to visualize outlier data
pulled from Elasticsearch and to gauge how high the outlier score is. Once a particular
threshold has been reached for this outlier score, an automated alert is sent to the
system administrator, who has the option to forward the alert to all of the employees in
the company so they are aware that a cybersecurity breach has occurred. By imple-
menting CAMLPAD as a running bash script, the CAMLPAD system immediately
recognizes anomalies and sends alerts. Subsequent paragraphs, however, are indented.
1.1 Background
Cybersecurity is the practice of defense of an organization’s network and data from
potential attackers that have unauthorized access to the particular network. One measure
of determining to what extent a particular user has this type of unauthorized access is
detecting anomalies in the network traffic data, specifically the data referred to earlier
(e.g. BRO, YAF, PCAP, SNORT). A potential platform that would be able to detect
anomalies would need to process this data in real time then use a model that uses past data
to learn whether current data contains anomalies and thus, if the network has an intrusion.
Specific to the CAMLPAD system, there are several ideas and terminologies that
would benefit the reader to have a background of. To begin with, there are a few pieces
of cybersecurity data that the CAMLPAD system makes use of. YAF, or “Yet Another
Flowmeter”, is a cybersecurity data type that processes PCAP data and exports these
flows to IPFIX Collecting Process [11]. BRO is an open source framework that ana-
lyzes network traffic and is used to detect anomalies in a network. SNORT, similarly, is
a network intrusion detection system that helps detect emerging threats. Meraki is a
cloud-based centralized management service that contains a network and organization
structure. This specifically is crucial as it further reveals the relationships between
members of a particular organization, which further assists the machine learning model
in determining where a potential anomaly may be.
Machine Learning, or ML, is the subfield within the exploding field of Artificial
Intelligence primarily concerned with having a computer or machine learns to make
predictions from a set of previous data rather than be explicitly programmed. There are
several algorithms, or methods, that facilitate this type of learning. Machine learning
consists of two main categories: unsupervised and supervised. In supervised machine
learning, the ML model already has data labeled, so calculating an accuracy is as simple
as detecting whether the model has correctly predicted the labeled data. In unsupervised
CAMLPAD: Cybersecurity Autonomous Machine Learning Platform 707
machine learning, the main type of ML to be referred to in this paper, the data has not
been labeled, so alternative methods need to be used to evaluate performance. As will be
discussed in further detail in the next section, the specific machine learning models used
as part of the CAMLPAD System are Isolation Forest, Histogram Based outlier detec-
tion, Cluster Based Local Outlier Factor, and Principal Component Analysis Fig. 1.
Based off of the cybersecurity system, data types, and machine learning algorithms,
the holistic CAMLPAD system incorporates the elasticsearch data stream in real time
with the machine learning algorithm, and if the anomaly score reaches a particular
threshold, a warning is sent to the organization. The methods section, delivers further
details based off of the fundamentals discussed in this section.
In the further sections of the paper, the CAMLPAD system will be thoroughly
described as well as results from comprehensive testing to gauge performance of the
system. Next, comparisons of results will be provided in a discussion with general
insights being stated in the conclusion.
2 Methods
Before implementing the machine learning aspect of the research, the data, which
described various online transactions, had to be accurately transferred from the sensors
that were run on Linux virtual machines, to a local server, where the model can process
the data and alert the user if any anomalies are present. The data, specific to the
different sensors running on the virtual machine including BRO, YAF, Snort, and
Meraki, are temporarily stored locally, in a machine composed of 4 Dell VRTX, before
being uploaded to a Hadoop server consisting of one master node and three slave
nodes. After successfully uploading the data to the Hadoop server, Apache NiFi is used
to streamline and process the sensor logs before pushing the processed information into
the Kafka Queue Fig. 4. Apache NiFi, a project from the Apache Software Foundation,
CAMLPAD: Cybersecurity Autonomous Machine Learning Platform 709
was specifically designed to automate the flow of data between software systems. In
our case, the data is transferred from the virtual storage on the Hadoop server to the
Kafka queue where it can be more efficiently stores. Specifically, the information stored
in each of the logs is queried into a JSON-like format consisting of a field, such as
MAC address or destination IP, and the actual information, such as a list of actual
addresses. Once the data has entered the queue, it is sent to the Elasticsearch database
where it is stored for future processing.
Elasticsearch is a database that parses and normalizes raw data before assigning
each query of information a unique identification number. Using this identification
number, and the index associated with the data, information from the sensor logs can be
queried for further processing using the machine learning models. However, one caveat
with Elasticsearch is that it doesn’t allow for custom processing scripts to be run inside
the database. Instead, the information must be queried based on an assigned index and
must be processed on an external server or node. That is where the current research
enters the workflow, since the machine learning algorithms utilize the indexing ability
of Elasticsearch to stream data into a separate machine. This data is streamed directly
from the database, without having to download the data as a CSV or JSON, meaning
that the data is quickly transferred from storage to a local processor on another
machine. Using a unique algorithm centered around the current date and time, all
previous data is indexed and imported into a dataframe, one of the most common
methods of storage for machine learning algorithms. The dataframe will contain the
information used for training and validating the machine learning models that were
created. Once the dataframe has been created, the current days data is indexed and
imported into another dataframe. This dataframe will contain the latest information
used for anomaly detection based upon patterns observed in the previous data stored.
710 A. Hariharan et al.
Now that the data has been successfully imported into the respective dataframes,
the categorical data present in the sensor logs, such as type of request or url, must be
encoded into numerical values before further analysis by the machine learning algo-
rithms. After encoding the data, two methods of imputation: linear regression Fig. 5,
for purely numerical values, and backfill insertion, for encoded categorical values, were
used. Now that missing or lost data has been imputed, the data can be imported into the
custom ensemble model for anomaly detection. The custom ensemble model consists
of an Isolation Forest algorithm, histogram based outlier detection algorithm, and
cluster based local outlier factor. All of these models are similar to the implementation
in the python outlier detection library. Once the data is fit to the overall model, the
validation data and testing data are assigned an outlier score. Based on the outlier score
and a simple PCA algorithm, clusters are developed depending on the outlier score
assigned by the respective model. Those clusters are then processed and a heat map is
created describing the various levels of outliers present in the data.
This process is repeated for each model created, resulting in three heat maps
describing the outlier scores assigned by each model for the data. After the outlier
scores have been assigned, the ensemble model is created through a democratic voting
system, where each model has an equal say on whether a data point is an outlier or an
inlier. After the voting system has been completed, the final outlier scores are run
through the PCA algorithm and a final heat map is created. The process is then repeated
for the different types of data that is stored on the Elasticsearch database, including
YAF, BRO, SNORT, and Meraki. Specifically, the BRO data is split by protocol into
DNS and CONN in order to accurately label the data. Once the final outlier scores have
been compiled for each data type, a final ensemble model is created, using a democratic
voting system in order to reclassify each data point. This final model takes into con-
sideration not only different outlier detection models that have been successful in
previous research, but also different types of sensor data that capture different layers of
internet traffic. After the model has been created, the accuracy is determined by cal-
culating the Adjusted RAND score, a common method of evaluating unsupervised
machine learning algorithms.
This represents the last part of the workflow, where the data originated from dif-
ferent sensors has been effectively processed and anomaly scanning has been com-
pleted. After the accuracy has been tested and confirmed, the newly assigned outlier
scores are then indexed and queried back into the Elasticsearch database so that
visualizations of these outlier scores can be created in Kibana. Specifically, the index is
portrayed as a gauge, where the outlier score of the current day’s data is compared with
previous data that the model has trained on. When the gauge passes the 75th percentile,
a custom alert is sent to the owner of the Apache database, alerting them that there
might be anomalies in the current data. The user then has the choice of whether to
respond to this alert by blocking certain destination ports or IP addresses or can
perform further investigation to determine the cause of the anomaly.
CAMLPAD: Cybersecurity Autonomous Machine Learning Platform 711
3 Results
In terms of results, the CAMLPAD system consists of five main data components:
BRO DNS, BRO CONN, YAF, SNORT, and Meraki. These components, along with
the three models: Isolation Forest (I-Forest), Cluster Based Local Outlier Factor
(CBLOF), and Histogram Based Outlier Score (HBOS), are then combined in a
democratic voting system to determine final outlier score. Although the user isn’t
alerted based upon each individual data type and only the final combined model, it is
interesting to note how different types of data, both layer3 and layer2, show similar
patterns in anomaly detection. In all of these heat maps, there are two separate data
points, previous data that the model trained on, which are represented by the smaller
data points, and current day’s data, which is represented by the larger data points. In
each graph, the differences or similarities between the current day and the previous data
can be observed along with other patterns that represent the level of anomalies present.
does not display this, it IS further corroborated by the combined algorithm, which
shows a dark dot representing less of an outlier, surrounded entirely by lighter dots.
Since the Kibana Recent day has a score of 0.13, which is higher than 0.022 of the
previous days, the current day is likely an outlier.
Next, the BRO CONN data’s Fig. 7 HBOS algorithm reveals that all of the data is
very light, with a light gray dot in the center, meaning that the current day’s data is only
a little bit of an outlier in comparison to previous day’s data. This is roughly corrob-
orated by the CBLOF algorithm, but the I-Forest and Combined algorithms deliver an
entirely different result. In both Isolation Forest and Combined, there is first a light dot
surrounded by black dots, then a black dot surrounded by white dots, meaning that
there was likely an outlier on the current day. Since the Kibana recent day has a score
of 0.025, which is significantly lower than the previous days’ score of 5.25, the Kibana
score suggests that the current day is less of an anomaly.
CAMLPAD: Cybersecurity Autonomous Machine Learning Platform 713
For the YAF data shown below Fig. 8, throughout each of the different algorithms,
it is evident that the current day’s data is an outlier when combined to the previous
days’ data. For example, in the HBOS diagram, there are black dots surrounded by
light dots. This repeats for CBLOF, I-Forest, and the combined algorithm. For Kibana,
the recent day has a score of 2.734, which is less than the previous days’ score of 2.75,
meaning that the most recent day is less of an outlier in comparison to previous days.
For the SNORT data shown below Fig. 9, the HBOS, CBLOF, and Combined
Algorithm all have black dots surrounded by lighter dots representing previous days’
data. However, the I-Forest algorithm has only light dots, meaning that it does not
perceive the current day’s data to be an outlier.
For Kibana, the recent day has a score of 0.961, which is less than the previous
days’ score of 4.378, meaning that the most recent day is far less of an outlier in
comparison to previous days.
For the Meraki data Fig. 10 shown below, it is clear that there are four main gray
dots representing the current days’ data, and these four clusters are outliers. While the
HBOS model has the data being almost equivalent to the data from previous days, the
CBLOF, I-Forest, and Combined all reveal that the clusters are indeed outliers. For
Kibana, the recent day has a score of 5.287, which is less than the previous days’ score
of 0.029.
CAMLPAD: Cybersecurity Autonomous Machine Learning Platform 715
Lastly, for the combined data Fig. 11 throughout each of the different algorithms, it
is evident that the current day’s data is an outlier when combined to the previous days’
data. In HBOS, there is an X-like outlier, in the CBLOF there is a dark dot, in I-Forest,
there is a gray dot, and in the combined algorithm there is a black dot surrounded by
gray points. For Kibana, the recent day has a score of 0.261, which is more than the
previous days’ score of 0.004, meaning that the most recent day is more of an outlier in
comparison to previous days.
4 Discussion
there are interesting insights to be gained from computational models [2, 4]. Several
researchers have also discussed more sophisticated machine learning algorithms, such
as utilizing a spiking neural network algorithm, but even these have had limited success
[3]. With the rise of machine learning, other researchers have investigated the subfield
of deep learning, including its role as the frontier for distributed attack detection and in
varied prevention systems [5, 6, 10, 17]. In addition, while several papers have pointed
to the intriguing intersection of autonomous AI and anomaly detection, others contain a
systematic review of how cloud computing could potentially assist in the development
of a prevention system [7, 8, 16].
In addition to research papers that have investigated broad fields such as deep
learning and cloud computing, several have dug deeper into the specifics of machine
learning. For example, one paper considers use of a TCM-KNN algorithm Fig. 13 for
supervised network intrusion, which provides a novel approach, since most network
intrusion algorithms use unsupervised learning [9]. In addition, another paper considers
use of the n-gram machine learning algorithm in both anomaly detection and classi-
fication [18]. On the other hand, different papers have focused on larger scale intrusion
networks, and more cost-benefit analysis oriented operations, involving research pri-
orities and business intelligence [10, 12, 13, 15]. Real-time network flow Fig. 12
analysis has been another field populated by a great deal of research, as shown by the
system named Disclosure, which can detect botnet commands and control servers [14].
Lastly, several papers have investigated feature selection models and learning-based
approaches to intrusion detection, such as applications to SQL attacks [19, 20].
5 Conclusion
Security breaches and threats are growing along with the cybersecurity field.
CAMLPAD is machine learning-based platform that can effectively detect anomalies in
real-time resulting in an outlier score. We demonstrated the vast possibilities of
anomaly detection in the cybersecurity field using unsupervised machine learning
models: Isolation Forest, Histogram-Based Outlier Detection, Cluster-Based Local
Outlier Factor, Multivariate Gaussian, and K-Means Clustering.
CAMLPAD’s pipeline starts with data streamed directly from Elasticsearch and
formatted on a local notebook. From there, we run the models which result in a
visualization of the data and an outlier score. From the different models, the outlier
scores are averaged and displayed on the Kibana dashboard.
helps encode descriptive and structural metadata for objects in a digital library. In
addition to running the machine learning models on a METS Server, we plan to clean
up the data and make each field consistent. Since the data, streamed from elasticsearch,
is different for each group (based on day), preprocessing the data to make all data
points similar will help increase the efficiency.
In the future, we will use supervised machine learning models to ensure all data
points are represented. For example, we plan to use a Support Vector Machine
(SVM) which is helpful for classification of the outliers and anomalies. The overall
model is trained on data from an earlier period of time, then outliers are detected and
scored based on those inconsistencies.
Overall, CAMLPAD achieved an adjusted rand score of 95%, but with the use of
several ML models and making CAMLPAD more efficient, the accuracy for anomaly
detection can increase.
Acknowledgments. We would like to thank the employees at Blue Cloak, LLC for their gen-
erous support throughout the duration of this research endeavor as well as for the cybersecurity
data and tools used.
References
1. Garcia-Teodoro, P., Diaz-Verdejo, J., Maciá-Fernández, G., Vázquez, E.: Anomaly-based
network intrusion detection: techniques, systems and challenges. Comput. Secur. 28(1–2),
18–28 (2009)
2. Dasgupta, D. (ed.): Artificial Immune Systems and their Applications. Springer, Heidelberg
(2012)
3. Demertzis, K., Iliadis, L., Spartalis, S.: A spiking one-class anomaly detection framework for
cyber-security on industrial control systems. In: International Conference on Engineering
Applications of Neural Networks, pp. 122–134. Springer, Cham (2017)
4. Dasgupta, D.: Immunity-based intrusion detection system: a general framework. In:
Proceedings of the 22nd NISSC, vol. 1, pp. 147–160 (1999)
5. Abeshu, A., Chilamkurti, N.: Deep learning: the frontier for distributed attack detection in
fog-to-things computing. IEEE Commun. Mag. 56(2), 169–175 (2018)
6. Patel, A., Qassim, Q., Wills, C.: A survey of intrusion detection and prevention systems. Inf.
Manag. Comput. Secur. 18(4), 277–290 (2010)
7. Mylrea, M., Gourisetti, S.N.G.: Cybersecurity and optimization in smart “autonomous”
buildings. In: Autonomy and Artificial Intelligence: A Threat or Savior?, pp. 263–294.
Springer, Cham (2017)
8. Patel, A., Taghavi, M., Bakhtiyari, K., Junior, J.C.: An intrusion detection and prevention
system in cloud computing: a systematic review. J. Netw. Comput. Appl. 36(1), 25–41
(2013)
9. Li, Y., Guo, L.: An active learning based TCM-KNN algorithm for supervised network
intrusion detection. Comput. Secur. 26(7–8), 459–467 (2007)
10. Diro, A.A., Chilamkurti, N.: Distributed attack detection scheme using deep learning
approach for Internet of Things. Futur. Gener. Comput. Syst. 82, 761–768 (2018)
11. Inacio, C.M., Trammell, B.: Yaf: yet another flowmeter. In: Proceedings of LISA10: 24th
Large Installation System Administration Conference, p. 107 (2010)
720 A. Hariharan et al.
12. Huang, M.Y., Jasper, R.J., Wicks, T.M.: A large scale distributed intrusion detection
framework based on attack strategy analysis. Comput. Netw. 31(23–24), 2465–2475 (1999)
13. Russell, S., Dewey, D., Tegmark, M.: Research priorities for robust and beneficial artificial
intelligence. Ai Mag. 36(4), 105–114 (2015)
14. Bilge, L., Balzarotti, D., Robertson, W., Kirda, E., Kruegel, C.: Disclosure: detecting botnet
command and control servers through large-scale netflow analysis. In: Proceedings of the
28th Annual Computer Security Applications Conference, pp. 129–138. ACM (2012)
15. Chen, H., Chiang, R.H., Storey, V.C.: Business intelligence and analytics: from big data to
big impact. MIS Q. 36(4) (2012)
16. Doelitzscher, F., Reich, C., Knahl, M., Passfall, A., Clarke, N.: An agent based business
aware incident detection system for cloud environments. J. Cloud Comput.: Adv. Syst. Appl.
1(1), 9 (2012)
17. Ten, C.W., Hong, J., Liu, C.C.: Anomaly detection for cybersecurity of the substations.
IEEE Trans. Smart Grid 2(4), 865–873 (2011)
18. Wressnegger, C., Schwenk, G., Arp, D., Rieck, K.: A close look on n-grams in intrusion
detection: anomaly detection vs. classification. In: Proceedings of the 2013 ACM Workshop
on Artificial Intelligence and Security, pp. 67–76. ACM (2013)
19. Aljawarneh, S., Aldwairi, M., Yassein, M.B.: Anomaly-based intrusion detection system
through feature selection analysis and building hybrid efficient model. J. Comput. Sci. 25,
152–160 (2018)
20. Valeur, F., Mutz, D., Vigna, G.: A learning-based approach to the detection of SQL attacks.
In: International Conference on Detection of Intrusions and Malware, and Vulnerability
Assessment, pp. 123–140 (2005)
A Holistic Approach for Detecting DDoS
Attacks by Using Ensemble Unsupervised
Machine Learning
Abstract. Distributed Denial of Service (DDoS) has been the most prominent
attack in cyber-physical system over the last decade. Defending against DDoS
attack is not only challenging but also strategic. Tons of new strategies and
approaches have been proposed to defend against different types of DDoS
attacks. The ongoing battle between the attackers and defenders is full-fledged
due to its newest strategies and techniques. Machine learning (ML) has
promising outcomes in different research fields including cybersecurity. In this
paper, ensemble unsupervised ML approach is used to implement an intrusion
detection system which has the noteworthy accuracy to detect DDoS attacks.
The goal of this research is to increase the DDoS attack detection accuracy while
decreasing the false positive rate. The NSL-KDD dataset and twelve feature sets
from existing research are used for experimentation to compare our ensemble
results with those of our individual and other existing models.
1 Introduction
From the beginning of the architectural evolution of the Internet, the proper way to
transmit a packet, and process reduction were the major concerns. Cyber attackers
easily exploit the existing limitations of the Internet protocols (TCP, UDP, etc.) and the
readily available attack tools. A Distributed Denial of Service (DDoS) attack is mostly
a network attack that causes bandwidth overloading due to the use of immense inbound
or outbound traffic over the network, resulting in disruption of the normal operation.
The first well-documented DDoS attack appears to have occurred on August 1999,
when a DDoS tool called ‘Trinoo’ was deployed in at least 227 systems, to flood a
single University of Minnesota computer, which was knocked down for more than 2
days. In recent years, attacks on financial systems, broadcast systems, and Internet-
based services have grown exponentially [1]. Moreover, those attacks are devastating,
wide-ranging, easy to implement, and difficult to detect and defend, posing a major
threat to Internet privacy and security. Today’s Internet is badly plagued by DDoS
attack and the attack has been escalated drastically over the last decade. In the last
couple of years, the giants such as GitHub, Amazon, Cloudflare, Facebook, Instagram,
etc. had their service disruption by DDoS attack. According to the World Infrastructure
Security Report 2018 [2], for the first time ever, a DDoS attack reached 1 Tbps
(Terabyte per Second) in size and the Internet has officially entered the terabit attack
era. The largest attack was recorded as 1.7 Tbps. In the report, 16,794 DDoS attacks
occurred per day have been mentioned which is equal to 700 attacks per hour or 12
attacks per minute and it predicts that this number has been growing rapidly day by
day.
To keep alive in the competition, defenders are developing the newest technologies
and mechanisms against those attacks. The existing DDoS attacks have been scruti-
nized and it is found that the attack can be mitigated by one of these three approaches
or defense mechanisms, namely, attacker-end approach, victim-end approach, and in-
network approach, depending on their locality of deployment. Though attacker-end
detection approach is much more challenging than the victim-end detection approach,
solutions exist. On the other hand, victim-end detection is easier to implement com-
pared to the other two types of detection approaches. The existing detection approaches
can be categorized into statistical, soft computing, clustering, knowledge-based, and
hybrid. These approaches can also be classified as supervised or unsupervised based on
the type of dataset [3].
In the evolution of Intrusion Detection Systems (IDS), anomaly-based detection is
more popular than signature-based detection. Machine learning (ML) has promising
outcomes in detecting cyber-physical attacks including DDoS. Many researchers have
already used ML classifiers to build IDSs in defending against DDoS attacks. In
Machine learning, supervised, semi-supervised, and unsupervised are three basic ways
to classify anomalous packets from normal packets. Supervised methods have the
privilege of differentiating anomalous and normal data from a tagged dataset. Unsu-
pervised methods, on the other hand, cluster dataset into different clusters where the
strength of the clustering lies within the algorithm itself. Among those, novelty and
outlier detection strategies are the unsupervised methods that have significant outcomes
in detecting the unseen anomaly. One class SVM (Support Vector Machine), Local
Outlier Factor, Elliptic Envelope, Isolation Forest, etc. are the most well-known novelty
and outlier detection classifiers.
Both supervised and unsupervised classifiers are being used in developing IDS.
However, the majority of these approaches have focused on learning a single model for
intrusions. Moreover, due to the varied nature of intrusions, it may be hard to learn a
single model that generalizes to all types. For example, some types of intrusions can be
modeled using a simple linear model (e.g. logistic regression) while others may require
more complex non-linear models (e.g. support vector machines with kernels). There-
fore, the main idea of this paper is to train several models that can identify DDoS
intrusions and then combine these into a unified system based on different mechanisms.
The benefits of ensemble learning, i.e., combining multiple classifiers to form a
more powerful classifier have been well-studied in the ML community. Dietterich et al.
[4] mentioned that ensembles can perform better than a single classifier and many
classification problems have benefited from the idea of combining multiple classifiers.
In general, there are two ways to ensemble the classifiers: homogeneous and hetero-
geneous. When similar types of classifiers are used to build a training model, it is called
a homogeneous ensemble (e.g.; bagging and boosting), whereas combining different
A Holistic Approach for Detecting DDoS Attacks 723
2 Literature Review
DDoS attacks have become a weapon of choice for hackers as well as for cyber
terrorists and used as a form of protest in a politically unstable society. Various
detection techniques are being improvised by researchers to defend against DDoS
attacks over the year. To evade the existing DDoS attack detection solutions, the attack
itself changes frequently. Based on the various techniques such as cloud computing,
software defined networking (SDN), backbone web traffic, and big data strategies, the
DDoS attack detection can be categorized into filtering mechanism, routers function,
network flow, statistical analysis, and learning machine.
A comprehensive survey of Machine learning intrusion detection [13], systematic
literature review and taxonomy of DDoS attack [14] are necessary to know the state of
art of ML approaches for both IDS and DDoS. Ahmad Riza’ain [14] et al. performed an
in-depth analysis on DDoS attack types as well as on existing DDoS detection and
attack prediction techniques by characterizing the attacks. Also, they identified the
factors behind those attacks. Moreover, they have classified and ranked at least 53
articles from different digital libraries such as Science Direct, ACM Digital Library,
724 S. Das et al.
IEEE Xplore, Springer, and Web of Science related to DDoS detection and prevention
and found 30% of them using ML techniques as their detection or prevention strategy.
To detect the DDoS attack, supervised [15], semi-supervised [16], and unsuper-
vised methods are being used to build the training model. A combination of supervised
and unsupervised ML model to detect anomaly can also be found in [17]. Neural
Network and SVM for supervised modeling, KNN for unsupervised modeling, and
Principal Component Analysis (PCA) and Gradual Feature Reduction (GFR) for fea-
ture selection with NSL-KDD dataset are used there. However, the reason behind using
the combination of the supervised and unsupervised method in their research is
ambiguous.
Ensemble is the way of combining multiple classifiers for better performance over
single classifier and many classification problems have benefited from this idea [4].
Homogeneous (combination of similar types of classifiers) and heterogeneous (com-
bination of different types of classifiers) are the two major ensemble types. A detailed
survey of ensemble and hybrid classifiers [5] helps in understanding the usage and
shortfalls of ensemble ML in network security.
Outlier and novelty detection techniques are more efficient in detecting unknown
attacks as they use unsupervised ML models. An unsupervised ML model is used [18]
to detect a high-volume DDoS attack using in-memory distributed graph. Jabez et al.
[19] mentioned an outlier detection mechanism NOF (Neighborhood Outlier Factor) to
detect the anomaly. But there could be a high chance for a single classifier to predict
incorrectly compared to multiple classifiers’ prediction. Therefore, an ensemble clas-
sifier would be a perfect fit for predicting anomalous behavior precisely. Smyth et al.
[20] showed that stacked density estimation outperforms a single best model which
could be chosen based on cross-validation, combining with uniform weights, or even
through bias. A few hybrid supervised learning models [21] are used to detect DDoS
attack but realistically for a zero-day attack or unknown attacks, an unsupervised
hybrid model has better detection accuracy.
Though most of the researchers have chosen a single classifier to train their model
in detecting DDoS attack, a combination of classifiers has better accuracy compared to
a stand-alone model. Moreover, none of them have focused on either unsupervised
ensemble or outlier and novelty detection ensemble. In addition, their works didn’t
guide properly in detecting unseen attacks. Therefore, our motivation and the goal of
this paper is to build an unsupervised ensemble model which combines five different
‘outlier and novelty detection’ classifiers, resulting in the detection of unseen DDoS
attacks. Using unsupervised ensemble model is the novelty of this research, and the
better detection accuracy with lower false positive rates are obtained with this model.
3 Proposed Method
attack types and patterns. However, with the change of attackers’ motive and intention,
an adaptive IDS is most demanding in the cyber world. Here, in this paper, an ML
based IDS is proposed that has the novelty to ensemble unsupervised classifiers based
on outlier detection approach, and which gives a better detection accuracy with a lower
false positive rate in detecting DDoS.
3.1 Dataset
In this research, NSL-KDD [22] dataset is used for training and testing purposes. NSL-
KDD is a data set suggested to solve some of the inherent problems of the KDD’99
data set which are mentioned in [23]. Although, McHugh discussed some problems that
are suffered by this new version of the KDD data set and may not be a perfect
representative of existing real networks due to the lack of public data sets for network-
based IDSs. It can be applied as an effective benchmark data set to help researchers
compare different intrusion detection methods. NSL-KDD has some major improve-
ments over the original KDD’99 dataset [22]:
1. No redundant records in the train data, so classifiers will not be biased towards more
frequent records.
2. No duplicate records in the test data, so the performance of the learners is not biased
by the methods which have better detection rates.
3. The number of selected records from each difficulty group is inversely proportional
to the percentage of records in the original KDD’99 dataset.
4. The number of records in the train and test sets are reasonable that makes the dataset
affordable to run the experiments, etc.
The dataset contains eight data files of different formats that are compatible with
most experimental platforms. Table 1 shows a summary of the testing and training data
record.
numeric values. In the data preprocessing phase, for each such feature, the distinct
values are identified for all entries in that column and replaced with numeric values
using simple integer assignment starting from 1. The reference for this conversion is
shown in Table 2.
The NSL-KDD dataset that we have used is a tagged dataset. Since the proposed
model initially works with unsupervised methods which don’t require any class level,
‘class’ column from the dataset is removed in this phase. On the other hand, to combine
the outputs of those unsupervised methods, logistic regression (LR) and naïve Bayes
(NB) are used which require a class label. We use the removed class label from this
phase to train LR and NB model later.
One Class SVM. Support vector machines (SVMs), which are the types of supervised
learning models are very well-known in the ML environment that analyze data and
recognize patterns. It can also be used for both classification and regression tasks. One-
class classification (OCC), also known as unary classification or class-modeling, tries
to identify objects of a specific class amongst all objects by primarily learning from a
training set containing only the objects of that class [25].
Therefore, in anomaly detection, one-class SVM is trained with data that has only
one class, which is the “non-anomalous” or “normal” class. It infers the properties of
normal classes and using these properties, it can predict which examples are unlike the
normal. This is useful for anomaly detection because the scarcity of training examples
is what defines anomalies; typically, there are very few examples of the network
intrusion, fraud, or other anomalous behavior [26].
Local Outlier Factor. The local outlier factor score is computed by the LOF algorithm
which reflects the degree of abnormality of the observations. With respect to its
neighbors, LOF measures the local density (obtained from k-nearest neighbors) devi-
ation of a given data point. As a result, it detects the samples that have a substantially
lower density than their neighbors. For an observation, the LOF score is equal to the
ratio of the average local density of its k-nearest neighbors and its own local density.
A “normal” data is expected to have a local density similar to that of its neighbors,
while an “abnormal” data is expected to have much smaller local density.
Isolation Forest. Random forest is used to perform the outlier detection efficiently in
high-dimensional datasets. The algorithm isolates observations by randomly selecting a
feature and then randomly selecting a split value between the maximum and minimum
values of the selected features [27]. Here, recursive portioning is used to split the
values. As the recursive patronizing is illustrated by a tree structure, the required
number of splitting to isolate a sample is equivalent to the path length from the root
node to the leaf node where it terminates. The path length from the root node to the
terminating node, averaged over a forest of such random trees, is a measure of nor-
mality and the decision function. Random partitioning produces noticeably shorter
paths for anomalies. Therefore, the random forest trees that collectively produce shorter
path lengths for particular samples are highly likely to be anomalies [28].
Elliptic Envelope. In outlier detection, one common way to detect the outlier is to
assume that the regular data comes from a known distribution (e.g. Gaussian distri-
bution). From this assumption, a ‘shape’ of the data can be defined, and data points
stand far enough from the fit shape are defined as outlying observations. Elliptic
Envelope assumes the data as normally distributed and based on that assumption it
‘draws’ an ellipse around the data, classifying any observation inside the ellipse as an
inlier (labeled as (+)1) or “normal” and any observation outside the ellipse as an outlier
(labeled as (−)1) or “anomalous”.
Ensemble Classifier. All four classifiers defined previously are used to build five
training models, where One-class SVM is used twice with different hyperparameters.
According to the framework, on top of these five models, different ensemble techniques
are applied to combine them. The majority voting mechanism is chosen as a baseline.
Then logistic regression (LR) and naïve Bayes (NB) are two supervised models that are
A Holistic Approach for Detecting DDoS Attacks 729
applied on top of these five classifiers to ensemble, for the better detection accuracy and
lower false positive alarm.
Majority Voting. Majority voting scheme is the very common and basic technique in
ML ensemble. Generally, a majority means when the greater part or more than half of
the total accumulates. In Machine learning, an output prediction could be ‘1’ or ‘0’.
A majority voting mechanism could be applied on any number of classifiers’ output.
When the greater part or more than half of the total classifiers’ predictions agree with a
certain prediction value ‘1’ or ‘0’, that prediction value would be the final output of this
majority voting mechanism. For example, for five classifiers ‘A, B, C, D, E’ that
predict a certain data instance ‘1, 0, 1, 1, 0’ respectively, the final output will be the ‘1’
decided by majority voting mechanism. When Majority Voting is used in ML ensemble
as a combination rule (which only works with nominal classes), each of these classifiers
will predict a nominal class label for a test sample. The label which was predicted the
most will then be selected as the output of the voting classifier.
Logistic Regression. Logistic regression is the go-to method for binary classification
problems (problems with two class values) which is borrowed by ML from the field of
statistics. In statistics, the logistic model (or logit model) is used to model the proba-
bility of an existing class or event such as normal/abnormal, pass/fail, win/lose,
hot/cold, etc. This can be extended to model with several classes of events, and each of
the events would be assigned a probability value between 0 to 1, where the sum of all
probabilities will be a complete 1. The coefficients (Beta values, b) of the logistic
regression algorithm must be estimated from the training data which is done by using
maximum-likelihood estimation. The best coefficients would result in a model that
would predict a value very close to 1 (e.g. anomalous) for the default class and value
very close to 0 (e.g. normal) for the other class. The intuition for maximum-likelihood
for logistic regression is that a search procedure seeks values for the coefficients (Beta
values) that minimize the error in the probabilities predicted by the model to those in
the data.
Naïve Bayes. Naïve Bayes is a simple and common classifier used in many ML
problems. It is a Bayes theorem based probabilistic classifier which helps to define the
probability of an event based on some prior knowledge of certain conditions associated
with that event. The goal of any probabilistic classifier (e.g.; X0, X1, …. Xn features
and C0, C1, . . . Ck classes) is to determine the probability of the features occurring in
each class and to return the most likely class. Naïve Bayes classifier assumes that the
features are independent of each other and thus it is named as “Naïve”.
Figure 1 shows the process flow of the proposed framework from data prepro-
cessing to ensemble classification. In this framework, NSL-KDD dataset contains both
training and testing data and is used as an input of Step-1. In Step-2, data are converted
into a model readable format with the help of data preprocessing and feature selection.
Processed training data is then fed into five different classifiers: One-Class SVM
(OCS) with two different hyperparameters, Local Outlier Factor (LOF), Isolation Forest
(ISOF), and Elliptic Envelope (ELE). At the end of this step, all five different classifiers
have built their models using training data. Then test data is transferred from the raw
dataset through feature selection and data preprocessing phase to Step-4 and evaluates
730 S. Das et al.
the training models that are built on Step-3 to predict outcomes. In Step-5, all the
predictions coming from different training models in the last step generate a vector for a
single data instance which is then carried to each of the three ensemble classifiers:
Majority Voting (MV), Logistic Regression (LR), and Naïve Bayes (NB). MV, LR,
and NB have different performance measures for that prediction vector. Based on the
higher detection accuracy, precision, recall, F-1 score, and lower false positive rate,
Step-6 decides the best ensemble algorithm which is finally chosen for DDoS attack
detection.
4 Experimental Results
The proposed model has been implemented in python and the ML tool scikit-learn [29],
a python library to model the training data and to evaluate that model using test data. In
experimentation, raw data was extracted from the NSL-KDD website and then con-
verted into a classifier readable format. The classifier readable format came from data
preprocessing phase (described earlier in ‘Proposed Method’ section). The basic idea of
data preprocessing is the conversion of non-numeric data content to the corresponding
assigned numeric value. Since unsupervised outlier detection models were used in this
experiment, the training data should contain the majority of ‘normal’ or ‘non-
anomalous’ type data instances for better predictions. From the whole training dataset,
A Holistic Approach for Detecting DDoS Attacks 731
normal and anomalous data were separated, and a new training dataset was created that
contained 99% of normal data and 1% of anomalous data. The reason behind using 1%
anomalous data in the training dataset was to make the model run efficiently and
accurately in detecting anomaly by learning from normal behavior. The additional
percentage of noise (anomalous data) was added later to measure the framework’s
efficiency. After separating normal and anomalous data from the training data, 67343
data were found as normal (see Table 1). 1% of this normal data which is 673, was
added with anomalous data to create a new or modified training dataset. On the other
hand, test dataset wasn’t separated but only 1000 amount of random data instances
were chosen from test data and created a modified test dataset to evaluate the training
model. For both cases, the ‘class’ column from the dataset was removed as we used
unsupervised methods in this framework. However, the ‘class’ column from the test
dataset is preserved to use it later in determining the accuracy of all three ensemble
classifiers, and training as well as testing purposes for logistic regression and naïve
Bayes models.
In the early phase of the experimental section, five different classifiers were used to
build five different training models by using training dataset. As the goal of this
research is to detect existing and new DDoS attacks pattern, outlier and novelty
detection type classifier was the highest priority on selecting classifiers. Here, four
different types of outlier and novelty detection classifiers were chosen, namely One-
Class SVM, Local Outlier Factor, Isolation Forest, and Elliptic Envelope. As four is the
even number and there might be a chance of a tie situation while choosing an outcome
using majority voting ensemble, the next odd number five was chosen as the classifier
count in this experiment. Initially, we have experimented with different hyper-
parameter values of these classifiers and based on the accuracy, we have selected the
best five hyperparameter combinations. The details of the classifiers with hyperpa-
rameter combinations used in our experiment are listed in Table 4.
By using different combinations from Table 4, five different training models were
built from the modified training set. Then test dataset was used to evaluate each model.
732 S. Das et al.
TPR
Sensitivity ¼ ð1Þ
ðTPR þ FNRÞ
TNR
Specificity ¼ ð2Þ
ðTNR þ FPRÞ
ðTPR þ TNRÞ
Accuracy ¼ ð3Þ
ðTPR þ TNR þ FPR þ FNRÞ
Also, Precision, Recall, and F-Measure are another three important performance
metrics that are used to evaluate a model. Those terms can be defined in terms of TP,
TN, FP, FN from Eqs. (4), (5) and (6).
TP
Precision ðPÞ ¼ ð4Þ
ðTP þ FPÞ
TP
Recall ðRÞ ¼ ð5Þ
ðTP þ FN Þ
2PR
F Score ¼ ð6Þ
PR
Table 5 shows the details of performance metrics for these classifiers when they
trained their model by using a single classifier.
After training with the single classifier, a top label classifier or technique was used
to ensemble these five classifiers. Majority voting was considered as a baseline, then
logistic regression (LR) and naïve Bayes (NB) were used to ensemble these classifiers.
A Holistic Approach for Detecting DDoS Attacks 733
In majority voting, the nominal class label which was predicted the most from the
unsupervised outputs will then be selected as the final output. On the other hand, LR
and NB are two supervised models that require a class label to learn. Here, we used
those two models to combine the outputs of five unsupervised models. To train those
models, a tagged dataset was required to create using the unsupervised models’ outputs
as features and the class label that was removed in the data preprocessing phase.
Maximum-likelihood estimation is the key mechanism for both logistic regression and
naïve Bayes to predict the ensemble outputs. Figure 2 shows the graphical represen-
tation of the comparison of ensemble classifiers’ performance metrics with single
classifier models.
1.2
0.8
0.6
0.4
0.2
0
One Class One Class LOF Isolation Elliptic Majority Naive Logistic
Poly Linear Forest Envelope Voting Bayes Regresion
As mentioned in Sect. 3.3, 12 different feature sets where each of the features is
relevant for DDoS attack were considered to build training models. Figure 3 shows the
accuracy of three different types of ensemble techniques with respect to 12 different
feature sets and we found that feature set 4 (FS-4) has the best accuracy when Logistic
Regression was used to ensemble.
As mentioned earlier, the majority of the data instances were normal data and a
very few amounts (1%) of noise (abnormal data) mixture was added to build a modified
training dataset. To verify our framework’s stability and efficiency with the increase of
noise (adding more anomalous data into dataset) added with the training dataset, we
varied the noise amounts from 1% to 5% and measured the performance metrics.
Figure 4 shows the deviation of performance by adding noise with the training data.
734 S. Das et al.
1
0.9
0.8
0.7
0.6
0.5 Majority Voting
0.2
0.1
0
0.8
0.6
0.4
FPR
0.2 F-1 Score
Recall
0 Precision
1 2 Accuracy
3
4
5
The proposed framework was tested and verified with both single classifiers and
ensemble classifiers to get a higher detection accuracy with a lower false positive rate.
Logistic Regression based ensemble model was found with the highest detection
accuracy and lowest false positive rate among all five different classifiers: One-Class
SVM, Local Outlier Factor, Isolation Forest, and Elliptic Envelope. Table 6 shows the
A Holistic Approach for Detecting DDoS Attacks 735
overall comparison among all classifiers including single classifiers, ensemble classi-
fiers as well as some existing research outcomes. ‘N/A’ refers to the results that are not
mentioned in those existing researches.
5 Conclusion
In this research, the goal was to detect DDoS attacks using an unsupervised ML
ensemble. Classifiers from various classifier families of outlier and novelty detection
type have been chosen to build the proposed framework. Initially, single classifiers
were used to measure the performance metrics in detecting DDoS attacks. On top of
these five individual classifiers (One class SVM: two different hyperparameters, Local
outlier factor, Elliptic envelope, and Isolation forest), an ensemble with majority voting
was applied as a baseline. Naïve Bayes and logistic regression were then used to
ensemble these five classifiers again to get a better detection accuracy. In our experi-
ment, Logistic regression based ensemble has the best performance measures that are
not only compared to baseline majority voting and Naïve Bayes ensemble but also with
the models where the single classifiers were used. In addition, it was also observed
from the experimental results and compared to existing research that logistic regression
based ensemble while using feature set 4 (FSet-4) has the best detection accuracy, high
precision and recall, F1 score, and low false positive rate. The proposed model is not
only capable of detecting existing DDoS attacks, but also using outlier detection
classifiers, it has the capability to detect unseen or new DDoS attack.
In this research, only one dataset was considered for experimentation and we plan
to continue our experiments with other different datasets using the proposed frame-
work. The dataset that we have used was an offline data, hence we have the limitation
of experimenting with online data. Twelve different feature sets have been chosen from
existing research for experimentation. In the future, we plan to reduce the features on
our own using different feature reduction techniques and domain knowledge. With this
research as the base, we will consider deep learning methods and software agents [33]
in detecting DDoS attacks more accurately.
References
1. Lee, Y.-J., Baik, N.-K., Kim, C., Yang, C.-N.: Study of detection method for spoofed ip
against DDoS attacks. Pers. Ubiquitous Comput. 22(1), 35–44 (2018)
2. NETSCOUT Report. https://www.netscout.com/report/. Accessed 10 July 2019
3. Specht, S.M., Ruby B.L.: Distributed denial of service: taxonomies of attacks, tools, and
countermeasures. In: Proceedings of the 17th International Conference on Parallel and
Distributed Computing Systems (2004)
4. Dietterich, T.G.: Ensemble methods in machine learning. In: International Workshop on
Multiple Classifier Systems. Springer, Heidelberg (2000)
5. Aburomman, A.A., Reaz, M.B.I.: A survey of intrusion detection systems based on
ensemble and hybrid classifiers. Comput. Secur. 65, 135–152 (2017)
6. Noureldien, N.A., Yousif, I.M.: Accuracy of machine learning algorithms in detecting DoS
attacks types. Sci. Technol. 6(4), 89–92 (2016)
7. Olusola, A.A., Oladele, A.S., Abosede, D.O.: Analysis of KDD’99 intrusion detection
dataset for selection of relevance features. In: Proceedings of the World Congress on
Engineering and Computer Science, WCECS, vol. 1 (2010)
8. Osanaiye, O., et al.: Ensemble-based multi-filter feature selection method for DDoS
detection in cloud computing. EURASIP J. Wirel. Commun. Netw. 2016(1), 130 (2016)
A Holistic Approach for Detecting DDoS Attacks 737
9. Ambusaidi, M.A., et al.: Building an intrusion detection system using a filter-based feature
selection algorithm. IEEE Trans. Comput. 65(10), 2986–2998 (2016)
10. Gaikwad, D.P., Thool, R.C.: Intrusion detection system using bagging ensemble method of
machine learning. In: 2015 International Conference on Computing Communication Control
and Automation. IEEE (2015)
11. Shrivas, A.K., Dewangan, A.K.: An ensemble model for classification of attacks with feature
selection based on KDD99 and NSL-KDD data set. Int. J. Comput. Appl. 99(15), 8–13
(2014)
12. Tesfahun, A., Bhaskari, D.L.: Intrusion detection using random forests classifier with
SMOTE and feature reduction. In: 2013 International Conference on Cloud & Ubiquitous
Computing & Emerging Technologies. IEEE (2013)
13. Haq, N.F., et al.: Application of machine learning approaches in intrusion detection system:
a survey. IJARAI-Int. J. Adv. Res. Artif. Intell. 4(3), 9–18 (2015)
14. Yusof, A.R., Udzir, N.I., Selamat, A.: Systematic literature review and taxonomy for DDoS
attack detection and prediction. Int. J. Digit. Enterp. Technol. 1(3), 292–315 (2019)
15. Belavagi, M.C., Muniyal, B.: Performance evaluation of supervised machine learning
algorithms for intrusion detection. Procedia Comput. Sci. 89, 117–123 (2016)
16. Ashfaq, R.A.R., et al.: Fuzziness based semi-supervised learning approach for intrusion
detection system. Inf. Sci. 378, 484–497 (2017)
17. Perez, D., et al.: Intrusion detection in computer networks using hybrid machine learning
techniques. In: 2017 XLIII Latin American Computer Conference (CLEI). IEEE (2017)
18. Villalobos, J.J., Rodero, I., Parashar, M.: An unsupervised approach for online detection and
mitigation of high-rate DDoS attacks based on an in-memory distributed graph using
streaming data and analytics. In: Proceedings of the Fourth IEEE/ACM International
Conference on Big Data Computing, Applications and Technologies. ACM (2017)
19. Jabez, J., Muthukumar, B.: Intrusion detection system (IDS): anomaly detection using outlier
detection approach. Procedia Comput. Sci. 48, 338–346 (2015)
20. Smyth, P., Wolpert, D.: Stacked density estimation. In: Advances in Neural Information
Processing Systems (1998)
21. Hosseini, S., Azizi, M.: The hybrid technique for DDoS detection with supervised learning
algorithms. Comput. Netw. 158, 35–45 (2019)
22. Canadian Institute for Cybersecurity, Datasets/NSL-KDD. https://www.unb.ca/cic/datasets/
nsl.html. Accessed 10 July 2019
23. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.: A detailed analysis of the KDD CUP 99
data set. In: Submitted to Second IEEE Symposium on Computational Intelligence for
Security and Defense Applications (CISDA) (2009)
24. Das, S., Mahfouz, A.M., Venugopal, D., Shiva, S.: DDoS intrusion detection through
machine learning ensemble. In: 2019 IEEE 19th International Conference on Software
Quality, Reliability and Security Companion (QRS-C), pp. 471–477. IEEE, July 2019
25. One-Class classification. https://en.wikipedia.org/wiki/One-class_classification. Accessed 10
July 2019
26. Microsoft, One-Class Support Vector Machine. https://docs.microsoft.com/en-us/azure/
machine-learning/studio-module-reference/one-class-support-vector-machine. Accessed 10
July 2019
27. Scikit learn, Novelty and Outlier Detection. https://scikit-learn.org/stable/modules/outlier_
detection.html. Accessed 10 July 2019
28. Scikit learn, Isolation Forest. https://scikit-learn.org/stable/modules/generated/sklearn.
ensemble.IsolationForest.html. Accessed 10 July 2019
29. Scikit learn. https://scikit-learn.org. Accessed 10 July 2019
738 S. Das et al.
30. Kanakarajan, N.K., Muniasamy, K.: Improving the accuracy of intrusion detection using
GAR-Forest with feature selection. In: Proceedings of the 4th International Conference on
Frontiers in Intelligent Computing: Theory and Applications (FICTA) 2015. Springer,
New Delhi (2016)
31. Pajouh, H.H., Dastghaibyfard, G.H., Hashemi, S.: Two-tier network anomaly detection
model: a machine learning approach. J. Intell. Inf. Syst. 48(1), 61–74 (2017)
32. Pervez, M.S., Farid, D.Md.: Feature selection and intrusion classification in NSL-KDD cup
99 dataset employing SVMs. In: The 8th International Conference on Software, Knowledge,
Information Management and Applications (SKIMA 2014). IEEE (2014)
33. Das, S., Shiva, S.: CoRuM: collaborative runtime monitor framework for application
security. In: 2018 IEEE/ACM International Conference on Utility and Cloud Computing
Companion (UCC Companion). IEEE (2018)
Moving Towards Open Set Incremental
Learning: Readily Discovering
New Authors
1 Introduction
Formal as well as informal textual data are over-abundant in this Internet-
connected era of democratized publishing and writing. These textual information
sources are in multiple forms such as news articles, electronic books and social
media posts. The use of text classification allows us to determine important
information about the texts that can often be used to connect to the respective
authors, naturally leading to the concept of Authorship Attribution. Authorship
Attribution is seen as the process of accurately finding the author of a piece of
text based on its stylistic characteristics [1]. Authorship Attribution is useful in
scenarios such as identification of the author of malicious texts or the analysis
of historical works with unknown authors.
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 739–751, 2020.
https://doi.org/10.1007/978-3-030-39442-4_54
740 J. Leo and J. Kalita
2 Related Work
The discussed related work is in terms of four topics: deep networks for open set
classification, metrics for open set classification, open set text classification, and
recent proposals to use loss functions for open set classification in the context of
computer vision.
Using deep neural networks for open set classification often requires a change in
the network model. Modern neural networks have multiple layers connected in
various ways, depending on the classifier architecture being used. Most models
eventually include a softmax layer that classifies the data to the known classes,
with an associated confidence level or probability for each class. A test example
is considered to belong to the class which has the highest probability among
Moving Towards Incremental Learning 741
all the classes. To adapt this model to the open set scenario, the softmax layer
was replaced by a unique layer named the OpenMax layer [6]. This layer esti-
mates the probability of an input being from one of the known classes as well
as an “unknown” class, which lumps together all classes unseen during training.
Thus, the network is able to recognize examples belonging to unknown classes,
enhancing the ability of the closed set classifier it starts with.
The process of open set class recognition leads to new challenges during the
evaluation process. There are multiple sources of error that could be present
including: misclassification of known or unknown classes and determination of
novel classes. Bendale and Boult (2015) proposed a metric to evaluate how indi-
vidual examples are classified. Although the metric was originally proposed for
use in computer vision, it could be applicable in author attribution as well.
Prakhya, Venkataram, and Kalita (2017) modify the single OpenMax layer pro-
posed by [6] to replace the softmax layer in a multi-layer convolution neural
networks with an ensemble of several outlier detectors to obtain high accuracy
scores for open set textual classification. The ensemble of classifiers uses a voting
model between three different approaches: Mahalanobis Weibull, Local Outlier
Factor [7], and Isolation Forest [8]. The average voting method produced results
that are more accurate in detecting outliers, making detection of unknown classes
better.
A problem that often occurs in open set classification is the classifier label-
ing known class data as unknown. This problem typically occurs if there are
some similar features in the examples of the pre-trained classes and unknown
classes encountered during testing. In the context of computer vision, Dhamija,
Günther, and Boult (2018) introduce what is called the Entropic Open-Set loss
function that increases the entropy of the softmax scores for background train-
ing samples and improves the handling of background and unknown inputs.
They introduce another loss function called the Objectosphere loss, which further
increases softmax entropy and performance by reducing the vector magnitudes of
examples of unknown classes in comparison with those from the known classes,
lowering the erroneous classification of known class data as unknown. Since this
approach squishes the magnitudes of all examples that belong to all unknown
classes, it makes later separation of individual unknown classes difficult.
742 J. Leo and J. Kalita
Fig. 1. Protocol for open set classification and incremental class learning
3 Approach
This paper explores open set classification and the process of moving towards
incremental learning of new classes. The objective is to create a classifier frame-
work that can incrementally learn and expand its knowledge base as additional
data is presented as shown in Figs. 1 and 2. The approach is also outlined in
Algorithm 1.
In prior work on open set classification, authors have focused on recognizing
test samples as belonging to classes unknown during training. Based on prior
research and knowledge, this paper is the first to instantiate new classes itera-
tively, extending prior work to real incremental class learning. Initially there is a
summarization of the approach to provide an easily comprehensible sketch before
moving on to details. The classifier framework is initially initialized by training
it with examples from a small number of selected classes. The trained classifier
is then exposed to a mix of examples from the already-known classes as well
unknown classes, during testing. At a certain point, the testing of the current-
classifier is paused and then all examples recognized as belonging to unknown
classes are clustered. Clustering allows for the grouping of similar data and visu-
ally represents the differences between unique clusters. The hypothesis is that, if
the clustering is good, one or more of the clusters of unknown examples can be
thought of as new classes the current-classifier has not seen and these clusters are
Moving Towards Incremental Learning 743
Fig. 2. Ensemble model and testing classifier diagram. This diagram more clearly
describes the ‘Ensemble Outlier Detector’ component from Fig. 1.
instantiated as new classes, by making up new unique labels for them. At this
point, the current-classifier is updated by retraining it with all examples of the
old known classes as well the newly instantiated classes. This process of training,
accumulating of outliers, clustering, and instantiating selected new classes out
of the clusters is repeated a number of times as long as the error of the entire
learning process remains acceptable.
In particular, the classifier is a multi-layer CNN structure for training pur-
poses. During testing, the softmax layer at the very end replaced by an outlier
ensemble, following the work of [9]. The outlier detector ensemble consists of a
Mahalanobis model, Local Outlier Factor model, and an Isolation Forest model,
like [9]. The classifier model, as used in training is shown in Fig. 2. Initially the
model is created by training a classifier Ecurrent with a given kseed number of
classes found in the entire training data set D. Then a derived dataset is created
test
Dcurrent for testing the model by mixing examples of kunknown unknown classes
with the previously trained kseed classes. The process always adds knew classes
to the number of known classes. Thus, at the end of the ith iteration of class-
learning, the classifier knows kseed + (i − 1)knew classes. The model instantiates
“new” classes by choosing dominant clusters, and then retrain the model with
these new classes. The classes are then removed from the set of all classes and
new ones are selected for the incremental addition.
This paper experiments with multiple clustering techniques including K-
Means [10], Birch [11], DBScan [12], and Spectral [13], to determine the most
suitable one for author attribution. There is also experimentation with various
values of the parameters: kseed , kunknown and δ.
744 J. Leo and J. Kalita
Input: Training Set D = x(i) , y (i) , i = 1 · · · N , samples from all known classes
Output: An incrementally trained classifier E on examples from a number of
classes in D
1 Call ← C1 , · · · Cn , set of all known classes
2 Ccurrent
train
← (randomly) pick kseed classes from Call
(i) (i)
current ←
Dtrain | y(i) ∈ Ccurrent , samples from classes in Ccurrent
train train
3 x ,y
4 repeat
5 Ccurrent
unknown
← (randomly) pick kunknown classes from Call − Ccurrent train
(i) (i) (i)
6 Dcurrent → Dcurrent
test train
x ,y | y ∈ Ccurrent
unknown
4 Evaluation Methods
Since the method uses clustering as well as classification in the designed protocol
for incremental classification, there needs to be evaluation of both. First, it is
required to have an outline of how clusters obtained from examples classified as
unknown are evaluated, and then it is required to have a description of how the
incremental classifier is evaluated.
There are a variety of clustering algorithms, and the model needs one that works
efficiently in the domain of author attribution. The test samples that are deemed
to be outliers are clustered, with the hypothesis that some of these clusters
correspond to actual classes in the original dataset. The evaluation process uses
the Davies-Bouldin Index as shown in Eq. (1) to evaluate clustering [14].
n
1 σi + σj
DB = maxj=i (1)
n i=1 d(ci , cj )
is the V-Measure as shown in Eq. (2), which has been widely used in clustering in
natural language processing tasks when ground truth is known, i.e., the samples
and their corresponding classes are known. This metric computes the harmonic
mean between homogeneity and completeness [15]. Homogeneity measures how
close the clustering is such that each cluster contains samples from one class only.
Completeness measures how close the clustering is such that samples of a given
class are assigned to the same cluster. Typically scores close to 1 indicate better
clustering. Here β is a parameter used to weigh between the two components, a
higher value of β weighs completeness more heavily over homogeneity, and vice
versa.
(1 + β) ∗ homogeneity ∗ completeness
V = (2)
β ∗ homogeneity + completeness
where N is the total number of samples in the dataset. When the same classifier
En () is tested in the context of open set classification, there is a need to keep
track of errors due that occur between known and unknown classes. When the
classifier is tested on N samples from n known classes and N samples from u
unknown classes, the test is a total of N + N samples over n + u classes. The
open set classification error OS for classifier En is given as [3]:
N
1
OS = n +
En (x(i) ) = unknown (4)
N
j=N +1
For this research, the approach uses clustering in order to obtain new classes after
open set recognition is performed. This way the new data identified for the novel
classes can be used to incrementally train the model. For the evaluation of these
clusters this paper presents a new metric ICA (Incremental Class Accuracy)
which takes into account the specific data from an identified cluster and averages
calculations of homogeneity, completeness, and unknown identification accuracy
of the cluster. This paper defines homogeneity as the ratio of the number of data
samples of the predominant class c in the cluster k (nc|k ) and the total number
of values in the cluster (Nk ). This paper defines completeness as the ratio of
the number of data samples of the predominant class c in the cluster k (nc|k )
and the total number of tested samples of the same class Nc . This paper defines
define unknown identification accuracy as ratio of the number of unknown u data
746 J. Leo and J. Kalita
samples in the cluster k (nu|k ) and the total number on values in the cluster Nk .
The equation used for ICA assumes only one cluster is being evaluated, but the
equation can be adapted for multiple clustering by finding multiple ICA scores
for each cluster and averaging.
max(nc|k )
Homogeneity = (5)
Nk
max(nc|k )
Completeness = (6)
Nc
(nu|k )
Unknown Identification Accuracy = (7)
Nk
max(nc|k ) max(nc|k ) (nu|k ) 1
ICA = + + ∗ (8)
Nk Nc Nk 3
Other metrics that will be used to determine the performance of the model
will be accuracy and F1-score, these figures inherently show the accuracy of the
classifier as well as novel data detection.
5.1 Datasets
Since the objective is for open set author attribution, the testing uses two
datasets each of which contains 50 authors.
– Victorian Era Literature Data Set [16]: This dataset is a collection of
writing excerpts from 50 Victorian authors chosen from the GDELT database.
The text has been pre-processed to remove specific words that identify the
individual piece of text or author (names, author made words, etc.). Each
author has hundreds of unique text pieces with 1000 words each.
– CCAT-50 [17]: This data set is a collection of 50 authors each with 50 unique
text pieces divided for both training and testing. These texts are collections
of corporate and industrial company news stories. This data is a subset of
Reuters Corpus Volume 1.
Fig. 3. Clustering plots for Victorian Literature data with accuracy score, 5 trained
classes and 8 tested classes
Fig. 4. Clustering plots for CCAT-50 data with accuracy score, 5 trained classes and
8 tested classes
For the first experiment the objective was to see if the proposed method would
improve the calculated classification accuracy and to also decide which clustering
algorithm would work best. Both data sets are run individually with five known
training classes and then with ten known training classes, then the model is
introduced to three unknown classes during the testing phase for each of the
tests. The results include the comparison with accuracy and F1-Score as found
on Table 1; a significant increase of these values is observed after the classifier is
retrained with the identified novel classes. The clustering evaluation metrics are
found on Table 2. V-Measure scores prove to be more useful because the Davies-
Bouldin scores do not always indicate the highest accuracy of clustering, this
is because the best formed clusters does not necessarily mean higher accuracy.
Even though the chosen data sets have not been used for open set classification
in prior research this paper compares the calculated open set classification scores
with the state of the art closed set classification scores. Based on prior research,
the best classification F1-Score from prior work for the Victorian Literature data
set using only few classes is 0.808 [16] and the designed model produces a slightly
better score. Also based on prior research, the best classification accuracy score
748 J. Leo and J. Kalita
Table 1. Pre-trained class scores and post-open set classification scores, either 5 or 10
initial trained classes and 3 unknown added during testing
Table 2. Davies Bouldin Index and V-Measure score for clustering methods evaluated,
either 5 or 10 trained classes and 3 unknown added during testing.
for the CCAT-50 data set for using only few classes is 86.5% and the designed
model obtains similar results. The clustering models seem to have the most
error for both data sets (especially the CCAT-50 data), thus presumably better
clustering models or would produce greater results.
For the second experiment the model is initially trained with a fixed amount
of classes kseed and then the method incrementally adds a kunknown amount of
classes for testing. This process is repeated to demonstrate the model incremen-
tally learns as the learning and open set classification cycle is repeated. This test
is run by adding classes for multiple iterations and record the change in the F1-
Score for the overall classification and generation of new classes; the objective is
to run each test until the results drop significantly or until the model reaches a
max value of classes. Figure 5 displays the results of the incremental cycle, and it
is observed that the model achieves better results when fewer classes are added
at a time. The experiment runs tests for adding 1, 2, and 3 classes at a time. The
open set error shown in Eq. 4 is also calculated for each test; this metric shows
error of unknown data identification but not novel class generation. The problem
noticed with the experiment is that error will propagate through the process so
as error accumulates the results deter. Another observation, based on the results
from both data sets, is that adding one class incrementally each iteration has
better results because this limits the clustering error. It is also clear that the
Victorian Literature performs worse than the CCAT-50 data and the initial rea-
soning for this is because of the text samples; the Victorian text includes words
with slurs and accent mark symbols and word2vec is not pre-trained with these
Moving Towards Incremental Learning 749
Fig. 5. Incremental learning plots. Initially trained with 5, 10, 15, and 20 initial classes
then tested by incrementally adding 1, 2, and 3 classes. These plots show the final F1-
scores and open set error from Eq. 4.
new features. The CCAT-50 data tends to have very distinct authors and the
pieces of text tend to also tend to be more unique. Overall based on the results,
it is concluded that most of this error can be attributed to the clustering process.
As stated in the previous experiments, the clustering process tends to have
the most variance, this is evident from the low clustering accuracy due to the lack
of fully distinct clusters. Thus, there needs to be a way to evaluate the clustering.
Using the Incremental Class Accuracy (ICA) metric shown from Eq. 5, there will
be ability to evaluate the clustering in regards to homogeneity, completeness, and
unknown identification accuracy. From the previous experiment it is also noticed
that adding one class at a time incrementally tends to produce the best results,
so the ICA score is calculated when one class is added and instantiated. The
results for both data sets is shown in Table 3. From these results it is observed
that having a fewer amount of initial trained kseed classes produces better results
and this is expected as the kunknown classes are more easily identified.
Table 3. ICA scores for 1 added class/cluster evaluation. Scores based on Eq. 5.
6 Conclusion
This research works with open set classification regarding NLP text analysis
in the area of Authorship Attribution. The model created will be to deter-
mine the originating author for a piece of text based on textual characteristics.
This research also move towards a novel incremental learning approach where
unknown authors are identified and then the data is labeled so the classifier
expands on its knowledge. Through this process there is expansion upon the
state of the art implementation by creating a full cycle model by training on
given data and then expanding the trained knowledge based on new data found
for future testing.
Text based Authorship Attribution can be applied to research involving secu-
rity and linguistic analysis. Some similar developing work using similar research
methods involving image recognition [19], this can be applied to facial recogni-
tion tasks and video surveillance applications. This model can also be further
improved by developing a more precise way of distinguishing different pieces of
text. Another method for future research is using backpropagation. Once novel
classes are identified, the model should be then able to modify the already trained
train test
classifier with the Dcurrent data. Then the model can be tested with the Dcurrent
to determine if the model can recognize previously unknown classes. Backprop-
agation of a neural network requires a fully inter connected set of layers that
allow the processing of data through either side of the model [20]. This process
would save the step of fully retraining the classifier model. A similar approach
to this can also be to add new “neurons” to a deep neural network to allow for
an extension of a trained model [21]. With these new future improvements the
designed model can be further improved and potentially obtain better results.
References
1. Rocha, A., Scheirer, W.J., Forstall, C.W., Cavalcante, T., Theophilo, A., Shen, B.,
Carvalho, A.R.B., Stamatatos, E.: Authorship attribution for social media foren-
sics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2016)
2. Scheirer, W.J., de Rezende Rocha, A., Sapkota, A., Boult, T.E.: Toward open set
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(7), 1757–1772 (2012)
3. Bendale, A., Boult, T.: Towards open world recognition. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1893–1902
(2015)
4. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural
networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang.
Process. 20(1), 30–42 (2011)
Moving Towards Incremental Learning 751
5. Higashinaka, R., Imamura, K., Meguro, T., Miyazaki, C., Kobayashi, N., Sugiyama,
H., Hirano, T., Makino, T., Matsuo, Y.: Towards an open-domain conversational
system fully based on natural language processing. In: Proceedings of COLING
2014, the 25th International Conference on Computational Linguistics: Technical
Papers, pp. 928–939 (2014)
6. Bendale, A., Boult, T.E.: Towards open set deep networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1563–1572
(2016)
7. Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Loop: local outlier probabili-
ties. In: Proceedings of the 18th ACM Conference on Information and Knowledge
Management, pp. 1649–1652. ACM (2009)
8. Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation forest. In: 2008 Eighth IEEE Inter-
national Conference on Data Mining, pp. 413–422. IEEE (2008)
9. Prakhya, S., Venkataram, V., Kalita, J.: Open set text classification using convo-
lutional neural networks. In: International Conference on Natural Language Pro-
cessing (2017)
10. Hartigan, J.A., Wong, M.A.: Algorithm as 136: a K-Means clustering algorithm.
J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
11. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering
method for very large databases. In: ACM SIGMOD Record, vol. 25, pp. 103–
114. ACM (1996)
12. Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for
discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp.
226–231 (1996)
13. Stella, X.Y., Shi, J.: Multiclass spectral clustering. In: Null, p. 313. IEEE (2003)
14. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern
Anal. Mach. Intell. 2, 224–227 (1979)
15. Rosenberg, A., Hirschberg, J.: V-Measure: a conditional entropy-based external
cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empiri-
cal Methods in Natural Language Processing and Computational Natural Language
Learning (EMNLP-CoNLL), pp. 410–420 (2007)
16. Gungor, A.: Benchmarking authorship attribution techniques using over a thou-
sand books by fifty Victorian era novelists. Ph.D. thesis (2018)
17. Houvardas, J., Stamatatos, E.: N-gram feature selection for authorship identifica-
tion. In: International Conference on Artificial Intelligence: Methodology, Systems,
and Applications, pp. 77–86. Springer (2006)
18. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in Neural
Information Processing Systems, pp. 3111–3119 (2013)
19. Rebuffi, S.-A., Kolesnikov, A., Sperl, G., Lampert, C.H.: iCaRL: incremental clas-
sifier and representation learning. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 2001–2010 (2017)
20. Hecht-Nielsen, R.: Theory of the backpropagation neural network. In: Neural Net-
works for Perception, pp. 65–93. Elsevier (1992)
21. Draelos, T.J., Miner, N.E., Lamb, C.C., Cox, J.A., Vineyard, C.M., Carlson, K.D.,
Severa, W.M., James, C.D., Aimone, J.B.: Neurogenesis deep learning: extending
deep networks to accommodate new classes. In: 2017 International Joint Conference
on Neural Networks (IJCNN), pp. 526–533. IEEE (2017)
Automatic Modulation Classification Using
Induced Class Hierarchies and Deep Learning
1 Introduction
AMC methods is in its ability to inherently learn features from the dataset and use these
learned features to form a decision. Neural networks (NNs) have been shown to be
universal function approximators [2], suggesting they might be capable of competing
with likelihood-based AMC methods. Since NNs inherently learn features from a given
dataset, there is no need for feature engineering, as with feature-based AMC methods.
The work in [3] demonstrates that convolutional neural networks (CNN) can achieve
comparable accuracy to traditional AMC methods, even in the presence of severe
channel impairments. In [4], the authors apply a hierarchical approach and provide a
proof-of-concept that hierarchical deep neural nets (DNNs) are feasible for AMC.
However, their hierarchies are manually chosen, based on expert knowledge of mod-
ulation type.
Hierarchical classification has been used to improve classification in other fields,
such as text-classification. In [5], the authors developed a method to automatically
induce a hierarchy of classifiers based on a confusion matrix of a base classifier. The
hierarchical approach performed better than flat classifiers on text classification [5]. In
this work, we propose a hierarchical CNN architecture framework, h-CNN, that
leverages induced class hierarchies to completely remove the need for expert domain
knowledge for feature engineering and hierarchy determination. Moreover, to the best
of our knowledge, previous works have not applied induced class hierarchies to DL
architectures; this work provides an opportunity to explore that area.
In this paper, related research is presented in Sect. 2; Sect. 3 overviews the dataset
and experimental framework, results are presented in Sect. 4; a discussion on the
viability of h-CNN follows in Sect. 5, and finally Sect. 6 notes key conclusions and
outlines future work.
2 Related Work
One problem with previous research on AMC methods is the lack of uniformity with
datasets; that is, various papers generate their own datasets which undergo different
channel conditions. Thus, while the works are useful for determining which features and
classification methods perform well under certain channel conditions, it is difficult to
directly compare and evaluate results across all works. To meet this need in modulation
classification, in [3], a GNU Radio channel model is developed, and three signal datasets
are generated to create benchmarks for future work using DL techniques for AMC. The
datasets are available in [8]. Using this dataset, several candidate neural networks are
developed in [3] and a CNN of four layers performed comparably to feature-based
methods. Since CNNs use filters and convolution to remain impervious to shifting,
rotation, linear mixing, and scaling in images, they are a viable candidate to naively
learn modulation schemes for radio signals that have undergone various transformations
due to the channel [3]. Recent works on the application of DL to AMC have been based
on this pioneering work done in [3], which the authors expanded in [6, 7].
In [9], the authors expand on the work of [3] by using other DL architectures such
as ReLu and CLDNN. While the work in [9] was able to achieve significantly better
results on all four models, the use of deeper networks resulted in higher computational
complexity and a significant increase in training times compared to the original
754 T. Odemuyiwa and B. Sirkeci-Mergen
architecture in [3]. In [10], the authors further expand on the work of [9] but focus on
reducing overall training time. Various dimension reduction methods, such as principal
component analysis (PCA) and subsampling methods, are used to reduce the input
vector dimensions. Some of these methods reduced training time by up to a factor of 2,
but several of the models suffered considerable loss in accuracy, depending on the
original number of dimensions and reduction method used [10].
3 Methods
In this work, the publicly available synthetic radio dataset, RadioML2016.10a from [3,
8] is used. The code used to generate this dataset is available at [11]. This dataset
comes from an effort highlighted in [3] to create a benchmark database on which
various DL AMC methods can be evaluated and compared against. Two sources are
used for the dataset: an analogue source consisting of an episode from a podcast, and a
digital source composed of the entire works of Shakespeare from Gutenberg.
To simulate a real-world transmission, the signals are passed through a channel
modelled in GNU Radio [6]. The channel model is shown in Fig. 1. Five transfor-
mations are applied to the signal:
1. Sample Rate Offset: This gives an offset to the sample rate to model timing offsets
between transmitter and receiver.
2. Center Frequency Offset: This injects a randomized frequency offset to model
effects such as a mismatch between the transmitter and receiver oscillator, or the
Doppler effect between a moving receiver or transmitter.
3. Selective Fading Model: This block captures the Rayleigh or Rician fading pro-
cesses and uses the sum of sinusoids method to generate a signal output based on
multipath propagation.
4. Additive White Gaussian Noise: This is usually the result of thermal noise.
Twenty SNR levels are used, ranging from −20 decibels (dB) to 18 dB.
Fig. 1. Channel effects applied to the transmitted signal to simulate real-world scenarios.
functions. For example, in the case of QPSK, given a symbol ci , and carrier waveform
with frequency fc , the transmitted waveform xðti Þ is as follows [3]:
2ci þ 1
xðti Þ ¼ ej2pfc t þ p 4 ; ci 2 0; 1; 2; 3 ð1Þ
3.3 Hardware
Training and testing are conducted on the GPU machines provided by the Google
Colaboratory cloud service [12]. Hosted on the Google Cloud platform, Colaboratory
provides free access to GPU acceleration. The underlying hardware consists of an
NVIDIA Tesla K80 GPU, 12 GB of RAM, and 2496 CUDA cores [13]. All code is
written using Python 3 and the Keras library, with a TensorFlow backend.
Fig. 2. PSD waveforms versus frequency for I-phase (blue) and Q-phase (orange) at 18 dB
SNR. Sample #536.
AMC Using Induced Class Hierarchies and Deep Learning 757
Fig. 3. The baseline CNN model consists of two convolutional layers followed by two fully
connected layers.
Fig. 4. Training and validation loss over 39 epochs for the baseline model.
758 T. Odemuyiwa and B. Sirkeci-Mergen
where x and t are the are the averages of x(i, j) and t(i, j), respectively. Overall, the
cophenetic value is a measure of how well a dendrogram remains faithful to pairwise
distances [16]. Using this value, the standardized Euclidean measure, with a coefficient
of 95.9%, was chosen as the final measure. Its resulting dendrogram is shown in Fig. 7.
An induced hierarchy can aid in better understanding hidden features of the
modulation schemes. In this induced hierarchy, the root separates the QAM signals
from the remaining schemes; rather than analog versus digital as done in previous
papers where hierarchies were determined manually [4]. From a classifier perspective,
this suggests the higher-order information contained in the QAM schemes versus the
remaining schemes has higher discriminatory power than the analog versus digital
features of the modulation schemes. AM-DSB and WBFM are grouped together as
both contain similar inputs of silent or carrier-tone only vectors from the original
analog signal; which confuses the base CNN classifier and affects the resulting
dendrogram.
AMC Using Induced Class Hierarchies and Deep Learning 759
Fig. 6. Distance matrix for the baseline CNN model in an all-SNR scenario.
For the root-binary classifier, we experiment with various modifications and find
that a deeper network with a three convolution layers and three dense layers produces
the highest accuracy. CNN_L1a and CNN_L1b are also trained using similar CNN
architectures as the root. Overall, the root classifier achieves an accuracy of 84%, while
CNN_L1a and CNN_L1b achieve accuracies of 53% and 65%, respectively. The total
combined training time of all three classifiers is around 568 s, cutting the training time
of the baseline model by over half. At first glance, this result appears surprising, as the
hierarchical model is a deeper architecture overall. However, each sub-CNN is training
a smaller subset of classifiers, and the root is simply a binary-classifier; a less complex
CNN than an N-way classifier. The overall accuracy achieved in the all-SNR scenario
is 51.1%. Though only slightly improved from the base CNN, the training time has
noticeably decreased. Moreover, at lower SNR ranges, the hierarchical classifier per-
forms better than the baseline CNN. Figures 9 and 10 show the confusion matrices of
the baseline CNN and Model A at 18 dB SNR. From this matrix, Model A does an
excellent job of accurately classifying 8PSK, AM-DSB, QAM-64, at the cost of a lower
misclassification rate of WBFM, AM-DSB, and QAM-16. In this case, the hierarchical
classifier is biased towards AM-DSB versus WBFM, 8PSK versus QPSK, and QAM64
vs QAM16, to achieve higher accuracy values.
are kept the same as Model A. The individual classification accuracies of CNN L2a,
CNN L2b, CNN L2c, and CNN L2d are 50%, 62%, 83%, and 55%, respectively. The
overall model achieves an accuracy of 50.7%, equivalent to the baseline model.
This degradation in accuracy from Model A can be attributed to the difficulty of the
lowest-level CNNs in naturally extracting the fine-grained features that distinguish
similar classes. For example, regardless of how deep CNN_L2a is designed for clas-
sifying QAM-16 and QAM-64, the area under the curve (AUC) measure – taken from a
graph of True Positive rates versus False Positive rates – consistently oscillates around
50%, indicating the classifier cannot definitively distinguish between the two modu-
lation schemes. Even upon changing the representation of the input vectors from raw
IQ samples to instantaneous amplitude and phase samples, the classification results
remain the same. Previous works using the RadioML datasets have noted similar
results; and several authors suggest adding a preprocessing step to manually extract
features [10–14]. However, this involves stepping back into using expert feature
engineering, which DL should not require. Figures 12 and 13 show the confusion
matrix of Model B at −8 dB SNR and 18 dB SNR. In both the high and low SNR
scenario, there is no significant improvement over the base CNN classifier.
parameters but reduce the overall training time by nearly half and a third, respectively.
Figure 14 shows the overall performance, in terms of accuracy, of all 3 models.
Starting at −12 dB, Model A performs significantly better than all three models, while
Model B performs comparably to the baseline CNN model. Figures 15 and 16 depict
the overall training and performance times of the three models.
This work has provided a baseline that shows the viability and benefits of using DL
based induced hierarchical classification for AMC. First, using DL eliminates the need
for expert-based feature extraction, since the model can independently learn features
during the training phase.
Second, there are various performance benefits to using induced class hierarchies.
For the DL based model, both hierarchical models have significantly improved training
times over the base CNN. For previous works using the RadioML dataset, an
improvement in classification accuracy generally also results in an increase in training
time due to the use of deeper neural networks [9, 17]. In this work, though the overall
architecture is deeper than the base model, training time is reduced due to the modu-
larization of the N-way problem into smaller classification tasks. Moreover, the 2-level
hierarchy yields a significant increase in accuracy from the baseline model, while the 3-
level hierarchy has comparable performance. However, while induced class hierarchies
improve or provide comparable performance in terms of classification accuracy and
training time, their effectiveness and depth is limited by the effects of error propagation.
The hierarchical approach for CNN shows promise in terms of reducing overall training
time, and after fine-tuning, in improving accuracy as well. In previous works, a con-
fusion matrix highlighting the areas where the base CNN classifier has issues motivated
the need for deeper architectures to extract the subtle differences between schemes. In a
sense, this work is creating deeper CNN architectures, but with separately trained CNN
classifiers, whose hierarchy is mathematically determined by the confusion matrix. By
using an induced hierarchy, the overall neural network is guided to learn generic
features at higher levels and subsequently learn more intricate features in lower levels
where the classes have higher similarity in the feature space.
In addition, from a computer architecture perspective, the chained h-CNN archi-
tecture allows for a modular hardware implementation of each sub-CNN. Rather than
accessing an entire block of memory containing weights, only the relevant memory
locations containing the weights for a sub-CNN need to be accessed. Future work can
focus on quantifying the hardware efficiency increase of an h-CNN. Moreover, each
sub-CNN model can be trained in parallel allowing for greater use of parallelization
techniques than with traditional CNN architectures.
This work is completed using the RadioML2016.10a dataset provided in [3, 8].
Several of the recent works based on the RadioML database use RadioML2016.10b,
which is a slightly larger database; however, most works remove the AM-SSB mod-
ulation scheme. The removal of the AM-SSB scheme yields higher accuracy values - of
above 75% at high SNRs - than what is achieved in this work [4, 9, 10]. Future
experiments can run both the RadioML2016.10a dataset and RadioML2016.10b
768 T. Odemuyiwa and B. Sirkeci-Mergen
dataset on the CNN architectures presented in this work, removing the AM-SSB
scheme, to directly compare performance of these architectures against those in pre-
vious works. For true training time comparison, code-sharing of the structure of each
architecture from various works is needed, such that independent researchers can run
and verify the architectures of other works on their own machines, and directly com-
pare against their own architectures.
Finally, in a real-world system, cognitive radios are exposed to a large variety of
modulation schemes and unpredictable channel conditions. While this work focuses on
11 modulation schemes and a severely impaired channel, future work should expand
the number of modulation schemes and include multi-receiver situations that require
schemes such as OFDM. Moreover, an interesting study would be to determine the
performance of DL architectures as channel constraints are successively relaxed; that is,
beyond the impairments of sample rate offset, carrier frequency offset, fading and
multipath effects, how do other channel conditions affect a classifiers accuracy?
References
1. Dobre, O.A., Abdi, A., Bar-Ness, Y., Su, W.: Survey of automatic modulation classification
techniques: classical approaches and new trends. IET Commun. 1(2), 137–156 (2007)
2. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal
approximators. Neural Netw. 2(5), 359–366 (1989). 197-11 (8-12)
3. O’Shea, T.J., West, N.: Radio machine learning dataset generation with gnu radio. In:
Proceedings of the GNU Radio Conference, vol. 1, no. 1 (2016)
4. Karra, K., Kuzdeba, S., Petersen, J.: Modulation recognition using hierarchical deep neural
networks. In: 2017 IEEE International Symposium on Dynamic Spectrum Access Networks
(DySPAN) (2017). https://doi.org/10.1109/dyspan.2017.7920746
5. Silva-Palacios, D., Ferri, C., Ramrez-Quintana, M.J.: Improving performance of multiclass
classification by inducing class hierarchies. Procedia Comput. Sci. 108, 16921701 (2017).
https://doi.org/10.1016/j.procs.2017.05.218
6. O’Shea, T.J., Corgan, J., Clancy, T.C.: Convolutional radio modulation recognition networks.
In: Engineering Applications of Neural Networks. Communications in Computer and
Information Science, pp. 213–226 (2016). https://doi.org/10.1007/978-3-319-44188-7_16
7. O’shea, T., Hoydis, J.: An introduction to deep learning for the physical layer. IEEE Trans.
Cogn. Commun. Netw. 3(4), 563575 (2017). https://doi.org/10.1109/tccn.2017.2758370
8. Datasets: DeepSig Inc. https://www.deepsig.io/datasets
9. Liu, X., Yang, D., Gamal, A.E.: Deep neural network architectures for modulation
classification. In: 2017 51st Asilomar Conference on Signals, Systems, and Computers
(2017). https://doi.org/10.1109/acssc.2017.8335483
10. Ramjee, S., Ju, S., Yang, D., Liu, X., Gamal, A., Eldar, Y.C.: Fast deep learning for
automatic modulation classification. J. Sel. Areas Commun. (2019)
11. RadioML. https://github.com/radioML
12. Colaboratory: Frequently asked questions. https://research.google.com/colaboratory/faq.html
13. Carneiro, T., Da Nobrega, R.V.M., Nepomuceno, T., Bian, G.-B., De Albuquerque, V.H.C.,
Reboucas Filho, P.P.: Performance analysis of Google colaboratory as a tool for accelerating
deep learning applications. IEEE Access 6, 61677–61685 (2018). https://doi.org/10.1109/
access.2018.2874767
AMC Using Induced Class Hierarchies and Deep Learning 769
Abstract. Wastewater generated from pulp and paper mills is a major pollution
source. It is important to identity the flocculation characteristics of papermaking
wastewater so that the wastewater treatment can be optimized. In this paper the
characteristics of flocculation of deinking wastewater were studied by computer
image processing. Experiments were carried out to acquire images of flocculation.
A series of graph parameters related to floc sedimentation characteristics and time-
variable parameters were found. Using computer visualization technology to study
the static and dynamic behavior of wastewater flocs has many advantages; com-
puter visualization technology can be used to improve wastewater treatment.
1 Introduction
Wastewater generated from pulp and paper mills is a major source of industrial pollution.
As the demand of paper products will continue to rise; it is imperative to optimize the
treatment of papermaking wastewater [1]. Papermaking wastewater contains soluble
organic and inorganic substances, insoluble organic and inorganic substances, which is a
very complex pollution source [2]. At present, the common methods for papermaking
wastewater treatment include: physical method, physical chemical method, biological
method, etc. The new technologies currently being explored and developed include
electrochemical method, membrane separation method and photocatalytic decomposi-
tion method [3]. Regardless of the method, it is essentially the separation process of the
components in the wastewater. The effective separation depends largely on the charac-
teristics of the flocculation of various components in the wastewater. The flocculation
method is the most widely used in wastewater treatment. It is not only used in pre-
treatment, primary treatment, final treatment of wastewater, but also used in sludge
treatment [4]. It can remove high molecular substances, plant fibers, various organic
substances, biological sludge, heavy metal substances such as lead, cadmium and
mercury, and other pollutants. The size of flocs formed during flocculation and the
characteristics of their sedimentation are important topic for environmental researchers
[5, 6]. Digital image processing has been used to identify flocculation and improve
wastewater treatment [7, 8].
1XL1 X
L1
f ¼ f ðx; yÞ ð1Þ
N x¼0 y¼0
The average gray value then used to calculate the floc strength value.
f
A¼1 ð2Þ
L
Where, A is floc strength.
Then the equivalent diameter of the floc was calculated as:
rffiffiffi pffiffiffiffiffi
s 2 sp 1
U¼2 1 1 1 1 A ð3Þ
p l m
ðqs qÞg 2
v¼ ds ð4Þ
18 l
4 Conclusion
Wastewater generated from pulp and paper mills is a major source of industrial pol-
lution. It is imperative to optimize the treatment of wastewater generated from pulp and
paper mills. Characterization of the flocculation is of great significance to study the
flocculants and improve wastewater treat. In this paper, wastewater generated from
pulp and paper mills after deinking was investigated utilizing digital image processing.
Digital images of flocculation during sedimentation were acquired and analyzed to
identify the characteristics of flocculation. The equivalent diameter of the flocs was
calculated and based on the results, the sedimentation velocity was determined. Image
processing can be an effective tool to be used for flocculation morphology analysis and
dynamic characterization in wastewater treatment.
774 M. Li et al.
References
1. Pokhrel, D., Viraraghavan, T.: Treatment of pulp and paper mill wastewater-a review. Sci.
Total Environ. 333(1–3), 37–58 (2004)
2. Kamali, M., Khodaparast, Z.: Review on recent developments on pulp and paper mill
wastewater treatment. Ecotoxicol. Environ. Saf. 114, 326–342 (2015)
3. El-Ashtoukhy, E.S., Amin, N.K., Abdelwahab, O.: Treatment of paper mill effluents in a
batch-stirred electrochemical tank reactor. Chem. Eng. J. 146(2), 205–210 (2009)
4. Nasser, M.S.: Characterization of floc size and effective floc density of industrial papermaking
suspensions. Sep. Purif. Technol. 122(10), 495–505 (2014)
5. Jenne, R., Cenens, C., Geeraerd, A.H., Impe, J.F.: Towards on-line quantification of flocs and
filaments by image analysis. Biotechnol. Lett. 24(11), 931–935 (2002)
6. Biggs, C.A.: Activated sludge flocculation: on-line determination of floc size and the effect of
shear. Water Res. 34(9), 2542–2550 (2000)
7. Juntunen, P., Liukkonen, M., Lehtola, M., Hiltunen, Y.: Characterization of alum floc in water
treatment by image analysis and modeling. Cogent Eng. 1(1), 944767 (2014)
8. Garcıa, H.L., González, I.M.: Self-organizing map and clustering for wastewater treatment
monitoring. Eng. Appl. Artif. Intell. 17(3), 215–225 (2004)
Detection of Anomalous Gait as Forensic Gait
in Residential Units Using Pre-trained
Convolution Neural Networks
1 Introduction
Malaysia is listed as one of the most rapidly evolving countries in the region of
Southeast Asia. The development growth is also in line with the rise in crime index [1–
3]. Motorcycle theft, car theft, and housebreaking are the most frequent crimes that
contributed up to 56% of reported cases in Malaysia [3, 4]. Previous studies have
shown that terraced, semi-detached and detached houses have a higher risk of
housebreaking [5, 6] and residence without surveillance systems was six times higher
in the risk as housebreaking crime victims [7].
Nowadays, the utilization of closed-circuit television (CCTV) as surveillance
cameras is prominently increasing in public places viz. streets, banks, shop lots, etc.
including residential units as a precautionary measure to protect against crime inci-
dents. Explicitly, the main task of the surveillance cameras is to monitor and detect the
occurrences of anomalous behavior that occur in its vicinity. The anomalous state of
objects and people can be defined as the changing pattern of the original state or
movement from the normal behavior [8, 9]. Furthermore, the anomalous behaviors are
2 Related Work
Transfer learning is a machine learning technique that exploited the knowledge learned
in a particular setting and improve its performance while modeling the related or
different set of the new assignment. Transfer learning has given a huge advantage to
deep learning with image data users by providing the finest architecture of pre-trained
CNNs and requires smaller data to obtain better results [15, 42]. Practically, there are
two focal steps in implementing the transfer learning technique, (i) selecting the pre-
trained CNN, and (ii) remodeling the pre-trained CNN.
The feature extractor strategy allows operating the pre-trained CNNs without its
final layer as the new assignment is different from the original setting. The final layers
are then being replaced by shallow classifier models (SVM, OCSVM or IPCA) or
dense classification layers (fully connected layer) in order to specify the output class of
new assignment [17, 18, 41]. This strategy enables the feature extraction process of the
new assignment to leverage knowledge from the convolution base of the pre-trained
CNNs.
The fine-tuning strategy requires more skills by replacing the final layers with a
shallow classifier or dense classification layers and fine-tuned the hyperparameters. In
addition, this strategy requires the new layers to learn faster by modifying the hyper-
parameters and the convolution base relatively freezes by fixing its weights. The
training process can have better performance and less training time.
4 Experiment Protocol
All experimental and analysis in this study are conducted using HP Pavilion 15
Notebook with 8G memory and GeForce 840 M graphic card including data collection
phase, learning phase for all five pre-trained CNNs that took over thirty-five hours of
training time and finally the testing phase. Early-stop based on the validation loss
during the training process is applied. The trained networks are tested using two
methods, (i) offline mode and (ii) real-time mode.
Fig. 1. The features of forensic gait: (a) bending, (b) squatting with heels down, (c) squatting
with heels up, and (d) kneeling with heels up.
same pace as the pre-trained layers. The regularization factor hyperparameter of the
series network is fixed throughout the learning process.
This section will discuss the experimental analysis done as well as the results attained.
Input images of remodeled pre-trained CNNs consist of 9558 color images for each class
viz. normal and anomaly with regards to forensic gait features requirement. The images
are collected from the footage of participants during data collection and the augmented
images. 7000 images are randomly selected as training images and the rest are served as
testing images. All the images are resized according to the requirement of pre-trained
networks, (i) AlexNet is 227*227, (ii) GoogLeNet is 224*224, (iii) Inception-v3 is
299*299, (iv) ResNet-50 is 224*224, and (v) ResNet-101 is 224*224.
The validation frequency is similar for each remodeled pre-trained CNNs but the
minibatch numbers are different according to the memory of the hardware. Total
iteration is varied between remodeled pre-trained CNNs as the consequence of different
numbers of minibatch despite the maximum epoch is set equal at 1000 for each
network. A small learning rate of 0.001 is used to ensure constant convergence between
pre-trained layers and newly added layers. Stochastic Gradient Descent with
Momentum is chosen and b is set to 0.9 for better navigation towards optimum global
minima and 50 iterations of validation frequency at regular intervals of the network.
The effectiveness of each remodeled pre-trained CNNs are investigated under two
methods, (i) offline mode tests the remodeled pre-trained CNNs using video as input
data where the videos are downloaded from YouTube channel related to the real
housebreaking events captured by CCTV take place in Malaysia, and (ii) real-time
mode tests the remodeled pre-trained CNNs using live feed from webcam as input data
at the laboratory.
Table 4. (continued)
Pre-trained CNNs Remodeled pre-trained CNNs
Inception-v3 Layer (end-3): Fully Connected Layer (end+1): Fully Connected
ResNet-50 Layer Layer
ResNet-101 Input Sizes 2048 Input Sizes ‘auto’
(DAG network) Output Sizes 1000 Output Sizes 2
Learning Rate of 1 Learning Rate of 20
Weight Weight
Learning Rate of Bias 1 Learning Rate of Bias 20
Regularization Factor 1 Regularization Factor 1
of Weight of Weight
Regularization Factor 0 Regularization Factor 0
of Bias of Bias
Layer (end−2): Softmax Layer Layer (end+2): Softmax Layer
Layer (end−1): Classification Layer (end+3): Classification
Output Layer Output Layer
Output Sizes 1000 Output Sizes ‘auto’
Table 5. Final dense layers of CNNs before and after transfer learning process.
Pre-trained CNNs Remodeled pre-trained CNNs
Layer Layer type Layer details Layer Layer type Layer details
No. Name No. Name
AlexNet 23 ‘fc8’ Fully 1000 fully 23 ‘special_2’ Fully 64 fully
connected connected connected connected
layer layer
24 ‘prob’ Softmax Softmax 24 ‘relu’ ReLU ReLU
25 ‘output’ Classification crossentropyex 25 ‘fc8_2’ Fully 2 fully
output connected connected
layer
26 ‘softmax’ Softmax Softmax
27 ‘classoutput’ Classification crossentropyex
output
GoogLeNet 142 ‘loss3-classifier’ Fully 1000 fully 142 ‘fc’ Fully 2 fully
connected connected connected connected
layer layer
143 ‘prob’ Softmax Softmax 143 ‘softmax’ Softmax Softmax
144 ‘output’ Classification crossentropyex 144 ‘classoutput’ Classification crossentropyex
output output
Inception- 314 ‘prediction’ Fully 1000 fully 314 ‘fc’ Fully 2 fully
v3 connected connected connected connected
layer layer
315 ‘prediction_ Softmax Softmax 315 ‘softmax’ Softmax Softmax
softmax’
316 ‘classification Classification crossentropyex 316 ‘classoutput’ Classification crossentropyex
Layer_prediction’ output output
ResNet-50 175 ‘fc1000’ Fully 1000 fully 175 ‘fc’ Fully 2 fully
connected connected connected connected
layer layer
176 ‘fc1000_softmax’ Softmax Softmax 176 ‘softmax’ Softmax Softmax
177 ‘classification Classification crossentropyex 177 ‘classoutput’ Classification crossentropyex
Layer_fc1000’ output output
(continued)
786 H. A. Razak et al.
Table 5. (continued)
Pre-trained CNNs Remodeled pre-trained CNNs
Layer Layer type Layer details Layer Layer type Layer details
No. Name No. Name
ResNet- 345 ‘fc1000’ Fully 1000 fully 345 ‘fc’ Fully 2 fully
101 connected connected connected connected
layer layer
346 ‘prob’ Softmax Softmax 346 ‘softmax’ Softmax Softmax
347 ‘classification Classification crossentropyex 347 ‘classoutput’ Classification crossentropyex
Layer_prediction’ output output
The remodeled pre-trained DAG networks learned the anomaly images better than
normal images. Moreover, the higher numbers of trainable parameters were observed to
produce lower numbers of misclassified images. GoogLeNet has 6.79 million,
Inception-v3 has 23 million and ResNet-50 has 25.6 million trainable parameters and
the misclassified images of GoogLeNet were 147 images, Inception-v3 was 114 images
and ResNet-50 was 80 images. Figure 2 shows examples of misclassified images or
false positive and false negative images during the classification process.
Fig. 2. Misclassified images of remodeled pre-trained CNNs (a) normal behavior and
(b) anomalous behavior
Detection of Anomalous Gait as Forensic Gait in Residential Units 787
Fig. 3. Offline mode test to detect behaviors at the gate using remodeled pre-trained CNNs
(a) Normal behavior and (b) Anomalous behavior
788 H. A. Razak et al.
Fig. 4. Real-time mode test in detecting behaviors at the gate using remodeled pre-trained
CNNs, (a) Normal behavior (i) Open or slide gate to open, (ii) Unlock the padlock or latch,
(iii) Other activities (b) Anomalous behavior (i) Lurking or sneaking, (ii) Breaking the gate using
tool, (iii) Wildly shaking the gate, (c) Detection on dusky environment without multi-condition
gate and (d) Two perpetrators
the scene of a person walking towards the gate with full bags in hand. Step 3 to 5 and
step 9 to 11 are detected as an anomaly due to the squatting posture during putting
down and picking up bags. Additionally, step 6 to 8 is considered as normal bending
posture, not as anomaly bending posture during opening the lock. Step 1 to 2 and step
12 to 15 are considered as normal walking towards the gate and the frontage of the
house. The anomalous behavior steps in Fig. 5(d) are acted by the participant for the
must-do thing during housebreaking attempt namely scene of wildly shaking the gate
(10 to 25). Moreover, the first three steps represented lurking with standing upright
posture and step 4 to 9 are classified alternately between normal and anomaly behavior
as participant standing and bending extensively while checking the gate locks. The
detection for remodeled GoogLeNet was faster as it completed at 40 to 80 ms or 16 fps
as compared to remodeled Inception-v3 that took 0.2 s per frame or 5 fps.
790 H. A. Razak et al.
Course 2
Course 1
Step
Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0
0
0.01
0.02
0.04 0.02
0.03
Time, s
0.06
Time, s
0.08 0.04
0.1 0.05
0.12
0.06
0.14
0.07
(a) (b)
Course 4
Course 3 Step
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Step
0.212
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0.205 0.214
0.21
0.216
0.215
0.22 0.218
Time, s
Time, s
0.225
0.22
0.23
0.235 0.222
0.24
0.224
0.245
0.25 0.226
0.228
(c) (d)
Fig. 5. Time vs. steps during time (a) Course 1, (b) Course 2, (c) Course 3, and (d) Course 4
6 Conclusion
References
1. Sidhu, A.C.P.A.S.: The rise of crime in malaysia: an academic and statistical analysis.
J. Kuala Lumpur R. Malaysia Police Coll. 4, 1–28 (2005)
2. Hamid, L.A., Toyong, N.M.P.: Rural area, elderly people and the house breaking crime.
Proc. - Soc. Behav. Sci. 153, 443–451 (2014)
3. Soh, M.C.: Crime and urbanization: revisited malaysian case. Proc. Soc. Behav. Sci. 42(July
2010), 291–299 (2012)
4. Marzbali, M.H., Abdullah, A., Razak, N.A., Tilaki, M.J.M.: The relationship between socio-
economic characteristics, victimization and CPTED principles: evidence from the MIMIC
model. Crime Law Soc. Chang. 58(3), 351–371 (2012)
5. Chris, K., Natalia, C.-M., Carys, T., Rebbecca, A.: Burglary, vehicle and violent crime. In:
The 2001 British Crime Survey. First Results, England and Wales, vol. 18, pp. 23–27. Home
Office Statistical Bulletin, Queen Anne’s Gate, London (2001)
6. Van Dijk, J.J.M., Mayhew, P., Killias, M.: Victimization rates. In :Experiences of Crime
across the World: Key findings of the 1989 International Crime Survey, pp. 23–25. Kluwer
Law and Taxation Publishers, Deventer (1990)
7. Murphy, R., Eder, S.: Acquisitive and other property crime. In: Flatley, J., Kershaw, C.,
Smith, K., Chaplin, R., Moon, D. (eds.) Crime in England and Wales 2009/10, Third Edit.,
vol. 12, pp. 79–87. Home Office Statistical Bulletin, Marsham Street, London (2010)
8. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv.
1–72 (2009)
9. Lawson, W., Hiatt, L.: Detecting anomalous objects on mobile platforms. In: 2016 IEEE
Conference on Computer Vision Pattern Recognition Working, pp. 1426–1433 (2016)
10. Mohammadi, S., Perina, A., Kiani, H., Murino, V.: Angry crowd : detecting violent events in
videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision - ECCV 2016.
LNCS, vol. 9911, pp. 3–18. Springer, Cham (2016)
11. Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In:
Proceedings of IEEE Computing Socitey Conference on Computer Vision and Pattern
Recognition, pp. 6479–6488 (2018)
12. Tay, N.C., Tee, C., Ong, T.S., Goh, K.O.M., Teh, P.S.: A robust abnormal behavior
detection method using convolutional neural network. In: Alfred, R., Lim, Y., Ibrahim, A.,
Anthony, P. (eds.) Computational Science and Technology. Fifth International Conference
on Computational Science and Technology. Lecture Notes in Electrical Engineering, vol.
481, pp. 37–47. Springer, Singapore (2019)
13. Al-Dhamari, A., Sudirman, R., Mahmood, N.H.: Abnormal behavior detection in automated
surveillance videos: a review. J. Theor. Appl. Inf. Technol. 95(19), 5245–5263 (2017)
14. Delgado, B., Tahboub, K., Delp, E.J.: Automatic detection of abnormal human events on
train platforms. In: IEEE National Aerospace and Electronics Conference (NAECON 2014),
pp. 169–173 (2014)
15. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
792 H. A. Razak et al.
16. Almisreb, A.A., Jamil, N., Md Din, N.: Utilizing AlexNet deep transfer learning for ear
recognition. In: 2018 Fourth International Conference on Information Retrieval and
Knowledge Management, pp. 8–12 (2018)
17. Andrew, J.T.A., Tanay, T., Morton, E.J., Griffin, L.D.: Transfer representation-learning for
anomaly detection. In: Proceedings of 33rd International Conference on Machine Learning
Research, New York, USA, vol. 48, pp. 1–5 (2016)
18. Ali, A.M., Angelov, P.: Anomalous behaviour detection based on heterogeneous data and
data fusion. Soft. Comput. 22(10), 3187–3201 (2018)
19. Sabokrou, M., Fayyaz, M., Fathy, M., Moayed, Z., Klette, R.: Deep-anomaly : fully
convolutional neural network for fast anomaly detection in crowded scenes. J. Comput. Vis.
Image Underst. 1–30 (2018). (arXiv00866v2 [cs.CV])
20. Huang, Z., Pan, Z., Lei, B.: Transfer learning with deep convolutional neural network for
SAR target classification with limited labeled data. Remote Sens. 9(907), 1–21 (2017)
21. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural
networks? Adv. Neural. Inf. Process. Syst. 27, 1–14 (2014)
22. Chollet, F.: Deep learning for computer vision: using a pretrained convnet. In: Deep
Learning with Python, pp. 143–159. Manning, Shelter Island(2018)
23. Ali, A.M., Angelov, P.: Applying computational intelligence to community policing and
forensic investigations. In: Bayerl, P.S., Karlovic, R., Akhgar, B., Markarian, G. (eds.)
Advanced Sciences and Technologies for Security Applications: Community Policing - A
European Perspective, pp. 231–246. Springer, Cham (2017)
24. Lu, J., Yan, W.Q., Nguyen, M.: Human behaviour recognition using deep learning. In: 2018
15th IEEE International Conference on Advanced Video and Signal Based Surveillance,
pp. 1–6 (2018)
25. Hospedales, T., Gong, S., Xiang, T.: A Markov clustering topic model for mining behaviour
in video. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1–8 (2009)
26. Zhang, C., Li, R., Kim, W., Yoon, D., Patras, P.: Driver behavior recognition via interwoven
deep convolutional neural nets with multi-stream inputs. arXiv:1811.09128v1 [cs.CV], pp.
1–10 (2018)
27. Pang, Y., Syu, S., Huang, Y., Chen, B.: An advanced deep framework for recognition of
distracted driving behaviors. In: 2018 IEEE 7th Global Conference on Consumer
Electronics, pp. 802–803 (2018)
28. Arifoglu, D., Bouchachia, A.: Activity recognition and abnormal behaviour detection with
recurrent neural networks. In: 14th International Conference on Mobile Systems and
Pervasive Computing (MobiSPC 2017), vol. 110, pp. 86–93 (2017)
29. Kröse, B., van Oosterhout, T., Englebienne, G.: Video surveillance for behaviour monitoring
in home health care. Proc. Meas. Behav. 2014, 2–6 (2014)
30. Leixian, S., Zhang, Q.: Fall behavior recognition based on deep learning and image
processing. Int. J. Mob. Comput. Multimed. 9(4), 1–16 (2019)
31. Xu, H., Li, L., Fang, M., Zhang, F.: Movement human actions recognition based on machine
learning. Int. J. Online Biomed. Eng. 14(4), 193–210 (2018)
32. Datta, A., Shah, M., Da Vitoria Lobo, N.: Person-on-person violence detection in video data.
In: Proceedings of International Conference on Pattern Recognition, vol. 16, no. 1, pp. 433–
438 (2002)
33. Gao, Y., Liu, H., Sun, X., Wang, C., Liu, Y.: Violence detection using oriented VIolent
flows. Image Vis. Comput. 48–49, 37–41 (2016)
34. Kooij, J.F.P., Liem, M.C., Krijnders, J. D., Andringa, T., Gavrila, D.M.: Multi-modal human
aggression detection. Comput. Vis. Image Underst. 1–35 (2016)
Detection of Anomalous Gait as Forensic Gait in Residential Units 793
35. Patil, S., Talele, K.: Suspicious movement detection and tracking based on color histogram. In:
2015 International Conference on Communication, Information and Computing Technology,
pp. 1–6 (2015)
36. Zhu, Y., Wang, Z.: Real-time abnormal behavior detection in elevator. In: Zhang, Z., Huang,
K. (eds.) Intelligent Visual Surveillance. IVS 2016. Communications in Computer and
Information Science, vol. 664, pp. 154–161. Springer, Singapore (2016)
37. Ben Ayed, M., Abid, M.: Suspicious behavior detection based on DECOC classifier. In: 18th
International Conference on Sciences and Techniques of Automatic Control and Computer
Engineering, pp. 594–598 (2017)
38. Yu, B.: Design and implementation of behavior recognition system based on convolutional
neural network. In: ITM Web Conference, vol. 12, no. 01025, pp. 1–5 (2017)
39. He, L., Wang, D., Wang, H.: Human abnormal action identification method in different
scenarios. In: Proceedings of 2011 2nd International Conference on Digital Manufacturing
and Automation ICDMA 2011, pp. 594–597 (2011)
40. Min, W., Cui, H., Han, Q., Zou, F.: A scene recognition and semantic analysis approach to
unhealthy sitting posture detection during screen-reading. Sensors (Basel) 18(9), 1–22
(2018)
41. Nazare, T.S., de Mello, R.F., Ponti, M.A.: Are pre-trained CNNs good feature extractors for
anomaly detection in surveillance videos? arXiv:1811.08495v1 [cs.CV], pp. 1–6 (2018)
42. Lee, J., Kim, H., Lee, J., Yoon, S.: Transfer learning for deep learning on graph-structured data. In:
Proceedings of Thirty-First AAAI Conference on Artificial Intelligence, pp. 2154–2160 (2017)
43. Bell, M.O.: Computational Complexity of network reliability analysis: an overview. IEEE
Trans. Reliab. R-35(3), 230–239 (1986)
44. The Mathworks: Series network for deep learning – MATLAB (2016). https://www.
mathworks.com/help/deeplearning/ref/seriesnetwork.html. Accessed 12 June 2019
45. Vedaldi, A., Lenc, K., Gupta, A.: MatConvNet - convolutional neural networks for
MATLAB. arXiv:1412.4564 [cs.CV], pp. 1–59 (2015)
46. The Mathworks: Directed acyclic graph (DAG) network for deep learning - MATLAB
(2017). Available: https://www.mathworks.com/help/deeplearning/ref/dagnetwork.html.
Accessed: 12 June 2019
47. Sahner, R.A., Trivedi, K.S.: Performance and reliability analysis using directed acyclic
graphs. IEEE Trans. Softw. Eng. SE-13(10), 1105–1114 (1987)
48. Bang-Jensen, J., Gutin, G.Z.: Acyclic digraphs. In: Diagraphs: Theory, Algorithms and
Applications, Second Edition. Monographs in Mathematics, pp. 32–34. Springer, London (2009)
49. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional
neural networks. In: Advance in Neural Information Processing Systems, vol. 25, pp. 1–9
(2012)
50. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. In: 3rd International Conference on Learning Representation (ICLR 2015),
pp. 1–14 (2015)
51. Szegedy, C., et al.: Going deeper with convolutions. In: IEEE Conference on Computer
Vision Pattern Recognition (CVPR 2015), pp. 1–9 (2015)
52. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J.: Rethinking the inception architecture for
computer vision. In: 2016 IEEE Conference on Computer Vision Pattern Recognition
(CVPR 2016), pp. 2818–2826 (2016)
53. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016
IEEE Conference on Computer Vision Pattern Recognition, pp. 770–778 (2016)
54. Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv:1605.07146v4 [cs.CV], pp.
1–15 (2017)
55. Liu, Y.Y., Slotine, J.-J., Barabasi, A.-L.: Control centrality and hierarchical structure in
complex networks. PLoS One 7(9), 1–7 (2012)
Occluded Traffic Signs Recognition
1 Introduction
Automatic traffic sign recognition plays a vital role in intelligent self-driving vehicles
and driver assistant systems. Especially in big Taipei area, including Taipei and New
Taipei City, the situation of heavy traffic of cars and motorcycles and a rapidly growing
number of aged drivers makes a reliable sign-recognizing system even more crucial in
enhancing safe driving.
Driving in the real-world road scene, the quality of traffic sign images is easily
affected by various conditions, such as weather, illumination, or occlusion, which are
difficult to control. In this article, we focus on the case of occluded traffic signs. The
occlusion is a common problem due to trees, pedestrians or other signs, and the resulted
incomplete information makes sign recognition difficult.
Recently, many object recognition tasks are solved using convolution neural net-
works (CNN). Due to its high recognition rate and fast execution, the superior per-
formance of CNN applying on various computer vision tasks is widely known. We
propose a CNN based recognition system for occluded traffic signs. It utilizes multiple
local masks to learn the regional feature of traffic signs. With auxiliary classifiers,
robust local features are learned and the problem of incomplete information is much
alleviated. In the following sections, we give a brief introduction of related works,
followed by the proposed method, experiments, and, finally, conclusion and future
work.
2 Related Works
The task of recognizing traffic sign images includes detecting then recognizing the
detected one. As mentioned, many issues make the problem of the automatic detection
of traffic signs difficult. Hence, features from both the color and shape of signs are used
for detection [1–3]. Greenhalgh et al. proposed a system detected candidate regions as
Maximally Stable Extremal Regions (MSERs), a method which is robust to variations in
lighting and illumination in the scene to diminish the illumination effect. These can-
didate regions were then classified using HOG features with a cascade of Random
Forests [1]. Kuang et al. improved Greenhalgh’s method via image contrast enhance-
ment before MSER which makes the sign detection more robust to illumination and
rotation scale. In recognition phase, they also used Random Forests but with a HSV-
HOG-LBP feature descriptor [2].
Conventional methods for traffic sign recognition, such as aforementioned, usually
include feature extraction and classification training. Features like color, HOG, SIFT,
SURF and LBP, or their improved version, are combined with classifiers like LDA,
SVM, AdaBoost, or Random Forest [4]. However, Convolutional Neural Network can
be trained without the need of hand-designed features and it is popular nowadays. For
example, Ciresan et al. used an ensemble of CNNs and the final prediction is averaging
the output score of all columns to boost the classification accuracy with low error rate
[5, 6]. In 2011, a traffic sign classification challenge on German Traffic Sign Recog-
nition Benchmarks (GTSRB) dataset held at the International Joint Conference on
Neural Networks (IJCNN) and [6] won the first place with an accuracy rate better than
human being’s [7]. Singh et al. used HSV color feature to extract sign candidates and a
classifier is trained by CNN [8]. They reported a 97.92% accuracy when tested on the
GTSRB dataset [7].
When the sign is occluded, the recognition task becomes more difficult due to the
insufficient visual clues. Hou et al. [9] proposed a two stages recognition system to
solve occlusion problem. Firstly, it extracts HOG feature to classify the category of the
sign by pairwise One-Versus-One SVMs. With voting strategy of SVMs, the correct
shape can be determined even if the sign is partially occluded. Further, the SVM is
trained to determine the exact label of the traffic sign under the specific shape.
Occlusion in tracking is a well-studied issue. In Yang et al. [10], they utilized
Siamese network for capturing features of exemplar image and search image as kernel
mask and the candidate area, respectively. The score map is obtained by the correlation
of the kernel mask with the candidate area. In addition, they used multiple (2-by-2)
position-sensitive score maps to alleviate occlusion effect. Finally the four position-
sensitive score maps are merged into a final score map to locate the target object. Our
method is inspired by this viewpoint.
In this article, a CNN-based system is proposed to recognize traffic sign especially
for occluded signs in big Taipei Area. We observe that it is distracted to recognize
traffic signs in brief time during busy traffic not to mention of the cluttered street scene
in big Taipei area. In order to train a system working for Taipei area, it would be better
to use training images reflecting the true environment. In addition, the public dataset
GTSRB only has limited occluded sign images. For these reasons, we collect our own
796 S.-H. Yen et al.
training images of Taipei area. We captured some traffic sign images ourselves but
most of images are collected from Google Maps [11]. These images are all taken in big
Taipei area. By analyzing the statistics of collected data, twenty most common traffic
signs are used as our recognition targets.
Fig. 1. The architecture diagram of our method which is made of three parts: global feature
extraction, local feature extraction, and classification.
3 Methodology
The architecture of the system is shown in Fig. 1. Given an input sign image, the global
feature is first extracted, followed by five local feature extractions. The local feature
extractor together with auxiliary classifier is to learn important features if only partial
feature is available. Finally, the label of the sign is predicted by a concatenation of local
features with a Softmax classifier. To illustrate the proposed system in details, we first
explain the training data followed by system architecture. Then feature maps will be
discussed especially the differences between the global and local features.
In preparing training and testing images, we crop the image by a rectangular box
roughly inscribing the sign. Resize the cropped image to be 64 64 with label
annotation. Figure 2 shows some training images and some testing images.
Fig. 2. Some sample images from our dataset. From left to right, these signs are “Dividing
Ways”, “Right Bend”, “Speed Limit (50 km/h)”, “Two-Stage Left Turn for
Bicycles/Motorcycles”. The first row shows training images and bottom shows testing images
which are occluded. These images are resize to be 64 64.
Since distribution of collected traffic signs is not uniform, we first enlarge the data
number of those classes with samples less than 1000. Using random translation in
horizontal and/or vertical direction to replicate the images until the sample number
reaches 1000. We take these image of each sign as the original training images and
augment the training data by randomly adding noise, rotating ±10o, ±15o, ±25o and
occlusion-blob synthesized to the training images. By this way, each traffic sign has
9,000 images and the total training images are 180,000. During training, we randomly
take 80% for training and 20% for validation.
Fig. 3. Global feature extraction. Detected sign image of size 64 64 3 is the input to the
system. After global feature extraction the output feature maps are of the size 14 14 64
(output of layer 5). In the figure, CNN, ReLU, and Max_Pooling operations are represented as
purple, brown, and green bars, respectively. It is made of 5 layers, where layers 1, 2, & 4 are 3
3 CNN (stride 1) followed by ReLU, and layers 3 and 5 are max pooling of 2 2 of stride 2.
Fig. 4. Local feature extraction. (Top) The input is from level 5 described in Fig. 3. It then is
copied into 5 parallel networks but each with only one-quarter of features retained and zeros for
the rest. (Bottom) For each network, it performs the operations as described. At layer 10, a dense
operation maps flattened 4096 nodes to 20 nodes.
Occluded Traffic Signs Recognition 799
Classification. The traffic sign is recognized by Softmax classifier with the help of five
auxiliary classifiers as explained in Fig. 5. In order to make the subnetworks learn local
features, we use auxiliary (Softmax) classifiers for these five subnetworks, respectively.
In addition, these local features are fused together to classify traffic signs. For each
output of layer 10, a vector of dimension 20 right before Softmax, is activated by ReLU
then concatenated with other subnetworks resulting a column vector of 100 nodes. This
column vector is densely connected to 20 nodes, followed by Softmax, then is clas-
sified. We call this classifier as Final Classifier. Therefore, the loss function consists of
loss functions from five (regional) auxiliary classifiers and one from overall final
classifier. As given in (1), all lost functions are categorical cross entropy function
where a is empirically set as 0.2.
Fig. 5. Auxiliary Classifiers and Final Classifier in the system. Output of layer 10 is used for
both auxiliary classifier and final classifier. The former is to learn different local feature of traffic
signs and the latter is to classify traffic sign after fusion of learnt local features.
correctly by our system. Figure 8 gives two more occluded examples. Their local
features show high response to the sign area and low response to the occluded area.
Fig. 6. Traffic sign “Speed Limit (50)” and its heat map of global feature (output of layer 5).
Red and blue colors represent high and low responses in the heat map. While heat map of (a) is
capable of representing the important feature of the sign, the occluded one mistakenly responses
to the green sign on the right of the image.
Fig. 7. Heat maps of local features (output of layer 8) of the occluded traffic sign from Fig. 6(b).
Heat maps of upper-left and lower-left regions give high response, upper-right and lower-right
regions give low response, and the center region gives a middle response. These heat maps
indicate that they learn well for local feature of the sign.
Fig. 8. More heat maps of global & local features. First row, (a) and (b) are occluded images
and global features. Row (c) is the local features of (a) “No U Turn”, and Row (d) is the local
features of (b) “Watch Out for Pedestrians”.
Occluded Traffic Signs Recognition 801
4 Experiment
In the confusion matrix of Fig. 9, blank cells represent 0 s. We can see that
diagonal entries are almost all 1 s with only three classes are below 90%. Among these
20 signs, the third class (“Straight”), the thirteenth class (“Right Bend”) and the fif-
teenth class (“Watch Out for Children”) have low accuracy rates of 0.778, 0.769, and
0.857, respectively. One of the main reason of low accuracy rate is the small size of test
samples. In Fig. 10, it shows erroneously classified test images of these three classes:
(first row) it has 2 errors in total 9 test images of “Straight”, and (middle row) it has 3
errors in total of 13 test images of “Right Bend”, and (third row) it has 1 error in total of
802 S.-H. Yen et al.
7 test images of “Watch Out for Children”. Observed from these erroneously classified
images, we can conclude that the classifications are prone to be incorrect when the
occlusion is severe or combining with blurring.
To further explore whether auxiliary classifiers can boost the performance of our
system, we train an identical system using the same training samples except it does not
have auxiliary classifiers. The accuracy rate of 330 occluded test images on the system
without auxiliary classifiers is 85%. Comparing to the 97% of the proposed system, it is
clearly that auxiliary classifiers help the machine to learn robust and discriminative
local features of different signs.
However, our system is trained using signs collected from Taipei and the vicinity
and only trained for 20 different signs. For comparison, we carefully design two tests
for GTSRB dataset. One is to select similar traffic signs from GTSRB and fine-tune the
system, and the other is to retrain our system by GTSRB training data.
For the first test, we select 5 signs that are similar to Taipei’s: No Entry, Speed
Limit (50), Speed Limit (70), Keep Right, and Keep Left. In these five signs, the least
number of training size is 210 in GTSRB. So we randomly take 210 training images on
each sign for fine-tuning and test on GTSRB test images. The test result has an
accuracy rate of 93.5%. As stated, the system is fine-tuned using only 1050 images, but
it still gets a satisfactory result.
Next, we would like to know whether the proposed architecture is able to work on
GTSRB traffic sign database in general. Among 43 signs, we discard 3 signs with least
training images and divide the remaining data into groups of 20 signs (to fit our
architecture) and retrain the system. Without loss of generality, according to their order
as in GTSRB (after 3 signs are discarded), the dataset is divided into 4 groups of 20
signs: 1–20, 21–40, odd numbered 1, 3, …, 39, and even numbered 2, 4, …, 40. We
train the system four times using these 4 groups of training and test on the corre-
sponding test images, respectively. The test result is summarized in Table 1. The
average accuracy is 0.9803 ± 0.0023. Comparing to the existing methods, Ciresan’s
0.9915 [5] and Singh’s 0.9792 [8], our result is compatible with the accuracy rates of
these methods. But both methods are to classify 43 classes of GTSRB which are more
difficult. We want to point out that although the proposed system is designed to solve
the occlusion problem, the experiment demonstrated that the system can capture
important features even with different lighting, weather conditions, and/or motion blur
challenges.
In this paper, we illustrated how to use local features in solving occluded traffic sign
recognition problem. With five local feature extractors each equipped with Softmax
classifier, the system can learn discriminative partial feature for traffic signs. The
experimental results are promising. However, we also found a few samples which can
be classified correctly using global feature alone but are erroneously classified by the
system. This indicates that it should be beneficial if we can combine both global and
local features together.
Another issue is about benchmark dataset. Although we collected our own dataset.
But the size is too small. The GTSRB dataset currently mostly contain illumination
variation and motion blur. It does not contain enough occlusion samples. In addition, it
is imbalance in the distribution of samples in the traffic sign class. This can negatively
impact on the recognition performance. Lastly, but not least, the current dataset has
achieved saturation. A new balanced dataset that contains signs from different regions
and captured in a more complex background with different conditions is in need in this
research field.
Acknowledgment. The authors would like to thank the Ministry of Science and Technology
(MOST) of R.O.C. financially supporting in part of the research under the grant number MOST
107-2221-E-032-035-MY2.
References
1. Greenhalgh, J., Mirmehdi, M.: Real-time detection and recognition of road traffic signs.
IEEE Trans. Intell. Transp. Syst. 13(4), 1498–1506 (2012)
2. Kuang, X., Fu, W., Yang, L.: Real-time detection and recognition of road traffic signs using
mser and random forests. Int. J. Online Biomed. Eng. (iJOE) 14(3), 34–51 (2018)
3. Yang, Y., Luo, H., Xu, H., Wu, F.: Towards real-time traffic sign detection and
classification. IEEE Trans. Intell. Transp. Syst. 17(7), 2022–2031 (2016)
4. Saadna, Y., Behloul, A.: An overview of traffic sign detection and classification methods.
Int. J. Multimed. Inf. Retr. 6(3), 193–210 (2017)
5. Ciresan, D., Meier, U., Masci, J., Schmidhuber, J.: A committee of neural networks for
traffic sign classification. In: The 2011 International Joint Conference on Neural Networks,
pp. 1918–1921 (2011)
6. Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image
classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR
2012), 3642–3649 (2012)
7. The German Traffic Sign Recognition Benchmark (GTSRB). http://benchmark.ini.rub.de/?
section=gtsrb&subsection=news. Accessed 01 Aug 2019
8. Singh, M., Pandey, M.K., Malik, L.: Traffic sign detection and recognition for autonomous
vehicles. Int. J. Adv. Res. Ideas Innov. Technol. (IJARIIT) 4(2), 1666–1670 (2018)
9. Hou, Y.L., Hao, X., Chen, H.: A cognitively motivated method for classification of occluded
traffic signs. IEEE Trans. Syst. 47(2), 2168–2216 (2017)
10. Yang, L., Jiang, P., Wang, F., Wang, X.: Region based fully convolutional siamese networks
for robust real-time visual tracking. In: IEEE International Conference on Image Processing
(ICIP 2017)
11. Google Maps. https://maps.google.com/. Accessed 01 Aug 2019
Consumer Use Pattern
and Evaluation of Social Media
Based Consumer Information Source
1 Purpose
The characteristics of social media, such as openness, participation, and sharing have
created a consumer-driven source of information and overactive interactions on the
network among consumers. Consumer-driven sources of information have different
attributes from the existing sources of information that function in a one-way manner.
This study aims to explore the emerging consumer information environment by
investigating the characteristics of social media based consumer information source
(SMCIS) and analyzing consumers’ evaluation of the source.
This study first classifies the characteristics of SMCIS into information character-
istics and network characteristics to validate the evaluation of consumers based on
previous research. Then we examine the patterns of using SMCISs and investigate
consumers’ evaluation on the characteristics of information and network of SMCIS.
Finally, we analyze the consumers’ evaluation of SMCIS according to their socio-
economic characteristics and usage patterns.
2 Background
Consumers use their own experience and memory, or gather the necessary information
from various channels, such as advertising, Internet sites, and people around them, in
the process of consumer information search [1]. Based on participatory online envi-
ronment, online consumer-driven information, such as recommendations, user reviews,
and assessments of products from other consumers, has a significant impact on con-
sumers’ purchase decisions [2]. Recently, social media based consumer information
source (SMCIS) is emerging as an interactive and participatory consumer-driven
information source.
Social media, a tool and platform that enables the transmission of documents,
pictures, videos, and music sources or online communication has led to the production
and sharing of information by social media users. The production and sharing of
consumer-driven information has gained momentum in an environment where con-
sumers can communicate and elaborate useful information by themselves. In this
context, consumers, furthermore, trust information received from SMCIS, such as
recommendations of products or user reviews that are generated by other consumers,
which also influence consumers’ purchasing decisions [3].
According to the previous studies, the usefulness of SMCIS can be identified by the
evaluation of information characteristics and the evaluation of network characteristics.
The information characteristics include reliability, vividness, and variety [3–5]. The
network characteristics have been identified with its economics, interactivity, and
regenerativity [6, 7]. In this study, the evaluation of the characteristics of SMCIS will
be explored through reliability, vividness, variety, economics, interactivity, and
regenerativity. Consumers’ evaluation on SMCIS will be identified through a case
study of “RED (www.xiaohongshu.com)”, China’s leading online information-sharing
community.
3 Method
3.1 Data
The number of users per region was calculated using the location information of RED
users from December 20, 2017, to December 20, 2018. The results showed that 56.5%
of the users were located in Shanghai, Guangdong, and Beijing out of the total 34
regions as shown in Fig. 1. In addition, Fig. 2 shows the user age distribution of RED.
In order to secure the validity and reliability of the survey results, the selected subjects
were Chinese consumers in their 20s to 40s in Shanghai, Guangdong, and Beijing who
had experience using SMCISs.
The responses from consumers aged 20 to 49 who have experience using SMCISs
and live in Beijing, Shanghai and Guangdong were collected by WenJuanXing, a
Chinese online research firm, for 10 days from December 1, 2018 to December 11,
2018. A total of 417 responses were collected and analyzed.
Consumer Use Pattern and Evaluation of Social Media 807
60%
50%
40%
30%
20%
10%
0%
≤19 20~29 30~39 40~49 ≥50
3.2 Measurement
The use pattern of SMCIS was measured by three items including Internet usage time
per day, most commonly used source of information, and the average usage of SMCIS
per week. The evaluation on information and network characteristics of SMCIS was
measured by three subfactors of each. All measurement items were extracted from
previous research [8–11]. In this study, the reliability of information was defined as
“the extent to which the information provided by the information source is realistic,
truthful, neutral and objective.” Vividness was defined as “the extent to which the
information provides vivid feelings as if it had actually been seen and specific, detailed,
and realistic depiction.” Variety was defined as “the extent to which the information
addresses various topics, alternatives, and substantial content.” Economics was defined
as “the cost of exploring information.” Interactivity was defined as “the extent to which
808 Y. Zhang et al.
the network helps them to share information among users and to build relationships.”
Regenerativity was defined as “the extent to which the network provides up-to-date
information in time.” Information characteristics include reliability (4 items), vividness
(4 items), and variety (4 items). Network characteristics include economics (3 items),
interactivity (4 items), and regenerativity (4 items). All items were measured on a 5-
point Likert scale (1 = not at all to 5 = perfectly). The measurements were examined
for the face validity and refined by two consumer researchers. A preliminary survey
was conducted on 45 consumers who have experience using SMCISs through an online
survey. The preliminary survey was conducted using test-retest on the same respon-
dents from November 16, 2018 to November 23, 2018 to verify the reliability. Items
with low reliability were deleted or modified.
3.3 Analysis
The analysis was conducted using SPSS 24.0. The Cronbach’s alpha (a) coefficient that
indicates internal consistency was calculated to determine the reliability of the mea-
surement that were composed of several items. Validity test of measurement items was
performed through exploratory factor analysis. T-test and ANOVA were performed to
examine the differences in SMCIS characteristics according to socio-demographic
characteristics.
4 Result
“I can find the information that I need without much effort” was the highest mean at
3.74, while “I can find the information that I need quickly” was the lowest at 3.72. The
highest mean question in Interactivity was 3.83: “It allows me to build a bond of
sympathy with other people,” and “It helps me build relationships” was the lowest at
3.65. The highest mean question in the regenerativity was 3.79, “The information about
the provision and sharing of information among users is updated in real-time”, while
“The information provided is constantly being updated” was the lowest at 3.72.
5 Conclusions
The purpose of this study was to analyze the characteristics of SMCIS and identify how
consumers are evaluating the characteristics of SMCIS, with the aim to help them
understand social media as a newly emerging consumer information environment. In
addition, we studied the difference in the evaluation of SMCIS according to the indi-
vidual characteristics and usage patterns of SMCIS. The evaluation of information
characteristics consisted of three subfactors: reliability, vividness, and variety, while
the evaluation of network characteristics consisted of three subfactors: economics,
interactivity, and regenerativity.
The results of the study are as follows. First, as SMCIS users, there were more
females than males and the most number of people were college graduates in their 20s,
with an average monthly income of more than 8000 yuan. According to the information
source usage pattern, most people spent more than five hours a day on the Internet, and
they often use online product reviews on SMCIS and community sites as sources of
information.
Second, the evaluation on information and network characteristics of SMCIS
showed the highest average of regenerativity, followed by economics, interactivity,
vividness, variety, and reliability.
Third, when it comes to personal characteristics, it was found that female rated
reliability higher than male, and the group with middle income rated vividness higher
than a group with low income. In terms of usage patterns, groups with an average
Internet usage time of less than three hours per day rated the reliability higher than
those with more than three hours per day, and those who use SMCIS more than once
per day rated the vividness higher than those who use it less than once per day.
References
1. YuJin, N., KyungJa, K.: Consumer confusion about information channels and information
contents in a multichannel environment. J. Korean Soc. Liv. Sci. 22(3), 455–471 (2013)
2. Bickart, B., Schindler, R.M.: Internet forums as influential sources of consumer information.
J. Interact. Mark. 15(3), 31–40 (2001)
3. TanWoo, P., KyungRyul, L.: Accessing social networking sites and accessing social
networking sites. Advert. Res. 100, 172–224 (2006)
4. JiSook, K., HyukKi, K.: A study on the influence of online oral information characteristics
on information acceptance and re-decision intention: focused on recipient’s expertise.
J. Korean Ind. Inf. Soc. 21(6), 81–93 (2016)
5. SangHyun, L., YongGil, J.: Influence of online oral characteristics on trust, oral acceptance
and purchase intent. J. Korean Cont. Soc. 16(9), 545–559 (2016)
6. WonWoo, J., SunJin, H., DongDong, C.: The impact of online community information
characteristics and pursuit benefits on old-fashionedness. J. Korean Assoc.Phys. Beauty Arts
16, 143–159 (2015)
7. JungMi, P., SungJin, H.: The influence of information characteristics of cosmetic blogs on
trust and oral effect of oral process. Korean Gastroenterol. 62(2), 13–25 (2012)
8. Elliott, K.M.: Understanding consumer-to-consumer influence on the web. Doctoral
Dissertation. Duke University (2002)
9. Chiou, J., Cheng, C.: Should a company have message boards on its web sites? J. Interact.
Mark. 17(3), 50–61 (2003)
10. TaeMin, L., EunYoung, L.: A study on the effects of perceived risk and perceived benefits
on the intent of mobile commerce. Asia Pac. J. Inf. Syst. 15(2), 1–21 (2005)
11. SangJo, K., SunMi, J.: Effect of the characteristics of the sender of the information on social
networking sites and the value of the information on the users’ intention to visit the
restaurant. Internet e-Commer. 14(6), 126–127 (2014)
Hardware-Software Implementation
of a McEliece Cryptosystem
for Post-quantum Cryptography
1 Introduction
of the embedded platforms or the security level (quantum or non-quantum) used in their
implementations.
This paper presents the implementation on an embedded device of a complete
McEliece cryptosystem on dedicated hardware. The security level is in accordance with
the recommendations given by ETSI for post-quantum resistant cryptosystems. The
implementation, which uses as parameters n = 2m = 8192, m = 13, t = 315 and k =
m – n t = 4097, is based on a hardware-software co-design that includes an ARM
Cortex-A53 microprocessor and a reconfigurable hardware. The overall system is
integrated on a Xilinx Zynq UltraScale+ FPGA. The encryption and decryption pro-
cesses are carried out in 1.55 ms (throughput of 5.28 Mbits/s) and 47.39 ms
(throughput of 86.45 kbits/s), respectively.
This paper is organized as follows. Section 2 describes the basic theory about the
McEliece Cryptosystem. Section 3 presents the hardware-software partitioning that
was performed. Section 4 describes the internal structure of the McElice IP block
implemented in hardware. Section 5 shows the experimental results and finally con-
clusions are presented.
The fundamentals of McEliece Cryptosystem are well documented in [1, 13, 15], so
that they will be only briefly reviewed here for understanding the internal structure of
the implemented coprocessor and how the software-hardware partitioning is performed.
As in the original proposal, the proposed implementation of McEliece is based on a
binary irreducible Goppa code. Thus, operations (sums and products) performed with
binary matrices are computed over F2 , while the coefficients of a Goppa polynomial are
defined in F2m .
Patterson Algorithm
The Patterson algorithm is used to computer the error-locator polynomial r(Z) of a
codeword with errors. The polynomial r(Z) of a vector y satisfies the property that
r(ai) = 0 if and only if y has an error in the i-th position. The algorithm is based on the
knowledge of the syndrome Sy(Z) of the codeword, and it is computed following the steps:
1. Compute the syndrome Sy ðZ Þ.
1
2. Compute the inverse T ðZ Þ ¼ Sy ðZ Þ .
3. Compute the square root RðzÞ ¼ sqrtðT ðZ Þ þ Z Þ.
4. Solve the equation aðZ Þ ¼ bðZ Þ RðZ Þ.
5. Compute the error location polynomial rðZ Þ ¼ a2 ðZ Þ þ Z b2 ðZ Þ.
6. Compute the error vector ɛ from rðZ Þ.
7. Correct the codeword y0 ¼ y þ e.
The inverse T(Z) and the resolution of equation aðZ Þ ¼ bðZ Þ RðZ Þ are computed
using the Extended Euclidean Algorithm.
818 M. López-García and E. Cantó-Navarro
The two initial polynomials have deg(b) deg(a). The algorithm works as fol-
lows. First, b(Z) is divided by a(Z) obtaining a quotient q(Z) and a reminder r(Z) that
meet b(Z) = a(Z)a(Z) + r(Z) and deg(r) < deg(a). Since gcd(a(Z), b(Z)) = gcd(r(Z),
a(Z)), we can reduce the problem of finding gcd(a(Z), b(Z)) to determining gcd(r(Z),
a(Z)) where the degree of r(Z) is smaller than the degree of b(Z). The process could be
repeated until one of the arguments is zero.
while ri≠0 do
qi=: quotient or ri-1 on division by ri;
(ri+1, λi+1, μi+1):= (ri+1, λi+1, μi+1) – qi · (ri+1, λi+1, μi+1);
i:=i+1;
end while;
return (g, λ, μ):=(ri+1, λi+1, μi+1);
3 Hardware-Software Partitioning
Usually, the parameter to be optimized is the resolution time while the main constraint
is usually the area needed by the coprocessor. Additionally, the success of the
hardware-software co-design depends not only on the active cooperation between the
software and hardware modules, but also on the communication protocol needed for
their proper operation.
Table 1 shows the execution time when encrypting and decrypting a block of 512
characters in length chosen at random. The table also includes the specific time needed
by each of the steps described in Sect. 2. These results were obtained programming the
overall McEliece cryptosystem in C language.
Table 1. Execution time for each step of the McEliece Cryptosystem (encryption and
decryption). Results are given in clock cycles and milliseconds using an ARM Cortex A53
microprocessor clocked at 1,1 GHz. In bold are represented the steps that will be implemented in
hardware.
Function No. cycles Time (ms)
Encryption Compute m Gpub 166.53 10 TCLK 1.51 ms
4
As can be seen, the calculation of the inverse syndrome, the polynomial a(Z) and
the vector error ɛ takes about the 98% (972.8 ms or 1070 106 clock cycles) of the
total time. Thus, these three steps are the candidates to be implemented in hardware.
On the other hand, the partitioning could be performed at different levels, ranging
from simple operations or instructions (fine granularity) to complex processes or
routines (coarse granularity). After analyzing the most time consuming steps, it seems
that the optimal partitioning should be performed mixing different granularities. Thus, a
unique coprocessor is designed, which internally includes several subsystems for
implementing some basic operations. The processor is able to perform operations using
Goppa polynomials. The proper sequence in which such operations should be per-
formed depends on the step that is being implemented by the coprocessor. Such a
sequence is generated by managing several internal signals by a control unit, which
820 M. López-García and E. Cantó-Navarro
conveniently programmed stablishes the order in which the subsystems are activated.
The internal subsystems are also designed in a similar way including each of them its
corresponding control unit.
Figure 1 shows the internal architecture of the McEliece IP block. Such IP is connected
as slave peripheral to the AXI4 bus, which is used during the configuration process
performed by the microprocessor. Additionally, the IP is also attached to an AXI4
stream-interface used for retrieving and storing polynomial data from/to memory. In
order to improve the performance, a couple of buffers are included. Such buffers store
(temporally) the input/output Goppa polynomials (input operands) while the arithmetic
unit is busy processing data.
McEliece IP
AXI MCELIECE
AXI Slave
AXI4 Bus
Interface
input output
degree control status degree
316*13-bit Goppa Arithme c Unit
Error Loca on
Slave Goppa Goppa Master
Buffer Buffer
AXI4 Stream Reader Writer AXI4 Stream
XOR-Mod Mul plier
Divisor
The coprocessor is designed to carry out seven different operations using Goppa
polynomials (see Table 2). Such operations are indicated by writing a specific code in
the input registers, which are used to communicate the microprocessor and the IP block.
Table 2. Codification table and Goppa polynomial operations performed by the arithmetic unit.
Note: Gp is the Goppa polynomial generator used to create the private and public keys.
Codification Operation
0000 op1 * op2
0001 (op1 * op2) mod Gp
0010 (op1 * op2) ⊕ op3
0011 ((op1 * op2) ⊕ op3) mod Gp
0100 op1/op2
1000 Error location (op1)
1001 Not used
1110 Set Gp
Hardware-Software Implementation of a McEliece Cryptosystem 821
All these operations can be performed designing only four basic operations: mul-
tiplication, division, addition (xor) and modulus.
The Goppa multiplier is presented in Fig. 2. Note as its design includes 316
(t = 315) internal multipliers in F2 that are able to perform in parallel the 316 multi-
plications between the coefficients that form the two input polynomials. An additional
block is added to perform the xor operation between this multiplication and a third
polynomial when necessary. This additional operation is very common in the Extended
Euclides Algorithm.
As shown in Fig. 3(a), a Goppa divider can be designed by including a Goppa
multiplier and a divider in F2 . Finally, the error location module can be implemented
by simply using a multiplier in F2 (see Fig. 3(b)).
Goppa Mult
GF_Mult GF_Mult GF_Mult
1 2 316
op1 op2 res op1 op2 res op1 op2 res
GF_Div Goppa
GF_Mult
Mult
op1 op2 res P1(Z) P2(Z) P3(Z) O(Z) op1 op2 res
P1(Z) R(Z)
Control Control
P2(Z)
Unit σ(Z) Unit
ε
P3(Z)
a) b)
Fig. 3. Internal architecture of: (a) Goppa Xor-Mod multiplier-divisor, (b) Error location.
822 M. López-García and E. Cantó-Navarro
5 Experimental Results
The overall embedded system was implemented on a Xilinx Zynq UltraScale+ ZCU
102 evaluation board, which includes an MPSoC device XCZU9EG-2FFVB11561.
Although the MPSoC includes four ARM Cortex-A53 core clocked at 1.1 GHz only
one of them is used in the final implementation. Additionally, a 4 GB DDR4 64-bit
memory is available, so that there is no limitation on the size of the keys to be stored.
All the hardware designs were described in VHDL and implemented by using the
software tools provided by Xilinx. The results are shown in Table 3.
Table 4 shows the results for the execution time of the overall hardware-software
co-design proposed in this paper. Results are obtained when encrypting a block of 4097
bits (512 characters). Afterward the cypher text of 8192 bits (1024 characters) is
decrypted recovering the original message. The table also shows the acceleration factor,
provided by the McEliece IP block, for those steps implemented in hardware against
the software version. In order to meet the maximum frequency requirements imposed
by the critical path, the McEliece IP block is clocked at 250 MHz. Note that the inverse
of the syndrome is processed in 15.67 ms, which means that it is about 20.18 faster
than the software execution. As similar figure is obtained when calculating the poly-
nomial a(Z). This polynomial is computed by the coprocessor in 7.38 ms, leading to an
acceleration factor of 22.72. The vector error calculation offers the best result. This
vector is obtained in 5.21 ms against the 488.84 ms needed by the microprocessor
(acceleration factor of 93.82). Thus, a complete decryption could be performed in
approximately 48 ms.
Table 3. Resources used by the McEliece IP Block when implemented on Zynq UltraScale+™
MPSoC family.
Used Available Utilization %
LUTs 77357 274080 28.22
FFs 41338 548160 7.54
Table 4. Comparison execution time for both: only software implementation (ARM Cortex
clocked at 1.1 GHz) and hardware-software co-design including McEliece IP (coprocessor)
clocked at 250 MHz.
Function Time (ms) Time (ms) Ratio: acceleration
(Only Software) (Hardware-Software) factor
Compute m Gpub 1.51 ms 1.51 ms (sw) –
Add error: r ¼ m Gpub þ z 0.045 ms 0.045 ms (sw) –
Compute r P1 0.169 ms 0.169 ms (sw) –
Syndrome Polynomial Sy ðZÞ 3.189 ms 3.189 ms (sw) –
Inverse Syndrome 316.31 ms 15.67 ms (hw) 20.18
1
TðZÞ ¼ Sy ðZÞ
(continued)
Hardware-Software Implementation of a McEliece Cryptosystem 823
Table 4. (continued)
Function Time (ms) Time (ms) Ratio: acceleration
(Only Software) (Hardware-Software) factor
Square root RðzÞ ¼ sqrtðTðZÞ þ ZÞ 11.28 ms 11.28 ms (sw) –
Equation aðZÞ ¼ bðZÞ RðZÞ 167.68 ms 7.38 ms (hw) 22.72
Compute rðZÞ ¼ a2 ðZÞ þ Z b2 ðZÞ 10.23 ms 10.23 ms (sw) –
Compute vector error ɛ 488.84 ms 5.21 ms (hw) 93.82
Correct the codeword y0 ¼ y þ e 0.171 ms 0.171 ms (sw) –
Truncation y′ to obtain m′ 0.079 ms 0.079 ms (sw) –
Compute m ¼ m0 S1 4.19 ms 4.19 ms (sw) –
Total time for encryption: 1.515 ms 1.515 ms
Total time for decryption: 1002.1 ms 47.39 ms 21.14
6 Conclusions
References
1. McEliece, R.J.: A public key cryptosystem based on algebraic coding theory. DNS progress
report 43.44 (1978)
2. Berlekamp, E.R., McEliece, R.J.: On the inherent intractability of certain coding problems.
IEEE Trans. Inf. Theory 24(3), 384–386 (1978)
3. ETSI – European Telecommunications Standards Institute: Quantum Safe Cryptography
(QSC); Quantum-safe algorithmic framework. ETSI GR QSC 001 v1.1.1 (2016)
Hardware-Software Implementation of a McEliece Cryptosystem 825
1 Introduction
With the development of technology, our daily lives have been tremendously
changed. Streets which are in physical space also have been significantly impacted
by these technologies, and people expect to see further changes of their forms.
The city of Los Angeles was build up with the mobility vehicles [1], Amsterdam
developed the shape of streets with channels. The shapes of streets are strongly
related to our transporting within the cities people live in. Breaking into 21th
century, as linear Chūō Shinkansen, people transport in a faster and more conve-
nient way. However, in the case of circulating around the neighborhood, traveler
might achieve more value from a lower speed pace rather than persuading to
a fast and convenient transportation, e.g., people might want to walk around
slowly and feel the streets when visiting a tourist attraction, also, visitors would
want to enjoy the beautiful seasonal scenery when being out on a spontaneous
walk. In this study, the researchers believe these low-speed moving experiences
can also, in a different vector, enrich our standard lives.
c Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 826–835, 2020.
https://doi.org/10.1007/978-3-030-39442-4_61
ZOUGAME 827
2 Related Works
2.1 Developing Ride-Sharing Services Rooted in the Town Street
Ride-Share Services for Easy Transporting. The share services of elec-
tronic scooters have already been expanding worldwide. e.g., “Bird’s electric
scooters [6]” which is based in LA, provides a ride-share service for people to
move around in the city easily and reduce the traffic problem in LA, and “Tele-
pod [7]” which is based in Singapore giving another transportation choice for
people who don’t want to walk in the hot weather, etc. These scooter-share ser-
vices provide users easy transporting experiences in which users don’t need to
pedal on bicycles while sweating, and can enjoy a beautiful day while in a com-
fortable riding experience. With the application on smartphones, it can navigate
828 M. Matsui et al.
the scooter’s parking spots, and a ride will be started only by scanning the QR
codes on the scooters. Also, users can pay through the application and return
the scooters in any parking spot for the rental service easily.
2.3 Contributions
contribution, the researchers set out the possibility of constructing and strength-
ening the already existing social relationships between the town’s people and the
potential of creating new social bonds through this service, and captured the
traces of people transporting in the town street for accumulating the data for
extracting the character of the street.
3 Design
3.1 Concept
3.2 Method
ZOUGAME is based on the concept design framework, through the use of design
thinking and service dominant logic methodology. Design thinking is a service
design methodology which can co-create values with customers through observ-
ing and describing in a fieldwork, and making several prototypes based on the
empathy which was found through the fieldwork. Service dominant logic which
is a new marketing logic Stephen L. Vargo and Robert F. Lusch advocated in
830 M. Matsui et al.
2004, thinking that values are create by both customers and corporations through
multiple intersections and interactions between them.
ZOUGAME is a ride sharing service which can enhance people’s lives by
making the transporting easier and more enjoyable. To make this, the researchers
extracted actors who are needed in the service. An actor is an existence which
has an agency for goal and acts in a structure which works as a system or a
policy. The researchers set up 4 actors for this service. First, “Mobility service
provider”, who provides this service. Second, “Local resident in the street”, who
is familiar with the town street. Third, “New resident who just moved into the
town”, and Fourth, “Mobility maker/mechanic”, who produces the mobility and
does maintenance.
To realize this service, the researchers need to make these four actors feel the
values of this service. Thus, the researchers conducted an ethnographic research
to figure out the actors’ goals and their mental models, especially “Local resident
in the street” and “New resident who just moved into the town”, and observed
how residents live and interact in the town streets and how they move around
the streets in a standard day life (Fig. 2).
In this ethnographic research, the researchers visited physical therapist Y and
observed the way she gets around the street by the use of her electronic bicycle
from10 am. to 1 pm. on 06 August, 2018, and had an interview after observation.
Physical therapist Y was born in Shinagawa, Tokyo, and now works as a home
physical therapist. After she got marriage, physical therapist Y still lives in the
same place as before, and her work and lives are all in the same street. The
nursing center where she works in, takes responsibility for all the patients in
Shinagawa, for this reason she has to move over 20 km on her electronic bicycle
everyday.
ZOUGAME 831
with a platform for loading baggages for one person, and can be connected to
another ZOUGAME as a modular vehicle system (Fig. 4).
4 Results
The researchers used the simple ZOUGAME vehicle prototype to validate the
efficiency of this service. The research members played the four actors and per-
formed a skit. The researchers set up a scene which represents a father taking his
daughter to her kinder garden class in the morning (Fig. 5). And the feedback
from the member who played the father was “It feels good when you can have a
5 Conclusion
5.1 Value Co-creation with Research Partners
This study is a collaborative research operated by three corporations and Keio
university of media design. Based on the methodology of design thinking, the
members came together and joined our hands to clearly define the form of the
service. Engineer M who works in Denso corporation pointed out “if ZOUGAME
extends its cable for charging, then it will be like a normal home appliance”,
so he provided his knowledge and skill of electrical engineering and made an
idea which makes ZOUGAME more adorable and vibrant by making one of the
ZOUGAME’s back legs as a positive pole, and the other leg as a negative pole,
then it can charge itself by going on a charge board located in its storage space.
Physical therapist Y who works in YAKUJU corporation shared his feelings
about an experience in which he traveled around the city by bicycle to go to work
as a daily routine in the past, and made a couple of ideas to have a discussion.
Mechanic M who works in a model railroad manufacturer and dealer, SEKISUI
KINZOKU, shared his view of a railroad fan community. Hence, the researchers
considered that ZOUGAME can give more value by installing functions which
can make it more pet-like. The researchers will still develop “ZOUGAME” and
co-create its values with the corporations.
References
1. Ratti, C., Claudel, M.: The City of Tomorrow: Sensors, Networks, Hackers, and
the Future of Urban Life (The Future Series). Yale University Press, New Haven
(2016)
ZOUGAME 835
2. Holtzblatt, K., Beyer, H.: Contextual Design: Design for Life (Interactive Tech-
nologies), 2nd edn. Morgan Kaufmann, Burlington (2016)
3. Goodwin, K.: Designing for the Digital Age: How to Create Human-Centered Prod-
ucts and Services. Wiley, Hoboken (2009)
4. Lusch, R.F.: Service-Dominant Logic: Premises, Perspectives, Possibilities. Cam-
bridge University Press, Cambridge (2014)
5. Levy, J.: UX Strategy: How to Devise Innovative Digital Products that People
Want. O’Reilly Media, Newton (2015)
6. Bird. https://www.bird.co/
7. Telepod. https://www.telepod.co/
8. Starship Technologies. https://www.starship.xyz/business/
9. Jacobs, B.J.: The Death and Life of Great American Cities. Vintage Books, New
York (1961)
10. Watanabe, S., Watanabe, Y.: Community Cat Activity that Connects People:
We Develop the Foundation of Community Welfare. Research Bulletin of Kyushu
Junior College of Kinki University, Higashi-osaka (2015)
Deep Learning Based Face Recognition
Application with Augmented Reality Devices
1 Purpose
Deep learning has become a topic of intense interest in the last few years due to its
success in many different fields such as computer vision and speech recognition [1].
The performance of some deep learning algorithms has even exceeded that of humans.
The convolutional neural network (CNN), a particular class of deep neural net-
works, has met with great success in the last few years in solving image classification
problems [2]. This makes CNNs the ideal tool to classify faces in the user’s view of the
HoloLens.
There has been resurgence in popularity and demand for Convolutional Neural
Networks (CNN) influenced by ImageNet challenges happening every year. The CNN
algorithm outperforms its next best algorithm by a huge margin; the gap being so large
makes CNN the only favorable choice for object recognition as of today. CNN is an
object recognition algorithm with the design inspired by the organization and structure
of the brain of living organisms. As of now CNN outperforms the human eye on the
ImageNet challenges.
The purpose of this research is to increase the accuracy of the object detection of
the CNN with additional data, specifically depth data. Usually CNN is used to detect
object in two dimensions such as through video streams or images. Adding depth data
would be a complicated task but having the neural network accept and identify three
dimension objects would logically give a better accuracy to the object it is identifying.
Therefore, the research of CNN with three-dimensional objects is implemented
through a device called HoloLens created by Microsoft. The product is an augmented
reality headset, which has key features such as spatial mapping where it can detect
surrounding objects shapes and the distance between the object and the user.
© Springer Nature Switzerland AG 2020
K. Arai et al. (Eds.): FICC 2020, AISC 1130, pp. 836–841, 2020.
https://doi.org/10.1007/978-3-030-39442-4_62
Deep Learning Based Face Recognition Application with Augmented Reality Devices 837
2 Background
The success of CNNs at image classification can be attributed to the unique architecture
of the CNN. CNNs use the convolution operation at each layer of the network. This
involves convolving a set of learnable filters across the width and height of the input to
the layer.
The result of this process is that the filters learn different features of the input image
and deeper layers learn progressively more abstract features of the image. The resultant
output of the last convolutional layer can then serve as input to a traditional dense feed
forward network, which can perform the classification of the initial input image.
3 Methodology
A four-step process has been proposed in order to complete the following tasks: First,
detect faces in the image. Second, transform the detected faces to make them suitable as
input to a CNN. Third, pass each face through a CNN to produce a feature vector; And,
finally, compare the feature vector to the feature vectors of known faces and pick the
closest one as the correct face.
838 A. Kim et al.
Fig. 2. (a). A non-trivial image of a dog; (b). Horizontal Edge Detection on (a).
3.2 Preprocessing
Because the faces detected in the previous step could be looking in different directions,
they need to be transformed so that they are all oriented in approximately the same way
to serve as better input to our CNN. Therefore, the second step is a preprocessing step.
This step can be done using Face Landmark Estimation (FLE). An FLE algorithm
will identify a certain number of “landmarks” of a face such as the eyes, nose, lips, etc.
Then we can apply a transform to the image based on the positioning of the landmarks
to center the face in the image as best as possible. Doing this will improve the accuracy
of our CNN’s prediction.
3.4 Comparison
Once the feature vectors have been computed by the CNN, the final step is quite
simple. The first thing is to compute a set of feature vectors for known faces, by
running them through Steps 1–3.
This will need to be done only once before we attempt to classify any faces. Then,
for each vector to be classified, it is compared against the set of feature vectors of
known faces, and the closest match will be selected.
4 Implementation
Fig. 4. Face Recognition with Hololens (Input: Photo of people, Output: Photo with bounding
boxes on face detected and their names).
5 Results
Figure 4 is the output of our face recognition application implemented on the aug-
mented reality devices, i.e. the Hololens. Our application is able to detect faces, add
bounding boxes on face detected, and also display their names in real time.
The main issue with this workaround is that it doesn’t allow for a very high frame
rate on the HoloLens. This is caused by several factors. Assuming no overhead for
processing, the time delay of sending a frame to the server and receiving a response,
that is, latency, imposes an absolute upper bound on the frame rate.
Once accounting for the processing time necessary on the server, the effective
frame rate is further reduced. This can be minimized by using a very efficient GPU
based implementation on the server.
6 Conclusion
In summary, a deep learning based method that uses a deep Convolutional Neural
Network (CNN) with the HoloLens can successfully recognize and classify individual
faces in a scene in real time.
Our experience implementing this process on a non-GPU based server resulted in a
frame rate of only approximately one frame per second. It has been reported that the
next version of the HoloLens will have a special coprocessor aimed at accelerating
deep learning workloads. This may allow at client side only implementation that should
run considerably faster.
Deep Learning Based Face Recognition Application with Augmented Reality Devices 841
Acknowledgments. We would like to thank the California State University Fresno, the
California State University East Bay, and the National Science Foundation for their financial
support (NSF Grant #DMS-1460151), The FURST program (NSF Grant #DMS-1620268),
CSUEB FURST program (NSF Grant #DMS-1620500), and our mentors, Dr. Kamalinejad and
Dr. Zhong for their support during the completion of the project.
References
1. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. The MIT Press, Cambridge (2017)
2. A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN. https://
blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-
34ea83205de4. Accessed 23 June 2019
3. Performing Convolution Operations. https://developer.apple.com/library/content/
documentation/Performance/Conceptual/vImage/ConvolutionOperations/Convolution
Operations.html. Accessed 23 June 2019
4. Dither a Grayscale Image. https://codegolf.stackexchange.com/questions/26554/dither-a-
grayscale-image. Accessed 23 June 2019
5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: International
Conference on Computer Vision & Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893.
IEEE Computer Society, San Diego (2005)
6. Amos, B., Ludwiczuk, B., Satyanarayanan, M.: Openface: a general-purpose face recognition
library with mobile applications. CMU-CS-16-118, CMU School of Computer Science, Tech.
Rep. (2016)
7. Face Recognition. https://github.com/ageitgey/face_recognition. Accessed 23 June 2019
e-Canteen for the Smart Campus Application
Abstract. Nowadays in 4.0 industrial and digital economic era, all of the
manual serving should be changed by the serving service which use a tech-
nology as the helping tools and applications to support the smart canteen in
solving the problem of serving food and the queue for paying. Therefore,
e-canteen web system with the Raspberry Pi as the sever is created, which can
ordering directly through the table of customer without waiting for the waiter to
bring the menu, and even for payment as well because the system is compactly
designed to make it easier for cashier, waiters and even for chef in the kitchen in
coordinating each other. The test is conducted by the support of the Apache
JMeter software to do the testing of server ability in handling many clients based
on the response time from server. The results show that Raspberry Pi 3 Model B
with Ram 1 GB capacity able to handle numbers of threads (users) more than
150 users at the same time with response time value around 8 ms–17.6 ms and
throughput value around 132.23 request/s. In parameter QoS testing, when using
WiFi network in Raspberry and Wireless Network Adapter ARGtek ARG-1209
type, it is obtained that the index value is 3 with 87,5% percentage which means
the QoS value in web access analysis when using the Raspberry WiFi network
as well as the Wireless Network Adapter is in the good category.
1 Introduction
For instance, in culinary industry such as canteen. Canteen is a familier thing in office
and school environment.
According to Azwar, Sapri 2012 [1]: Canteen of Hasanuddin University is a place
where the students can spend their time to eat, drink, or just for taking a rest. But the
payment transaction still in manual process and using cash for the payment. It caused a
queuing problem. Therefore, a Smart Card is designed as the tools of transaction in
canteen. By using the Smart Card as the payment tools, the queue in cashier could be
solved [2]. But the queue for ordering food still unsolved. The ordering method of food
and drink still using the manual system which take much time.
Now to fulfill the needs of the customer, mostly the canteen let the customer to
make a line in front of the cashier or waiting for the waiter to come and give the menu,
then make an order for food. After they finish to eat, the customer go to the cashier to
pay the bill, then it will make a line again. The manual ordering also tend to make a
mistake on misspell of the menu which had been ordered by the customer. The service,
ordering and payment method of this system show how ineffective and inefficient it is.
Therefore it needs a system which could be make a direct ordering through customer’s
table without waiting for the waiter to give them menu list, or making a line in front of
the cashier either to order or pay the bill.
According to Vinayak Ashok Bharadi, Ph. D. et al. 2013 [3]: E-menu provides the
information about the menu items with interactive pictures. Sometimes there are many
confusions which happened in the kitchen about the menu that had been written by the
waiter. The using of tablet will minimize the ordering mistakes, and make a faster
service as well.
Based on the description above, there will be an effective and efficient system
which is designed for the canteen service by using WEB that had been installed in
tablet as the ordering tools. The pictures of the food and drink with the prices from
canteen will appear in the tablet. The customer could choose the menu that they want
by using the tablet which is available in each table. After finish the ordering, the
kitchen will receive a message about the food and drink that had been ordered by the
customer, then the chef in the kitchen will serves it. The cashier will also receive the
order list and the amount of customer’s bill, therefore the customer will just pay the bill
directly without check their order list. To connect the customer to the cashier and the
kitchen, it is used Raspberry Pi as the server. Therefore, the authors consider creating
an “E-Canteen for the Smart Campus Application” as the solution to solve the long
queue lines problem in ordering and payment process.
2 Theoretical Background
2.1 Raspberry Pi
Raspberry Pi is a single-board computer in a credit card size with the operation system
which is generally in Linux base that in its development already capable to run the
operation system in Windows IoT base. The development of Raspberry Pi was started
since 2006 by a non-profit institute of Raspberry Pi Foundation, which consist of
volunteer and academician of technology in England [4].
844 Z. B. Hasanuddin et al.
2.3 PHP
PHP (PHP Hypertext Processor) is a scripting language in HTML. Most of the syntax
are similar with the C language, Java, and Perl, and some of the specific PHP functions.
The main purpose of the using of this language is to enable the web creator to write a
dynamic web page fast [7].
2.4 MySQL
MySQL is one of the most popular kind of database server. The popularity is caused by
MySQL is using SQL as the basic language to access its database. SQL is a concept of
database operation, especially in choice or selection and data input, which is enable the
data operation to be done easily and automatically [8].
2.6 Wireshark
Wireshark is a Network Packet Analyzer application which will try to catch or filter the
packets which exist in a network and trying to show all the information in that packets
very clearly. The filtered packets could be used to analyze a network. The analysis of
e-Canteen for the Smart Campus Application 845
network performance included many things, start from the process of catching the data
packets or the information through the network, check the network security and the
troubleshooting, and even for sniffing (get the important information such as password,
privacy data, etc.). Wireshark is an open source for Network Analyzer. The appearance
of Wireshark is using GUI (Graphical User Interface) [10].
3 Research Methodology
3.1 General Overview of the System
Figure 1 shows the flow system from the beginning when the customer open the web
system of menu list in the tablet which already available in each tables. Then the
customers look at the menu in that show in the tablet when they want to make an order.
If the customers already made an order, the order list will be sent to the kitchen to
process their order, and to the cashier to count their payment. LCD in the cashier and
kitchen will show the customers’ order list in a real time, that the kitchen LCD will
show the order view list, while in the cashier will show the order view list with the total
payment. After the customers pay the bill, the flow system is done.
Login menu is used to input the username and password of the customer’s table.
Login is used as the system authentication for the user who may to access it. If the login
data that had been input is valid, then the user will enter the homepage of the website.
e-Canteen for the Smart Campus Application 847
In home user screen as shown in Fig. 4, there are buttons to enter the customer’s
name, add order, final order, and log out:
1. Enter the customer’s name to input the customer’s name.
2. Add order to move to the next page and see the menu list.
3. Final order to end the order process.
4. Logout is a sign that the user is ending the system.
In the view of food and drink as shown in Fig. 5, there are buttons:
1. Add to chart
Add to chart to choose the food and drink that customers want.
2. Quantity
Quantity determines the amount of food and drink of customer’s order.
3. Order
Order will send the order data of customers to the cashier and kitchen.
After the customer order the food and drink by click the order in user page, then the
customer will back to the home user page to re-check the order list as shown in Fig. 6.
848 Z. B. Hasanuddin et al.
After re-check the order list in order, the customer will click the final order to end
the order process. Then the order data will be sent to the kitchen and cashier.
Kitchen page shows the order list of the customers as shown in Fig. 7. As follows
in the table:
1. Product name: the order list of food and drink from the customer
2. Quantity: the amount of the order
3. Price details: the price of food and drink that had been ordered by the customer
4. Order total: the payment total of the customer’s order.
e-Canteen for the Smart Campus Application 849
Food and drink management as shown in Fig. 8 is a view of kitchen page which is
used to disable the food or drink in the page of user menu. If the kitchen makes the food
or drink disable or not available then the customer could not order it from the menu.
Login is used to input the username and password for admin as shown in Fig. 9.
This login is an authentication system for the user who may to access it. If the login
data which had been input is valid, then the user will enter the cashier website page.
After login, then the view of admin page will appear as shown in Fig. 10. This page
has logout button which is used by the admin to exit from the page.
Figure 11 shows cashier page. In cashier page, there are pay and clear data buttons.
The pay button is used when the customer already paid the bill. While the clear data
button will delete the data on cashier and kitchen page.
Fig. 12. The view of insert food and insert drink page
Insert food and insert drink page as shown in Fig. 12 is a page to add the food or
drink menu on e-canteen order system. The data which is need to be input are:
1. Product name
2. Product price
3. Product image
The database of order from the customer could be checked again by the admin on
data order page as shown in Fig. 13. The data order page will show the information as
follows:
1. Data order: the view of customer order list
2. Time: the customer’s times of order
3. Total: the total amount of customer’s order
4. Detail: the order details of the customers
2. The average graphic of throughput from the observation results is shown in Fig. 15
below.
Based on the graphics of both conditions above, it could be seen that there is a
difference when using the raspberry WiFi network and the wireless network adapter. In
the testing, the result of response time when using the raspberry WiFi network range
around 8 ms–15.2 ms. And when using the wireless network adapter, the response time
range around 12.8 ms–19.6 ms. The highest response time from both of the conditions
is on 5 number of threads (client), which is for the Raspberry WiFi with 15.2 ms, while
in the ARGtek is 19.6 ms, therefore based on that results a better response time is when
using the Raspberry WiFi network.
The biggest throughput value when using the raspberry WiFi network is 132.33/s,
and when using the wireless network adapter the throughput value is 141.86/s with 150
number of threads, while the lowest throughput value when using the raspberry WiFi
network is 6.04/s and when using the wireless network adapter the throughput value is
6.06/s with 5 number of threads. For the testing of 5–150 number of threads, the
condition of throughput could be seen that the more sample is sent, then the request
will be faster for each seconds, which is caused by the server that tend to maintain the
response time as soon as possible, but the increasing still occur in the throughput. In
throughput, the higher throughput value will be better. It means that the server could
execute many requests per time.
Based on the test results, it reveals that the web server performances in the first
condition is better on responding the request which is showed by a less response time
value than the response time value in the second condition, with a difference of 4.4 ms.
While the second condition has a better condition than the first condition, which is
showed by the throughput value which is bigger than the throughput value in the first
condition, with a difference of 0.02–9.5/s.
The following Table 2 gives a result data of QoS Parameter measurement by using
Wireless Network Adapter.
By looking on the index table of QoS, to determine the QoS value is by divided the
index total from QoS parameters with 16, then the index value of QoS for each
condition are as follows:
1. The index result value of QoS when using Raspberry WiFi by making request is
14/16 = 0.875 100% = 87.5% with the index value is 3 which means that the
QoS value is in the good category.
2. The index result value of QoS when using Wireless Network Adapter by making
request is 15/16 = 0.9375 100% = 93.75% with the index value is 3 that means
the QoS value is in the good category for the total user 2 and 4. For the total user 3
the index result of QoS is: 14/16 = 0.875 100% = 87.5% with the index value is
3 which means that the QoS value is in the good category.
5 Conclusion
3. The testing result of QoS when using Raspberry WiFi network and wireless net-
work adapter ARGtek ARG-1209 have shown the index value is 3 which means
that QoS value on web access analysis when using Raspberry WiFi as well as
Wireless network adapter ARGtek ARG-1209 type is in good category with 87.5%
of QoS for Raspberry WiFi and 93.75% of QoS for ARGtek ARG-1209 Wireless
Network Adapter, respectively.
6 Future Works
There are many things that could be conducted to improve this research in the future,
which are:
1. The making of this E-Canteen still in a simple category, especially from the
appearance and security, therefore it is expected to be more interesting in the further
improvement.
2. The design of this E-Canteen is expected to be more interactive to make the
information needed by the customer more useful.
3. The availability of facilities that can handle the payment process.
4. Provide big scree (display) for promoting new menus for foods and drinks and
CCTV for safety reason to support E-Canteen management system.
Acknowledgment. The authors would like to thank the Hasanuddin University and the
University of Sulawesi Barat for supporting this work.
References
1. Aswar, Z.: Kartu Mahasiswa Cerdas Menggunakan Teknologi RFID, Universitas Hasanud-
din (2016)
2. Amol, S., Aayushi, V., Manali, C.: Integrated cafeteria management system using RFID.
IOSR J. Electron. Commun. Eng. (IOSR-JECE) 1, 01–06 (2017)
3. Vinayak, A., et al.: Intelligent e-Restaurant Using Android OS (2013)
4. Patulak, Mitra, Priyon: Aplikasi Raspberry Pi sebagai Webserver untuk mengendalikan
Lampu melalui Website. Skripsi. FT, Teknik Informatika. Universitas Hasanuddin (2016)
5. Rao, P.B., et al.: Raspberry Pi home automation with wireless sensors using smart phone.
Int. J. Comput. Sci. Mob. Comput. 4(5), 797–803 (2015)
6. Betha, S.: Pemrograman WEB dengan PHP. Informatika Bandung, Bandung (2014)
7. Muhammad, I.: PHP dan MySQL untuk orang Awam. Maxikom, Palembang (2003)
8. Abdul, K.: Tuntunan Praktis: Belajar Database Menggunakan MySQL. Andi, Yogyakarta
(2008)
9. Evin, Asmunin: Performance Test Dan Stress Website Menggunakan Open Source Tools,
Jurnal Manajemen Informatika 6(1) (2016). ISSN: 208-215 208. http://jurnalmahasiswa.
unesa.ac.id/index.php/jurnal-manajemeninformatika/article/view/18463/16837. Accessed 14
Mar 2018
10. Annisa, C.: Pengenalan dan Dasar Penggunaan Wireshark (2013). http://ilmukomputer.org/
2013/04/22/pengenalan-dan-dasar-penggunaan-wireshark/. Accessed 19 Mar 2018
Voluntary Information
and Knowledge “Hiding” in a Conservative
Community and Its Consequences: The Case
of the Ultra-orthodox in Israel
Moshe Yitzhaki(&)
1 Introduction
Generally speaking, two distinct religious groups can be distinguished among the
Jewish population world-wide: modern orthodox and ultra-orthodox. The term used
by members of this community to refer to themselves, is the Hebrew word ‘Haredi’
(rather than Ultra Orthodox). The modern orthodox are usually much more open to the
modern world, combine ‘secular’ general and academic studies in their educational
system and are more involved in the surrounding ‘secular’ world. The ultra-orthodox
groups (‘Haredi’) on the contrary, are less open to the modern world, strongly adhering
to traditions and old customs.
Understandably, the free flow of unfiltered information, as facilitated by current IT,
poses great challenges for a conservative community which strives to retain its
members and especially its youngsters, within a traditional lifestyle [1].
The aims of our study was to explore the ways by which a conservative minority
community upholding a unique subculture, copes with the challenges of unrestricted
information and knowledge flow, in a cyber-developed country, like Israel. We tried to
3 Literature Review
After being neglected for a long time by academics and the scholarly community, there
has recently, during the last three decades there has been an influx of academic publi-
cations, articles as well as books, not to mention popular news reports, dealing with the
ultra-orthodox society, on its various branches, trying to decipher its roots, codes,
motives, goals, behavior and future. The growing interest in this group has to do with the
rapid demographic growth of ultra-orthodox society. From a small and negligible
minority at the time of the establishment of the State of Israel in 1948, their population is
now estimated at about 10% to 15% of Israeli population, becoming higher every year.
One of the first academic experts in Israel was Friedman, from the department of
sociology and anthropology at Bar-Ilan University in his 1991 now-classic book, titled
The ultra-orthodox society: sources, trends and processes [1]. Friedman had since then
many followers, who used various research methods, mainly qualitative ones, to study
various aspects of this unique society, which is different in many aspects from the
general population in Israel, but cannot be ignored due to its growing influence,
politically, economically and sociologically.
To mention at least some of the studies which follow: Nurit Stadler has published in
2008 her book, titled Yeshiva Fundamentalism: Piety, Gender and Resistance in the
Ultra-Orthodox World. In her book she analyzes the reconstruction of masculine in the
fundamentalist world as a result of the challenges of modernity. It addresses these
questions through an investigation of the redefinition of the family, work, the army and
civil society in the Ultra-Orthodox yeshiva world in Israel [24]. In her 2012 book A
Well-Worn Tallis for a New Ceremony she explores new aspects of voluntarism, citi-
zenship, family life and the concept of freedom in the ultra Orthodox culture today [25].
Hakak’s 2012 book (volume 19 in the series of “Jewish identities in a changing
world”) titled: Young Men in Israeli Haredi Yeshiva Education: The Scholars’ Enclave
in Unrest. The book is based on former papers he had published, dealing with various
aspects of the Israeli ultra-orthodox community [20]. Foscarini has published in 2014 a
long and thorough study of the ultra-orthodox community in Israel, claiming that the
secular education and vocational training that ultra-orthodox Jewish women get in
order to get a decent job in the labor market have been a source of emancipation and
modernization in their community [21].
In a recent article 2015 Adam Ferziger reviews in detail former studies dealing with
the burgeoning egalitarian trends featured in the new roles for women within liberal
Jewish denominations and among the Modern Orthodox as well as the ultra-orthodox
one. Ferziger builds upon and adds to these former studies by offering an initial portrayal
Voluntary Information and Knowledge “Hiding” in a Conservative Community 857
and analysis of a relatively new phenomenon: the American female non-hasidic haredi
outreach activist. He does so by locating these figures within overall trends of American
ultra-orthodox Jewry, as well as in relation to the broader phenomenon of Orthodox
feminism [22].
4 Methods
In the present study we had heavily used tools and techniques of the qualitative
research methods. Several ultra-orthodox newspapers, radio stations, public
announcements and other media, as well as adult and children’s books were reviewed.
In addition, members of the community were interviewed and various public
announcements of religious leaders were analyzed, using qualitative research tech-
niques, including in-depth content analysis.
It is widely known that all forms of media, including adults and children literature,
are not ideologically and culturally neutral, strongly reflecting various biases of the
people who had created them. Consequently, the spiritual leaders of the ultra-orthodox
society (the “Rabbis”) insist that both adults and youngsters consume only media and
literature upholding, promoting and consistent with their values and lifestyle. These
leaders strongly oppose, on ideological grounds, any use of the so-called ‘secular’
media, both printed and electronic, including children’s books.
The Internet in particular is perceived as a very serious spiritual threat and is the
object of heated condemnation and total rejection. Prominent rabbinical figures from
almost all ultra-orthodox circles constantly warn, in mass public gatherings as well as in
all kinds of their media of the spiritual hazards of the Internet. In public rallies the Internet
is termed “today’s major threat to the souls of the young generation” [4, 5, 11–16].
Special newly-written prayers are disseminated in synagogues and congregations
against the use of the new IT. Following is one of these new prayers, claimed to be
recommended by chief spiritual leaders (like Rabbis Hayyim Kanievsky, Moshe
Tsedaka and David Abu-Hatsira): “Lord of the universe, you know very well that
terrible dangers are hidden in most of the technological tools which are common in our
generation, and have ‘killed’ many. And several men, women and youngsters, boys and
girls, who had been god fearers, have strongly deteriorated due to these tools, god
forbid. We beg you that you will help them … to fully repent soon. Have mercy on us
for your holy sake to help us as well as all your beloved children (especially to…) to
keep great distance from all the non-Kosher tools and to earn our living by only the
permitted means ….” Another slogan says: “Thank God we have no email and no site
in the Internet (Inter-sin…), let its name be erased” [a privately owned document].
Admittedly, such harsh admonitions may attest to the contrary; perhaps the rab-
binical leadership does not expect total compliance to its prohibitions and fears the
consequences of dissent. After all, obedience is voluntarily and the rabbis lack tangible
means of enforcement, besides social and community pressure [3, 9].
As a result, in a later phase of their life, when these boys marry and have to find a
good high-income job, their options are very limited and many of them are forced to
hold only low salary jobs, which do not require higher education background. It is a
loss for them personally as well as to the whole country on the macro level. No doubt,
the lack of core subjects (English, math, etc.) from the school’s curriculum clearly
damages the chances of the ultra-orthodox graduates to earn a decent salary once they
leave the ‘Kollel’ and go out to the labor market [20].
A recent Facebook post by David Uman, a former ultra-orthodox ‘Kollel’ member,
and now a fresh divorcee, published in the weekly modern-orthodox magazine ‘Be-
Sheva’ (Sep. 26, 2019, p.32) [6] reads: “This is what I told this morning the rabbinical
judges of the family court at the end of the session dealing with the monthly sum of
money I have to pay to my ex-wife: You have explained very well how much I have to
pay as a father responsible for his daughters, and why 2000 Shekels (about $550 USD)
are not enough, and even though I don’t have a profession, I should have one. You are
100% right. It’s my responsibility and I’ll do my best to live up to it. However, I have
only one question to you: Now you come?! All my life I was told that I don’t need a
profession and that I can stay in the ‘Kollel’, bring home 1500 Shekels (about $400
USD) for 10 children and everything will be OK. And now you are telling me that in
fact a father should bring 1400 Shekels for each child?! You, as judges who are
familiar with the problem, are responsible to change it, to prepare the boys long time
ahead for a situation in which they will have to provide living for their future children,
before they reach an impossible critical situation” [6].
Mr. Uman, 28, was raised in an ultra-orthodox typical family, passing through all
stages of ultra-orthodox education. After marriage and birth of children, striving for
decent living, he tried to establish cloth business, together with his wife, but failed and
fell to bigger debts. He then tried to find a job, but failed in finding one with a decent
salary due to lack of education and training, thus deteriorated to the situation described
in his post [6].
Interviewed later by the abovementioned magazine, Mr. Uman added: “They keep
telling us that we must learn in the ‘Kollel’ as much as possible, year after year. Core
studies are strictly forbidden in both elementary and high school, and thus comes the
day when a guy has 5 children and realizes that his wife’s salary together with modest
allowance from the ‘Kollel’ are not enough. But then it’s too late. He has already lost
the chance to go to the labor market to earn the money he needs… the ultra-orthodox
community has developed the model of helping their children by buying apartment for
the young couple. It helps the young couple at the beginning and makes them happy.
Soon, however, they realize that they, in turn, will have to provide 400,000 Shekels
(more than $100,000 !) to each of their eight children, when they get from the ‘Kollel’
the 1500 Shekels allowance” [6].
The dissonance presented by Uman’s story appears in a delicate and complicated
form in the story of Avigdor Feldman, his classmate. He also encountered the diffi-
culties in earn his family living costs with the modest allowance of ‘Kollel’ member,
but chose a different way of coping with it. After failing five (!) times in the entrance
examination for the academic degree, he did not give up and finally completed the
Bachelor degree in the Open University and soon will start studying for his Master
degree. Meanwhile, he started a huge start-up project of searching the internet for jobs
Voluntary Information and Knowledge “Hiding” in a Conservative Community 861
for ultra-orthodox people. He also stated a project of teaching English among the ultra-
orthodox youngsters. He cites data showing that many ultra-orthodox students who
want to acquire secular education, fail due to lack of English knowledge. There is a
demand for English study among parents who want to enable their children a decent
living. He manages to walk “between the drops”, his projects are part of unwritten and
unofficial consensus between the leadership and its rank and file. The spiritual lead-
ership is aware of his projects, being publicized in the ultra-orthodox newspapers, but
they avoid any hint which might show that they support them. Knowing the ultra-
orthodox community, Feldman knows how to manage himself in a way that will not
attract ‘fire’. “I’ll never attack – he says – and never start a controversy with a rabbi
concerning studying English. When parents demand it, we provide it and the rabbis
usually are aware and agree in silence. If a local rabbi or another spiritual leader
publicly objects, we stay away immediately, avoiding head to head confrontation. This
is the way that the change takes place and it occurs in great numbers” (‘Be-Sheva’
weekly magazine, Sep. 26, 2019, p.32) [6].
It is worth noting that the question of introducing core subjects into the ultra-
orthodox school’s curriculum has been a very controversial issue, being in a hot debate
in the Israeli political arena. Some political parties, mainly from the ‘left-central’ side
of the political map, demanded to enforce the ultra-orthodox schools to teach core
subjects, if they want to get government money. These parties have put it as a necessary
condition for their joining the coalition and forming a government. On the other hand,
the ultra-orthodox parties still sharply object it, striving to keep the independence of
their school system curriculum-wise, because of ideological reasons. They fear that
introducing the core subjects will interfere with the Torah and Talmud studies of their
boys and will attract some of them to academic studies [1, 19].
Another obstacle preventing the ultra-orthodox adults from getting academic
studies has been the issue of ‘gender separation’ in the classes. Most ultra-orthodox
adults prefer to study academic studies in classes with separation between males and
females, as they had been used to in their own separate educational system. As a matter
of fact, most universities have opened separate tracks, especially tailored for the ultra-
orthodox community. This step was greatly encouraged and financially supported by
various government executive agencies and branches, in order to help people from the
ultra-orthodox community to enter the labor market, to earn a decent living and to
increase the labor force and strengthen the economic state of the country.
Following are two examples, out of many, of the ongoing efforts to teach at least
vocational skills to ultra-orthodox adults, to enable them a decent living standard.
A bulletin board in an ultra-orthodox neighborhood presents an advertisement for an
evening class teaching plaster building. Another ad announces the opening of a weekly
evening class to acquire the skills of technician for home small electric appliances. No
prior knowledge is required and graduates will get a certificate and assistance in work
placement. Moreover, it states that the class is being directed by “ultra-orthodox
management under rabbinical supervision”.
862 M. Yitzhaki
5 Conclusions
1. Being a religious-cultural minority, the ultra-orthodox community in Israel seeks to
conserve a specific sub-culture which advocates maximal separation from general
secular culture and adhering to tradition, retaining its youngsters within the
community.
2. It has significant implications regarding the use of IT, which is being heavily
utilized in daily life in Israel, but with very clear-cut limits, such as various degrees
of filtering and partial access to the Internet and other forms of IT.
3. To be sure, these practical solutions block to a great extent the free flow of infor-
mation and knowledge in the ultra-orthodox community here and have adverse
implications both nationally and individually. However, the ultra-orthodox com-
munity has been ready to pay this price in order to conserve its old traditions.
References
1. Friedman, M.: The ultra-orthodox society: sources, trends and processes. The Jerusalem
Institute for the study of Israel, Jerusalem (1991)
2. Hovav, L.: Ultra-orthodox children literature – realistic or didactic? Sifrut Yeladim ve-Noar
20(3–4), 20–35 (1994). (in Hebrew)
3. Kaplan, K., Shtadler, N.: Leadership and authority in the ultra-orthodox society in Israel-
Challenges and Alternatives. Van-Leer Institute, Jerusalem (2009). (in Hebrew)
4. Karmi-Laniado, M.: Ideologies and their reflections in children literature. Dvir, Tel-Aviv
(1983). (in Hebrew)
5. Regev, M.: Children literature – reflections of society, ideology and values in Israeli children
literature. Ofir, Tel-Aviv (2002). (in Hebrew)
6. Rotenberg, Y.: A remedy to lack a livelihood. ‘Be-Sheva’, 26 September 2019, p. 32 (2019).
(in Hebrew)
7. Segev, Y.: Literature as creating and reflecting a narrative: the national-religious children
literature as a test-case. Talelei Orot; Annual of Orot Yisrael College, vol. 15, pp. 229–242
(2009). (in Hebrew)
8. Segev, Y.: Themes and trends in the ultra-orthodox children literature from the 1990’s on.
Oreshet, 4, 327–355. (in Hebrew)
9. Stadler, N., Lomsky, F.E., Ben Ari, E.: Fundamentalist citizenships: the Haredi challenge.
In: Ben-Porat, G., Turner, B.S. (eds.) The Contradictions of Israeli Citizenship: Land,
Religion, and State. Routledge, Milton Park (2011)
10. Tsweek, G.: The state misses the ultra-orthodox 8200. Yisrael Hayom [daily newspaper], 5
July 2019, pp. 12–13 (2019)
11. Yafe, O.: Psychological aspects of Ultra-orthodox children literature: child and self concepts.
Megamot 41(1–2), 10–19 (2001). (in Hebrew)
12. Yitzhaki, M.: Censorship in Israeli high school libraries. Yad la-Kore 31, 20–33 (1998). (in
Hebrew)
13. Yitzhaki, M.: Censorship in high school libraries in Israel: an exploratory field study. In:
Education for All; Culture, Reading and Information - Selected Papers from the Proceedings
of the 27th Annual Conference of the International Association of School Librarian-
ship. Ramat-Gan, IASL, pp. 265–275 (1998)
Voluntary Information and Knowledge “Hiding” in a Conservative Community 863
14. Yitzhaki, M.: Censorship in high school libraries in Israel: the role of the sectorial affiliation
factor; an extended nation-wide field study. In: Hughes, P., Selby, L. (eds.) Inspiring
Connections: Learning, Libraries and Literacy; Selected Papers from the Fifth International
Forum on Research in School Librarianship, pp. 231–247. IASL, Seattle (2001)
15. Yitzhaki, M.: The Internet as viewed by school librarians in Israel: conceptions, attitudes,
use and censorship. Yad la-Kore 35, 56–73 (2003). (in Hebrew)
16. Yitzhaki, M., Sharabi, Y.: Censorship in Israeli high school libraries; analysis of complaints
and librarians’ reactions. In: Lee, A.O.S. (ed.) Information Leadership in a Culture of
Change; Selected Papers from the Ninth International Forum on Research in School
Librarianship, pp. 183–202. IASL, Hong Kong (2005)
17. Yitzhaki, M.: Free flow of information and knowledge and use of IT in a conservative
community: the case of the ultra-orthodox in Israel. In: Proceedings of Informing Science &
IT Education Conference (In SITE) 2016, pp. 159–164 (2016). http://www.
informingscience.org/Publications/3503. Accessed 3 Nov 2019
18. Zicherman, H., Kahaner, L.: Modern ultra-orthodox – a Hareidi middle-class in Israel.
Jerusalem, Israel Institute for Democracy (2012). (in Hebrew)
19. Zicherman, H.: Black blue-white: a journey into the ultra-orthodox society in Israel, Tel-
Aviv, 360 p. (2014). (in Hebrew)
20. Hakak, Y., Rapoport, T.: Equality or excellence in the name of God? The case of ultra-
orthodox enclave education in Israel. J. Relig. 92(2), 251–276 (2011)
21. Foscarini, G.: Ultra-orthodox Jewish Women go to work: secular education and vocational
training as sources of emancipation and modernization. Annali di Ca’Foscari 50, 53–74
(2014)
22. Ferziger, A.S.: Beyond Bais Ya’akov: orthodox outreach and the emergence of Hareidi
women as religious leaders. J. Mod. Jewish Stud. 14(1), 140–159 (2015)
23. Friedman, Y.: Shtaygen. Benei-Brak (2006)
24. Stadler, N.: Yeshiva Fundamentalism: Piety. Gender and Resistance in the Ultra-Orthodox
World. NYU Press, New York (2008)
25. Stadler, N.: A Well-Worn Tallis for a New Ceremony. Academic Studies Press, Brighton
(2012)
Emoji Prediction: A Transfer
Learning Approach
Abstract. We present a transfer learning model for the Emoji Prediction task
described at SemEval-2018 Task 2. Given a text of tweet, the task aims to
predict the most likely emoji to be used within such tweet. The proposed method
used a pre-training and fine-tuning strategy, which applies the pre-learned
knowledge from several upstream tasks to downstream Emoji Prediction task,
solving the data scarcity issue suffered by most of the SemEval-2018 partici-
pants using supervised learning strategy. Our transfer learning-based model can
outperform state-of-the-art system (best performer at SemEval-2018) by 2.53%
in macro F-score. Except from providing details of our system, this paper also
intends to provide a comparison between supervised learning models and
transfer learning models in solving Emoji Prediction task.
1 Introduction
Emojis are graphic symbols that represent ideas or concepts used in electronic mes-
sages and web pages [1]. Currently, they are largely adopted by almost any social
media service and instant messaging platforms. However, understanding the meanings
of emojis are not straightforward, i.e. people sometimes have multiple interpretations of
emojis beyond the designer’s intent or the physical object they evoke [2]. For example,
intends to mean “pray”, but it is mis-used as “high five” in many occasions.
A misunderstanding of emojis can reverse the meanings of sentences and mislead
people. Therefore, effectively predicting emojis from text is an important step to
understand content, especially for the emoji-enriched social media messages.
SemEval-2018 Task 2 [3] introduced an Emoji Prediction task. Given a text
message including an emoji, the task consists of predicting that emoji based exclusively
on the textual content of that message. Specifically, the messages are selected from
Twitter data and assume that only one emoji occurs inside each tweet. Figure 1
illustrates an example of a tweet message with an emoji at the end.
2 Model Description
The main structure of the proposed model is illustrated in Fig. 2. Part (a) shows the
basic architecture and the pre-training step of BERT. Ei and Ti are the token embed-
dings and the final hidden state with respect to the input token Tok i. Particularly, a
special classification token ([CLS]) is always added to the beginning of the input
sequence and its corresponding hidden state C is used as the aggregate sequence
representation for classification tasks. A separate token ([SEP]) is used to separate the
input sentences. 12 to 24 layers of bidirectional Transformer are used as the repre-
sentation layers to learn token representations (Ti) from token embeddings (Ei). For
more details about the architecture and the pre-training procedure of BERT, please
refer to the original paper [12].
2.1 Preprocessing
Generally speaking, preprocessing is a fundamental step for many NLP systems. It
directly affects the performance of the follow up components in the system. In par-
ticular, social media message processing is challenging, since there is a large variation
in vocabulary and expressions that is used in such data.
We utilized the ekphrasis tool [14] as the text preprocessor. It can perform tok-
enization, word normalization, word segmentation (for splitting hashtags) and spell
correction for Twitter data. Table 1 shows an example of a tweet processed by the
ekphrasis. There are several benefits of using the preprocessed tweets as inputs. For
example, ekphraris can recognize dates (e.g. May 8, 1989) and replace them with
labels (<date>). This can help the system reduce the vocabulary size without losing too
much information.
Structure: The structure of the customized BERT is illustrated in Part (b) of Fig. 2.
The structure of this model is similar to the structure of the original BERT model,
except that it accepts single sentences as input. A classification layer is built on top the
sequence representation state C to generate class labels of the emojis. The parameters
within this model are pre-learned from the pre-training step.
Input Sequence Representation: The preprocessed tweets need to be transformed
into the BERT input format (BERT input embeddings) before sending to the fine-
tuning step. The BERT input embeddings consist of three embeddings corresponding
to the input tokens: (1) Pre-learned Token Embeddings, (2) Segment Embeddings,
indicating whether the sentence belongs to the first sentence A or the second sentence
B, and (3) Position Embeddings, indicating the positions of the tokens in the sentence.
The final input representation is constructed by summing the corresponding token,
segment and position embeddings. The visualization of the BERT input representation
of sentence “the dog is hairy” can be seen in Fig. 3.
868 L. Zhang et al.
Fig. 3. The representation of sentence “the dog is hairy” in BERT input format.
Fine-tuning: To customize the model for the Emoji Prediction task, we fine-tuned the
model parameters with the training data provided by SemEval-2018 Task 2. Cross-
entropy loss is used as the object function for the fine-tuning procedure, which is
calculated as follows:
Xn X20
Loss ¼ i¼1
yj
j¼1 i
log Pij ð1Þ
In this section, we introduce the corpus, present the experimental results and compare
our system with other learning models in literature.
Emoji Prediction: A Transfer Learning Approach 869
3.1 Corpus
SemEval-2018 Task 2 provided a corpus for Multilingual Emoji Prediction with
roughly 500K training data and 50K testing data in English track. It collected tweets
that include one of the twenty emojis that occur most frequently in the Twitter data.
The relative frequency percentage of each emoji in the train and test set is shown in
Table 3.
Table 4. Comparison of the participating systems with our system by precision, recall and
macro F-score in the test set of SemEval-2018 task 2 English track.
Team Approach F-score Prec. Recall
Ours Pre-trained Model with BERT 38.52 40.64 41.76
Tubingen-Oslo SVMs, RNNs 35.99 36.55 36.22
NTUA-SLP RNNs 35.36 34.53 38.00
EmoNLP RNNs 33.67 39.43 33.70
UMDuluth-CS8761 SVMs 31.83 39.80 31.37
BASELINE Pre-trained Classifier with FastText [15] 30.98 30.34 33.00
870 L. Zhang et al.
39 38.52
37.59 37.97
37 36.6
35.99
35 35.12
33 33.52
31 31.47 30.98
29 29.05
27
25
0 100 200 300 400 500 600
Fig. 4. Learning curve of our model against the training set (1000 instances).
• The pre-learned knowledges may contain extra syntactic and semantic information,
e.g. relation between sentences, deep bidirectional representation of a sequence, that
cannot be learned directly from the training data of the downstream task.
We also observe that the size of the fine-tuning set significantly affects the quality
of the fine-tuned model in both accuracies (shown in Fig. 4) and speed. According to
our experiment, the speed of fine-tuning procedure is roughly 10K tweets/min. It takes
approximately 55 min to complete one epoch of fine-tuning, since there are 500K
tweets in the fine-tuning set.
4 Conclusion
References
1. Cappallo, S., Mensink, T., Snoek, C.G.: Image2emoji: zero-shot emoji prediction for visual
media. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1311–
1314. ACM (2015)
2. Barbieri, F., Ballesteros, M., Saggion, H.: Are emojis predictable? In: Proceedings of the
15th Conference of the European Chapter of the Association for Computational Linguistics:
Volume 2, Short Papers, pp. 105–111. Association for Computational Linguistics, Valencia,
Spain (2017)
3. Barbieri, F., Camacho-Collados, J., Ronzano, F., Anke, L.E., Ballesteros, M., Basile, V.,
Saggion, H.: SemEval-2018 Task 2: multilingual emoji prediction. In: Proceedings of the
12th International Workshop on Semantic Evaluation (SemEval-2018). Association for
Computational Linguistics, New Orleans, LA, United States (2018)
4. Coltekin, C., Rama, T.: Tubingenoslo at semeval-2018 task 2: SVMs perform better than
RNNs in emoji prediction. In: Proceedings of the 12th International Workshop on Semantic
Evaluation, pp. 32–36. Association for Computational Linguistics, New Orleans, LA, United
States (2018)
5. Beaulieu, J., Owusu, D.A.: Umduluth-cs8761 at semeval-2018 task 2: emojis: too many
choices? In: Proceedings of the 12th International Workshop on Semantic Evaluation,
pp. 32–36. Association for Computational Linguistics, New Orleans, LA, United States
(2018)
872 L. Zhang et al.
6. Naziontis, C., Athanasiou, N., Chronopoulou, A., Kolovou, A., Paraskevopoulos, G.,
Ellians, N., Narayanan, S., Potamianos, A.: Ntua-slp at semeval-2018 task1: predicting
affective content in tweets with deep attentive rnns and transfer learning. arXiv prepreint
arXiv:1804.06658
7. Liu, M.: Emonlp at semeval-2018 task 2: English emoji prediction with gradient boosting
regression tree method and bidirectional LSTM. In: Proceedings of the 12th International
Workshop on Semantic Evaluation, pp. 32–36. Association for Computational Linguistics,
New Orleans, LA, United States (2018)
8. Coltekin, C., Rama, T.: Discriminating similar languages with linear SVMs and neural
networks. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties
and Dialects, pp. 15–24. Osaka, Japan (2016)
9. Medvedeva, M., Kroon, M., Plank, B.: When sparse traditional models outperform dense
neural networks: the curious case of discriminating between similar languages. In:
Proceedings of the Fourth Workshop on NLP for similar language, Varieties and Dialects,
pp. 156–163. Association for Computational Linguistics, Valencia, Spain (2017)
10. West, J., Venture, D., Warnick, S.: Spring research presentation: a theoretical foundation for
inductive transfer. Brigham Young Univ. Coll. Phys. Math. Sci. 1, 32 (2007)
11. Radford, A., Narasimhan, K., Salimans, T., Sutskever. I.: Improving language understanding
by generative pre-training. https://s3-us-west-2.amazonaws.com/opennai-assets/research
covers/languageunsupercised/languageunderstandingpaper.pdf (2018)
12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
13. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized
autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237
(2019)
14. Baziotis, C., Pelekis, N., Doulkeridis, C.: Datastories at semeval-2017 task 4: deep lstm with
attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th
International Workshop on Semantic Evaluation (SemEval-2017), pp. 747–754. Vancouver,
Canada (2017)
15. Joulin, A., Grave, E., Bojanowski, R., Mikolov, T.: Bag of tricks for efficient text
classification. arXiv:1607.01759 (2016)
On the Emerging Area of Biocybersecurity
and Relevant Considerations
Abstract. Biocybersecurity is a novel space for the 21st century that meets our
innovations in biotechnology and computing head on. Within this space, many
considerations are open for and demand consideration as groups endeavor to
develop products and policies that adequately ensure asset management and
protection. Herein, simplified and brief exploration is given followed by some
surface discussion of impacts. These impacts concern the end user, ethical and
legal considerations, international proceedings, business, and limitations. It is
hoped that this will be helpful in future considerations towards biocybersecurity
policy developments and implementations.
1 Introduction
Biocybersecurity presents a new way of exploring how we protect our societies. It can
be thought of in part as an extension of cybersecurity, which involves the protection of
systems, made of hardware and software, from unauthorized access and attacks. Bio-
cybersecurity is alternatively referred to as Cyberbiosecurity, which, according to the
Peccoud Lab, exists at the intersection of cybersecurity, cyber-physical security, and
bio-security, and focuses on mitigating risks within and relating to their intersections
[1, 2]. A growing need exists for expertise in this field as we persist in a world at a time
where computer systems and biotechnology are increasingly ingrained in day-to-day
life, in both developed and developing economies. Furthermore, strong lines eventually
need to be drawn when determining where Biocybersecurity and other fields end, to
adequately allot resources and focus towards adequately mapping vulnerabilities and
prevent exploits that may occur and evolve [1]. To this end, we have provisionally
defined Biocybersecurity as thus: Any cybersecurity system where a biological com-
ponent, target, or interlock is involved in the terminal or intermediate stages. The first
potential hurdle- is there a clear and present need for the development of an entirely
new field of study? Let us first turn to the current ease of obtaining biological data.
Whereas sequencing DNA used to be a rigorous process, it has gotten easier [3]. In fact
the Human Genome Project was projected to last 15 years, and was completed in 13,
demonstrating that the speed of computing and insights into the structure of life have
made it easier than ever to obtain, disseminate, and utilize biological data [3]. Sec-
ondly- the processing and accessibility to the mentioned data is easier than ever. The
physical barrier to acquiring healthcare data has been demolished in the name of ease of
access and patient-centered care digitally. So, while the hardware to crack conventional
cybersecurity barriers is more prevalent, the safeguards of that data have been chipped
away. Thirdly, new computational platforms question the nature of the separation of
biology and computing, leading to a more tightly integrated biocybersecurity process
[3, 4]. Some rising platforms even call into question the contemporary understanding of
typical cybersecurity processes and could circumvent typical security at the cost of
creating entirely new (and unforeseen) problems that arise from biological matter and
the application of medicine being used to perform computations [1, 4]. An entire new
group of subdisciplines may be needed to understand the unknown complications that
arise from use of these platforms [5]. All these features of modern healthcare and
technology combine, meaning, that biological data can be applied in more ways. For
instance, the process of implicit authentication using biological data is now possible
with COTS (Commercial, Off-The-Shelf) components. A smartwatch can access
multiple kinds of information including heart rate data and more [6]. The use of an
RHR monitor can be used as a simple mode of implicit authentication- matching a set
of recorded data to a user accessing a work terminal. However, this means that said data
would, somehow, be accessible to other, perhaps nefarious, individuals. As touched on
above, part of the rise of potential threats to in the field of biocybersecurity is the
demand for easier and faster access to data. Patients desire faster, more convenient
access to medical records; medical research companies need larger and more com-
prehensive trials; and data sets to remain viable for the more technically demanding
medical interventions currently used [7–10]. Medical companies also need to increase
their awareness of the potential of malware to compromise device outputs; for example,
one research team's recently demonstrated an algorithm that could modify CT scans to
mislead sick patients into believing that they are healthy and vice versa [11]. Inade-
quate defenses against such malware and inadequate protection of patient data could
brew a maelstrom of related crises, on the scale of ransomware outbreaks [11]. The
prevalence of such devices that demonstrate the confluence of biology and cyberse-
curity include thumbprint scanners, retina scanners, digitized healthcare records,
forensics databases, DNA sequencing databases, and pharmacology records. All of
these could be accessed and used for threats in the Biocybersecurity domain. The
potential growth of biological data in demand is seen in examples in the rise of services
to sequence and interpret DNA as seen through ancestry and health services through
DNA analysis. For a more concrete example of what is meant when they describe the
definition of biocybersecurity: let’s say that a user is at a computer using implicit
authentication. The computer’s security system tracks the eyes of the user, and if the
saccic rhythms change, the computer locks the user. Under our definition, this would
fall under biocybersecurity, as the system is using biological inputs or data as an
intermediate step- in this case as preventative interlocks.
On the Emerging Area of Biocybersecurity and Relevant Considerations 875
enhancement of humans and their technology in often unique and beneficial ways [13,
14]. In some stories, innovations within such literature often arise from lack the of
oversight, or reduced confidence in the ability of the government to adequately assuage
needs of a growingly frantic populace in the face of ever-growing technological reli-
ance. Cyberpunk culture has also contributed to the growth of cultures like that of the
Maker-movement, which is composed of individuals that are often resisting traditional,
institutional control of technologies while self-policing [13–15]. With respect to
biology, some of them are addressing prosthetics, implantable electronics, gene editing
and protein engineering, and bio-adjacent biotechnologies with wide appeal and
potential to correct for deficiencies in their communities [14, 15]. An easy case to
consider is that of the failure of the US government to control for drug prices, leading
to some groups to take matters in their own hands to make them and analogues
themselves such as some community bio labs that have met, with increasing success at
mobilizing the community [13–15]. Less benign efforts have just been to engineer other
means of producing food, whereas others may aim to re-write some parts of life itself
through the possibility of creating synthetic organisms [13–16]. Some of these groups
may or may not apply for government funding and instead pursue their own path to
innovation through private or self-funded measures. Examples can be seen among a
few groups in the Community Bio movement which arose out of the Maker Movement,
in which groups of people have been inspired to pursue these research areas and more
through a mix of traditional and non-traditional cooperation with mixed success [17–
19]. Plenty of these successes resulted in the creation of start-ups that deal in a great
amount of biometric data or material which is tracked, for improving health outcomes
or expanding functions [14–19]. Much of what is thought of as cyberpunk in science
fiction has reached reality, and this implies that the time to think ahead regarding the
protection of biometric data is now. There’s little reason to suggest that these projects
won’t become even more complex. Overall, there is much to consider socially as we
consider and pursue cyberbiotechnological policies that tackle our increasing reliance
and potential overexposure to technology. Given that the technology already exists in
large amounts and data is already being generated in volumes and at rates at which
already has the potential to overly stir and stoke negative public action, we are able to
need to bolster our foci on further social implications of such technology. To fail to do
so can undo many societal gains within technologically advanced nations.
jurisdiction to which said resources and or their owners belong to protect commercial
and otherwise academic chains of value. The Genetic Information Nondiscrimination
Act was passed in 2008 and has the aim of protecting people from DNA based health
insurer and employee discrimination. A core weakness is that life and disability
insurance as well as long term plans are not covered, leaving people up to the mercy of
state government laws [2, 21]. Dual Use Research of Concern Policy, implemented in
2012, underlines policies to regulate life sciences research that could have a double-
edge. Means of research that pursue overly risky methods or are otherwise unethical
face defunding and additional potential penalties [2, 22]. Lastly, the Health Insurance
Portability and Accountability Act, originally enacted in 1996, outlined a means of
protecting citizen medical data while making it available to health professionals in
order to allow adequate, if not superior care [2, 23]. This is increasingly important in an
industry where data driven care is used to deliver more informed treatment wherein
health professionals can learn of complications and patient differences, allowing for a
more personalized and accurate care of an individual, reducing confusion. What can be
taken from the existence of these policies is that in the context of cybersecurity,
biocybersecurity policies will need to flexible, based on consent, and the benefit of
whose information is at risk. Policies not taking this into account are likely to face
considerable legal action.
make this example concrete, let’s say Nation A is particularly susceptible to type 2
diabetes mellitus, and Nation B is a producer of Fructose. If Nation B increases
marketing, increases supply, and seeks trade deals to increase the uptake of fructose in
Nation A-does this count as a kind of stochastic act of war? This and similar questions
form the landscape defining the impact of biocybersecurity on international security.
2.7 Conclusion
There is little rationality in denying the currently robust and the potentially explosive
growth of biocybersecurity as a field of thought, research, and action. The dangers
presented in this paper, as well as the complications of our world are clear. The
question is not one of what to do if these problems occur, but what do we do to prevent,
ameliorate, and treat them? What are we going to do about it?
880 X.-L. Palmer et al.
References
1. Murch, R.S., So, W.K., Buchholz, W.G., Raman, S., Peccoud, J.: Cyberbiosecurity: an
emerging new discipline to help safeguard the bioeconomy. Front. Bioeng. Biotechnol. 6, 39
(2018). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5895716/
2. Dieuliis, D., Lutes, C.D., Giordano, J.: Biodata risks and synthetic biology: a critical
juncture. J. Bioterrorism Biodefense, 09(01) (2018). https://doi.org/10.4172/2157-2526.
1000159
3. Chial, H.: DNA sequencing technologies key to the human genome project. Nat. Educ. 1(1),
219 (2008). https://www.nature.com/scitable/topicpage/dna-sequencing-technologies-key-
to-the-human-828/
4. Trafton, A.: MIT News Office. Scientists program cells to remember and respond to series of
stimuli. http://news.mit.edu/2016/biological-circuit-cells-remember-respond-stimuli-0721.
Accessed 21 July 2016
5. Radanović, I., Likić, R.: Opportunities for use of blockchain technology in medicine. Appl.
Health Econ. Health Policy 16(5), 583–590 (2018). https://doi.org/10.1007/s40258-018-
0412-8
6. Arriba-Pérez, F.D., Caeiro-Rodríguez, M., Santos-Gago, J.: Collection and processing of
data from wrist wearable devices in heterogeneous and multiple-user scenarios. Sensors 16
(9), 1538 (2016). https://doi.org/10.3390/s16091538
7. Patient Demand for Patient-Driven Health Information (2018). https://catalyst.nejm.org/
patient-demand-for-patient-driven-health-information/
8. Providers Turn to Portals to Meet Patient Demand, Meaningful Use (2012). https://journal.
ahima.org/2012/08/23/providers-turn-to-portals-to-meet-patient-demand-meaningful-use/
9. Doi, S., Ide, H., Takeuchi, K., Fujita, S., Takabayashi, K.: Estimation and evaluation of
future demand and supply of healthcare services based on a patient access area model. Int.
J. Environ. Res. Pub. Health 14(11), 1367 (2017). https://doi.org/10.3390/ijerph14111367
10. Merrill, R.A.: Regulation of drugs and devices: an evolution. Health Aff. 13(3), 47–69
(1994). https://doi.org/10.1377/hlthaff.13.3.47
11. Mirsky, Y., Mahler, T., Shelef, I., Elovici, Y.: CT-GAN: Malicious Tampering of 3D
Medical Imagery using Deep Learning (2019). https://arxiv.org/abs/1901.03597
12. Pauwels, E., Denton, S.W.: The internet of bodies: life and death in the age of AI. Calif.
Western Law Rev. 55(1), 221 (2019). https://scholarlycommons.law.cwsl.edu/cgi/
viewcontent.cgi?article=1667
13. de Beer, J., Jain, V.: Inclusive innovation in biohacker spaces: the role of systems and
networks. Technol. Innov. Manage. Rev. 8(2) (2018). https://doi.org/10.22215/timreview/
1133
14. Parisi, L.: What can biotechnology do? Theory Cult. Soc. 26(4), 155–163 (2009). https://doi.
org/10.1177/0263276409104973
15. Wilbanks, R.: Real vegan cheese and the artistic critique of biotechnology. Engag. Sci.
Technol. Soc. 3, 180 (2017). https://doi.org/10.17351/ests2017.53
16. Agapakis, C.M.: Designing synthetic biology. ACS Synth. Biol. 3(3), 121–128 (2013).
https://doi.org/10.1021/sb4001068
17. Hyysalo, S., Kohtala, C., Helminen, P., Mäkinen, S., Miettinen, V., Muurinen, L.:
Collaborative futuring with and by makers. CoDesign 10(3–4), 209–228 (2014). https://doi.
org/10.1080/15710882.2014.983937
18. Landrain, T., Meyer, M., Perez, A.M., Sussan, R.: Do-it-yourself biology: Challenges and
promises for an open science and technology movement. Syst. Synth. Biol. 7(3), 115–126
(2013). https://doi.org/10.1007/s11693-013-9116-4
On the Emerging Area of Biocybersecurity and Relevant Considerations 881
19. Gallegos, J.E., Boyer, C., Pauwels, E., Kaplan, W.A., Peccoud, J.: The open insulin project:
a case study for ‘Biohacked’ medicines. Trends Biotechnol. 36(12), 1211–1218 (2018).
https://doi.org/10.1016/j.tibtech.2018.07.009
20. About the Nagoya Protocol. https://www.cbd.int/abs/about/
21. Genetic Discrimination Fact Sheet (2008). https://www.genome.gov/10002328/genetic-
discrimination-fact-sheet/
22. United States Government Policy for Institutional (2015). https://www.phe.gov/s3/dualuse/
Documents/durc-policy.pdf
23. HHS Office of the Secretary, Office for Civil Rights, Ocr. Summary of the HIPAA Privacy
Rule (2013). https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.
html
24. Feng, Y.: Democracy, Political stability and economic growth. Br. J. Polit. Sci. 27(3), 391–
418 (1997). https://doi.org/10.1017/s0007123497000197
25. Price-Smith, A.T.: The Health of Nations: Infectious Disease, Environmental Change, and
Their Effects on National Security and Development. MIT Press, Cambridge (2002)
26. MHealth: New Horizons for Health through Mobile Technologies. World Health Organi-
zation (2011). https://www.who.int/ehealth/mhealth_summit.pdf
27. M. H. History of the Bureau of Diplomatic Security of the United States Department of State
[PDF]. United States Department of State Bureau of Diplomatic Security (2011). https://
www.state.gov/documents/organization/176589.pdf
28. Molteni, M.: Hackers Listen in on What Synthetic DNA Machines are Printing (2019).
https://www.wired.com/story/hackers-listen-synthetic-dna-machines/
The Impact of an Online Course of Inclusive
Physical Education on Teachers’ Skills
and Knowledge
1 Background
study conducted by teachers in the Arab sector in the north of the country, in relation to
the ability to integrate students with disabilities in PE, a one-dimensional picture was
presented showing low to medium values of self-efficacy (SE) [19] when the areas of
disability that are particularly low are children with visual impairment and mobility-
impaired [18]. Therefore, it is worth noting that among PE teachers in Israel, there is
still a great lack of knowledge on ways of teaching that are adapted for children with
disabilities, especially those who challenge the movement system, such as children
with severe visual impairments or children using a wheelchair.
1.2 Inclusion
Towards the beginning of the 21st century, the inclusion trend spread within educa-
tional systems worldwide [29]. Inclusion is an educational worldview that supports the
active participation of students with a variety of abilities in school culture [23]. The
inclusion trend is based on the right of every learner to quality education that enables
personal development and the realization of the child’s potential, while taking into
account the diversity of children’s environments and abilities, in order to promote
opportunities and reduce barriers to learning [1]. In order to achieve this educational
goal, the school system requires an effort that encourages teachers to develop high
expectations of all students and to ensure that the educational programs are tailored to
the child’s needs [41]. The inclusion trend is based on early conceptions of the Swedish
researcher Bengt Nirje [27], who together with Wolf Wolfensberger and other col-
leagues [42] developed the concept of normalization to establish the right of people
with disabilities to live in a normative environment alongside people with normative
behavior. The goal is to prevent the phenomenon of isolation in people with disabilities
who develop non-normative behaviors in the absence of normative models.
Some believe that PE for students with disabilities does not provide enough in the
educational system, and that they must be provided with a supportive educational
environment that includes reducing barriers to participation, in line with the principles
of universal design and a set of unique adjustments according to abilities and needs of
the child – defined as corrective teaching (Israeli Ministry of Education, 2015).
Inclusion in Physical Education. Professional perceptions that support the develop-
ment of teaching methods that are adapted to the inclusion of students with disabilities
were mentioned in the United States as early as the 1950s [38]. Further validation of the
requirement of inclusion of children with disabilities in PE and sport was provided by
the United Nations Education, Science and Culture Organization (UNESCO) [40],
which states that “it is needed to provide opportunities for participation in PE, physical
activity and sports for all people, especially children of school age, women and girls,
people with disabilities and local minorities in developed countries” [40]. Referring to
the international demand and the ongoing efforts to train teaching forces and devise
tailored teaching methods, the study presents a variety of evidence that demonstrates
success in acquiring skills and experiences of accepted frameworks. However, on the
other hand, evidence of negative experiences and social rejection in North America [3,
13], in Europe [5], and in Israel [20] was also revealed.
884 N. Choresh and Y. Hutzler
Unfortunately, one of the main arguments of many teachers is the lack of adequate
training, and this is also common in the United States [2, 15], where most states require
a union-based exam to allow them to tailor educational programs for students with
disabilities [22]. A survey of 129 universities sampled in 41 states found characteristics
of the educational teacher training courses used in the United States [30]. The findings
in this survey indicated that in most training frameworks (69%), one course usually
offers a face-to-face setting, and only one percent reported using an entirely online
environment.
Physical Education Teachers’ Self-efficacy Towards Inclusion. Self-ability is a
concept that Albert Bandura (1977) embedded as part of the social-cognitive theory of
learning, expressing the individual’s belief in his/her ability to successfully complete a
concrete task. This term is considered to be among the key variables that influence
motivation to perform a particular activity and maintain participation in the activity,
despite difficulties that may emerge during the course of time. Therefore, ensuring
teachers’ SE is a very important issue for promoting quality teaching (Bandura 1997).
Studies that have examined teachers’ perception of SE indicate that the teacher’s
professional ability is a significant asset that distinguishes teachers who will engage in
their work, know how to use effective strategies, and improve the performance of their
students and teachers who feel helpless [7, 44]. The SE of teacher educators toward the
inclusion of students with disabilities was first investigated through self-filling ques-
tionnaires, without the backing of a theoretical model [21], and later by constructing a
theoretical model that addresses students with different disabilities individually [4].
and only a minority under a teacher’s directive), it has been found that learning takes
place when different learning aids are used, and not only one type of aid [8]. In
addition, it was found that there is also a positive relationship between different
learning aids and the level of education of children who studied in homeschooling or
open schools.
2 Purpose
The purpose of the study was to investigate the impact of an online course for inclusive
PE on teachers’ knowledge, skills and prospectives about integrating children with
disabilities into regular PE classes.
3 Method
One hundred and ten PE teachers participated in five online groups, each of them
registered to an online a-synchronic course in a Moodle platform. All the courses had
the same content, including 14 units that included APE (adapted physical education)
theory and philosophy, adaptation principles, as well as disability-based and practice
related modules. The course administrators were APE graduates and certified teachers,
who received on-the-job training in online course administration. The training included
weekly meetings where issues were discussed after or prior to opening the units for
learning.
The Impact of an Online Course of Inclusive Physical Education 887
4 Results
Findings indicated that teachers improved their capability for inclusion in general, as
well as in educating peers in terms of fitness and motor skills (3.6, 3.3, 3.7, respec-
tively, from a range of 1–5). Figure 1 shows the teachers’ perceived capability average
scores in various popular PE activities.
Fig. 1. Teachers’ scores on perceived self-efficacy or capability for inclusion children with
disabilities in various sports activities.
Teachers reported they have gained their highest SE in inclusion of children with
developmental coordination disorders (mean = 3.6), and the least gain referred to the
inclusion of those with spinal cord injury (mean = 3.0). Figure 2 shows the teachers’
perceived SE average scores regarding to inclusion of children with various disabilities.
888 N. Choresh and Y. Hutzler
5
4
3
2
1
0
ID ASD DCD CP SCI VIS
Fig. 2. Teachers’ scores on perceived SE or capability for inclusion of children with various
disabilities in PE classes.
Findings indicate that all the teachers concluded the course with good knowledge and
skills for inclusion of children with various disabilities in a variety of sports activities.
We learned from the results that the course was fairly effective. Some cases, such as
inclusion of children with a spinal cord injury disability and training peers to support
children with disabilities in class, need to be improved.
We suggest that training courses for applied APE will include practical workshops
in addition to digital tools, in order to reach higher self-efficacy in general.
A further investigation could be a follow-up study to investigate the teachers’ years
after the course, during which time they probably had a significant number of children
with disabilities, to determine if they actually related to them and integrated them more
effectively in classes.
References
1. Ainscow, M., Miles, S.: Making education for all inclusive: where next? Prospects 145(1),
15–34 (2008)
2. Ammah, J.O., Hodge, S.R.: Secondary physical education teachers’ beliefs and practices in
teaching students with severe disabilities: a descriptive analysis. High Sch. J. 89, 40–54
(2005)
3. Blinde, E.M., McCallister, S.G.: Listening to the voices of students with physical disabilities.
J. Phys. Educ. Recreat. Dance 69, 64–68 (1998)
4. Block, M., Hutzler, Y., Klavina, A., Barak, S.: Creation and validation of the situational-
specific self-efficacy scale. Adapt. Phys. Act. Q. 29, 184–205 (2013)
The Impact of an Online Course of Inclusive Physical Education 889
5. Bredahl, A.-M.: Sitting and watching the others being active: the experienced difficulties in
PE when having a disability. Adapt. Phys. Act. Q. 30, 40–58 (2013)
6. Chee, T.S., Divaharan, S., Tan, L., Mun, C.H.: Self-Directed Learning with ICT: Theory,
Practice and Assessment, pp. 1–65. Ministry of Education, Singapore (2011)
7. Dibapile, W.T.S.: A review of literature on teacher efficacy and classroom management.
J. Coll. Teach. Learn. 9(2), 79–92 (2012). http://trace.tennessee.edu/utk_educpubs/31.
Accessed 20 Sept 2018
8. Donkor, F.: The comparative instructional effectiveness of print-based and video-based
instructional materials for teaching practical skills at a distance. Int. Rev. Res. Open Distrib.
Learn. 11(1), 96–116 (2010)
9. Draus, P.J., Curran, M.J., Trempus, M.S.: The influence of instructor-generated video
content on student satisfaction with and engagement in asynchronous online classes.
J. Online Learn. Teach. 10(2), 240–254 (2014)
10. Fejgin, N., Talmor, R., Erlich, I.: Inclusion and burnout in physical education. Eur. Phys.
Educ. Rev. 11(1), 29–50 (2005)
11. Fullan, M., Langworthy, M.: Towards a New End: New Pedagogies for Deep Learning.
Collaborative Impact, Washington (2013)
12. Geri, N., Winer, A.: Patterns of online video lectures use and impact on student achievement.
In: Eshet-Alkalai, Y., Blau, I., Caspi, A., Geri, N., Kalman, Y., Silber-Varod, V. (eds.)
Proceedings of the 10th Chais Conference for the Study of Innovation and Learning
Technologies: Learning in the Technological Era, pp. 9E–15E. The Open University of
Israel, Raanana (2015). (in Hebrew)
13. Goodwin, D.L., Watkinson, E.J.: Inclusive physical education from the perspective of
students with physical disabilities. Adapt. Phys. Act. Q. 17, 144–160 (2000)
14. Hahn, E.: Video lectures help enhance online information literacy course. Ref. Serv. Rev. 40
(1), 49–60 (2012)
15. Hardin, B.: Physical education teachers’ reflections on preparation for inclusion. Phys. Educ.
62, 44–56 (2005)
16. Horizon Report. Higher Education Edition. NMC (2016)
17. Huffaker, D., Calvert, S.: The new science of learning: active learning, metacognition, and
transfer of knowledge in E-Learning applications. J. Educ. Comput. Res. 29(3), 325–334
(2003)
18. Hutzler, Y., Barak, S.: Self-efficacy of physical education teachers in including students with
cerebral palsy in their classes. Res. Dev. Disabil. 68, 52–65 (2017). https://doi.org/10.1016/j.
ridd.2017.07.005
19. Hutzler, Y., Shama, E.: Attitudes and self-efficacy of Arabic-speaking physical education
teachers in Israel toward including children with disabilities. Int. J. Soc. Sci. Stud. 5(10), 28–
42 (2017). https://doi.org/10.11114/ijsss.v5i10.2668
20. Hutzler, Y., Fliess, O., Chacham, A., van den Auweele, Y.: Perspectives of children with
physical disabilities on inclusion and empowerment: supporting and limiting factors. Adapt.
Phys. Act. Q. 19, 300–317 (2002)
21. Hutzler, Y., Zach, S., Gafni, O.: Physical education students’ attitudes and self-efficacy
towards the participation of children with special needs in regular classes. Eur. J. Spec.
Needs Educ. 20(3), 309–327 (2005)
22. Kelly, L.: Adapted Physical Education National Standards, 2nd edn. Human Kinetics,
Champaign (2006)
23. Kugelmass, J.W.: The Inclusive School: Sustaining Equity and Standards. Teachers College
Press, New York (2004)
890 N. Choresh and Y. Hutzler
24. Kwon, E.H., Block, M.E.: Implementing the adapted physical education E-learning program
into physical education teacher education program. Res. Dev. Disabil. 69(1), 18–29 (2017)
25. McLoughlin, C., Lee, M.J.: Personalised and self regulated learning in the Web 2.0 era:
international exemplars of innovative pedagogy using social software. Australas. J. Educ.
Technol. 26(1), 28–43 (2010)
26. Mega, C., Ronconi, L., De Beni, R.: What makes a good student? How emotions, self-
regulated learning, and motivation contribute to academic achievement. J. Educ. Psychol.
106(1), 121 (2014)
27. Nirje, B.: The normalisation principle and its human management implications. In: Kugel,
R., Wolfensberger, W. (eds.) Changing Patterns in Residential Services for the Mentally
Retarded. President’s Committee on Mental Retardation, Washington, DC, chap. 7 (1969)
28. Nodoushan, M.A.S.: Self-regulated learning (SRL): emergence of the RSRLM model. Int.
J. Lang. Stud. 6(3), 1–16 (2012)
29. Pecora, P.J., Whittaker, J.K., Maluccio, A.N., Barth, R.P.: The Child Welfare Challenge:
Policy, Practice, and Research, 4th edn. Adline Transaction, London (2012)
30. Piletic, C.K., Davis, R.: A profile of the introduction to adapted physical education course
within undergraduate physical education teacher education programs. ICHPER-SD J. Res. 5
(2), 27–32 (2010). https://files.eric.ed.gov/fulltext/EJ913329.pdf. Accessed 29 Sept 2018
31. Platt, C.A., Amber, N.W., Yu, N.: Virtually the same?: student perceptions of the
equivalence of online classes to face-to-face classes. J. Online Learn. Teach. 10(3), 489–503
(2014)
32. Rimmer, J.A., Rowland, J.L.: Physical activity for youth with disabilities: a critical need in
an underserved population. Dev. Neurorehabilitation 11(2), 141–148 (2008)
33. Rimmer, J.H., Riley, B., Wang, E., Rauworth, A., Jurkowski, J.: Physical activity
participation among persons with disabilities: barriers and facilitators. Am. J. Prev. Med. 26,
419–425 (2004)
34. Rose, K.K.: Student perceptions of the use of instructor-made videos in online and face-to-
face classes. J. Online Learn. Teach. 5(3), 487–495 (2009)
35. Sato, T., Haegele, J.A.: Professional development in adapted physical education with
graduate web-based professional learning. Phys. Educ. Sport Pedagogy 22(6), 618–631
(2017)
36. Sato, T., Haegele, J.A., Foot, R.: In-service physical educators’ experiences of online
adapted physical education endorsement courses. Adapt. Phys. Act. Q. 34, 162–178 (2017)
37. Shamir, T.A., Blau, I.: “The flipped classroom” at the open university? Promoting personal
and collaborative self-regulated learning in an academic course. In: Eshet-Elkalay, Y., Blau,
I., Caspi, N., Gerri, N., Kelmn, Y., Zilber-Warod, W. (eds.) The 11 Conference for
Innovation Study and Learning Technologies Chaise: The Learning Person on the Digital
Era, pp. 226–233 (2016). (in Hebrew)
38. Sherrill, C.: Adapted Physical Activity, Recreation, and Sport: Crossdisciplinary and
Lifespan, 6th edn. McGraw-Hill, New York (2004)
39. Svinicki, M.D.: Student learning: from teacher-directed to self-regulation. New Dir. Teach.
Learn. 2010(123), 73–83 (2010)
40. UNESCO: International Charter of Physical Education, Physical Activity and Sport (2015).
http://unesdoc.unesco.org/images/0023/002354/235409e.pdf
41. Villa, R., Thousand, J.: Restructuring for Caring and Effective Education. Brookes,
Baltimore (2000)
The Impact of an Online Course of Inclusive Physical Education 891
42. Wolfensberger, W.P., Nirje, B., Olshansky, S., Perske, R., Roos, P.: The Principle of
Normalization in Human Services. National Institute of Mental Retardation. Online: Books:
Wolfensberger Collection (1972). http://digitalcommons.unmc.edu/wolf_books/1
43. Wright, A., Roberts, E., Bowman, G., Crettenden, A.: Barriers and facilitators to physical
activity participation for children with physical disability: comparing and contrasting the
views of children, young people, and their clinicians. Disabil. Rehabil. (2018). https://doi.
org/10.1080/09638288.2018.1432702
44. Zee, M., Koomen, H.M.Y.: Teacher self-efficacy and its effects on classroom processes,
student academic adjustment, and teacher well-being: a synthesis of 40 years of research.
Rev. Educ. Res. 86(4), 981–1015 (2016)
5G Service and Discourses
on Hyper-connected Society in South Korea:
Text Mining of Online News
1 Purpose
2 Background
3 Method
3.1 Data Collection
This study used news articles available online to explore aspects of a hyper-connected
society that are being discussed socially in Korea. News articles were crawled using
R3.5.3 on BigKinds (www.bigkinds.or.kr), a database of news articles provided by the
Korea Press Promotion Foundation. A total of 918 news data were collected dated from
April 3, 2019, when 5G was first commercialized, to August 7, 2019.
4 Results
4.1 Results of Topic Modeling
The results of the LDA are shown in Table 1. Each of the 10 topics contains various
social discourse in hyper-connected society that is currently being discussed.
First, many words related to 5G next-generation communications, such as “Auto-
matic Driving,” “Mobile Communication,” and “Technology” are mentioned in Topic
1. This indicates that changes based on information and communication technology,
which is the basis of a hyper-connected society, are becoming an essential factor. As an
industry that leads change, it is possible to see the emergence of communication-based
products and services, such as self-driving cars. Similarly, commercialization of 5G as
the “First” can represent the general characteristics of the hyper-connected society
itself; rapid “Speed” and “Network” indicate that the main attributes of the changes in
hyper-connected society are being discussed.
Second, Topic 6 presents a description of lower-level technologies that specifically
apply to develop solutions for corporate products and services, such as “Platform,”
“Cloud,” and “Service.” In addition, words such as “Technology,” “Solution,” and
“Game” are mentioned, which shows the content on the side of innovative technology
that will emerge in the hyper-connected society.
Third, Topic 4 features words, such as “Industry,” “Construct,” and “Develop-
ment,” as well as information on the overall structure of the industry and government
support to strategically create and foster it.
Fourth, Topic 3 and Topic 7 show content on societal issues and concerns in the
hyper-connected society. The words “People” and “Issue” were mentioned in Topic 3
and the words “Security” and “Safety” about “Information” emerged in Topic 7.
5G Service and Discourses on Hyper-connected Society in South Korea 895
“IoT,” and “Big data” are linked to “Core” and “Technology.” These “Technologies” are
linked to “Services,” such as “Connected cars,” “Clouds,” and “Mobile comm-
unications.”
5 Conclusion
As South Korea is the first country to commercialize 5G, it is leading many changes
that have been brought by the automated telecommunication generation. Therefore, this
study attempted to explore the aspects of hyper-connected society that have been
discussed in Korean society.
First, the result of the Topic Modeling shows important aspects of the technologies
underlying the hyper-connected society. In particular, this shows that the innovative
performance that can be achieved by applying the leading Fourth Industrial Revolution
information and communication-related technologies to the industry is being discussed
as an essential topic.
Second, the analysis shows that a discussion of both negative and positive aspects
of a hyper-connected society is taking place simultaneously. In particular, the social
environment that is evolving in a hyper-connected society revolves around technology
and relationships among people; this is being discussed as one of the primary topics.
This implies that advanced communication technology does not just mean a problem of
speed and accessibility but should be able to support human life reliably through the
implementation of secure and reliable network technologies.
Third, in the case of South Korea, the government actively implements policies to
promote informatization and fosters information communication technology-related
industries. This result shows the relevant government responses. This includes not only
5G Service and Discourses on Hyper-connected Society in South Korea 897
the growth of self-sustaining industries but also the strategic creation of new industries
led by the government. Therefore, it can be expected that the rapid development of a
hyper-connected society in Korea where the government supports and fosters the
changes of information communication environment.
Fourth, with the progress of hyper-connected society, changes in technology and
social environment have changed the demand for education. This has led to changes in
universities offering new majors or strengthening relevant educational programs and
has recognized the need for close collaboration with industry and academia.
These analyses show that hyper-connected society does not mean only the emer-
gence and application of innovative communication technologies but it also represents
a wide range of changes that encompass changes in human lives, social relationships,
education, industry, etc. In the future, we must prevent and adequately deal with any
problems of maladjustment or lag that may arise in hyper-connected society, or of
reverse functioning. To this end, the hyper-connected society does not focus solely on
the technical aspects. Comprehensive discussions are needed on “how” information
and communication technologies, which are the basis for leading human life, are being
used and “what” the results will be.
This study analyzed online news articles to understand how the new era of hyper-
connectedness has been discussed. In future studies, it is necessary to consider the
responses from people who are experiencing the progress of hyper-connected society.
References
1. Seungwha (andy), C., Sunju, P., Seungyong, L.: The era of hyper-connected society and the
changes in business activities: focusing on information blocking and acquisition activities. Int.
J. Manag. Appl. Sci. 3 (2017). ISSN: 2394-7926
2. YoungSung, Y.: The Advent of Hyper-Connected Society and Our Future. Hanulbooks, Seoul
(2014)
3. The Financial News, A hyperconnected society…A new world is coming. http://www.efnews.
co.kr/news/articleView.html?idxno=78971. Accessed 26 Mar 2019
4. Etnews, First commercialization of 5G in Korea, opening of a new era. http://www.etnews.
com/20190404000302. Accessed 04 Apr 2019
5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5),
993–1022 (2003). Edited by J. Lafferty
6. The Financial News, It’s not just communication…The “Industrial Big Bang” is coming.
http://www.efnews.co.kr/news/articleView.html?idxno=78904. Accessed 15 Mar 2019
A Digital Diagnostic Aide for Skincare: The
Role of Computer Vision and Machine
Learning in Revealing Skin Texture Changes
Abstract. Skin disease is a serious disease and can also be a deadly disease.
Prevalence of skin disease is high and is likely to increase as the population
ages. Skin disease burdens Americans, their families and employers. According
to American Academy of Dermatology (AAD), nearly 25% of the population
ages 0–17 had diagnosed with skin disease in 2013 and the price tag for
treatment was $75 billion. Worldwide, an estimated 1.9 billion people suffer
from a skin condition at any given time, and shortage of dermatologists
aggravating the issue. One of the chief early signs, per dermatologist, to a
potential skin disease is change in the skin, ranging from discoloration to new
growth. In this paper, we will discuss the application of machine learning
algorithms and Computer Vision techniques to analyze skin texture changes that
are invisible to the naked eye and provide actionable insights framework that
would trigger preventive treatment procedures to address any impending skin
disease. We will discuss several computer vision techniques and cognitive
services to improve efficiencies of computer vision techniques. Our goal is to
develop assistive Computer vision models that could potentially help derma-
tologists to take proactive healthcare actions to reduce occurrence skin diseases.
1 Introduction
Skin disease is serious and can be deadly. Prevalence of skin disease is high and is
likely to increase as the population ages. Skin disease burdens Americans, their
families and employers1. Patients and caregivers with skin disease suffered $11
billion2 in lost productivity. Market research projects that the global market for skin
disease treatment technologies will reach $20.4 billion in 20203. Prevention is better than
cure and that the application of machine learning and computer vision to analyze images
to predict and prevent the onset of skin disease, generally change of the texture invisible
[1] to the naked eye, would reduce overall cost and improve health outcomes (see Fig. 1).
Most Artificial Intelligence applications that cater to Skin related & dermatology-
based applications falls into Skin Image Analysis and Skin care treatment
personalization4.
Additionally, Computer assisted diagnosis (CAD) systems use artificial intelligence
(AI) to analyze lesion data and arrive at a diagnosis of skin cancer5. Computer Vision
infused AI systems could extract valuable diagnostics markers from Selfies6 and could
provide forecast of a potential masked disease and thus trigger actions to forestall any
emergency healthcare incidence, so saving millions of lives. Images provides valuable
medical aides to diagnose Jaundice7 (see Fig. 2) [2, 3].
1
AAD - https://www.ncmedsoc.org/wp-content/uploads/2018/07/NCDA-AADA-LC-18-Burden-of-
Skin-Disease.pdf.
2
New Study shows significant economic burden of skin disease in the United States - https://www.
aad.org/media/news-releases/burden-of-skin-disease.
3
Face of Dermatology Industry Changing; Companies in Global Skin Disease Market Extending
Products - https://www.bccresearch.com/pressroom/phm/face-of-dermatology-industry-changing-
companies-in-global-skin-disease-market-extending-products.
4
Machine Learning for Dermatology – 5 Current Applications - https://emerj.com/ai-sector-
overviews/machine-learning-dermatology-applications/.
5
Computer‐assisted diagnosis techniques (dermoscopy and spectroscopy‐based) for diagnosing
skin cancer in adults - https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD013186/
abstract.
6
The Role of Selfies in Creating the Next Generation Computer Vision Infused Outpatient Data Driven
Electronic Health Records (EHR) - https://ieeexplore.ieee.org/document/8622458.
7
Skin - https://www.livescience.com/46868-skin-changes-signal-health-problems.html.
900 J. S. Vuppalapati et al.
8
A gentle introduction of computer vision - https://machinelearningmastery.com/what-is-computer-
vision/.
A Digital Diagnostic Aide for Skincare 901
9
The 5 Computer Vision Techniques That Will Change How You See The World - https://heartbeat.
fritz.ai/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world-1ee19334354b.
10
Microsoft Azure Vision Cognitive Services - https://azure.microsoft.com/en-us/services/cognitive-
services/computer-vision/.
902 J. S. Vuppalapati et al.
Objects [ { "rectangle": { "x": 89, "y": 38, "w": 132, "h": 74 }, "object": "Glasses",
"parent": { "object": "Personal care", "confidence": 0.757 }, "confidence":
0.749 } ]
Racy false
Faces []
Fig. 3. Selfies Parsed on Microsoft Azure (Microsoft Azure Cognitive Services - https://azure.
microsoft.com/en-us/services/cognitive-services/computer-vision/)
As it is clear from the above table, the API provides all the details on the image (see
Fig. 4).
904 J. S. Vuppalapati et al.
{
"cropHintsAnnotation": {
"cropHints": [
{
"boundingPoly": {
"vertices": [
{
"x": 35
},
{
"x": 161
},
{
"x": 161,
"y": 157
},
{
"x": 35,
"y": 157
}
]
(continued )
A Digital Diagnostic Aide for Skincare 905
Table 3. (continued)
"labelAnnotations": [
{
"description": "Hair",
"mid": "/m/03q69",
"score": 0.97040063,
"topicality": 0.97040063
},
{
"description": "Skin",
"mid": "/m/06z04",
"score": 0.9424602,
"topicality": 0.9424602
},
{
"description": "Shoulder",
"mid": "/m/01ssh5",
"score": 0.8860629,
"topicality": 0.8860629
},
{
"description": "Arm",
"mid": "/m/0dzf4",
"score": 0.88020605,
"topicality": 0.88020605
},
(continued )
906 J. S. Vuppalapati et al.
Table 3. (continued)
{
"description": "Muscle",
"mid": "/m/04_fs",
"score": 0.8697178,
"topicality": 0.8697178
},
{
"description": "Chest",
"mid": "/m/0dzdr",
"score": 0.86141694,
"topicality": 0.86141694
},
{
"description": "Male",
"mid": "/m/05zppz",
"score": 0.8549979,
"topicality": 0.8549979
}
]
Fig. 5. Selfies Parsed on Google Cloud Vision (Google Cloud Vision AI - https://cloud.google.
com/vision/)
The CV takes input units Uin and generates the output Uout. Tuning vector is T.
Given image perturbed due quality of camera, or other environmental reasons, the
input is assumed random variable.
11
Performance Characterization in Computer Vision: A Guide to Best Practices - http://www.tina-
vision.net/docs/memos/2005-009.pdf.
908 J. S. Vuppalapati et al.
import cv2
import numpy as np
from matplotlib import pyplot as plt
def get_pixel(img, center, x, y):
new_value = 0
try:
if img[x][y] >= center:
new_value = 1
except:
pass
return new_value
12
How-To: Three Ways to compare histograms using OpenCV and Python - https://www.
pyimagesearch.com/2014/07/14/3-ways-compare-histograms-using-opencv-python/.
13
CompareHist - https://docs.opencv.org/2.4/modules/imgproc/doc/histograms.html?highlight=
comparehist#comparehist.
14
Histogram Comparison methods - https://docs.opencv.org/2.4/modules/imgproc/doc/histograms.
html?highlight=comparehist#comparehist.
15
Histogram comparison methods - https://www.pyimagesearch.com/2014/07/14/3-ways-compare-
histograms-using-opencv-python/.
910 J. S. Vuppalapati et al.
• High data availability even in the presence of faults in the network or computer
hardware (e.g. due to power outages, environmental disasters, and regional strife).
• High performance to ensure the system can function even under the high loads that
may arise in emergency situations (such as a pandemic, large-scale accident or war).
Security is to protect patient data from misuse, unauthorized access or attacks [9].
3 System Overview
In order to extract additional attributes that correlates skin disease issue, we have
partnered with Sanjeevani Electronic Health Records that provides de-identified details
of patients. Our analysis is to correlate User skin diseases with other medical conditions
(see Fig. 9).
16
MORPH II – Feature Vector Documentation - https://libres.uncg.edu/ir/uncw/f/wangy2018-1.pdf.
912 J. S. Vuppalapati et al.
This paper presented a novel and radical approach to integrate computer vision and
LBP to identify skin disease for the Selfies that are taken for medical image purposes
with Electronic Health Records. Application of Computer Vision with Open CV sta-
tistical methods would provide a great diagnostic tool to identify skin texture change
and a prognosis indicator for impending skin disease.
Acknowledgment. We sincerely thank the management and field staff of Sanjeevani Electronic
Health Records (www.sanjeevani-ehr.com) for their active support in providing images, de-
identified database and image analysis (see Fig. 10).
REFERENCES
1. Guerrero, A.: Medicalized Smartphones: Giving New Meaning to “Selfies”. https://
technologyadvice.com/blog/healthcare/medicalized-smartphones-giving-new-meaning-to-
selfies/. Accessed 11 Aug 2018
2. Langston, J.: New app could use smartphone selfies to screen for pancreatic cancer, 28
August 2017. https://www.washington.edu/news/2017/08/28/new-app-uses-smartphone-
selfies-to-screen-for-pancreatic-cancer/. Accessed 06 Aug 2018
3. Honnungar, S., Mehra, S., Joseph, S.: Diabetic Retinopathy Identification and Severity
Classification, Fall (2016). http://cs229.stanford.edu/proj2016/report/HonnungarMehra
Joseph-DRISC-report.pdf
4. Le, J.: The 5 Computer Vision Techniques That Will Change How You See The World, 12
April 2018. https://heartbeat.fritz.ai/the-5-computer-vision-techniques-that-will-change-
how-you-see-the-world-1ee19334354b
5. Pietikäinen, M., Hadid, A., Zhao, G., Ahonen, T.: Computer Vision Using Local Binary
Patterns, 1st edn. Springer, London (2011). Hardcover ISBN 978-0-85729-747-1, ASIN
B009T11G7M
6. SCIKIT-Image, Local Binary Pattern for texture classification. http://scikit-image.org/docs/
dev/auto_examples/features_detection/plot_local_binary_pattern.html. Accessed 20 July
2018
7. Praksa, E.: Texture Feature Extraction by Using Local Binary Pattern, 04 January 2017.
https://www.researchgate.net/publication/305152373_Texture_Feature_Extraction_by_
Using_Local_Binary_Pattern
8. Ojala, T., Valkealahti, K., Oja, E., PietikaKinen, M.: Texture discrimination with
multidimensional distributions of signed gray-level differences, November 1999. http://
www.ee.oulu.fi/mvg/files/pdf/pdf_58.pdf
9. Hanumayamma Innovations and Technologies, Sanjeevani Electronic Health Records &
Healthcare analytics platform. http://hanuinnotech.com/healthcare.html. Accessed 08 Jan
2017
10. Rosebrock, A.: Local Binary Patterns with Python & OpenCV, 7 December 2015. https://
www.pyimagesearch.com/2015/12/07/local-binary-patterns-with-python-opencv/
11. Hanzra, B.S.: Texture Matching using Local Binary Patterns (LBP), OpenCV, scikit-learn
and Python, 30 May 2015. http://hanzratech.in/2015/05/30/local-binary-patterns.html
12. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan
Kaufmann, Burlington (2011)
A Digital Diagnostic Aide for Skincare 915
13. Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press,
New York (2011)
14. Vuppalapati, C., Ilapakurti, A., Kedari, S.: The role of big data in creating sense EHR, an
Integrated approach to create next generation mobile sensor and wearable data driven
electronic health record (EHR). In: Proceedings of International The Second Service
Applications, 2016 IEEE Second International Conference on Big Data Computing Service
and Applications (BigData) (2016)
Deep-Learned Artificial Intelligence
for Semantic Communication
and Data Co-processing
⎧ D
↓
⎪
⎪ PCogOnt
IN = ⎨ DLIA → I N →
⎪ ↓
⎪ K
⎩
Fig. 1. Natural and artificial intelligences partnership ~IN for knowledge (K) extraction by means
of data (D) processing.
920 N. Vasilyev et al.
Communication understanding happens by the same laws, see Fig. 1. Let ~IN be
human mind working jointly with life-long partner DL IA. Scheme of future commu-
nication between SIC persons can be presented in the next form, see Fig. 1:
K ¼D j
~I i ! ~INj ! :
i j K
N
Universalities are interconnected. The same diagram can be used, for instance, in
order to introduce notion of map inverse image.
O ¼ fO; A1 ; . . .; AK g; Ak : O ! Og
A
O!O
l# : ð1Þ
0
0 A 0
O !O
In other words, morphism transits action of operators from one ring into another
one. For instance, let p : O ! O0 be projection on invariant subspace O0 O of
operators A ¼ Ak . Then map l ¼ p satisfies to (1) i.e. p is morphism.
If an operator A ¼ Ak is invertible then property lA1 ¼ ðlAÞ1 takes place. The
scheme is universal because it is recognized in many mathematical theories and
applications. Its specialization is given in functional rings of the kind
O ¼ ff : R þ ! R; f ð0Þ ¼ 0g;
:
O0 ¼ fF : DomF ! C; Fð1Þ ¼ 0g; DomF ¼ fp : p 2 C; Re p rðf Þg
They are compared with the help of morphism l ¼ M called function f Laplace
transformation F(p):
Zþ 1
Mf ¼ FðpÞ ¼ ept f ðtÞdt:
0
O ¼ ff : R þ ! R; f ð0Þ ¼ 0g;
:
O0 ¼ fF : DomF ! C; Fð1Þ ¼ 0g; DomF ¼ fp : p 2 C; Re p rðf Þg
A1 ¼ DO ; A2 f ðtÞ , t f ðtÞ; Ag f ¼ g f ; g 2 O:
Deep-Learned Artificial Intelligence for Semantic Communication 923
1 1
MD1 1 1 1
O f ¼ Mf ; M DO0 F ¼ M F:
p t
The considerations show how similar elements S : f ðtÞ ! f ðatÞ of object O are
transformed in the ones S0 : FðpÞ ! 1aFðpaÞ of object O0 .
With the help of forgiving functor U ring UO is defined without operators
Ak ; k ¼ 1; 2; . . .; K. Ring UO is embedded in ring of lattice functions On :
O ¼ ff : R þ ! R; f ð0Þ ¼ 0; 9C j f j Cg;
:
O0 ¼ fF : R ! C; Fð1Þ ¼ 0g
924 N. Vasilyev et al.
U
O ! O0
#.# :
M
O ! O0
D1 : ðf : A ! BÞ ! ðfn : An ! Bn Þ;
U UD;n
Dn : ½O O0 ! ½On O0n
U1 U1
D;n
n X
T 2N1 T
f ðtk Þe 2N ; n ¼ 0; 1; . . .2N 1; tk ¼ k
ipkn
Un ðfn Þð Þ ¼ 2 ½0; T :
T 2N k¼0 2N
D ¼ D2 ; D2 : RingM ! RingM
fin :
X
1 X
n
D2 : f ¼ a k fk ! ak f k :
k¼0 k¼0
5 Conclusions
Huge mind abilities are used in infinitesimal measure. Emerging technology of man’s
partnership with DL IA is to assist person’s rational auto-development (man’s self-
development on rational ground). Man is born to think in meanings (to live sane in
thinking). Self-reflection removes divergence between ideal ontological constructions
and real means of knowledge implementation in computer systems. Intellectual pro-
cesses lean on CogOnt and system axiomatic method. DL IA will foster man’s thinking
in language of categories. De docta ignorantia and attendant consciousness perfection
on the base of universalities allow breading SIC subject. Man’s auto-molding happens
on the highest genome level of neuro-organization and in trend of rational sophisti-
cation. Future information processing and communication will be holding on semantic
level in order to support person’s successful trance-disciplinary activity.
References
1. Gromyko, V.I., Kazaryan, V.P., Vasilyev, N.S., Simakin, A.G., Anosov, S.S.: Artificial
intelligence as tutoring partner for human intellect. J. Adv. Intell. Syst. Comput. 658, 238–
247 (2018)
2. Gromyko, V.I., Vasilyev, N.S.: Mathematical modeling of deep-learned artificial intelligence
and axiomatic for system-informational culture. Int. J. Robot. Autom. 4(4), 245–246 (2018)
926 N. Vasilyev et al.
3. Vasilyev, N.S., Gromyko, V.I., Anosov, S.S.: On inverse problem of artificial intelligence in
system-informational culture. J. Adv. Intell. Syst. Comput. Hum. Syst. Eng. Des. 876, 627–
633 (2019)
4. Sadique Shaikh, Md.: Defining ultra artificial intelligence (UAI) implementation using
bionic (biological-like-electronics) brain engineering insight. MOJ Appl. Bio Biomech. 2(2),
127–128 (2018)
5. Deviatkov, V.V., Lychkov, I.I.: Recognition of dynamical situations on the basis of fuzzy
finite state machines. In: International Conference on Computer Graphics, Visualization,
Computer Vision and Image Processing and Big Data Analytics, Data Mining and
Computational Intelligence, pp. 103–109 (2017)
6. Fedotova, A.V., Davydenko, I.T., Pförtner, A.: Design intelligent lifecycle management
systems based on applying of semantic technologies. J. Adv. Intell. Syst. Comput. 450, 251–
260 (2016)
7. Volodin, S.Y., Mikhaylov, B.B., Yuschenko, A.S.: Autonomous robot control in partially
undetermined world via fuzzy logic. J. Mech. Mach. Sci. 22, 197–203 (2014)
8. Svyatkina, M.N., Tarassov, V.B., Dolgiy, A.I.: Logical-algebraic methods in constructing
cognitive sensors for railway infrastructure intelligent monitoring system. Adv. Intell. Syst.
Comput. 450, 191–206 (2016)
9. Hadamer, G.: Actuality of Beautiful. Art, Moscow (1991)
10. Mclane, S.: Categories for Working Mathematician. Phys. Math. Ed., Moscow (2004)
11. Goldblatt, R.: The Categorical Analysis of Logic. North-Holland Publishing Company,
Amsterdam (1979)
12. Husserl, A.: From Idea to Pure Phenomenology and Phenomenological Philosophy: Book 1:
General Introduction in Pure Phenomenology. Acad. Project, Moscow (2009)
13. Pinker, S.: Thinking Substance Language as Window in Human Nature. Librokom, Moscow
(2013)
14. Kassirer, E.: Philosophy of Symbolical Forms. Language Univ. Book, Saint Petersburg 1
(2000)
15. Courant, R., Robbins, G.: What is Mathematics?. Moscow Center of Continuous Education,
Moscow (2017)
16. Euclid: Elements. GosTechIzd, Leningrad (1949–1951)
17. Hilbert, D.: Grounds of Geometry. Tech.-Teor. Lit., Leningrad (1948)
18. Kirillov, A.: What is the Number?. Nauka, Moscow (1993)
19. Artin, E.: Geometric Algebra. Nauka, Moscow (1969)
20. Bachman, F.: Geometry Construction on the Base of Symmetry Notion. Nauka, Moscow
(1969)
21. Maltsev, A.: Algebraic Systems. Nauka, Moscow (1970)
22. Maltsev, A.I.: Algorithms and Recursive Functions. Nauka, Moscow (1986)
23. Shafarevich, I.R.: Main Notions of Algebra. Reg. and Chaos Dynam, Izhevck (2001)
24. Kourosh, A.: Lecture Notes on General Algebra. Phys.-Mat., Moscow (1962)
25. Engeler, E.: Metamathematik der Elementarmathematik. MIR, Moscow (1987)
Author Index