Professional Documents
Culture Documents
Ebook Data Engineering For Smart Systems Proceedings of Ssic 2021 Priyadarsi Nanda Online PDF All Chapter
Ebook Data Engineering For Smart Systems Proceedings of Ssic 2021 Priyadarsi Nanda Online PDF All Chapter
https://ebookmeta.com/product/networking-intelligent-systems-and-
security-proceedings-of-niss-2021-smart-innovation-systems-and-
technologies-237/
https://ebookmeta.com/product/data-intelligence-and-cognitive-
informatics-proceedings-of-icdici-2021-algorithms-for-
intelligent-systems-i-jeena-jacob-editor/
https://ebookmeta.com/product/human-centred-intelligent-systems-
proceedings-of-kes-hcis-2021-conference-244-smart-innovation-
systems-and-technologies-244-alfred-zimmermann-editor/
Data Science and Intelligent Systems: Proceedings of
5th Computational Methods in Systems and Software 2021,
Vol. 2 Radek Silhavy
https://ebookmeta.com/product/data-science-and-intelligent-
systems-proceedings-of-5th-computational-methods-in-systems-and-
software-2021-vol-2-radek-silhavy/
https://ebookmeta.com/product/cloud-iot-systems-for-smart-
agricultural-engineering-1st-edition-saravanan-krishnan/
https://ebookmeta.com/product/communication-and-smart-
technologies-proceedings-of-icomta-2021-1st-edition-alvaro-rocha/
Priyadarsi Nanda ·
Vivek Kumar Verma ·
Sumit Srivastava · Rohit Kumar Gupta ·
Arka Prokash Mazumdar Editors
Data Engineering
for Smart Systems
Proceedings of SSIC 2021
Lecture Notes in Networks and Systems
Volume 238
Series Editor
Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland
Advisory Editors
Fernando Gomide, Department of Computer Engineering and Automation—DCA,
School of Electrical and Computer Engineering—FEEC, University of Campinas—
UNICAMP, São Paulo, Brazil
Okyay Kaynak, Department of Electrical and Electronic Engineering,
Bogazici University, Istanbul, Turkey
Derong Liu, Department of Electrical and Computer Engineering, University
of Illinois at Chicago, Chicago, USA
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Witold Pedrycz, Department of Electrical and Computer Engineering,
University of Alberta, Alberta, Canada
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Marios M. Polycarpou, Department of Electrical and Computer Engineering,
KIOS Research Center for Intelligent Systems and Networks, University of Cyprus,
Nicosia, Cyprus
Imre J. Rudas, Óbuda University, Budapest, Hungary
Jun Wang, Department of Computer Science, City University of Hong Kong,
Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest
developments in Networks and Systems—quickly, informally and with high quality.
Original research reported in proceedings and post-proceedings represents the core
of LNNS.
Volumes published in LNNS embrace all aspects and subfields of, as well as new
challenges in, Networks and Systems.
The series contains proceedings and edited volumes in systems and networks,
spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor
Networks, Control Systems, Energy Systems, Automotive Systems, Biological
Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems,
Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems,
Robotics, Social Systems, Economic Systems and other. Of particular value to both
the contributors and the readership are the short publication timeframe and
the world-wide distribution and exposure which enable both a wide and rapid
dissemination of research output.
The series covers the theory, applications, and perspectives on the state of the art
and future developments relevant to systems and networks, decision making, control,
complex processes and related areas, as embedded in the fields of interdisciplinary
and applied sciences, engineering, computer science, physics, economics, social, and
life sciences, as well as the paradigms and methodologies behind them.
Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of Science.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
This book is one-pot solution for authors to showcase their work among the research
communities, application warehouses and the public–private sectors. This series is
an opportunity to gather research scientific works related to data engineering concept
in the context of computational intelligence consisted of interaction between smart
devices, smart environments and smart interactions, as well as information tech-
nology support for such areas. Data from mentioned areas also need to be stored
(after their gathering) in intelligent database systems and to be processed (using
smart and intelligent approach). The aim of this series is to make available a plat-
form for the publication of books on all aspects of single- and multi-disciplinary
research on these themes to make the latest results available in a readily accessible
form. The innovations provide the latest software tools to be showcased in the book
series through publications. The high-quality content with broad range of the topics
pertaining to the book series will be peer-reviewed and get published on suitable
recommendations.
This book will provide state of the art to research scholars, scientists, industry
learners and postgraduates from the various domains of engineering and related
fields, as this incorporates data science and the latest innovations in the field of engi-
neering with their paradigms and methods that employ knowledge and intelligence
in the research community. This book comprises its scope ranging from data science
for smart systems which are having inbuilt capabilities of handling the research chal-
lenges and problems related to energy-aware sensors, smart city projects, wearable
devices, smart healthcare solutions, smart e-learning initiatives and social implica-
tions of IoT. Further, it extends its coverage to the different computational aspects
involved in various domains of engineering such as complex security solutions for
data engineering, communication networks, data analytics, machine learning, inte-
grating IoT data with external data sources and data science approaches for smart
systems.
Secondarily, this book provides the technological solutions to non-engineering
and sciences domain as it contains the fundamental innovation in the field of engi-
neering which turns to be a real solution for their problems. Also, it includes the
v
vi Preface
vii
viii Contents
xiii
xiv Editors and Contributors
Contributors
Abstract We all have heard about bullying, and we know that it is an immense
challenge that schools have to tackle. Many lives have been ruined due to bullying,
and the fear it implants into students’ mind has caused many of them to go into
depression which can lead to suicide. Traditional methods (National Academies
of Sciences, Engineering, and Medicine, in Preventing bullying through science,
policy, and practice. The National Academies Press, Washington, DC, [1]) need to
be accompanied with modern technology to make the method more effective and
efficient. If real-time alerts are sent to school staff, they can identify the perpetuator
and extricate the victim swiftly. In this proposed method, an AI-based solution is
implemented to monitor students using standard school surveillance technologies
and CCTV to maintain a decorum and safe environment in the school premise. Also
the proposed method utilizes other unstructured sources such as attendance records,
social media activity and general nature of the students to deliver quick response.
Artificial intelligence (AI) techniques like convolutional neural networks (CNNs),
which include image-processing capabilities, logistic regression methods, long short-
term memory (LSTM) and pre-trained model Darknet-19 are used for classification.
Further, the model also included sentiment analysis to identify commonly used abuse
terms and noisy labels to improve overall model accuracy. The model has been trained
and validated with the realistic data from all the sources mentioned and has achieved
the classification accuracy of 87% for detecting any sign of bullying.
L. Kumar
Gazelle Information Technologies, Delhi, India
P. Goyal (B) · R. Kumar · D. Shrivastav
Mount Carmel School, Delhi, India
e-mail: palashgoyal1608@mountcarmeldelhi.com
K. Malik
Computer Science Engineering, USICT- Guru Gobind Singh Indraprastha University, Delhi, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 1
P. Nanda et al. (eds.), Data Engineering for Smart Systems, Lecture Notes in Networks
and Systems 238, https://doi.org/10.1007/978-981-16-2641-8_1
2 L. Kumar et al.
1 Introduction
First of all, we have identified all the physical, mental, emotional, social and
cyberbullying types and parameters that are prevalent. For this, extensive study
of existing literature has been done [1–3]. Subsequently, these parameters were
studied to identify how they can be objectively analyzed. This involved identifying
the data sources that can give input signals for that bullying parameter. Once the
data sources were identified, actual data was recovered from these sources. Again,
existing literature was studied [4–7] extensively to identify AI/ML algorithms that
can be used to classify them with binary outputs linked to whether this is a bullying
situation or not. In case bullying is detected, our system detects and recognizes faces
of those involved and sends real-time alerts to the authorities, with the identity of
the perpetrator [8, 9]. The authors then put together all these elements to develop
a working model of their solution with a proof of concept done in a school. The
research methodology followed in this is as given below in Fig. 1.
To identify the signs of bullying, we have taken multiple parameters under consider-
ation through which we can filter-out real bullies from other behaviors which could
be mistaken for bullying. Features like rude behavior, troubling authorities regularly,
low attendance, use of foul language, low scores, drug use, explicit content, etc.
Using Machine Learning, Image Processing and Neural Networks … 3
are used to filter the data. In the proposed model, it is planned to integrate AI tools
with school infrastructure and data like cameras with microphones, student portals
containing community forums where students and teachers interact, attendance of
students, their scorecards, etc. Data used for analysis and a complete infrastructural
setup is discussed below in Tables 1 and 2.
4 L. Kumar et al.
on ImageNet dataset), along with LSTM which provides the capability to process
sequence of image data with feedback connections. When a bullying scenario is
detected by the CNN-LSTM architecture, faces of all those involved or present in
the particular frame are detected using the popular object detection algorithm You
Only Look Once (YOLO). These faces are input into another simple CNN, namely a
Siamese network, which tries to identify the perpetrator by matching the input with
the student database already present.
Audios which are mapped to text using Google Cloud Speech API alongside
comments at student community are further used to determine the following features:
tone, amplitude and pitch of voice, language used (explicit or not), uppercase text,
text length and sentiments, threatening statements, trolling, unpleasant comments and
distasteful words. Other attributes like low attendance, incompetent grades, enrol-
ment in extracurricular activities, teacher–student interaction, frequency of councilor
appointments and behavior report by staff will also be considered. Data for these
inputs is converted from physical form into digital form first.
Convolution neural network is a type of neural network (NN) that provides the
capability to convert pixels into well-structured data [5]. CNNs replicate the function
of the frontal lobe of the human brain (cerebral cortex), which is responsible for
processing audio visual stimulus in humans. To process images, CCTV footage is
spliced into still images and then each frame is analyzed to extract the crucial and
important features which can be further analyzed for more refined results.
Proposed network architecture for video classification is shown in Fig. 4. It has
been already shown that adding LSTM (which extract global temporal features)
after CNN, the local temporal features obtained from optical flow are also of great
importance [10].
6 L. Kumar et al.
For unambiguous and noise-free detection, threshold values are set, in order to
filter-out disturbances. Only regions that have their probability (p) value exceeding
the threshold are considered. To further improve quality of predictions, Intersec-
tion over Union (IoU) and non-max suppression are performed. Non-max suppres-
sion helps in eliminating any bounding regions that might have detected an already
detected feature or face. The final bounding boxes received after the complete
procedure are sent as inputs into a Siamese network for facial recognition.
Since only one picture of each student will be present in the database, training a
deep learning algorithm to map the received images to one stored in the dataset will
not be feasible. Due to this, a Siamese network, another variation of the convolutional
neural network that is a one-shot learning algorithm is used. It gives high performance
in applications and scenarios where there is not a large amount of data belonging to
each data class.
A Siamese network consists of two parallel neural networks, with the same weights
and parameters. It consists of convolutional, pooling and dense layers just like a
regular CNN, but what distinguishes it from it is that it is bereft of a softmax/sigmoid
activation at the last layer. Thus, the encoding of an image is received as an output
instead of a category. The contrastive loss or difference between encodings received
from the two parallel networks is calculated to ascertain the similarity between the
input images.
The faces inside the bounding boxes predicted by the YOLO algorithm are
extracted and provided as input to one of the two parallel networks. The other network
receives images from the database of students, and the difference of the received
encodings in calculated. The image having the least difference is considered to be
“identified” as one of those present in the bullying frame. To fasten the process and
reduce computational needs, the encodings of all students in the database were gener-
ated in advance using the same Siamese network, and only difference calculation was
done during the test phase. Immediate alerts, including the identity of those involved,
are then sent to the authorities. Figure 6 displays the working of a Siamese network
for facial recognition.
For audio analysis, we have mapped CCTV voice output with Google Cloud
Speech API which uses CNN and provides real-time streaming of speech recognition
and conversion from audio to text. This lies within our CCTV and Google Cloud
Storage for sentiment analysis to be done on it.
For text analysis, mentioned words are mapped with sentiment analysis dictionary
(“Liu and Hu opinion lexicon” containing 6800 positive and negative words [4]). To
extract feature from the text, multinomial Naïve Bayes is used (refer Fig. 7). It helps
in classification of bad words and rude comments from audio-to-text converted files
as well as from student’s community portal including explicit and vulgar remarks,
text length as well as usage of upper-case letters in community forums, etc. All such
attributes are then listed for feature extraction by the model.
Even data like medical records, attendance, grades and teachers remark from
student portals is used and processed for feature extraction. Logistic regression
modeling is used for this analysis.
8 L. Kumar et al.
solution is a one-time investment and will provide a great benefit to the coming future
making schools bullying free, a dream most people never thought would become a
reality.
Detailed information and diagrammatic representation of recommended architec-
ture is provided below:
As shown in Fig. 8, data streams from CCTV are split as audio and video media
streams. Audio is then sent to Google speech-to-text API which is used for sentiment
analysis. In case of video, clips are sliced down into multiple frames based on time
and frame rates. These images are analyzed to detect any signs of bullying or any
other sort of unethical action via the use of defined CNN model.
From student portals, web scraping is implemented to obtain information about
posts, tags, comments, open chats and for records which can be taken directly from
the school database. All these attributes are fetched to be deployed into the model
(on cloud premises), and results can be fetched remotely via Internet.
Fig. 9 Cross-entropy loss (train dataset is represented by blue curve, and test dataset is represented
by orange curve)
3 Result
To analyze the accuracy in initial phases, labeled data was split into training and
testing data in ratio of 7:3, respectively. At initial stages, with the gathered data our
model is able to predict with accuracy of 87%.
Cross-entropy loss was used as the loss function to analyze the performance of
the deployed model [11]. It is used to measure the performance of a classification
model, with outputs between 0 and 1. As the predicted label diverges from the actual
label, the penalty or the loss increases. Equation (1) represents cross-entropy loss for
N number of classes.
N
C.E. Loss = − yo,c log po,c (1)
1
4 Improvements
Future improvements can include the model being used as a measure to check
for depressed victims so as to avoid suicide and self-harm tendencies and help in
countering it.
Using Machine Learning, Image Processing and Neural Networks … 11
Fig. 10 Classification accuracy (train dataset is represented by blue curve, and test dataset is
represented by orange curve)
5 Conclusion
Acknowledgments We would like to thank Mr. Kirit, the CEO of Gazelle Information Technologies
PVT LTD, for his expert advice and a supply of required resources for the implementation of this
project.
12 L. Kumar et al.
References
Abstract As the use of social media is growing these days in context of information
sharing, the necessity for reality check is also the need of the hour. As more of the
users rely on the news spread on these social media platforms, more are the chances
of spreading rumours through these contents. To avoid the escalation of hoaxes, the
content must be checked and properly labelled as fact or fake by human perception
or by using widely used classification algorithms. In this paper, we are performing
a feature-based survey of four popular machine learning-based algorithms for the
classification of content on social media as credible or fake. The platform which we
will consider for data collection will be Twitter API, which is freely distributed by
Twitter. We propose the feature selection by using ant colony optimization (ACO)
algorithm. The algorithms we consider for comparison are decision tree, Naïve Bayes,
random forest and SVM. All these algorithms are tested on some predefined features
of tweets which are provided by the Twitter API, and also, some additional features
are also extracted by feature engineering on data. We provide our findings on the basis
of accuracy, precision and recall measures calculated for all of the algorithms. We
also applied cross-validation with tenfold on these algorithms to get the appropriate
measure of accuracy.
1 Introduction
Online social networks (OSN) are a stage to which people can join in and that permits
them to interconnect with one another and make them able to share their own views,
encouraging the making of a virtual informal community with mutual agreement
and social relations among individuals that generally share same interests, exercises
or genuine associations. OSN is also affecting human social life in a big way [1].
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 13
P. Nanda et al. (eds.), Data Engineering for Smart Systems, Lecture Notes in Networks
and Systems 238, https://doi.org/10.1007/978-981-16-2641-8_2
14 U. Sharma and S. Kumar
OSN nowadays attracted millions of people to express and to also to gain insight
from others views [2]. According to statia [3], the number of users of only Facebook
is 2.45 (billions) worldwide, and only India has 269 million users of Facebook and
has an estimate of rising this number up to 445 million till 2023. Table 1 shows the
number of active users worldwide for major online social networking sites.
Just like Facebook, now there are several of online social media platforms are
available such as Twitter, WhatsApp, Instagram, etc. As from the statistics gathered,
it is much obvious that Facebook tops the chart in the number of active users exists
worldwide but still when it comes to assess human behaviour and credibility of facts,
researchers tend to prefer microblogging sites rather than Facebook. A microblogging
site just like any blogging site allows a user to share or express his/her views in form
of text or audio/visual content but just with the restriction of character count of 140
[4]. Microblogs are hotspots for examination with respect to forecast of patterns or
early alerts just as for hearing data about the open point of view. And the most popular
microblogging site available as of now is Twitter.
The number of users of Twitter worldwide is 330 million and increasing. Twitter
allows a user to share the content with a label or tag which is represented by a Hashtag
(#), all the users can post or search the contents related to a particular hashtag. It
allows us to analyse or classify the responses of the users on Twitter on the basis of
different hashtags. The suitability of Twitter for data analysis also comes from the
fact that it allows us to extract the tweets runtime free of cost just with few limitations
on the number of tweets to be accessed.
But as there are lot of benefits with the microblogging sites, there exists some
severe drawbacks also. People use these platforms to spread rumours and false news
and mitigate hoax among community, which is a major shortfall of online social
network, and it seems for now it cannot be controlled.
The machine learning algorithms are mainly classified under two categories, which
are (a) supervised learning and (b) unsupervised learning. The supervised learning
category is of those problems in which we have a set of training labels for clas-
sification and we categorize our data only to those labels. On the other hand, the
unsupervised type of algorithms works for the problems in which we do not have
Feature-Based Comparative Study of Machine Learning Algorithms … 15
any kind of labels for classification rather we group our data based on the properties,
data with similar properties is grouped in same cluster and so on.
Already a number of machine learning algorithms have been used by the
researchers for the classification of fake content on the social media. Almost all of the
supervised category of machine learning algorithms has been tested for this task of
categorization as clearly, we have labels in our dataset to distinguish the contents as
‘fake’ or ‘real’. In this paper for our consideration, we will be describing the perfor-
mance of four major machine learning algorithms namely:–Decision tree, SVM,
KNN and random forest. Following is the brief introduction of these algorithms:
The class labelled training tuples-based learning of decision trees is known as deci-
sion tree induction which is the most widely used algorithm for classification. The
basic structure of a decision tree algorithm [5] is a recursively defined partitioning
mechanism based on numerical value of the nodes. The node having the higher index
values will be split in to further nodes. The overall representation of this method looks
like a flowchart structure in which the root node is the target node and all of the leaf
nodes are the class labels to be identified. The transition from the root node to the
subsequent internal nodes is based on the outcome of the decision to be taken by
testing some attribute. This process continues until we reach from the root node to the
leaf nodes based on our decisions and those decisions becomes basis of our classifica-
tion model. The problem with the decision tree algorithm is of overfitting and underfit-
ting, which can later be overcome by using the random forest approach. A variation
of decision tree algorithm is used by Djukova and Peskov [6] which is termed as
complete decision tree. Decision tree algorithm is widely used by statisticians for
classification of fake news such as Singh et al. [7] and Adewole et al. [8].
The main feature of decision tree method is the selection of attribute which is a
heuristic measure for selecting a criterion for splitting the data. The most common
measures for attribute selection are: (i) Gini index and (ii) information gain. The
formulas for calculating Gini index and information gain are represented in Eqs. 1
and 2, respectively.
m
Gini(D) = 1 − pi2 (1)
i=1
m
I n f o(D) = − pi log2 ( pi ) (2)
i=1
16 U. Sharma and S. Kumar
The support vector machine is a technique for classification which will work for
both linear and nonlinear data. It is also used for regression problems also [5]. The
main principle of SVM is that it transforms the input points into a plane of higher
dimension by some nonlinear or linear mapping. Then it tries to search a hyperplane
to separate the data or to classify the points into different categories. By an efficient
nonlinear mapping, it is always possible to find a hyperplane which separates the
data. The procedure it uses to find the hyperplane is based on support vectors which
are the training tuples and margins which are related to the support vectors. There
may be a scenario that we have more than one hyperplane in our data plane separating
our points, then we need to select the most appropriate hyperplane according to best
separation or the distance between the points and the plane. There comes the use
of margin feature, we select the hyperplane with the highest marginal difference for
our categorization approach of the future datasets. One of the drawbacks of SVM
approach is that it will ignore the outliers or we can say that it does not perform well
when the dataset has a high amount of noise.
In practical, SVM uses a kernel trick which is termed as SVM Kernel which
transforms the data points of lower dimensions to higher dimension. It is done because
there are some points which cannot be made separable in the same plane so for them
these points need to be transformed by some polynomial function to plot them on
to new plane. Basically, three types of kernel are used with the SVM algorithm
namely—linear kernel, polynomial kernel and radial basis function kernel. Hadeer
ahmed et al. [9] used linear SVM algorithm with TF-IDF as extraction of feature
technique for classification of fake news articles. There are several applications of
SVM in field of classification such as heart disease classification [10], analysis and
identification of traffic [11], classification of mechanical faults [12].
This algorithm uses the similarity of features to predict the labels for the new data
points; basically, it is a non-parametric classification algorithm as it does not assume
anything about the underlying data used. The basic principle of this algorithm is that
it will first learn about the structure of the data presented for training, and then on the
basis of proximity of the likeliness of the neighbours, it will choose the appropriate
class. Suppose point ‘X’ has three neighbours (that 3 is the K in the name of the
algorithm), (as shown in Fig. 1) out of which two belongs to class ‘A’ and one is of
class ’B’, then based on the maximum closeness to the class ‘A’, it will categorize
‘X’ as a point of class ‘A’. Due to its slow pace of learning, it is also considered as
the lazy learner algorithm. The performance of this algorithm is mostly dependent
on the number of ‘K’ we select. That also becomes its disadvantage because every
time we have to assume the value of this parameter. Apart from these drawbacks, one
significant advantage of this algorithm is that it is easy to implement and robust to
noise and outliers. It also outperforms SVM algorithm as feature selection according
to Li et al. [15].
The ant colony optimization algorithm (ACO) is a probabilistic strategy for taking
care of computational issues which can be diminished to discovering great ways
through graphs. Ant colony optimization algorithm (ACO) was presented by Dorigo
M. [16] as a novel nature-enlivened metaheuristic for the arrangement of hard combi-
natorial optimization (CO) issues. ACO has a place with the class of metaheuristic,
which are inexact calculations used to get sufficient answers for hard CO issues in
a sensible measure of calculation time. The motivating wellspring of ACO is the
scavenging conduct of genuine ants. While scanning for nourishment, ants at first
investigate the region encompassing their home in an arbitrary way. When a subter-
ranean insect finds a nourishment source, it assesses the amount and the nature of
the sustenance and conveys some of it back to the home. Amid the arrival trip, the
subterranean insect stores a substance pheromone trail on the ground. The amount
of pheromone saved, which may rely upon the amount and nature of the sustenance,
will control different ants to the nourishment source.
Algorithm 1 The Basic ACO Algorithm [17]
• Set all initial parameters and initialize value of pheromone
• while (! End constraints) do
• Perform Ant traversal
• Search locally for the solution (optional)
• Update local or global Pheromone values
• end
As referenced before given a list of n features, the issue of feature selection is to iden-
tify subset of minimal size m less than n, while holding a reasonably high exactness
in speaking to the initial n features.
For feature selection task using the ACO algorithm, some perspectives should be
kept in mind [18]. The principal issue is tended by how the graph is to be visual-
ized containing ants and the feature set, i.e. the proper mapping of feature selection
problem to ACO-compatible problem. The fundamental principle of ACO is to find
a shortest path from a given cost graph. We model the features as the nodes of the
graph and the edges connecting to the nodes will be the selection of the next appro-
priate feature. The final solution will be a traversal of ant with passing through a
minimum number of nodes in the graph and that satisfies the constraints for stopping
the traversal. The major benefit through any ACO algorithm is the heuristic-based
best probabilistic solution.
Any feature selection based on ACO algorithm basically works on three main
rules which are pheromone evaporation rule, transitional probability rule and the
pheromone depositional rule.
Feature-Based Comparative Study of Machine Learning Algorithms … 19
The probability p(i, j) that ant k will include the feature set (i, j) into its path is
given by the following Eq. 3.
⎧ β
⎨ α
τi,j ·μi,j
if eij ∈ E(sp )
α β
pki,j = eiv ∈E(Sp ) τiv ·μiv , (3)
⎩ 0 other wise
where E(s p ) is the set of the features which could be added to the solution. Whereas
τ and μ are the variables representing the effect of pheromone trail and the heuristic
information, respectively. α and β are the coefficients which control weather the
probability is dependent on the heuristic value or the pheromone value.
After all ants complete their path, the pheromone value at each node will be
updated so that all ants in the next iteration should not converge to the same path in
the first iteration. For this the pheromone τi, j is updated by the rule in Eq. 4. The
evaporation rate is represented by ρ and the amount of pheromone laid by the ant k
on the feature set (i, j) is represented by τi,k j , the number of ants will be n:
n
τi, j = (1 − ρ) · τi, j + τi,k j (4)
k=1
The amount of pheromone each ant k deposit on the edge ei, j is given by Nemati
et al. [19] as:
k n − s k (t)
τi,k j= φ · γ s (t) + ψ · i f i ∈ S k (t) (5)
n
In Eq. 5, γ s k (t) is the measurement of performance of the classifier, whereas
φ and ψ are the parameters controlling the effect of performance of classifier and
length of the feature subset S k (t). The value of φ ranges between 0 and 1 and for
ψ it is 1 – φ. Giving preference to performance of classifier more than the length of
feature subset, we set the value of φ = 0.7 and ψ = 0.3.
Already a lot of research work is done for the determination of truthiness of news
content available online, and moreover, all of those approaches follow a same flow of
action. We will look at the literature available on the prospect of credibility analysis.
Kumar et al. [20] proposed an approach which talks about the use of shallow clas-
sifiers based on particle swarm optimization (PSO) for rumour veracity detection of
tweets available for a particular hashtag. In their model they collected the data from
the Twitter API, which is the widely used source of collecting the data for analysis.
Then they extracted the features from the collected tweets based on various criterion
20 U. Sharma and S. Kumar
namely content-based features like bag of words, part of speech, etc. and pragmatic
features like presence of emoticons and anxiety related words, sentiments etc. Also,
they considered some network-based features which are given by the API itself such
as no. of followers, no. of retweets. Out of all these features, they extracted some
features by using the PSO-based classifier and compared the performance with all
of the features and of the extracted features and their results shown a significant
improvement in the accuracy.
Zubiaga and Ji [21] proposed an epistemology for information verification of
tweets. They basically targeted the tweets about natural disaster and specifically about
hurricanes. The study suggests that apart from the features available on Twitter, the
image-based features can add up a subsequent amount of accuracy in to the results. In
their research they basically collected the tweets containing the images of hurricane
related disasters. They provided a reputation score based on features of the image
and other text-based features and author related info.
Shu et al. [22] presented a tool for study of facts available online. They built
that tool for data collection from some fact checking sites such as Politifact and
Buzzfeed, and then based on the topic of fake and true news, they searched tweets
related to those categories. They used social engagement of tweets as a basic feature,
and for calculating the temporal engagement, they used an RNN and LSTM-based
algorithm. Further, they used SVM, LR and GNB for feature-based detection.
On the basis of our study, we propose a model (shown in Fig. 2) for credibility
analysis of online social media content (mostly tweets) using ACO-based feature
extractor, and for classification of tweets, we used decision tree, KNN, SVM and
also random forest algorithm.
The model for this credibility analysis will be comprising of the following
components:
• Data collection for classification (Twitter API).
• Feature extraction
• Feature selection (using ACO).
• Apply machine learning algorithm (decision tree, KNN, random forest, SVM)
• Calculate accuracy
For collection of data, we made use of some fact checking website like politifact,
where they provide the facts taken from various social media platforms and the
assessment of those facts by their experts as true or fake. We crawled some news
topics which were fake and on the basis of those topics we searched the tweets related
to that particular hashtag (#). We categorized the tweets based on false news related
topics as fake and others as real manually.
In the feature extraction process, we make use of both API provided features and
some extracted features from the tweet. A total of 18 features are extracted for the
classification task; the features we used are listed as:
Features Provided by API
1. Retweet count,
2. Favourite count,
3. User statuses count,
4. User verified,
5. Statuses followers count,
6. Friends followers count,
7. User has URL.
Then by using these five features, we perform the classification using the machine
learning algorithms. The performance of the algorithms is compared on the basis
of accuracy measure. The accuracy of the algorithms is measured by the help of
confusion matrix. The formula for accuracy as mentioned by Chouhan et al. [23] is
shown in Eq. 6:
Accuracy = (T P + T N )/ (T P + T N + F P + F N ) (6)
6 Results
We calculated the tweets related to the events of natural disaster, specifically hurri-
canes with the hashtag (#) Hurricane. For performing the algorithms on the dataset,
we used the R language and R-studio IDE. The accuracy measure of the algorithms is
compared before feature selection and after feature selection. A significant improve-
ment is seen in all of the algorithms after feature selection with largest in decision
tree around 21% and the smallest in the KNN around 8%. The results with KNN vary
with a great effect if we change the number of neighbours. For our computation, we
selected the number of neighbours as 10 (Approx. the square root of the number of
samples taken in the test set). A wordcloud for both of the categories of tweets (fake
and real) is also shown in Fig. 3.
The research work suggests that the impact of feature selection is important in the
case of classification problems, as also suggested by our results (around 21%). The
performance result for the SVM is found to be best among all of the algorithms for
the above-mentioned dataset Fig. 4. The performance of all of the algorithms do
tends to decrease as we try to test them on some other dataset collected for some
Another random document with
no related content on Scribd:
4.