Professional Documents
Culture Documents
Detecting Fake Accountsv4withallcommentscovered
Detecting Fake Accountsv4withallcommentscovered
Detecting Fake Accountsv4withallcommentscovered
net/publication/330629456
CITATIONS READS
86 29,926
3 authors:
Hoda Mokhtar
Cairo University
85 PUBLICATIONS 690 CITATIONS
SEE PROFILE
All content following this page was uploaded by Neamat El-Tazi on 24 May 2019.
Abstract—In the present generation, online social networks exploring and studying users behaviours as well as detecting
(OSNs) have become increasingly popular, people’s social lives their abnormal activities [4].
has become more associated with these sites. They use OSNs In [5] researchers have made a study to predict, analyze
to keep in touch with each others, share news, organize events, and explain customers loyalty towards a social media-based
and even run their own e-business. The rabid growth of OSNs online brand community, by identifying the most effective
and the massive amount of personal data of its subscribers have cognitive features that predict their customers attitude.
attracted attackers, and imposters to steal personal data, share Facebook community continues to grow with more than 2.2
false news, and spread malicious activities. On the other hand billion monthly active users and 1.4 billion Daily active
researchers have started to investigate an efficient techniques users, with an increase of 11% on a year-over-year basis
to detect abnormal activities and fake accounts relying on [6]. In the second quarter of 2018 alone, Facebook reported
accounts features, and classification algorithms. However, some that its total revenue was $13.2 billion with $13.0 billion
of the account’s exploited features have negative contribution from ads only [6].
in the final results or have no impact, also using stand alone Similarly, in second quarter of 2018 Twitter has reported
classification algorithms does not always reach satisfied results. reaching about one billion of Twitter subscribers, with 335
In this paper, a new algorithm, SVM-NN, is proposed to million monthly active users [7]. In 2017 twitter reported a
provide efficient detection for fake Twitter accounts and bots, steady revenue growth of 2.44 billion U.S. dollars, with 108
four feature selection and dimension reduction techniques
million U.S. dollars lower profit compared to the previous
year [7].
were applied. Three machine learning classification algorithms
In 2015 Facebook estimated that nearly 14 million of its
were used to decide the target accounts identity real or fake,
monthly active users are in fact undesirable, representing
those algorithms were support vector machine (SVM), neural
malicious fake accounts that have been created in violation
Network (NN), and our newly developed algorithm, SVM-NN,
of the websites terms of service [8].
that uses less number of features, while still being able to
Facebook, for the first time, shared a report in the first
correctly classify about 98% of the accounts of our training
quarter of 2018 that shows their internal guidelines used to
dataset. enforce community standards covering their efforts between
October 2017 to March 2018, this report illustrates the
1. Introduction amount of undesirable content that has been removed by
Facebook, and it covers six categories: graphic violence,
Online social networks(OSNs), such as Facebook, Twit- adult nudity and sexual activity, terrorist propaganda, hate
ter, RenRen, LinkedIn, Google+, and Tuenti, have become speech, spam, and fake accounts [9]. 837 million posts
increasingly popular over last few years. People use OSNs to containing spam have been taken down, and about 583
keep in touch with each others, share news, organize events, million fake accounts have been disabled, Facebook also has
and even run their own e-business. For the period between removed around 81 million undesirable content in terms of
2014 and 2018 around 2.53 million U.S. dollars have been the rest violating content types.
spent on sponsoring political ads on Facebook by non-profits However, even after preventing millions of fake accounts
[1]. The open nature of OSNs and the massive amount of from Facebook, it was estimated that, around 88 million
personal data for its subscribers have made them vulnerable accounts, are still fake [9]. For such OSNs, the existence of
to Sybil attacks [2]. fake accounts lead advertisers, developers, and inventors to
In 2012, Facebook noticed an abuse on their platform in- distrust their reported user metrics, which would negatively
cluding publishing false news, hate speech, sensational and impacts their revenues as recently, banks and financial insti-
polarizing, and some others [3]. However, online Social Net- tutions in U.S. have started to analyze Twitter and Facebook
works (OSNs) have also attracted the interest of researchers accounts of loan applicants, before actually granting the loan
for mining and analyzing their massive amount of data, [10].
Attackers follow the concept of having OSNs user accounts represents as a relationship (e.g. friendship) [20], [22].
are “keys to walled gardens” [11], so they deceive them- Though Sybil accounts find a way to cloak their behaviour
selves off as somebody else, by using photos and profiles with patterns resembling real accounts [14], [23], [24], they
that are either snatched from a real person without his/her manifest numerous profile features and activity patterns.
knowledge, or are generated artificially, to spread fake news, Thus, automated Sybil detection are not always robust
and steal personal information. These fake accounts are against adversarial attacks, and does not yield desirable
generally called imposters [8], [12]. In both cases, such fake accuracy.
accounts have a harmful effect on users, and their motives In this paper, a hybrid classification algorithm has been
would be any thing other than good intentions as they used by running the Neural Network (NN) [25] classification
usually flood spam messages, or steal private data. They are algorithm on the decision values resulting from the Support
keen to phish individual naive users to phony relationships vector machine (SVM) [8], [20], [26], this algorithm uses
that lead to sex scam, human trafficking, and even political less number of features, while still being able to correctly
astroturfing [13], [14], [15], [8], [12]. classify about 98% of the accounts of our training dataset.
Statistics show that 40% of parents in the United States In addition, we also validated the detection performance of
and 18% of teens have a great concern about the use of our classifiers over two other sets of real and fake accounts,
fake accounts and bots on social media to sell or influence disjoint from the original training dataset as illustrated in
products [16]. Another example, during the 2012 US elec- section 4.
tion campaign, the Twitter account of challenger Romney The rest of this paper is organized as follows. Section 2,
experienced a sudden jump in the number of followers. The provides an overview about the research carried out on twit-
great majority of them were later claimed to be fake fol- ter network and prior research on fake profile detection. In
lowers [17]. To enhance their effectiveness, these malicious section 3, the twitter dataset has been described. Section 4,
accounts are often armed with stealthy automated tweeting demonstrates how the collected data has been pre-processed
programs, to mimic real users, known as bots [14]. and used to classify the accounts into fake accounts and
In December 2015, Adrian Chen, a reporter for the New real accounts. In section 5, the overall accuracy rates have
Yorker, noted that he had seen a lot of the Russian accounts been discussed and compared with all other used methods.
that he was tracking switch to pro-Trump efforts, but many In Section 6, we present our conclusions.
of those were accounts that were better described as trolls
accounts managed by real people that were meant to mimic
2. Related work
American social media users [18]. Similarly, before the
general Italian elections of February 2013, online blogs
and newspapers reported statistical data over a supposed Inspired by the importance of detecting fake accounts,
percentage of fake followers of major candidates [19]. researchers have recently started to investigate efficient fake
Detecting those threatening accounts in OSNs has become accounts detection mechanisms. Most detection mechanisms
a must to avoid various malicious activities, insure secu- attempt to predict and classify user accounts as real or
rity of user’s accounts and protect personal information. fake (malicious, Sybil) by analyzing user level activities
Researchers attempt to come up with automated detection or graph-level structures. There are several data mining
tools for identifying fake accounts, which would be labor- methodologies [4] and approaches that help detecting fake
intensive and costly if done manually. The implications of accounts that are described in the following subsections.
researchers attempt may allow an OSN operator detecting
fake accounts efficiently and effectively, it would improve 2.1. Feature Based detection
the experience of its users by preventing annoying spam
messages and other abusive content. The OSN operator can This approach relies on user-level activities and its ac-
also increase the credibility of its user metrics and enable count details (user logs and profiles). Unique features are
third parties to consider its user accounts [20]. extracted from recent user activities (e.g. frequency of friend
Information security and privacy are among the primary requests, fraction of accepted requests), then those features
requirements of social network users, maintaining and pro- are applied to a classifier that has been trained offline using
viding those requirements increases network credibility and machine learning techniques [20], [27], [28], [29].
subsequently its revenues. OSNs are employing different In [27], the authors used a click-stream dataset provided by
detecting algorithms and mitigation approaches to address RenRen, a social network used in China [30], to cluster user
the growing threat of fake/malicious accounts. accounts into similar behavioral groups, corresponding to
Researchers focus on identifying fake accounts through ana- real or fake accounts. Using the METIS clustering algorithm
lyzing user level activity by extracting features from recent with both session and clicks features, such as:
users e.g number of posts, number of followers, profiles.
They apply trained machine learning technique for real/fake • Average clicks per session
accounts classification [8], [21]. Another approach is using • Average session length
graph level structure where the OSN is modeled as a graph • The percentage of clicks used to send friend requests
essentially presented as a collection of nodes and edges. • Visit photos
Each node represents an entity (e.g. account), and each edge • Share contents
The authors were able to classify the data with 3% false 2.3. Neural network (NN), and Support vector ma-
positive rate and 1% false negative rate. chine (SVM)
In [29], the authors used ground-truth provided by RenRen
to train an SVM classifier in order to detect fake accounts. In [25], authors extracted the profile features using PCA,
Using simple features, such as: and then applied Neural Networks, and Support Vector ma-
• frequency of friend requests chines to detect legitimate profiles. ”Variance maximization”
• fraction of accepted requests was selected as a mathematical way of deriving PCA results,
which were:
The authors were able to train a classifier with 99% true-
positive rate (TPR) and 0.7% false-positive rate (FPR). • Number of Languages.
In [26], researchers used a ground truth provided by Twitter; • Profile Summary
the data have been processed using two main approaches: • Number of Connections
• Number of Skills
• Single classification rules • Number of LinkedIn Groups
• Feature sets proposed in the literature for detecting • Number of Publications
spammers
Authors applied Neural Networks with resilient back prop-
Some features have been used from previous work such as agation (Rprop) and SVM with C-support vector classifica-
Stateofsearch.com rule set [31], and Socialbakers rule set tion and polynomial kernel (polydot) as kernel functions.
[32]. The authors were able to correctly classify more than The findings of this paper showed that using PCA as a
95% of the accounts of the original training set. dimension reduction produce better accuracy results than
using all the features without any selection.
2.2. Feature Reduction Even though feature-based detection scales to large OSNs, it
is still relatively easy to circumvent. Researchers have found
that attackers used to change their behaviour to evade spam
High dimensional data could be a serious problem for
detection techniques by adversely modifying the content and
many classification algorithms because of its high computa-
activity patterns of their fakes [20]. Feature-based detection
tional cost and memory usage. On the other hand, reducing
does not provide any formal security guarantees and often
the dimension space would remove noisy (i.e. irrelevant) and
results in a high false positive rate in practice [21].
redundant features and lead to a better classification model
and simple visualization technique [33]. Feature reduction
techniques can be categorized into two types: 3. Baseline datasets
• Dimension reduction where the data in high- In this research we used ”MIB” dataset [26] to perform
dimensional space is transformed into a space of our study, it consistes of three datasets collected from twitter
fewer dimensions as follows:
• Feature subset selection where the original features
set is disjoint into a selected subsets of features
to build simpler and faster models, increases the 3.1. Baseline dataset of Real followers
models performance, and gain a better understand-
ing of the data. Feature subset selection could be Authors of [26] have mentioned that they had two
broken down into (filtering methods, and wrapping datasets of real Twitter followers.
methods)
3.1.1. The Fake Project. The Fake Project dataset, hence-
2.2.1. Principal Component Analysis (PCA): . PCA is a forth TFP, was collected in a research project owned by
technique used to identify features (dimensions) that best researchers at IIT-CNR,in Pisa-Italy.It consists of 469 vol-
explain the predominant normal user behavior [29], [34]. unteers accounts that have been certified as human by
PCA projects high-dimensional data into a low-dimensional CAPTCHA.
subspace (called the normal subspace) of the top-N principal
components that accounts for as much variability in the data 3.1.2. #elezioni2013 dataset. The #elezioni2013 dataset,
as possible. was used to support a research initiative for a sociological
In [29], authors used real data from three social networks to study in collaboration with the University of Perugia and
demonstrate that normal user behavior is low dimensional the Sapienza University of Rome. This study added a set
(14K Facebook users, 92K Yelp users,100K Twitter users). of 1481 twitter accounts with different professional back-
They also evaluated their anomaly detection technique using ground except for the following categories: political, parties,
extensive ground-truth data of anomalous behavior exhibited journalists, bloggers, those selectec accounts were manually
by fake, compromised, and colluding users. This approach labeled as a human by two sociologists from the University
achieves a detection rate of over 66 %(covering more than of Perugia, Italy. In the rest of this paper the dataset of this
94% of misbehavior) with less than 0.3% false positives. section is referred to as E13.
3.2. Baseline dataset of fake followers
4.1. Data Pre-Processing Those techniques are discussed in the following subsections.
Radial Basis Function (RBF) was exploited as SVM classi- Algorithm 1: SVM-NN
fier kernel, and it was trained using libSVM machine learn- Result: feature subsets classification accuracy
ing algorithm [40]. For NN feed-forward back propagation 1 Identify list of reduced features using PCA,
has been selected as the base algorithm. Correlation, Regression, SVM;
In each one of the feature sets, it was noticed that 2 Set feature subsets to s;
there is a feature subset that provides a better classifi- 3 Split your data into testing and training using 8 cross
cation accuracy compared with the other subsets. As in validation;
spearmans rank-order Correlation best feature subset was 4 Set the training identifying labels to rLable Set the
(101000001000111), Multiple linear Regression best feature testing identifying labels to sLable
subset was (0011110110001111), and in Wrapper-SVM best 5 for each s do
feature subset was (110111111110111). 6 Use SVM classification algorithm to Train the
model using the training set, and the identifying
5. Performance and Evaluation labels rLable.
7 Predict the output using the SVM trained model,
In this section the results and findings of this work would and set the output decision-values to decisionV
be explained and evaluated. 8 Train NN model using decisionV, and the
Initially, three different classification algorithms have been identifying labels rLable .
trained and tested using divergent four feature sets. 9 Predict the testing set output using the SVM
Neural network classification algorithm and SVM classifi- trained model, and set the output
cation algorithm were used as the principles mining tech- decision-values to testingDecisionV
niques in many social network researches, so they have 10 Test NN using the testingDecisionV, and NN
been applied on the feature sets mentiond in section 4.2 trained model, set the output to nnPredicted
and compared with the proposed SVN-NN algorithm. 11 Calculate NN prediction for each s accuracy
using the sLable, and nnPredicted
12 end
5.1. SVM classification
13 calculate the average accuracy for each fold