Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Chandigarh University

MONTH & YEAR


November ,2023

TITLE OF MINI PROJECT REPORT


Smart music recommendation system
A PROJECT REPORT

Submitted by
Shriyam prasad

Rishabh patel

in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING

IN

Computer Science ENGINEERING


Chandigarh University

MAY 2023

BONAFIDE CERTIFICATE

Certified that this project report “..........music recommendation system


...............................” is the bonafide work of “..............Shriyam prasad and
Rishav patel............” who carried out the project work under my/our
supervision.

SIGNATURE SIGNATURE

SUPERVISOR
HEAD OF THE DEPARTMENT
Music Recommender System
: Artificial Intelligence
Abstract Music recommender system is a system which learns
from the users past listening history and recommends
In this project, we have designed, implemented and them songs which they would probably like to hear in
analyzed a song recommendation system. We used future. We have implemented various algorithms to try
Million Song Dataset[1] provided by Kaggle to find to build an effective recommender system. We firstly
correlations between users and songs and to learn from implemented popularity based model which was quite
the previous listening history of users to provide simple and intuitive. Collaborative filtering algorithms
recommendations for songs which users would prefer to which predict (filtering) taste of a user by collecting
listen most. In this paper, we will discuss the problems we preferences and tastes from many other users
faced, methods we have implemented, results and their (collaborating) is also implemented. We have also done
analysis. We have got best results for memory based experiments on content based models, based on latent
collaborative filtering algorithm. We believe that content- factors and metadata.
based model would have worked better if we would have
enough memory and computational power to use the
whole available metadata and training dataset.
2 Dataset
Keywords- recommendation systems, music, Million We used data provided by Million Song Data Challenge
Song Dataset, collaborative filtering, content-based hosted by Kaggle. It was released by Columbia University
Laboratory for the Recognition and Organization of
Speech and Audio. The data is open; meta-data,
1 Introduction audiocontent analysis, etc. are available for all the songs.
It is also very large and contains around 48 million
Rapid development of mobile devices and internet has (userid, songid, play count) triplets collected from
made possible for us to access different music resources histories of over one million users and metadata (280
freely. The number of songs available exceeds the GB) of millions of songs[7]. But the users are anonymous
listening capacity of single individual. People sometimes here and thus information about their demography and
feel difficult to choose from millions of songs. Moreover, timestamps of listening events is not available. The
music service providers need an efficient way to manage feedback is implicit as play-count is given instead of
songs and help their costumers to discover music by explicit ratings. The contest was to predict one half of
giving quality recommendation. Thus, there is a strong the listening histories of 11,000 users by training their
need of a good recommendation system. Currently, there other half and full listening history of other one million
are many music streaming services, like Pandora, Spotify, users.
etc. which are working on building high-precision Since, processing of such a large dataset is highly
commercial music recommendation systems. These memory and CPU-intensive, we used validation set as our
companies generate revenue by helping their customers main data. It consists of 10,000,000 triplets of 10000
discover relevant music and charging them for the quality users. We used metadata of only 10,000 songs (around
of their recommendation service. Thus, there is a strong 3GB). From the huge amount of song metadata, we focus
thriving market for good music recommendation systems.

1
only on features that seem to be most relevant in
characterizing a song. We decided that information like
year, duration, hotness, danceability, etc. may distinguish
a song most from other songs. To increase processing
speed, we converted user and song ids from strings to
integer numbers.

3 Algorithms
We have implemented four different algorithms to build
an efficient recommendation system.

3.1 Popularity based Model


It is the most basic and simple algorithm. We find the
popularity of each song by looking into the training set Figure 1: Item Based
and calculating the number of users who had listened to
this song. Songs are then sorted in the descending order
of their popularity. For each user, we recommend top In item-based model [Figure1][8], it is assumed that
most popular songs except those already in his profile. songs that are often listened together by some users
This method involves no personalization and some songs tend to be similar and are more likely to be listened
may never be listened in future. together in future also by some other user.
According to user based similarity model [Figure2][8],
users who have similar listening histories, i.e., have
3.2 Collaborative based Model listened to the same songs in the past tend to have
Collaborative filtering involves collecting information similar interests and will probably listen to the same
from many users and then making predictions based on songs in future too.
some similarity measures between users and between We need some similarity measure to compare
items. This can be classified into user-based and between two songs or between two users. Cosine
itembased models. similarity weighs each of the users equally which is
usually not the case. User should be weighted less if he
has shown interests to many variety of items (it shows
that either she does not discern between songs based
on their quality, or just likes to explore). Likewise, user is
weighted more if listens to very limited set of songs. The
similarity measure,
, also has drawbacks that some songs which
are listened more by users have higher similarity values
not because they are similar and listened together but
because they are more popular.
We have used conditional probability based model of
similarity[2] between users and between items:

2
probability distribution and then recommending top
scored items from them. When the song history of a user
is too small to utilize the user-based recommendation
algorithm, we can offer recommendations based on song
similarity, which yields better results when number

Figure 3: Matrix M

of songs is smaller than that of users.


We got the best results for stochastic aggregation of
item-based model with values of q and α as 3 and 0.15,
and of user-based model with values 0.3 and 5 for α and
Figure 2: User Based q respectively, giving overall mAP 0.08212. (See details
about mAP in the Sec 3.5)
This method does not include any personalization.
Moreover, majority of songs have too few listeners so
Different values of α were tested to finally come with a
they are least likely to be recommended. But still, we as
good similarity measure.
well as Kaggle winner got best results from this method
Then, for each new user u and song i, user-based
among many others. Here, we have not used play count
scoring function is calculated as
information in final result as they did not give good
result because similarity model is biased to a few songs
played multiple times and calculation noise was
Similarly, item-based scoring function is generated by a few very popular songs.

Locality of scoring function is also necessary to 3.3 SVD Model


emphasize items that are more similar. We have used Listening histories are influenced by a set of factors
exponential function to determine locality. specific to the domain (e.g. genre, artist). These factors
f(w) = are in general not at all obvious and we need to infer
This determines how individual scoring components those so called latent factors[4] from the data. Users and
affects the overall scoring between two items. The similar songs are characterized by latent factors.
things are emphasized more while less similar ones Here, to handle such a large amount of data, we build
contribution drop down to zero. a sparse matrix[6] from user-song triplets and directly
After computing user-based and item-based lists, we operate on the matrix [Figure 3], instead of looping over
used stochastic aggregation to combine them. This is done millions of songs and users. We used truncated SVD for
by randomly choosing one of them according to their dimensionality reduction.

3
We used SVD algorithm in this model as follows: was used in the Kaggle challenge which helps us to
Firstly, we decompose Matrix M into latent feature compare our results with others. Moreover, precision is
space that relates user to songs. much more important than recall because false positives
M = U PV , where
can lead to a poor user experience. Our metric gives the
average of the proportion of correct recommendations
and V k∗n giving more weight to the top recommendations. There
are three steps of computing mAP as follows: Firstly,
precision at each k is calculated. It gives the proportion of
Here, U represents user factors and V represents item correct recommendations within the top-k of the
factors. Then, for each user, a personalized predicted rankings.
recommendation is givn by ranking following item for
each song as follows-

Though the theory behind SVD is quite compelling,


there is not enough data for the algorithm to arrive at a
good prediction. The median number of songs in a users
play count history is fourteen to fifteen; this sparseness
does not allow the SVD objective function to converge to
a global optimum.

3.4 KNN Model


In this method, we utilize the available metadata. We
Figure 4: Results
create a space of songs according to their features from
metadata and find out neighborhood of each song. We
choose some of the available features (e.g., loudness,
genre, mode, etc.) which we found most relevant to Then, for each user, average precision at each k is
distinguish a song from others. After creating the feature evaluated.
space, to recommend songs to the users, we look at each AP
users profile and suggest songs which are neighbors to
the songs present in his listening history. We have taken Finally, mean of all the users is taken. mAP
top 50 neighbors of each song. This model is quite =
personalized and uses metadata. But since, we had 280GB
file of metadata which takes huge amount of time in
processing, we extracted features of only 3GB (10,000 4 Results
songs), which is less than 2 % of total number. Due to this,
we had features of only small number of songs, which We got best results for memory based collaborative
gives us very small precision. filtering algorithm. Our SVD based latent factor model
gives better results than popularity based model. It lags
behind collaborative filtering algorithm because the
3.5 Evaluation Metrics matrix was too sparse which prevented objective
We used mean Average Precision (mAP) as our evaluation functions to converge to global optimum. Our K-NN
metric. The reason behind using this is that this metric model did not work well and performs worse than even

4
the popularity model. The reason behind this is that we Thats why we didnt get any result better than 10 %. Even
have the features of only 10,000 songs, which is less the Kaggle winner has only got 17 %. Secondly, the
than 3 % of the whole dataset so only some of these metadata includes huge information and when exploring
10,000 songs could be recommended. The huge lack of it, it is difficult to extract relevant features for song.
information leads to the bad performance of this Thirdly, technically speaking, processing such a huge
method. dataset is memory and CPU intensive.
All these difficulties due to the data and to the system
itself make it more challenging and also more attractive.
5 Conclusion We hope that we will get other opportunities in the future
to work in the domain of artificial intelligence. We are
This is a project of our Artificial Intelligence course. We certain that we can do a better job.
find it is very good as we got a chance to practice
theories that we have learnt in the course, to do some
implementation and to try to get a better understanding
of a real artificial intelligence problem: Music
6 Future work
Recommender System. There are many different • Run the algorithms on a distributed system, like
approaches to this problem and we get to know some Hadoop or Condor, to parallelize the computation,
algorithms in detail and especially the four models that decrease the runtime and leverage distributed
weve explained in the paper. By manipulating the memory to run the complete MSD.
dataset, changing the learning set and testing set, • Combine different methods and learn the weightage
changing some parameters of the problem and for each method according to the dataset
analyzing the result, we earn a lot practicing skills. Weve
faced a lot of problem in dealing with this huge dataset, • Automatically generate relevant features
how to explore it in a better way and we also had • Develop more recommendation algorithms based on
difficulties in some programming details. However, with different data (e.g. the how the user is feeling, social
lot of efforts, we have overcame all of these. recommendation, etc)
The best part of this project is the teamwork. Both of
us come from different countries and thus have different
cultures and ways of working. We took a bit of time to get
7 Acknowledgements
to know each other, to adjust ourselves and to perform
like a team. We become much more efficient by the time We would like to acknowledge the efforts of Dr.
the team spirit is formed and we also enjoy more. We Amitabha Mukherjee as without his constant support
both find this project a nice experience and all the effort and guidance this project would not have been possible.
put is worthy. We have learnt a lot from this project.
In terms of research, we still have a lot to do to make
our studies a better one. Music Recommender System is References
such a wide, open and complicated subject that we can
[1] McFee, B., BertinMahieux,T., Ellis, D. P., Lanckriet, G.
take some initiatives and do a lot more tests in future.
R. (2012, April). The million song dataset challenge.
We also got to realize that building a recommender
In Proceedings of the 21st international conference
system is not a trivial task. The fact that its large scale
companion on World Wide Web (pp. 909916).ACM.
dataset makes it difficult in many aspects. Firstly,
recommending 500 correct songs out of 380 million for [2] Aiolli, F. (2012). A preliminary study on a
different users is not an easy task to get a high precision. recommender system for the million songs dataset

5
challenge. PREFERENCE LEARNING: PROBLEMS AND fourth ACM conference on Recommender systems.
APPLICATIONS IN AI ACM, 2010

[3] Koren, Yehuda. ”Recommender system utilizing [5] T. Bertin et al., The Million Song Dataset, Proc. of the
collaborative filtering combining explicit and implicit 12th International Society for Music Information
feedback with both neighborhood and latent factor Retrieval Conference, 2011
models.”
[6] Sparse Matrices
[4] Cremonesi, Paolo, Yehuda Koren, and Roberto http://docs.scipy.org/doc/scipy/reference/sparse.h
Turrin. ”Performance of recommender algorithms tml
on topn recommendation tasks.”Proceedings of the
[7] Mahiux Ellis 2-11
http://labrosa.ee.columbia.edu/millionsong/tasteprofile
[8] http://www.bridgewell.com/images en

6
66

Bot Applicationon in Twitter

Abstract—A Twitter bot is a Twitter account programmed to Bots have various objectives. It is undeniable that some bots
automatically do social activities by sending tweets through a have benefits in disseminating weather or earthquake
scheduling program. Some bots intend to disseminate useful information [4]. However, many malicious bots exist too. They
information such as earthquake and weather information. harm by broadcasting malware links [5], disrupting other users,
However, not a few bots have a negative influence, such as broadcasting terrorist propaganda, spam, news lies, making
broadcasting false news, spam, or become a follower to increase hoaxes, and doing political campaigns. A massive tweet
an account's popularity. It can change public sentiments about an volume is capable of polluting users' timeline, changing user
issue, decrease user confidence, or even change the social order.
perception, damaging user confidence, affecting the stock
Therefore, an application is needed to distinguish between a bot
and non-bot accounts. Based on these problems, this paper market, and even being able to undermine social order.
develops bot detection systems using machine learning for Therefore, basic knowledge to distinguish between types of bot
multiclass classification. These classes include human classes, accounts and non-bot accounts is required.
informative, spammers, and fake followers. The model training Conventionally, identifying bot accounts and not bots can be
used guided methods based on labeled training data. First, a carried out by observing the activity pattern on an account. For
dataset of 2,333 accounts was pre-processed to obtain 28 feature example, noticing that a particular account does more
sets for classification. This feature set came from analysis of user retweeting than creating original tweets, writes many tweets,
profiles, temporal analysis, and analysis of tweets with numeric but only has a few followers. Besides, the account also does not
values. Afterward, the data was partitioned, normalized with
scaling, and a random forest classifier algorithm was implemented
have a biography, a profile picture, and writes the same tweet
on the data. After that, the features were reselected into 17 feature content as another user at the same time. However, such
sets to obtain the highest accuracy achieved by the model. In the cognitive approaches are rated inefficient and merely focus on
evaluation stage, bot detection models generated an accuracy of precision.
96.79%, 97% precision, 96% recall, and an f-1 score of 96%. Therefore, an approach to detecting bots with machine
Therefore, the detection model was classified as having high learning is created. The machine learning models are used
accuracy. The bot detection model that had been completed was because of their capacity to analyze extensive data based on
then implemented on the website and deployed to the cloud. In the parameters. According to a study, the random forest algorithm
end, this machine learning-based web application could be
has the best accuracy, which is 95%, compared to Naïve Bayes
accessed and used by the public to detect Twitter bots.
Multinomial algorithm (70%), Naïve Bayes Gaussian
algorithm (68%), and Logistic Regression algorithm (52%) [6].
Keywords—Bot Detection, Multiclass Classification, Machine Therefore, the random forest classifier algorithm is selected in
Learning, Supervised Learning, Twitter.
this paper. The system classified the multiclass classification
I. INTRODUCTION by classifying accounts into four different classes, which are
human, informative, spammers, and fake followers. Then, a
One of the popular social media apps today is Twitter. It was
completed machine learning model was inserted into the web
launched in 2006 by Jack Dorsey [1]. Twitter is a micro-blog
application. The website serves as an interface for end-users to
service in which the users can send short messages, which then
use machine learning systems. In the end, the system underwent
called tweets. Each tweet is limited to 140-character only.
a deployment process to the cloud service.
However, the simplicity and capacity to send tweets as often as
possible become the additional value from this application. II. THEORETICAL FRAMEWORK
According to statistics in 2018, there have been 326 million
Twitter active users each month, with an average of 500 million A. Classification Type
tweets sent every day [2]. Classification is a process of assigning a category or label
Bots or automated programs are growing in popularity that has been defined as data that does not yet have a category.
alongside Twitter's popularity. They are created using the In general, there are three types of the data classification
Twitter Application Programming Interface (API). A study by process, namely binary, multiclass, and multilabel
the United States Commission on Exchange and security in classification [7].
2016 found at least 8.5% of active Twitter users were bots [3]. • Binary classification is a process of classifying each
element in a group into two groups or categories.
1,2,3 Department of Electrical and Information Engineering,
• Multiclass classification is a classification process
Faculty of Engineering, Universitas Gadjah Mada, Jln. Grafika
No.2, Kampus UGM Yogyakarta 55281 INDONESIA (tlp: 0274- involving more than two classes. However, the multiclass
552305; email: 1
aqilah.aini.z@mail.ugm.ac.id, classification creates an assumption that each given sample
2widyawan@ugm.ac.id, 3silmi@ugm.ac.id) is categorized into a single label (mutually exclusive).

...
67

• A cultilabel classification is a classification process that F. Flask


puts samples into a set of targets. This classification Flask is a Python micro-framework that provides the basic
predicts the properties of data that are not mutually functionality of a web framework [15]. It allows plug-ins to be
exclusive. Examples of this classification are found in added to add functionality and features. It is named a
document classification. microframework because it has a very simple core functionality,
yet it can be expanded by adding plug-ins.
B. Machine Learning
Machine learning is a technique that enables the system to G. Heroku
learn from data compared to using direct programming so that Heroku is a cloud computation-based application that is
it can deliver relevant results [8]. useful for doing deployment and management services. As a
Platform as a Service (PaaS), Heroku provides a service that
C. Random Forest Algorithm
allows running scripts directly without requiring complex
Random forest was first introduced by Leo Breiman [9]. The configuration, so developers can focus on application code
random forest classifier is the development of the decision tree. development without the need to think about architecture and
It consists of a combination of many decision trees, with each servers.
tree relying on independent random vector values with an
equivalent distribution of each tree [9]. H. Waterfall Method
This method emphasizes the planning and scheduling
D. Twitter Social Media
process before starting the system development. This method is
Twitter is a micro-blogging social network that allows its best used if the product definition is clear, the project is short-
users to send and read short messages up to 140 words, which lived, technology is known, and resources are available. The
are then called tweets [10]. Jack Dorsey founded this social advantages of this method are organized documentation, proper
media in 2006. Unlike social media such as Facebook or to be used for known needs, and easily understood. The
MySpace, on Twitter, the relationship between to follow an weakness of this method is the need for appropriate
account and the followers are not reciprocal. It means that an management, and small mistakes will be a big problem if not
account can follow other accounts without automatically be noticed from the beginning of development, high risk, and not
followed by the account it follows. a good model for intricate work.
E. Bots and Twitter Bot Types I. Classification Evaluation
In general, bot means an application that performs tasks A metric evaluation is a set of metrics used to measure a
automatically. In social media, bot domain is a social media classifier's performance. Different metrics measure different
accounts programmed to perform social media activities classifier characteristics. Evaluation metrics consist of three
automatically, so they look like real humans. According to types, namely, threshold, opportunity, and ranking metrics
research from the University of Southern California, at least 9%
to 15% of active Twitter users are bots [11]. Until 2017, there III. METHODOLOGY
were 319 million active users each month. It means there are The reference method employed in this paper is the waterfall
almost 48 million bot accounts spread on the Twitter social method. This method was selected because it was easy to
network. understand. Moreover, the system requirements were precise,
Factors that influence bot growth include Twitter API the technology was known, and the documentation was
support, bot development cycles that can be created quickly, organized. Whereas the research process flow was carried out
Twitter public platforms, and the flexibility to create as many based on the flowchart like in Fig. 1.
accounts as possible. According to the Digital Forensic In the initial stage, there was an identification of problems
Research (DFR) of the Atlantic Council Lab, there are several occurring at present. Then, a literary study was carried out to
features indicating that an account is a bot, including collect references related to bot detection systems. After that,
amplification, anonymity, activity, similarity, and description an observation was made on the existing bot detection system,
of "bot" in the account [12]. Whereas the Twitter bot types namely Botometer, Bot or not, and I Spot a Bot. The next step
based on account activity are as follows. was to create a system design, then downloaded data through
• Informative, i.e., a bot that functions to disseminate public repositories and did the crawling to the Twitter API to
information to users. For example, bots that publish facts, download supporting data.
earthquake information, and write poetry content as well as The downloaded data was then a research dataset.
humor content [13]. Furthermore, there was a data pre-processing in the form of
• Spammers, i.e., bots that work to broadcast spam content data cleansing and feature engineering. A feature engineering
[14]. was carried out by extracting the basic parameters of the dataset
• Fake Followers, i.e., bots that act as shadow followers for and creating the derived parameters through a calculation
an account. The purpose of using fake followers is to create process. This feature set then became the three classes of
an image that an account seems to have prominent analysis, namely temporal analysis, tweet content analysis, and
popularity [14]. user profile analysis. Then, the data were aggregated according


68

Fig. 1 Flowchart of research implementation.

TABLE I After the data was partitioned with a ratio of 80:20, training
PARTITION AND ACCURACY RATIO data was implemented on the random forest classifier algorithm
implementation. The obtained results were then tested for
Partition Ratio Accuracy
50:50 95.12%
accuracy based on the threshold feature importance. This
60:40 96.36% process was employed to re-select features to achieve the
70:30 96.57% highest accuracy based on existing parameters. After that, the
80:20 96.79% algorithm implementation was repeated with the selected
90:10 93.16% parameters. The accuracy results were evaluated, applying the
multiclass confusion matrix using (1) to (4).
to the average value of each username. A scaling was applied 𝑡𝑝𝑖 +𝑡𝑛𝑖
∑𝑘
𝑖=1
to these data aggregation to change the range of data values to 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑖𝑜𝑛 =
𝑡𝑝𝑖 +𝑡𝑛𝑖+𝑓𝑝𝑖 +𝑓𝑛𝑖
(1)
between 0-100. The next step was to do data partition, which 𝑘
divided data into training data and test data. Throughout the ∑𝑘𝑖=1 𝑡𝑝𝑖
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝜇 = ∑𝑘 (2)
implementation process, it was recognized that the 80:20 𝑖=1(𝑡𝑝𝑖 +𝑓𝑝𝑖 )

partition produced the best accuracy. Table I shows partition ∑𝑘𝑖=1 𝑡𝑝𝑖
and accuracy ratio. 𝑟𝑒𝑐𝑎𝑙𝑙 = ∑𝑘 (3)
𝑖=1(𝑡𝑝𝑖 +𝑓𝑛𝑖 )

–)
69

Fig. 2 Bot detection system scheme.

TABLE II TABLE III


DATASET DETAILS UTILIZED PARAMETERS
Number Number Types of
No. Type of Data of of Percentage Parameter's Name Type
Analysis
Accounts Tweets friends_count Basic
1 Username data 859 42,950 36.82% Verified Basic
of spammer followers_count Basic
account URL Basic
2 Username data 334 16,700 14.32% statusses_count Basic
of fake default_profile_image Basic
follower default_profile Basic
accounts
favorites_count Basic
3 Username data 293 14,650 12.56% Profile
listed_count Basic
of informative Analysis
Location Basic
accounts
contains_bot_name Derivative
4 Username data 847 42,350 36.31%
of the original length_of_bio Derivative
account age_in_days Derivative
Total 2,333 116,650 100% ratio_favorites_per_age Derivative
ratio_statusses_per_age Derivative
2×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝜇 ×𝑟𝑒𝑐𝑎𝑙𝑙𝜇 ratio_friends per_followers Derivative
𝐹1 − 𝑠𝑐𝑜𝑟𝑒𝜇 = (4)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝜇 +𝑟𝑒𝑐𝑎𝑙𝑙𝜇 Reputation Derivative
tweet_retweet_status Derivative
where,
Tweet_retweet_count Derivative
𝑡𝑝𝑖 = true positive for 𝐶𝑖 𝑐𝑙𝑎𝑠𝑠 Tweet_favorite_count Derivative
𝑡𝑛𝑖 = true negative for 𝐶𝑖 𝑐𝑙𝑎𝑠𝑠 Tweet_hashtag_count Derivative
Tweet
𝑓𝑝𝑖 = false positive for 𝐶𝑖 𝑐𝑙𝑎𝑠𝑠 Content
Tweet_urls_count Derivative
𝑓𝑛𝑖 = false negative for 𝐶𝑖 𝑐𝑙𝑎𝑠𝑠 Tweet_user_mentions_count Derivative
Analysis
𝑘 = total amount. Avg_word Derivative
Word_count Derivative
The constructed machine learning model was stored in the Char_count Derivative
form of pickle (.pkl). This process is called serialization. It was Lexical Derivative
carried out because the bot detection model would be Temporal Entropy Derivative
implemented on the website. Analysis
The next process was to utilize the Flask framework to create
a website with a bot detection model as a back-end. After that, IV. RESULTS AND DISCUSSIONS
the next step was creating an interface as a front-end using A. System Development
HTML, CSS, and Javascript markup languages. Finally, the
system that had been constructed was deployed to the cloud 1) Data Collection: The initial process in the system
services using the Heroku platform. The bot detection system development was by collecting data. The labeled username data
scheme was made according to Fig. 2. were downloaded to a public repository. Then, a crawling


70

Fake Follower Friends vs Follower

Fig. 3 A scatter plot on a comparison of the number of accounts following and followers in the dataset.

process was carried out to obtain additional data such as TABLE IV


metadata, tweets, and other attributes using the Tweepy library. MULTICLASS CONFUSION MATRIX
The data collection process employed the Jupyter Notebook as
Prediction
an Integrated Development Environment (IDE). The dataset Total
F H I S
details are described in Table II.
F 61 5 1 1 88
2) Pre-processing: The pre-processing process included a H 1 155 1 2 261
data cleansing process, i.e., clearing data by removing or Actual I 1 1 56 0 78
replacing values of the empty, invalid, or Not a Number (NaN) S 0 2 0 180 273
values. The next step was to do deletion on the duplicate data. Total Amount 700
Note: F= Fake Follower, H= Human, I= Informative, S = Spammer
The following process was defining the utilized parameters
as a set analysis feature. Processing was carried out to adjust learn using Python version 3.7. This algorithm implementation
the information structure so that it could be processed with the also included weighing in each class due to uneven data
random forest classifier algorithm. This algorithm is only distribution.
capable of processing numerical values. The utilized 5) Feature Importance: Based on the measurement of
parameters are listed in Table III.
feature importance scores, the feature with the highest weight
After the parameter analysis was carried out, the significant
is the "favorites", while the lowest is "contains_bot_name".
differences between each category were recognized according
to the following and follower parameter through a graph in Fig. When visualized, the score will look like Fig. 4. By performing
3. After the pre-processing, a data scaling process was carried an accuracy test based on the threshold feature importance in
out to change the value range to 100. This aimed to equalize the Fig. 4, a graph can be drawn as shown in Fig. 5. It shows that
values between parameters. the accuracy is relatively stable until the threshold reaches 0.01.
Therefore, to obtain the highest accuracy, the feature was
3) Data Partitioning: The data partitioning process aimed to selected by taking 20 feature sets with the most significant
separate the dataset into training data and test data with a ratio weight to form the final detection model.
of 80:20. This partitioning process was carried out using a
train_test_split() method in the sklearn library. 6) Evaluation of Bot Detection Models: The evaluation of
bot detection models employed a multiclass confusion matrix,
4) Implementation of the Random Forest Classifier
as shown in Table IV. The generated accuracy value is 96.79%,
Algorithm: The algorithm implementation was carried out by
while the classification process creates values of precision,
implementing the algorithm in the training data and testing it
with the test data. This algorithm can be accessed in the scikit- recall, and f-1 corresponds to Table V.
71

Fig. 4 Feature importance result of the test parameter.

TABLE V
SUMMARY OF CLASSIFICATION EVALUATION

Precision Recall f1 Support


Fake follower 0.97 0.90 0.93 68
Human 0.95 0.97 0.96 159
Informative 0.97 0.97 0.97 58
Spammer 0.98 0.99 0.99 182
Average 0.97 0.96 0.96

8) Website Implementation: The website implementation


was carried out with a Flask micro framework using the Python
programming language. Fig. 6 shows the display of the web
application's main page. Through the main page, users could
Fig. 5 Accuracy graphs based on threshold feature importance. input their Twitter account username without using an '@'.
After that, the website server would receive the input and
7) Model Serialization: The model serialization generated a process it through machine learning.
pickle file that can be deserialized, and this adjusted to the
platform of implementation's destination. In this paper, the file After the process was complete, the system produced an
would be implemented on a website with the Flask framework; output in the form of a profile and prediction category class
therefore, the primary programming language was Python. results, such as in Fig. 7. The system was then deployed into
72

Fig. 6 The Twitter Bot Detection main page.

Fig. 7 Image of the prediction result.

the Heroku cloud service. Through this platform, the web random forest classifier based on the 17 feature sets produces
application can be accessed at an accuracy of 96.79%, 97% precision, 96% recall, and an f-1
http://botclassifier.herokuapp.com. score of 96%. Thus, the detection model is classified as having
high accuracy.
V. CONCLUSIONS
Based on the process and obtained results in the system REFERENCES
development stage as well as the machine learning evaluation, [1] (2014) “Jack Dorsey Biography,” [Online],
https://www.biography.com/business-figure/jack-dorsey, access date: 3-
several conclusions can be formulated as follows. The Sep-2019.
development of a bot detection system using a machine [2] (2018) “Twitter by the Numbers (2018): Stats, Demographics & Fun
learning with the random forest classifier has been successfully Facts,” [Online], https://www.omnicoreagency.com/twitter-statistics/,
developed. The features of this system include the classification access date: 13-Jun-2019.
of four class types, namely non-bot (human) class, fake [3] V.S. Subrahmanian, A. Azaria, S. Durst, V. Kagan, A. Galstyan, et al.,
followers bot, spammers bot, and informative bot, and this can “The DARPA Twitter Bot Challenge,” Computer (Long. Beach. Calif.),
Vol. 49, No. 6, pp. 38–46, 2016.
be accessed via the web. The training data partition and test data
[4] (2019) “Nain Weather Bot (@NainWxPxBot) | Twitter.” [Online],
of 80:20 produce the highest accuracy (96.795) compared to the https://twitter.com/nainwxpxbot, access date: 3-Sep-2019.
test data partition of 50:50 (95.12%), 60:40 (96.36%), 70:30 [5] K. Zetter (2009) “Trick or Tweet? Malware Abundant in Twitter URLs,”
(96.57%), and 90:10 (93.16%). The feature selection process [Online], https://www.wired.com/2009/10/twitter-malware/, access date:
based on the feature importance score shows that only 17 3-Sep-2019.
features with a threshold of 0.017 contribute to increasing the [6] M. Haidermota, “Classifying Twitter User as a Bot or Not and Comparing
high accuracy. The multiclass classification process using the Different Classification Algorithms.,” Int. J. Adv. Res. Comput. Sci., Vol.
9, No. 3, pp. 29–33, 2018.
73

[7] M. Hossin and Sulaiman, “A Review on Evaluation Metrics for Data [12] (2017) “#BotSpot: Twelve Ways to Spot a Bot,” Medium, [Online],
Classification Evaluations,” Int. J. Data Min. Knowl. Manag. Process, https://medium.com/dfrlab/botspot-twelve-ways-to-spot-a-bot-
Vol. 5, No. 2, pp. 1-11, 2015. aedc7d9c110c, access date: 27-Aug-2019.
[8] J. Hurwitz and D. Kirsch, Machine Learning For Dummies, Hoboken, [13] S. Schreder (2018) “10 Twitter bots that actually make the internet a better
USA: John Wiley & Sons, Inc., 2018. place - Internet Citizen.” [Online],
[9] L. Breiman, “Random Forests,” Mach. Learn., Vol. 45, pp. 5–32, 2001. https://blog.mozilla.org/internetcitizen/2018/01/19/10-twitter-bots-
actually-make-internet-better-place/, access date: 19-Dec-2019.
[10] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social
network or a news media?,” Proceedings of the 19th International [14] A. Khalil, H. Hajjdiab, and N. Al-Qirim, “Detecting Fake Followers in
Conference on World Wide Web, WWW ’10, 2010, pp. 591–600. Twitter: A Machine Learning Approach,” Int. J. Mach. Learn. Comput.,
Vol. 7, No. 6, pp. 198–202, 2018.
[11] M. Newberg (2017) “Nearly 48 million Twitter accounts could be bots,
says study,” [Online], https://www.cnbc.com/2017/03/10/nearly-48- [15] F.A. Aslam and H.N.M.J.M.M.M.M.A. Gulamgaus, “Efficient Way Of
million-twitter-accounts-could-be-bots-says-study.html, access date: 27- Web Development Using Python And Flask,” Int. J. Adv. Res. Comput.
Aug-2019. Sci., Vol. 6, No. 2, pp. 54–57, 2015.

You might also like