Professional Documents
Culture Documents
Ilovepdf Merged
Ilovepdf Merged
Submitted by
Shriyam prasad
Rishabh patel
BACHELOR OF ENGINEERING
IN
MAY 2023
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
SUPERVISOR
HEAD OF THE DEPARTMENT
Music Recommender System
: Artificial Intelligence
Abstract Music recommender system is a system which learns
from the users past listening history and recommends
In this project, we have designed, implemented and them songs which they would probably like to hear in
analyzed a song recommendation system. We used future. We have implemented various algorithms to try
Million Song Dataset[1] provided by Kaggle to find to build an effective recommender system. We firstly
correlations between users and songs and to learn from implemented popularity based model which was quite
the previous listening history of users to provide simple and intuitive. Collaborative filtering algorithms
recommendations for songs which users would prefer to which predict (filtering) taste of a user by collecting
listen most. In this paper, we will discuss the problems we preferences and tastes from many other users
faced, methods we have implemented, results and their (collaborating) is also implemented. We have also done
analysis. We have got best results for memory based experiments on content based models, based on latent
collaborative filtering algorithm. We believe that content- factors and metadata.
based model would have worked better if we would have
enough memory and computational power to use the
whole available metadata and training dataset.
2 Dataset
Keywords- recommendation systems, music, Million We used data provided by Million Song Data Challenge
Song Dataset, collaborative filtering, content-based hosted by Kaggle. It was released by Columbia University
Laboratory for the Recognition and Organization of
Speech and Audio. The data is open; meta-data,
1 Introduction audiocontent analysis, etc. are available for all the songs.
It is also very large and contains around 48 million
Rapid development of mobile devices and internet has (userid, songid, play count) triplets collected from
made possible for us to access different music resources histories of over one million users and metadata (280
freely. The number of songs available exceeds the GB) of millions of songs[7]. But the users are anonymous
listening capacity of single individual. People sometimes here and thus information about their demography and
feel difficult to choose from millions of songs. Moreover, timestamps of listening events is not available. The
music service providers need an efficient way to manage feedback is implicit as play-count is given instead of
songs and help their costumers to discover music by explicit ratings. The contest was to predict one half of
giving quality recommendation. Thus, there is a strong the listening histories of 11,000 users by training their
need of a good recommendation system. Currently, there other half and full listening history of other one million
are many music streaming services, like Pandora, Spotify, users.
etc. which are working on building high-precision Since, processing of such a large dataset is highly
commercial music recommendation systems. These memory and CPU-intensive, we used validation set as our
companies generate revenue by helping their customers main data. It consists of 10,000,000 triplets of 10000
discover relevant music and charging them for the quality users. We used metadata of only 10,000 songs (around
of their recommendation service. Thus, there is a strong 3GB). From the huge amount of song metadata, we focus
thriving market for good music recommendation systems.
1
only on features that seem to be most relevant in
characterizing a song. We decided that information like
year, duration, hotness, danceability, etc. may distinguish
a song most from other songs. To increase processing
speed, we converted user and song ids from strings to
integer numbers.
3 Algorithms
We have implemented four different algorithms to build
an efficient recommendation system.
2
probability distribution and then recommending top
scored items from them. When the song history of a user
is too small to utilize the user-based recommendation
algorithm, we can offer recommendations based on song
similarity, which yields better results when number
Figure 3: Matrix M
3
We used SVD algorithm in this model as follows: was used in the Kaggle challenge which helps us to
Firstly, we decompose Matrix M into latent feature compare our results with others. Moreover, precision is
space that relates user to songs. much more important than recall because false positives
M = U PV , where
can lead to a poor user experience. Our metric gives the
average of the proportion of correct recommendations
and V k∗n giving more weight to the top recommendations. There
are three steps of computing mAP as follows: Firstly,
precision at each k is calculated. It gives the proportion of
Here, U represents user factors and V represents item correct recommendations within the top-k of the
factors. Then, for each user, a personalized predicted rankings.
recommendation is givn by ranking following item for
each song as follows-
4
the popularity model. The reason behind this is that we Thats why we didnt get any result better than 10 %. Even
have the features of only 10,000 songs, which is less the Kaggle winner has only got 17 %. Secondly, the
than 3 % of the whole dataset so only some of these metadata includes huge information and when exploring
10,000 songs could be recommended. The huge lack of it, it is difficult to extract relevant features for song.
information leads to the bad performance of this Thirdly, technically speaking, processing such a huge
method. dataset is memory and CPU intensive.
All these difficulties due to the data and to the system
itself make it more challenging and also more attractive.
5 Conclusion We hope that we will get other opportunities in the future
to work in the domain of artificial intelligence. We are
This is a project of our Artificial Intelligence course. We certain that we can do a better job.
find it is very good as we got a chance to practice
theories that we have learnt in the course, to do some
implementation and to try to get a better understanding
of a real artificial intelligence problem: Music
6 Future work
Recommender System. There are many different • Run the algorithms on a distributed system, like
approaches to this problem and we get to know some Hadoop or Condor, to parallelize the computation,
algorithms in detail and especially the four models that decrease the runtime and leverage distributed
weve explained in the paper. By manipulating the memory to run the complete MSD.
dataset, changing the learning set and testing set, • Combine different methods and learn the weightage
changing some parameters of the problem and for each method according to the dataset
analyzing the result, we earn a lot practicing skills. Weve
faced a lot of problem in dealing with this huge dataset, • Automatically generate relevant features
how to explore it in a better way and we also had • Develop more recommendation algorithms based on
difficulties in some programming details. However, with different data (e.g. the how the user is feeling, social
lot of efforts, we have overcame all of these. recommendation, etc)
The best part of this project is the teamwork. Both of
us come from different countries and thus have different
cultures and ways of working. We took a bit of time to get
7 Acknowledgements
to know each other, to adjust ourselves and to perform
like a team. We become much more efficient by the time We would like to acknowledge the efforts of Dr.
the team spirit is formed and we also enjoy more. We Amitabha Mukherjee as without his constant support
both find this project a nice experience and all the effort and guidance this project would not have been possible.
put is worthy. We have learnt a lot from this project.
In terms of research, we still have a lot to do to make
our studies a better one. Music Recommender System is References
such a wide, open and complicated subject that we can
[1] McFee, B., BertinMahieux,T., Ellis, D. P., Lanckriet, G.
take some initiatives and do a lot more tests in future.
R. (2012, April). The million song dataset challenge.
We also got to realize that building a recommender
In Proceedings of the 21st international conference
system is not a trivial task. The fact that its large scale
companion on World Wide Web (pp. 909916).ACM.
dataset makes it difficult in many aspects. Firstly,
recommending 500 correct songs out of 380 million for [2] Aiolli, F. (2012). A preliminary study on a
different users is not an easy task to get a high precision. recommender system for the million songs dataset
5
challenge. PREFERENCE LEARNING: PROBLEMS AND fourth ACM conference on Recommender systems.
APPLICATIONS IN AI ACM, 2010
[3] Koren, Yehuda. ”Recommender system utilizing [5] T. Bertin et al., The Million Song Dataset, Proc. of the
collaborative filtering combining explicit and implicit 12th International Society for Music Information
feedback with both neighborhood and latent factor Retrieval Conference, 2011
models.”
[6] Sparse Matrices
[4] Cremonesi, Paolo, Yehuda Koren, and Roberto http://docs.scipy.org/doc/scipy/reference/sparse.h
Turrin. ”Performance of recommender algorithms tml
on topn recommendation tasks.”Proceedings of the
[7] Mahiux Ellis 2-11
http://labrosa.ee.columbia.edu/millionsong/tasteprofile
[8] http://www.bridgewell.com/images en
6
66
Abstract—A Twitter bot is a Twitter account programmed to Bots have various objectives. It is undeniable that some bots
automatically do social activities by sending tweets through a have benefits in disseminating weather or earthquake
scheduling program. Some bots intend to disseminate useful information [4]. However, many malicious bots exist too. They
information such as earthquake and weather information. harm by broadcasting malware links [5], disrupting other users,
However, not a few bots have a negative influence, such as broadcasting terrorist propaganda, spam, news lies, making
broadcasting false news, spam, or become a follower to increase hoaxes, and doing political campaigns. A massive tweet
an account's popularity. It can change public sentiments about an volume is capable of polluting users' timeline, changing user
issue, decrease user confidence, or even change the social order.
perception, damaging user confidence, affecting the stock
Therefore, an application is needed to distinguish between a bot
and non-bot accounts. Based on these problems, this paper market, and even being able to undermine social order.
develops bot detection systems using machine learning for Therefore, basic knowledge to distinguish between types of bot
multiclass classification. These classes include human classes, accounts and non-bot accounts is required.
informative, spammers, and fake followers. The model training Conventionally, identifying bot accounts and not bots can be
used guided methods based on labeled training data. First, a carried out by observing the activity pattern on an account. For
dataset of 2,333 accounts was pre-processed to obtain 28 feature example, noticing that a particular account does more
sets for classification. This feature set came from analysis of user retweeting than creating original tweets, writes many tweets,
profiles, temporal analysis, and analysis of tweets with numeric but only has a few followers. Besides, the account also does not
values. Afterward, the data was partitioned, normalized with
scaling, and a random forest classifier algorithm was implemented
have a biography, a profile picture, and writes the same tweet
on the data. After that, the features were reselected into 17 feature content as another user at the same time. However, such
sets to obtain the highest accuracy achieved by the model. In the cognitive approaches are rated inefficient and merely focus on
evaluation stage, bot detection models generated an accuracy of precision.
96.79%, 97% precision, 96% recall, and an f-1 score of 96%. Therefore, an approach to detecting bots with machine
Therefore, the detection model was classified as having high learning is created. The machine learning models are used
accuracy. The bot detection model that had been completed was because of their capacity to analyze extensive data based on
then implemented on the website and deployed to the cloud. In the parameters. According to a study, the random forest algorithm
end, this machine learning-based web application could be
has the best accuracy, which is 95%, compared to Naïve Bayes
accessed and used by the public to detect Twitter bots.
Multinomial algorithm (70%), Naïve Bayes Gaussian
algorithm (68%), and Logistic Regression algorithm (52%) [6].
Keywords—Bot Detection, Multiclass Classification, Machine Therefore, the random forest classifier algorithm is selected in
Learning, Supervised Learning, Twitter.
this paper. The system classified the multiclass classification
I. INTRODUCTION by classifying accounts into four different classes, which are
human, informative, spammers, and fake followers. Then, a
One of the popular social media apps today is Twitter. It was
completed machine learning model was inserted into the web
launched in 2006 by Jack Dorsey [1]. Twitter is a micro-blog
application. The website serves as an interface for end-users to
service in which the users can send short messages, which then
use machine learning systems. In the end, the system underwent
called tweets. Each tweet is limited to 140-character only.
a deployment process to the cloud service.
However, the simplicity and capacity to send tweets as often as
possible become the additional value from this application. II. THEORETICAL FRAMEWORK
According to statistics in 2018, there have been 326 million
Twitter active users each month, with an average of 500 million A. Classification Type
tweets sent every day [2]. Classification is a process of assigning a category or label
Bots or automated programs are growing in popularity that has been defined as data that does not yet have a category.
alongside Twitter's popularity. They are created using the In general, there are three types of the data classification
Twitter Application Programming Interface (API). A study by process, namely binary, multiclass, and multilabel
the United States Commission on Exchange and security in classification [7].
2016 found at least 8.5% of active Twitter users were bots [3]. • Binary classification is a process of classifying each
element in a group into two groups or categories.
1,2,3 Department of Electrical and Information Engineering,
• Multiclass classification is a classification process
Faculty of Engineering, Universitas Gadjah Mada, Jln. Grafika
No.2, Kampus UGM Yogyakarta 55281 INDONESIA (tlp: 0274- involving more than two classes. However, the multiclass
552305; email: 1
aqilah.aini.z@mail.ugm.ac.id, classification creates an assumption that each given sample
2widyawan@ugm.ac.id, 3silmi@ugm.ac.id) is categorized into a single label (mutually exclusive).
...
67
–
68
TABLE I After the data was partitioned with a ratio of 80:20, training
PARTITION AND ACCURACY RATIO data was implemented on the random forest classifier algorithm
implementation. The obtained results were then tested for
Partition Ratio Accuracy
50:50 95.12%
accuracy based on the threshold feature importance. This
60:40 96.36% process was employed to re-select features to achieve the
70:30 96.57% highest accuracy based on existing parameters. After that, the
80:20 96.79% algorithm implementation was repeated with the selected
90:10 93.16% parameters. The accuracy results were evaluated, applying the
multiclass confusion matrix using (1) to (4).
to the average value of each username. A scaling was applied 𝑡𝑝𝑖 +𝑡𝑛𝑖
∑𝑘
𝑖=1
to these data aggregation to change the range of data values to 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑖𝑜𝑛 =
𝑡𝑝𝑖 +𝑡𝑛𝑖+𝑓𝑝𝑖 +𝑓𝑛𝑖
(1)
between 0-100. The next step was to do data partition, which 𝑘
divided data into training data and test data. Throughout the ∑𝑘𝑖=1 𝑡𝑝𝑖
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝜇 = ∑𝑘 (2)
implementation process, it was recognized that the 80:20 𝑖=1(𝑡𝑝𝑖 +𝑓𝑝𝑖 )
partition produced the best accuracy. Table I shows partition ∑𝑘𝑖=1 𝑡𝑝𝑖
and accuracy ratio. 𝑟𝑒𝑐𝑎𝑙𝑙 = ∑𝑘 (3)
𝑖=1(𝑡𝑝𝑖 +𝑓𝑛𝑖 )
–)
69
–
70
Fig. 3 A scatter plot on a comparison of the number of accounts following and followers in the dataset.
TABLE V
SUMMARY OF CLASSIFICATION EVALUATION
the Heroku cloud service. Through this platform, the web random forest classifier based on the 17 feature sets produces
application can be accessed at an accuracy of 96.79%, 97% precision, 96% recall, and an f-1
http://botclassifier.herokuapp.com. score of 96%. Thus, the detection model is classified as having
high accuracy.
V. CONCLUSIONS
Based on the process and obtained results in the system REFERENCES
development stage as well as the machine learning evaluation, [1] (2014) “Jack Dorsey Biography,” [Online],
https://www.biography.com/business-figure/jack-dorsey, access date: 3-
several conclusions can be formulated as follows. The Sep-2019.
development of a bot detection system using a machine [2] (2018) “Twitter by the Numbers (2018): Stats, Demographics & Fun
learning with the random forest classifier has been successfully Facts,” [Online], https://www.omnicoreagency.com/twitter-statistics/,
developed. The features of this system include the classification access date: 13-Jun-2019.
of four class types, namely non-bot (human) class, fake [3] V.S. Subrahmanian, A. Azaria, S. Durst, V. Kagan, A. Galstyan, et al.,
followers bot, spammers bot, and informative bot, and this can “The DARPA Twitter Bot Challenge,” Computer (Long. Beach. Calif.),
Vol. 49, No. 6, pp. 38–46, 2016.
be accessed via the web. The training data partition and test data
[4] (2019) “Nain Weather Bot (@NainWxPxBot) | Twitter.” [Online],
of 80:20 produce the highest accuracy (96.795) compared to the https://twitter.com/nainwxpxbot, access date: 3-Sep-2019.
test data partition of 50:50 (95.12%), 60:40 (96.36%), 70:30 [5] K. Zetter (2009) “Trick or Tweet? Malware Abundant in Twitter URLs,”
(96.57%), and 90:10 (93.16%). The feature selection process [Online], https://www.wired.com/2009/10/twitter-malware/, access date:
based on the feature importance score shows that only 17 3-Sep-2019.
features with a threshold of 0.017 contribute to increasing the [6] M. Haidermota, “Classifying Twitter User as a Bot or Not and Comparing
high accuracy. The multiclass classification process using the Different Classification Algorithms.,” Int. J. Adv. Res. Comput. Sci., Vol.
9, No. 3, pp. 29–33, 2018.
73
[7] M. Hossin and Sulaiman, “A Review on Evaluation Metrics for Data [12] (2017) “#BotSpot: Twelve Ways to Spot a Bot,” Medium, [Online],
Classification Evaluations,” Int. J. Data Min. Knowl. Manag. Process, https://medium.com/dfrlab/botspot-twelve-ways-to-spot-a-bot-
Vol. 5, No. 2, pp. 1-11, 2015. aedc7d9c110c, access date: 27-Aug-2019.
[8] J. Hurwitz and D. Kirsch, Machine Learning For Dummies, Hoboken, [13] S. Schreder (2018) “10 Twitter bots that actually make the internet a better
USA: John Wiley & Sons, Inc., 2018. place - Internet Citizen.” [Online],
[9] L. Breiman, “Random Forests,” Mach. Learn., Vol. 45, pp. 5–32, 2001. https://blog.mozilla.org/internetcitizen/2018/01/19/10-twitter-bots-
actually-make-internet-better-place/, access date: 19-Dec-2019.
[10] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social
network or a news media?,” Proceedings of the 19th International [14] A. Khalil, H. Hajjdiab, and N. Al-Qirim, “Detecting Fake Followers in
Conference on World Wide Web, WWW ’10, 2010, pp. 591–600. Twitter: A Machine Learning Approach,” Int. J. Mach. Learn. Comput.,
Vol. 7, No. 6, pp. 198–202, 2018.
[11] M. Newberg (2017) “Nearly 48 million Twitter accounts could be bots,
says study,” [Online], https://www.cnbc.com/2017/03/10/nearly-48- [15] F.A. Aslam and H.N.M.J.M.M.M.M.A. Gulamgaus, “Efficient Way Of
million-twitter-accounts-could-be-bots-says-study.html, access date: 27- Web Development Using Python And Flask,” Int. J. Adv. Res. Comput.
Aug-2019. Sci., Vol. 6, No. 2, pp. 54–57, 2015.