Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

An Additive Approach using Machine Learning and

Sentimental Analysis for Pernicious Rumour Detection


in Social Media

Priyanshi Borase and Satish Kolhe

Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon, Maharashtra,


India
priyanshiborase@gmail.com, srkolhe2008@gmail.com

Abstract: The expeditious way in which rumours flow due to social media
platforms have a drastic impact on the thought process, forges opinion in an
individual and the reaction of people to these rumours. Repudiating the rumours
makes it imperative that one detects them before they can cause torment and pain.
This research paper aims to put forth the methods for rumour detection particularly
in social media as a means to quash any sort of harm and trauma. This paper
proposes the machine learning approach with sentimental analysis is discussed for
rumour detection. The different classification procedures like K-nearest neighbor,
Decision tree and Random forest are implemented that give excellent accuracy.

Keywords: Rumour Detection, Sentiment Analysis, K-nearest neighbour, Decision tree,


Random forest.

I. Introduction
An unchecked story, news or a statement coming from an unreliable source that has no
concrete evidence constitutes a rumour. Due to advance in technology and social media
there is no control over how fast these rumours can spread in a minuscule period.
Information could very well take the shape of a rumour if the source itself is questionable
or the information is mutated. It becomes imperative that one investigates the authenticity
of the information.
Rumor discernment and detection is the procedure of checking the genuineness of any
information or statement. Rumour detection has to countenance numerous challenges
before drawing any conclusion like understanding the mélange of data, identifying the
source of the rumour and identification of latest rumors from time period data needs to be
dealt with while performing rumor detection.
This paper proposes sentiment analysis technique for detecting optimistic or negative
sentiment in text. In NLP, sentiment analysis is all about deciphering such sentiment from
text. It is used to detect sentiment behind the text by the writer put in social media.
Sentiment analysis models focus on feeling and thus find the motive of a rumour(positive,
negative, and neutral). The paper aims in applying the different classification techniques
such as K-nearest neighbor, Decision tree and Random forest which not only will help in
categorizing the status of information but also will give a comparative in terms of
accuracy for the excellent results.
2
II. Background.
Alessandro Bondielli and Francesco Marcelloni et al [29] explained detecting rumours is
essential, keeping in mind the volume and velocity of user-generated information on
social media. Social media allows information propagation regardless of the source
verification status and truth value. Forwarding and sharing content combined with the
lack of validation fuels rumours as it permits exchange and broadcasting at an unmatched
level. Nevertheless, this can be harmful when users are exposed to damaging or
undesirable content. Pathak, Ajeet and Mahajan et al [2] gives an overview of rumor
detection, datasets, application areas, and performs comparative analysis of the state-of-
the-art rumor detection approaches. Also, most social media platforms allow users to
form groups based on their shared interests; however, such virtual alignments may lead to
the creation of echo-chambers in which participants’ views are amplified and reinforced.
Such echo chambers also make unconfirmed posts appear more trustworthy. When a
group member receives a certain piece of information, they might think that the
information is truthful because it is from their “own” people. Akshi Kumar and Saurabh
Raj Sangwan [4] in his paper proposed that it can be extended to a multi-level, fine-grain
classification where rumors can be detected for being a misinformation or a
disinformation, hoaxes, etc. Various novel and hybrid machine learning techniques such
as fuzzy, neuro fuzzy can also be used for detecting.

III. Proposed Methodology

A. Sentiment Analysis
The information on social media is utilized to detect the sentiment hidden in the text. A
unique method that uses sentiment as a significant tool for analysis in NLP is the
sentimental analysis. It deciphers the sentiment of the person embodied in the written
text. It is in fact the process of decoding and interpreting the positivity or negativity of the
sentiment in a text. The dictionary based approach applied here basically uses three
dictionaries carrying the various terms that convey the positive, negative or harmful
sentiment. A positive dictionary naturally holds the terms that convey positive vibes and
similarly negative dictionary will consist of terms that carry negative vibes.eg the term
good, excellent carry positivity and will find a hit in the positive dictionary. Similarly, the
word “bad” finds place in the negative. If a word does not find a hit in either then it
comes under the normal category. However the critical one, that is the harmful dictionary
refers to words that could cause physical or mental harm to an individual. Example: beat,
stab. These words are irrefutably negative but contemporaneously they give “harmful”
vibes. Language analysis indicates around hundred or more words in harmful dictionary.
Sentiment analysis technique is implemented to classify the term in one of the three
categories. Prior to analysis one needs to create a Sentiment Dictionary of Lists with
Parameters and Label in sorted form. Using a Ordered Sequential Model Search Method
on the dictionaries results in returning the prediction. The search process returns a list of
the [sentiment, similarity score] pairs relative to a word. It can be categorized into
Harmful, Negative and Positive considering its TFIDF.
3
The prediction for some of the terms and its score using the sequential search method
are explained.

Example: The prediction of the sentiment for the term “bad” is as given below: The
table shows the sentiment of the text with its similarity score for three different words.

Table 1.0 Sentiment and corresponding dictionary category.

Word Sentiment Score Dictionary


Category
Bad Negative 1.0002090738030525 Negative
Positive 0.0
Harmful 0.0
Knock cold Negative 0.0 Harmful
Positive 0.0
Harmful 1.0121951219512195
Angel Negative 0.0 Positive
Positive 1.0004985044865404
Harmful 0.0

It classifies whether feature is positive, negative, harmful or else normal

Sentiment Word Frequency Ratio


1.2

negative Words
29.2 Positive words
Harmful negative words

69.6

.
Fig. 1. Sentiment Word Frequency Ratio

Positive Dictionary: Opinion : Positive. The file contains a list of words that give
positive vibes hence categorized as POSITIVE opinion words, 2041 positive words [9],
[10]
4
Negative Dictionary: Opinion Lexicon: Negative. The file contains a list of words that
give negative vibes hence categorized as NEGATIVE opinion words. 4818 negative
words [9], [10]
Harmful Dictionary: A new dictionary is prepared known as Harmful dictionary. It
consists of words that refer to be harmful or cause adverse effect. It reflects the
psychological condition of individual through words. Harmful dictionary currently
consists of more than 400 words. A word may be negative but if it is harmful, it causes
major damage. Example: - beat, bombard, raid, stab.

B. Improvement in feature selection


In this paper, for analysis for testing purposes, we have used the already available
dataset. It contains nearly 7528 documents. These documents are collected from 20
different sources. The documents are to be considered as the data repositories of news. 20
different sources will be the different authorized data sources—number of Documents
7528 Number of class 20. Available dataset is divided into training data set and test data
set. For our research we used four filter feature selection methods to pick out informative
features which are Correlation (CO), Information Gain (IG), Gain Ratio (GR), and
Symmetrical Uncertainty (SU). The paper puts forth that feature analysis is done to
calculate parameters such as Term Frequency, Inverse Document Frequency, Term
Frequency Inverse Document Frequency, Information Gain, Gain Ratio, Symmetric
Uncertainty, and Correlation. The number of feature selection is on the basis of these
parameters. However a combination of different parameters is used in order to analyze the
number of features selected.

C. Dataset and Annotation

For this research paper we have utilized the available standard PHEME dataset.
The benchmark PHEME dataset used in this research is a conglomeration of tweets that
are connected to breaking news events. These tweets are categorized as ‘rumour’ and
‘non-rumour’ as interpreted by expert journalists.

This is one of the breaking news that was used as a dataset for our simulation to help
categorize. News event of#germanwingscrash– “On 24th March 2015, an Airbus A320-
211 scheduled for the international German wings Flight 9525from Barcelona–El Prat
Airport in Spain to Düsseldorf Airport in Germany crashed 100 km (62 mi) north-west of
Nice in the French Alps, killing all 144 passengers and six crew members. The crash was
a deliberate one caused by the co-pilot diagnosed with suicidal tendencies and declared
unfit for work by his doctor”. The dataset contains 238 rumours and 231 non-rumours.

IV. Classification Algorithms


In our research paper we have used the classifiers like K- Nearest, Decision Tree and
Random Forest,. Also we have the metric parameters for Logistic regression and Naive
Bayes algorithm giving us a comparative. Logistic regression is one of a machine learning
5
algorithm used for classification. In this algorithm, a model is designed based on
probabilities. The probability is based on the possible outcomes of a single trial applied to
a logistic function. It gives a lucid explanation of the influence of several independent
variables on a single outcome variable.
Naive Bayes classifiers work well in many real time situations such as document
classification. It can work very efficiently in spam filtering giving excellent results. The
advantage in using this algorithm is it deals well for a small training data to estimate the
necessary parameters. This classifier is extremely fast as compared to others.
K- Nearest Neighbour technique is a simple learning method as it simply stores
instances of the training data. A simple majority vote of the k nearest neighbour of each
point decides its classification. The simplicity makes it easier to implement. A sufficiently
large training dataset will make it more effective.
Decision Tree is a classification procedure for given data of attributes together with its
classes. A decision tree produces a sequence of rules that can be used to classify the data.
Its simplicity to understand and visualize, requires little data preparation and reliability
makes it an excellent choice for implementation n classification.
Random forest classifier is a meta-estimator that fits a number of decision trees on
various sub-samples that are of the same size as original sample size of datasets. The
predictive accuracy of the model can be improved if it utilizes the average values. Over-
fitting can be better controlled by the use of average values.

IV. Result and Analysis


The performance of any classifier is based on its metrics. Accuracy and confusion
matrix are the best metric tools used for evaluating the classifiers. The accuracy and
confusion matrix calculated through simulation for Logistic Multinominal, K-nearest,
Decision Tree and Random forest are put forth for the “news” and it is found that
Multinominal gives us the best results with 97.43%.

A. Accuracy

Accuracy is defined as a ratio of correctly predicted observation to the total


observations and is calculated using the formula:
Accuracy: (True Positive + True Negative) / Total Population (1)
where
True Positive: The number of correct predictions that the occurrence is positive.
True Negative: The number of correct predictions that the occurrence is negative.
6
Table 1. Accuracy of Various Classification algorithms

Classification
Accuracy
Algorithms

Logistic Regression 95%

Multinomial Naïve
97.42%
Bayes

K-Nearest Neighbours 96.23%

Decision Tree 96.44%

Random Forest 96.35%


7

Accuracy(%)
98
97.5
97
96.5
96
Accuracy

95.5
95
94.5
94
93.5
Logistic Re- Multinominal K nearest Decision Tree Random Forest
gression Bayes
Different classification97.42 Methods

Accuracy(%)

Fig. 1. Comparative of Accuracy of various classification algorithms

Fig. 1. Training and Testing time of Different Classifiers

B. Confusion Matrix

. Confusion matrix in a matrix form gives the actual values to predicted values. The
matrix form encompasses the true negative, false positive, false negative and true positive
values
8

Fig. 1.Confusion Metrics for various Classification Algorithms

a. Text Searching
In text searching step we search the particular message, as given below, whether it is
rumour or not. Our dataset is divided into sets-train and test dataset. The message is
classified as being a rumour or not, based on the authenticity of dataset. As per our
observations, Multinomial Naïve Bayes Algorithm is more accurate as compared to other
algorithms. In our implementation we have used the same due to its excellent accuracy
percentage to rumour detection. It gives good results in identifying the data or
information as being real or fake.
9
The input to our simulation is a news information in the form of text. After the
classification procedures are applied the Multinominal Bayes gives excellent results with
an accuracy of 99%. identifying the news as fake.

Input text:

test_text = "Germanwings Airbus A320 crashes in French Alps"


test_text2 = "Airbus A320 #4U9525 crash"
test_text3 = "BREAKING We got reports the crash could be an Airbus A320
Germanwings between Barcelona and Dusseldorf."
Output:

MultinomialNB Accuracy: 0.99079754601227


REAL
REAL
FAKE

b. Evaluation based on F1 score


F1 score is a metric used to evaluate the performance of a classification model, especially
when dealing with imbalanced datasets. It combines precision and recall into a single
value, providing a balanced measure of a model's accuracy. The F1 score is calculated as
the harmonic mean of precision and recall and is given by formula:

F1=2∗ (precision + recall) / (precision ∗ recall) (2)

Where: Precision is the ratio of true positive predictions to the total number of positive
predictions made by the model. It measures the accuracy of positive predictions.

Recall is the ratio of true positive predictions to the total number of actual positives in the
dataset. It measures the model's ability to correctly identify all positive instances.
The F1 score ranges from 0 to 1, with 1 indicating perfect precision and recall, and 0
indicating the worst possible performance. A high F1 score indicates that a model is
achieving both high precision and high recall, which is desirable in many classification
tasks.

Table 3. Parametrs: Precision, Recall and F1 score for the” news”

Methodology P R F1
TFIDF+IG-ACONB 0.776 0.745 0.732
TFIDF_IG,CO,SU,GR , 0.794 0.773 0.783
NB
10

1.2
Performance Metrices For "News" for Multinominal Bayes
1
0.8
0.6
in %

0.4
0.2
0
Precision Recall Accuracy F1
Different parameters

Fig 5. Graph different parameters for Multinominal Bayes.

1.2

0.8

0.6 Accuracy
0.4 F1

0.2

0
L ogistic Multinominal K-nearest Decision Tree Random
Regression Bayes Forest

Fig 6. Graph showing parameters for different classification methods.


V. Research Conclusion
The research paper here concludes that rumour can be detected and taken care of well
before it can spread and become a nuisance and danger to society through technology.
Prior to text classification data pre-processing and feature extraction is performed.
Various classification algorithms are applied to the available dataset using standard
PHEME dataset. Training and testing time for the dataset are calculated for considering
the appropriate classification algorithm. Performance of algorithms is measured using
Accuracy and Confusion metrics parameters for all the algorithms. Our method is
compared with available TFIDF-IG ACO method based on metrics of accuracy and F1
score. The new message is searched in the available authentic dataset to display the
output whether it is rumour or not. .

References

1. Kumar, A., Bhatia, M.P.S. &Sangwan, S.R. Rumour detection using deep learning and filter-
wrapper feature selection in benchmark twitter dataset. Multimed Tools Appl (2021).
https://doi.org/10.1007/s11042-021-11340-x
11
2. Pathak, Ajeet& Mahajan, Aditee & Singh, Keshav&Patil, Aishwarya& Nair, Anusha. (2020).
Analysis of Techniques for Rumor Detection in Social Media. Procedia Computer Science.
167. 2286-2296. 10.1016/j.procs.2020.03.281.
3. Sarah A. Alkhodair, Steven H.H. Ding, Benjamin C.M. Fung, Junqiang Liu, Detecting
breaking news rumors of emerging topics in social media, Information Processing &
Management, Volume 57, Issue 2, 2020, 102018, ISSN 0306-4573,
https://doi.org/10.1016/j.ipm.2019.02.016.
4. Kumar, A., &Sangwan, S. R. (2018). Rumor Detection Using Machine Learning Techniques
on Social Media. Lecture Notes in Networks and Systems, 213– 221. doi:10.1007/978-981-
13-2354-6_23
5. Kumar, A., Sangwan, S.R. &Nayyar, A. Rumour veracity detection on twitter using particle
swarm optimized shallow classifiers. Multimed Tools Appl 78, 24083–24101 (2019).
https://doi.org/10.1007/s11042-019-7398-6.
6. WalaaMedhat, Ahmed Hassan, HodaKorashy, Sentiment analysis algorithms and
applications: A survey, Ain Shams Engineering Journal, Volume 5, Issue 4, 2014
7. Alonso, M.A.; Vilares, D.; Gómez-Rodríguez, C.; Vilares, J. Sentiment Analysis for Fake
News Detection. Electronics 2021, 10, 1348.
8. V. Sivasangari, Ashok Kumar Mohan, K. Suthendran, M. Sethumadhavan, “Isolating Rumors
Using Sentiment Analysis”, Journal of Cyber Security and Mobility. 2018 7.
10.13052/jcsm2245-1439.7113.
9. Kapusta, Jozef&Benko, Ľubomír&Munk, Michal. (2020). Fake News Identification Based on
Sentiment and Frequency Analysis. 10.1007/978-3-030-36778-7_44. Pages 1093-1113
10. O. Ajao, D. Bhowmik and S. Zargari, "Sentiment Aware Fake News Detection on Online
Social Networks," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 2507-2511
11. Q. Li, S. Shah, R. Fang, A. Nourbakhsh and X. Liu, "Tweet Sentiment Analysis by
Incorporating Sentiment-Specific Word Embedding and Weighted Text Features", 2016
IEEE/WIC/ACM International Conference on Web Intelligence (WI), Omaha, NE, 2016, pp.
568-571.
12. A. J. J. Mary and L. Arockiam, "Jen-Ton: A framework to enhance the accuracy of aspect
level sentiment analysis in big data," 2017 International Conference on Inventive Computing
and Informatics (ICICI), Coimbatore, 2017, pp. 452-457.
13. Kula S., Choraś M., Kozik R., Ksieniewicz P., Woźniak M. (2020) Sentiment Analysis for
Fake News Detection by Means of Neural Networks. In: Krzhizhanovskaya V. et al. (eds)
Computational Science – ICCS 2020. ICCS 2020. Lecture Notes in Computer Science, vol
12140. Springer, Cham.
14. Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews", Proceedings of the
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-
2004), Aug 22-25, 2004, Seattle, Washington, USA
15. Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing
Opinions on the Web." Proceedings of the 14th International World Wide Web conference
(WWW-2005), May 10-14, 2005, Chiba, Japan.
16. WalaaMedhat, Ahmed Hassan, HodaKorashy, Sentiment analysis algorithms and
applications: A survey, Ain Shams Engineering Journal, Volume 5, Issue 4, 2014
17. Alonso, M.A.; Vilares, D.; Gómez-Rodríguez, C.; Vilares, J. Sentiment Analysis for Fake
News Detection. Electronics 2021, 10, 1348.
18. V. Sivasangari, Ashok Kumar Mohan, K. Suthendran, M. Sethumadhavan, “Isolating Rumors
Using Sentiment Analysis”, Journal of Cyber Security and Mobility. 2018 7.
10.13052/jcsm2245-1439.7113.
12
19. Kapusta, Jozef& Benko, Ľubomír& Munk, Michal. (2020). Fake News
Identification Based on Sentiment and Frequency Analysis. 10.1007/978-3-030-36778-7_44.
Pages 1093-1113
20. O. Ajao, D. Bhowmik and S. Zargari, “Sentiment Aware Fake News Detection on Online
Social Networks,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 2507-2511
21. Q. Li, S. Shah, R. Fang, A. Nourbakhsh and X. Liu, “;Tweet Sentiment Analysis by
Incorporating Sentiment-Specific Word Embedding and Weighted Text Features”, 2016
IEEE/WIC/ACM International Conference on Web Intelligence (WI), Omaha, NE, 2016, pp.
568-571.
22. A. J. J. Mary and L. Arockiam, "Jen-Ton: A framework to enhance the accuracy of
aspect level sentiment analysis in big data," 2017 International Conference on Inventive
Computing and Informatics (ICICI), Coimbatore, 2017, pp. 452-457.
23. Kula S., Choraś M., Kozik R., Ksieniewicz P., Woźniak M. (2020) Sentiment Analysis for
Fake News Detection by Means of Neural Networks. In: Krzhizhanovskaya V. et al. (eds)
Computational Science – ICCS 2020. ICCS 2020. Lecture Notes in Computer Science, vol
12140. Springer, Cham.
24. Minqing Hu and Bing Liu. “Mining and Summarizing Customer Reviews”, Proceedings of
the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA
25. Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and Comparing
Opinions on the Web." Proceedings of the 14th International World Wide Web
conference (WWW-2005), May 10-14, 2005, Chiba, Japan.
26. Yuhang Yu. 2021. Review of the Application of Machine Learning in Rumor Detection. In
Proceedings of the 5th International Conference on Control Engineering and Artificial
Intelligence (CCEAI '21). Association for Computing Machinery, New York, NY, USA, 46–
52. https://doi.org/10.1145/3448218.3448238
27. Rani, N, Das, P, Bhardwaj, AK. Rumor, misinformation among web: A contemporary review
of rumor detection techniques during different web waves. Concurrency ComputatPractExper.
2022; 34(1):e6479. https://doi.org/10.1002/cpe.6479
28. Rani, Neetu and Das, Prasenjit and Bharadwaj, Amit, Rumour Detection in Online Social
Networks: Recent Trends (March 30, 2020). Proceedings of the International Conference on
Innovative Computing & Communications (ICICC) 2020,
http://dx.doi.org/10.2139/ssrn.3564070
29. Alessandro Bondielli, Francesco Marcelloni, A survey on fake news and rumour detection
techniques, Information Sciences, Volume 497, 2019, Pages 38-55, ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2019.05.035
30. Olan, F., Jayawickrama, U., Arakpogun, E.O. et al. Fake news on Social Media: the Impact on
Society. InfSyst Front (2022). https://doi.org/10.1007/s10796-022-10242-z
31. Raza, S., Ding, C. Fake news detection based on news content and social contexts: a
transformer-based approach. Int J Data Sci Anal 13, 335–362 (2022).
https://doi.org/10.1007/s41060-021-00302-z
32. Takahashi T, Igata N (2012) Rumor detection on twitter. In: The 6th international conference
on soft computing and intelligent systems, and the 13th international symposium on advanced
intelligence systems, IEEE, pp 452–457
33. Zhao Z, Resnick P, Mei Q (2015) Enquiring minds: early detection of rumors in social media
from enquiry posts. In: Proceedings of the 24th international conference on world wide web.
International world wide web conferences steering committee, pp 1395–1405

You might also like