Fake News

You might also like

Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 15

a)Spreading of fake news is a social phenomenon that is pervasive at the social level

between individuals, and also through social media such as Facebook and Twitter. Fake news
that we are interested in is one of many kinds of deception in social media, but it’s more
important one as it is created with dishonest intention to mislead people.This phenomenon
has recently caused through the means of social communication to change the course of
society and peoples and also their views, for example, during revolutions in some Arab
countries have emerged some false news that led to the absence of truth and stirs up public
opinion and also fake of news is one of the factors Trump successes in the presidential
election. Techniques of fake news detection varied, ingenious, and often exciting. The best
approach to handle it is to build a classifier that can predict whether a piece of news is fake or
not based only its content, thereby approaching the problem from a purely deep learning
perspective by RNN technique models (vanilla, GRU) and LSTMs. It will show the difference
and analysis of results by applying them to the dataset that we used called LAIR. We found that
the results are close, but the GRU is the best of our results that reached (0.217) followed
by LSTM (0.2166) and finally comes vanilla (0.215). Due to these results, I would seek to
increase accuracy by applying a hybrid model between the GRU and CNN techniques on the
same data set.

b)In recent times, we’ve witnessed how fake news has become a viral targeted strategy deployed
to sway and divide citizens politically, particularly in messaging platforms where fake news can
easily be forwarded and shared to large groups. For example, in the 2018 Brazil elections, 6/10
Brazilian voters who voted for current president Jair Bolsonaro (nearly 120 Million citizens)
informed themselves about the election primarily via WhatsApp and group messaging options,
and nearly 3/10 voters of the messages they received in these groups contained misinformation.

For the most part, once these fake news goes shared, it goes unchallenged. The problem with
that is that misinformation is often not harmless. In India, 50 lynchings across the country that
were blamed on incendiary messages spread using the app. In India, this fake news problem has
been even more severe with the upcoming 2019 elections, with 6x larger group messaging
userbase. 50 people have already been lynched because of false news being shared on the
messaging platforms like Whatsapp.

SOLUTION:

We can build a native AI chatbot for WhatsApp and Google Assistant that identifies/shares
relevant sources to contextify to a user’s “argument” or possibly “fake news claim” and seeks to
report how relevant those claims really are to the sources. We can build and integrate the first
truly native experience for fact-checking between those platforms. All a user must do is forward
(copy) their claim to either WhatsApp bot or Google Assistant app, and instantly the bot returns
“claim and source corroboration report/score” along with a list of relevant sources to contextify
a claim.
We can initially use UCL's MLP for FNC-1. I then iterate the model a bit, adding an extra hidden
layer and the accuracy can be just as good, but qualitatively, it looked like the results for that.

c) We can propose an approach for the detection of clickbait posts in online social media (OSM).
Clickbait posts are short catchy phrases that attract a user's attention to click to an article. The
approach can be based on a machine learning (ML) classifier capable of distinguishing between
clickbait and legitimate posts published in OSM. The classifier can be based on a variety of
features, including image related features, linguistic analysis, and methods for abuser detection.
In order to evaluate this method, The best performance obtained by the ML classifier can be an
AUC of 0.8, an accuracy of 0.812, precision of 0.819, and recall of 0.966. In addition, clickbait
post titles are statistically significant shorter than legitimate post titles. Finally, we can conclude
that counting the number of formal English words in the given content is useful for clickbait
detection.

d)In order to create a hate-speech-detecting algorithm, we are going to use Python-based NLP
machine learning techniques. Machine learning is basically teaching machines to accomplish
various tasks by training them through data,using a NLP (or Natural Language Processing)
technique called Tf-Idf vectorization, we’ll extract keywords that convey importance within hate
speech.

Finally, based on a machine learning technique called logistic regression, which is popular for
probability calculations, we’ll train the computer to classify hate speech using the data extracted
from dataset (or any kind of data you wish to utilize for training).

e)Sentiment Analysis also known as Opinion Mining is a field within Natural Language Processing

Currently, sentiment analysis is a topic of great interest and development since it has many
practical applications. Since publicly and privately available information over Internet is
constantly growing, a large number of texts expressing opinions are available in review sites,
forums, blogs, and social media.

We can use Textbox it returns a sentiment score which you can interpret as either positive or
negative, then we can built a crappy little algorithm to add weights to the sentiments of the
different types of text I was extracting (title, content, author etc.) and added it all together to see
if I could come up with a global score.

f) I would use NLP for this process, I would use the bag of words model , I would look for the
words concerning a particular a particular person eg- In India fake news was spread in the name
of "NAMO" which refers to Indian Prime Minister Narendra Modi.

PORTER STEMMER can be used to remove irrevelant words apart from the desired words

g)We can create a granular taxonomy of different types and targets of online hate and train

machine learning models to automatically detect and classify the hateful comments in the full
dataset. Our contribution would be twofold:

1) creating a granular taxonomy for hateful

online comments that includes both types and targets of

hateful comments, and

2) experimenting with machine

learning, including Logistic Regression, Decision Tree,

Random Forest, Adaboost, and Linear SVM, to generate a

multiclass, multilabel classification model that automatically detects and categorizes hateful
comments in the context

of online news media. We can find that the best performing

model is Linear SVM, with an average F1 score of 0.79 using TF-IDF features. We can validate the
model by testing its

predictive ability, and, relatedly, provide insights on distinct

types of hate speech taking place on social media.

h). NLP enables computers to perform a wide range of natural language related tasks at

all levels, ranging from parsing and part-of-speech (POS) tagging, to machine translation and
dialogue systems.

models that create word embeddings have been shallow neural networks and there has not
been need

for deep networks to create good embeddings. However, deep learning based NLP models
invariably represent their words,

phrases and even sentences using these embeddings. This is in fact a major difference between
traditional word count based

models and deep learning based models

i will use these techniques to find out the revelant terms , the library for this task used would be
teh PorterStemeer library under NLTK

i)In doing that LDA-style Topic Model becomes one of the

hottest research spot in machine learning and informationretrieval. It was also introduced in
emerging trends detection.
However, the original LDA [10] is independent with time. So,

several topic models with temporal information were

proposed, such as DDTM(Discrete-time Dynamic Topic

Model) [11], CDTM(Continuous Time Dynamic Topic

Model) [12], TOT(Topics over Time) [13] and TAM(Trend

Analysis Model) [14]. DDTM requires that time be

discredited, and the complexity of variational inference for

the DDTM grows quickly as time granularity increases.

this is i would do that

j) I will be using an unsupervised learning approach to find the sentences similarity and rank
them. One benefit of this will be, we don’t need to train and build a model prior start using it for
your project.

It’s good to understand Cosine similarity to make the best use of code we are going to see.

k) We shall use NLP to add frequently answered questions like training our model to specific
questions related to our articles

2) I would use several methods for this which i shall be listing one by one

I would start by

A. Dataset

There are very useful datasets to study fake detection but the positive training data are
collected from a tested out (in a way that was close to the real thing) (surrounding
conditions). More importantly, these datasets are not good for fake statements detection; since
the fake news on TVs and social media are much shorter than customer reviews. According to
(William Yang et.al, 2017) presents a new benchmark dataset called LIAR: it‟s a new, publicly
available dataset for fake news detection. Collected a decade-long, 12.8K manually labeled
short statements in various contexts from POLITIFACT.COM, Which provides a detailed
analytical report and a link to its source level for each case. This dataset can be used for fact-
checking research as well. They investigate the automatic detection of fake news based on
surface-level linguistic patterns [12]. The LIAR dataset includes 12,836 short statements labeled
for truthfulness, subject, context/venue, speaker, state, party, and prior history. With this size
and time span of ten years, LIAR cases are collected in more natural context, such as
political debate, TV ads, Facebook posts, tweets, interview, news release, etc. In each case,
the labeler provides a lengthy analysis report to the ground each judgment [12]. They have
evaluated several popular learning based methods on this dataset. The baselines include
logistic regression, support vector machines, LSTM and the CNN model.

B. Data Preparation

After we have LIAR data set as we mention earlier, we must preprocess it to be suitable for our
system that can work with. Preprocessing means that data set is clearer to our algorithm
by removing dummy characters, string, and Impurities. Preprocessing Works on three steps :
Splitting: Separate each sentence from the next sentences to deal with them individually. Stop
words removing: remove UN-important words from each sentence. Stemming: Returns each
word to its origin.

C. Deception System

Due to the deception phenomenon that spread nowadays through the traditional media and
the social media platform especially the Facebook and Twitter, which controls the users'
selections in relation to the economic part such as (buying products, business, etc…) also in
relation to the political life, for that we can create s system which solves this problem not only
fake reviews but we concerned about fake news by using preprocessing LAIR data which has
been prepared based on what we have cleared in details in the previous part we apply it to
Word embedding (or word vector) that gives each word a vector and each vector represents a
latent feature of a word , then the result of the word vector apply to RNN models (Vanilla, GRU)
and LSTM , then we will get the results that determine that a piece of news is deceptive or not.

We can work on this model in following ways

First step: Preparing LIAR dataset in four levels: Ø The first level is splitting each sentence to
deal with separately. Ø The second level is removing stop words and that includes
identifying the useless words in each statement like (the, a, an, etc). Ø The third level is
stemming which every word return to its infinitive.

Second step: Output of stemming will be the input to word embedding which played an
important role in deep learning based on deception analysis that includes representing each
single word in each sentence by dimensional vector and get the relation between two words
not only syntactic but also the same (as „see‟ and „watch‟ are very different in syntactic, but
their meaning is somewhat related) . Another benefit is that the algorithm detects the words
that appear mostly together (like „wear‟ and „clothes‟) and it shows their relationship and
then this is able to predict the next word

Third step: Results of word embedding level will be the input to the RNN models (vanilla,
GRU ) and LSTMs technique.

Fourth step: the output of step four will Getting final result determining if the piece of news
is truthful or deceptive.

3)

In case of Brexit the issue is common as in other fake news it's like

Trolls and bots have a huge and often unrecognized influence on social media. They are used to
influence conversations for commercial or political reasons. They allow small hidden groups of
people to promote information supporting their agenda and a large scale. They can push their
content to the top of people’s news feeds, search results, and shopping carts. Some say they can
even influence presidential elections.

In order to maintain the quality of discussion on social sites, it’s become necessary to screen and
moderate community content. We can use machine learning to identify suspicious posts and
comments

Like in other cases

Trolls are dangerous online because it’s not always obvious when you are being influenced by
them or engaging with them. Posts created by Russian operatives were seen by up to 126 million
Americans on Facebook leading up to the last election. Twitter released a massive data dump of
over 9 million tweets from Russian trolls. And it’s not just Russia! There are also accounts of
trolls attempting to influence Canada after the conflict with Huawei. The problem even extends
to online shopping where reviews on Amazon have slowly been getting more heavily
manipulated by merchants. Same way Brexit propaganda can be created as well

Bots are computer programs posing as people. They can amplify the effect of trolls by engaging
or liking their content en masse, or by posting their own content in an automated fashion. They
will get more sophisticated and harder to detect in the future. Bots can now create entire
paragraphs of text in response to text posts or comments. OpenAI’s GPT-2 model can write text
that feels and looks very similar to human quality. How to protect Brexit from propaganda and
spam posted by malicious trolls and bots? We could carefully investigate the background of each
poster, but we don’t have time to do this for every comment we read. The answer is to automate
the detection using big data and machine learning. Let’s fight fire with fire!

I’ll focus on Reddit because users often complain of trolls in political threads. It’s easier for trolls
to operate thanks to anonymous posting. Operatives can create dozens or hundreds of accounts
to simulate user engagement, likes and comments. Research from Stanford has shown that just
1% of accounts create 74% of conflict. There are several existing resources we can leverage. For
example, the botwatch subreddit keeps track of bots on Reddit, true to its namesake! Reddit’s
2017 Transparency Report also listed 944 accounts suspected of being trolls working for the
Russian Internet Research Agency.

Also, there are software tools for analyzing Reddit users. For example, the very nicely designed
reddit-user-analyzer can do sentiment analysis, plot the controversiality of user comments, and
more. Let’s take this a step further and build a tool that puts the power in the hands of
moderators and users.

Malicious URL detection is a vast thing it's not a single step process it's comprised of

Machine Learning Approaches. These approaches try to analyze the information of a URL and

its corresponding websites or webpages, by extracting good feature representations of URLs, and

training a prediction model on training data of both malicious and benign URLs. There are two-
types

of features that can be used - static features, and dynamic features. In static analysis, we
perform

the analysis of a webpage based on information available without executing the URL (i.e.,
executing

JavaScript, or other code) . The features extracted include lexical features from the

URL string, information about the host, and sometimes even HTML and JavaScript content. Since

no execution is required, these methods are safer than the Dynamic approaches. The underlying

assumption is that the distribution of these features is different for malicious and benign URLs.

Using this distribution information, a prediction model can be built, which can make predictions

on new URLs. Due to the relatively safer environment for extracting important information, and

the ability to generalize to all types of threats (not just common ones which have to be defined

by a signature), static analysis techniques have been extensively explored by applying machine

learning techniques. In this survey, we focus on static analysis techniques where machine
learning

has found tremendous success. Dynamic analysis techniques include monitoring the behavior of

the systems which are potential victims, to look for any anomaly. These include which monitor

the system call sequences for abnormal behavior, and which mine internet access log data for

suspicious activity. Dynamic analysis techniques have inherent risks, and are difficult to
implement

and generalize.
Next, we formalize the problem of malicious URL detection as a machine learning task
(specifically

binary classification) which allows us to generalize most of the existing work in literature

g: Learning a prediction function f : R

d → R which predicts the class

assignment for any URL instance x using proper feature representations.

The goal of machine learning for malicious URL detection is to maximize the predictive accuracy.

Both of the folds above are important to achieve this goal. While the first part of feature
representation is often based on domain knowledge and heuristics, the second part focuses on
training the

classification model via a data driven optimization approach. Fig. 2 illustrates a general work-flow

for Malicious URL Detection using machine learning.

The first key step is to convert a URL u into a feature vector x, where several types of information

can be considered and different techniques can be used. Unlike learning the prediction model,
this

part cannot be directly computed by a mathematical function (not for most of it). Using domain

knowledge and related expertise, a feature representation is constructed by crawling all relevant

information about the URL. These range from lexical information (length of URL, the words used

in the URL, etc.) to host-based information (WHOIS info, IP address, location, etc.). Once the

information is gathered, it is processed to be stored in a feature vector x. Numerical features can

be stored in x as is, and identity related information or lexical features are usually stored through

a binarization or bag-of-words (BoW) approach. Based on the type of information used, x ∈ R

generated from a URL is a d-dimensional vector where d can be less than 100 or can be in the
order

of millions. A unique challenge that affects this problem setting is that the number of features

may not be fixed or known in advance. For example, using a BoW approach one can track the
occurrence for every type of word that may have occurred in a URL in the training data. A model

can be trained on this data, but while predicting, new URLs may have words that did not occur in

the training data. It is thus a challenging task to design a good feature representation that is
robust

to unseen data.

After obtaining the feature vector x for the training data, to learn the prediction function

f:R

d → R, it is usually formulated as an optimization problem such that the detection accuracy is

maximized (or alternately, a loss function is minimized). The function f is (usually) parameterized

by a d− dimensional weight vector w, such that f(x) = (w⊤x). Let yˆt = sign(f(xt )) denote the class

label prediction made by the function f. The number of mistakes made by the prediction model
on

the entire training data is given by: ÍT

t=1

Iyˆt =yt where I is an indicator which evaluate to 1

condition is true, and 0 otherwise. Since the indicator function is not convex, the optimization
can

be difficult to solve. As a result, a convex loss function is often defined, and is denoted by
ℓ(f(x),y)

and the entire optimization can be formulated as:

min

t=1

ℓ(f(xt ),yt ) (1)


Several types of loss functions can be used, including the popular hinge-loss ℓ(f(x),y) =

2 max(1−

yf(x), 0), or the squared-loss ℓ(f(x),y) =

(f(x) − y)

. Sometimes, a regularization term is added

to prevent over-fitting or to learn sparse models, or the loss function is modified based on
costsensitivity of the data (e.g., imbalanced class distribution, different costs for diverse threats).

4) I havent worked much on it

We can achieve this task using Transfer learning

, a pre-trained model is a model created by some one else to solve a similar problem. Instead of
building a model from scratch to solve a similar problem, you use the model trained on other
problem as a starting point.

For example, if you want to build a self learning car. You can spend years to build a decent image
recognition algorithm from scratch or you can take inception model (a pre-trained model) from
Google which was built on ImageNet data to identify images in those pictures.

A pre-trained model may not be 100% accurate in your application, but it saves huge efforts
required to re-invent the wheel. Let me show this to you with a recent example.

i would suggest it with an example

. The objective was to classify the images into one of the 16 categories. After the basic pre-
processing steps, I started off with a simple MLP model with the following architecture-
To simplify the above architecture after flattening the input image [224 X 224 X 3] into
[150528], I used three hidden layers with 500, 500 and 500 neurons respectively. The
output layer had 16 neurons which correspond to the number of categories in which we
need to classify the input image.

I barely managed a training accuracy of 6.8 % which turned out to be very bad. Even
experimenting with hidden layers, number of neurons in hidden layers and drop out rates. I
could not manage to substantially increase my training accuracy. Increasing the hidden
layers and the number of neurons, caused 20 seconds to run a single epoch on my Titan X
GPU with 12 GB VRAM.

Below is an output of the training using the MLP model with the above architecture.

Epoch 10/10

50/50 [==============================] – 21s – loss: 15.0100 – acc:


0.0688

As, you can see MLP was not going to give me any better results without exponentially
increasing my training time. So I switched to Convolutional Neural Network to see how they
perform on this dataset and whether I would be able to increase my training accuracy.

The CNN had the below architecture –

I used 3 convolutional blocks with each block following the below architecture-

32 filters of size 5 X 5
Activation function – relu

Max pooling layer of size 4 X 4

The result obtained after the final convolutional block was flattened into a size [256] and
passed into a single hidden layer of with 64 neurons. The output of the hidden layer was
passed onto the output layer after a drop out rate of 0.5.

The result obtained with the above architecture is summarized below-

Epoch 10/10

50/50 [==============================] – 21s – loss: 13.5733 – acc:


0.1575

Though my accuracy increased in comparison to the MLP output, it also increased the time
taken to run a single epoch – 21 seconds.

But the major point to note was that the majority class in the dataset was around 17.6%. So,
even if we had predicted the class of every image in the train dataset to be the majority
class, we would have performed better than MLP and CNN respectively. Addition of more
convolutional blocks substantially increased my training time. This led me to switch onto
using pre-trained models where I would not have to train my entire architecture but only a
few layers.

So, I used VGG16 model which is pre-trained on the ImageNet dataset and provided in the
keras library for use. Below is the architecture of the VGG16 model which I used.
The only change that I made to the VGG16 existing architecture is changing the softmax
layer with 1000 outputs to 16 categories suitable for our problem and re-training the dense
layer.

This architecture gave me an accuracy of 70% much better than MLP and CNN. Also, the
biggest benefit of using the VGG16 pre-trained model was almost negligible time to train the
dense layer with greater accuracy.

So, I moved forward with this approach of using a pre-trained model and the next step was
to fine tune my VGG16 model to suit this problem.

Ways to Fine tune the model

Feature extraction – We can use a pre-trained model as a feature extraction mechanism.


What we can do is that we can remove the output layer( the one which gives the
probabilities for being in each of the 1000 classes) and then use the entire network as a
fixed feature extractor for the new data set.

Use the Architecture of the pre-trained model – What we can do is that we use architecture
of the model while we initialize all the weights randomly and train the model according to
our dataset again.

Train some layers while freeze others – Another way to use a pre-trained model is to train is
partially. What we can do is we keep the weights of initial layers of the model frozen while
we retrain only the higher layers. We can try and test as to how many layers to be frozen and
how many to be trained.

5) I would be doing an ML reasearch this summer as a summer intern in kiel university in


which i will be working on deep learning algorithms i would send the offer letter along with
this document

6)I like to work on sentiment anaylysis area of NLP .Sentiment analysis is extremely useful
in social media monitoring as it allows us to gain an overview of the wider public opinion
behind certain topics. Social media monitoring tools like Brandwatch Analytics make that
process quicker and easier than ever before, thanks to real-time monitoring capabilities.

The applications of sentiment analysis are broad and powerful. The ability to extract
insights from social data is a practice that is being widely adopted by organisations across
the world.
Shifts in sentiment on social media have been shown to correlate with shifts in the stock
market.

The Obama administration used sentiment analysis to gauge public opinion to policy
announcements and campaign messages ahead of 2012 presidential election. Being able to
quickly see the sentiment behind everything from forum posts to news articles means being
better able to strategise and plan for the future.

It can also be an essential part of your market research and customer service approach. Not
only can you see what people think of your own products or services, you can see what they
think about your competitors too. The overall customer experience of your users can be
revealed quickly with sentiment analysis, but it can get far more granular too.

The ability to quickly understand consumer attitudes and react accordingly is something
that Expedia Canada took advantage of when they noticed that there was a steady increase
in negative feedback to the music used in one of their television adverts.

But that is not to say that sentiment analysis is a perfect science at all.

The human language is complex. Teaching a machine to analyse the various grammatical
nuances, cultural variations, slang and misspellings that occur in online mentions is a
difficult process. Teaching a machine to understand how context can affect tone is even
more difficult.Humans are fairly intuitive when it comes to interpreting the tone of a piece of
writing.

Consider the following sentence: “My flight’s been delayed. Brilliant!”

Most humans would be able to quickly interpret that the person was being sarcastic. We
know that for most people having a delayed flight is not a good experience (unless there’s a
free bar as recompense involved). By applying this contextual understanding to the
sentence, we can easily identify the sentiment as negative.

Without contextual understanding, a machine looking at the sentence above might see the
word “brilliant” and categorise it as positive.
7) I have worked on almost every ML alogorithms in my projects

REGRESSION

CLASSIFICATION

these were for predicting sales and health

NEURAL NETWORKS

For making chatbots

NATURAL LANGUAGE PROCESSING

For making restraunt reviews and detecting fake news

8) No i havent done any web developement as yet . i have worked on machine learning, data
science and deep learning , i am learning IOT because that excites me.

You might also like