Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

FAKE NEWS DETECTION USING PYTHON

A project report in partial fulfilment for the award of the degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND SYSTEMS ENGINEERING

Submitted by

Ryali Aravind (316506408006)

Under the guidance of

Prof Viziananda Row Sanapal


Professor,
DEPARTMENT OF COMPUTER SCIENCE AND SYSTEMS ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND SYSTEMS


ENGINEERING, ANDHRA UNIVERSITY COLLEGE OF ENGINEERING (A),
ANDHRA UNIVERSITY, VISAKHAPATNAM-530003 MARCH 2020
CERTIFICATE

This is to certify that the project titled " FAKE NEWS DETECTION
USING PYTHON" is a certified record of work done by: R. Aravind
(316506408006),student of Bachelor of Technology (B.Tech) in the Department
of Computer Science & Systems Engineering, Andhra University College of
Engineering (A), Andhra University, Visakhapatnam during the period 2016-2020
in the partial fulfilment of the requirements for the award of Bachelor of
Technology. This work has not been submitted to any other university for the
award of Degree or Diploma.

Prof. VIZIANANDA ROW SANAPAL


Computer Science & Systems Engineering
Andhra University College of Engineering(A)
DECLARATION

We hereby declare that the project entitled “FAKE NEWS


DETECTION USING PYTHON” is an original and authentic work done in the
Department of Computer Science & Systems Engineering, Andhra University
College of Engineering (A), Andhra University Visakhapatnam, submitted in
partial fulfilment of the requirements for the degree of Bachelor of Technology.

R. ARAVIND (316506408006)
ACKNOWLEDGEMENT

We have immense pleasure in expressing our earnest gratitude to my Project


Supervisor Prof VIZIANANDA ROW SANAPAL Andhra University for her inspiring and scholarly
guidance. Despite her pre-occupation with several assignments, she has been kind enough to
spare her valuable time and gave us the necessary counsel and guidance. We received at
every stage of planning and constitution of this work. We express sincere gratitude for having
accorded us permission to take up this Project work and for helping us graciously throughout
the execution of this work

We express sincere Thanks to Prof. K. NAGESWARA RAO, Head of CS&SE, Andhra


University College of Engineering(A) for his keen interest and providing necessary facilities for
this project study.

We express sincere gratitude to Prof. P.V.G.D. PRASAD REDDY, Chairman, Board of


Studies, Computer Science and Systems Engineering, for having accorded us permission to
take up this Project work and for helping us graciously throughout the execution of this work.

We express sincere Thanks to Prof. P SRINIVAS RAO, Principal, Andhra University


College of Engineering(A) for his keen interest and for providing necessary facilities for the
project study
ABSTRACT

This research surveys the current state‐of‐the‐art technologies that are instrumental in the
adoption and development of fake news detection. “Fake news detection” is defined as the
task of categorizing news along a continuum of veracity, with an associated measure of
certainty. Veracity is compromised by the occurrence of intentional deceptions.

The nature of online news publication has changed, such that traditional fact checking and
vetting from potential deception is impossible against the flood arising from content
generators, as well as various formats and genres.

This project provides a typology of several varieties of veracity assessment methods emerging
from major category – linguistic cue approaches (with machine learning).

We see promise in an innovative hybrid approach that combines linguistic cue and machine
learning. Although designing a fake news detector is not a straightforward problem, we
propose operational guidelines for a feasible fake news detecting system.
TABLE OF CONTENT

CHAPTER-1-PROJECT OVERVIEW
Introduction
Purpose and Benefits
Modules and Description
Existing System & Proposed System

CHAPTER-2-PROJECT ANALYSIS
Project Analysis

CHAPTER-3- PROJECT LIFECYCLE


Project Lifecycle Details

CHAPTER-4-PROJECT DESIGN
E-R Diagram
Use Case Diagram
Sequence Diagram
Activity Diagram
Data Flow Diagram
System Architecture

CHAPTER-5-PROJECT IMPLEMENTATION
Project Implementation Procedure
Feasibility Report

CHAPTER-6- TESTING
Testing
Levels of Testing
Test Cases
CHAPTER-7-ADVANTAGES AND DISADVANTAGES
Advantages
Disadvantages

CHAPTER-8-CONCLUSION
Project Conclusion
CHAPTER-9-BIBLIOGRAPHY
Website links
Acknowledgement

We are pleased to present “FAKE NEWS DETECTION USING PYTHON”


project and take this opportunity to express our profound gratitude to all those
people who helped us in completion of this project.

We express our deepest gratitude towards our project guide for her
valuable and timely advice during the various phases in our project. We would
also like to thank her for providing us with all proper facilities and support as
the project co-coordinator. We would like to thank her for support, patience
and faith in our capabilities and for giving us flexibility in terms of working and
reporting schedules.

We thank our college for providing us with excellent facilities that


helped us to complete and present this project. We would also like to thank
the staff members and lab assistants for permitting us to use computers in the
lab as and when required.

We would like to thank all our friends for their smiles and friendship
making the college life enjoyable and memorable and family members who
always stood beside us and provided the utmost important moral support.
Finally we would like to thank everyone who has helped us directly or
indirectly in our project.
CHAPTER – 1 – PROJECT OVERVIEW

INTRODUCTION

Fake news is a term that has been used to describe very different
issues, from satirical articles to completely fabricated news and plain
government propaganda in some outlets. Fake news, information
bubbles, news manipulation and the lack of trust in the media are
growing problems with huge ramifications in our society.
First, fake news is intentionally written to mislead readers to
believe false information, which makes it difficult and nontrivial to
detect based on news content; therefore, we need to include auxiliary
information, such as user social engagements on social media, to help
make a determination.
However, in order to start addressing this problem, we need to
have an understanding on what Fake News is. Only then can we look
into the different techniques and fields of machine learning (ML),
natural language processing (NLP) and artificial intelligence (AI) that
could help us fight this situation.
Second, exploiting this auxiliary information is challenging in and
of itself as users' social engagements with fake news produce data
that is big, incomplete, unstructured, and noisy. Because the issue of
fake news detection on social media is both challenging and relevant
Social media for news consumption is a double-edged sword. On
the one hand, its low cost, easy access, and rapid dissemination of
information lead people to seek out and consume news from social
media. On the other hand, it enables the wide spread of fake news,
i.e., low quality news with intentionally false information.
Purpose and benefits

The extensive spread of fake news has the potential for


extremely negative impacts on individuals and society. Therefore, fake
news detection on social media has recently become an emerging
research that is attracting tremendous attention. Fake news detection
on social media presents unique characteristics and challenges that
make existing detection algorithms from traditional news media
inactive or not applicable.

What is fake news?


fake news is an allegation that some story is misleading – it
contains significant omissions – or even false – it is a lie – designed to
deceive its intended audience.
“Fake news” has been used in a multitude of ways in the last half
a year and multiple definitions have been given. For instance, the New
York times defines it as “a made-up story with an intention to
deceive”. This definition focuses on two dimensions: the intentionality
(very difficult to prove) and the fact that the story is made up. This
implies that honest mistakes are not considered to be fake news.
Fake news is a problem that is heavily affecting society and our
perception of not only the media but also facts and opinions
themselves. I believe that this problem is solvable using AI and ML.
What truly differentiates Fake News from simple hoaxes like
“Moon landing was fake”, etc. is the fact that it carefully mimics the
“style” and “patterns” that real news usually follows. That is what
makes it so hard to distinguish for the untrained human eye.
Types of fake content:
• False Connection: Headlines, visuals or captions do not support
the content
• False Context: Genuine content is shared with false contextual
information
• Manipulated content: Genuine information or imagery is
manipulated
• Satire or Parody: No intention to cause harm but potential to
fool
• Misleading Content: Misleading use of information to frame an
issue/individual
• Imposter Content: Impersonation of genuine sources
• Fabricated content: New content that is 100% false
Information and relationship extraction are two fields of NLP that
can detect specific nuggets of information being mentioned in natural
language. For instance, given the sentence “Tim Cook is the CEO of
Microsoft” these type of systems could automatically identify that the
article claims that the job of Tim Cook in Microsoft is being the CEO. If
the system also had access to information about the company (e.g.,
companies house or even Wikipedia) this could easily be debunked
because the Microsoft’s CEO is Satya Nadella. In addition to the
academic efforts, we are seeing different companies starting to apply
these technologies for Automatic Fact Checking. It is critical that we
understand how credible different sources are, and then we can apply
our own criticism to decide to believe them or to find a second source
for this information. However, in a world with thousands of
publications either disappearing or being created every minute, this
cannot be done manually.
When it comes to handling fake news, none have put a braver
face than Facebook. With trillions of user posts, Facebook realized
that manual fact-checking would not do any good to solve the fake
news problem. Facebook turned to artificial intelligence to arrest this
problem. Artificial intelligence is being leveraged to find words or even
patterns of words that can throw light on fake news stories.
Artificial intelligence is now looked upon as the cornerstone to
separate the good from bad in the news field. That is because artificial
intelligence makes it easy to learn behaviours, possible through
pattern recognition. Harnessing artificial intelligence’s power
As the volume of data grows bigger by the day, so is the chance
of handling misinformation as it challenges the human ability to
uncover the truth. Artificial intelligence has turned into a beacon of
hope for to assure data veracity, and more importantly, identify fake
news.
When it comes to news items, the headline is the key to capture
the attention of the audience. It is for this reason that sensational
headlines become a handy tool to capture readers’ interest. When
sensational words are used to spread fake news, it becomes a lure to
attract more eyeballs and spread the news faster and wider. Not
anymore, as artificial intelligence has been instrumental in discovering
and flagging fake news headlines by using keyword analytics.
Modules and desciption

There are 5 modules in this project as follows


1. Data Exploration
2. Extracting Training Data
3. Building vectorizers
4. Comparing Models
5. Applying Classifiers
6. Prediction on Testing data
Description:
1. Data Exploration
To begin, you should always take a quick look at the data and get
a feel for its contents. To do so, use a Pandas Data Frame and check
the shape, head and apply any necessary transformations.
2.Extracting the training data
The Data Frame should looks closer to what we need, so we
want to separate the labels and set up training and test datasets.
3.Building Vectorizers
Now that we have our training and testing data, we can build our
classifiers. To get a good idea if the words and tokens in the articles
had a significant impact on whether the news was fake or real, we
begin by using Count Vectorizer and TfidfVectorizer.
Text data requires special preparation before we can start using
it for predictive modelling.
The text must be parsed to remove words, called tokenization.
Then the words need to be encoded as integers or floating point
values for use as input to a machine learning algorithm, called
feature extraction (or vectorization).
The scikit-learn library offers easy-to-use tools to perform both
tokenization and feature extraction of your text data.
4.Comparing Models
We will begin with an NLP favourite, MultinomialNB. We can use
this to compare TF-IDF versus bag-of-words. Our intuition was that
CountVectorizer would perform better with this model. We used
the scikit-learn documentation to build some easily-readable
confusion matrices. A confusion matrix shows the proper labels on
the main. The other cells show the incorrect labels, often referred
to as false positives or false negatives. Depending on our problem,
one of these might be more significant. For example, for the fake
news problem, is it more important that we don't label real news
articles as fake news? If so, we might want to eventually weight our
accuracy score to better reflect this concern.
5.Applying classifiers
Now applying classifiers is the very important part in the entire
project. In this, we will be classifying the data into two classes, the
first one is the Real class and the second one is the fake class. We
use PassiveAggressiveClassifier to classify the data.
6.Prediction on the testing data
Now after the classification is done, it will analyse the data and
it will learn how to predict the data whether it is real or fake data.
Then we apply this classifier to the testing data which we have
separated before in the exploration of the data area. Then we will
get the results whether the tested data is real or fake.
Existing System
There exists a large body of research on the topic of machine
learning methods for deception detection, most of it has been
focusing on classifying online reviews and publicly available social
media posts. Particularly since late 2016 during the American
Presidential election, the question of determining 'fake news' has also
been the subject of particular attention within the literature.

Conroy, Rubin, and Chen [1] outlines several approaches that


seem promising towards the aim of perfectly classify the misleading
articles. They note that simple content-related n-grams and shallow
parts-of-speech (POS) tagging have proven insufficient for the
classification task, often failing to account for important context
information. Rather, these methods have been shown useful only in
tandem with more complex methods of analysis. Deep Syntax analysis
using Probabilistic Context Free Grammars (PCFG) have been shown
to be particularly valuable in combination with n-gram methods. Feng,
Banerjee, and Choi [2] are able to achieve 85%-91% accuracy in
deception related classification tasks using online review corpora.

Proposed system

In this project model is build based on the count vectorizer or a


tfidf matrix ( i.e ) word tallies relatives to how often they are used in
other articles in your dataset ) can help . Since this problem is a kind
of text classification, implementing a Naive Bayes classifier will be best
as this is standard for text-based processing. The actual goal is in
developing a model which was the text transformation (count
vectorizer vs tfidf vectorizer) and choosing which type of text to use
(headlines vs full text). Now the next step is to extract the most
optimal features for countvectorizer or tfidf-vectorizer, this is done by
using a n-number of the most used words, and/or phrases, lower
casing or not, mainly removing the stop words which are common
words such as “the”, “when”, and “there” and only using those words
that appear at least a given number of times in a given text dataset.
CHAPTER-2 –PROJECT ANALYSIS

News, a game changer in each and every aspect. We can hide anything
from this world and we can show anything to the world using news. If
the news is misused, then it is going to create a huge impact on the
society. News can be manipulated. It is easy to spread fake news than
the real ones. So, when people start reading these fakes news, they
start believing that it is real. That’s where the actual problem starts.
We need to find a Realtime solution for this problem of fake news.
It is impossible to do it manually. There are infinite no of records of
data as news in this world. And the records are going to be more in
future. So, we need a tool which can do this work efficiently.
For that reason, we have started a project of detecting the fake news
using machine learning tools. We have used some tools for the
detection of the fake news. The results are not hundred percent
accurate. We can get a 93% accuracy by using these tools.
CHAPTER -3- PROJECT LIFECYCLE

Waterfall Model

REQUIREMENT
Requirements for detecting fake news

LOADING DATA
SETS
Loading the data sets

IMPLEMENTATION Applying the calssifierss

VERIFICATION
Checking results

Description
The waterfall Model is a linear sequential flow. In which
progress is seen as flowing steadily downwards (like a waterfall)
through the phases of software implementation. This means
that any phase in the development process begins only if the
previous phase is complete. The waterfall approach does not
define the process to go back to the previous phase to handle
changes in requirement. The waterfall approach is the earliest
approach that was used for software development.
CHAPTER- 4- PROJECT DESIGN

E-R Diagram
Testing data

CLASSIFIER PREDICTS NEWS DATA

Training data
PassiveAgress
iveClassifier

USE CASE DIAGRAM

News Data

Vectorization

Classification

Prediction
Sequence Diagram

USER Program

1-Loading the data into the program


2-Data is displayed in the tabular form
3-Applying the Count Vectorizer 4-Vectorizing
the data
5-Displaying the word count
6- Applying the passiveAggressive Classifier 7-Data is
Classified into
the groups
8-testing the data 9-Prediction
10-Gives the prediction of the data given
for the testing
Activity Diagram

Loading the
datasets

Applying the
Countvectorizer

Getting the
word count

Applying the
Passive Aggressive
classifier

Classification

fake Real

Prediction
Fake Real
CHAPTER-5 -PROJECT IMPLEMENTATION

PROCEDURE-
• Download and install python in the system.
• Now install pip in the system.
• Now set the path in the environmental variables.
• Now install numpy pandas sklearn using pip.
pip install numpy pandas sklearn

• Now install jupyter lab in the system.


C:\Users\DataFlair>jupyter lab

Now we shall import some libraries.


import pandas as pd
import numpy as np
import itertools
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer,
HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

Data Exploration
To begin, we should always take a quick look at the data and get a
feel for its contents. To do so, use a Pandas DataFrame and check the
shape, head and apply any necessary transformations.
df = pd.read_csv('fake_or_real_news.csv')
df.shape
df.head()
df = df.set_index('Unnamed: 0')
df.head()

Extracting the training data


separate the labels and set up training and test datasets.
y = df.label
df = df.drop('label', axis=1)
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33,
random_state=53)

Building Vectorizer Classifiers


Word Counts with CountVectorizer
The CountVectorizer provides a simple way to both tokenize a
collection of text documents and build a vocabulary of known words,
but also to encode new documents using that vocabulary. An encoded
vector is returned with a length of the entire vocabulary and an
integer count for the number of times each word appeared in the
document.
Because these vectors will contain a lot of zeros, we call them
sparse. Python provides an efficient way of handling sparse vectors in
the scipy.sparse package.
Word Frequencies with TfidfVectorizer
Word counts are a good starting point, but are very basic. One
issue with simple counts is that some words like “the” will appear
many times and their large counts will not be very meaningful in the
encoded vectors.
An alternative is to calculate word frequencies, and by far the
most popular method is called TF-IDF. This is an acronym than stands
for “Term Frequency – Inverse Document” Frequency which are the
components of the resulting scores assigned to each word.
Term Frequency: This summarizes how often a given word appears
within a document.
Inverse Document Frequency: This downscales words that appear a
lot across documents.
Without going into the math, TF-IDF are word frequency scores
that try to highlight words that are more interesting, e.g. frequent in
a document but not across documents.
The TfidfVectorizer will tokenize documents, learn the
vocabulary and inverse document frequency weightings, and allow
you to encode new documents. Alternately, if you already have a
learned CountVectorizer, you can use it with a TfidfTransformer to just
calculate the inverse document frequencies and start encoding
documents.
The same create, fit, and transform process is used as with the
CountVectorizer.
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)
tfidf_vectorizer.get_feature_names()[-10:]
count_vectorizer.get_feature_names()[:10]
Now that we have vectors, we can then take a look at the vector
features, stored in count_vectorizer and tfidf_vectorizer.
There are clearly comments, measurements or other
nonsensical words as well as multilingual articles in the dataset that
you have been using. Normally, you would want to spend more time
preprocessing this and removing noise, but as we showcase a small
proof of concept, we will see if the model can overcome the noise and
properly classify despite these issues.

Count versus TF-IDF Features


We are curious if the count and TF-IDF vectorizers had extracted
different tokens. To take a look and compare features, we can extract
the vector information back into a DataFrame to use easy Python
comparisons.
As we can see by running the cells below, both vectorizers
extracted the same tokens, but obviously have different weights.
Likely, changing the max_df and min_df of the TF-IDF vectorizer could
alter the result and lead to different features in each.

count_df = pd.DataFrame(count_train.A,
columns=count_vectorizer.get_feature_names())
tfidf_df = pd.DataFrame(tfidf_train.A,
columns=tfidf_vectorizer.get_feature_names())
difference = set(count_df.columns) - set(tfidf_df.columns)
difference

print(count_df.equals(tfidf_df))
count_df.head()
tfidf_df.head()
Comparing Models
Here, we will begin with an NLP favorite, MultinomialNB. we can
use this to compare TF-IDF versus CountVectorizer. our intuition was
that CountVectorizer would perform better with this model
We personally find Confusion Matrices easier to compare and
read, so we used the scikit-learn documentation to build some easily-
readable confusion matrices. A confusion matrix shows the proper
labels on the main diagonal (top left to bottom right). The other cells
show the incorrect labels, often referred to as false positives or false
negatives. Depending on our problem, one of these might be more
significant. For example, for the fake news problem, is it more
important that we don't label real news articles as fake news? If so,
we might want to eventually weight our accuracy score to better
reflect this concern.
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')

thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]),
range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
clf = MultinomialNB()
clf.fit(tfidf_train, y_train)
pred = clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
clf = MultinomialNB()
clf.fit(count_train, y_train)
pred = clf.predict(count_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
clf = MultinomialNB()
clf.fit(tfidf_train, y_train)
pred = clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
clf = MultinomialNB()
clf.fit(count_train, y_train)
pred = clf.predict(count_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])

Testing Linear Models


We'll test this approach (which has some significant speed
benefits and permanent learning disadvantages) with the fake news
dataset. Here we will be comparing the PassiveAggresiveClassifier and
MultinomialNB.
linear_clf = PassiveAggressiveClassifier(n_iter=50)
linear_clf.fit(tfidf_train, y_train)
pred = linear_clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL']
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
Introspecting models
Using your best performing classifier with your TF-IDF vector
dataset (tfidf_vectorizer) and Passive Aggressive classifier (linear_clf),
inspect the top 30 vectors for fake and real news:
def most_informative_feature_for_binary_classification(vectorizer, classifier,
n=100):

class_labels = classifier.classes_
feature_names = vectorizer.get_feature_names()
topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]

for coef, feat in topn_class1:


print(class_labels[0], coef, feat)

print()

for coef, feat in reversed(topn_class2):


print(class_labels[1], coef, feat)

most_informative_feature_for_binary_classification(tfidf_vectorizer,
linear_clf, n=30)

So, clearly there are certain words which might show political
intent and source in the top fake features (such as the words
corporate and establishment).
Also, the real news data uses forms of the verb "to say" more
often, likely because in newspapers and most journalistic publications
sources are quoted directly ("German Chancellor Angela Merkel
said...").
To extract the full list from your current classifier and take a look
at each (or easily compare tokens from classifier to classifier), you
can easily export it like so.
tokens_with_weights = sorted(list(zip(feature_names, clf.coef_[0])))

Feasibility Report
Feasibility Study is a high level capsule version of the entire
process intended to answer a number of questions like: What is the
problem? Is there any feasible solution to the given problem? Is the
problem even worth solving? Feasibility study is conducted once the
problem clearly understood. Feasibility study is necessary to
determine that the proposed system is Feasible by considering the
technical, Operational, and Economical factors. By having a detailed
feasibility study the management will have a clear-cut view of the
proposed system. The following feasibilities are considered for the
project in order to ensure that the project is variable and it does not
have any major obstructions. Feasibility study encompasses the
following things:
1. Technical Feasibility
2. Economic Feasibility
3. Operational Feasibility
In this phase, we study the feasibility of all proposed systems, and
pick the best feasible solution for the problem. The feasibility is
studied based on three main factors as follows
Technical Feasibility
In this step, we verify whether the proposed systems are technically
feasible or not. i.e., all the technologies required to develop the
system are available readily or not.
Technical Feasibility determines whether the organization has the
technology and skills necessary to carry out the project and how this
should be obtained. The system can be feasible because of the
following grounds:
• All necessary technology exists to develop the system.
• This system is too flexible and it can be expanded further.
• This system can give guarantees of accuracy, ease of use,
reliability and the data security.
• This system can give instant response to inquire.
Our project is technically feasible because, all the technology
needed for our project is readily available.
Operating System : Windows 7 or higher
Languages : PYTHON
Documentation Tool : MS - Word 2019

Economic Feasibility
Economically, this project is completely feasible because it
requires no extra financial investment and with respect to time, it’s
completely possible to complete this project in 6 months.
In this step, we verify which proposal is more economical. We
compare the financial benefits of the new system with the investment.
The new system is economically feasible only when the financial
benefits are more than the investments and expenditure. Economic
Feasibility determines whether the project goal can be within the
resource limits allocated to it or not. It must determine whether it is
worthwhile to process with the entire project or whether the benefits
obtained from the new system are not worth the costs. Financial
benefits must be equal or exceed the costs. In this issue, we should
consider:
• The cost to conduct a full system investigation.
• The cost of h/w and s/w for the class of application being
considered.
• The development tool.
• The cost of maintenance etc...
Our project is economically feasible because the cost of
development is very minimal when compared to financial benefits of
the application.
Operational Feasibility
In this step, we verify different operational factors of the
proposed systems like man-power, time etc., whichever solution uses
less operational resources, is the best operationally feasible solution.
The solution should also be operationally possible to implement.
Operational Feasibility determines if the proposed system satisfied
user objectives could be fitted into the current system operation.
• The methods of processing and presentation are
completely accepted by the clients since they can meet all
user requirements.
• The clients have been involved in the planning and
development of the system.
• The proposed system will not cause any problem under any
circumstances.
Our project is operationally feasible because the time
requirements and personnel requirements are satisfied. We are a
team of four members and we worked on this project for three
working months.
CHAPTER- 6 -CODING AND TESTING
CODE
import pandas as pd
import numpy as np
import itertools
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer,
HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import matplotlib.pyplot as plt
%pylab inline
df = pd.read_csv('fake_or_real_news.csv')
df.shape
OUTPUT-(6335, 4)

df.head()

df = df.set_index('Unnamed: 0')
df.head()
y = df.Label
df = df.drop('label', axis=1)
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33,
random_state=53)
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())


tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())
difference = set(count_df.columns) - set(tfidf_df.columns)
difference
print(count_df.equals(tfidf_df))
count_df.head()
tfidf_df.head()

def plot_confusion_matrix(cm, classes,


normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')

thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
clf = MultinomialNB()
clf.fit(tfidf_train, y_train)
pred = clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])

clf = MultinomialNB()
clf.fit(count_train, y_train)
pred = clf.predict(count_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])

linear_clf = PassiveAggressiveClassifier()
linear_clf.fit(tfidf_train, y_train)
pred = linear_clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])

def most_informative_feature_for_binary_classification(vectorizer, classifier, n=100):


class_labels = classifier.classes_
feature_names = vectorizer.get_feature_names()
topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]
for coef, feat in topn_class1:
print(class_labels[0], coef, feat)
print()
for coef, feat in reversed(topn_class2):
print(class_labels[1], coef, feat)

most_informative_feature_for_binary_classification(tfidf_vectorizer, linear_clf, n=30)


TESTING

As the project is on bit large scale, we always need testing to


make it successful. If each component work properly in all respect and
gives desired output for all kind of inputs, then project is said to be
successful. So, the conclusion is-to make the project successful, it
needs to be tested.
A series of testing is done for the proposed system
before the system is ready for the user acceptance testing.
The steps involved in Testing are:

Unit Testing
Unit testing focuses verification efforts on the smallest unit of
the software design, the module. This is also known as “Module
Testing”. The modules are tested separately. This testing carried out
during programming stage itself. In this testing each module is found
to be working satisfactorily as regards to the expected output from
the module.

Integration Testing
Data can be grossed across an interface; one module can have
adverse efforts on another. Integration testing is systematic testing for
construction the program structure while at the same time conducting
tests to uncover errors associated with in the interface. The
objective is to take unit tested modules and build a program structure.
All the modules are combined and tested as a whole. Here correction
is difficult because the isolation of cause is complicate by the vast
expense of the entire program. Thus, in the integration testing stop,
all the errors uncovered are corrected for the text testing steps.
System testing
System testing is the stage of implementation that is aimed at
ensuring that the system works accurately and efficiently for live
operation commences. Testing is vital to the success of the system.
System testing makes a logical assumption that if all the parts of the
system are correct, then goal will be successfully achieved.

Validation Testing
At the conclusion of integration testing software is completely
assembled as a package, interfacing errors have been uncovered and
corrected and a final series of software tests begins, validation test
begins. Validation test can be defined in many ways. But the simple
definition is that validation succeeds when the software function in a
manner that can reasonably expected by the customer. After
validation test has been conducted one of two possible conditions
exists.
One is the function or performance characteristics confirm to
specifications and are accepted and the other is deviation from
specification is uncovered and a deficiency list is created. Proposed
system under consideration has been tested by using validation
testing and found to be working satisfactorily.

Output Testing
After performing validation testing, the next step is output
testing of the proposed system since no system could be useful if it
does not produce the required output in the specified format. Asking
the users about the format required by them tests the outputs
generated by the system under consideration. Here the output format
is considered in two ways, one is on the screen and other is the printed
format. The output format on the screen is found to be correct as the
format was designed in the system designed phase according to the
user needs.
For the hard copy also, the output comes as the specified
requirements by the users. Hence output testing does not result any
corrections in the system.

User Acceptance Testing


User acceptance of a system is the key factor of the success of
any system. The system under study is tested for the user acceptance
by constantly keeping in touch with the prospective system users at
the time of developing and making changes wherever required.
CHAPTER-7-ADVANTAGES AND DISADVANTAGES

Advantages
If we use the normal traditional method of crosschecking the
data manually it takes a lot of time. So here we use the machine
learning algorithms to check whether the data is fake or real. So, we
save a lot of time to us. And if we give more data the accuracy of the
algorithm will also increases. And the time it takes to classify the data
is also very less when compared to humans.
Disadvantages
There is one and only disadvantage in this project, i.e if we use
the algorithms for the prediction, we may not get the accurate results
when we take less data sets. So, we need to take a lot of data sets to
get the good accuracy.
CHAPTER-8-PROJECT CONCLUSION

The aim of the project “FAKE NEWS DETECTION USING PYTHON”


Is to detect the fake news using the machine learning algorithms. We
used CountVectorizer, tfidfVectorizer for the process of the
vectorization. We used PassiveAggresiveClassifier for the classification
process. This gives whether the data is real or fake.

CHAPTER- 9 -BIBLIOGRAPHY

• THE PRACTICAL ACADEMIC, Merging the best research in text analytics


with practical and commercial perspectives, how can Machine Learning
and AI help solving the fake news problem?
https://miguelmalvarez.com/2017/03/23/how-can-machine-learning-
and-ai-help-solving-the-fake-news-problem/

• Katharine Jarmul, Detecting Fake News with Scikit Learn -


https://www.datacamp.com/community/tutorials/scikit-learn-fake-
news

• GitHub, Detecting Fake News with Scikit Learn


https://github.com/docketrun/Detecting-Fake-News-with-Scikit-Learn
• N. J. Conroy, V. L. Rubin, and Y. Chen, “Automatic deception detection:
Methods for finding fake news,” Proceedings of the Association for
Information Science and Technology, vol. 52, no. 1, pp. 1–4, 2015.

You might also like