Professional Documents
Culture Documents
Fake News Detection
Fake News Detection
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND SYSTEMS ENGINEERING
Submitted by
This is to certify that the project titled " FAKE NEWS DETECTION
USING PYTHON" is a certified record of work done by: R. Aravind
(316506408006),student of Bachelor of Technology (B.Tech) in the Department
of Computer Science & Systems Engineering, Andhra University College of
Engineering (A), Andhra University, Visakhapatnam during the period 2016-2020
in the partial fulfilment of the requirements for the award of Bachelor of
Technology. This work has not been submitted to any other university for the
award of Degree or Diploma.
R. ARAVIND (316506408006)
ACKNOWLEDGEMENT
This research surveys the current state‐of‐the‐art technologies that are instrumental in the
adoption and development of fake news detection. “Fake news detection” is defined as the
task of categorizing news along a continuum of veracity, with an associated measure of
certainty. Veracity is compromised by the occurrence of intentional deceptions.
The nature of online news publication has changed, such that traditional fact checking and
vetting from potential deception is impossible against the flood arising from content
generators, as well as various formats and genres.
This project provides a typology of several varieties of veracity assessment methods emerging
from major category – linguistic cue approaches (with machine learning).
We see promise in an innovative hybrid approach that combines linguistic cue and machine
learning. Although designing a fake news detector is not a straightforward problem, we
propose operational guidelines for a feasible fake news detecting system.
TABLE OF CONTENT
CHAPTER-1-PROJECT OVERVIEW
Introduction
Purpose and Benefits
Modules and Description
Existing System & Proposed System
CHAPTER-2-PROJECT ANALYSIS
Project Analysis
CHAPTER-4-PROJECT DESIGN
E-R Diagram
Use Case Diagram
Sequence Diagram
Activity Diagram
Data Flow Diagram
System Architecture
CHAPTER-5-PROJECT IMPLEMENTATION
Project Implementation Procedure
Feasibility Report
CHAPTER-6- TESTING
Testing
Levels of Testing
Test Cases
CHAPTER-7-ADVANTAGES AND DISADVANTAGES
Advantages
Disadvantages
CHAPTER-8-CONCLUSION
Project Conclusion
CHAPTER-9-BIBLIOGRAPHY
Website links
Acknowledgement
We express our deepest gratitude towards our project guide for her
valuable and timely advice during the various phases in our project. We would
also like to thank her for providing us with all proper facilities and support as
the project co-coordinator. We would like to thank her for support, patience
and faith in our capabilities and for giving us flexibility in terms of working and
reporting schedules.
We would like to thank all our friends for their smiles and friendship
making the college life enjoyable and memorable and family members who
always stood beside us and provided the utmost important moral support.
Finally we would like to thank everyone who has helped us directly or
indirectly in our project.
CHAPTER – 1 – PROJECT OVERVIEW
INTRODUCTION
Fake news is a term that has been used to describe very different
issues, from satirical articles to completely fabricated news and plain
government propaganda in some outlets. Fake news, information
bubbles, news manipulation and the lack of trust in the media are
growing problems with huge ramifications in our society.
First, fake news is intentionally written to mislead readers to
believe false information, which makes it difficult and nontrivial to
detect based on news content; therefore, we need to include auxiliary
information, such as user social engagements on social media, to help
make a determination.
However, in order to start addressing this problem, we need to
have an understanding on what Fake News is. Only then can we look
into the different techniques and fields of machine learning (ML),
natural language processing (NLP) and artificial intelligence (AI) that
could help us fight this situation.
Second, exploiting this auxiliary information is challenging in and
of itself as users' social engagements with fake news produce data
that is big, incomplete, unstructured, and noisy. Because the issue of
fake news detection on social media is both challenging and relevant
Social media for news consumption is a double-edged sword. On
the one hand, its low cost, easy access, and rapid dissemination of
information lead people to seek out and consume news from social
media. On the other hand, it enables the wide spread of fake news,
i.e., low quality news with intentionally false information.
Purpose and benefits
Proposed system
News, a game changer in each and every aspect. We can hide anything
from this world and we can show anything to the world using news. If
the news is misused, then it is going to create a huge impact on the
society. News can be manipulated. It is easy to spread fake news than
the real ones. So, when people start reading these fakes news, they
start believing that it is real. That’s where the actual problem starts.
We need to find a Realtime solution for this problem of fake news.
It is impossible to do it manually. There are infinite no of records of
data as news in this world. And the records are going to be more in
future. So, we need a tool which can do this work efficiently.
For that reason, we have started a project of detecting the fake news
using machine learning tools. We have used some tools for the
detection of the fake news. The results are not hundred percent
accurate. We can get a 93% accuracy by using these tools.
CHAPTER -3- PROJECT LIFECYCLE
Waterfall Model
REQUIREMENT
Requirements for detecting fake news
LOADING DATA
SETS
Loading the data sets
VERIFICATION
Checking results
Description
The waterfall Model is a linear sequential flow. In which
progress is seen as flowing steadily downwards (like a waterfall)
through the phases of software implementation. This means
that any phase in the development process begins only if the
previous phase is complete. The waterfall approach does not
define the process to go back to the previous phase to handle
changes in requirement. The waterfall approach is the earliest
approach that was used for software development.
CHAPTER- 4- PROJECT DESIGN
E-R Diagram
Testing data
Training data
PassiveAgress
iveClassifier
News Data
Vectorization
Classification
Prediction
Sequence Diagram
USER Program
Loading the
datasets
Applying the
Countvectorizer
Getting the
word count
Applying the
Passive Aggressive
classifier
Classification
fake Real
Prediction
Fake Real
CHAPTER-5 -PROJECT IMPLEMENTATION
PROCEDURE-
• Download and install python in the system.
• Now install pip in the system.
• Now set the path in the environmental variables.
• Now install numpy pandas sklearn using pip.
pip install numpy pandas sklearn
Data Exploration
To begin, we should always take a quick look at the data and get a
feel for its contents. To do so, use a Pandas DataFrame and check the
shape, head and apply any necessary transformations.
df = pd.read_csv('fake_or_real_news.csv')
df.shape
df.head()
df = df.set_index('Unnamed: 0')
df.head()
count_df = pd.DataFrame(count_train.A,
columns=count_vectorizer.get_feature_names())
tfidf_df = pd.DataFrame(tfidf_train.A,
columns=tfidf_vectorizer.get_feature_names())
difference = set(count_df.columns) - set(tfidf_df.columns)
difference
print(count_df.equals(tfidf_df))
count_df.head()
tfidf_df.head()
Comparing Models
Here, we will begin with an NLP favorite, MultinomialNB. we can
use this to compare TF-IDF versus CountVectorizer. our intuition was
that CountVectorizer would perform better with this model
We personally find Confusion Matrices easier to compare and
read, so we used the scikit-learn documentation to build some easily-
readable confusion matrices. A confusion matrix shows the proper
labels on the main diagonal (top left to bottom right). The other cells
show the incorrect labels, often referred to as false positives or false
negatives. Depending on our problem, one of these might be more
significant. For example, for the fake news problem, is it more
important that we don't label real news articles as fake news? If so,
we might want to eventually weight our accuracy score to better
reflect this concern.
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]),
range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
clf = MultinomialNB()
clf.fit(tfidf_train, y_train)
pred = clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
clf = MultinomialNB()
clf.fit(count_train, y_train)
pred = clf.predict(count_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
clf = MultinomialNB()
clf.fit(tfidf_train, y_train)
pred = clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
clf = MultinomialNB()
clf.fit(count_train, y_train)
pred = clf.predict(count_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
class_labels = classifier.classes_
feature_names = vectorizer.get_feature_names()
topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]
print()
most_informative_feature_for_binary_classification(tfidf_vectorizer,
linear_clf, n=30)
So, clearly there are certain words which might show political
intent and source in the top fake features (such as the words
corporate and establishment).
Also, the real news data uses forms of the verb "to say" more
often, likely because in newspapers and most journalistic publications
sources are quoted directly ("German Chancellor Angela Merkel
said...").
To extract the full list from your current classifier and take a look
at each (or easily compare tokens from classifier to classifier), you
can easily export it like so.
tokens_with_weights = sorted(list(zip(feature_names, clf.coef_[0])))
Feasibility Report
Feasibility Study is a high level capsule version of the entire
process intended to answer a number of questions like: What is the
problem? Is there any feasible solution to the given problem? Is the
problem even worth solving? Feasibility study is conducted once the
problem clearly understood. Feasibility study is necessary to
determine that the proposed system is Feasible by considering the
technical, Operational, and Economical factors. By having a detailed
feasibility study the management will have a clear-cut view of the
proposed system. The following feasibilities are considered for the
project in order to ensure that the project is variable and it does not
have any major obstructions. Feasibility study encompasses the
following things:
1. Technical Feasibility
2. Economic Feasibility
3. Operational Feasibility
In this phase, we study the feasibility of all proposed systems, and
pick the best feasible solution for the problem. The feasibility is
studied based on three main factors as follows
Technical Feasibility
In this step, we verify whether the proposed systems are technically
feasible or not. i.e., all the technologies required to develop the
system are available readily or not.
Technical Feasibility determines whether the organization has the
technology and skills necessary to carry out the project and how this
should be obtained. The system can be feasible because of the
following grounds:
• All necessary technology exists to develop the system.
• This system is too flexible and it can be expanded further.
• This system can give guarantees of accuracy, ease of use,
reliability and the data security.
• This system can give instant response to inquire.
Our project is technically feasible because, all the technology
needed for our project is readily available.
Operating System : Windows 7 or higher
Languages : PYTHON
Documentation Tool : MS - Word 2019
Economic Feasibility
Economically, this project is completely feasible because it
requires no extra financial investment and with respect to time, it’s
completely possible to complete this project in 6 months.
In this step, we verify which proposal is more economical. We
compare the financial benefits of the new system with the investment.
The new system is economically feasible only when the financial
benefits are more than the investments and expenditure. Economic
Feasibility determines whether the project goal can be within the
resource limits allocated to it or not. It must determine whether it is
worthwhile to process with the entire project or whether the benefits
obtained from the new system are not worth the costs. Financial
benefits must be equal or exceed the costs. In this issue, we should
consider:
• The cost to conduct a full system investigation.
• The cost of h/w and s/w for the class of application being
considered.
• The development tool.
• The cost of maintenance etc...
Our project is economically feasible because the cost of
development is very minimal when compared to financial benefits of
the application.
Operational Feasibility
In this step, we verify different operational factors of the
proposed systems like man-power, time etc., whichever solution uses
less operational resources, is the best operationally feasible solution.
The solution should also be operationally possible to implement.
Operational Feasibility determines if the proposed system satisfied
user objectives could be fitted into the current system operation.
• The methods of processing and presentation are
completely accepted by the clients since they can meet all
user requirements.
• The clients have been involved in the planning and
development of the system.
• The proposed system will not cause any problem under any
circumstances.
Our project is operationally feasible because the time
requirements and personnel requirements are satisfied. We are a
team of four members and we worked on this project for three
working months.
CHAPTER- 6 -CODING AND TESTING
CODE
import pandas as pd
import numpy as np
import itertools
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer,
HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import matplotlib.pyplot as plt
%pylab inline
df = pd.read_csv('fake_or_real_news.csv')
df.shape
OUTPUT-(6335, 4)
df.head()
df = df.set_index('Unnamed: 0')
df.head()
y = df.Label
df = df.drop('label', axis=1)
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33,
random_state=53)
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
clf = MultinomialNB()
clf.fit(tfidf_train, y_train)
pred = clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
clf = MultinomialNB()
clf.fit(count_train, y_train)
pred = clf.predict(count_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
linear_clf = PassiveAggressiveClassifier()
linear_clf.fit(tfidf_train, y_train)
pred = linear_clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
Unit Testing
Unit testing focuses verification efforts on the smallest unit of
the software design, the module. This is also known as “Module
Testing”. The modules are tested separately. This testing carried out
during programming stage itself. In this testing each module is found
to be working satisfactorily as regards to the expected output from
the module.
Integration Testing
Data can be grossed across an interface; one module can have
adverse efforts on another. Integration testing is systematic testing for
construction the program structure while at the same time conducting
tests to uncover errors associated with in the interface. The
objective is to take unit tested modules and build a program structure.
All the modules are combined and tested as a whole. Here correction
is difficult because the isolation of cause is complicate by the vast
expense of the entire program. Thus, in the integration testing stop,
all the errors uncovered are corrected for the text testing steps.
System testing
System testing is the stage of implementation that is aimed at
ensuring that the system works accurately and efficiently for live
operation commences. Testing is vital to the success of the system.
System testing makes a logical assumption that if all the parts of the
system are correct, then goal will be successfully achieved.
Validation Testing
At the conclusion of integration testing software is completely
assembled as a package, interfacing errors have been uncovered and
corrected and a final series of software tests begins, validation test
begins. Validation test can be defined in many ways. But the simple
definition is that validation succeeds when the software function in a
manner that can reasonably expected by the customer. After
validation test has been conducted one of two possible conditions
exists.
One is the function or performance characteristics confirm to
specifications and are accepted and the other is deviation from
specification is uncovered and a deficiency list is created. Proposed
system under consideration has been tested by using validation
testing and found to be working satisfactorily.
Output Testing
After performing validation testing, the next step is output
testing of the proposed system since no system could be useful if it
does not produce the required output in the specified format. Asking
the users about the format required by them tests the outputs
generated by the system under consideration. Here the output format
is considered in two ways, one is on the screen and other is the printed
format. The output format on the screen is found to be correct as the
format was designed in the system designed phase according to the
user needs.
For the hard copy also, the output comes as the specified
requirements by the users. Hence output testing does not result any
corrections in the system.
Advantages
If we use the normal traditional method of crosschecking the
data manually it takes a lot of time. So here we use the machine
learning algorithms to check whether the data is fake or real. So, we
save a lot of time to us. And if we give more data the accuracy of the
algorithm will also increases. And the time it takes to classify the data
is also very less when compared to humans.
Disadvantages
There is one and only disadvantage in this project, i.e if we use
the algorithms for the prediction, we may not get the accurate results
when we take less data sets. So, we need to take a lot of data sets to
get the good accuracy.
CHAPTER-8-PROJECT CONCLUSION
CHAPTER- 9 -BIBLIOGRAPHY