Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

BHARATI VIDYAPEETH'S COLLEGE OF ENGINEERING

A-4, PASCHIM VIHAR , ROHTAK ROAD, NEW DELHI - 110063

AFFILIATED TO

GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY, DELHI

Summer Training Report


On
Machine Learning with Python

BACHELORS OF TECHNOLOGY
IN
INSTRUMENTATION AND CONTROL
ENGINEERING

Submitted by
Anshul Jain ( 00711503020)
ACKNOWLEDGEMENT
The success and final outcome of learning Machine Learning required a lot of guidance
and assistance from many people and I am extremely privileged to have got this all along
the completion of my course and few of the projects. All that I have done is only due to
such supervision and assistance and I would not forget to thank them.
I respect and thank Sofcon delhi, for providing me an opportunity to do the course and
project work and giving me all support and guidance, which made me complete the
course duly. I am extremely thankful to the course advisor Mr. Krishna.
I am thankful to and fortunate enough to get constant encouragement, support and
guidance from all Teaching staffs of SOFCON India pvt. Ltd. which helped us in
successfully completing my course and project work.
DECLARATION
I hereby declare that I have completed my six weeks summer training at SOFCON
INDIA PVT. LTD. Delhi from 1st August, 2022 to 9th September, 2019 under the
guidance of Mr. Krishna. I have declared that I have worked with full dedication during
these six weeks of training and my learning outcomes fulfill the requirements of training
for the award of degree of Bachelor of Technology (B.Tech.) in ICE, Bharati Vidyapeeth
college of engineering, new delhi, India.
Training Certificate
CONTENTS
Page no.

Candidate’s Declaration……………………………………………………………………… i

Certificate………………………………………………………………………………………… ii

Training completion certificate……………………………………………………………. iii

Abstract…………………………………………………………………………………………… iv

1.Introduction………………………………………………………………….
1.1 Python……………………………………………………………….
1.2 Machine learning………………………………………………………………………………
1.3 Types ML ……………………………………………………
1.3.1 Supervised………………………………………………………………………
1.3.2 Unsupervised………………………………………………………………………
1.3.3 Reinforcement…………………………………………………………………
1.3.4 Semi-supervised………………………………………………………………
1.4 Artificial intelligence………………………………………………………………..
1.5 Problem Statement………………………………………………………………..
1.6 Aim and Objective………………………………………………………………..

2. Basic theoretical concepts……………………………………


2.1 python modules…………………………………………………………………
2.2 fake news……………………………………………………………………………
2.3 tf and idf vectorizer……………………………………………………………
2.4 passive aggressive classifiers……………………………………………………………..
3. Proposed work…………………………………………………………….
3.1 Detecting fake news………………………………………………………
3.1.1 Datasets…………………………………………………
3.2 Project prerequisites………………………………………………….
3.3 Necessary imports………………………………………………..
3.4 Converting data to data frames……………………………………………
3.5 Getting labels………………………………………………………………

3.6 Splitting data……………………………………………………………………

3.6.1 Training data………………………………………………………


3.6.2 Testing data………………………………………………………..
3.7 Initializing Tfidfvectorizer……………………………………………
3.8 Passive aggressive classifers……………………………………….

4. Conclusion………………………………………………………………….
4.1 Results………………………………………………
4.2 Future work………………………………………………………..
1 Introduction
1.1 Python
 Python is an interpreted, object-oriented, high-level programming
language.
 Python can be used for developing websites and software, task
automation, data analysis, and data visualization. Since it's
relatively easy to learn, Python has been adopted by many non-
programmers such as accountants and scientists, for a variety of
everyday tasks, like organizing finances.
 Python is a popular programming language because it is easy to
learn .

1.2 Machine Learning


 Machine learning (ML) is a type of artificial intelligence (AI) that
allows software applications to become more accurate at predicting
outcomes without being explicitly programmed to do so.
 Machine Learning enables computers to behave like human beings
by training them with the help of past experience and predicted
data.
 Training is the most important part of Machine Learning. Choose
your features and hyper parameters carefully.

1.3 Types:
1.3.1 SUPERVISED LEARNING
 Supervised learning is applicable when a machine has
sample data, i.e., input as well as output data with correct
labels to check the correctness of the model.
 This technique helps us to predict future events with the
help of past experience and labeled examples. 
 It also predicts errors during this entire learning process
and also corrects those errors through algorithms.

1.3.2 UNSUPERVISED LEARNING


 In unsupervised learning, a machine is trained with
some input samples or labels only, when output is
unknown.
 Although Unsupervised learning is less common in
practical business settings, it helps in exploring the data
and can draw inferences from datasets to describe
hidden structures from unlabeled data.

1.3.3 REINFORCEMENT LEARNING


 Reinforcement Learning is a feedback-based machine
learning technique.
 Computer programs perform actions, and on the basis
of their actions, they get rewards as feedback example
for each good action, they get a positive reward, and for
each bad action, they get a negative reward.
 The goal of a Reinforcement learning agent is to
maximize the positive rewards.

1.3.4 SEMI - SUPERVISED LEARNING


 Semi-supervised Learning is an intermediate technique
of both supervised and unsupervised learning.
 Sem-supervised learning helps data scientists to
overcome the drawback of supervised and unsupervised
learning.
1.4 Artificial Intelligence
 With the help of AI, you can create such software or devices which
can solve real-world problems very easily and with accuracy such
as health issues, marketing, traffic issues, etc.
 With the help of AI, you can create your personal virtual Assistant,
such as Cortana, Google Assistant, Siri, etc.
 With the help of AI, you can build such Robots which can work in
an environment where survival of humans can be at risk.
 AI opens a path for other new technologies, new devices, and new
Opportunities.

1.5 PROBLEM STATEMENT


Detecting the fake news using python with machine learning .
Working-

The problem can be broken down into 3 statements-

1. Use NLP to check the authenticity of a news


article.
2. If the user has a query about the authenticity of
a search query then we he/she can directly
search on our platform and using our custom
algorithm we output a confidence score.
3. Check the authenticity of a news source.
These sections have been produced as search fields to take
inputs in 3 different forms in our implementation of the
problem statement.
1.6 Aim and objective
The aim and objective is practicing this advanced python
project of detecting fake news, you will easily make a
difference between real and fake news.

2.BASIC THEORITICAL CONCEPTS


2.1 Python modules

NumPy is a Python library used for working with arrays. It


also has functions for working in domain of linear algebra,
fourier transform, and matrices.NumPy was created in 2005
by Travis Oliphant. It is an open source project and you can
use it freely.NumPy stands for Numerical Python.

pandas is a Python package that provides fast, flexible, and


expressive data structures designed to make working with
"relational" or "labeled" data both easy and intuitive. It aims
to be the fundamental high-level building block for doing
practical, real world data analysis in Python. Additionally, it
has the broader goal of becoming the most powerful and
flexible open source data analysis / manipulation tool
available in any language. It is already well on its way
towards this goal.

2.2 What is fake news?


Fake news is false or misleading information presented as
news. Fake news often has the aim of damaging the reputation
of a person or entity, or making money through advertising
revenue. Although false news has always been spread
throughout history, the term "fake news" was first used in the
1890s when sensational reports in newspapers were common.
Nevertheless, the term does not have a fixed definition and has
been applied broadly to any type of false information.

2.3 What is a TfidfVectorizer?


TF (Term Frequency): The number of times a word appears
in a document is its Term Frequency. A higher value means a
term appears more often than others, and so, the document is
a good match when the term is part of the search terms.

IDF (Inverse Document Frequency): Words that occur many


times a document, but also occur many times in many others,
may be irrelevant. IDF is a measure of how significant a term is
in the entire corpus.
The TfidfVectorizer converts a collection of raw documents
into a matrix of TF-IDF features.

2.4 What is a Passive Aggressive Classifier?


Passive Aggressive algorithms are online learning algorithms.
Such an algorithm remains passive for a correct classification
outcome, and turns aggressive in the event of a
miscalculation, updating and adjusting. Unlike most other
algorithms, it does not converge. Its purpose is to make
updates that correct the loss, causing very little change in the
norm of the weight vector.
3. PROPOSED WORK

3.1 Detecting Fake News with Python


To build a model to accurately classify a piece of news as
REAL or FAKE.
This advanced python project of detecting fake news deals
with fake and real news. Using sklearn, we build a
TfidfVectorizer on our dataset. Then, we initialize a
PassiveAggressive Classifier and fit the model. In the end,
the accuracy score and the confusion matrix tell us how well
our model fares.

3.1.1 The fake news Dataset


The dataset we’ll use for this python project- we’ll call it
news.csv. This dataset has a shape of 7796×4. The first
column identifies the news, the second and third are the title
and text, and the fourth column has labels denoting whether
the news is REAL or FAKE.

3.2 Project Prerequisites


You’ll need to install the following libraries with pip:
pip install numpy pandas sklearn
You’ll need to install Jupyter Lab to run your code. Get to your
command prompt and run the following command:
C:\Users\DataFlair>jupyter lab
You’ll see a new browser window open up; create a new
console and use it to run your code. To run multiple lines of
code at once, press Shift+Enter.
3.3 Make necessary imports
import numpy as np import pandas as pd import
itertools from sklearn.model_selection import
train_test_split from sklearn.feature_extraction.text
import TfidfVectorizer from sklearn.linear_model
import PassiveAggressiveClassifier.

3.4 Putting data to data forms


let’s read the data into a DataFrame, and get the shape of the
data and the first 5 records.
#Read the data
df=pd.read_csv('D:\\DataFlair\\news.csv')

#Get shape and


head df.shape
df.head()
3.5 Getting labels
#DataFlair -
Get the labels
labels=df.label
labels.head()

3.6 Splitting of data


the split of our modelling dataset into training and testing
samples is probably one of the earliest pre-processing steps
that we need to undertake. The creation of different samples
for training and testing helps us evaluate model performance.
#DataFlair - Split the dataset
x_train,x_test,y_train,y_test=train_test_split(
df['text'], labels, test_size=0.2,
random_state=7)

3.6.1 Training Data

Machine learning uses algorithms to learn from data in


datasets. They find patterns, develop understanding, make
decisions, and evaluate those decisions.
In machine learning, datasets are split into two subsets.
The first subset is known as the training data - it’s a portion
of our actual dataset that is fed into the machine learning
model to discover and learn patterns. In this way, it trains our
model.

3.6.2 Testing Data


Once your machine learning model is built (with your
training data), you need unseen data to test your model. This
data is called testing data, and you can use it to evaluate the
performance and progress of your algorithms’ training and
adjust or optimize it for improved results.
Testing data has two main criteria. It should:

• Represent the actual dataset


• Be large enough to generate meaningful predictions
3.7 Initializing Tfidfvectorizer
Initialize a TfidfVectorizer with stop words from the
English language and a maximum document frequency of
0.7 (terms with a higher document frequency will be
discarded). Stop words are the most common words in a
language that are to be filtered out before processing the
natural language data. And a TfidfVectorizer turns a
collection of raw documents into a matrix of TF-IDF
features. Now, fit and transform the vectorizer on the train
set, and transform the vectorizer on the test set.
#DataFlair - Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

#DataFlair - Fit and transform train set, transform test set


tfidf_train=tfidf_vectorizer.fit_transform(x_train)
tfidf_test=tfidf_vectorizer.transform(x_test)

3.8 Passive aggressive classifiers


we’ll initialize a PassiveAggressiveClassifier. This is. We’ll fit this
on tfidf_train and y_train.
Then, we’ll predict on the test set from the TfidfVectorizer
and calculate the accuracy with accuracy_score() from
sklearn.metrics.

#DataFlair - Initialize a
PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=
50)
pac.fit(tfidf_train,y_train)

#DataFlair - Predict on the test set and


calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

We got an accuracy of 92.82% with this model. Finally, let’s


print out a confusion matrix to gain insight into the number of
false and true negatives and positives.

#DataFlair - Build confusion matrix


confusion_matrix(y_test,y_pred,
labels=['FAKE','REAL'])
So with this model, we have 589 true positives, 587 true
negatives, 42 false positives, and 49 false negatives.

4. CONCLUSION

4.1 Results
In the 21st century, the majority of the tasks are done online.
Newspapers that were earlier preferred as hard- copies are
now being substituted by applications like Facebook, Twitter,
and news articles to be read online. Whatsapps forwards are
also a major source. The growing problem of fake news only
makes things more complicated and tries to change or hamper
the opinion and attitude of people towards use of digital
technology. When a person is deceived by the real news two
possible things happen- People start believing that their
perceptions about a particular topic are true as assumed.
Thus, in order to curb the phenomenon, we have developed
our Fake news Detection system that takes input from the
user and classify it to be true or fake. To implement this,
various NLP and Machine Learning Techniques have to be
used. The model is trained using an appropriate dataset and
performance evaluation is also done using various
performance measures. The best model, i.e. the model with
highest accuracy is used to classify the news headlines or
articles. As evident above for staticsearch, our best model
came out to be Logistic Regression with an accuracy of 65%.
Hence we then used grid search parameter optimization to
increase the performance of logistic regression which then
gave us the accuracy of 75%. Hence we can say that if a user
feed a particular news article or its headline in our model,
there are 75% chances that it will be classified to its true
nature.
The user can check the news article or keywords online; he
can also check the authenticity of the website. The accuracy
for dynamic system is 93% and it increases with every
iteration.

we learned to detect fake news with Python. We took a


political dataset, implemented a TfidfVectorizer, initialized a
PassiveAggressiveClassifier, and fit our model. We ended up
obtaining an accuracy of 92.82% in magnitude.

4.2 Future scope


Based on Natural Language Processing (NLP) techniques,
several lifelike text-generating systems have proliferated and
they are becoming smarter every day. This year, OpenAI
announced the launch of GPT-3, a tool to produce text that is
so real, that in some cases it’s nearly impossible to
distinguish from human writing. GPT-3 can also figure out
how concepts relate to each other, and discern context. Tools
like this one can be used to generate misinformation, spam,
phishing, abuse of legal and governmental processes, and
even fake academic essays.
Fighting fake news is a double-edged sword
We need technology to fight this battle. AI makes it
possible to find words and patterns that indicate fake news in
huge volumes of data, and tech companies are already
working on it. Google is working on a system that can detect
videos that have been altered, making their datasets open
source and encouraging others to develop deepfake detection
methods. YouTube declared that it won’t allow election
related “deepfake” videos and anything that aims to mislead
viewers about voting procedures and how to participate in the
2020 census.

You might also like