Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

EMAIL SPAM DETECTION

A Social Relevant Project Report Submitted to

Jawaharlal Nehru Technology University-A, Ananthapur

In partial fulfillment of the requirements

for the award of the degree of

BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
By
G.MADHU SUJAN - 19AK1A0593
T.KAVERI - 19AK1A0575
N.LAVANYA - 19AK1A0583
A.LARIFA - 19AK1A0581

Under the guidance of


Dr. T.Sreenivasula Reddy

Associate Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES

(AUTONOMOUS)

Venkatapuram(V), Karakambadi (Po), Renigunta (M), Tirupati-517520, A.P.

2021-2022

1
ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES

(AUTONOMOUS)

Venkatapuram(V), Karakambadi (Po), Renigunta(M), Tirupati-517520, A.P

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE
Certified that this is a bonafide record of the Social Relevant Project Report entitled “ EMAIL
SPAM DETECTION”, done by G.MADHU SUJAN, REG NO: 19AK1A0593, T.KAVERI
REGNO:19AK1A0575,N.LAVANYA,REGNO:19AK1A0583, A.LARIFA, REG NO: 19AK1A0581,submitted to the faculty of
Computer Science and Engineering, in partial fulfillment of the requirements for the Degree of BACHELOR OF
TECHNOLOGY in Computer Science and Engineering from Jawaharlal Nehru Technological University-A,
Anantapur during the year 2019 - 2023.

Guide Head of the Department


Dr. T.Sreenivasula Reddy, Mr. B.Ramana Reddy,
Associate Professor, Assistant Professor & HOD,
Dept of CSE, AITS, Dept of CSE,
Tirupati. AITS, Tirupati.

INTERNAL EXAMINER EXTERNAL EXAMINER

Date:______________
Place:Tirupati.
2
ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES
(AUTONOMOUS)
Venkatapuram(V), Karakambadi (Po), Renigunta(M), Tirupati-517520, A.P
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DECLARATION

We hereby declare that the project titled “EMAIL SPAM DETECTION” is a genuine project work carried
out by us, in B.TECH (Computer Science and Engineering) degree course of Jawaharlal Nehru
Technology University-A, Ananthapur and has not been submitted to any other course or university for
the award of our degree by us.

G.MADHU SUJAN - 19AK1A0593


T.KAVERI - 19AK1A0575
N.LAVANYA - 19AK1A0583
A.LARIFA - 19AK1A0581

3
ACKNOWLEDGEMENT

The satisfaction that accompanies the successful completion of the task would be incomplete without
the mention of the people who made it possible, whose encouragement crown all the efforts with success.

We avail this opportunity to express our deep sense of gratitude and hearty thanks to Mr. C. GANGI
REDDY, Honarable Secretary of AITS- Tirupati for providing congenial atmosphere and encouragement.

We show gratitude to Dr. C. NADHAMUNI REDDY, Principal for having provided all the
facilities and support.

We would like to thank MR.B. RAMANA REDDY, Assistant Professor & HOD, Computer
Science and Engineering for encouragement at various levels of our Project
.
We thankful to our guide Dr. T.Sreenivasula Reddy, Computer Science and Engineering for his
sustained inspiring guidance and cooperation throughout the process of this project.

We express our deep sense of gratitude and thanks to all the Teaching and Non-Teaching Staff of
our college who stood with us during the project and helped us to make it a successful venture.

We place highest regards to our Parents, Friends and Well Wishers who helped a lot in making the
report of this project.

G.MADHU SUJAN - 19AK1A0593


T.KAVERI - 19AK1A0575
N.LAVANYA - 19AK1A0583
A.LARIFA - 19AK1A0581

4
CONTENTS

CHAPTER NO NAME OF THE CHAPTER PAGE NO


List of figures 7
Abstract 7
1 INTRODUCTION 8
1.1 Introduction 8
1.2 Existing system 8
1.3 Proposed system 8
1.4 Disadvantages of Existing system 8
1.5 Advantages of proposed system 9
2 ANALYSIS 10
2.1 Introduction 10
2.2 Requirements Specification 10
2.2.1 Hard ware Requirements 10
2.2.2 Software Requirements 10
2.2.3 Explanation of each Requirement 10-15
3 DESIGN 16
3.1 Introduction 16
3.2 Components of ER-Diagram 16
3.2.1 ER-Diagram 17
3.2.2 UML Diagram 17
3.3 Module Design and Organization 18
4 IMPLEMENTATION & RESULT 19
4.1 Introduction 19
4.1.1 Sample code 19-20
4.1.2 Output Screens 21-22
5 TESTING 22
5.1 Introduction 22
5.2 Test cases 22
6 CONCLUSION & FUTURE 23
ENHANCEMENT
6.1 Conclusion 23
6.2 Future Enhancement 23
7 REFERENCES 24

5
LIST OF FIGURES

FIGURE NO FIGURE NAME PAGE NO


3.2.1 ER-Diagram 17
3.2.2 UML-Diagram 17

6
ABSTRACT

Social communication has evolved, with e-mail still being one of the most common communication means, used for
both formal and informal ways. With many languages being digitized for the electronic world, the use of English is
still abundant. However, various native languages of different regions are emerging gradually. The Urdu language,
coming from South Asia, mostly Pakistan, is also getting its pace as a medium for communications used in social
media platforms, websites, and emails. With the increased usage of emails, Urdu’s number and variety of spam
content also increase. Spam emails are inappropriate and unwanted messages usually sent to breach security. These
spam emails include phishing URLs, advertisements, commercial segments, and a large number of indiscriminate
recipients. Thus, such content is always a hazard for the user, and many studies have taken place to detect such
spam content. However, there is a dire need to detect spam emails, which have content written in Urdu language.

The proposed system “ EMAIL SPAM DETECTION” study utilizes the existing machine learning
algorithms including Naive Bayes, CNN, SVM, and LSTM to detect and categorize e-mail content. According to our
findings, the LSTM model outperforms other models with a highest score of 98.4% accuracy.

7
1.INTRODUCTION

1.1INTRODUCTION

Email Spam has become a major problem nowadays, with Rapid growth of internet users, Email spams is
also increasing. People are using them for illegal and unethical conducts, phishing and fraud. Sending
malicious link through spam emails which can harm our system and can also seek in into your system.
Creating a fake profile and email account is much easy for the spammers, they pretend like a genuine
person in their spam emails, these spammers target those peoples who are not aware about these frauds.
So, it is needed to Identify those spam mails which are fraud.

1.2 EXISTING SYSTEM

Email Spam has become a major problem nowadays, with Rapid growth of internet users, Email spams is
also increasing. People are using them for illegal and unethical conducts, phishing and fraud. Sending
malicious link through spam emails which can harm our system and can also seek in into your system.
Creating a fake profile and email account is much easy for the spammers, they pretend like a genuine
person in their spam emails, these spammers target those peoples who are not aware about these frauds.
So, it is needed to Identify those spam mails which are fraud.

1.3. PROPOSED SYSTEM

In this proposed system, a dataset from “Kaggle” website is used as a training dataset. The inserted dataset
is first checked for duplicates and null values for better performance of the machine. Then, the dataset is
split into 2 sub-datasets; say “train dataset” and “test dataset” in the proportion of 70:30. Then the “train”
and “test” dataset is then passed as parameters for text-processing.

In text-processing, punctuation symbols and words that are in the stop words list are removed and returned
as clean words.
After acquiring the values from the “hyperparameter tuning”, the machine is fitted using those values with a
random state. The state of the trained model and features are saved for future use for testing unseen data.
Using classifiers from module sklearn in python, the machines are trained using the values obtained from
above.

1.4 DISADVANTAGES OF EXISTING SYSTEM

Automatic email filtering may be the most effective method of detecting spam but nowadays spammers
can easily bypass all these spam filtering applications easily. Naive Bayes is one of the utmost well-
known algorithms applied in these procedures. The boycott approach has been probably the soonest
technique pursued for the separating of spams. The technique is to acknowledge all the sends other than
those from the area/electronic mail ids.

8
1.5 ADVANTAGES OF PROPOSED SYSTEM

Ensemble methods on the other hand proven to be useful as they using multiple classifiers for class
prediction. Nowadays , lots of emails are sent and received and it is difficult as our project is only able to
test emails using a limited amount of corpus. Our project, thus spam detection is proficient of filtering
mails giving to the content of the email and not according to the domain names or any other criteria.

Good Efficiency
Greater accuracy

9
2. ANALYSIS

2.1 INTRODUCTION

Analysis is the process of gathering and interpreting the requirements. Analysis can be done in
different ways. In this, it involves the identification of materials that are suitable for relevant analysis. It is
important to gather the necessary information first beforeorganizing or scheduling anything. System analysis
is an important phase of any system development process.

2.2 REQUIREMENT SPECIFICATION

2.2.1 USER REUIREMENTS

 Web Browser
 Internet Connection
 Laptop with good specifications

2.2.2 HARDWARE SPECIFICATION

 Processor : Intel core i5


 Processor memory RAM : 4GB
 Processor speed RAM : 1.70GH

2.2.3 SOFTWARE SPECIFICATION

Operating system : Windows 10.


Ide:Google Colab
Coding Language : Python

2.2.4 EXPLANATION OF EACH REQUIREMENT

What is Colab?
Colab, or "Colaboratory", allows you to write and execute Python in your browser, with

• Zero configuration required


• Access to GPUs free of charge
• Easy sharing

Whether you're a student, a data scientist or an AI researcher, Colab can make your work easier
With Colab you can import an image dataset, train an image classifier on it, and evaluate the
model, all in just a few lines of code. Colab notebooks execute code on Google's cloud servers,
meaning you can leverage the power of Google hardware, including GPUs and TPUs, regardless
of the power of your machine. All you need is a browser.

Colab is used extensively in the machine learning community with applications including:

• Getting started with TensorFlow


• Developing and training neural networks
• Experimenting with TPUs
• Disseminating AI research
10
• Creating tutorials

Data Preprocessing in Machine learning


Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and put in
a formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?


A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required tasks
for cleaning the data and making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

What is Machine Learning Model?


Machine Learning models can be understood as a program that has been trained to find patterns
within new data and make predictions. These models are represented as a mathematical function
that takes requests in the form of input data, makes predictions on input data, and then provides
an output in response. First, these models are trained over a set of data, and then they are
provided an algorithm to reason over data, extract the pattern from feed data and learn from
those data. Once these models get trained, they can be used to predict the unseen dataset.

11
Machine Learning Models
A machine learning model is defined as a mathematical representation of the output of the
training process. Machine learning is the study of different algorithms that can improve

automatically through experience & old data and build the model. A machine learning model is
similar to computer software designed to recognize patterns or behaviors based on previous
experience or data. The learning algorithm discovers patterns within the training data, and it
outputs an ML model which captures these patterns and makes predictions on new data.

Classification of Machine Learning Models:


Based on different business goals and data sets, there are three learning models for algorithms.
Each machine learning algorithm settles into one of the three models:

o Supervised Learning o

Unsupervised Learning o

Reinforcement Learning

Supervised Learning is further divided into two categories:

o Classification o Regression

Unsupervised Learning is also divided into below categories:

o Clustering o Association Rule o

Dimensionality Reduction

The main aim of the linear regression model is to find the best fit line that best fits the data points.

Linear regression is extended to multiple linear regression (find a plane of best fit) and polynomial
regression (find the best fit curve).

12
Classification
Classification models are the second type of Supervised Learning techniques, which are used to
generate conclusions from observed values in the categorical form. For example, the classification
model can identify if the email is spam or not; a buyer will purchase the product or not, etc.
Classification algorithms are used to predict two classes and categorize the output into different
groups.

In classification, a classifier model is designed that classifies the dataset into different categories,
and each category is assigned a label.

There are two types of classifications in machine learning:

o Binary classification: If the problem has only two possible classes, called a binary classifier. For

example, cat or dog, Yes or No,

o Multi-class classification: If the problem has more than two possible classes, it is a multi-class

classifier.

Some popular classification algorithms are as below:

a) Logistic Regression

Logistic Regression is used to solve the classification problems in machine learning. They are
similar to linear regression but used to predict the categorical variables. It can predict the output in
either Yes or No, 0 or 1, True or False, etc. However, rather than giving the exact values, it provides
the probabilistic values between 0 & 1.

b) Support Vector Machine

Support vector machine or SVM is the popular machine learning algorithm, which is widely used
for classification and regression tasks. However, specifically, it is used to solve classification
problems. The main aim of SVM is to find the best decision boundaries in an N-dimensional space,
which can segregate data points into classes, and the best decision boundary is known as
Hyperplane. SVM selects the extreme vector to find the hyperplane, and these vectors are known
as support vectors

13
c) Naïve Bayes

Naïve Bayes is another popular classification algorithm used in machine learning. It is called so as it
is based on Bayes theorem and follows the naïve(independent) assumption between the features
which is given as:

Each naïve Bayes classifier assumes that the value of a specific variable is independent of any other
variable/feature. For example, if a fruit needs to be classified based on color, shape, and taste. So
yellow, oval, and sweet will be recognized as mango. Here each feature is independent of other
features.

d.Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the final
output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem
of overfitting.

14
e.k-nearest neighbor algorithm:

This algorithm is used to solve the classification model problems. K-nearest neighbor or
K-NN algorithm basically creates an imaginary boundary to classify the data. When new
data points come in, the algorithm will try to predict that to the nearest of the boundary
line.
Therefore, larger k value means smother curves of separation resulting in less complex
models. Whereas, smaller k value tends to overfit the data and resulting in complex
models.

15
3.DESIGN

3.1 INTRODUCTION

The aim is to develop one or more designs that can be used to achieve the desired project golas. The
main aim of the system design phase is to provide a design for specified needs of the apothecary
management system. This Management System is designed to reduce pen-paper work at the hospitals.

3.2 ER DIAGRAM COMPONENTS

ER Diagram stands for Entity Relationship Diagram, also known as ERD is a diagram
that displays the relationship of entity sets stored in a database. In other words, ER diagrams help to explain
the logical structure of databases. ER diagrams are created based on three basic concepts: entities, attributes
and relationships. The ER Model represents real-world entities and the relationships between them. Creating
an ER Model in DBMS is considered as a best practice before implementing your database. Entity
Relationship Diagram Symbols & Notations mainly contains three basic symbols which are rectangle, oval
and diamond to represent relationships between elements, entities and attributes. There are some sub
elements which are based on main elements in ERD Diagram. ER Diagram is a visual representation of data
that describes how data is related to each other using different ERD Symbols and Notations.

TABLE : ER DIAGRAM SYMBOLS

16
Following are the main components and its symbols in ER Diagrams:
 Rectangles: This Entity Relationship Diagram symbol represents
entity types
 Elipses: Symbol represent attributes
 Diamonds: This symbol represents relationship types
 Lines: It links attributes to entity types and entity types with other
relationship types
 Primary key: attributes are underlined
 Double Ellipses: Represent multi-valued attributes

3.2.1 ER DIAGRAM

3.2.2 . UML DIAGRAM

17
3.3 MODULE DESIGN AND ORGANIZATION

Step 1: E-mail Data Collection. The dataset contained in a corpus plays a crucial role in assessing the performance of
any spam filter. ...

Step 2: Pre-processing of E-mail content. ...

Step 3: Feature Extraction and Selection. ...

Step 4: Apply Different Ml Models Step 5:

Performance Analysis.

18
4. IMPLEMENTATION AND RESULT

4.1 INTRODUCTION

Implementation is the stage of the project when the theoretical design is turned out into a working
system. Thus, it can be considered to be the most critical stage in achieving a successful new system and in
giving the user, confidence that the new system will work and be effective. The implementation stage
involves careful planning, investigation of the existing system and its constraints on implementation,
designing of methods to achieve change and evaluation of changeover methods.

4.2 . SAMPLE CODE


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# loading the data from csv file to a pandas Dataframe
raw_mail_data = pd.read_csv('/content/mail_data.csv')
print(raw_mail_data)
Splitting the data into training data & test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_sta
te=3)
print(X.shape)
print(X_train.shape)
print(X_test.shape)
Feature Extraction
# transform the text data to feature vectors that can be used as input to the Logis
tic regression

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase='T
rue')

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

# convert Y_train and Y_test values as integers

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')
Training the Model

Logistic Regression
m1 = LogisticRegression()
# training the Logistic Regression model with the training data
m1.fit(X_train_features, Y_train)

Evaluating the trained model


# prediction on training data

prediction_on_training_data = m1.predict(X_train_features)
19
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)
print('Accuracy on training data : ', accuracy_on_training_data)

# prediction on test data

prediction_on_test_data = m1.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
print('Accuracy on test data : ', accuracy_on_test_data)

Applying Multinomial NaiveBayes Classification


from sklearn.naive_bayes import MultinomialNB
m2 = MultinomialNB()
m2.fit(X_train_features, Y_train)
# prediction on training data

prediction_on_training_data = m2.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)
print('Accuracy on training data : ', accuracy_on_training_data)
Applying RandomForest Classification.
from sklearn.ensemble import RandomForestClassifier
m3.fit(X_train_features, Y_train)
# prediction on training data

prediction_on_training_data = m3.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

Applying KneighboursClassifier
from sklearn.neighbors import KNeighborsClassifier
# prediction on training data

prediction_on_training_data = m4.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

Building a Predictive System


input_mail = ["hii how are you"]
# convert text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# making prediction

prediction = m2.predict(input_data_features)
print(prediction)

if (prediction[0]==1):
  print('Ham mail')

else:
  print('Spam mail')

20
4.2 OUTPUT SCREENS

From the above screen we are give input messgage hii how are you it’s a normal message so the ml model
classifies as ham mail.

21
From the above screen we are given input as a fake message so the ml model predicts it’s as a spam mail.

5.TESTING

5.1 INTRODUCTION

 The test case is an object for execution of the other modules in the architecture which would not
represent the interaction with itself.
 Each test case is a set of sequential steps to execute a test operating on a set of predefined inputs to
produce the expected outputs.
 The table shows the test cases, corresponding results and the status of the test steps.
 A test case consists of the set of conditions in which tester determines whether the system satisfies
the requirements and works correctly.
 Problems in the requirements and design of the application are evaluated during the process of
developing test cases.
 The primary goal of software tests is to eliminate bugs in the code.
 However, there are additional benefits a project can gain from a good testing process.
 Benefits such as enhancing performance, user experience, and security of the overall project.
 Often, when working on big projects, the team is divided into several sections.
 Each has its development task, and each task has its standalone functionality.
 These tasks are then combined to form the overall software product.
 That’s why each part must undergo its own testing process to make sure it functions properly before
it is added to the main project.

5.2 TEST CASES

INPUT OUTPUT RESULT


Input correct username and Tested for whether the input Success
password for login details are correct or not
22
Input correct username and Tested for whether the input Fail
Incorrect password for logindetails are correct or not and
failed to get details
Both the username and Tested for whether the input Fail
password are incorrect details are correct or not and
failed to get the details
Input username and password Tested for whether the user Success
for login already exists or nor if not
ten Registers successfully
Check for password and If matched, get opened Success
confirm password are
matching for opening to
admin
Check for password and If not matched, wont get Fail
confirm password are opened
matching for registration
Check the tickets are If it shows tickets available success
available or not available.

6. CONCLUSION & FUTURE ENHANCEMENT

6.1 CONCLUSION

Accuracy Percentages of different ml Models:

Logistic Regression:96%
Multinomomail Naïve bayes Classification:98%
RandomForestClassifier:88%
KnnClassifier;90%

From this we can conclude that Multinomial Naïve Bayes Classification giving the best Acurracy Prediction
with 98% best accuracy when compared to remaining mL models.

6.2 FUTURE ENHANCEMENT

Efficient pattern detection in spam mail filtering plays crucial role. Using ml model spam detection gives the spam
patterns, non –spam patterns and general patterns which easily identify the whether the mail is spam or ham. The
current method which uses the spam detection method does not include the general patterns. RFD gives the general
patterns of which user can decide to determine whether he wants to put the mail as spam or non-spam to avoid the
loss of important mails. The images which are in forms of spams are also detected using File Properties, Histogram
and Hough Transform. The current proposed system is for English language mails but as future scope we can design
the system for multiple languages.
23
7. REFERENCES

1.Nikhil Kumar, Sanket Sonowal, Nishant “Email Spam Detection Using Machine Learning Algorithms”,
IEEE CONFERENCE 2020.

2.https://www.kaggle.com/venky73/spam-mails-dataset

3.https://jpinfotech.org/email-spam-detection-using-machine-learning-algorithms/

24

You might also like