Cybersecurity System

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

CYBER ATTACKS CLASSIFICATION USING SUPERVISED

MACHINE LEARNING TECHNIQUES

MINI PROJECT REPORT


Submitted in partial fulfillment of the requirements
for the award of the degree of
Bachelor of Computer Applications

SUBMITTED BY
SATHYA S
211314103314

Under the guidance of


Mrs.D.GAYATHRI, M.Sc
GUEST LECTURER
BACHELOR OF COMPUTER APPLICATIONS

GURU NANAK COLLEGE


(AUTONOMOUS)
Affiliated to University of Madras
Accredited at ‘A++’ Grade by NAAC | An ISO 9001 2015 Certified Institution
Guru Nanak Salai, Velachery, Chennai – 600 042.
MARCH- 2024
GURU NANAK COLLEGE
(AUTONOMOUS)

Affiliated to University of Madras


Accredited at ‘A++’ Grade by NAAC | An ISO 9001 2015 Certified Institution
Guru Nanak Salai, Velachery, Chennai – 600 042.

BACHELOR OF COMPUTER APPLICATIONS

BONAFIDE CERTIFICATE

This is to certify that, this is a bonafide record of work done by SATHYA S, 211314103314 of for

the Final Year Project during the Academic Year 2023-24.

PROJECT GUIDE HEAD OF THE DEPARTMENT

Submitted for the Project Viva Voce Examination held on ________________ at

GURU NANAK COLLEGE (Autonomous), Guru Nanak Salai, Velachery, Chennai - 600 042.

Internal Examiner External Examiner


Date: Date:
DECLARATION

I am SATHYA, 2113141033142 studying III Year ,Bachelor of Computer Applications at Guru


Nanak College (Autonomous), Chennai hereby declare that this the Report of my Project entitled, CYBER
ATTACKS CLASSIFICATION USING SUPERVISED MACHINE LEARNING TECHNIQUES is
the record of the original work carried out by me under the Guidance and Supervision of
Mrs.D.GAYATHRI towards the partial fulfillment of the requirements of the award of the Degree of
Bachelor of Computer Applications. I further declare that this has not been submitted anywhere for the
award of Degree/Diploma or any other similar to this before.

PLACE : CHENNAI SATHYA S


DATE : 2113141003142
ACKNOWLEDGEMENT

I would like to thank the Principal Dr. T. K. Avvai Kothai and Vice Principal Dr.
Anitha Malisetty for providing the necessary resources and facilities for the completion of this
project.

I extend my deepest thanks to Dr. K. RAVIYA, Head of the Department, whose guidance,
support, and encouragement were invaluable throughout this endeavor. Her expertise and insights
have been instrumental in shaping this project and enhancing its quality.

I owe my Guide Mrs. D. GAYATHRI a debt of gratitude for his/her invaluable guidance,
patience, and encouragement. Her mentorship has been a beacon of light, steering me through the
complexities of this project and helping me realize my potential.

I also like to extend my thanks to the faculty members of BACHELOR OF


COMPUTER APPLICATIONS, for their valuable suggestion during the course of the study
of my project.

Last but not least, I thank my family and friends for their unwavering encouragement
and understanding during this journey.
TABLE OF CONTENTS

S.No. TOPIC PAGE NO.

INTRODUCTION
1. 2
1.1 Objective
3
1.2 Modules of the Project

SYSTEM SPECIFICATION

2. 2.1 Hardware Requirements 5

2.2 Software Requirements 5

SURVEY OF TECHNOLOGIES

3. 3.1 Feature of the Front-End 7

3.2 Feature of the Back-End 7

SELECTED SOFTWARE
4.1 Html

4.2 Css

4.3 Java script

4. 4.4 Bootstrap
9
4.5 python

4.6 Django

4.7 My sql

4.8 My sql lite

SYSTEM ANALYSIS

5. 5.1 Existing System 14

5.2 Characteristics of Proposed System 14


SYSTEM DESIGN

6.1 Data visualization 16


6.
6.2 Use Case Diagram 16

6.3 Entity Relationship Diagram 18

PROGRAM CODING

7. 7.1 Source Code 21

7.2 Screenshots 57

TESTING

8. 8.1 Software Testing 61

8.2 Types of Testing 61

9. CONCLUSION 64

10. REFERENCE 66
ABSTRACT

Cyberattack classification through the utilization of supervised machine learning methods. The
system is designed to categorize diverse cyber-attacks by employing a meticulously curated dataset
encompassing a wide array of attack types, including but not limited to malware, phishing, and distributed
denial-of-service (DDoS) attacks. Feature extraction techniques are applied to both network traffic data and
behavioural attributes, facilitating the training of a robust classification model. Various supervised learning
algorithms, such as decision trees, support vector machines, and neural networks, are evaluated for their
efficacy in accurately predicting attack categories. The training process involves labelling historical attack
instances, enabling the model to discern intricate patterns and subtle differentiators among attack types.
Regular model updates and retraining with new attack data ensure its relevance in dynamically evolving
threat landscapes. The system's predictive accuracy empowers cybersecurity teams to swiftly identify and
respond to cyber threats, thereby bolstering overall defense strategies. Through this research, we contribute
to the proactive identification and mitigation of cyber-attacks, ultimately fortifying digital security
frameworks.
INTRODUCTION

1
1. INTRODUCTION

1.1 OBJECTIVES
The objective of this research is to explore and highlight the significance of employing supervised
machine learning techniques for the classification of cyber-attacks in the realm of modern
cybersecurity. The focus is on leveraging labelled datasets to train algorithms for the swift and accurate
identification and categorization of diverse cyber threats. The ultimate goal is to enable organizations
to respond effectively, mitigate potential damage, and strengthen their overall cybersecurity defenses.

• Pivotal Aspect of Cybersecurity: Recognition of the pivotal role played by supervised


machine learning techniques in the contemporary landscape of cybersecurity.

• Growing Sophistication and Frequency of Cyber Threats: Acknowledgment of the


increasing complexity and frequency of cyber threats in the evolving digital environment.

• Swift and Accurate Categorization: Emphasis on the ability of supervised machine learning
to provide a quick and accurate categorization of various types of cyber-attacks, aiding
organizations in timely responses.

• Leveraging Labelled Datasets: Highlighting the crucial role of labelled datasets in training
machine learning algorithms for effective cyber-attack classification.

• Challenges in Cybersecurity Classification: Recognition of challenges, including the


diversity of attack methods, adaptability of attackers, and imbalanced data, which contribute to
the complexity of cyber threat classification.

• Applications Across Cybersecurity Domains: Identification of diverse applications, ranging


from intrusion detection and email filtering to malware identification and anomaly detection,
showcasing the versatility of supervised machine learning in cybersecurity.

2
1.2 MODULES OF THE PROJECT

• Data Pre-processing
• Data Analysis of Visualization
• Implementing Algorithm 1
• Implementing Algorithm 2
• Implementing Algorithm 3
• Deployment

1. Data Pre-processing:

- Cleans and prepares raw data, addressing missing values and optimizing features for subsequent
analysis.

2. Data Analysis and Visualization:

- Extracts insights and patterns through statistical analysis and visualization, laying the groundwork for
informed decision-making.

3. Algorithm Implementation (1, 2, 3):

- Applies and evaluates multiple algorithms (1, 2, 3) to identify the most effective solution based on
performance metrics.

4. Deployment:

- Integrates the selected algorithm into a practical setting, ensuring it is adapted for operational use with
user interfaces and continuous monitoring.

3
SYSTEM SPECIFICATION

4
2. SYSTEM SPECIFICATION

2.1 Hardware Requirements

Processor : Pentium IV/III

Hard disk : minimum 80 GB

RAM : minimum 2 GB

2.2 Software Requirements

Operating System : Windows

Tool : Anaconda with Jupyter Notebook

5
SURVEY OF
TECHNOLOGIES

6
3.SURVEY OF TECHNOLOGIES

3.1 FEATURES OF FRONT END

The part of an application that the user interacts directly is termed as frontend. It is also
referred as a ‘client side’ of the application. It includes everything that users experience directly:
text colors and styles, images, graphs and tables, buttons, colors, and navigation menu. XML is
the language used for front end development. The structure, design, behaviour, and content of
everything seen on screens when websites, web applications, or mobile apps are opened up, is
implemented by front End developers. Responsiveness and performance are two main
objectives of the Front End. The developer must ensure that the site is responsive i.e. it appears
correctly on devices of all sizes no part of the website should behave abnormally irrespective
of the size of the screen. Some of frontend development tools are HTML, CSS, XML, BULMA,
TAIL WIND CSS, SASS.

3.2 FEATURES OF BACK END

Backend is the server-side of the application/website. It stores and arranges data, and
also makes sure everything on the client-side of the application/website works fine. It is the part
of the application/website that you cannot see and interact with. It is the portion of software that
does not come in direct contact with the users. The parts and characteristics developed by
backend designers are indirectly accessed by users through a frontend application. Activities,
like writing APIs, creating libraries, and working with system components without user
interfaces or even systems of scientific programming, are also included in the backend. Some
of backend development tools are PHP, Java, C++, Python, Firebase and MySQL. Etc.

7
SELECTED SOFTWARE

8
4.SELECTED SOFTWARE

4.1 HTML

HTML, or HyperText Markup Language, is the fundamental language of web development. Created by
Tim Berners-Lee, HTML uses tags to structure content, define links, and incorporate multimedia. Key
features include hyperlinks for navigation, support for multimedia elements, interactive forms, semantic
markup for accessibility, and cross-browser compatibility. HTML continues to evolve, with HTML5
introducing new features like canvas for graphics and improved support for mobile devices. As the
backbone of the web, HTML is essential for creating structured and visually appealing online content.

Features :

• New features should be based on HTML, CSS, DOM and JavaScript


• The need for external plug-in (Like Flash) need to be reduced.
• Error handling should be easier than in previous versions.
• Scripting has to be replaced by more mark-ups.
• Some of the most interesting new features in HTML5 are:
• The <canvas> elements for drawing.
• The <video> and <audio> elements for media playback
• Support for local storage

9
4.2 CSS

Cascading Style Sheets (CSS) is a vital web development technology that complements HTML by styling
web pages. Using selectors and declarations, CSS separates content and presentation, allowing developers
to define the appearance of HTML elements. Key features include layout control, responsiveness through
media queries, external style sheets for modularity, and support for animations. CSS enhances the visual
appeal and consistency of web pages, playing a crucial role in creating engaging and well-designed online
content.

KEY FEATURES:

• Selectors: CSS3 introduces several new selectors that allow you to target specific
elements in a more precise way, such as: nth-child (), not (), and: checked.
• Box model: CSS3 adds new properties for controlling the size, padding, border, and
margin of boxes, such as box-sizing, border-radius, and box-shadow.
• Colors: CSS3 introduces new color formats, such as HSL and RGBA, which allow
you to specify colors in a more intuitive way.
• Fonts: CSS# adds new properties for controlling the font size, style and weight, as
well as new font formats, such as web fonts.

10
4.3 JAVA SCRIPT

JavaScript is predominantly used for client-side scripting in web development. It runs directly
in the web browser, enabling developers to create dynamic and interactive web pages that respond to
user actions in real-time without needing to communicate with the server. Java Script is the backbone
of many modern web applications, including social media platforms, online collaboration tools, and
e-commerce websites. It allows developers to create rich, interactive user interfaces and deliver a
seamless user experience. With the advent of platforms like Node.js, JavaScript can also be used for
server-side development. Node.js allows developers to build scalable and high-performance web
servers and backend services using JavaScript. JavaScript allows manipulation of the Document
Object Model (DOM), enabling developers to dynamically update and modify the content, structure,
and style of web pages based on user actions or application state changes.

KEY FEATURES:

• Dynamic Content
• Event Handling
• Data Manipulation
• Modularity
• Event-Driven Programming

11
4.4 BOOTSTRAP

Bootstrap is a popular front-end framework that provides a collection of CSS, JavaScript,


and HTML components for building responsive, mobile-first web applications. It was first released
in 2011 by twitter and has since become one of the most widely used front- end frameworks on the
web

KEY FEATURES:

• Responsive design: bootstrap's grid system makes it easy to create responsive designs
that adapt to different screen sizes and devices.

• Pre-designed UI components: bootstrap includes a large set of pre-designed UI


components, such as buttons, forms, tables, and navigations bars, that can be easily
customized to fit your applications, needs

12
4.5 PYTHON

Python is a widely-used, high-level programming language renowned for its


readability and versatility. Created in 1991, Python's simplicity, extensive standard library,
and dynamic typing make it suitable for diverse applications, including web development and
data analysis. Its community-driven development, cross-platform compatibility, and ease of
learning contribute to Python's popularity and widespread

KEY FEATURES:

• Readable Syntax: Clear and concise code structure for readability.

• Versatility: Supports both procedural and object-oriented programming.

• Extensive Library: Rich standard library simplifies development.

• Interpreted and Interactive: Allows rapid testing and experimentation.

13
4.6 Django

Django is a high-level Python web framework known for rapid development. With an MVC
architecture, built-in ORM system, and templating engine, it simplifies common tasks.
Features like an automatic admin interface, security measures, and scalability contribute to its
popularity. Supported by a vibrant community, Django is versatile, suitable for various
applications, and includes Django REST framework for API development.

Key Features :

1. Rapid Development: Facilitates quick and efficient web development.

2. MVC Architecture: Organizes code in a Model-View-Controller pattern.

3. ORM System: Simplifies database interactions with Python code.

4. Django REST framework: Enhances Django for modern API development.

14
4.7 My sql

MySQL is an open-source relational database management system (RDBMS)


that uses Structured Query Language (SQL). One of its key features is its support for
ACID (Atomicity, Consistency, Isolation, Durability) properties, ensuring data
integrity and reliability even in complex transactional scenarios.

Key Features :

• Data Querying:Enables efficient retrieval and manipulation of data.

• Data Definition Language (DDL):Defines and modifies database structures.

• Data Manipulation Language (DML):Manipulates data within the database.

• Data Integrity:Ensures accuracy and consistency of stored data.

• Transaction Control: Manages transactions for data consistency.

• Security: Implements access controls and permissions for data protection.

• Scalability:Adaptable for handling growing volumes of data.

15
4.8 My sql lite

SQLite is a lightweight and serverless relational database management system, known for its
simplicity and efficiency. With a zero-configuration approach, it operates from a single file,
making it easy to integrate into various applications. SQLite supports standard SQL syntax, is
cross-platform, and boasts a low memory footprint, making it a popular choice for embedded
systems, mobile apps, and desktop software. As open-source software, SQLite has a robust
community providing support and resources for developers.

Key Features :

• Serverless & Embedded: Lightweight and serverless, operates from a single file.

• Zero Configuration: Requires minimal setup for easy integration.

• Cross-Platform: Compatible with various operating systems.

• Transactional Support: Ensures reliable transaction processing.

• SQL Compatibility: Follows standard SQL syntax for seamless use.

• Low Memory Footprint: Designed for memory efficiency.

• Open Source: Freely available and modifiable.

16
SYSTEM ANALYSIS

17
5.SYSTEM ANALYSIS

5.1 Existing System

The use of invariants in developing security mechanisms has become an attractive research
area because of their potential to both prevent attacks and detect attacks in Cyber-Physical
Systems (CPS). In general, an invariant is a property that is expressed using design parameters
along with Boolean operators and which always holds in normal operation of a system, in
particular, a CPS. Invariants can be derived by analysing operational data of various design
parameters in a running CPS, or by analysing the system’s requirements/design documents,
with both of the approaches demonstrating significant potential to detect and prevent cyber-
attacks on a CPS. While data-driven invariant generation can be fully automated, design-driven
invariant generation has a substantial manual intervention. In this paper, we aim to highlight
the shortcomings in data-driven invariants by demonstrating a set of adversarial attacks on such
invariants. We propose a solution strategy to detect such attacks by complementing them with
design-driven invariants. We perform all our experiments on a real water treatment testbed. We
shall demonstrate that our approach can significantly reduce false positives and achieve high
accuracy in attack detection on CPSs.

Disadvantages:

• Higher time complexity for implementation process.


• Complexity and usability.
• Accuracy was low.
• Limited scalability.

18
5.2 Characteristics of Proposed System

• Enhanced Efficiency:The proposed system is designed to improve overall operational


efficiency by streamlining processes and reducing manual intervention.

• User-Friendly Interface: A user-friendly interface ensures ease of use for all


stakeholders, promoting accessibility and reducing the learning curve.

• Scalability: The system is scalable to accommodate future growth or changes in user


requirements, ensuring adaptability to evolving needs.

• Robust Security Measures: Implementation of robust security features safeguards


sensitive data and protects against potential cyber threats, ensuring data integrity and
user privacy.

• Integration Capabilities:The proposed system is capable of integrating seamlessly


with existing systems and technologies, promoting interoperability and minimizing
disruptions.

• Reliability and Stability: The system is designed for reliability, minimizing downtime
and ensuring stable performance under varying conditions.

• Data Accuracy and Consistency: Measures are in place to ensure the accuracy and
consistency of data through validation and verification processes.

Real-time Reporting and Analytics:The system provides real-time reporting and


analytics capabilities, empowering users with timely and meaningful insights for
decision-making.

• Audit Trail and Traceability:Comprehensive audit trails are implemented to track


system activities, ensuring traceability and accountability for all transactions.

19
• Adherence to Regulatory Standards: The proposed system complies with relevant
regulatory standards and industry best practices, ensuring legal and ethical integrity.

• Flexibility and Customization: The system is flexible, allowing for customization to


meet specific organizational needs and evolving business requirements.

• Automated Workflows: Automation of key workflows reduces manual tasks,


minimizing errors and improving overall process efficiency.

• Collaborative Features: Collaboration tools and features are integrated to enhance


communication and teamwork among users.

• Regular Updates and Maintenance: A systematic approach to updates and


maintenance ensures the system's longevity and responsiveness to changing
technological landscapes.

• User Training and Support: Provision of adequate training resources and ongoing
support to users ensures effective utilization of the system and addresses any issues
promptly.

• These characteristics collectively contribute to the effectiveness and success of the


proposed system, aligning it with the organization's goals and requirements.

20
SYSTEM DESIGN

21
6.SYSTEM DESIGN

6.1 Data visualization

Data visualization is an important skill in applied statistics and machine learning. Statistics
does indeed focus on quantitative descriptions and estimations of data. Data visualization
provides an important suite of tools for gaining a qualitative understanding. This can be helpful
when exploring and getting to know a dataset and can help with identifying patterns, corrupt
data, outliers, and much more. With a little domain knowledge, data visualizations can be used
to express and demonstrate key relationships in plots and charts that are more visceral and
stakeholders than measures of association or significance. Data visualization and exploratory
data analysis are whole fields themselves and it will recommend a deeper dive into some the
books mentioned at the end.

Sometimes data does not make sense until it can look at in a visual form, such as with charts
and plots. Being able to quickly visualize of data samples and others is an important skill both
in applied statistics and in applied machine learning. It will discover the many types of plots
that you will need to know when visualizing data in Python and how to use them to better
understand your own data.

➢ How to chart time series data with line plots and categorical quantities with bar charts.
➢ How to summarize data distributions with histograms and box plots.

22
MODULE DIAGRAM

GIVEN INPUT EXPECTED OUTPUT

input : data

output : visualized data

23
6.2 Use Case Diagram

Use case diagrams are considered for high level requirement analysis of a system. So
when the requirements of a system are analyzed the functionalities are captured in use cases.
So, it can say that uses cases are nothing but the system functionalities written in an organized
manner.

24
6.3 Entity Relationship Diagram

An entity relationship diagram (ERD), also known as an entity relationship model, is a


graphical representation of an information system that depicts the relationships among people,
objects, places, concepts or events within that system. An ERD is a data modeling technique
that can help define business processes and be used as the foundation for a relational database.
Entity relationship diagrams provide a visual starting point for database design that can also be
used to help determine information system requirements throughout an organization. After a
relational database is rolled out, an ERD can still serve as a referral point, should any debugging
or business process re-engineering be needed later.

25
PROGRAM CODING

26
7.PROGRAM CODING

7.1 Source Code

DATA PREPROCESSING AND DATA CLEANING

# Import the necessary libraries.

import pandas as pd

import numpy as np

# Avoid unnecessary warnings, (EX: software updates, version mismatch, and so on.)

import warnings

warnings.filterwarnings('ignore')

# Load the datasets

df=pd.read_csv('CYBER.csv')

# Check the top5 values

df.head()

# Check the bottom five values.

df.tail()

# Check the dimension of our datasets

df.shape

# Check the dataset size

df.size

27
# Check the columns of dataset

df.columns

# To know the information of our datsets

df.info()

# Check the unique columns of our specific column

df['Label'].unique()

# Transform the columns value(ex: int to str, str to int) for classification purpose.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

var = ['Label']

for i in var:

df[i] = le.fit_transform(df[i]).astype(int)

# Check the value is null or notnull

df.isnull().head()

# Remove the null value

df = df.dropna()

# Describe the datasets into stastical point of view

28
df.describe()

# Check the relation between each individual columns

df.corr().head()

# Check the events for specific columns

pd.crosstab(df["'Tot Fwd Pkts'"], df["'Tot Bwd Pkts'"]).head()

# Ascending the value of specific columns

df.groupby(["'Flow Byts/s'","'Pkt Len Std'"]).groups

# Check the value counts for specific columns

df["Label"].value_counts()

# Check the specific column catagorical distribution

pd.Categorical(df["'Idle Min'"]).describe()

# Check if the value is duplicated or not

df.duplicated()

# Calculate the total number of duplicated values

sum(df.duplicated())

# Remove the duplicate values

29
df=df.drop_duplicates()

# Calculate the total number of duplicated values

sum(df.duplicated())

DATA VISUALIZATION AND DATA ANALYSIS

# Import the necessary libraries.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Avoid unnecessary warnings, (EX: software updates, version mismatch, and so on.)

df=pd.read_csv('CYBER.csv')

# Check the top5 values

df.head()

# Remove the null value

df = df.dropna()

# Remove the duplicate values

30
df=df.drop_duplicates()

# Transform the columns value(ex: int to str, str to int) for classification purpose.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

var = ['Label']

for i in var:

df[i] = le.fit_transform(df[i]).astype(int)

# Check the data is balanced or imbalanced so that's why we use Countplot.

plt.figure(figsize=(12,7))

sns.countplot(x='Label',data=df)

# Plot a Histogram

plt.figure(figsize=(15,5))

plt.subplot(1,2,1)

plt.hist(df["'Flow Duration'"],color='red')

plt.subplot(1,2,2)

plt.hist(df["'Active Std'"],color='blue')

31
# Check how many columns are in datasets

df.columns

# Plot a Histogram.

df.hist(figsize=(15,55), color='green')

plt.show()

# Plot a Histogram

df["'Pkt Len Mean'"].hist(figsize=(10,5),color='yellow',bins=25)

# Check the outliers our datasets.

plt.boxplot(df["'Pkt Size Avg'"])

# Plot a density plot

df["'Pkt Len Mean'"].plot(kind='density')

# Plot a distance plot

sns.displot(df["'Bwd Pkt Len Mean'"], color='purple')

# barplot, boxenplot, boxplot, countplot, displot, distplot, ecdfplot, histplot, kdeplot,


pointplot, violinplot, stripplot

# Plot a distance plot.

sns.displot(df["'Pkt Len Mean'"], color='coral') # residplot, scatterplot

# Plot a head map for co relationships for each columns.

32
fig, ax = plt.subplots(figsize=(20,15))

sns.heatmap(df.corr(),annot = True, fmt='0.2%',cmap = 'autumn',ax=ax)

# Plot a Piechart

def plot(df, variable):

dataframe_pie = df[variable].value_counts()

ax = dataframe_pie.plot.pie(figsize=(9,9), autopct='%1.2f%%', fontsize = 10)

ax.set_title(variable + ' \n', fontsize = 10)

return np.round(dataframe_pie/df.shape[0]*100,2)

plot(df, 'Label')

GaussianNB CLASSIFIER ALGORITHEM

# Import the necessary libraries.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Avoid unnecessary warnings, (EX: software updates, version mismatch, and so on.)

import warnings

warnings.filterwarnings('ignore')

# Load the datasets

33
df=pd.read_csv('CYBER.csv')

# Check the top5 values

df.head()

del df["'TotLen Fwd Pkts'"]

del df["'TotLen Bwd Pkts'"]

del df["'Fwd Pkt Len Max'"]

del df["'Fwd Pkt Len Min'"]

del df["'Fwd Pkt Len Mean'"]

del df["'Fwd Pkt Len Std'"]

del df["'Bwd Pkt Len Max'"]

del df["'Bwd Pkt Len Mean'"]

del df["'Idle Std'"]

del df["'Flow Byts/s'"]

del df["'Flow IAT Std'"]

del df["'Flow IAT Min'"]

del df["'Pkt Len Max'"]

del df["'Bwd Pkt Len Min'"]

del df["'Flow IAT Max'"]

del df["'Fwd IAT Max'"]

del df["'Fwd IAT Min'"]

del df["'Bwd IAT Std'"]

del df["'Bwd IAT Max'"]

34
del df["'Fwd IAT Std'"]

del df["'Bwd IAT Min'"]

del df["'Bwd PSH Flags'"]

del df["'Bwd URG Flags'"]

del df["'Pkt Len Min'"]

del df["'Pkt Len Std'"]

del df["'Pkt Len Var'"]

del df["'FIN Flag Cnt'"]

del df["'RST Flag Cnt'"]

del df["'PSH Flag Cnt'"]

del df["'ACK Flag Cnt'"]

del df["'URG Flag Cnt'"]

del df["'CWE Flag Count'"]

# Remove the null value

df=df.dropna()

# Check the columns of dataset

df.columns

# Transform the columns value(ex: int to str, str to int) for classification purpose.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

35
var = ['Label']

for i in var:

df[i] = le.fit_transform(df[i]).astype(int)

# Check the top5 values

df.head()

# Remove the duplicate values

df=df.drop_duplicates()

# Split the datasets into depended and independed variable

# X is independend variable (Input features)

x1 = df.drop(labels='Label', axis=1)

# Y is dependend variable (Target variable)

y1 = df.loc[:,'Label']

# This process execute to balanced the datasets features.

import imblearn

from imblearn.over_sampling import RandomOverSampler

from collections import Counter

ros =RandomOverSampler(random_state=42)

36
x,y=ros.fit_resample(x1,y1)

print("OUR DATASET COUNT : ", Counter(y1))

print("OVER SAMPLING DATA COUNT : ", Counter(y))

# Split the datasets into two parts like trainng and testing variable.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42,


stratify=y)

print("NUMBER OF TRAIN DATASET : ", len(x_train))

print("NUMBER OF TEST DATASET : ", len(x_test))

print("TOTAL NUMBER OF DATASET : ", len(x_train)+len(x_test))

print("NUMBER OF TRAIN DATASET : ", len(y_train))

print("NUMBER OF TEST DATASET : ", len(y_test))

print("TOTAL NUMBER OF DATASET : ", len(y_train)+len(y_test))

# Implement Gaussian naive bayes algorithm learning patterns

from sklearn.naive_bayes import GaussianNB

GNB = GaussianNB()

# Fit is the training function for this algorithm.

GNB.fit(x_train,y_train)

# Predict is the test function for this algorithm

predicted = GNB.predict(x_test)

37
# Check classification report for this algorithm

from sklearn.metrics import classification_report

cr = classification_report(y_test,predicted)

print('THE CLASSIFICATION REPORT OF GAUSSIANNB CLASSIFIER:\n\n',cr)

# Check the confusion matrix for this algorithms

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,predicted)

print('THE CONFUSION MATRIX SCORE OF GAUSSIANNB CLASSIFIER:\n\n\n',cm)

# Check the cross value score of this algorithm.

from sklearn.model_selection import cross_val_score

accuracy = cross_val_score(GNB, x, y, scoring='accuracy')

print('THE CROSS VALIDATION TEST RESULT OF ACCURACY :\n\n\n',


accuracy*100)

# Check the accuracy score of this algorithms.

from sklearn.metrics import accuracy_score

a = accuracy_score(y_test,predicted)

print("THE ACCURACY SCORE OF GAUSSIANNB CLASSIFIER IS :",a*100)

# Check the hamming loss of this algorithm.

from sklearn.metrics import hamming_loss

hl = hamming_loss(y_test,predicted)

print("THE HAMMING LOSS OF GAUSSIANNB CLASSIFIER IS :",hl*100)

38
# Plot a Confusion matrix for this algorithms.

def plot_confusion_matrix(cm, title='THE CONFUSION MATRIX SCORE OF


GAUSSIANNB CLASSIFIER\n\n', cmap=plt.cm.cool):

plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title)

plt.colorbar()

cm1=confusion_matrix(y_test, predicted)

print('THE CONFUSION MATRIX SCORE OF GAUSSIANNB CLASSIFIER:\n\n')

print(cm)

plot_confusion_matrix(cm)

# Plot the worm plot for this model.

import matplotlib.pyplot as plt

df2 = pd.DataFrame()

df2["y_test"] = y_test

df2["predicted"] = predicted

df2.reset_index(inplace=True)

plt.figure(figsize=(20, 5))

plt.plot(df2["predicted"][:100], marker='x', linestyle='dashed', color='red')

plt.plot(df2["y_test"][:100], marker='o', linestyle='dashed', color='green')

plt.show()

ADABOOST CLASSIFIER ALGORITHEM

39
# Import the necessary libraries.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Avoid unnecessary warnings, (EX: software updates, version mismatch, and so on.)

import warnings

warnings.filterwarnings('ignore')

# Load the datasets

df=pd.read_csv('CYBER.csv')

# Check the top5 values

df.head()

del df["'TotLen Fwd Pkts'"]

del df["'TotLen Bwd Pkts'"]

del df["'Fwd Pkt Len Max'"]

del df["'Fwd Pkt Len Min'"]

del df["'Fwd Pkt Len Mean'"]

del df["'Fwd Pkt Len Std'"]

del df["'Bwd Pkt Len Max'"]

40
del df["'Bwd Pkt Len Mean'"]

del df["'Idle Std'"]

del df["'Flow Byts/s'"]

del df["'Flow IAT Std'"]

del df["'Flow IAT Min'"]

del df["'Pkt Len Max'"]

del df["'Bwd Pkt Len Min'"]

del df["'Flow IAT Max'"]

del df["'Fwd IAT Max'"]

del df["'Fwd IAT Min'"]

del df["'Bwd IAT Std'"]

del df["'Bwd IAT Max'"]

del df["'Fwd IAT Std'"]

del df["'Bwd IAT Min'"]

del df["'Bwd PSH Flags'"]

del df["'Bwd URG Flags'"]

del df["'Pkt Len Min'"]

del df["'Pkt Len Std'"]

del df["'Pkt Len Var'"]

del df["'FIN Flag Cnt'"]

del df["'RST Flag Cnt'"]

del df["'PSH Flag Cnt'"]

del df["'ACK Flag Cnt'"]

del df["'URG Flag Cnt'"]

41
del df["'CWE Flag Count'"]

# Remove the null value

df=df.dropna()

# Check the columns of dataset

df.columns

# Transform the columns value(ex: int to str, str to int) for classification purpose.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

var = ['Label']

for i in var:

df[i] = le.fit_transform(df[i]).astype(int)

# Check the top5 values

df.head()

# Remove the duplicate values

df=df.drop_duplicates()

# Split the datasets into depended and independed variable

42
# X is independend variable (Input features)

x1 = df.drop(labels='Label', axis=1)

# Y is dependend variable (Target variable)

y1 = df.loc[:,'Label']

# This process execute to balanced the datasets features.

import imblearn

from imblearn.over_sampling import RandomOverSampler

from collections import Counter

ros =RandomOverSampler(random_state=42)

x,y=ros.fit_resample(x1,y1)

print("OUR DATASET COUNT : ", Counter(y1))

print("OVER SAMPLING DATA COUNT : ", Counter(y))

# Split the datasets into two parts like trainng and testing variable.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42,


stratify=y)

print("NUMBER OF TRAIN DATASET : ", len(x_train))

print("NUMBER OF TEST DATASET : ", len(x_test))

print("TOTAL NUMBER OF DATASET : ", len(x_train)+len(x_test))

print("NUMBER OF TRAIN DATASET : ", len(y_train))

43
print("NUMBER OF TEST DATASET : ", len(y_test))

print("TOTAL NUMBER OF DATASET : ", len(y_train)+len(y_test))

# Implement Adaboost classifier algorithm learning patterns

from sklearn.ensemble import AdaBoostClassifier

ABC = AdaBoostClassifier()

# Fit is the training function for this algorithm.

ABC.fit(x_train,y_train)

# Predict is the test function for this algorithm

predicted = ABC.predict(x_test)

# Check classification report for this algorithm

from sklearn.metrics import classification_report

cr = classification_report(y_test,predicted)

print('THE CLASSIFICATION REPORT OF ADABOOST CLASSIFIER:\n\n',cr)

# Check the confusion matrix for this algorithms.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,predicted)

print('THE CONFUSION MATRIX SCORE OF ADABOOST CLASSIFIER:\n\n\n',cm)

# Check the cross value score of this algorithm.

44
from sklearn.model_selection import cross_val_score

accuracy = cross_val_score(ABC, x, y, scoring='accuracy')

print('THE CROSS VALIDATION TEST RESULT OF ACCURACY :\n\n\n',


accuracy*100)

# Check the accuracy score of this algorithms.

from sklearn.metrics import accuracy_score

a = accuracy_score(y_test,predicted)

print("THE ACCURACY SCORE OF ADABOOST CLASSIFIER IS :",a*100)

# Check the hamming loss of this algorithm.

from sklearn.metrics import hamming_loss

hl = hamming_loss(y_test,predicted)

print("THE HAMMING LOSS OF ADABOOST CLASSIFIER IS :",hl*100)

# Plot a Confusion matrix for this algorithms.

def plot_confusion_matrix(cm, title='THE CONFUSION MATRIX SCORE OF


ADABOOST CLASSIFIER\n\n', cmap=plt.cm.cool):

plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title)

plt.colorbar()

cm1=confusion_matrix(y_test, predicted)

print('THE CONFUSION MATRIX SCORE OF ADABOOST CLASSIFIER:\n\n')

print(cm)

45
plot_confusion_matrix(cm)

# Plot the worm plot for this model.

import matplotlib.pyplot as plt

df2 = pd.DataFrame()

df2["y_test"] = y_test

df2["predicted"] = predicted

df2.reset_index(inplace=True)

plt.figure(figsize=(20, 5))

plt.plot(df2["predicted"][:100], marker='x', linestyle='dashed', color='red')

plt.plot(df2["y_test"][:100], marker='o', linestyle='dashed', color='green')

plt.show()

CAT BOOST CLASSIFIER ALGORITHEM

# Import the necessary libraries.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Avoid unnecessary warnings, (EX: software updates, version mismatch, and so on.)

import warnings

warnings.filterwarnings('ignore')

46
# Load the datasets

df=pd.read_csv('CYBER.csv')

del df["'TotLen Fwd Pkts'"]

del df["'TotLen Bwd Pkts'"]

del df["'Fwd Pkt Len Max'"]

del df["'Fwd Pkt Len Min'"]

del df["'Fwd Pkt Len Mean'"]

del df["'Fwd Pkt Len Std'"]

del df["'Bwd Pkt Len Max'"]

del df["'Bwd Pkt Len Mean'"]

del df["'Idle Std'"]

del df["'Flow Byts/s'"]

del df["'Flow IAT Std'"]

del df["'Flow IAT Min'"]

del df["'Pkt Len Max'"]

del df["'Bwd Pkt Len Min'"]

del df["'Flow IAT Max'"]

del df["'Fwd IAT Max'"]

del df["'Fwd IAT Min'"]

del df["'Bwd IAT Std'"]

del df["'Bwd IAT Max'"]

del df["'Fwd IAT Std'"]

del df["'Bwd IAT Min'"]

47
del df["'Bwd PSH Flags'"]

del df["'Bwd URG Flags'"]

del df["'Pkt Len Min'"]

del df["'Pkt Len Std'"]

del df["'Pkt Len Var'"]

del df["'FIN Flag Cnt'"]

del df["'RST Flag Cnt'"]

del df["'PSH Flag Cnt'"]

del df["'ACK Flag Cnt'"]

del df["'URG Flag Cnt'"]

del df["'CWE Flag Count'"]

# Check the columns of dataset

df.columns

# Check the top5 values

df.head()

# Remove the null value

df=df.dropna()

df['Label'].value_counts()

# Transform the columns value(ex: int to str, str to int) for classification purpose.

48
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

var = ['Label']

for i in var:

df[i] = le.fit_transform(df[i]).astype(int)

df['Label'].value_counts()

# Check the top5 values

df.head()

# Remove the duplicate values

df=df.drop_duplicates()

# Split the datasets into depended and independed variable

# X is independend variable (Input features)

x1 = df.drop(labels='Label', axis=1)

# Y is dependend variable (Target variable)

y1 = df.loc[:,'Label']

# This process execute to balanced the datasets features.

49
import imblearn

from imblearn.over_sampling import RandomOverSampler

from collections import Counter

ros =RandomOverSampler(random_state=42)

x,y=ros.fit_resample(x1,y1)

print("OUR DATASET COUNT : ", Counter(y1))

print("OVER SAMPLING DATA COUNT : ", Counter(y))

# Split the datasets into two parts like trainng and testing variable.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42,


stratify=y)

print("NUMBER OF TRAIN DATASET : ", len(x_train))

print("NUMBER OF TEST DATASET : ", len(x_test))

print("TOTAL NUMBER OF DATASET : ", len(x_train)+len(x_test))

print("NUMBER OF TRAIN DATASET : ", len(y_train))

print("NUMBER OF TEST DATASET : ", len(y_test))

print("TOTAL NUMBER OF DATASET : ", len(y_train)+len(y_test))

# Implement Catboost classifier algorithm learning patterns

from catboost import CatBoostClassifier

CBC = CatBoostClassifier()

50
# Fit is the training function for this algorithm.

CBC.fit(x_train,y_train)

# Predict is the test function for this algorithm

predicted = CBC.predict(x_test)

# Check classification report for this algorithm

from sklearn.metrics import classification_report

cr = classification_report(y_test,predicted)

print('THE CLASSIFICATION REPORT OF CAT BOOST CLASSIFIER:\n\n',cr)

# Check the confusion matrix for this algorithms.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,predicted)

print('THE CONFUSION MATRIX SCORE OF CAT BOOST CLASSIFIER:\n\n\n',cm)

# Check the cross value score of this algorithm.

from sklearn.model_selection import cross_val_score

accuracy = cross_val_score(CBC, x, y, scoring='accuracy')

print('THE CROSS VALIDATION TEST RESULT OF ACCURACY :\n\n\n',


accuracy*100)

# Check the accuracy score of this algorithms.

from sklearn.metrics import accuracy_score

a = accuracy_score(y_test,predicted)

51
print("THE ACCURACY SCORE OF CAT BOOST CLASSIFIER IS :",a*100)

# Check the hamming loss of this algorithm.

from sklearn.metrics import hamming_loss

hl = hamming_loss(y_test,predicted)

print("THE HAMMING LOSS OF CAT BOOST CLASSIFIER IS :",hl*100)

# Plot a Confusion matrix for this algorithms.

def plot_confusion_matrix(cm, title='THE CONFUSION MATRIX SCORE OF CAT


BOOST CLASSIFIER\n\n', cmap=plt.cm.cool):

plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title)

plt.colorbar()

cm1=confusion_matrix(y_test, predicted)

print('THE CONFUSION MATRIX SCORE OF CAT BOOST CLASSIFIER:\n\n')

print(cm)

plot_confusion_matrix(cm)

# Plot the worm plot for this model.

import matplotlib.pyplot as plt

df2 = pd.DataFrame()

df2["y_test"] = y_test

df2["predicted"] = predicted

df2.reset_index(inplace=True)

52
plt.figure(figsize=(20, 5))

plt.plot(df2["predicted"][:100], marker='x', linestyle='dashed', color='red')

plt.plot(df2["y_test"][:100], marker='o', linestyle='dashed', color='green')

plt.show()

# Build a model in catboosting algorithms

import joblib

joblib.dump(CBC, 'cyber1.pkl')

53
7.2 SCREENSHOTS

54
55
56
TESTING

57
8.TESTING

8.1 Software Testing

The purpose of testing is to discover errors. Testing is the process of trying to


discover every conceivable fault or weakness in a work product. It provides a way
to check the functionality of components, sub-assemblies, assemblies and/or a
finished product it is the process of exercising software with the intent of ensuring
that the Software system meets its requirements and user expectations and does not
fail in an unacceptable manner. There are various types of test. Each test type
addresses a specific testing requirement.

8.2 TYPES OF TESTING

• Unit Testing
• White-box Testing
• Black-box Testing
• Validation Testing
• Backend Testing

Unit Testing: Unit testing involves the design of test cases that validate that the
internal program logic is functioning properly, and that program input produces
valid outputs. All decision branches and internal code flow should be validated. It
is the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that
relies on knowledge of its construction and is invasive. Unit tests perform basic
tests at component level and test a specific 52 business process, application, and/or
system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined
inputs and expected results.

White-box Testing: It is a test case design method that uses the control structure
of the procedural design to drive test cases. Using white box testing methods it was
guaranteed that most of the independent paths within modules had been exercised
at least once, all logical decision on their true and false sides, executed all loops at
their boundaries and exercised internal data structures to ensure their data validity.
White box testing has been done to achieve the following objectives. Logic errors

58
and incorrect assumptions are inversely proportional to the probability that a
program path will be executed. Errors tend to creep into the work when design and
implementation functions, condition or control that is out of the mainstream. We
often believe that logical path is not likely to be executed when the fact it may be
executed on regular basis. When program is translated into programming

language source code, it is likely that some typing errors will occur. Many will be
uncovered by syntax and type checking mechanism but others may go undetected
until testing begins.

Black-box Testing: Although tests are designed to uncover errors, they are also
used to demonstrate that the software functions are operational, input is properly
accepted and output is correctly produced and that the integrity of external
information is maintained. A black box test examines some of fundamental aspects
of a system with little regard for the internal logical structure of the software. All
input screens were thoroughly tested for data validity and smoothness of data entry
operations. Test cases were so formulated to verify whether the system works
properly in rare conditions also. Error conditions were checked. Data entry
operations are to be user friendly and smooth. It would be easier for the operators
if they can enter data through key board only.

Validation Testing: Validation testing can be defined as many, but a single


definition is that validation succeeds when the software functions in a manner that
can be reasonable excepted by the customer. After validation test have been
conducted one of the two possible conditions exists. The function or performance
characteristics are acceptable and confirmed to specification. A decision from
specification is uncovered and defining list is created. System validation checks the
quality of software in both simulated and live environment. First the software goes
through a phase in which errors and failures based on simulated user requirements
are verified and studied.

Back-end Testing: Whenever an input or data is entered on front-end application,


it stores in the database and the testing of such database is known as Database

59
Testing or Backend testing. There are different databases like SQL Server,
MySQL. Database testing involves testing of table structure, schema, stored
procedure, data structure and so on. Functional testing: Functional tests provide
systematic demonstrations that functions tested are available as specified by the
business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items: 1. Valid Input: identified
classes of valid input must be accepted. 2. Invalid Input: identified classes of
invalid input must be rejected. 3. Functions: identified functions must be exercised.
4. Output: identified classes of application outputs must be exercise.

60
CONCLUSION

61
9. CONCLUSION

The analytical process started from data cleaning and processing, missing
value, exploratory analysis and finally model building and evaluation. The best
accuracy on public test set of higher accuracy score algorithm will be find out.
The founded one is used in the application which can help to find the type of
Cyberattacks

FUTURE WORK

• Deploying the project in the cloud.

• To optimize the work to implement in the IOT system.

62
REFERENCE

63
10. REFERENCE

1. HTML: MDN Web Docs.


https://developer.mozilla.org/en-US/docs/Web/HTML

2. CSS: MDN Web Docs.


https://developer.mozilla.org/en-US/docs/Web/CSS

3. JavaScript: MDN Web Docs.


https://developer.mozilla.org/en-US/docs/Web/JavaScript

4. Bootstrap:Bootstrap Documentation.
https://getbootstrap.com/docs/4.0/getting-started/introduction/

5. Python: Python Software Foundation.


https://www.python.org/doc/

6. Django: Django Documentation.


https://docs.djangoproject.com/

7. MySQL: MySQL Documentation.


https://dev.mysql.com/doc/

8. SQLite: SQLite Documentation.


https://www.sqlite.org/docs.html

64

You might also like