Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Measurement: Sensors 31 (2024) 101003

Contents lists available at ScienceDirect

Measurement: Sensors
journal homepage: www.sciencedirect.com/journal/measurement-sensors

Intrusion detection based on phishing detection with machine learning


R. Jayaraj a, *, A. Pushpalatha b, K. Sangeetha c, T. Kamaleshwar d, S. Udhaya Shree e,
Deepa Damodaran f
a
Data Science and Business Systems, School of Computing, SRM Institute of Science and Technology, Kattankulathur, Chennai, TN, India
b
M.Tech Computer Science and Engineering, Sri Krishna College of Engineering and Technology, Coimbatore, TN, India
c
Department of Computer Science and Engineering, Panimalar Engineering College, Chennai, Tamil Nadu, India
d
Department of Computer Science and Engineering, Vel Tech Dr. Rangarajan Dr.Sagunthala R&D Institute of Science and Technology, Chennai, TN, India
e
Department of Computer Science and Engineering, Alpha College of Engineering and Technology, Puducherry, India
f
VITBS, Vellore Institute of Technology, Chennai Campus, TN, India

A R T I C L E I N F O A B S T R A C T

Keywords: Machine learning technique which uses artificial neural networks to learn representations. Phishing is a form of
Machine learning fraud in which the attacker tries to learn credential information from the websites. Web phishing is to steal
Cyber attack sensitive information such as usernames, passwords and credit card details by way of impersonating a authorized
Phishing detection
entity. The Hybrid Ensemble Feature Selection is a new feature selection method for machine learning-based
Intrusion detection,CDF-G
phishing detection systems (HEFS). The first step of HEFS involves using a novel Cumulative Distribution
Function gradient (CDF-g) algorithm to generate primary feature subsets, which are then fed into a data
perturbation ensemble to generate secondary feature subsets. We present the results of our approach and
compare them to a few previous studies, with the paper focusing primarily on phishing urls for detecting the
unauthorised one by using phishing detection method.

1. Introduction provides huge, accurate, and improved IDS classification accuracy. The
main goal of this project is to design computer network security using
Network-based computer systems play an important role in modern DBN. The rate of known and unknown attacks with the lowest number of
society, but they are vulnerable to attacks from our opponents and at­ false alarms is to analyse classification change. It detects attacks and
tackers. The additions to the intrusion prevention approach are user prevents them as well []. It goes into the most popular types of attacks as
authentication, authorization, encryption, and safety programming. well as the types of attackers that use Intrusion Detection Systems.
Intrusion detection is the method for securing computer systems [1]. Intrusion detection systems (IDS) are systems that are designed to
Misuse detection and anomaly detection are the two key intrusion detect attacks that can emerge from the internet or a local network and
detection methods. For example, STAT and IDIOT use patterns that are cause harm to network systems, and are composed of different packets
assumed to be weak points in the system to match and detect known and data to ensure data security. Their primary goal is to detect attacks
intrusions [2]. For example, if a criminal creates more than four failed and, if needed, to avoid them. Intrusion Detection Systems may gather
login attempts within 2 min when guessing a password using a signature factual results on the most frequent and types of attacks as well as their
rule, there is the limited trials. Since no patterns or matched rules are victims [4]. Nodes between the hidden layers are no longer con­
available, misuse detection techniques are not robust against novel at­ nectionless and already have connections to not only the output of the
tacks [3] (see Table 1, Figs. 1–7). input layer but also the output of the last hidden layer, which operates
The feature learning task is completely unsupervised while using a on the hidden layer’s input.
sparse auto encoder, and we recently noticed a sparse auto encoder- Cyber security encompasses both network and host security systems,
based deep learning model for network traffic recognition by deep which employ defences such as firewalls, software, and antivirus to keep
confidence neural hybrid intrusion detection system. The system intruders out using intrusion detection systems [5]. A network-based

* Corresponding author.
E-mail addresses: jayarajr1@srmist.edu.in (R. Jayaraj), pushpalathaa@skcet.ac.in (A. Pushpalatha), sangeethakalyaniraman@gmail.com (K. Sangeetha),
kamalesh4u2@gmail.com (T. Kamaleshwar), gvhari03@gmail.com (S. Udhaya Shree), deepa.d@vit.ac.in (D. Damodaran).

https://doi.org/10.1016/j.measen.2023.101003
Received 21 January 2023; Received in revised form 13 June 2023; Accepted 19 December 2023
Available online 21 December 2023
2665-9174/© 2023 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
R. Jayaraj et al. Measurement: Sensors 31 (2024) 101003

Table 1
Performance results.
Machine Learning Technique Train Time (s) Test Time (s) Accuracy (%) Sensitivity (%) Specificity (%) F-measure (%)

Decision Tree 0.021 0.009 96.5 96.1 97.83 96.4


Random Forest 0.436 0.034 96.4 97.6 96.4 96.8
k-NN 0.112 0.38 95.8 93.7 95.9 95.8
Neural Networks 9.08 0.006 95.7 96.4 96.14 97.2
Proposed 0.416 0.02 97.8 98.2 98.17 97.6

Fig. 1. Detecting phishing URL.

Fig. 2. Proposed feature selection framework.

IDS is located at the network’s demilitarised zone, where it analyses


network traffic in real time to detect unwanted intrusions or malicious
attacks. The two hidden layers create an undirected associative memory,
while the remaining hidden layer creates a directed acyclic graph that
transforms the associative memory into measurable variables like pic­
ture pixels. Network intrusion detection systems are typically signature
based and rule based controls that are deployed at systems to detect
known threats [6].
Organisation mainly deploys the cyber security for systems and
database for safety protection. For measuring the efficiency of many
techniques, in addition to using the most important performance met­
rics, false alarm rate, and detection rate are employed. Data, hardware,
and software are all protected from cyber attacks by internet-connected
systems. Enterprises use both cyber security and physical security to
protect their data centres from unauthorised users. It is a subset of cyber
security which is intended to protect the confidentiality, integrity, and
availability of data [7]. This research paper also integrates the proposed
model into a single board computer to allow an effective phishing
Fig. 3. Feature subset selection using cut-off rank. website sensor, as well as the feasibility of resource-constrained

2
R. Jayaraj et al. Measurement: Sensors 31 (2024) 101003

Fig. 7. Comparative analysis.

be consider as malicious link. When registering the IP address of system


for each websites that creates a white list when user visits has login user
interface. The system will warn during registration process if it feels
Fig. 4. Analysis performance of CDF-g. incompatibility. Using the URL records the blacklist is created and is
referred as phishing websites. From a number of sources a list entries are
derived. The different techniques are used by the attackers. The training
data which contains more features in order to develop a learning based
detection systems [9].
CANTINA and Zhang developed a technique called text based
phishing detection. The keywords are extracted and to be searched by
google search engine. Fengetal proposed a novel neural network based
classification for detecting malicious webpages. Basnet performed a
similar study in which only two common feature selection techniques,
the CFS and wrapper process, were evaluated. The authors tested two
feature space searching techniques (i.e., genetic algorithm and greedy
forward selection) on features originating from the website itself as well
as third-party sources such as search engines [10]. The performance of
the feature subsets is assessed using Naive Bayes, Logistic Regression,
and Random Forest classifiers. As compared to CFS, the wrapper
approach achieves higher detection accuracies, which is consistent with
the findings [11]. According to the authors, the wrapper method is
Fig. 5. Train/Test time. computationally more costly, preventing it from being a feasible solu­
tion in feature selection applications [12].
computing devices to enable a phishing website detection sensor. Qabajeh and Thabtah tested 47 features for phishing email identifi­
cation using IG, Chi-Square, and CFS. The authors discovered a greater
2. Related works reduction of filter measure values between the 20th and 21st function for
both IG and Chi-Square ranked features, and used this difference in the
Phishing detection is divided into two groups they are list based filter measure values as the cut-off rank to pick the top 20 features as the
detection systems and deep learning based systems. The blacklists and reduced feature set [13]. However, it is unclear how to calculate the cut-off
whitelists are the two list in list based phishing detection. This whitelist rank from a computational perspective. Further tests are conducted with
based detection system provides necessary information to secure legit­ 12 common features derived from intersecting the feature sets of IG,
imate websites [8]. In the whitelist if this websites not provide then it to Chi-Square, and CFS, achieving an average accuracy decrease of just
0.28% as compared to the full feature set. As a result of their study, they
were able to demonstrate the importance of filter measures in reducing
feature dimensionality while maintaining classification accuracy [14].

3. Proposed system

This paper is mainly focused on phishing attack. As phishing attacks


continue to become more sophisticated, persistent, and demand for end-
to-end phishing defence solutions is at all-time high. Comprehensive
approach to stopping phishing attacks provides by PDR platform
through global crowd-sourced phishing intelligence from 25 million
people combined with advanced automation [15–19]. List based
phishing website detection methods normally produced two URL lists
they are whitelist and blacklist. To create one whitelist and one blacklist
the antiphishing companies uses report from the companies. To detect
malicious websites the computing system uses two lists in the system. If
Fig. 6. Accuracy rate analysis.

3
R. Jayaraj et al. Measurement: Sensors 31 (2024) 101003

the URL present in the whitelist is to be a user trusted URLS and if the Existing feature is cutoff rank identification method has the limita­
URL present in blacklist it to be recognised as malicious URL. tion to overcome this, it use the novel algorithm called as CDF-g.
The detection of phishing URLS there is ongoing challenge in list
based method. If the unknown URL is very difficult that is not any list. If 5. CDF-g concepts and definitions
a new websites uses this malicious URL it will potentially harm the users.
Keep changing URLS the attackers take advantage of this loophole. To Cumulative distribution function is a theoretical background and it
ensure the new URLS are not in blacklist for their phishing websites. To described in this section how to be need in the feature selection algo­
achieve reliable and accurate phishing detection, use a phishing detec­ rithm that is proposed. Discrete random variable be represented as X.
tion sensor. Anti - phishing software does not need to be installed on Where X be the random variable having possible values. The X be the
every single computer. For office or household between the devices and particular value and the probability of ‘x’ to be taken as random variable
the router the designed sensor is required. The proposed model can also
P(Х = x) (1)
be implemented into the router directly due to its computational effi­
ciency. This paper implements the integration of the proposed deep
Fx(t) = P(X ≤ t) (2)
learning methods using the support vector machine to detect the
phishing url using phishing detection prototype sensor. Central difference for the one side (forward and backward) and
interior points and differences for the gradient, boundaries are shown.
4. Feature selection framework Gradient G(ri) represented as
FX (ti+1 ) − FX (ti ),
Two types of major ensemble feature selection techniques in the field G(ri ) = , if i = n
h
namely function perturbation and data perturbation. In multiple subsets
of a dataset the same feature selection method is applied. In same set of FX (ti+1 ) − FX (ti − 1)
data the multiple features is applied using function perturbation. Hybrid G(ri ) =
2h
, if 1 < i < n
perturbation is the combination of bot function perturbation and data
perturbation. To improve the classification performance using ensemble FX (ti ) − FX (ti − 1)
G(ri ) = , if i = 1
strategy suggested by number of studies and stable subset of feature is h
obtain. Reducing the classifier complexity and explored the data
portioning dataset benefits. 6. CDF-g algorithm - an automatic feature cut-off rank identifier
Feature is most guaranteed and truly predictive and enabling to be
selected and exist in all dataset partition and results a single data In machine learning, selecting subsets of feature for phishing detection
perturbation cycle to be a secondary feature subset which reduce the feature space and not compromising the detection ac­
curacy. Filter measures is utilize by the approach is feature subset selection
1∑ J
and crucial is optimal cut-off rank. Specific threshold rank is defined by cut
τK = τj,k
J j=1 off rank is an ordered list of filter measure values. If it to be irrevelant the
features are located beyond the cut-off rank and other features are to be

J discarded. Cut off rank to be illustrated. The subset of selected features are
FSK = FSjik surrounded by the discontinuous rectangle. The unstable is found on
existing cut-off rank identification method and not strong. The new
j=1

Inputs and union aggregates and it ensemble by the function method is more desirable and flexible, making it easier to decide the best
perturbation to obtain the feature subset in the best manner. Through cut-off rank. To overcome the limitations of current methods, a novel al­
this it leveraged the different filter measures of intelligence. It is less gorithm known as CDG-g has been developed.
susceptible to overfitting leading to the baseline features.

k 6.1. Algorithm:Feature cut–off rank identification
Baseline feature set = FSK
k=1

4
R. Jayaraj et al. Measurement: Sensors 31 (2024) 101003

References

[1] Tonguing Zhang, Winked Lee, Intrusion detection in wireless ad-hoc networks, in:
Proceedings of the 6th Annual International Conference on Mobile Computing and
7. Results analysis and discussions Networking, 2000, pp. 275–283.
[2] Tuan A. Tang, Miami Lofty, McLennan Des, Shed Ali Raze Sadie, Gogh Moonie,
7.1. Dataset preparation Deep recurrent neural network for intrusion detection based networks, in: 2018 4th
IEEE Conference on Network Softwarization and Workshops, IEEE, 2018,
pp. 202–206.
Python script automated and using by the collection of webpages. We [3] Stefano Zanier, Sergio M. Savers, Unsupervised learning techniques for an intrusion
can download the related resources from the total HTML document. detection system, in: Proceedings of the 2004 ACM Symposium on Applied
Computing, 2004, pp. 412–419.
Examples are Javascript,CSS,images.In browser all the downloaded [4] A. Ali, Selah Ahmed, Tamer Ramadan, Multilayer perceptions networks for an
webpages to be placed properly. Every webpage screenshot is to be intelligent adaptive intrusion detection system, International Journal of Computer
stored for filtering and inspection. Science and Network Security 10 (2) (2010).
[5] Gazed Karats, Ongar Kory, Neural network based intrusion detection systems with
different training functions, in: 2018 6th International Symposium on Digital
7.2. Experimental setup Forensic and Security (ISDFS), IEEE, 2018, pp. 1–6.
[6] Chua long Yin, Jialing Fee, A deep learning approach for intrusion detection using
Running classification filter measure values the experiment pro­ recurrent neural networks, IEEE Access 5 (2017) 21954–21961.
[7] Mohamed Amine Farrago, Malarias Lindros, Deep learning for cyber security
cesses to be calculated which is training and tested based on performed intrusion detection: approaches, datasets, and comparative study, J. Inf. Secur.
using Weka. Weka specifies the default parameter, which is used to Appl. 50 (2020) 102419.
deploy all classifiers. Other studies have looked into the importance of [8] Horne, Kurt, Maxwell, Hilbert White, Multilayer feed forward networks are
universal approximates, Neural Network. 2 (5) (1989) 359–366.
fine training. [9] Geoffrey E. Hinton, R. Roslyn, Reducing the dimensionality of data with neural
networks, Science 313 (5786) (2006) 504–507.
TP + TN [10] G. Xiang, J. Hong, C.P. Rose, L. Cranor, Cantina+ a feature-rich machine learning
Accuracy =
TP + TN + FP + FN framework for detecting phishing web sites, ACM Trans. Inf. Syst. Secur. 14 (2)
(2011) 1–28.
URL is 100 % based on phishing detection method. From credential [11] H. Zuhair, A. Selamat, M. Salleh, The effect of feature selection on phish website
websites 3000 records on database has tested in the system. Here the detection: an empirical study on robust feature subset selection for effective
system is satisfactory 96.80 % recognition rate analysis on the result classification, Int. J. Adv. Comput. Sci. Appl. 6 (10) (2016) 221–232.
[12] O. Kaynar, A.G. Yüksek, Y. Görmez, Y.E. Işik, Intrusion detection with autoencoder
preciously. Methods in the system on a strong tool of AI are applied. based deep learning machine, in: 2017 25th Signal Processing and
Between the phishing website and its destination provided the hamming Communications Applications Conference (SIU), IEEE, 2017, May, pp. 1–4.
distance. From the URL the five features are to be extracted. [13] K. Alrawashdeh, C. Purdy, Toward an online anomaly intrusion detection system
based on deep learning, in: 2016 15th IEEE International Conference on Machine
Learning and Applications (ICMLA), IEEE, 2016, December, pp. 195–200.
8. Conclusion [14] S. Ding, G. Wang, Research on intrusion detection technology based on deep
learning, in: 2017 3rd IEEE International Conference on Computer and
Communications (ICCC), IEEE, 2017, December, pp. 1474–1478.
The phishing detection is an important challenge that results to find [15] A.K. Jain, B.B. Gupta, A machine learning based approach for phishing detection
out the attackers. Thus attacks are effective which damage billion of using hyperlinks information, J. Ambient Intell. Hum. Comput. 10 (5) (2019)
dollars in the past years. The theft resells the legitimate information on 2015–2028.
[16] A.K. Jain, B.B. Gupta, PHISH-SAFE: URL features-based phishing detection system
the third party secondary market and the solution for problems are
using machine learning, in: Cyber Security, Springer, Singapore, 2018,
required not directly affect the economic. In this research the URL fea­ pp. 467–474.
tures and introduced a CDG-g based on phishing URL detection solution. [17] O.K. Sahingoz, E. Buber, O. Demir, B. Diri, Machine learning based phishing
detection from URLs, Expert Syst. Appl. 117 (2019) 345–357.
[18] K.L. Chiew, C.L. Tan, K. Wong, K.S. Yong, W.K. Tiong, A new hybrid ensemble
Declaration of competing interest feature selection framework for machine learning-based phishing detection system,
Inf. Sci. 484 (2019) 153–166.
The authors declare that they have no known competing financial [19] F. Itoo, S. Singh, Comparison and analysis of logistic regression, Naïve Bayes and
KNN machine learning algorithms for credit card fraud detection, Int. J. Inf.
interests or personal relationships that could have appeared to influence Technol. 13 (4) (2021) 1503–1511.
the work reported in this paper.

Data availability

No data was used for the research described in the article.

You might also like