Research Proposal

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Research Proposal

Knowledge based planning of implementation of an autonomous intelligent cyber


defense agent : Towards intelligent collection and dissection of phishing scam contents
from the Web

Ngnintedem Teukeng Jim carlson, Master Degree, Computer Science, University of Ngaoundéré,
Cameroon,jimcarl91@gmail.com

1. Introduction
Social engineering attacks include malicious activities realized via human interactions to influence a
person psychologically and emotionally to disclose confidential information, or to crash the security
measures [1], [2]. Social engineering attacks are one of the most dangerous threats over the world,
which rarely fail and affect the worldwide economy and privacy [3]. A social engineer performs four
phases during a social engineering attack: collecting information about the victim, gaining trust of the
victim, manipulating the victim emotionally to provide sensitive information, and quitting without any
traces [4], [5]. Social engineering attacks are human-based or computer-based [6]. This work deals
with human-based attacks. In such attacks, the attacker performs the attack personally by interacting
with the target to collect the desired information. They are the most dangerous and successful attacks
when they involve human interactions (through voice calls, SMS messages and emails), during which
attackers manipulate the victim’s psychology and emotion [7]. Phishing attacks are the most common
human-based attacks performed by social engineers [8], [9]. They consist to mislead vulnerable users
to provide sensitive and confidential information (such as banking account information, secret files, .)
via phone calls, emails or SMS [10]–[12]. Literature proposals against phishing are either
technological-centred or human-centred. We believe that the first category requires that users dispose
a minimal technical expertise to apply solutions. The second category educates users and investigates
psychological and sociological behaviours of victims. It requires that learners dispose a minimal level
of knowledge to understand fundamentals. This research would like to offer a solution with no
background level. To achieve that, we exploit various artificial intelligence techniques to manipulate
retrieved web content related to phishing scams such that users can easily make decisions.
This work associates artificial, web and collective intelligences to automate collection and organization
of phishing texts and to make understandable insights from such data. This work relies therefore on
experienced human judgment and artificial-computed opinions to help users to make right decisions
in case of phishing.
The terms “scam” and “phishing scam” are interchangeably used in this document.

2. Related works
Literature includes technological and human-centred approaches against social engineering attacks.
2.1 Technological-centred approaches
Companies acquire network protection solutions (Intrusion Detection System, firewalls, honeypots
etc.) to curb spear-phishing intrusion [16], [17]. At the employee level, they opt for antiviruses [15][16]
or filters based on black and white lists installed on browsers [17]–[20]. Complex solutions are
proposed research. They rely on artificial intelligence including automatic or deep learning to generate
intelligence necessary to characterize spear-phishing activities [15], [21]–[26] based on an annotated
sample of emails or URLs [27]. Other orientations seek to determine signatures to characterize variants
of Web pages or emails to recognize similarities and to deduce the malicious characters [21], [28],
1
[29].

2
2.2 Human-centred approaches
Solutions in this category include educational approaches and investigations of psychological factors
to scam susceptibility. Concerning education, training sessions with tools simulating real attacks are
planned and educational games [30]–[33] are developed for this purpose. Employees can also
voluntarily take owner-ship of educational tools such as TORPEDO [34] that assist themselves in real
time to manage suspicious emails. Concerning psychological investigations, several researches studied
elements affecting phishing susceptibility [35]–[47].
2.3 Contribution
The technological-centred approach requires that users dispose a minimal technical expertise to apply
solutions. The human-centred approach requires that learners dispose a minimal level of knowledge to
understand fundamentals. Unlike the aforementioned approaches, this research relies on artificial
intelligence to automatically retrieve information about scams on the web, to determine insights and
to organize for valuable decision making. This research is an efficient requirement to prevent being
scammed because it puts enough intelligence on scam texts for people. The long-term final system will
be accessible to every potential victim worldwide.
3. Problem statement and research questions
Experienced users feed the Web with judgements about the veracity of disseminated information.
However, this knowledge is not exploited to prevent phishing scams and to sustain user opinions before
making decisions. This work addresses this issue by enriching users with retrieved, structured, valuable
and knowledgeable information taken from the Web. The following questions underline this research
• How to automate mining and retrieving of such data?
• What approaches of data structuring can be exploited such that redundant and noise data are
eliminated?
• Is it possible to rely on clustering algorithms and sentiment analysis to organize and understand
data?
• How to classify a new post based on similarity with the known ones and the crowdsourcing
judgements of experienced people?

4. Aim and objectives


This work aims to provide a safe environment where people can get valuable information about the
veracity of texts based on opinions mined in the Web. The following includes objectives of this
research.
• To propose a mechanism to automate mining and retrieving of phishing scam texts with
opinions during a period and based on selected keywords.
• To determine the best approach to structure data with relevant information based on feature
engineering artefacts.
• To build exploitable knowledge using from sentiment analysis and supervised and
unsupervised learning schemes.
• To provide approaches of aggregation of crowd-opinions based on user experiences to
reinforce to validate retrieved judgements.
• To determine the nature of a phishing scam based on similarity measurements with known
texts.

3
5. Methodology
The research design has six (06) activities.
Web content mining and retrieving (A1)
This activity aims to mine and retrieve automatically information related to phishing scams. We use
web scraping techniques including web crawler and data extractor. Web crawler is used to crawl all
the links of web pages corresponding to search keywords and saves them in a database. The data
extractor extracts data from the stored links [48]–[50]. We deal with French texts.
Structuration of data (A2)
Since data come from various Web sources, we use data analysis to clean and transform them into
knowledge useful for users to make decisions. Specifically, we remove redundant and noise data and
we retain meaningful characteristics using feature-engineering techniques [51], [52].We will exploit
text summarization to reduce useless terms [53]. At the end of this activity, each scam text is
represented to a unique structure of features.
Learning knowledge (A3)
This activity has three objectives. The first one is categorizing phishing scams in different clusters:
event (fake trainings, fake recruitments), organization (fake ministry, fake bank, fake NGO …). The
second one is classifying a new text as scam or benign relying on structured data stored. We will make
sure that benign and scam samples are balanced, otherwise we will gather more samples in the web to
enrich knowledge. To achieve the first two, we exploit supervised and unsupervised algorithms [54]
to derive knowledge in form of decision trees, rules or instances. The third objective investigates
whether analysing sentiment polarity could reveal information for identification of scam as well as
which sentiment analysis level is more appropriate (document-level, sentence-level, or aspect-level)
[55], [56].
Multi-Criteria Decision Making (A4)
This step intends to build up a MCDM model [57]–[59] since the problem to solve in this research is
to help users in decision making about trusting or not a Web content. For that, we define the objectives,
criteria and sub-criteria (such as source reliability, content faults, ..) and alternatives (such as “trust the
text”, “do not trust it”, …). Then we help users to take a judgment according to the model defined.
Collective aggregation (A5)
This phase consists to take multiple sources of opinions and to provide a sole reliable answer. There
are two levels to consider. The first one takes the fact that multiple different opinions are retrieved for
the same text, that is, some retrieved results R could state that an information I is a scam and another
results R’ state that I is to trust. The second one takes the fact that a unique opinion comes from
different people in the web concerning an information I, but our crowd-sourcers (i.e experts registered
in our system as confirmers) give a contrary judgement. In both situations, we rely on crowdsourcing
techniques [60]–[62], opinion aggregation models [63], [64].
Similarity-based classification (A6)

4
Here, we aim to identify the nature of an incoming information I, we compare its content syntactically
and semantically with existing contents. For that, we rely on text similarity approaches [65]. It is useful
to reinforce decision-making guidance.

6. Expected outcomes
There are principally two expected products from this research.

• Corpus of scam datasets: research lacks such up-to-date, meaningful and real-life datasets of
phishing scams. We aim to provide and maintain structure and unstructured samples helpful
to reproduce and enhance this research area.
• A prototype of the system: A lightweight web platform built in Python that incorporates each
activity deliverable. We adopt Python due to its popularity in data science [66], [67].

References
[1] N. N. Pokrovskaia and S. O. Snisarenko, “Social engineering and digital technologies for the security of the
social capital’ development,” in Proceedings of the 2017 International Conference “Quality Management,
Transport and Information Security, Information Technologies”, IT and QM and IS 2017, 2017, pp. 16–18.
[2] R. Kalniņš, J. Puriņš, and G. Alksnis, “Security Evaluation of Wireless Network Access Points,” Applied
Computer Systems, vol. 21, no. 1, pp. 38–45, Jun. 2017.
[3] AranaM, “How much does a cyberattack cost companies?,” OpenData Security. [Online]. Available:
https://opendatasecurity.io/how-much-does-a-cyberattack-cost-companies/.
[4] F. Mouton, L. Leenen, and H. S. Venter, “Social engineering attack examples, templates and scenarios,”
Computers and Security, vol. 59, pp. 186–209, Jun. 2016.
[5] P. L. Gallegos-Segovia, P. E. Vintimilla-Tapia, J. F. Bravo-Torres, I. F. Yuquilima-Albarado, V. M. Larios-
Rosillo, and J. D. Jara-Saltos, “Social engineering as an attack vector for ransomware,” in 2017 CHILEAN
Conference on Electrical, Electronics Engineering, Information and Communication Technologies, CHILECON
2017 - Proceedings, 2017, vol. 2017-January, pp. 1–6.
[6] X. Liu, Q. Li, and C. Sonali, “Social engineering and insider threats,” in Proceedings - 2017 International
Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, CyberC 2017, 2018, vol. 2018-
January, pp. 25–34.
[7] P. Patil and R. Devale, “A Literature Survey of Phishing Attack Technique,” International Journal of Advanced
Research in Computer and Communication Engineering, vol. 5, pp. 198–200, 2016.
[8] S. Gupta, A. Singhal, and A. Kapoor, “A literature survey on social engineering attacks: Phishing attack,” in
Proceeding - IEEE International Conference on Computing, Communication and Automation, ICCCA 2016,
2017, pp. 537–540.
[9] E. O. Yeboah-Boateng and P. M. Amanor, “Phishing, SMiShing & Vishing: An Assessment of Threats
against Mobile Devices,” Journal of Emerging Trends in Computing and Information Sciences, vol. 5, no. 4,
2014.

5
[10] A. Aleroud and L. Zhou, “Phishing Environments, Techniques, and Countermeasures: A Survey,” Computers &
Security, vol. 68, pp. 160–196, Jul. 2017.
[11] K. L. Chiew, K. S. C. Yong, and C. L. Tan, “A survey of phishing attacks: Their types, vectors and technical
approaches,” Expert Systems with Applications, vol. 106, pp. 1–20, Sep. 2018.
[12] B. B. Gupta, N. A. G. Arachchilage, and K. E. Psannis, “Defending Against Phishing Attacks: Taxonomy of
Methods, Current Issues and Future Directions,” Telecommunication Systems, vol. 67, no. 2, pp. 247–267, Feb.
2018.
[13] R. Singh, H. Kumar, R. K. Singla, and R. R. Ketti, “Internet Attacks and Intrusion Detection System: A Review
of the Literature,” Online Information Review, vol. 41, no. 2, pp. 171–184, Apr. 2017.
[14] L. Santos, C. Rabadao, and R. Goncalves, “Intrusion Detection Systems in Internet of Things: A Literature
Review,” in 2018 13th Iberian Conference on Information Systems and Technologies (CISTI), 2018, pp. 1–7.
[15] T. Chin, K. Xiong, and C. Hu, “Phishlimiter: A Phishing Detection and Mitigation Approach Using Software-
Defined Networking,” IEEE Access, vol. 6, pp. 42516–42531, 2018.
[16] A. Qamar, A. Karim, and V. Chang, “Mobile Malware Attacks: Review, Taxonomy & Future Directions,”
Future Generation Computer Systems, vol. 97, pp. 887–909, Aug. 2019.
[17] A. Jamil, K. Asif, Z. Ghulam, M. K. Nazir, S. Mudassar Alam, and R. Ashraf, “MPMPA: A Mitigation and
Prevention Model for Social Engineering Based Phishing attacks on Facebook,” in 2018 IEEE International
Conference on Big Data (Big Data), 2018, pp. 5040–5048.
[18] N. Virvilis, A. Mylonas, N. Tsalis, and D. Gritzalis, “Security Busters: Web Browser Security vs. Rogue Sites,”
Computers & Security, vol. 52, pp. 90–105, Jul. 2015.
[19] A. Amran, Z. F. Zaaba, M. M. Singh, and A. W. Marashdih, “Usable Security: Revealing End-Users
Comprehensions on Security Warnings,” Procedia Computer Science, vol. 124, pp. 624–631, Jan. 2017.
[20] N. Tsalis, N. Virvilis, A. Mylonas, T. Apostolopoulos, and D. Gritzalis, “Browser Blacklists: The Utopia of
Phishing Protection,” Springer, Cham, 2015, pp. 278–293.
[21] A. K. Jain and B. B. Gupta, “A Novel Approach to Protect Against Phishing Attacks at Client Side using Auto-
updated White-List,” EURASIP Journal on Information Security, vol. 2016, no. 1, p. 9, Dec. 2016.
[22] S. Mahdavifar and A. A. Ghorbani, “Application of Deep Learning to Cybersecurity: A Survey,”
Neurocomputing, vol. 347, pp. 149–176, Jun. 2019.
[23] T. Shibahara et al., “Malicious URL Sequence Detection using Event De-noising Convolutional Neural
Network,” in 2017 IEEE International Conference on Communications (ICC), 2017, pp. 1–7.
[24] R. M. Mohammad, F. Thabtah, and L. McCluskey, “Tutorial and critical analysis of phishing websites methods,”
Computer Science Review, vol. 17, pp. 1–24, Aug. 2015.
[25] E.-S. M. El-Alfy, “Detection of Phishing Websites Based on Probabilistic Neural Networks and K-Medoids
Clustering,” The Computer Journal, vol. 60, no. 12, pp. 1745–1759, Dec. 2017.
[26] D. Sahoo, C. Liu, and S. C. H. Hoi, “Malicious URL Detection using Machine Learning: A Survey,” Jan. 2017.
[27] OpenPhish, “Timely. Accurate. Relevant Threat Intelligence.” [Online]. Available: https://www.openphish.com/.
[Accessed: 09-May-2019].
[28] S. Gupta and S. Sachdeva, “Invitation or Bait? Detecting Malicious URLs in Facebook Events,” in 2018 Eleventh
International Conference on Contemporary Computing (IC3), 2018, pp. 1–6.
[29] H. Shirazi, B. Bezawada, and I. Ray, “"Kn0w Thy Doma1n Name": Unbiased Phishing Detection
Using Domain Name Based Features,” in Proceedings of the 23nd ACM on Symposium on Access Control
Models and Technologies - SACMAT ’18, 2018, pp. 69–75.
[30] L. Caporarello, M. Magni, and F. Pennarola, “One Game Does not Fit All. Gamification and Learning: Overview
and Future Directions,” Springer, Cham, 2019, pp. 179–188.
[31] R. N. Landers, E. M. Auer, A. B. Collmus, and M. B. Armstrong, “Gamification Science, Its History and Future:
Definitions and a Research Agenda,” Simulation & Gaming, vol. 49, no. 3, pp. 315–337, Jun. 2018.

6
[32] G. CJ, S. Pandit, S. Vaddepalli, H. Tupsamudre, V. Banahatti, and S. Lodha, “PHISHY - A Serious Game to
Train Enterprise Users on Phishing Awareness,” in Proceedings of the 2018 Annual Symposium on Computer-
Human Interaction in Play Companion Extended Abstracts - CHI PLAY ’18 Extended Abstracts, 2018, pp. 169–
181.
[33] G. Misra, N. A. G. Arachchilage, and S. Berkovsky, “Phish Phinder: A Game Design Approach to Enhance User
Confidence in Mitigating Phishing Attacks,” in Eleventh International Symposium on Human Aspects of
Information Security & Assurance (HAISA 2017), 2017, pp. 41–51.
[34] M. Volkamer, K. Renaud, B. Reinheimer, and A. Kunz, “User experiences of TORPEDO: TOoltip-poweRed
Phishing Email DetectiOn,” Computers & Security, vol. 71, pp. 100–113, Nov. 2017.
[35] M. Alsharnouby, F. Alaca, and S. Chiasson, “Why phishing still works: User strategies for combating phishing
attacks,” International Journal of Human-Computer Studies, vol. 82, pp. 69–82, Oct. 2015.
[36] E. J. Williams, A. Beardmore, and A. N. Joinson, “Individual differences in susceptibility to online influence: A
theoretical review,” Computers in Human Behavior, vol. 72, pp. 412–421, Jul. 2017.
[37] P. Rajivan and C. Gonzalez, “Creative Persuasion: A Study on Adversarial Behaviors and Strategies in Phishing
Attacks,” Frontiers in Psychology, vol. 9, p. 135, Feb. 2018.
[38] M. Butavicius, K. Parsons, M. Pattinson, and A. McCormac, “Breaching the Human Firewall: Social engineering
in Phishing and Spear-Phishing Emails,” May 2016.
[39] M. Nicho, H. Fakhry, and U. Egbue, “Evaluating User Vulnerabilities vs Phisher Skills In Spear Phishing,”
IADIS International Journal on Computer Science and Information Systems, vol. 13, no. 2, pp. 93–108, 2018.
[40] E. J. Williams, J. Hinds, and A. N. Joinson, “Exploring susceptibility to phishing in the workplace,” International
Journal of Human Computer Studies, vol. 120, pp. 1–13, Dec. 2018.
[41] H. S. Jones, J. N. Towse, N. Race, and T. Harrison, “Email fraud: The search for psychological predictors of
susceptibility,” PLoS ONE, vol. 14, no. 1, Jan. 2019.
[42] K. Greene, M. Steves, M. Theofanos, and J. Kostick, “User Context: An Explanatory Variable in Phishing
Susceptibility,” in Proceedings 2018 Workshop on Usable Security, Network and Distributed Systems Security
(NDSS) Symposium, 2018.
[43] D. Oliveira et al., “Dissecting spear phishing emails for older vs young adults: On the interplay of weapons of
influence and life domains in predicting susceptibility to phishing,” in Conference on Human Factors in
Computing Systems - Proceedings, 2017, vol. 2017-May, pp. 6412–6424.
[44] I. Alseadoon, T. Chan, E. Foo, and J. Gonzalez Nieto, “Who is More Susceptible to Phishing Emails?: A Saudi
Arabian Study,” in 23rd Australasian Conference on Information Systems, 2012.
[45] I. Alseadoon, M. F. I. Othman, and T. Chan, “What is the Influence of Users’ Characteristics on their Ability to
Detect Phishing Emails?,” in Lecture Notes in Electrical Engineering, 2015, vol. 315, pp. 949–962.
[46] S. Kleitman, M. K. H. Law, and J. Kay, “It’s the deceiver and the receiver: Individual differences in phishing
susceptibility and false positives with item profiling.,” PloS one, vol. 13, no. 10, p. e0205089, 2018.
[47] S. M. Albladi and G. R. S. Weir, “User characteristics that influence judgment of social engineering attacks in
social networks,” Human-centric Computing and Information Sciences, vol. 8, no. 1, Dec. 2018.
[48] M. Salah, B. Al Okush, and M. Al Rifaee, “A Comparison of Web Data Extraction Techniques,” in 2019 IEEE
Jordan International Joint Conference on Electrical Engineering and Information Technology, JEEIT 2019 -
Proceedings, 2019, pp. 785–789.
[49] N. V. Kamanwar and S. G. Kale, “Web data extraction techniques: A review,” in IEEE WCTFTR 2016 -
Proceedings of 2016 World Conference on Futuristic Trends in Research and Innovation for Social Welfare,
2016.
[50] A. K. Yatskov, M. I. Varlamov, and D. Y. Turdakov, “Extraction of Data from Mass Media Web Sites,”
Programming and Computer Software, vol. 44, no. 5, pp. 344–352, Sep. 2018.
[51] X. Deng, Y. Li, J. Weng, and J. Zhang, “Feature selection for text classification: A review,” Multimedia Tools
and Applications, vol. 78, no. 3, pp. 3797–3816, Feb. 2019.

7
[52] A. Zheng and A. Casari, Feature Engineering for Machine Learning : Principles and Techniques for Data
Scientists. O’Reilly Media, 2018.
[53] M. Gambhir and V. Gupta, “Recent automatic text summarization techniques: a survey,” Artificial Intelligence
Review, vol. 47, no. 1, Jan. 2017.
[54] J. Chapmann, Machine Learning: Fundamental Algorithms for Supervised and Unsupervised Learning With
Real-World Applications. CreateSpace Independent Publishing Platform, 2017.
[55] W. Medhat, A. Hassan, and H. Korashy, “Sentiment analysis algorithms and applications: A survey,” Ain Shams
Engineering Journal, vol. 5, no. 4, pp. 1093–1113, Dec. 2014.
[56] D. M. E. D. M. Hussein, “A survey on sentiment analysis challenges,” Journal of King Saud University -
Engineering Sciences, vol. 30, no. 4, pp. 330–338, Oct. 2018.
[57] E. K. Zavadskas, Z. Turskis, and S. Kildiene, “State of art surveys of overviews on MCDM/MADM methods,”
Technological and Economic Development of Economy, vol. 20, no. 1. Taylor and Francis Ltd., pp. 165–179,
2014.
[58] A. Mardani, A. Jusoh, K. M. D. Nor, Z. Khalifah, N. Zakwan, and A. Valipour, “Multiple criteria decision-
making techniques and their applications - A review of the literature from 2000 to 2014,” Economic Research-
Ekonomska Istrazivanja , vol. 28, no. 1. Taylor and Francis Ltd., pp. 516–571, 11-Sep-2015.
[59] M. R. Asadabadi, E. Chang, and M. Saberi, “Are MCDM methods useful? A critical review of Analytic
Hierarchy Process (AHP) and Analytic Network Process (ANP),” Cogent Engineering, vol. 6, no. 1, Jan. 2019.
[60] S. Chatterjee, A. Mukhopadhyay, and M. Bhattacharyya, “A Review of Judgment Analysis Algorithms for
Crowdsourced Opinions,” IEEE Transactions on Knowledge and Data Engineering, 2019.
[61] G. Xintong, W. Hongzhi, Y. Song, and G. Hong, “Brief survey of crowdsourcing for data mining,” Expert
Systems with Applications, vol. 41, no. 17. Elsevier Ltd, pp. 7987–7994, 01-Dec-2014.
[62] A. I. Chittilappilly, L. Chen, and S. Amer-Yahia, “A Survey of General-Purpose Crowdsourcing Techniques,”
IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 9, pp. 2246–2266, Sep. 2016.
[63] H. Mercier and O. Morin, “Majority rules: how good are we at aggregating convergent opinions?,” Evolutionary
Human Sciences, vol. 1, 2019.
[64] J. Y. L. Yap, C. C. Ho, and C. Y. Ting, “Aggregating multiple decision makers’ judgement,” in Lecture Notes in
Networks and Systems, vol. 67, Springer, 2019, pp. 13–21.
[65] W. H.Gomaa and A. A. Fahmy, “A Survey of Text Similarity Approaches,” International Journal of Computer
Applications, vol. 68, no. 13, pp. 13–18, Apr. 2013.
[66] A. Luashchuk, “Why I Think Python is Perfect for Machine Learning and Artificial Intelligence,”
TowardsDataScience, 2019. [Online]. Available: https://towardsdatascience.com/8-reasons-why-python-is-good-
for-artificial-intelligence-and-machine-learning-4a23f6bed2e6. [Accessed: 03-Sep-2019].
[67] R. Fabisiak, “The best programming language for Artificial Intelligence and Machine Learning,” Medium, 2019.
[Online]. Available: https://medium.com/duomly-blockchain-online-courses/the-best-programming-language-
for-artificial-intelligence-and-machine-learning-538486b462c. [Accessed: 03-Sep-2019].

You might also like