ICATM Paper Template

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Conference on Advance Technology and Management

Node Classification in Website Fingerprinting using Graph


Neural Networks

Suaz Hanif Malgundkar Rahul Babar Kage Priyanshu Chilkoti


Dept. of Computer Engineering Dept. of Computer Engineering Dept. of Computer Engineering
Pillai College of Engineering Pillai College of Engineering Pillai College of Engineering
Navi Mumbai, India
Navi Mumbai, India Navi Mumbai, India
msuaz20ce@student.mes.ac.in rkage20comp@student.mes.ac.in pchil20comp@student.mes.ac.in

Adheesh Sreedharan Prof. K.S Suresh Babu


Dept. of Computer Engineering Dept. of Computer Engineering
Pillai College of Engineering Pillai College of Engineering
Navi Mumbai, India Navi Mumbai, India
achakrambil20comp@student.mes.ac.in sureshbabu@mes.ac.in

ABSTRACT scenarios will be investigated. GAP-WF's applicability extends


Website fingerprinting aims to identify the specific webpages to scenarios involving encrypted or anonymized traffic, making
in encrypted traffic by observing patterns of traffic traces. The it potentially valuable for surveillance, security monitoring,
existing system mainly focuses on website homepage and research purposes. However, challenges such as
fingerprinting and it is more difficult to identify different computational demands and accuracy limitations must be
webpages within the same website because the traffic traces are addressed. The broader implications of website fingerprinting
very similar. In our project, we propose Graph Attention using GAP-WF underscore the importance of considering both
Pooling Network for fine-grained WF(GAP-WF). We will its benefits and risks in any deployment.
construct the trace graph according to the flow sequence in
webpage loading and use a GNN based model to better learn
1. LITERATURE SURVEY
the features of nodes(intra-flow) and structures(inter-flow). 1.1 Literature Review
There may be different flows having different effects on In the feature extraction stage, the system extracts relevant
classification. For this, we will utilize Graph Attention features from the encrypted traffic using a pre-trained deep
Network to pay attention to the more useful nodes. The learning model. In the classification stage, the system uses a
algorithms that will be used are Support Vector Machine(SVM) feed-forward neural network to classify the extracted features
and Random Forest with the contribution using K-Nearest and identify the visited website.
Neighbor(KNN). Also, we use four datasets comprising of
WEB100, APPLE60, CDC30, PAGE100 to evaluate the 1.2 Deep Learning Approach
performance of GAP-WF. Xue, “Fingerprinting HTTPS Websites: “A Deep Learning
Approach"(2021).
Keywords
Keywords— Fingerprinting, Pooling Network, Support This paper provides an idea about deep learning-based
Vector Machine, Random Forest, KNN approach for HTTPS website fingerprinting. The system uses
INTRODUCTION a convolutional neural network (CNN) to extract features
This report offers a thorough examination of node classification from the encrypted traffic and classify the traffic into different
in website fingerprinting, focusing on the challenges posed by websites
the widespread adoption of SSL/TLS encryption. Such
encryption complicates the identification of visited websites,
prompting the use of techniques like SSL/TLS website 1.3 Transfer Learning Approach
fingerprinting, which relies on encrypted traffic analysis. Garg, "Improving Website Fingerprinting Attacks through
Various methodologies, including machine learning Transfer Learning"(2021).
frameworks like TensorFlow and Keras, alongside graph
theory libraries such as NetworkX, are employed in these This survey paper involves training a model on a source
systems. However, ethical concerns, privacy issues, and domain with a large amount of labeled data and transferring
algorithmic biases present notable challenges. The project aims the learned knowledge to a target domain with limited labeled
to implement the Graph Attention Pooling Network (GAP-WF) data. The authors propose a novel transfer learning algorithm
for SSL/TLS website fingerprinting, evaluating its efficacy in
accurately classifying encrypted traffic and assessing its called TL-WF, which fine-tunes a pre-trained convolutional
potential superiority over existing techniques. Additionally, the neural network (CNN) on a small set of labeled data from the
scalability and practicality of deploying GAP-WF in real-world target domain.
International Conference on Advance Technology and Management

1.4 Data Augmentation 2.1.3 Online Learning


Data augmentation is the addition of new data artificially The proposed system can implement an online learning
derived from existing training data. Techniques include framework that continuously updates the model parameters
resizing, flipping, rotating, cropping, padding, etc. It helps to with new data, improving the system's efficiency and
adaptability to dynamic network traffic patterns.
address issues like overfitting and data scarcity, and it makes
the model robust with better performance. 2.1.4 Multi-Task Learning
The proposed system can utilize multi-task learning to jointly
1.5 Few Shot technique learn website fingerprinting and other related tasks, such as
Yongjun Wang, " “Few-Shot Website Fingerprinting Attack traffic classification or intrusion detection, to leverage the
shared information and improve the overall performance of the
with Data Augmentation, 2021.
system.
This paper fingerprinting attack method that uses data 2.2 Existing System Architecture
augmentation techniques to improve the performance of the
attack. The method leverages a few labeled samples of a target
website and a large amount of unlabeled traffic data to
construct a fingerprint of the target website. The authors
showed that their method outperforms existing few-shot
website fingerprinting attack methods in terms of accuracy.

1.6 Homology Analysis


Maohua Guo, “Website Fingerprinting Attacks Based on
Homology Analysis”,2021.

The survey paper of two phases: training and testing. In the 2.2.1 Dataset size and diversity
training phase, the attacker collects a set of website traces and The performance of the model is highly dependent on the
extracts their corresponding features. The features are then dataset and diversity of the dataset used to train it. A larger and
clustered using homology analysis to generate a set of more diverse dataset is likely to lead to better performance.
representative templates. In the testing phase, the attacker
2.2.2 Graph structure
captures the traffic of the target user and extracts the features The structure of the graph used to represent the network traffic
of each packet. The attacker then compares the extracted can have a significant impact on the performance of the model.
features with the representative templates to identify the Different graph structures can capture different aspects of the
visited web pages. It does not rely on the assumption that the network traffic, so it's important to choose an appropriate
attacker has prior knowledge of the website's content. It has a structure.
high accuracy rate, which means that it is effective in 2.2.3 Attention mechanism
identifying the websites visited by a user. The attention mechanism used in the GAPN can affect the
performance of the model. Different attention mechanisms can
2. WEBSITE FINGERPRINTING focus on different parts of the input data and capture different
SYSTEM patterns.

2.1 Website Fingerprinting Techniques 2.2.4 Network architecture


The objective of the project is to create a software The architecture of the network can also impact the
implementation of the Graph Attention Pooling Network performance of the model. Different architectures can capture
(GAPWF) for SSL/TLS website fingerprinting and evaluate the different levels of abstraction and complexity in the data.
effectiveness of GAP-WF in accurately classifying SSL/TLS
encrypted traffic and identifying the websites being visited.
2.2.5 Adversarial attacks
Moreover, investigate the potential for improved performance Website fingerprinting using GAPN is vulnerable to
compared to existing SSL/TLS fingerprinting techniques. adversarial attacks, where an attacker tries to disguise their
network traffic to avoid detection. Robustness against such
2.1.1 Improved Feature Extraction attacks is an important factor to consider when evaluating the
The proposed system can employ more advanced feature effectiveness of the model. Avoid combining SI and CGS units,
extraction techniques such as convolutional neural networks such as current in amperes and magnetic field in oersteds. This
(CNNs) and recurrent neural networks (RNNs) to extract more often leads to confusion because equations do not balance
discriminative and informative features from network traffic dimensionally. If you must use mixed units, clearly state the
data. units for each quantity that you use in an equation.

2.1.2 Ensemble Learning


The proposed system can use ensemble learning techniques to
combine multiple classifiers to improve the accuracy and
robustness of the classification model.
International Conference on Advance Technology and Management

2.3 Proposed System Architecture graph (DAG) that captures the temporal ordering of the
SSL/TLS packets.

2.4.2 Graph Attention Pooling


To learn a representation of the graph, the authors used a graph
attention pooling (GAP) network. The GAP network consists
of multiple graph attention layers that learn to assign
importance scores to the nodes and edges of the graph based on
their features and relations. The outputs of the final attention
layer are then pooled to produce a fixed-size representation of
the graph.

Depending on the domain and data characteristics, different 2.4.3 Classification


types of combinations might produce dissimilar outputs. The The authors trained a classifier on the graph representations
following list describes several hybridization techniques that produced by the GAP network to predict the website being
come into consideration to merge CF and CBF recommenders visited based on the SSL/TLS traffic. They experimented with
several classifiers, including logistic regression, random forest,
2.3.1 Data Collection and support vector machines (SVMs), and evaluated their
Collection of a dataset of SSL/TLS encrypted website traffic performance using various metrics such as accuracy, precision,
traces from various websites. recall, and F1 score.
2.3.2 Preprocessing 2.4.4 Evaluation
Preprocess the dataset by filtering out irrelevant traffic and The authors evaluated the performance of their approach on
splitting the remaining traffic into packets. several benchmark datasets and compared it with existing
SSL/TLS fingerprinting techniques. They also conducted
2.3.3 Graph Construction ablation studies to analyze the contribution of each component
Construct a graph representation of the SSL/TLS encrypted of their approach and identified areas for future research.
website traffic using the packets as nodes and their
relationships as edges. 2.4.5 Results and Analysis
The results of the experiments were analyzed to determine the
2.3.4 Feature Extraction effectiveness of the GAP-WF network in accurately classifying
Extract features from the SSL/TLS encrypted website traffic SSL/TLS traffic and identifying the websites being visited. The
graph using techniques such as subgraph enumeration and edge analysis also identified any potential limitations or areas for
distance distribution. improvement in the network architecture.
2.3.5 Graph Attention Pooling Network 2.4.6 Scalability Analysis
Train a Graph Attention Pooling Network (GAP-Net) using the The scalability of the network was analyzed by measuring the
SSL/TLS encrypted website traffic graph and the extracted computational efficiency and resource usage of the network on
features to perform fine-grained SSL/TLS website larger datasets of SSL/TLS traffic. This involved identifying
fingerprinting. any potential bottlenecks or limitations that may affect the
2.3.6 Evaluation practical use of the network in real-world applications.
Evaluate the performance of the trained GAP-Net using metrics 2.5 Use Case
such as accuracy, precision, recall, and F1-score.

2.3.7 Deployment 2.5.1 Online Fraud Detection


Websites can detect fraudulent user behavior using browser
Deploy the trained GAP-Net for SSL/TLS website fingerprinting technology. When a website detects fraudulent
fingerprinting on a system. user activity, it adds extra authentication steps to the login
process to prevent unauthorized access and theft of legitimate
2.4 Implementation Techniques users’ accounts. For instance, browser fingerprinting is used to
We collected a dataset of SSL/TLS traffic by monitoring the
authenticate users when they login to an online banking system.
network traffic of various websites using a passive network Wachovia, a financial services company, created unique
sniffer. Then preprocessed the dataset by filtering out non- fingerprinting for their customers’ devices to verify their
HTTPS traffic and splitting it into training and testing sets. identity with a unique identifier and block malicious users.
Next, we extracted features from the SSL/TLS packets using
2.5.2 Tailored content recommendations
byte frequency distribution (BFD) which is a histogram of the
It’s possible to delete cookie history and block them on your
frequency of each byte value in the packet payload, Byte web browsers. Web browsers such as Google, Firefox, Safari,
sequence distribution (BSD) which is a histogram of the and Microsoft Edge enable users to disable cookies in their web
frequency of each byte sequence of length k in the packet browsers. Unlike cookies, it is hard to block browser
payload and Statistical moments that is a set of statistical fingerprints. That’s why browser fingerprinting is a more
moments of the packet payload, such as mean, variance etc. effective technique for advertising to track users’ behaviors and
activities across the web.
2.4.1 Graph Construction Websites use fingerprinting to track and analyze visitors’
The authors constructed a graph representation of the SSL/TLS activities and behaviors to create personalized experiences
traffic by considering each packet as a node in the graph and based on visitors’ behaviors and activities. For example, when
adding edges between nodes based on their inter-arrival time you request a website to display its content, the website can
and packet direction. The resulting graph is a directed acyclic reveal your geo-location by tracking your IP fingerprinting.
International Conference on Advance Technology and Management

This enables eCommerce sites to recommend locally relevant


content and nearby stores to their online visitors. Websites can
access user’s current location, the type of device the user uses, FULL FEATURE SET OF WA-KNN
such as a desktop, tablet, or mobile phone, Traffic source, users Number of
can reach your website through different ways such as direct, Description
features
social media, referral, and paid traffic using browser fingerprint
4 P|, |P+|, |P−|, T|P|
technology.
A list M, such that Mi = 1 if ∃ i ∈
2.6 Result and Discussion P` and Mi = 0 otherwise. M is
similar to PU , the unique packet
2.6.1 Standard datasets used 3001 length feature. Given that the MTU
We collected the following major data sets used across several is 1500, we have |Mi| = 3001 for all
sections: lengths between −1500 and 1500.
2.6.1.1 Non-Tor data set 500
Difference in position between the
We used Firefox 38.0 to visit web pages. Most privacy first 500 outgoing packets
technologies (such as VPNs and proxies) do not attempt to
defend the packet sequence, so results on this data set would be Position of the first 500 outgoing
500
similar to results on most privacy technologies. We collected packets
100 instances of Alexa’s top 100 pages as the set of monitored
100 The length of first 100 bursts.
pages, and 9000 instances of Alexa’s top 10000 pages (1
instance of each page, and discarding failed pages) as the set of
10 The direction of first 10 packets.
non-monitored pages. We collected this data in July 2023.

2.6.1.2 Tor data set


We collected our data on Tor Browser 4.5a4 with Tor 0.2.7.0-
alpha. As above, we collected the same number of monitored 2.6.3 Performance evaluation
and non-monitored pages. Tor is especially interesting to us In order to evaluate the proposed system, we used it on various
because it has applied defenses against website fingerprinting popular websites like Google, Youtube, etc. The first step is
and continues to improve its defenses. We collected the data in data collection. This step involves collecting the fingerprints or
June 2023. In our previous work we collected a data set of the data of the particular website visited by several users. We
sensitive pages banned by several ISPs in various countries. We use a script capture.sh to capture the traffic on the particular
do not do so in this thesis because we have found that these website in order to use it to identify third party websites trying
sites are taken down frequently, which makes it harder to to gain access. The captured traffic is then saved into a file and
reproduce our results. Thus, we will only use Alexa’s top sites. the proposed system creates another file depicting the addresses
We also collected several other data sets for use in specific of unknown traffic or third party websites trying to access the
sections, and we will describe these data sets in detail in the website, thus compromising the user’s safety.
relevant sections..
The accuracy of the proposed technique was better in
comparison to the existing system. We split the data into
training and test data. The classifier KNN is used to train the
2.6.2 Evaluation parameters model. The predict.py script is used to load the classifier and
To tackle the open-world scenario, we designed Wa-kNN identifies which web page the unknown traffic originated from.
[WCN+14] (2014). Wa-kNN uses the k-Nearest Neighbours It is worth noting that using the classifier, the proposed system
(k-NN) classifier, a simple supervised machine learning obtained an accuracy of 95% which was higher than the
algorithm. k-NN starts with a set of training points Strain = {P1, existing system.
P2, . . .}, each point being a packet sequence. Let the class of
Pi be c(Pi). Given a testing point Ptest ∈ Stest, the classifier 2.7 Conclusion and Future scope
computes the distance d(Ptest, Ptrain) for each Ptrain ∈ Strain.
The algorithm then classifies Ptest based on the classes of the 2.7.1 Conclusion
k closest training points to Ptest. k-NN uses a distance metric d Website fingerprinting exposes the difference between security
to describe how similar two packet sequences are. We want the and privacy. We never attempt to break the encryption of the
distance to be accurate on simple encrypted data without extra client’s communication channel, but the client’s page access is
padding, but also accurate if the client applies defenses that compromised nonetheless. Since different clients’
remove features from our available feature set (for example, communication are largely similar (especially on Tor Browser)
Tor removes unique packet lengths). We therefore start with a when accessing the same page, we have gained significant
large feature set F = {f1, f2, . . .}. knowledge of their browsing activities. Website fingerprinting
is conceptually analogous to a dictionary attack against an
Each feature is a function f which takes in a packet sequence P encryption scheme with a small set of possible plaintexts.
and computes f(P), a nonnegative number. Conceptually, we Without randomly padding the plaintexts (analogous to website
select each feature such that packet sequences from the same fingerprinting defenses), even an otherwise correctly
page are more likely to have similar features than packet implemented encryption scheme can be broken by
sequences from different pages. precomputing ciphertexts (analogous to the attacker’s training
set).
Wa-kNN achieves the best for both worlds, it has a very low
training and testing time, but it is also nearly as accurate as Wa-
OSAD in the closed world. It trumps all other attacks in the
open world, as it has tunable parameters that trade off the TPR
International Conference on Advance Technology and Management

for the FPR and vice-versa effectively, allowing us to tackle the sequences accurately from the request-response structure, with
base rate fallacy. We have seen that our new WLLCC algorithm the network conditions as input, we could significantly improve
for Wa-kNN has lifted the competitiveness of the simple and website fingerprinting. Data collection would be much faster,
fast k-NN classifier to that of SVMs for website fingerprinting. as we would not have to wait for Tor’s latency to load web
The algorithm may be useful for k-NN classification in general, pages. We would also be able to build better classifiers that
and it can also be used in clustering. ignore random noise such as advertisements.

2.7.2 Future scope


We include significant new work in this thesis beyond the The astute reader may have noted that we collect data sets in
scope of our published papers on the above attacks. We this chapter in a simplistic manner. We start the browser, visit
implement 12 website fingerprinting attacks with standardized a web page, record the packet sequence, and exit the browser.
input and output to allow comparative experiments at a greater Realistic clients can visit several web pages, sequentially or at
scale than previously seen work. We show new results on once, and can generate noise by engaging in other activities
training and testing time. Our systematization of attacks by (e.g. listening to streaming music in the background).
comparing their use of distances and features allows the
possible development of hybrid attacks that combine 3. ACKNOWLEDGMENTS
approaches from different attacks. We would like to express our special thanks to Prof. K.S Suresh
We developed our attacks and performed our experiments with Babu, our major project guide who guided us through the
a strong focus on Tor, because it is one of the easiest and most project and who helped us in applying the knowledge that we
popular tools for clients to achieve the level of privacy required have acquired during the semester and learning new concepts.
to render website fingerprinting relevant: an encrypted channel We would like to express our special thanks to Prof. Sharvari
across proxies. Tor is actively developing defenses against Govilkar the H.O.D of our Computer Engineering department
website fingerprinting attacks, because if website who gave us the opportunity to do this major project because
fingerprinting were perfectly successful, it would reduce the of which we learned new concepts and their application.
privacy of Tor to simply that of an encrypted channel.
We are also thankful to our mini project coordinator along with
Any entry node on Tor may perform website fingerprinting to other faculties for their encouragement and support.
discover its clients’ behaviour. Nevertheless, website
fingerprinting attacks have limitations. Some of these Finally we would like to express our specials thanks of
limitations are inherent to the problem statement, and gratitude to Principal Dr. Sandeep Joshi who gave us the
overcoming them may be truly difficult. We categorize these opportunity and facilities to conduct this major project.
limitations, which may make for interesting future work:
4. REFERENCES
2.7.2.1 Identifying very rare pages. [1] Freier, “GAP-WF: Graph Attention Pooling network for
The low base rate of a very rare page may cause the number of fine-grained SSL/TLS website fingerprinting,” IEEE
false positives to overwhelm the number of true positives, and Conference Publication, IEEE Xplore, pp20- 24, July
the base rate fallacy is much harder to overcome. Our 2021.
experiments suggest that we can achieve high precision down
[2] Xue, “Fingerprinting HTTPS Websites: “A Deep
to a base rate of around 0.01% (1 in 10,000), but rare pages
would require more advanced techniques, and there is no limit Learning Approach"(2021).
to how rare a web page might be. [3] Garg, "Improving Website Fingerprinting Attacks
through Transfer Learning"(2021).
2.7.2.2 Identifying activity. [4] Maohua Guo, “Website Fingerprinting Attacks Based on
It may be possible to identify whether a client is web browsing,
Homology Analysis”,2021
chatting, listening to music, watching a video, and so on. All of
these activities can be done through a regular browser, and can [5] M. Shen, Y. Liu, L. Zhu, X. Du and J. Hu, "Fine-Grained
therefore be done on Tor. Currently, we do not tackle any Webpage Fingerprinting Using Only Packet Length
activity except web browsing. Identifying them would give the Information of Encrypted Traffic," in IEEE Transactions
attacker more information to work with. on Information Forensics and Security, vol. 16, pp. 2046-
2059, 2021.
2.7.2.3 Identifying web sites
There is some previous work on identifying web sites [6] S. Bhat, D. Lu, A. Kwon, and S. Devadas, “VAR-CNN:
[CZJJ12]. However, past results are generally limited to A Data-Efficient Website Fingerprinting attack based on
identifying only one or two specific web sites. It is not clear if Deep learning,” Proceedings on Privacy Enhancing
techniques used for identifying these specific web sites can be Technologies, pp. 292–310, July 2019.
extended to identifying hundreds of web sites at [7] X. He, J. Wang, Y. He and Y. Shi, "A Deep Learning
once.identifying only one or two specific web sites. It is not Approach for Website Fingerprinting Attack," 2018
clear if techniques used for identifying these specific web sites
IEEE 4th International Conference on Computer and
can be extended to identifying hundreds of web sites at once.
Communications (ICCC), pp. 1419-1423, July 2018M.
2.7.2.4 Understanding the underlying web page. Young, The Technical Writer’s Handbook. Mill Valley,
Web pages are essentially request-response pairs, and some CA: University Science, 1989
responses trigger further requests. Currently, we collect web
pages only as packet sequences, ignoring the intermediate
request-response structure. If we were able to simulate packet

You might also like