Professional Documents
Culture Documents
Ayush TECH REPORTFIN
Ayush TECH REPORTFIN
Submitted in Partial Fulfilment of the requirement for the award of the degree of
BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING
Mrs. Suguna M K
Assistant Professor
Dept of ISE, Sir MVIT
Bengaluru-562157
CERTIFICATE
Certified that the Technical Seminar entitled “Digital Video Fingerprinting” carried out by Mr. Ayush
Tiwary bearing USN: 1MV19IS008 a bonafide student of Sir M V Institute of Technology in partial
fulfilment for the award of Bachelor of Engineering in Information Science Engineering of the
Visvesvaraya Technological University, Belagavi during the year 2022-23. It is certified that all
corrections/suggestions indicated for TECHNICAL SEMINAR have been executed under the directions
of Mrs.Shanta Biradar H, Assist Professor ,ISE,SMVIT.The report has been approved as it satisfies the
academic requirements in respect of TECHNICAL SEMINAR prescribed for the said degree.
2
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
DECLARATION
I, Ayush Tiwary bearing the USN 1MV19IS008, student of 8th Semester B.E.
Department of Information Science Engineering, Sir M V Institute of Technology, Bengaluru
declare that the TECHNICAL SEMINAR entitled “Digital Video Fingerprinting ”, has been
duly executed by me under the guidance Mrs Shanta Biradar H, Assistant Professor,
Department of Information Science Engineering, Sir MVIT, Bengaluru. The TECHNICAL
SEMINAR report of the same is submitted in partial fulfilment of the requirement for the
award of Bachelor of Engineering degree in Department of Information Science Engineering
by Visvesvaraya Technological University, Belgaum during the year 2022-2023.
3
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of the TECHNICAL SEMINAR report
which would be complete only with the mention of the almighty God and the people who made it
possible, whose report rewarded the effort with success of technical seminar presentation.
We are grateful to Sir M V Institute of Technology for providing us an opportunity to enhance our
knowledge through the technical seminar.
We express our sincere thanks to Dr. Rakesh S G, Principal, Sir MVIT, for providing us an
opportunity and means to present the technical seminar.
We express our heart full thanks to the technical seminar guide Mrs. Shanta Biradar H Assistant
Professor, Department of Information Science & Engineering, SIR MVIT, for encouragement in our
technical seminar, whose cooperation and guidance helped in nurturing this technical seminar report.
We would like to express profound thanks to Mrs.Suguna.M.K, Associate Professor, Technical
seminar Coordinator, Department of Information Science Engineering for the keen interest and
encouragement in our Technical seminar.
Finally, we would like to thank our family members and friends for standing with us through all
times.
Ayush Tiwary
1MV19IS008
4
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
ABSTRACT
Recently, many video streaming services, such as YouTube, Twitch, and Facebook, have
contributed to video streaming traffic, leading to the possibility of streaming unwanted and
inappropriate content to minors or individuals at workplaces. Therefore, monitoring such content is
necessary. Although the video traffic is encrypted, several studies have proposed techniques using
traffic data to decipher users’ activity on the web. Dynamic Adaptive Streaming over HTTP (DASH)
uses Variable Bit-Rate (VBR) - the most widely adopted video streaming technology, to ensure
smooth streaming. VBR causes inconsistencies in video identification in most research. This research
proposes a fingerprinting method to accommodate for VBR inconsistencies. First, bytes per second
(BPS) are extracted from the YouTube video stream.
Bytes per Period (BPP) are generated from the BPS, and then fingerprints are generated from
these BPPs. Furthermore, a Convolutional Neural Network (CNN) is optimized through
experiments. The resulting CNN is used to detect YouTube streams over VPN, Non-VPN, and a
combination of both VPN and Non-VPN network traffic.
5
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
CONTENTS
Chapter No.
Chapter Name Page No.
1 INTRODUCTION 7
2 TECHNICAL DESCRIPTION 8
3 REFERENCE PAPER-I 13
4 REFERENCE PAPER-II 18
4 REFERENCE PAPER-III 24
5 CONCLUSION 31
6
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
INTRODUCTION
Video fingerprinting or video hashing are a class of dimension reduction techniques in which
a system identifies, extracts, and then summarizes characteristic components of a video as a unique or
a set of multiple perceptual hashes or fingerprints, enabling that video to be uniquely identified. This
technology has proven to be effective at searching and comparing video files
Recently, many video streaming services, such as YouTube, Twitch, and Facebook, have contributed
to video streaming traffic, leading to the possibility of streaming unwanted and inappropriate content
to minors or individuals at workplaces. Therefore, monitoring such content is necessary. Although the
video traffic is encrypted, several studies have proposed techniques using traffic data to decipher users’
activity on the web. Dynamic Adaptive Streaming over HTTP (DASH) uses Variable Bit-Rate (VBR)
- the most widely adopted video streaming technology, to ensure smooth streaming. VBR causes
inconsistencies in video identification in most research.
This research proposes a fingerprinting method to accommodate for VBR inconsistencies. First, bytes
per second (BPS) are extracted from the YouTube video stream. Bytes per Period (BPP) are generated
from the BPS, and then fingerprints are generated from these BPPs. Furthermore, a Convolutional
Neural Network (CNN) is optimized through experiments. The resulting CNN is used to detect
YouTube streams over VPN, Non-VPN, and a combination of both VPN and Non-VPN network traffic.
With the advancement of technology and availability of mobile devices, the past few years have seen
an increase in video network traffic. CISCO claims video streaming to be the leading consumed media
that has become the major contributing factor to internet traffic [1]. For the security and privacy of
clients, internet traffic is encrypted, leaving little or no possibility of monitoring stream content. Minors
and adolescents can be induced to inappropriate content with unmonitored traffic.
7
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
TECHNICAL DESCRIPTION
• The popularity of DASH resulted in multiple industries starting to invest in this direction.
Furthermore, the Google search engine also ranks video streaming websites on the first page of
search results that adopt DASH streaming technology [5].
• The previous video identification frameworks rely on IP packet headers, ports, and content
information to identify individual videos. However, with the rising popularity of such frameworks
and their threats to user privacy and security, video streaming service providers started encrypting
their streams to mitigate security issues. Most of the traffic flowing between clients and servers is
secured by Secure Socket Layer (SSL) and Transport layer security (TLS) encryption technology
over HTTPS protocol. In conclusion, such encryption approaches restrict techniques including Deep
Packet Inspection (DPI) [6]
• To identify individual videos streaming over a network. Furthermore, with the upsurge in the
availability of free Virtual Private Networks (VPNs), more clients and adversaries are inclined to use
them to hide their network activities, which aggravates the streaming video identification.
8
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
METHODOLOGY
This section illustrates the methodology used for data collection, fingerprinting methodologies, and
producing a list of predictions through various classifiers as shown in Figure 3. The methodology is
defined in steps as (a) data collection, (b) preprocessing of data, (c) bytes per period, (d)
fingerprinting, and (e) summary of neural network.
• DATA COLLECTION We use Wireshark to capture the network traffic and generate packet capture
(PCAP) files against each video to generate the dataset of video streams. We utilize the Chrome
browser to play the YouTube videos and Selenium for automation. We selected 43 random videos
from YouTube and each video is downloaded 55 times. A desktop client SurfShark is used for
capturing the VPN streams.
• PRE-PROCESSING Wireshark exports the captured data in pcap file format. Each generated pcap
file contains 120 seconds of the streaming video. As this file contains both uplink and downlink
traffic, this dataset is cleaned by applying a filter through the IP address of the server, here YouTube
, and selecting only the downlinks. This process is done for VPN and Non-VPN data. Thus, it
mitigates the streaming noise of unwanted applications. This is achieved by the integrated Wireshark
filter in the conversation section
• BYTES PER PERIOD (BPP) As mentioned above, DASH uses the VBR encoding, which can cause
irregularities in the sequence of BPS, effectively reducing the accuracy of machine learning
models. For this reason, the BPS are aggregated into L segments, where L is any constant
number. In this paper, the value of L is set to 6 as discussed.
9
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
A convolutional neural network (CNN) is a variant of the traditional neural network because it can
learn directly from data without manual feature extraction. A CNN generally consists of convolution
layers, pooling layers, and a fully connected layer. They are mainly used in pattern recognition and
their architecture makes them a preferred model for object detection in image, voice in audio, natural
language processing (hate speech detection [2]), activity recognition (bot detection [11], human
activity recognition [10], malware detection [3]), and classify digital signals. The CNN model
designed in this paper comprises four 1D convolutional layers, each having ReLU as its activation
function. Each convolutional layer employs distinct kernels (also called filters) that independently
convolve the input data and produce a feature map as the output. The kernel size is assigned a small
number relative to the input size. The smaller kernel size helps the model learn more feature maps
and improve the overall prediction accuracy. The generated feature map is passed through an
activation function (ReLU in our case) and passes to the pooling layers. The max-pooling layers
separate the four convolutional layers. A pooling layer summarizes the result of the previous
layer to its neighboring outputs, ultimately reducing the size of the data without losing the key
features. This reduction of data generalizes the repeating patterns while simultaneously reducing
10
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
memory requirements. After convolutional and max-pooling layers, a dropout layer is added. The
dropout layer randomly disables some of the inputs of the previous layer to prevent the model from
learning only a few input values and restrain overfitting. The relative amount of input features are
disabled by defining the probability value p in the dropout layer. The dropout layer’s output is passed
to the flatten layer, which performs conversion of the multidimensional pooled feature map into a one-
dimensional vector to make it compatible with forwarding into the dense layer. The dense layer is a
fully connected layer having all of its neurons connected with the neurons of the previous layer. The
output of the dense layer is passed to a final dense layer, also called Output layer [51]. The fingerprint
of BPP is a one-dimensional array; therefore, the input of the proposed CNN model is a one-dimension
series of BPP with the size of 20. The first convolutional layer has a kernel size equal to 5 with 300
filters and a stride of 1. A single neuron is connected to a cluster of five features of the input data. The
output of the layer is 300 feature maps of size 19. Subsequently, the first convolutional layer contains
1800 trainable parameters (1500 weights and 300 bias parameters.) This layer is followed by a max-
pooling layer that consists of a 300 feature map of size 50. Each feature map of this layer is connected
to two feature maps of the previous convolutional layer. The max-pooling layer has no trainable
parameters. The second convolutional layer has 512 kernels, each of size 3, forming a total number
of 461,312 parameters that yield 512 feature maps of size 3. The trailing max-pooling layer after the
second convolutional layer generates 512 feature maps of size 11. The third convolutional layer
containing 524,800 trainable parameters has 512 kernels of size 1, which output 512 feature maps of
size 1. Its successive max-pooling layer generates 512 feature maps of size 1. The last convolutional
layer has 300 kernels of size 1, having 307,500 trainable parameters. Consequently, the last max-
pooling layer produces 300 feature maps of size 1. After the last pooling layer, a dropout layer is added
to disable arbitrary neurons from the previous layer with the probability of 0.8. The dropout layer is
followed by a flatten layer that converts the pooled feature maps of the previous layer into a one-
dimensional feature size of 3,300. The output layer, the last layer in the model, contains 141,743
trainable parameters for 43 labels in the dataset. The activation function selected for all the
convolutional layers in this model is the ReLU function. The softmax function is assigned as the
activation function for the output layer. The ReLU function is quite simple as it outputs the input
directly if it is a positive number. However, it outputs a zero in the case of a negative number. The
softmax function is a generalization of logistic regression to handle multiple classes. For N output
classes, it normalizes an N-dimensional vector of actual values to an N-dimensional probability
distribution vector of actual values in the range [0,1]. The N-dimensional output vector is the
probabilistic score of each corresponding class. The cost function measures the difference between
the model’s prediction with the actual output and returns an error value. This error rate helps the model
determine how much more optimization is needed. The cost function selected for the proposed model
11
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
is categorical cross-entropy. The optimization function, which assigns optimal weights to neurons in
each layer, is the Adam optimizer. The complete architecture of the CNN model is shown in Figure 4
The softmax activation function is used for the dense layer and adam optimizer is used for model
optimization. The model is trained on three types of datasets: SDF, ADF, and DF.
12
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
RESEARCH
Received 21 June 2022, accepted 15 July 2022, date of publication 19 July 2022, date of current version 26 July 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3192458
ABSTRACT Recently, many video streaming services, such as YouTube, Twitch, and Facebook, have
contributed to video streaming traffic, leading to the possibility of streaming unwanted and inappropriate
content to minors or individuals at workplaces. Therefore, monitoring such content is necessary. Although
the video traffic is encrypted, several studies have proposed techniques using traffic data to decipher users’
activity on the web. Dynamic Adaptive Streaming over HTTP (DASH) uses Variable Bit-Rate (VBR) - the
most widely adopted video streaming technology, to ensure smooth streaming. VBR causes inconsistencies
in video identification in most research. This research proposes a fingerprinting method to accommodate
for VBR inconsistencies. First, bytes per second (BPS) are extracted from the YouTube video stream. Bytes
per Period (BPP) are generated from the BPS, and then fingerprints are generated from these BPPs.
Furthermore, a Convolutional Neural Network (CNN) is optimized through experiments. The resulting
CNN is used to detect YouTube streams over VPN, Non-VPN, and a combination of both VPN and Non-
VPN network traffic.
INDEX TERMS Video identification, fingerprinting, deep learning, classification, variable bitrate.
The associate editor coordinating the review of this manuscript and
I. INTRODUCTION experience [4]. The popularity of DASH resulted in multiple
industries starting to invest in this direction. Furthermore, the
With the advancement of technology and availability of
Google search engine also ranks video streaming websites
mobile devices, the past few years have seen an increase in
on the first page of search results that adopt DASH streaming
video network traffic. CISCO claims video streaming to be
technology [5].
the leading consumed media that has become the major
contributing factor to internet traffic [1]. For the security and The previous video identification frameworks rely on IP
privacy of clients, internet traffic is encrypted, leaving little packet headers, ports, and content information to identify
or no possibility of monitoring stream content. Minors and individual videos. However, with the rising popularity of
adolescents can be induced to inappropriate content with such frameworks and their threats to user privacy and
unmonitored traffic [2], [3]. Most video streaming platforms, security, video streaming service providers started
such as YouTube, Facebook, and Twitch, have adopted encrypting their streams to mitigate security issues. Most of
dynamic adaptive streaming over HTTP (DASH) technology the traffic flowing between clients and servers is secured by
approving it for publication was Shihong Ding. in the availability of free Virtual Private Networks (VPNs),
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 10, 2022 76731
to enhance the client’s quality of experience (QoE). DASH Secure Socket Layer (SSL) and Transport layer security
uses the Variable Bitrate (VBR) encoding technique to (TLS) encryption technology over HTTPS protocol. In
stream video content to clients to ensure a smooth streaming conclusion, such encryption approaches restrict techniques
13
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
including Deep Packet Inspection (DPI) [6] to identify pattern to identify the videos in the network traffic [18]–[21].
individual videos streaming over a network. Furthermore, However, VBR encoding produces inconsistency in the
with the upsurge streaming pattern as shown in Figure 2. These abnormalities
FIGURE 1. Experimental platform of video identification pipeline. sometimes create difficulties for the researchers to identify
the video. Moreover, the client’s variable network conditions
also complicate the video identification process.
more clients and adversaries are inclined to use them to To address the aforementioned challenges, i.e., (a) video
hide identification in the variable environment and (b) handle the
theirnetworkactivities,whichaggravatesthestreamingvideo abnormalities and inconsistencies cropped up by the DASH
identification. streaming technology, a fingerprinting method, Simple
VPN and SSL protocols prevent a middleman from Difference Fingerprint (SDF), is proposed. This method is
viewing the content of network traffic between the two used to generate a stable fingerprint of a video. For this
communicating parties. Although an SSL protocol only purpose, the Bytes per second (BPS) of the video stream are
encrypts the packet, the whole packet is encapsulated within extracted, aggregated into periods, and used for video
another packet in a VPN, which allows the client to bypass identification as shown in Figure 1. The details of the
server blockage at the gateway by tunneling through a remote creation of periods and fingerprints are provided in Section
machine. A VPN is classified into two types based on its III.
securityprotocol;SSLVPNandIPsecVPN.TheVPNusedin this The proposed SDFs are used to train a convolution neural
research is OpenVPN [7] which is an SSL type VPN that uses network (CNN) to classify the VBR video. Initially, the CNN
a Hash-based Message Authentication Codes (HMAC) with is fine-tuned through rigorous experiments to deduce the
a SHA1 hashing algorithm to ensure that the contents in the perfect hyperparameters for different layers, including
packets are intact. However, plentiful information regarding convolutional, pooling, dropout, and dense layers, alongside
streaming video can be obtained via flow-based features, other model hyperparameters such as batch size, optimizer,
including the number of packets, packet sizes, burst sizes, and the number of epochs. After tuning, the
and quantity of packet bursts. Machine learning and deep optimizedmodelisusedforvideoclassificationwithdifferent
learning models [8], [9] can leverage these features to
identify streaming videos.
Over the years, deep learning-based neural networks have
outperformed traditional machine learning algorithms. A
Convolutional Neural Network (CNN) is a particular type of
artificial neural network that applies convolutional operation
in at least one of its layers. Its high accuracy and efficiency
have introduced many real-world applications, including
human activity detection [10], natural language processing
[2], and bot detection [11]., energy consumption prediction
[12], smart city policing [13], risk assessment [14], and text
simplification [15].
DASH streaming follows a specific pattern to send the
videos to the client. In DASH, each video is divided into
small segments, sometimes called chunks, and delivered to
clients. This technique is used to increase the Quality of
Experience (QoE) of the client. However, these segments are
delivered to the client’s device in a specific method, leaving
a delivery pattern in the network traffic. This
14
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
III. METHODOLOGY
This section illustrates the methodology used for data
collection, fingerprinting methodologies, and producing a
list of predictions through various classifiers as shown in
Figure 3. The methodology is defined in steps as (a) data
collection, (b) preprocessing of data, (c) bytes per period, (d)
fingerprinting, and (e) summary of neural network.
A. DATA COLLECTION
We use Wireshark to capture the network traffic and generate
packet capture (PCAP) files against each video to generate
the dataset of video streams. We utilize the Chrome browser
to play the YouTube videos and Selenium for automation. We
selected 43 random videos from YouTube and each video is
downloaded 55 times. A desktop client SurfShark is used for
capturing the VPN streams. In conclusion, the resultant data
set consists of 86 total labels - 43 non-VPN titles and 43 VPN
titles and each video stream is captured for 120 seconds.
FIGURE 3. Architectural design of video identification pipeline.
B. PRE-PROCESSING
Wireshark exports the captured data in pcap file format. Each
generated pcap file contains 120 seconds of the streaming
video. As this file contains both uplink and downlink traffic,
this dataset is cleaned by applying a filter through the IP
address of the server, which in this case is YouTube, and
selecting only the downlinks. This process is done for VPN
and Non-VPN data. Thus, it mitigates the streaming noise of
unwanted applications. This is achieved by the integrated
Wireshark filter in the conversation section.
15
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
16
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
V. COMPARISON OF DIFFERENT FINGERPRINTING TECHNIQUES scenario, there are more unknown videos than known videos.
After fine-tuning the model, the same model is applied to Moreover, our proposed technique requires a huge storage
different datasets to check the accuracy. This section capacity and computational requirement for video detection.
compares the accuracy with the fingerprint techniques, i.e., As the proposed framework works on the known videos, the
SDF,ADF,andDF.Forthispurpose,fourtypesofdatasetsare model must be trained on a large dataset, limiting the scope
generated by following the method mentioned in Section III. of implementation.
The first dataset is prepared by capturing the 43 video Moreover, there are some shortcomings of this technique.
streams in normal traffic mode (Non-VPN mode). Similarly, The framework is set back by the phenomenon of ’concept
the same videos are captured using an encryption technique drift’. Hence, the proposed model requires a substantial
(VPN) for the second dataset. We combine the first two amount of computational and space requirements at the
datasets for the third dataset and obtain a dataset containing observer’s end, thus creating a challenge for large-scale
86 videos, 43 in normal traffic mode and 43 in encrypted deployment. Moreover, detection is only possible if the video
mode. For the fourth dataset, we label all videos captured in is streamed exactly from the start to the first 120 seconds.
normal traffic mode as Non-VPN, and videos captured using Changing the video runtime between the first 120 seconds
a VPN are labeled as VPN. In this case, we get a dataset can lead to abnormal predictions
containing two labels, VPN and Non-VPN, and call it a
traffic VII. CONCLUSION AND FUTURE WORK
dataset.Theimprovedmodelistrainedontheaforementioned Irregularities and inconsistencies due to the VBR encoding
datasets. Our proposed method SDF, outperforms the other of the video make it challenging to identify the videos in the
two techniques in all datasets, as shown in Figure 12. The network traffic. To address the aforementioned problem, this
results are summarized in Table 4. paper converts BPSs into BPPs and presents a stable
fingerprinting method, SDF. The SDF works on the
difference between the BPPs to identify the VBR video
streamed in encrypted network traffic. The created SDFs are
used to train the CNN model. After tuning the model’s
hyperparameters, the model achieves an accuracy of 90%
and 99% in predicting videos and classifying traffic,
respectively. Additionally, the effects of variable period
length on the model’s prediction accuracy are yet to be
analyzed. We aim to modify the technique to cope with the
concept drift problem in the future. Observing the effect of
variable period length and finding the optimal value will
make this technique more foolproof and increase the
practical deployment applications.
FIGURE 12. Accuracy comparison with different fingerprinting techniques on
different dataset.
REFERENCES
[1] U. Cisco. (2020). CISCO Annual Internet Report (2018–2023) White
TABLE 4. Accuracy comparison of datasets on different fingerprinting methods. Paper. Accessed: Dec. 15, 2021. [Online]. Available: https://www.
cisco.com/c/en/us/solutions/collateral/executive-
perspectives/annualinternet-report/whitepaper-c11-741490html
[2] M. U. S. Khan, A. Abbas, A. Rehman, and R. Nawaz, ‘‘HateClassify: A
service framework for hate speech identification on social media,’’ IEEE
Internet Comput., vol. 25, no. 1, pp. 40–49, Jan. 2020.
[3] M. Khan, D. Baig, U. S. Khan, and A. Karim, ‘‘Malware classification
framework using convolutional neural network,’’ in Proc. Int. Conf. Cyber
Warfare Secur. (ICCWS), Oct. 2020, pp. 1–7.
[4] A. Reed and B. Klimkowski, ‘‘Leaky streams: Identifying variable bitrate
VI. DISCUSSION
DASH videos streamed over encrypted 802.11n connections,’’ in Proc.
From the results, the proposed framework for video and 13th IEEE Annu. Consum. Commun. Netw. Conf. (CCNC), Jan. 2016, pp.
traffic identification is quite persistent. The performance of 1107–1112.
the baseline model is increased up to 12.21% with [5] R. Dubin, O. Hadar, I. Richman, O. Trabelsi, A. Dvir, and O. Pele, ‘‘Video
quality representation classification of safari encrypted DASH streams,’’ in
hyperparameter tuning. The SDF technique outperforms Proc. Digit. Media Ind. Acad. Forum (DMIAF), Jul. 2016, pp. 213–216.
other techniques in all datasets, as demonstrated in Section
V. However, identifying video streams over VPN is relatively
more challenging to detect compared to Non-VPN streams.
Moreover, the accuracy of distinguishing between the VPN
and Non-VPN traffic is 99%.
The proposed framework is quite feasible for detecting the
known videos in the network as only 55 streams of a single
video are required for training. However, in a real-world
17
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
Dvir, A., Marnerides, A. K., Dubin, R., Golan, N. and Hajaj, C. (2020) Encrypted video
traffic clustering demystified. Computers and Security, 96, 101917.
(doi: 10.1016/j.cose.2020.101917)
There may be differences between this version and the published version. You are
advised to consult the publisher’s version if you wish to cite from it.
http://eprints.gla.ac.uk/221829/
18
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
Amit Dvir1, Angelos K. Marnerides2, Ran Dubin1, Nehor Golan1, Chen Hajaj3,4
1
Cyber Innovation Center, Department of Computer Science, Ariel University, Israel
2
InfoLab21, School of Computing and Communications, Lancaster University, UK
3
Data Science and Artificial Intelligence Research Center, Ariel University, Israel
4
Department of Industrial Engineering and Management, Ariel University, Israel
Corresponding Author: Dr. Amit Dvir, Golan Heights 65, Ariel, amitdv@g.ariel.ac.il
Abstract—Cyber threat intelligence officers and forensics intelligence agencies, as well as forensics investigators, are
investigators often require the behavioural profiling of groups no longer capable of utilizing conventional traffic analysis
based on their online video viewing activity. It has been techniques such as Deep Packet Inspection (DPI).
demonstrated that encrypted video traffic can be classified under Given the various unavoidable limitations on traffic
the assumption of using a known subset of video titles based on classification tasks, a number of studies have looked into
temporal video viewing trends of particular groups. Nonetheless, the analysis of encrypted Internet traffic. Nevertheless,
composing such a subset is extremely challenging in real little has been done in the context of explicitly dealing with
situations. Therefore, this work exhibits a novel profiling scheme encrypted video traffic. Most importantly, the majority of
for encrypted video traffic with no a encrypted video classification and clustering studies rely
on assumptions that are not necessarily true in some
priori assumption of a known subset of titles. It introduces a
scenarios. For instance, the work by Dubin et al. in [4]
seminal synergy of Natural Language Processing (NLP) and Deep
Encoder-based feature embedding algorithms with refined
considers a pool of video titles for conducting supervised
clustering schemes from off-the-shelf solutions, in order to group
classification of encrypted transport layer flows (e.g., TCP,
viewing profiles with unknown video streams. This study is the
UDP) obtained by YouTube video streams. The applicability
first to highlight the most computationally effective, accurate of the proposed scheme was demonstrated in the scenario
combinations of feature embedding and clustering using real where an external attacker could identify a video title from
datasets, thereby, paving the way to future forensics tools for the video streams as long as the title of the video stream
automated behavioral profiling of malicious actors. was from a given video title set. Similarly, the studies in [5]
and [6] also utilised a number of supervised Machine
Index Terms—Encrypted Traffic, Video Title, Clustering, Learning (ML) algorithms by considering already known
YouTube, NLP video titles associated with statistical properties of
encrypted network flow. Hence, none of these studies
consider the pragmatic scenario in which encrypted video
1. Introduction streams may be clustered or classified with no knowledge
of any video titles. In fact, the majority of encrypted video
Governmental law enforcement bodies in developed
streams from most content providers include the video
countries, the UN Counter Terrorism Committee, and the
titles in the encrypted payload information of network
Security Council have stress the importance of tracking
flow, hence the aforementioned techniques would be
terrorists through profiling the Internet-wide behaviour of
greatly affected in terms of classification accuracy.
extremist propaganda groups [1]. Unfortunately, recent
In order to confront the challenging task of clustering
events such as the March 2019 mosque shootings in
encrypted video with no a priori knowledge of video titles,
Christchurch, New Zealand, have prompted militant
the preliminary work in [7] exploited the usefulness of
groups to broadcast their hate speech through online
Natural Language Processing (NLP). The proposed solution
videos in various social networking sites and YouTube [2].
was composed of a single clustering method relying on a
In parallel, the various shootings in the US since 2013 up
language derived by meta-statistics of network traffic
to the latest one in Virginia Beach in May 2019 highlight
features per encrypted video stream. However, the
the correlation of the shooters’ behaviour with viewing
selection of statistical features within the clustering
content through popular video streaming sites such as
process were manually selected by empirically observing
YouTube and Netflix [3].
the distributional behaviour. Thus, there was minimal
With the introduction of General Data Protection
automation with reasonably high computational costs in
Regulation (GDPR), there is a justified act towards the
some instances.
support of human rights. Nonetheless we are also now
Therefore, this work goes beyond previously
witnessing the weakness of law enforcement agencies in
introduced pieces of work and extends the preliminary
preemptively tracking extremists and hate groups through
findings
the Internet. Thus profiling the online video viewing
behavior of such groups has become quite a challenging
task due to the encrypted nature of the Internet video
streams through protocols such as HTTPS 1. Therefore
19
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
21
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
22
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
method that identifies outliers could be of use prior to the that the most optimal solution for addressing real-time
employment of the clustering process. grouping of encrypted video streams would be any
clustering based on Word2Vec embedding since it takes
5.1.4. BPPs Set Intersection. The BPP’s set intersection is less than 26 seconds for any algorithm to produce an
treated here as an affinity function within formulations output. Nonetheless, as discussed in Section 5.1,
that parametrically accept such functions. Hence, for this Word2Vec-based clustering also poses some trade-offs
technique there is only provision of results for the spectral related to the cluster purity (i.e. CP metric) and also the
and the AHC-based clustering algorithms that do consider amount of clean traffic streams that are considered to be
affinity functions within their corresponding formulations. pure within a given cluster (i.e. NSCB). Therefore, it is
By relating the results depicted in Table 4 to the desired recommended to consider such an approach purely on a
metrics discussed in Table 3, it is clear that BPPs produce real-time basis and further compare its outputs with an
the optimal outputs through all metrics apart from the offline clustering process. Based on the results discussed
NPC. It can therefore be seen that BPP based clustering in Section 5.1, the evaluation identifies clustering based on
methods are capable of producing pure clusters to which BPPs set intersection embedding to have the best
viewers are correctly allocated based on their respective clustering performance metrics on average, but with a high
video category (i.e., Clustering Purity - CP). It is evident that computational cost in terms of run time.
the composed spectral-based clusters have a high Embedding Clustering Runtime [sec]
precision range due to the high scores obtained in terms of K-Means 25.20
the number of cleanly clustered streams (i.e. CCS). GMM 25.25
Word2Vec
Furthermore, AHC-based clustering schemes utilising AHC (cosine) 25.24
the BPPs set intersection affinity function do not perform AHC (L2) 25.1
equally well despite the fact that the number of pure K-Means 1,461.6
clusters through the NPC metric reaches 0.99. Through GMM 3,565.3
Bag of Words
AHC (cosine) 2,864.7
both schemes it is evident that the average number of
AHC (L2) 1,913.4
streams in each cluster is 0.02. Consequently, the latter
Deep Auto Encoder K-Means 96,628 ⇡ 1.1days
result suggests that most of those clean clusters have 2 BPP’s Set Intersection Spectral 1,451
single users watching the same video and the last unclean AHC 452.5
cluster has all the remaining users.
TABLE 5: Computational Run Time Performance 6.
5.2. Computational Performance Conclusion
This work presents an extensive study highlighting a
In order to assess the efficacy of the various algorithms novel, synergistic use of off-the-shelf clustering schemes
and whether they can be used in practical, real-time with NLP and learning-based embedding schemes for
scenarios, the study focuses on the computational time profiling encrypted YouTube video traffic streams. In
required for each to produce a clustering output. Table 5 contrast with previous studies, the authors propose
depicts the resulting computational performance clustering with no a-priori assumption on a known subset
results based on experiments conducted on an Intel I7 of titles, and argue in terms of the potential usefulness of
machine, 3.5GHz with 11 CPU’s and 57G RAM. As such an approach for automated behavioral profiling of
illustrated, the majority of embedding and clustering malicious actors. The results indicate that it is feasible to
combinations may adequately operate on a close to real- obtain extremely high clustering performance under
time scenario and provide meaningful profiling of user optimal computational costs and may contribute towards
viewing behaviour. From a pure embedding perspective forensic investigations required by law enforcement
there is evidence to show that the less intensive clustering bodies. It can be seen that the combination of an NLP-
outputs are produced when employing the Word2Vec based embedding scheme with either K-means or GMM-
feature engineering and embedding. By contrast, the most based clustering is an candidate for real-time and scalable
computationally intensive scheme is resulted by utilising grouping of behavioral viewing for end-users over
features from the Deep Auto Encoder approach. encrypted video streams, and that a combination of
As already discussed earlier, the spectral clustering learning-based embedding, as well as spectral or
with the BPPs set intersection outperformed all other hierarchical clustering under BPPs set intersection, can
models in terms of the CCS metric while maintaining high adequately serve the purposes of highly accurate profiling
scores on the NPC metric (i.e. in both close to 1). for cases with small datasets.
Nonetheless, despite the high clustering performance for
some of the assessed clustering performance metrics,
there is a significant computational trade-off in terms of
runtime. In fact, spectral-based clustering based on BPPs
set intersections requires 24.2 minutes in order to
generate any clustering outputs. Hence, such an approach
would not be ideal in scenarios where real-time clustering
is required. In general, the conducted evaluation suggests
23
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
24
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
Abstract—Information centric network (ICN) improves the clients. Furthermore, bitrate unfairness potentially causes
communication quality by flexible content-based control overall QoE degradation by decreasing the bitrate of the
according to application-level information of video content. Then,
adaptive video streaming is applied to ICN to improve the client’s lower bitrate client because the decrease of a lower bitrate
quality of experience (QoE). In previous research for adaptive is worse than the decrease of a higher bitrate in terms of QoE
bitrate (ABR) algorithms for ICN, the buffer-based ABR algorithm [9].
(BBA), which is designed to change a bitrate according to the
video buffer length of video player in a client, is effective for QoE
For QoE fairness in adaptive video streaming, the
of the single client. However, in terms of QoE fairness among BBA serverassisted scheme in which the fixed server centrally
clients sharing a bottleneck, since each client adjusts a bitrate manages the fair allocation of bandwidth among connected
independently, the clients, which communicate with the lower clients for TCP/IP [10], [11]. However, in ICN, since the
bitrate video, have to decrease their bitrate even when the
bottleneck link is occupied by the other clients communicating content location is not identified for communications, these
with the higher bitrate. This results in QoE unfairness and fixed-server approaches are not acceptable. In addition,
degradation of overall QoE for all clients. To improve overall QoE, although applicationlevel information (e.g. an encoded
this paper presents the approach to decrease bitrate differences bitrate.) explicitly affect QoE fairness, those approaches are
among clients by increasing the bitrate of the lower bitrate clients
before the bandwidth of bottleneck link is excessively occupied by implicitly adjusted for QoE fairness by depending on short
communications for the higher bitrate clients in terms of QoE reference to communication control information (e.g. a
fairness. Then, we propose notification-based bitrate control in bandwidth.). Then, ICN is possible to apply flexible network
which distributed servers or routers detect congestion early and control according to application-level information for QoE
explicitly notifies the lower bitrate clients to increase the bitrate
before the bottleneck link is occupied, while encouraging the fairness of adaptive video streaming.
highest bitrate clients to decrease for congestion avoidance. To improve overall QoE among clients sharing a bottleneck,
Through the simulation experiments, the proposed system this paper presents the network-assisted scheme to
improves both QoE fairness and overall QoE metrics.
encourage QoE fairness among the clients by increasing the
Index Terms—ICN, Adaptive Video Streaming, QoE, Fairness
bitrate of the lower bitrate clients before the bottleneck link
is occupied, while decreasing the bitrate of the highest
I. Introduction bitrate clients for congestion avoidance with relatively low
QoE losses. Then, we propose notification-based bitrate
Information centric network (ICN) has attracted attention control in which the server or router detects congestion in
to accommodate the large volume of video content traffic on the early stage and explicitly notifies the lower bitrate clients
the Internet [1]. In ICN, the name address, which represents to increase the bitrate by additionally using the bandwidth
hierarchical information of content attributes, is given for the given by mitigating excessive
communication ID of a packet, and each ICN node associates
the decentralized network of a content request by Interest
packets and a content distribution by Data packets [2]. By
understanding the content information of the Interest or
Data packet with reference to the name address, ICN enables
flexible control based on the content information and thus
improves communication quality. Therefore, adaptive video
streaming over ICN is applied to improve both the
communication quality and the client’s quality of experience
(QoE) [3].
Adaptive video streaming finds a wide use of industry
because the adaptive bitrate (ABR) algorithm avoids QoE
degradation due to congestion by adjusting the bitrate
according to the estimated available bandwidth [4]. For
adaptive video streaming, the ABR algorithm is classified into
two methodologies depending on the reference for bitrate
adjustment, the rate-based ABR algorithm (RBA) and buffer-
based ABR algorithm (BBA) [5]. These ABR algorithms are
needed to jointly maximize QoE as well as ensure QoE
fairness among clients [6]. In terms of single client’s QoE,
BBA is basically effective because conservative buffer-based
bitrate adjustment avoids the bitrate overestimation due to
throughputbased bitrate adjustment of RBA [7], [8].
However, in terms of QoE fairness among multiple clients
sharing a bottleneck, BBA clients may suffer from the QoE
unfairness because each BBA client has to decrease the
bitrate for congestion avoidance independently, even if the
bitrate of video communication is low and it’s not fair among
25
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
Application Layer
MPD N Video Segments Video Player Engine
Algorithm 1 Bitrate adjustment for Bitratenext in BBAzero.
+ 1 N ABR Algorithm+ 1: {R1, ..., RM} ⇐ The given discrete bitrate levels from MPD;
1 N
Transport Layer 2: {Bitrateprev} ⇐ The previously used bitrate;
1 N Transport Control
3: Bcur ⇐ The current video buffer length (sec);
Router
with Cache Interest 4: r ⇐ The size of reservoir; 5:
Data
cu ⇐ The size of cushion;
6: 𝑓 (Bn) ⇐ The linear mapping function from Bcur to Rm;
Tx Queue for Downlink NIC Tx Queue for Downlink NIC
improvement in QoE [3]. Figure 1 shows the overview of 17: if Bcur ≤ r then
adaptive video streaming over ICN. In Fig. 1, the server stores )
18: Bitratenext = min(Rm ;
the video contents composed of the media presentation
description (MPD) that includes information for initializing 19: else if Bcur ≥ (r + cu) then
video streaming (e.g., a number of video segments, discrete )
20: Bitratenext = max(Rm ;
bitrate levels, audio information) and N video segments of 21: else if 𝑓 (Bcur)≥ Bitrate+ then
short play time encoded in M bitrate levels1. The server and 𝑓 )
22: Bitratenext = max(Rm | Rm < (Bcur ;
router with cache forward the video content to the client by
store-and-forward using the transmission (tx) queue of the 23: else if 𝑓 (Bcur)≤ Bitrate− then
NIC. In addition, the router caches the Data packets to its 𝑓 )
24: Bitratenext = min(Rm | Rm > (Bcur ;
cache storage before forwarding. In the application layer of
the client, the video player engine equips a fixed-length 25: else
video buffer which stores downloaded video segments 26: Bitratenext = Bitrateprev; 27:
sequentially and the playback of the stored video segment is end if
performed only when the video buffer length becomes larger
28: return Bitratenext;
than a predefined threshold. Additionally, the ABR algorithm
is installed with the video player engine for adaptive video
streaming. The ABR algorithm adjusts the bitrate for each
video segment before downloading it according to the
following procedures for adaptive video streaming with
transport control in the transport layer.
To start adaptive video streaming, the video player engine
first requests MPD for initialization. After receiving MPD, the
video player engine starts adaptive video streaming from the
first video segment to the last video segment in a sequence
using the ABR algorithm as follows.
1) The ABR algorithm selects the bitrate of the video
segment at the given discrete bitrate levels in MPD.
2) The video player engine downloads the video segment
of the selected bitrate.
1The encoded bitrate represents an average throughput required to
download a video segment encoded in the bitrate. 26
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
27
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
III. Increasing a lower bitrate with fast-retransmission to notification. Thus, the lower bitrate client increases the
encourage fairness for QoE improvement bitrate by additionally using the bandwidth given by
mitigating the excessive traffic for the highest bitrate clients
A. An overview
before the bottleneck link is occupied. In the following
In terms of BBA clients sharing a bottleneck, all the clients subsection, we describe the detail of the fast-retransmission
have to decrease the bitrate even if the bitrate is not fair notification according to the bitrate set on the server or
among the clients. Therefore, the bitrate differences are not router.
decreasing, resulting in degradation of QoE fairness and
overall QoE due to the excessively lower bitrate. In contrast,
our approach presents network-assisted control to B. The fast-retransmission notification according to the
selectively increase the lower bitrate by explicitly notifying bitrate set on a server or router
the lower bitrate clients to additionally use bandwidth early In FNC, to decrease bitrate differences among the clients
in the congestion of the bottleneck link. Then, we propose sharing a bottleneck, a server or router explicitly identifies
the following fast-retransmission notification control (FNC) the bitrate set of the clients with reference to the name
for adaptive video streaming over ICN. address of Interest or Data packet in communications. The
Figure 2 shows the overview of FNC based on Fig. 1. In bitrate set is periodically updated, and entries that have no
addition, to adjust the QoE fairness among the reference since the last update are deleted. Then, according
accommodating clients, a server or router newly adds fast- to the bitrate set, the server and router explicitly notifies the
retransmission notification according to the bitrate set of the clients below the highest bitrate of the fast-retransmission
clients composed of the following procedures 1 and 2 in Fig. signal according to early congestion detection based on the
2. First, to evaluate bitrate differences among the clients, the tx queue occupation. Algorithm 2 shows the fast-
bitrate set, which is the bitrate list of videos that is currently retransmission notification processed every time before
downloaded by the clients, is constructed by the router forwarding a Data packet. In Algorithm 2, before forwarding
referring to the name address of Interest or Data packets in the Data packet to the NIC, the bottleneck with early
communications. Second, to force the lower bitrate clients to congestion is detected according to the tx queue occupation
use the bandwidth more before the bottleneck link is of the NIC and the predefined congestion threshold (lines 3-
occupied by the highest bitrate clients, the fast- 5 in Algorithm 2). If detected, the following procedures are
retransmission signal is set to the forwarding Data packets processed according to the bitrate from the name address of
for the clients below the highest bitrate. Then, to cooperate the Data packet (lines 6-11 in Algorithm 2). First, if the bitrate
with the fast-retransmission notification from the server or is the same in the bitrat set, no control is needed for the
router, notification-based fast-retranmission control bitrate fairness. If not, the fast-retransmission is set to the
Algorithm 2 The fast-retransmission notification according to option fields of the Data packet to force the clients below the
the bitrate set highest bitrate to use the bandwidth more. Finally, the Data
1: {R1, ..., RL} ⇐ The bitrate set of the clients; 2: if packet is forwarded to the tx queue of the NIC that is
desgined to send the Data packet with the fast-
a Data packet arrives at a NIC then retransmission signal for fast download (line 13 in Algorithm
3: Cth ⇐ The congestion threshold in [0, 1); 2).
∑︁ ∑︁
QoE-lin = q(Rn) − 𝜆 |q(Rn+1) − q(Rn)|
n=1 n=1 (1)
N
∑︁
−𝜇 b n − 𝜇 DD
n=1
FNCcth.3 658.032 48.231 8.580 601.221 0.842 FNCcth.0 and FNCcth.1 cases are not excessively large in the
29
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING
References
[1] Cisco,“CiscoVisualNetworkingIndex:ForecastandTrends,2017–2022,”
2017.
[2] . Kutscher et al., “Information-Centric Networking (ICN) Research
Challenges,” RFC 7927, 2016.
[3] C. Westphal et al., “Adaptive Video Streaming over Information-Centric
Networking (ICN),” RFC 7933, 2016.
[4] ISO-IEC 23009-1:2012 Information Technology. Dynamic Adaptive
Streaming over HTTP (DASH).
[5] Y. Sani, A. Mauthe and C. Edwards, “Adaptive Bitrate Selection: A Survey,”
IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2985-3014.
[6] N. Barman and M. G. Martini, “QoE Modeling for HTTP Adaptive Video
Streaming–A Survey and Open Challenges,” IEEE Access, vol. 7, pp.
30831-30859, 2019.
[7] B. Rainer, . Posch, aptnd H. Hellwagner, “Investigating the performance
of pull-based dynamic adaptive streaming in N N,” IEEE JSAC, vol. 34,
no. 8, pp. 2130–2140, 2016.
[8] K. Goto, Y. Hayamizu, M. Bandai and M. Yamamoto, “QoE performance
of adaptive video streaming in information centric networks,” in Proc. of
IEEE LANMAN, pp. 1-2, 2019.
[9] Z. uanmu, A. Rehman and Z. Wang, ”A Quality-of-Experience Database
for Adaptive Video Streaming,” in IEEE Transactions on Broadcasting, vol.
64, no. 2, pp. 474-487, 2018.
[10] Y. Zhang et al., “Improving Quality of Experience by Adaptive Video
Streaming with Super-Resolution,” in Proc. of IEEE INFOCOM 2020 IEEE
Conference on Computer Communications, pp. 1957-1966, 2020.
[11] Y. Yuan et al., “VSiM: Improving QoE Fairness for Video Streaming in
Mobile Environments,” in Proc. of IEEE INFOCOM 2022 - IEEE Conference
on Computer Communications, pp. 1309-1318, 2022.
[12] T. Huang, R. Johari, N. McKeown, M. Trunnell, and M. Watson, “A
bufferbased approach to rate adaptation: evidence from a large video
streaming service,” in Proc. of the ACM conference on SIGCOMM
(SIGCOMM ’14), pp. 187–198, 2014.
[13] W. Afandi, S. M. A. H. Bukhari, M. U. S. Khan, T. Maqsood and S. U. Khan,
”Fingerprinting Technique for YouTube Videos Identification in Network
Traffic,” in IEEE Access, vol. 10, pp. 76731-76741, 2022.
[14] G. Carofiglio, M. Gallo, and L. Muscariello, “ICP: esign and evaluation of
an interest control protocol for content-centric networking,” in Proc. of
IEEE INFOCOM Workshop, pp. 304-309, 2012.
[15] C. Xia, M. Xu and Y. Wang, “A loss-based TCP design in ICN,” in Proc. of
22nd Wireless and Optical Communication Conference, pp. 449-454,
2013.
[16] K. Schneider, C. Yi, B. Zhang and L. Zhang, “A Practical Congestion Control
Scheme for Named ata Networking,” in Proc. of ACM ICN, pp. 21-30,
2016.
[17] B. Rainer, C. Kreuzberger, D. Posch, and H. Hellwagner, Demo:
amusndnSIM – adaptive multimedia streaming simulator for ndn.
[18] MPEG-DASH at ITEC/AAU, http://dash.itec.aau.at
[19] Z.Akhteretal.,“Oboe:Auto-tuningVideoABRAlgorithmstoCongestion
status,” in Proc. of ACM SIGGCOMM, pp. 44-58, 2018.
[20] R. Jain, D.-M. Chiu and W. Hawe, “A quantitative measure of fairness and
discrimination for resource allocation in shared computer system,” 1984.
30
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN
CONCLUSION
Different methods exist for video fingerprinting. Van Oostveen relied on changes in patterns of
image intensity over successive video frames. This makes the video fingerprinting robust against
limited changes in color - or the transformation of color into gray scale of the original video. Others
have tried to reduce the size of the fingerprint by only looking around shot changes.
Video fingerprinting does not rely on any addition to the video stream. A video fingerprint cannot
be removed, because it is not added. In addition, a reference video fingerprint can be created at any
point, from any copy of the video.
Irregularities and inconsistencies due to the VBR encoding of the video make it challenging to
identify the videos in the network traffic. To address the aforementioned problem, this paper converts
BPSs into BPPs and presents a stable fingerprinting method, SDF. The SDF works on the difference
between the BPPs to identify the VBR video streamed in encrypted network traffic. The created SDFs
are used to train the CNN model. After tuning the model’s hyperparameters, the model achieves an
accuracy of 90% and 99% in predicting videos and classifying traffic, respectively. Additionally, the
effects of variable period length on the model’s prediction accuracy are yet to be analysed. We aim to
modify the technique to cope with the concept drift problem in the future. Observing the effect of
variable period length and finding the optimal value will make this technique more fool proof and
increase the practical deployment applications.
31
DEPT OF ISE, SIR MVIT