Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

"Jnana Sangama", Belgavi-590 018, Karnataka, India

A Technical Seminar Report


On

“Digital Video Fingerprinting”

Submitted in Partial Fulfilment of the requirement for the award of the degree of

BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING

Submitted By Ayush Tiwary


1MV19IS008

Technical Seminar Coordinator

Mrs. Suguna M K
Assistant Professor
Dept of ISE, Sir MVIT
Bengaluru-562157

Department of Information Science Engineering


Sir M Visvesvaraya Institute of Technology
[Affiliated to Visvesvaraya Technological University, Belgaum]
Bengaluru-562157 2022-2023
DIGITAL VIDEO FINGERPRINTING

Sir M Visvesvaraya Institute of Technology


[Affiliated to Visvesvaraya Technological University, Belgaum]
Bengaluru-562157
2022-2023

CERTIFICATE

Certified that the Technical Seminar entitled “Digital Video Fingerprinting” carried out by Mr. Ayush
Tiwary bearing USN: 1MV19IS008 a bonafide student of Sir M V Institute of Technology in partial
fulfilment for the award of Bachelor of Engineering in Information Science Engineering of the
Visvesvaraya Technological University, Belagavi during the year 2022-23. It is certified that all
corrections/suggestions indicated for TECHNICAL SEMINAR have been executed under the directions
of Mrs.Shanta Biradar H, Assist Professor ,ISE,SMVIT.The report has been approved as it satisfies the
academic requirements in respect of TECHNICAL SEMINAR prescribed for the said degree.

____________________________ ____________________________ ____________________________


Signature of the Guide Signature of the Coordinator Signature of the HOD

Mrs. Shanta Biradar H Mrs. Suguna M K Dr. G.C.Bhanu Prakash


Assistant Professor Assistant Professor HOD and Professor
Dept of ISE, Sir MVIT Dept of ISE, Sir MVIT Dept. of ISE,Sir MVIT

2
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

DECLARATION

I, Ayush Tiwary bearing the USN 1MV19IS008, student of 8th Semester B.E.
Department of Information Science Engineering, Sir M V Institute of Technology, Bengaluru
declare that the TECHNICAL SEMINAR entitled “Digital Video Fingerprinting ”, has been
duly executed by me under the guidance Mrs Shanta Biradar H, Assistant Professor,
Department of Information Science Engineering, Sir MVIT, Bengaluru. The TECHNICAL
SEMINAR report of the same is submitted in partial fulfilment of the requirement for the
award of Bachelor of Engineering degree in Department of Information Science Engineering
by Visvesvaraya Technological University, Belgaum during the year 2022-2023.

Date: 02-05-2023 Ayush Tiwary


Place: Bengaluru 1MV19IS008

3
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

ACKNOWLEDGEMENT

The satisfaction that accompanies the successful completion of the TECHNICAL SEMINAR report
which would be complete only with the mention of the almighty God and the people who made it
possible, whose report rewarded the effort with success of technical seminar presentation.
We are grateful to Sir M V Institute of Technology for providing us an opportunity to enhance our
knowledge through the technical seminar.
We express our sincere thanks to Dr. Rakesh S G, Principal, Sir MVIT, for providing us an
opportunity and means to present the technical seminar.
We express our heart full thanks to the technical seminar guide Mrs. Shanta Biradar H Assistant
Professor, Department of Information Science & Engineering, SIR MVIT, for encouragement in our
technical seminar, whose cooperation and guidance helped in nurturing this technical seminar report.
We would like to express profound thanks to Mrs.Suguna.M.K, Associate Professor, Technical
seminar Coordinator, Department of Information Science Engineering for the keen interest and
encouragement in our Technical seminar.
Finally, we would like to thank our family members and friends for standing with us through all
times.

Ayush Tiwary
1MV19IS008

4
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

ABSTRACT

Recently, many video streaming services, such as YouTube, Twitch, and Facebook, have
contributed to video streaming traffic, leading to the possibility of streaming unwanted and
inappropriate content to minors or individuals at workplaces. Therefore, monitoring such content is
necessary. Although the video traffic is encrypted, several studies have proposed techniques using
traffic data to decipher users’ activity on the web. Dynamic Adaptive Streaming over HTTP (DASH)
uses Variable Bit-Rate (VBR) - the most widely adopted video streaming technology, to ensure
smooth streaming. VBR causes inconsistencies in video identification in most research. This research
proposes a fingerprinting method to accommodate for VBR inconsistencies. First, bytes per second
(BPS) are extracted from the YouTube video stream.

Bytes per Period (BPP) are generated from the BPS, and then fingerprints are generated from
these BPPs. Furthermore, a Convolutional Neural Network (CNN) is optimized through
experiments. The resulting CNN is used to detect YouTube streams over VPN, Non-VPN, and a
combination of both VPN and Non-VPN network traffic.

5
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

CONTENTS

Chapter No.
Chapter Name Page No.
1 INTRODUCTION 7

2 TECHNICAL DESCRIPTION 8

3 REFERENCE PAPER-I 13

4 REFERENCE PAPER-II 18

4 REFERENCE PAPER-III 24

5 CONCLUSION 31

6
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

INTRODUCTION

Video fingerprinting or video hashing are a class of dimension reduction techniques in which
a system identifies, extracts, and then summarizes characteristic components of a video as a unique or
a set of multiple perceptual hashes or fingerprints, enabling that video to be uniquely identified. This
technology has proven to be effective at searching and comparing video files
Recently, many video streaming services, such as YouTube, Twitch, and Facebook, have contributed
to video streaming traffic, leading to the possibility of streaming unwanted and inappropriate content
to minors or individuals at workplaces. Therefore, monitoring such content is necessary. Although the
video traffic is encrypted, several studies have proposed techniques using traffic data to decipher users’
activity on the web. Dynamic Adaptive Streaming over HTTP (DASH) uses Variable Bit-Rate (VBR)
- the most widely adopted video streaming technology, to ensure smooth streaming. VBR causes
inconsistencies in video identification in most research.
This research proposes a fingerprinting method to accommodate for VBR inconsistencies. First, bytes
per second (BPS) are extracted from the YouTube video stream. Bytes per Period (BPP) are generated
from the BPS, and then fingerprints are generated from these BPPs. Furthermore, a Convolutional
Neural Network (CNN) is optimized through experiments. The resulting CNN is used to detect
YouTube streams over VPN, Non-VPN, and a combination of both VPN and Non-VPN network traffic.
With the advancement of technology and availability of mobile devices, the past few years have seen
an increase in video network traffic. CISCO claims video streaming to be the leading consumed media
that has become the major contributing factor to internet traffic [1]. For the security and privacy of
clients, internet traffic is encrypted, leaving little or no possibility of monitoring stream content. Minors
and adolescents can be induced to inappropriate content with unmonitored traffic.

7
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

TECHNICAL DESCRIPTION

• The popularity of DASH resulted in multiple industries starting to invest in this direction.
Furthermore, the Google search engine also ranks video streaming websites on the first page of
search results that adopt DASH streaming technology [5].

• The previous video identification frameworks rely on IP packet headers, ports, and content
information to identify individual videos. However, with the rising popularity of such frameworks
and their threats to user privacy and security, video streaming service providers started encrypting
their streams to mitigate security issues. Most of the traffic flowing between clients and servers is
secured by Secure Socket Layer (SSL) and Transport layer security (TLS) encryption technology
over HTTPS protocol. In conclusion, such encryption approaches restrict techniques including Deep
Packet Inspection (DPI) [6]

• To identify individual videos streaming over a network. Furthermore, with the upsurge in the
availability of free Virtual Private Networks (VPNs), more clients and adversaries are inclined to use
them to hide their network activities, which aggravates the streaming video identification.

8
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

METHODOLOGY

This section illustrates the methodology used for data collection, fingerprinting methodologies, and
producing a list of predictions through various classifiers as shown in Figure 3. The methodology is
defined in steps as (a) data collection, (b) preprocessing of data, (c) bytes per period, (d)
fingerprinting, and (e) summary of neural network.

• DATA COLLECTION We use Wireshark to capture the network traffic and generate packet capture
(PCAP) files against each video to generate the dataset of video streams. We utilize the Chrome
browser to play the YouTube videos and Selenium for automation. We selected 43 random videos
from YouTube and each video is downloaded 55 times. A desktop client SurfShark is used for
capturing the VPN streams.

• PRE-PROCESSING Wireshark exports the captured data in pcap file format. Each generated pcap
file contains 120 seconds of the streaming video. As this file contains both uplink and downlink
traffic, this dataset is cleaned by applying a filter through the IP address of the server, here YouTube
, and selecting only the downlinks. This process is done for VPN and Non-VPN data. Thus, it
mitigates the streaming noise of unwanted applications. This is achieved by the integrated Wireshark
filter in the conversation section

• BYTES PER PERIOD (BPP) As mentioned above, DASH uses the VBR encoding, which can cause
irregularities in the sequence of BPS, effectively reducing the accuracy of machine learning
models. For this reason, the BPS are aggregated into L segments, where L is any constant
number. In this paper, the value of L is set to 6 as discussed.

• FINGERPRINTING Fingerprinting is a process of representing a large data by a small bit of


string, that uniquely identifies the data in a process. Particularly, fingerprints are the small
labels for large data. Due to the effectiveness of fingerprinting, many researchers have
effectively utilized the fingerprinting technique in different scenarios.

• CONVOLUTIONAL NEURAL NETWORK (CNN) MODEL A convolutional neural network (CNN)


is a variant of the traditional neural network because it can learn directly from data without
manual feature extraction. A CNN generally consists of convolution layers, pooling layers, and
a fully connected layer.

9
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

CONVOLUTIONAL NEURAL NETWORK (CNN) MODEL

A convolutional neural network (CNN) is a variant of the traditional neural network because it can
learn directly from data without manual feature extraction. A CNN generally consists of convolution
layers, pooling layers, and a fully connected layer. They are mainly used in pattern recognition and
their architecture makes them a preferred model for object detection in image, voice in audio, natural
language processing (hate speech detection [2]), activity recognition (bot detection [11], human
activity recognition [10], malware detection [3]), and classify digital signals. The CNN model
designed in this paper comprises four 1D convolutional layers, each having ReLU as its activation
function. Each convolutional layer employs distinct kernels (also called filters) that independently
convolve the input data and produce a feature map as the output. The kernel size is assigned a small
number relative to the input size. The smaller kernel size helps the model learn more feature maps
and improve the overall prediction accuracy. The generated feature map is passed through an
activation function (ReLU in our case) and passes to the pooling layers. The max-pooling layers
separate the four convolutional layers. A pooling layer summarizes the result of the previous
layer to its neighboring outputs, ultimately reducing the size of the data without losing the key
features. This reduction of data generalizes the repeating patterns while simultaneously reducing

10
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

memory requirements. After convolutional and max-pooling layers, a dropout layer is added. The
dropout layer randomly disables some of the inputs of the previous layer to prevent the model from
learning only a few input values and restrain overfitting. The relative amount of input features are
disabled by defining the probability value p in the dropout layer. The dropout layer’s output is passed
to the flatten layer, which performs conversion of the multidimensional pooled feature map into a one-
dimensional vector to make it compatible with forwarding into the dense layer. The dense layer is a
fully connected layer having all of its neurons connected with the neurons of the previous layer. The
output of the dense layer is passed to a final dense layer, also called Output layer [51]. The fingerprint
of BPP is a one-dimensional array; therefore, the input of the proposed CNN model is a one-dimension
series of BPP with the size of 20. The first convolutional layer has a kernel size equal to 5 with 300
filters and a stride of 1. A single neuron is connected to a cluster of five features of the input data. The
output of the layer is 300 feature maps of size 19. Subsequently, the first convolutional layer contains
1800 trainable parameters (1500 weights and 300 bias parameters.) This layer is followed by a max-
pooling layer that consists of a 300 feature map of size 50. Each feature map of this layer is connected
to two feature maps of the previous convolutional layer. The max-pooling layer has no trainable
parameters. The second convolutional layer has 512 kernels, each of size 3, forming a total number
of 461,312 parameters that yield 512 feature maps of size 3. The trailing max-pooling layer after the
second convolutional layer generates 512 feature maps of size 11. The third convolutional layer
containing 524,800 trainable parameters has 512 kernels of size 1, which output 512 feature maps of
size 1. Its successive max-pooling layer generates 512 feature maps of size 1. The last convolutional
layer has 300 kernels of size 1, having 307,500 trainable parameters. Consequently, the last max-
pooling layer produces 300 feature maps of size 1. After the last pooling layer, a dropout layer is added
to disable arbitrary neurons from the previous layer with the probability of 0.8. The dropout layer is
followed by a flatten layer that converts the pooled feature maps of the previous layer into a one-
dimensional feature size of 3,300. The output layer, the last layer in the model, contains 141,743
trainable parameters for 43 labels in the dataset. The activation function selected for all the
convolutional layers in this model is the ReLU function. The softmax function is assigned as the
activation function for the output layer. The ReLU function is quite simple as it outputs the input
directly if it is a positive number. However, it outputs a zero in the case of a negative number. The
softmax function is a generalization of logistic regression to handle multiple classes. For N output
classes, it normalizes an N-dimensional vector of actual values to an N-dimensional probability
distribution vector of actual values in the range [0,1]. The N-dimensional output vector is the
probabilistic score of each corresponding class. The cost function measures the difference between
the model’s prediction with the actual output and returns an error value. This error rate helps the model
determine how much more optimization is needed. The cost function selected for the proposed model

11
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

is categorical cross-entropy. The optimization function, which assigns optimal weights to neurons in
each layer, is the Adam optimizer. The complete architecture of the CNN model is shown in Figure 4
The softmax activation function is used for the dense layer and adam optimizer is used for model
optimization. The model is trained on three types of datasets: SDF, ADF, and DF.

12
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

RESEARCH

Received 21 June 2022, accepted 15 July 2022, date of publication 19 July 2022, date of current version 26 July 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3192458

Fingerprinting Technique for YouTube Videos


Identification in Network Traffic
WALEED AFANDI1, SYED MUHAMMAD AMMAR HASSAN BUKHARI 1
, MUHAMMAD U. S.
KHAN 1, (Member, IEEE), TAHIR MAQSOOD1,
AND SAMEE U. KHAN 2, (Senior Member, IEEE)
1
Department of Computer Science, COMSATS University Islamabad, Abbottabad Campus, Abbottabad 22060, Pakistan
2
Department of Electrical and Computer Engineering, Mississippi State University, Starkville, MS 39762, USA Corresponding
author: Muhammad U. S. Khan (ushahid@cuiatd.edu.pk)
This work was supported in part by the National Centre of Cyber Security (NCCS), Pakistan; and in part by the Higher Education Commission
(HEC) under Grant RF-NCCS-023.

ABSTRACT Recently, many video streaming services, such as YouTube, Twitch, and Facebook, have
contributed to video streaming traffic, leading to the possibility of streaming unwanted and inappropriate
content to minors or individuals at workplaces. Therefore, monitoring such content is necessary. Although
the video traffic is encrypted, several studies have proposed techniques using traffic data to decipher users’
activity on the web. Dynamic Adaptive Streaming over HTTP (DASH) uses Variable Bit-Rate (VBR) - the
most widely adopted video streaming technology, to ensure smooth streaming. VBR causes inconsistencies
in video identification in most research. This research proposes a fingerprinting method to accommodate
for VBR inconsistencies. First, bytes per second (BPS) are extracted from the YouTube video stream. Bytes
per Period (BPP) are generated from the BPS, and then fingerprints are generated from these BPPs.
Furthermore, a Convolutional Neural Network (CNN) is optimized through experiments. The resulting
CNN is used to detect YouTube streams over VPN, Non-VPN, and a combination of both VPN and Non-
VPN network traffic.

INDEX TERMS Video identification, fingerprinting, deep learning, classification, variable bitrate.
The associate editor coordinating the review of this manuscript and
I. INTRODUCTION experience [4]. The popularity of DASH resulted in multiple
industries starting to invest in this direction. Furthermore, the
With the advancement of technology and availability of
Google search engine also ranks video streaming websites
mobile devices, the past few years have seen an increase in
on the first page of search results that adopt DASH streaming
video network traffic. CISCO claims video streaming to be
technology [5].
the leading consumed media that has become the major
contributing factor to internet traffic [1]. For the security and The previous video identification frameworks rely on IP
privacy of clients, internet traffic is encrypted, leaving little packet headers, ports, and content information to identify
or no possibility of monitoring stream content. Minors and individual videos. However, with the rising popularity of
adolescents can be induced to inappropriate content with such frameworks and their threats to user privacy and
unmonitored traffic [2], [3]. Most video streaming platforms, security, video streaming service providers started
such as YouTube, Facebook, and Twitch, have adopted encrypting their streams to mitigate security issues. Most of
dynamic adaptive streaming over HTTP (DASH) technology the traffic flowing between clients and servers is secured by

approving it for publication was Shihong Ding. in the availability of free Virtual Private Networks (VPNs),
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 10, 2022 76731

to enhance the client’s quality of experience (QoE). DASH Secure Socket Layer (SSL) and Transport layer security
uses the Variable Bitrate (VBR) encoding technique to (TLS) encryption technology over HTTPS protocol. In
stream video content to clients to ensure a smooth streaming conclusion, such encryption approaches restrict techniques

13
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

including Deep Packet Inspection (DPI) [6] to identify pattern to identify the videos in the network traffic [18]–[21].
individual videos streaming over a network. Furthermore, However, VBR encoding produces inconsistency in the
with the upsurge streaming pattern as shown in Figure 2. These abnormalities
FIGURE 1. Experimental platform of video identification pipeline. sometimes create difficulties for the researchers to identify
the video. Moreover, the client’s variable network conditions
also complicate the video identification process.
more clients and adversaries are inclined to use them to To address the aforementioned challenges, i.e., (a) video
hide identification in the variable environment and (b) handle the
theirnetworkactivities,whichaggravatesthestreamingvideo abnormalities and inconsistencies cropped up by the DASH
identification. streaming technology, a fingerprinting method, Simple
VPN and SSL protocols prevent a middleman from Difference Fingerprint (SDF), is proposed. This method is
viewing the content of network traffic between the two used to generate a stable fingerprint of a video. For this
communicating parties. Although an SSL protocol only purpose, the Bytes per second (BPS) of the video stream are
encrypts the packet, the whole packet is encapsulated within extracted, aggregated into periods, and used for video
another packet in a VPN, which allows the client to bypass identification as shown in Figure 1. The details of the
server blockage at the gateway by tunneling through a remote creation of periods and fingerprints are provided in Section
machine. A VPN is classified into two types based on its III.
securityprotocol;SSLVPNandIPsecVPN.TheVPNusedin this The proposed SDFs are used to train a convolution neural
research is OpenVPN [7] which is an SSL type VPN that uses network (CNN) to classify the VBR video. Initially, the CNN
a Hash-based Message Authentication Codes (HMAC) with is fine-tuned through rigorous experiments to deduce the
a SHA1 hashing algorithm to ensure that the contents in the perfect hyperparameters for different layers, including
packets are intact. However, plentiful information regarding convolutional, pooling, dropout, and dense layers, alongside
streaming video can be obtained via flow-based features, other model hyperparameters such as batch size, optimizer,
including the number of packets, packet sizes, burst sizes, and the number of epochs. After tuning, the
and quantity of packet bursts. Machine learning and deep optimizedmodelisusedforvideoclassificationwithdifferent
learning models [8], [9] can leverage these features to
identify streaming videos.
Over the years, deep learning-based neural networks have
outperformed traditional machine learning algorithms. A
Convolutional Neural Network (CNN) is a particular type of
artificial neural network that applies convolutional operation
in at least one of its layers. Its high accuracy and efficiency
have introduced many real-world applications, including
human activity detection [10], natural language processing
[2], and bot detection [11]., energy consumption prediction
[12], smart city policing [13], risk assessment [14], and text
simplification [15].
DASH streaming follows a specific pattern to send the
videos to the client. In DASH, each video is divided into
small segments, sometimes called chunks, and delivered to
clients. This technique is used to increase the Quality of
Experience (QoE) of the client. However, these segments are
delivered to the client’s device in a specific method, leaving
a delivery pattern in the network traffic. This

FIGURE 2. Inconsistent bytes arrival time in three separate runs.

pattern can be used to identify the video in the network traffic


[16], [17]. Many studies have exploited the DASH streaming

14
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

III. METHODOLOGY
This section illustrates the methodology used for data
collection, fingerprinting methodologies, and producing a
list of predictions through various classifiers as shown in
Figure 3. The methodology is defined in steps as (a) data
collection, (b) preprocessing of data, (c) bytes per period, (d)
fingerprinting, and (e) summary of neural network.

A. DATA COLLECTION
We use Wireshark to capture the network traffic and generate
packet capture (PCAP) files against each video to generate
the dataset of video streams. We utilize the Chrome browser
to play the YouTube videos and Selenium for automation. We
selected 43 random videos from YouTube and each video is
downloaded 55 times. A desktop client SurfShark is used for
capturing the VPN streams. In conclusion, the resultant data
set consists of 86 total labels - 43 non-VPN titles and 43 VPN
titles and each video stream is captured for 120 seconds.
FIGURE 3. Architectural design of video identification pipeline.

TABLE 1. List of acronyms used in the paper.

B. PRE-PROCESSING
Wireshark exports the captured data in pcap file format. Each
generated pcap file contains 120 seconds of the streaming
video. As this file contains both uplink and downlink traffic,
this dataset is cleaned by applying a filter through the IP
address of the server, which in this case is YouTube, and
selecting only the downlinks. This process is done for VPN
and Non-VPN data. Thus, it mitigates the streaming noise of
unwanted applications. This is achieved by the integrated
Wireshark filter in the conversation section.

C. BYTES PER PERIOD (BPP)


As mentioned above, DASH uses the VBR encoding, which
can cause irregularities in the sequence of BPS, effectively
reducing the accuracy of machine learning models. For this

15
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

A. FINGERPRINTING CNN model designed in this paper comprises four 1D


Fingerprinting is a process of representing a large data by a convolutional layers, each having ReLU as its activation
small bit of string, that uniquely identifies the data in a function. Each convolutional layer employs distinct kernels
process. Particularly, fingerprints are the small labels for (also called filters) that independently convolve the input
large data [47]. Due to the effectiveness of fingerprinting, data and produce a feature map as the output. The kernel size
many researchers have effectively utilized the fingerprinting is assigned a small number relative to the input size. The
technique in different scenarios. For instance, fingerprinting smaller kernel size helps the model learn more feature maps
is actively used in application discrimination [48], video and improve the overall prediction accuracy. The generated
identification [20], Web page recognition [49], user activity feature map is passed through an activation function (ReLU
monitoring [25], and mobile application identification [50]. in our case) and passes to the pooling layers.
In our paper, we utilize the fingerprinting technique for The max-pooling layers separate the four convolutional
video identification in the encrypted network traffic. For this layers. A pooling layer summarizes the result of the previous
purpose, we created fingerprints of the BPPs of a video layer to its neighboring outputs, ultimately reducing the size
stream. The created fingerprints help to differentiate of the data without losing the key features. This reduction of
individual video streams. After generating a BPP sequence data generalizes the repeating patterns while simultaneously
of a stream, all the 0 in the sequence are replaced with 1 to reducing memory requirements.
resolve the zero division problem encountered during After convolutional and max-pooling layers, a dropout
fingerprints creation. For a video consisting of n seconds, we layer is added. The dropout layer randomly disables some of
get a sequence denoted as a = (a1,a2,...,ai,...). For two the inputs of the previous layer to prevent the model from
adjacent data amounts ai−1 and ai, fingerprints can be learning only a few input values and restrain overfitting. The
generated as r = (r1,r2,...,ri,...) by applying the one of the relative amount of input features are disabled by defining the
following equations described below: probability value p in the dropout layer.
The dropout layer’s output is passed to the flatten layer,
1) SIMPLE DIFFERENCE FINGERPRINT (SDF) which performs conversion of the multidimensional pooled
feature map into a one-dimensional vector to make it
Video fingerprint ri can be calculated by subtracting the ith
compatible with forwarding into the dense layer. The dense
term of sequence a with the previous term (i−1) as shown in
layer is a fully connected layer having all of its neurons
Equation (1):
connected with the neurons of the previous layer. The output
of the dense layer is passed to a final dense layer, also called
ri = ai − ai−1 (1)
Output layer [51].
The fingerprint of BPP is a one-dimensional array;
2) ABSOLUTE DIFFERENCE FINGERPRINT (ADF)
therefore, the input of the proposed CNN model is a one-
This is a modified form of Equation (1) proposed in [20]. To dimension series of BPP with the size of 20. The first
eliminate the negative values generated by subtracting ai−1 convolutional layer has a kernel size equal to 5 with 300
from ai, we take the absolute of the difference as shown in filters and a stride of 1. A single neuron is connected to a
Equation 2. cluster of five features of the input data. The output of the
layer is 300 feature maps of size 19. Subsequently, the first
ri = |ai − ai−1| (2) convolutional layer contains 1800 trainable parameters
(1500 weights and 300 bias parameters.) This layer is
3) DIFFERENTIAL FINGERPRINT (DF) followed by a max-pooling layer that consists of a 300
Equation (3) is proposed by [20]. In this equation, the feature map of size 50. Each feature map of this layer is
differential of two consecutive periods is calculated as shown connected to two feature maps of the previous convolutional
below: layer. The max-pooling layer has no trainable parameters.
The second convolutional layer has 512 kernels, each of
ai − ai−1 size 3, forming a total number of 461,312 parameters that
ri = (3) ai−1 yield 512 feature maps of size 3. The trailing max-pooling
layer after the second convolutional layer generates 512
E. CONVOLUTIONAL NEURAL NETWORK (CNN) MODEL feature maps of size 11. The third convolutional layer
A convolutional neural network (CNN) is a variant of the containing 524,800 trainable parameters has 512 kernels of
traditional neural network because it can learn directly from size 1, which output 512 feature maps of size 1. Its successive
data without manual feature extraction. A CNN generally max-pooling layer generates 512 feature maps of size 1. The
consists of convolution layers, pooling layers, and a fully last convolutional layer has 300 kernels of size 1, having
connected layer. They are mainly used in pattern recognition 307,500 trainable parameters.
and their architecture makes them a preferred model for Consequently, the last max-pooling layer produces 300
object detection in image, voice in audio, natural language feature maps of size 1. After the last pooling layer, a dropout
processing (hate speech detection [2]), activity recognition layer is added to disable arbitrary neurons from the previous
(bot detection [11], human activity recognition [10], layer with the probability of 0.8.
malware detection [3]), and classify digital signals. The

16
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

V. COMPARISON OF DIFFERENT FINGERPRINTING TECHNIQUES scenario, there are more unknown videos than known videos.
After fine-tuning the model, the same model is applied to Moreover, our proposed technique requires a huge storage
different datasets to check the accuracy. This section capacity and computational requirement for video detection.
compares the accuracy with the fingerprint techniques, i.e., As the proposed framework works on the known videos, the
SDF,ADF,andDF.Forthispurpose,fourtypesofdatasetsare model must be trained on a large dataset, limiting the scope
generated by following the method mentioned in Section III. of implementation.
The first dataset is prepared by capturing the 43 video Moreover, there are some shortcomings of this technique.
streams in normal traffic mode (Non-VPN mode). Similarly, The framework is set back by the phenomenon of ’concept
the same videos are captured using an encryption technique drift’. Hence, the proposed model requires a substantial
(VPN) for the second dataset. We combine the first two amount of computational and space requirements at the
datasets for the third dataset and obtain a dataset containing observer’s end, thus creating a challenge for large-scale
86 videos, 43 in normal traffic mode and 43 in encrypted deployment. Moreover, detection is only possible if the video
mode. For the fourth dataset, we label all videos captured in is streamed exactly from the start to the first 120 seconds.
normal traffic mode as Non-VPN, and videos captured using Changing the video runtime between the first 120 seconds
a VPN are labeled as VPN. In this case, we get a dataset can lead to abnormal predictions
containing two labels, VPN and Non-VPN, and call it a
traffic VII. CONCLUSION AND FUTURE WORK
dataset.Theimprovedmodelistrainedontheaforementioned Irregularities and inconsistencies due to the VBR encoding
datasets. Our proposed method SDF, outperforms the other of the video make it challenging to identify the videos in the
two techniques in all datasets, as shown in Figure 12. The network traffic. To address the aforementioned problem, this
results are summarized in Table 4. paper converts BPSs into BPPs and presents a stable
fingerprinting method, SDF. The SDF works on the
difference between the BPPs to identify the VBR video
streamed in encrypted network traffic. The created SDFs are
used to train the CNN model. After tuning the model’s
hyperparameters, the model achieves an accuracy of 90%
and 99% in predicting videos and classifying traffic,
respectively. Additionally, the effects of variable period
length on the model’s prediction accuracy are yet to be
analyzed. We aim to modify the technique to cope with the
concept drift problem in the future. Observing the effect of
variable period length and finding the optimal value will
make this technique more foolproof and increase the
practical deployment applications.
FIGURE 12. Accuracy comparison with different fingerprinting techniques on
different dataset.
REFERENCES
[1] U. Cisco. (2020). CISCO Annual Internet Report (2018–2023) White
TABLE 4. Accuracy comparison of datasets on different fingerprinting methods. Paper. Accessed: Dec. 15, 2021. [Online]. Available: https://www.
cisco.com/c/en/us/solutions/collateral/executive-
perspectives/annualinternet-report/whitepaper-c11-741490html
[2] M. U. S. Khan, A. Abbas, A. Rehman, and R. Nawaz, ‘‘HateClassify: A
service framework for hate speech identification on social media,’’ IEEE
Internet Comput., vol. 25, no. 1, pp. 40–49, Jan. 2020.
[3] M. Khan, D. Baig, U. S. Khan, and A. Karim, ‘‘Malware classification
framework using convolutional neural network,’’ in Proc. Int. Conf. Cyber
Warfare Secur. (ICCWS), Oct. 2020, pp. 1–7.
[4] A. Reed and B. Klimkowski, ‘‘Leaky streams: Identifying variable bitrate
VI. DISCUSSION
DASH videos streamed over encrypted 802.11n connections,’’ in Proc.
From the results, the proposed framework for video and 13th IEEE Annu. Consum. Commun. Netw. Conf. (CCNC), Jan. 2016, pp.
traffic identification is quite persistent. The performance of 1107–1112.
the baseline model is increased up to 12.21% with [5] R. Dubin, O. Hadar, I. Richman, O. Trabelsi, A. Dvir, and O. Pele, ‘‘Video
quality representation classification of safari encrypted DASH streams,’’ in
hyperparameter tuning. The SDF technique outperforms Proc. Digit. Media Ind. Acad. Forum (DMIAF), Jul. 2016, pp. 213–216.
other techniques in all datasets, as demonstrated in Section
V. However, identifying video streams over VPN is relatively
more challenging to detect compared to Non-VPN streams.
Moreover, the accuracy of distinguishing between the VPN
and Non-VPN traffic is 99%.
The proposed framework is quite feasible for detecting the
known videos in the network as only 55 streams of a single
video are required for training. However, in a real-world

17
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

Dvir, A., Marnerides, A. K., Dubin, R., Golan, N. and Hajaj, C. (2020) Encrypted video
traffic clustering demystified. Computers and Security, 96, 101917.

(doi: 10.1016/j.cose.2020.101917)

There may be differences between this version and the published version. You are

advised to consult the publisher’s version if you wish to cite from it.

http://eprints.gla.ac.uk/221829/

Deposited on: 25 August 2020

Enlighten – Research publications by members of the University of Glasgow


http://eprints.gla.ac.uk

18
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

Encrypted Video Traffic Clustering Demystified

Amit Dvir1, Angelos K. Marnerides2, Ran Dubin1, Nehor Golan1, Chen Hajaj3,4
1
Cyber Innovation Center, Department of Computer Science, Ariel University, Israel
2
InfoLab21, School of Computing and Communications, Lancaster University, UK
3
Data Science and Artificial Intelligence Research Center, Ariel University, Israel
4
Department of Industrial Engineering and Management, Ariel University, Israel
Corresponding Author: Dr. Amit Dvir, Golan Heights 65, Ariel, amitdv@g.ariel.ac.il
Abstract—Cyber threat intelligence officers and forensics intelligence agencies, as well as forensics investigators, are
investigators often require the behavioural profiling of groups no longer capable of utilizing conventional traffic analysis
based on their online video viewing activity. It has been techniques such as Deep Packet Inspection (DPI).
demonstrated that encrypted video traffic can be classified under Given the various unavoidable limitations on traffic
the assumption of using a known subset of video titles based on classification tasks, a number of studies have looked into
temporal video viewing trends of particular groups. Nonetheless, the analysis of encrypted Internet traffic. Nevertheless,
composing such a subset is extremely challenging in real little has been done in the context of explicitly dealing with
situations. Therefore, this work exhibits a novel profiling scheme encrypted video traffic. Most importantly, the majority of
for encrypted video traffic with no a encrypted video classification and clustering studies rely
on assumptions that are not necessarily true in some
priori assumption of a known subset of titles. It introduces a
scenarios. For instance, the work by Dubin et al. in [4]
seminal synergy of Natural Language Processing (NLP) and Deep
Encoder-based feature embedding algorithms with refined
considers a pool of video titles for conducting supervised
clustering schemes from off-the-shelf solutions, in order to group
classification of encrypted transport layer flows (e.g., TCP,
viewing profiles with unknown video streams. This study is the
UDP) obtained by YouTube video streams. The applicability
first to highlight the most computationally effective, accurate of the proposed scheme was demonstrated in the scenario
combinations of feature embedding and clustering using real where an external attacker could identify a video title from
datasets, thereby, paving the way to future forensics tools for the video streams as long as the title of the video stream
automated behavioral profiling of malicious actors. was from a given video title set. Similarly, the studies in [5]
and [6] also utilised a number of supervised Machine
Index Terms—Encrypted Traffic, Video Title, Clustering, Learning (ML) algorithms by considering already known
YouTube, NLP video titles associated with statistical properties of
encrypted network flow. Hence, none of these studies
consider the pragmatic scenario in which encrypted video
1. Introduction streams may be clustered or classified with no knowledge
of any video titles. In fact, the majority of encrypted video
Governmental law enforcement bodies in developed
streams from most content providers include the video
countries, the UN Counter Terrorism Committee, and the
titles in the encrypted payload information of network
Security Council have stress the importance of tracking
flow, hence the aforementioned techniques would be
terrorists through profiling the Internet-wide behaviour of
greatly affected in terms of classification accuracy.
extremist propaganda groups [1]. Unfortunately, recent
In order to confront the challenging task of clustering
events such as the March 2019 mosque shootings in
encrypted video with no a priori knowledge of video titles,
Christchurch, New Zealand, have prompted militant
the preliminary work in [7] exploited the usefulness of
groups to broadcast their hate speech through online
Natural Language Processing (NLP). The proposed solution
videos in various social networking sites and YouTube [2].
was composed of a single clustering method relying on a
In parallel, the various shootings in the US since 2013 up
language derived by meta-statistics of network traffic
to the latest one in Virginia Beach in May 2019 highlight
features per encrypted video stream. However, the
the correlation of the shooters’ behaviour with viewing
selection of statistical features within the clustering
content through popular video streaming sites such as
process were manually selected by empirically observing
YouTube and Netflix [3].
the distributional behaviour. Thus, there was minimal
With the introduction of General Data Protection
automation with reasonably high computational costs in
Regulation (GDPR), there is a justified act towards the
some instances.
support of human rights. Nonetheless we are also now
Therefore, this work goes beyond previously
witnessing the weakness of law enforcement agencies in
introduced pieces of work and extends the preliminary
preemptively tracking extremists and hate groups through
findings
the Internet. Thus profiling the online video viewing
behavior of such groups has become quite a challenging
task due to the encrypted nature of the Internet video
streams through protocols such as HTTPS 1. Therefore

19
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

4. Methodology NLP Natural Language Processing


BPP Bit Per Peak
Fig. 1 exhibits a high-level view of the methodology as NPC Number of Pure Clusters
conducted in this study. As depicted, firstly a sanitisation NSCB Number of Streams in Clean Bins
module is triggered where re-transmissions and audio CP Clustering Purity
ASV Average Silhouette Value
packets are eliminated from the video stream in a similar
CCS Cleanly Clustered Streams
fashion as in the previous work in [7]. Moreover, Bits per
AHC Agglomerative Hierarchical Clustering
Peak (BPPs) are computed for each encrypted video
CNN Convolutional Neural Network
stream via re-assembling network flows by using the GMM Gaussian Mixture Models
5tuple packet-level information 4 between two Off periods AFF Affinity
within the observed peaks [7]. wi i’th word
The following step focuses on feature engineering such win Word2vec window size
as firstly identifying the most statistically significant Ci(win) Words in the i’th window
features related to the raw measurements in the dataset. K Number of Clusters
Therefore, data mining and embedding techniques are
utilised initially in order to filter out insignificant raw Matrix of predicted samples by the Auto Encoder
measurements and reduce data dimensionality. Xi i’th sample
Subsequently, an NLP-based approach is adopted in order S, S0 Streams as sets of BPPs
to compose meaningful meta-features envisaged for use L2 Norm
within clustering components, as discussed in Section 4.1. mk Centroid of k’th cluster
The fourth step, as evidenced in Fig. 1, is concerned Ck The set of Samples in the k’th cluster
with triggering the selected unsupervised clustering µi The mean of the i’th component
algorithms discussed in Section 4.2. Essentially, the i The variance of the i’th component
purpose of this module is to group the video viewing TABLE 2: List of abbreviations
behaviour of the encrypted video streams and indicate the
corresponding clusters of common viewing patterns.
In order to assess the significance of the clusters and 4.1.1. Word2Vec. The core principle of NLP is to
validate the accuracy of viewing groups each clustering understand the meaning of a given word. Although
procedure is evaluated by calculating well known human-like ability to understand languages remains
measures such as the number of pure clusters (NPC), elusive, certain methods have been successful in capturing
cluster purity (CP), average silhouette value (ASV), number similarities between words. Recently, such approaches
of streams in clean bins (NSCB), and cleanly clustered based on neural networks in which words are embedded
streams (CCS). Note that, the methods described are into a low dimensional space have been proposed by
different alternatives, i.e., both an embedding and a various studies [18]. The proposed models represent each
clustering algorithm must be chosen. Combining different word as a d-dimensional vector of real numbers, where
embedding techniques, or different clustering techniques vectors that are close to each other are shown to be
is not considered. semantic.
Mikolov et al. [19], [20], enabled an efficient
4.1. Embedding and Feature Engineering embedding of words that achieves striking results on
various linguistic tasks by a skip-gram with a
The direct use of raw data in a machine learning negativesampling training method. Unlike most previously
method is problematic since raw data tend to be used neural network architectures for learning word
unstructured and they normally contain redundant
information. A natural, clear solution is to initially perform
feature extraction and then feature selection in order to
create a structured representation that in parallel is
tailored to the specific problem domain.
There follows a novel synergy of embedding
techniques for feature extraction. The use of the base NLP
and embedding algorithms is then justified further
indicating the contributions towards refining them such as
to address the explicit problem of encrypted video traffic
clustering. Table 2 summarizes the abbreviations and
notations used in this paper.
4. The 5-tuple packet-level information is defined by the
source/destination IP port, source/destination IP address and the
protocol (i.e., TCP/UDP).
DPI Deep Packet Inspection
ML Machine Learning
20
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

4.1.4. BPPs set Intersection. The earlier work in [4]


suggested supervised learning techniques for classifying
encrypted streams with a priori knowledge of known video
titles. In parallel, an unsupervised clustering scheme
derived by the k-nearest neighbors (KNN) algorithm was
also utilised. In this work, this KNN formulation is exploited
Algorithm 2 Bag of Words
with a new affinity function. In practice, an affinity function
Input: Streams - a list of all the video streams, each is a measure of similarity between two samples.
stream is a list of BPP’s Ultimately, two distinct samples should have lower values
Output: The embedded streams while similar samples should have higher values.
1: Max MaximalPeak(Streams) The affinity score between two BPP sets, S and S0, is
defined as the cardinality of the intersection set, where M,
and M0 are BPP sets:
2:3: BagOfWordsfor StreamStreams {} do
AFF(S,S0) = |S \ S0| (2)
4: Curr 2{0,0,...,0} . An array of length Max
The defined affinity function is employed within the
evaluation phase for the Spectral and Hierarchical
5:7: for BPP 2 Stream] Currdo[BPP] + 1 clustering algorithms as described in Sections 4.2.3 and
4.2.4.
6: Curr[BPP end for 8:
BagOfWords.append(Curr) 4.2. Clustering Techniques
9: end for
10: Return BagOfWords
Unsupervised clustering methods learn a function from
a set of unlabeled examples by using heuristics. Based on
the problem domain reported here, unknown encrypted
4.1.3. Deep Auto Encoder embedding. As revealed through video streams are clustered to determine whether there
the feature engineering process, the number of a possible are streams of the same video title, without an a-priori
BPP value is much higher in comparison to the number of knowledge of the video title. The following is a description
BPPs in each encrypted stream. Consequently, every of the algorithmic properties of the unsupervised learning
produced ”Bag-of-Words” vector had high sparsity methods employed in this work.
properties. In order to address matrix sparsity and further 4.2.1. K-means. The k-means algorithm approximates a
cope with the high dimensionality and lack of relevant division of the dataset into K distinct clusters of equal
information, Deep Auto Encoder outputs are used [24]. variance where each cluster is described by its mean.
The goal of Deep Auto Encoders is to assemble a neural Hence, K-means attempts to find the mean values that
network with the aim of producing a compact data minimize the intra-cluster sum of squares; i.e., the squared
embedding output. This aim is achieved by setting each Euclidean distance between all samples within a cluster
layer to be a smaller size until reaching a bottleneck. The and its respective mean.
bottleneck is then connected to layers of increasing size, K-means split the data into K clusters by iteratively
and the network output is set to be equal to the size of the optimizing the following problem: given set of points X =
original input.
Let X be the sample matrix and X0 the matrix of the re-
{of clusterx1,...,xCnk}, wherethe algorithm tries to minimize
encoded samples. The loss of the neural network is
the followingmk = Pi2Ck xi/nk is the centroid
obtained by:
(1) Hence, the
network will try to embed the data into a lower yet
information-preserving dimension and then reassemble it.
After training the network, all the layers going out of the
bottleneck are removed, and each sample is run through
the modified network, in order to obtain a compact data
embedding output.
In contrast to other pieces of work (e.g. [24]), where
clustering and embedding is done separately, the Bag of
Words is used in synergy with a deep auto encoder to
enable uniform embedding and clustering technique as
well as automated optimization for the clustering process.
Moreover, a direct assessment of the clustering accuracy
performance is achieved under the aforementioned
synergy.

21
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

22
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

method that identifies outliers could be of use prior to the that the most optimal solution for addressing real-time
employment of the clustering process. grouping of encrypted video streams would be any
clustering based on Word2Vec embedding since it takes
5.1.4. BPPs Set Intersection. The BPP’s set intersection is less than 26 seconds for any algorithm to produce an
treated here as an affinity function within formulations output. Nonetheless, as discussed in Section 5.1,
that parametrically accept such functions. Hence, for this Word2Vec-based clustering also poses some trade-offs
technique there is only provision of results for the spectral related to the cluster purity (i.e. CP metric) and also the
and the AHC-based clustering algorithms that do consider amount of clean traffic streams that are considered to be
affinity functions within their corresponding formulations. pure within a given cluster (i.e. NSCB). Therefore, it is
By relating the results depicted in Table 4 to the desired recommended to consider such an approach purely on a
metrics discussed in Table 3, it is clear that BPPs produce real-time basis and further compare its outputs with an
the optimal outputs through all metrics apart from the offline clustering process. Based on the results discussed
NPC. It can therefore be seen that BPP based clustering in Section 5.1, the evaluation identifies clustering based on
methods are capable of producing pure clusters to which BPPs set intersection embedding to have the best
viewers are correctly allocated based on their respective clustering performance metrics on average, but with a high
video category (i.e., Clustering Purity - CP). It is evident that computational cost in terms of run time.
the composed spectral-based clusters have a high Embedding Clustering Runtime [sec]
precision range due to the high scores obtained in terms of K-Means 25.20
the number of cleanly clustered streams (i.e. CCS). GMM 25.25
Word2Vec
Furthermore, AHC-based clustering schemes utilising AHC (cosine) 25.24
the BPPs set intersection affinity function do not perform AHC (L2) 25.1
equally well despite the fact that the number of pure K-Means 1,461.6
clusters through the NPC metric reaches 0.99. Through GMM 3,565.3
Bag of Words
AHC (cosine) 2,864.7
both schemes it is evident that the average number of
AHC (L2) 1,913.4
streams in each cluster is 0.02. Consequently, the latter
Deep Auto Encoder K-Means 96,628 ⇡ 1.1days
result suggests that most of those clean clusters have 2 BPP’s Set Intersection Spectral 1,451
single users watching the same video and the last unclean AHC 452.5
cluster has all the remaining users.
TABLE 5: Computational Run Time Performance 6.
5.2. Computational Performance Conclusion
This work presents an extensive study highlighting a
In order to assess the efficacy of the various algorithms novel, synergistic use of off-the-shelf clustering schemes
and whether they can be used in practical, real-time with NLP and learning-based embedding schemes for
scenarios, the study focuses on the computational time profiling encrypted YouTube video traffic streams. In
required for each to produce a clustering output. Table 5 contrast with previous studies, the authors propose
depicts the resulting computational performance clustering with no a-priori assumption on a known subset
results based on experiments conducted on an Intel I7 of titles, and argue in terms of the potential usefulness of
machine, 3.5GHz with 11 CPU’s and 57G RAM. As such an approach for automated behavioral profiling of
illustrated, the majority of embedding and clustering malicious actors. The results indicate that it is feasible to
combinations may adequately operate on a close to real- obtain extremely high clustering performance under
time scenario and provide meaningful profiling of user optimal computational costs and may contribute towards
viewing behaviour. From a pure embedding perspective forensic investigations required by law enforcement
there is evidence to show that the less intensive clustering bodies. It can be seen that the combination of an NLP-
outputs are produced when employing the Word2Vec based embedding scheme with either K-means or GMM-
feature engineering and embedding. By contrast, the most based clustering is an candidate for real-time and scalable
computationally intensive scheme is resulted by utilising grouping of behavioral viewing for end-users over
features from the Deep Auto Encoder approach. encrypted video streams, and that a combination of
As already discussed earlier, the spectral clustering learning-based embedding, as well as spectral or
with the BPPs set intersection outperformed all other hierarchical clustering under BPPs set intersection, can
models in terms of the CCS metric while maintaining high adequately serve the purposes of highly accurate profiling
scores on the NPC metric (i.e. in both close to 1). for cases with small datasets.
Nonetheless, despite the high clustering performance for
some of the assessed clustering performance metrics,
there is a significant computational trade-off in terms of
runtime. In fact, spectral-based clustering based on BPPs
set intersections requires 24.2 minutes in order to
generate any clustering outputs. Hence, such an approach
would not be ideal in scenarios where real-time clustering
is required. In general, the conducted evaluation suggests

23
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

Improving Fairness for QoE of Adaptive Video


Streaming over ICN
Rei Nakagawa Satoshi Ohzahata, Ryo Yamamoto

Tokyo University of Agriculture and Technology, The University of Electro-Communications


Tokyo, Japan Tokyo, Japan
rnakagawa@go.tuat.ac.jp {ohzahata, ryo-yamamoto}@uec.ac.jp

24
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

Abstract—Information centric network (ICN) improves the clients. Furthermore, bitrate unfairness potentially causes
communication quality by flexible content-based control overall QoE degradation by decreasing the bitrate of the
according to application-level information of video content. Then,
adaptive video streaming is applied to ICN to improve the client’s lower bitrate client because the decrease of a lower bitrate
quality of experience (QoE). In previous research for adaptive is worse than the decrease of a higher bitrate in terms of QoE
bitrate (ABR) algorithms for ICN, the buffer-based ABR algorithm [9].
(BBA), which is designed to change a bitrate according to the
video buffer length of video player in a client, is effective for QoE
For QoE fairness in adaptive video streaming, the
of the single client. However, in terms of QoE fairness among BBA serverassisted scheme in which the fixed server centrally
clients sharing a bottleneck, since each client adjusts a bitrate manages the fair allocation of bandwidth among connected
independently, the clients, which communicate with the lower clients for TCP/IP [10], [11]. However, in ICN, since the
bitrate video, have to decrease their bitrate even when the
bottleneck link is occupied by the other clients communicating content location is not identified for communications, these
with the higher bitrate. This results in QoE unfairness and fixed-server approaches are not acceptable. In addition,
degradation of overall QoE for all clients. To improve overall QoE, although applicationlevel information (e.g. an encoded
this paper presents the approach to decrease bitrate differences bitrate.) explicitly affect QoE fairness, those approaches are
among clients by increasing the bitrate of the lower bitrate clients
before the bandwidth of bottleneck link is excessively occupied by implicitly adjusted for QoE fairness by depending on short
communications for the higher bitrate clients in terms of QoE reference to communication control information (e.g. a
fairness. Then, we propose notification-based bitrate control in bandwidth.). Then, ICN is possible to apply flexible network
which distributed servers or routers detect congestion early and control according to application-level information for QoE
explicitly notifies the lower bitrate clients to increase the bitrate
before the bottleneck link is occupied, while encouraging the fairness of adaptive video streaming.
highest bitrate clients to decrease for congestion avoidance. To improve overall QoE among clients sharing a bottleneck,
Through the simulation experiments, the proposed system this paper presents the network-assisted scheme to
improves both QoE fairness and overall QoE metrics.
encourage QoE fairness among the clients by increasing the
Index Terms—ICN, Adaptive Video Streaming, QoE, Fairness
bitrate of the lower bitrate clients before the bottleneck link
is occupied, while decreasing the bitrate of the highest
I. Introduction bitrate clients for congestion avoidance with relatively low
QoE losses. Then, we propose notification-based bitrate
Information centric network (ICN) has attracted attention control in which the server or router detects congestion in
to accommodate the large volume of video content traffic on the early stage and explicitly notifies the lower bitrate clients
the Internet [1]. In ICN, the name address, which represents to increase the bitrate by additionally using the bandwidth
hierarchical information of content attributes, is given for the given by mitigating excessive
communication ID of a packet, and each ICN node associates
the decentralized network of a content request by Interest
packets and a content distribution by Data packets [2]. By
understanding the content information of the Interest or
Data packet with reference to the name address, ICN enables
flexible control based on the content information and thus
improves communication quality. Therefore, adaptive video
streaming over ICN is applied to improve both the
communication quality and the client’s quality of experience
(QoE) [3].
Adaptive video streaming finds a wide use of industry
because the adaptive bitrate (ABR) algorithm avoids QoE
degradation due to congestion by adjusting the bitrate
according to the estimated available bandwidth [4]. For
adaptive video streaming, the ABR algorithm is classified into
two methodologies depending on the reference for bitrate
adjustment, the rate-based ABR algorithm (RBA) and buffer-
based ABR algorithm (BBA) [5]. These ABR algorithms are
needed to jointly maximize QoE as well as ensure QoE
fairness among clients [6]. In terms of single client’s QoE,
BBA is basically effective because conservative buffer-based
bitrate adjustment avoids the bitrate overestimation due to
throughputbased bitrate adjustment of RBA [7], [8].
However, in terms of QoE fairness among multiple clients
sharing a bottleneck, BBA clients may suffer from the QoE
unfairness because each BBA client has to decrease the
bitrate for congestion avoidance independently, even if the
bitrate of video communication is low and it’s not fair among

25
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

Application Layer
MPD N Video Segments Video Player Engine
Algorithm 1 Bitrate adjustment for Bitratenext in BBAzero.

+ 1 N ABR Algorithm+ 1: {R1, ..., RM} ⇐ The given discrete bitrate levels from MPD;
1 N
Transport Layer 2: {Bitrateprev} ⇐ The previously used bitrate;
1 N Transport Control
3: Bcur ⇐ The current video buffer length (sec);
Router
with Cache Interest 4: r ⇐ The size of reservoir; 5:
Data
cu ⇐ The size of cushion;
6: 𝑓 (Bn) ⇐ The linear mapping function from Bcur to Rm;
Tx Queue for Downlink NIC Tx Queue for Downlink NIC

7: if Bitrateprev = max(Rm) then


Fig. 1. An overview of adaptive video streaming over ICN.
)
8: Bitrate+ = max(Rm ;
traffic for the highest bitrate clients before the bottleneck
9: else
link is occupied. In simulation experiments, we implemented
the proposed method with a representative BBA, BBAZero, )
10: Bitrate+ = min(Rm | Rm > Bitrateprev ;
and evaluated its efficacy in improving QoE metrics in terms 11: end if
of fairness.
12: if Bitrateprev = min(Rm) then

II. Related works )


13: Bitrate− = min(Rm ;
Adaptive video streaming is the most popular video 14: else
streaming system because the adaptively decreased bitrate )
15: Bitrate− = max(Rm | Rm < Bitrateprev ;
avoids QoE degradation due to congestion [4]. Then,
adaptive video streaming is applied to ICN for further 16: end if

improvement in QoE [3]. Figure 1 shows the overview of 17: if Bcur ≤ r then
adaptive video streaming over ICN. In Fig. 1, the server stores )
18: Bitratenext = min(Rm ;
the video contents composed of the media presentation
description (MPD) that includes information for initializing 19: else if Bcur ≥ (r + cu) then
video streaming (e.g., a number of video segments, discrete )
20: Bitratenext = max(Rm ;
bitrate levels, audio information) and N video segments of 21: else if 𝑓 (Bcur)≥ Bitrate+ then
short play time encoded in M bitrate levels1. The server and 𝑓 )
22: Bitratenext = max(Rm | Rm < (Bcur ;
router with cache forward the video content to the client by
store-and-forward using the transmission (tx) queue of the 23: else if 𝑓 (Bcur)≤ Bitrate− then
NIC. In addition, the router caches the Data packets to its 𝑓 )
24: Bitratenext = min(Rm | Rm > (Bcur ;
cache storage before forwarding. In the application layer of
the client, the video player engine equips a fixed-length 25: else
video buffer which stores downloaded video segments 26: Bitratenext = Bitrateprev; 27:
sequentially and the playback of the stored video segment is end if
performed only when the video buffer length becomes larger
28: return Bitratenext;
than a predefined threshold. Additionally, the ABR algorithm
is installed with the video player engine for adaptive video
streaming. The ABR algorithm adjusts the bitrate for each
video segment before downloading it according to the
following procedures for adaptive video streaming with
transport control in the transport layer.
To start adaptive video streaming, the video player engine
first requests MPD for initialization. After receiving MPD, the
video player engine starts adaptive video streaming from the
first video segment to the last video segment in a sequence
using the ABR algorithm as follows.
1) The ABR algorithm selects the bitrate of the video
segment at the given discrete bitrate levels in MPD.
2) The video player engine downloads the video segment
of the selected bitrate.
1The encoded bitrate represents an average throughput required to
download a video segment encoded in the bitrate. 26
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

A. The buffer-based ABR algorithm


To estimate the available bandwidth, BBA refers to the
video buffer length in the video player engine. Huang et al.
firstly introduces the pure BBA, BBAzero [12].
Algorithm 1 shows the procedures of BBAzero for output
of the bitrate of next video segment, Bitratenext. For Bitratenext,
BBAzero refers to the remaining video buffer length, ( Bcur),
Interest
the linear mapping function from video buffer length to ata

available bitrate levels (f (Bcur)) and two buffer thresholds, the


reservoir (r) and the cushion (cu) (lines 17-27 in Algorithm 1).
First, to avoid the cause of stall time, the reservoir is set to Fig. 2. An overview of fast-retransmission notification control in adaptive
video streaming over ICN.
fill at least one chunk of the buffer. Therefore, when the
buffer is filling up the reservoir (i.e. 0 ≤ Bcur ≤ r), BBAzero interface, is proposed [16]. Since the explicit method enables
selects the minimum bitrate (lines 17-18 in Algorithm 1). it to explicitly notify the congestion of the network with a
Once the reservoir is reached, BBAzero then increase the short communication delay between a client and a
bitrate according to 𝑓 (Bcur) that increases the bitrate router/server, the communication performance of the client
conservatively as the buffer increases (lines 21-26 in is better than that of the implicit method. However, in terms
Algorithm 3). Then, when the buffer is filled up enough (i.e., of clients sharing a bottleneck under control of the explicit
method, since all the clients are notified to decrease the
r + cu ≤ Bcur ≤ Bmax), BBAzero selects the maximum bitrate transmission amount of Intrest packets for congestion
(lines 19-20 in Algorithm 3). In this way, BBAzero is the pure avoidance, overall communication performance for all the
buffer-based ABR algorithm to prevent QoE degradation due clients are decrease as a result.
to the bitrate overestimation [8]. However, since a client of
BBAzero has to adjust the bitrate independently, bitrate
differences among the clients are not decreasing when the
bottleneck link is occupied by the higher bitrate clients and
potentially cause the excessive decrease of QoE of the lower
bitrate clients.

B. The congestion and the implicit or explicit congestion


detection for the management of Interest packet
transmission
In adaptive video streaming, since communications have a
burst property, there are on-off periods, which are
composed of an on-period with burst traffic and an off-
period with no traffic, in communications [13]. Therefore,
the higher bitrate client requires the larger bandwidth for
longer periods of time and thus frequently occupies the
bandwidth of the bottleneck link during downloads. As a
result, congestion is caused when the tx queue of the
bottleneck link is rapidly occupied by excessive network
traffic in a short period of time. To avoid congestion in ICN,
transport control manages the transmission of the interest
packet according to congestion detection. Congestion
detection methods are classified into two methods, an
implicit method and an explicit one. The implicit method of
congestion detection, in which the client estimates the cause
of congestion in a network, as in TCP / IP, is applied to ICN
[14], [15]. Furthermore, the explicit method, in which the
client explicitly knows congestion according to the explicit
congestion notification (ECN) based on the tx queue length
of a network

27
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

III. Increasing a lower bitrate with fast-retransmission to notification. Thus, the lower bitrate client increases the
encourage fairness for QoE improvement bitrate by additionally using the bandwidth given by
mitigating the excessive traffic for the highest bitrate clients
A. An overview
before the bottleneck link is occupied. In the following
In terms of BBA clients sharing a bottleneck, all the clients subsection, we describe the detail of the fast-retransmission
have to decrease the bitrate even if the bitrate is not fair notification according to the bitrate set on the server or
among the clients. Therefore, the bitrate differences are not router.
decreasing, resulting in degradation of QoE fairness and
overall QoE due to the excessively lower bitrate. In contrast,
our approach presents network-assisted control to B. The fast-retransmission notification according to the
selectively increase the lower bitrate by explicitly notifying bitrate set on a server or router
the lower bitrate clients to additionally use bandwidth early In FNC, to decrease bitrate differences among the clients
in the congestion of the bottleneck link. Then, we propose sharing a bottleneck, a server or router explicitly identifies
the following fast-retransmission notification control (FNC) the bitrate set of the clients with reference to the name
for adaptive video streaming over ICN. address of Interest or Data packet in communications. The
Figure 2 shows the overview of FNC based on Fig. 1. In bitrate set is periodically updated, and entries that have no
addition, to adjust the QoE fairness among the reference since the last update are deleted. Then, according
accommodating clients, a server or router newly adds fast- to the bitrate set, the server and router explicitly notifies the
retransmission notification according to the bitrate set of the clients below the highest bitrate of the fast-retransmission
clients composed of the following procedures 1 and 2 in Fig. signal according to early congestion detection based on the
2. First, to evaluate bitrate differences among the clients, the tx queue occupation. Algorithm 2 shows the fast-
bitrate set, which is the bitrate list of videos that is currently retransmission notification processed every time before
downloaded by the clients, is constructed by the router forwarding a Data packet. In Algorithm 2, before forwarding
referring to the name address of Interest or Data packets in the Data packet to the NIC, the bottleneck with early
communications. Second, to force the lower bitrate clients to congestion is detected according to the tx queue occupation
use the bandwidth more before the bottleneck link is of the NIC and the predefined congestion threshold (lines 3-
occupied by the highest bitrate clients, the fast- 5 in Algorithm 2). If detected, the following procedures are
retransmission signal is set to the forwarding Data packets processed according to the bitrate from the name address of
for the clients below the highest bitrate. Then, to cooperate the Data packet (lines 6-11 in Algorithm 2). First, if the bitrate
with the fast-retransmission notification from the server or is the same in the bitrat set, no control is needed for the
router, notification-based fast-retranmission control bitrate fairness. If not, the fast-retransmission is set to the
Algorithm 2 The fast-retransmission notification according to option fields of the Data packet to force the clients below the
the bitrate set highest bitrate to use the bandwidth more. Finally, the Data
1: {R1, ..., RL} ⇐ The bitrate set of the clients; 2: if packet is forwarded to the tx queue of the NIC that is
desgined to send the Data packet with the fast-
a Data packet arrives at a NIC then retransmission signal for fast download (line 13 in Algorithm
3: Cth ⇐ The congestion threshold in [0, 1); 2).

4: QLen ⇐ The tx queue length of the NIC;


5: if QLen QLenmax ≥ Cth then

6: bitrateData ⇐ The bitrate of the Data packet; 7: if


max(Rl) = min(Rl) then
8: do nothing;
9: else if BitrateData < max(Rl) then
10: set “fast-retransmission” signal to the ata packet;
11: end if 12: Fig. 3. An evaluation topology.

end if IV. Evaluation


13: forward the Data packet to tx queue of the NIC;
We implement FNC to BBAzero of Algorithm 1, and
14: end if
evaluate the efficacy of FNC with BBAzero on the throughput
of video segment download, the bitrate of video segment,
is newly added to the client (the procedure 3 in Fig. 2). and QoE in terms of fairness through amus-ndnSIM, an
Fastretransmission control forces transport control to event-driven ICN simulator for adaptive video streaming
transmit all Interest packets immediately for the not- [17]. Figure 3 shows the evaluation topology composed of 5
received Data packets (including retransmission of the servers, 20 routers, and 25 clients. All the link speed between
already transmitted Interest packets) according to the each ICN node is 10Mbps. Each server stores MPD and 100 28
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

. The metrics for evaluation

To evaluate QoE, we use QoE-lin scoring defined in


Equation (1) [19].
N N−1

∑︁ ∑︁
QoE-lin = q(Rn) − 𝜆 |q(Rn+1) − q(Rn)|
n=1 n=1 (1)
N
∑︁
−𝜇 b n − 𝜇 DD
n=1

In Eq. (1), {QoE-lin | − ∞ < QoE-lin ≤ (4.3 × N)} is set to


quantitative QoE for playback of the video content of N video
segments. {q(Rn) | 0.045 ≤ q(Rn) ≤ 4.3} is an identity function q
of the bitrate (Mbps) of nth video seg-
ment, {Rn | 0.045 ≤ Rn ≤ 4.3}, representing “QoE(bitrate).”.
{q(Rn+1) − q(Rn) | 0 ≤ |q(Rn+1) − q(Rn) | ≤ (4.3 − 0.045)} is the
difference between the next segment’s bitrate metric and
the current segment’s bitrate metric, representing (c) (d)
“QoE(magnitude).” InQoE-
Fig. 5. The CDF of (a) the QoE(bitrate), (b) the QoE(magnitude), (c) the
lin,thestalltimeexperienceiscomposedoftheplayback stop QoE(stall time) and (d) the QoE-lin.
time and the waiting time for streaming initialization.
{bn | 0 ≤ bn < ∞} is the sum of the playback stall time (sec) due B. The results for adaptive video streaming
to video buffer starvation in the nth video segment playback. First, we evaluate the efficacy of FNC on the throughput of
In addition, {D | 0 ≤ D < ∞} is the sum of the stop time (sec) video segment download and bitrate of video segment.
Table I, and Figure 4 shows the comparison of an average
resulting from the initialization of adaptive video streaming.
(Avg.), a jain’s fairness index (Fair.) and a cumulative
bk and D represent “QoE(stall).” {𝜆 = 1, 𝜇 = 4.3, 𝜇D = 4.3} are distribution fuction (CDF) of the bitrate (B.) and throughput
nonnegative weighting factors for QoE(magnitude) and (T.) (Mbps) of video segment in 5 methods on all the played
QoE(stall) set against the bitrate metric. In addition, to streaming session.
evaluate fairness of the throughput, bitrate, and QoE-lin, we For the throughput in Fig. 4 (a), the FNCcth.3, FNCcth.9, and
use the jain’s fairness index in [0,1] [20]. BBAzero methods are almost the same. This is because at
TABLE II congestion thresholds above 0.3, the bottleneck link is
The average and jain’s fairness index of the QoE metrics.
always
Avg. Avg. Avg. occupiedbythehighestbitrateclientsaftercongestiondetectio
Avg. Fair. n, and thus FNC is not able to make the available bandwidth
QoE QoE QoE
QoE-lin QoE-lin
(bit.) (mag.) (stall) for lower bitrate clients for almost periods. On the other
BBAzero 640.973 46.746 7.805 586.422 0.837 hand, the FNScth.0 and FNCcth.1 methods are better in the

FNCcth.0 763.333 36.875 11.913 714.546 0.904


proportion around 3 Mbps compared to the BBAzero
method because FNC additionally transfers lower bitrate
FNCcth.1 778.102 38.740 11.766 727.596 0.908 traffic before the bottleneck link is occupied. In addition, the

FNCcth.3 658.032 48.231 8.580 601.221 0.842 FNCcth.0 and FNCcth.1 cases are not excessively large in the

FNCcth.9 658.448 48.386 8.268 601.794 0.840


proportion of

29
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTING

ICN enables flexible network control according to video


content information with reference to the name address for
improving QoE of adaptive video streaming. Then, to improve
overall QoE for all clients under BBA control, the proposed FNC
over ICN enables network-assisted control to improve QoE
fairness according to bitrate differences among the BBA clients,
and thus the BBA clients are possible to adjust the bitrate in
the fair manner that increases the lower bitrate before the
bottleneck link is occupied. In future work, we plan to apply
cache control to FNC for QoE fairness.

References
[1] Cisco,“CiscoVisualNetworkingIndex:ForecastandTrends,2017–2022,”
2017.
[2] . Kutscher et al., “Information-Centric Networking (ICN) Research
Challenges,” RFC 7927, 2016.
[3] C. Westphal et al., “Adaptive Video Streaming over Information-Centric
Networking (ICN),” RFC 7933, 2016.
[4] ISO-IEC 23009-1:2012 Information Technology. Dynamic Adaptive
Streaming over HTTP (DASH).
[5] Y. Sani, A. Mauthe and C. Edwards, “Adaptive Bitrate Selection: A Survey,”
IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2985-3014.
[6] N. Barman and M. G. Martini, “QoE Modeling for HTTP Adaptive Video
Streaming–A Survey and Open Challenges,” IEEE Access, vol. 7, pp.
30831-30859, 2019.
[7] B. Rainer, . Posch, aptnd H. Hellwagner, “Investigating the performance
of pull-based dynamic adaptive streaming in N N,” IEEE JSAC, vol. 34,
no. 8, pp. 2130–2140, 2016.
[8] K. Goto, Y. Hayamizu, M. Bandai and M. Yamamoto, “QoE performance
of adaptive video streaming in information centric networks,” in Proc. of
IEEE LANMAN, pp. 1-2, 2019.
[9] Z. uanmu, A. Rehman and Z. Wang, ”A Quality-of-Experience Database
for Adaptive Video Streaming,” in IEEE Transactions on Broadcasting, vol.
64, no. 2, pp. 474-487, 2018.
[10] Y. Zhang et al., “Improving Quality of Experience by Adaptive Video
Streaming with Super-Resolution,” in Proc. of IEEE INFOCOM 2020 IEEE
Conference on Computer Communications, pp. 1957-1966, 2020.
[11] Y. Yuan et al., “VSiM: Improving QoE Fairness for Video Streaming in
Mobile Environments,” in Proc. of IEEE INFOCOM 2022 - IEEE Conference
on Computer Communications, pp. 1309-1318, 2022.
[12] T. Huang, R. Johari, N. McKeown, M. Trunnell, and M. Watson, “A
bufferbased approach to rate adaptation: evidence from a large video
streaming service,” in Proc. of the ACM conference on SIGCOMM
(SIGCOMM ’14), pp. 187–198, 2014.
[13] W. Afandi, S. M. A. H. Bukhari, M. U. S. Khan, T. Maqsood and S. U. Khan,
”Fingerprinting Technique for YouTube Videos Identification in Network
Traffic,” in IEEE Access, vol. 10, pp. 76731-76741, 2022.
[14] G. Carofiglio, M. Gallo, and L. Muscariello, “ICP: esign and evaluation of
an interest control protocol for content-centric networking,” in Proc. of
IEEE INFOCOM Workshop, pp. 304-309, 2012.
[15] C. Xia, M. Xu and Y. Wang, “A loss-based TCP design in ICN,” in Proc. of
22nd Wireless and Optical Communication Conference, pp. 449-454,
2013.
[16] K. Schneider, C. Yi, B. Zhang and L. Zhang, “A Practical Congestion Control
Scheme for Named ata Networking,” in Proc. of ACM ICN, pp. 21-30,
2016.
[17] B. Rainer, C. Kreuzberger, D. Posch, and H. Hellwagner, Demo:
amusndnSIM – adaptive multimedia streaming simulator for ndn.
[18] MPEG-DASH at ITEC/AAU, http://dash.itec.aau.at
[19] Z.Akhteretal.,“Oboe:Auto-tuningVideoABRAlgorithmstoCongestion
status,” in Proc. of ACM SIGGCOMM, pp. 44-58, 2018.
[20] R. Jain, D.-M. Chiu and W. Hawe, “A quantitative measure of fairness and
discrimination for resource allocation in shared computer system,” 1984.

30
DEPT OF ISE, SIR MVIT
DIGITAL VIDEO FINGERPRINTIN

CONCLUSION

Different methods exist for video fingerprinting. Van Oostveen relied on changes in patterns of
image intensity over successive video frames. This makes the video fingerprinting robust against
limited changes in color - or the transformation of color into gray scale of the original video. Others
have tried to reduce the size of the fingerprint by only looking around shot changes.

Video fingerprinting does not rely on any addition to the video stream. A video fingerprint cannot
be removed, because it is not added. In addition, a reference video fingerprint can be created at any
point, from any copy of the video.

Irregularities and inconsistencies due to the VBR encoding of the video make it challenging to
identify the videos in the network traffic. To address the aforementioned problem, this paper converts
BPSs into BPPs and presents a stable fingerprinting method, SDF. The SDF works on the difference
between the BPPs to identify the VBR video streamed in encrypted network traffic. The created SDFs
are used to train the CNN model. After tuning the model’s hyperparameters, the model achieves an
accuracy of 90% and 99% in predicting videos and classifying traffic, respectively. Additionally, the
effects of variable period length on the model’s prediction accuracy are yet to be analysed. We aim to
modify the technique to cope with the concept drift problem in the future. Observing the effect of
variable period length and finding the optimal value will make this technique more fool proof and
increase the practical deployment applications.

31
DEPT OF ISE, SIR MVIT

You might also like