DeviceMien: Network Device Behavior Modeling For Identifying Unknown IoT Devices

DeviceMien: Network Device Behavior Modeling for Identifying
Unknown IoT Devices

Jorge Ortiz∗ Catherine Crawford Franck Le
Rutgers University IBM Research IBM Research
jorge.ortiz@rutgers.edu catcraw@us.ibm.com fle@us.ibm.com
ABSTRACT littered with IoT devices – internet connected devices measuring

With the explosion of IoT device use, networks are becoming more the environment, detecting events, monitoring our health and our
vulnerable to attack. Network administrators need better tools to infrastructure. However, this rapid growth introduces many opera-
verify and discover these devices in order to minimize attack risk. tional and management challenges. It is difficult to keep track of
Existing tools provide rule-based assessment capabilities that can- which devices are plugged into our networks and whether they are
not keep pace with the proliferation of devices. Current techniques properly operating.
demonstrate that given a rich set of labeled packet traces, one could Given their penetration into these important aspects of our
design a pipeline that identifies all the devices in that trace with lives, these devices have become very attractive targets. They al-
over 99% accuracy [30, 32]. However, it has also been observed [25], ready present a major risk for IT infrastructure [18] and DDoS
that such techniques are brittle when no labels are available. More attacks [7, 34]. They can also be used to attain personal information
perniciously, they provide false confidence scores about the label they about users [11, 24], hack a car on the highway [19], or spy on
do ascribe to a sample. This paper introduces a probabilistic frame- company operations [36]. Faced with these threats, network ad-
work for providing meaningful feedback in device identification, ministrators need better management tools to discover IoT devices
particularly when the device has not been previously observed. In on their network. There are two approaches for device discovery,
our work, we use stacked autoencoders for automatically learning one active and the other passive. In the active approach the device
features from device traffic, learn the classes of traffic observed, registers with a management and directory service, such as Apple’s
and probabilistically model each device as a distribution of traffic Bonjour [1] or Avahi [4]. While the passive approach monitors
classes. Our experiments show that we are able to identify pre- network traffic, classifying different types of traffic patterns, at-
viously seen devices after only 18.9 TCP-flow samples with 100% tributing some to IoT devices and some to non-IoT devices. In the
accuracy for devices where at least 50 samples are observed. We best case, the device itself, make, model, software version, etc. could
also show that we can distinguish between two broad classes of be fully identified passively. While we believe the full solution suite
devices – IoT and Non-IoT – by examining the average number of would include active and passive elements, the passive solution
flow classes observed over a set of samples. Our experiments show helps address an immediate need and may be the only option for
that we can infer the correct class of unseen devices with an over dealing with new IoT devices.
82% average F1 score and 70% accuracy. We also observe a major blind spot in the related work. None of
the existing techniques provide a structured approach for identify-
ACM Reference Format:
ing previously unobserved / unseen devices. Solutions have been
Jorge Ortiz, Catherine Crawford, and Franck Le. 2019. DeviceMien: Net-
work Device Behavior Modeling for Identifying Unknown IoT Devices. In
proposed to examine the characteristics of a known set of devices
Proceedings of ACM /IEEE Conference (IoTDI’19). ACM, New York, NY, USA, in order to identify similar ones from a test/validation set. However,
12 pages. https://doi.org/10.1145/3302505.3310073 such approaches are not general enough to provide meaningful
feedback when the device traffic is not labeled and not included
1 INTRODUCTION in the training set. This is a crucial piece that is necessary for
achieving real-world deployment on live networks. The common
Today, there are an estimated 8.4 Billion Internet-of-Things (IoT)
case is to have sparsely labeled data about the kind of device that
devices in active use – up from 4 Billion just three years ago. More-
is connected. While most organization follow a device registra-
over, the number of IoT devices is expected to more than double
tion procedure, many also allow members of the organization (and
in the next three years [27]. The impact of IoT is already mas-
guests) to ‘bring-your-own-device’. Network administrators would
sive and will continue to become more deeply embedded in our
like to get a handle of the types of devices connected to the net-
daily lives. Our homes, offices, schools and infrastructure will be
work, especially in the unregistered case. They would like to infer
∗ Work done while at IBM Research and Rutgers University. as much as possible from the network traffic itself: is the communi-
cation behavior of the device changing? Are the changes indicative
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed associated with malicious intent or are they within the boundaries
for profit or commercial advantage and that copies bear this notice and the full citation of previously observed behavior patterns?
on the first page. Copyrights for components of this work owned by others than ACM In addition, existing approaches rely on hand-crafted features –
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a features that are designed from the perspective of known or exist-
fee. Request permissions from permissions@acm.org. ing devices and protocols. As the IoT market continues to grow, it
IoTDI’19, April 15–18, 2019, Montreal, QC, Canada
is impossible to anticipate which protocols will come to dominate
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6283-2/19/04. . . $15.00 and approaches based on such hand-crafted features, while they
https://doi.org/10.1145/3302505.3310073
106
IoTDI’19, April 15–18, 2019, Montreal, QC, Canada Jorge Ortiz, Catherine Crawford, and Franck Le

(b) Choose optimal clustering (c) Train classifier using cluster la-
(a) Train an LSTM-Autoencoder. paramters. bels.
Figure 1: This figure shows all the steps taken during the training phase. The first step is to train an stacked, LSTM-Autoencoder
Figure 1a. The second step is to use Bayesian optimization to find the optimal clustering parameters, using DBSCAN, Figure 1b.
Finally, we train a classifier based on the assigned labels from the optimal clustering algorithm, Figure 1c.
work very well under today’s assumptions, they may become ob- (4) We can identify new devices 100% of the time and can infer
solete over time. In [30], for example, they extract domain names, the right class (IoT/Non-IoT) with over 80% F1 score on
port numbers, and cypher suites from network traffic per device. average.
There are a number of undesirable properties to this approach. New (5) Our model exploits a Bayesian analytical convenience to
devices may not use any of these protocols to communicate. IoT allow for a simple, flexible representation that is easy to
devices that are deployed by new companies use different protocols, instantiate, update and run in only a few lines of code.
non-commercial entities may deploy devices that never issue DNS
requests. Moreover, the port numbers may be entirely unique to Our approach is general and is not limited to only IoT device identi-
the application and the traffic may not be encrypted. It is not only fication. However, we observe that it can be used in this fashion and
possible, it is likely in non-commercial deployments that none of that it is especially useful, in comparison to existing approaches,
these pre-requisites will be met. Also, since a bag of words model when either a new device or new device behavior is observed. By
is used and it relies on a dictionary of terms, new terms may be not relying on hand-crafted features, it also provides the flexibil-
introduced that expand the dictionary, which may either require ity to run on other kinds of IP traffic. In this paper, we focus on
re-training or require that multiple dictionaries are managed for old identifying new devices and providing class-level feedback about
and new devices. This is not only true for this particular instance the kind of device that is being observed (IoT vs Non-IoT). In the
of related work but for most of the existing approaches. While they rest of the paper we provide a brief overview of related work, es-
certainly provide value towards a general solution, those solution pecially those focused on IoT device identification. We then give
proposals will require complex maintenance in production and will a high-level description of the framework and the three phases
become unwieldy at scale. of its construction – the training phase, the modeling phase, and
We consider a fundamentally different approach. Our approach the execution phase. We then describe each of the components
does not rely on hand-crafted features. Instead, it relies only on used in each phase and the specific selection of technique to enable
the data. We use an automatic feature learning technique to cast it. We describe the data sets used in our experiments and give a
the traffic onto a distribution that serves as a pseudo-signature detailed experimental methodology. Finally, we present the results
for devices and device type (IoT/Non-IoT). Our solution casts the and discuss the implications and conclusions.
problem as one of comparing observed distributions of device traffic,
in a streaming fashion. By using probabilistic matching, we are able
2 RELATED WORK
to match known device behaviors and give meaningful feedback
about the confidence we have about that inference. In summary, Many studies [2, 21, 31, 32, 37] have stressed the vulnerability of
we make the following contributions: IoT devices to security attacks, and emphasized the need for means
to detect, recognize, identify and discover IoT devices. Apthorpe
(1) We automatically learn features from the data and construct et al. [2] showed that by passively monitoring IoT network traffic,
a pipeline that combines deep learning elements with proba- one could infer user behaviors (e.g., user’s sleeping patterns) even
bilistic inference, successfully. when the traffic is encrypted. Yu et al. [37] discussed the root causes
(2) We provide a framework that identifies known devices with behind several reported IoT vulnerabilities (e.g., unprotected RSA
over 99% accuracy after only a few TCP-flow observations. key pairs, open DNS resolver), and present potential multi-stage
(3) We introduce an inexpensive modeling approach that models cross-device attacks. The authors emphasize the need to be able
the distribution of TCP-flows and uses the distribution as a to identify and understand IoT device classes so the cross-device
pseudo-signature; a general approach that works with any interactions can be learned, and acceptable behaviors distinguished
packet data. from potential attacks.
107
DeviceMien: Network Device Behavior Modeling for Identifying Unknown IoT DevicesIoTDI’19, April 15–18, 2019, Montreal, QC, Canada
Moore et al. [23] use Bayesian analysis to classify internet traffic.

They use a Naive Bayes classifier to classify flows in k different class
types. Similar to other methods, they manually construct a feature

vector of aggregate statistics and header field values. Each sample

vector is constructed from a summary of a complete, semantic TCP

flow1 , similar to our approach, it is not limited by this experimental

design choice. Also, their technique requires labeled data and does

multiclass classification. We assume labels are attained over time

and are sparingly available. Figure 2: To generate a distribution over flows for each de-
Lopez-Matrin et al. [17] use deep learning to classify different vice, we use all its TCP-flow samples as input and feed them
types of IoT traffic. The use a mixed deep neural network that into the encoder. Then, the classifier takes the encoded out-
combines a Convolutional Neural Network and Recurrent Neural put vectors and assigns a class label to each. A count for
Network to learn the appropriate features that maximizes the clas- each class is maintained and uses to update the prior on the
sification accuracy for a set of 108 labeled service flows. While Bayesian model. Updates can occur in batches or one at a
their work bares much resemblence in structure to ours, they are time.
working in a supervised setting where their training data is richly
labeled. Because of the nature of IoT device proliferation and diver-
sity, we must work in an unsupervised context. We aim to provide
useful feedback to the network administrator about what the ob-
served traffic might be and what it looks like with respect to what
is already known. The common use case will consist of a few la-

beled devices (those that have registered properly) and the majority
unlabeled. Our technique provides a probabilistic framework for
comparing behavior across known and unknown sets of devices. It
also provides a way to compare a device to itself over time, with the
ability to identify new kinds of behavior that may be a red flag to

explore further. We combine the use of deep learning and Bayesian
modeling in order realize these capabilities.
Sivanathan et al. [30, 32] set up a lab with several commercial
IoT devices and we use their data [3] in our study. In their work, Figure 3: This is a high-level illustration of the processes of
they characterize IoT and non-IoT traffic, similar to the goals of comparing distributions of unlabled/unknown devices with
our work. While they are able to uncover many useful insights, the those that are labeled. We match them using the two-sample
study is preliminary. We build on these insights and thoroughly Kolmogorov-Smirnov test in our implementation.
develop techniques to differentiate between IoT and non-IoT traffic.
In addition, we explore how well machine learning models gener-
alize with respect to finding devices that have not been observed with electromagnetic radiation (such as light) differently. Specif-
before. We also apply new techniques on the data that show great ically, the dispersion pattern can be classified to determine the
promise, as we move towards generalization. mixture of matter elements in the physical object. Spectroscopy
Miettinen et al. [21] stress the need to identify IoT devices, so studies these properties with specific matter-EMR pairs and uses
vulnerable devices can be isolated. The authors present a machine both active and passive methods to obtain differential dispersion
learning based solution (more specifically based on random forests). patterns that can be used to discover different properties of mate-
However, the proposed approach requires traffic corresponding to rial composition. Our approach is similar in principle. We pass our
the setup phase to be captured, and more generally, it does not data through a general function approximator (a stacked, LSTM-
address discovery of new unseen IoT devices. Similarly, Sivanathan Autoencoder) that allows us to uncover hidden structure in the data
et al. proposed a machine learning (random forest) based solution and we use a distribution over the structure to characterize and
to identify IoT devices but does not address discovery of new IoT infer the device and its class.
devices. Our approach consists of three phases: 1) the training phase, 2)
the model-construction phase, and 3) the execution and ranking
3 APPROACH: HIGH-LEVEL DESCRIPTION phase. In the training phase (shown in Figure 1), we train a model
that automatically learns to represent the packet flows – sets of
Our approach is loosely inspired by spectroscopy, the study of the
packets between a source-destination pair. The model-construction
interaction between matter and electromagnetic radiation (EMR).
phase consist of clustering all the representation samples learned in
Spectropic processing uses electromagnetic radiation to study the
the first step (also shown in Figure 1). This is also the phase where
dispersion properties of matter. Different types of matter interact
a classifier is learned to classify the different types of flows and
where a model to represent that distribution is constructed for all
1 Flows for which we see a connection setup (SYN-ACK packets) and tear-down (FIN- known/labeled devices. Finally, in the execution phase, the encoder
ACK packets). and classifier are combined to learn a distribution of flows coming
108
from new devices and probabilistic measures of similarity are used

to match and/or rank newly observed distributions with labeled
ones. This is illustrated in Figure 2 and Figure 3, respectively.
Our framework aims to model the behavior of a device as a

distribution over classes of sequences of packets transmitted by
a (pair of) device(s). TCP provides a natural sequence of packets

to model. We train a neural network to encode a packet sequence
into a fixed-size vector, then we cluster the vectors to find their
“natural” separation and train a classifier on the cluster labels. Once

trained, we pass all the TCP flows for each device, encode it and
classify the encoding, maintaining a count over each of the classes
for that device. Finally, we model each histogram as a multinomial Figure 4: A TCP-flow set consists of a series of packets in
distribution. All traffic from previously unseen devices is similarly sequential order. We remove the headers from each packet
modeled and we calculate the similarity between distributions to and treat all payload sequences as the set that we feed into
identify them. the autoencoder.
The choice of technique applied in each stage is made judiciously.
The output of each stage is critical for building an effective discrim-
inative model. For example, some sequence encoders yield “uninter- how we combine these architectures to learn features from samples
esting” distributions over types that make sequence classification of TCP-flow packet data. Finally, we describe the rationale for our
ineffective – distributions for several devices are indiscernible. In design in Section 3.2.4 and describe how the input is trained and
the rest of this section, we explain each design choice and discuss encoded in Section 3.2.3.
some of the alternatives we experimented with. 3.2.1 Recurrent Neural Network. Recurrent neural networks con-
tain cyclic connections that make them more useful for capturing
3.1 Processing TCP Connection Data dependencies between sequences of values than feed-forward neu-
The transmission control protocol used by most internet device ral networks. RNNs have been used to successfully model sequences,
provides a natural sequence of packets transmitted between a pair of such as handwriting recognition [10], language modeling [22] and
devices. The protocol consists of a connection-establishment phase, acoustic modeling [12]. RNNs contain cyclical connections that
a communication phase, and a teardown phase. TCP controls the feed the activation nodes with output from the previous step. The
transmission rate between parties, so the number of packets in flight Long-Term Short-Term (LSTM) architecture is a type of RNN [14]
varies with the available bandwidth. The maximum packet size is which is a modification of a standard RNN.
1500 bytes. Each packet consists of several headers, concatenated A vanilla RNN cannot capture long-term dependencies due to
to encapsulate the MAC, network, transmission, and application- the vanishing and exploding gradient problem [5, 26]. To address
layer information. Although TCP manages the transmission rate, this issue, LSTMs were designed with special units called memory
re-transmissions, and packet ordering, a sniffer does not. Sniffed blocks in the recurrent hidden layer. The memory blocks contain
packets may be captured and recorded out of order, duplicated, memory cells with self-connections storing the temporal state of
or missing altogether. We choose to model TCP connection data the network in addition to special multiplicative units called gates
between devices because it provides a natural, temporal ordering to control the flow of information. Each memory block in the orig-
of packets for any pair of devices. However, because we capture inal architecture contains an input gate and an output gate. The
packets on the network with a sniffer, we must pre-process it to input gate controls the flow of input activations into the memory
ensure the sequences are in the right order and duplicated packets cell. The output gate controls the output flow of cell activations
are removed. We ignore missing packets. into the rest of the network. Later, the forget gate was added to
the memory block [8]. This addresses a weakness of LSTM models
3.2 Unsupervised Feature Learning preventing them from processing continuous input streams that
are not segmented into sub-sequences. The forget gate scales the
In this section we describe the neural network architecture we
internal state of the cell before adding it as input to the cell through
use to learn features from TCP-flow packet data. We first describe
the self-recurrent connection of the cell, therefore adaptively forget-
the foundational elements of the architecture and give a detailed
ting or resetting the cell’s memory. In addition, the modern LSTM
description of how these networks are connected and trained. Our
architecture contains peephole connections from its internal cells to
feature-learning architecture consists of two types of networks: the
the gates in the same cell to learn precise timing of the outputs [9].
long-term, short-term (LSTM) neural network architecture and a
We connect an LSTM layer as input to a stacked autoencder net-
stacked autoencoder architecture. In the following section, we give
work – discussed in the next section (Section 3.2.2) – in order to
a brief overview of a recurrent neural network (RNN) and how an
learn a function that maps to low-dimensional representation of a
LSTM is constructed from an RNN. We then give a brief overview
TCP-flow sample.
of autoencoders and describe how this network architecture is used
to learn features in an unsupervised fashion. We also describe a 3.2.2 Autoencoders. Autoencoders [13, 16, 35] are a class of neural
version of autoencoders, stacked autoencoders, that allow us to networks used to learn a compact representation of the input. The
learn higher-level, semantic feature representations. We describe network consists of two components: an encoding component and
109
vector is then used as the feature representation of the input. We

use a TCP-flow with 25 packets as our input, zero-filling the rest
if the number of packets is less than 25. We then train in batches
of size 100. Our stacked encoder consists of three layers, with an
encoding of size 200. Figure 5 illustrates the training process.
3.2.4 Design Rationale. We ran experiments with many types of
neural network architectures, including a standard autoencoder,
variational autoencoder [15] and combinations of these with LSTMs.
In order to build a useful model, our goal is to learn a feature repre-
sentation which maximizes type-dispersion or spread across TCP-
flow types, such that we can model device behavior effectively.
Without a diverse dispersion factor you cannot discriminate behav-
ior (i.e. the features are too close to separate into distinct classes).
Before stacking, we did not find the features clustered into mean-
ingfully discernible classes. In other words, the distributions were
not diverse. This made is statistically impossible to characterize
different types of device behavior. The ideal model must have the
following two properties: 1) a compact representation of a packet
sequence and 2) a diverse representation that can be clustered into
Figure 5: Stacked autoencoder. Each autoencoder is trained discernible classes. LSTMs give us a compact feature representation
in sequence in the training phase, with the finished encoder of a sequence. Stacked autoencoders provide the second property –
output of the n-th autoencoder used an input to the (n+1)- a representation that captures subtle but important, consistent dif-
th autoencoder in the next training phase. The final autoen- ferences across sequence representations. When we cluster the final
coder’s encoder network is used at the output in the execu- representation, we find balanced, well-separated classes that spread
tion phase. the input packet sequence into a meaningful projection space.
3.3 Bayesian Modeling

decoding component. Each component looks like a feed-forward
After we train the stacked, LSTM-Autoencoder network, we cluster
neural network, however the encoder component of the network
the encoded vectors to get a distribution over TCP-flow types. We
takes an input of size n and outputs a representation of size k, such
show that the distribution over encodings serves as an effective
that k < n. The decoder component takes an input of size k and
probabilistic identifier for devices and their type. We choose to
outputs a vector of size n. The network is trained to minimize the
model the distribution over k TCP-flow classes after n observations
difference between the input to the encoder and the output of the de-
have been made as a multinomial distribution. The multinomial
coder. The autoencoder model is trained with back-propagation [6]
distribution models the probability of counts for rolling a k-sided
in an unsupervised manner. By stacking multiple autoencoders, the
die n times. Since we model device behavior as a generative process,
network will generate more compact, higher-level semantic fea-
whereby each TCP-flow sample is one of k different classes, and
tures, which have been shown to capture meaningful latent feature
the behavior of the device is captured as a sequence of samples, the
representations and improved classifier performance in many appli-
multinomial distribution is a natural fit. After clustering, we train
cations. In our design, we find stacked autoencoders to work best.
a shallow classifier to use assigned cluster labels and then use the
There are a number of interpretations of stacking in the autoen-
classifier to generate a histogram. Figure 6 shows an example of
coder literature. We stack our autoencoders by training one encoder
such a histogram. Notice how the pattern is unique for each device.
at a time in a sequential fashion and then feeding the encoding
The counts over the histogram of encoded TCP-flow types can itself
layer of one to the next one for training.
be used as the parameters of the associated multinomial, however,
3.2.3 Putting it all together. To train the network we first trained in practice the observed distribution is sensitive to the number
an LSTM autoencoder. That is, we use an LSTM layer to learn a of classes and number of observations. Therefore, we model the
feature representation and a LSTM-based decoder that takes the parameters of the multinomial as a Dirichlet distribution – a congu-
encoding and generates the input. Once trained, the hidden layer jate distribution of the multinomial – and update the parameters of
representation – output of the encoder network – is fed into a the Dirichlet with every new observation.
standard autoencoder. The second and third layers of the stack
consist of training a second autoencoder encoder first and finally θ 1 , . . . , θ k ∼ Dir (α 1 , . . . , α k )
a third autoencoder. Once the third encoder is trained, we use the (1)
d 1 , . . . , dk ∼ Mult(θ 1 , . . . , θ k )
representation of the third autoencoder as our feature of the input.
The entire process consists of passing a TCP-flow sample as a tensor Given the uncertainty we have about our observations of TCP-
to the LSTM autoencoder’s encoder, taking the output and passing flow types, we set our prior belief about the distribution over the
it to the second autoencoder’s encoder, and finally piping the that model parameters, θ , and use Bayesian inference to calculate the
output to a the third autoencoder’s encoder. The finally encoded posterior distribution. From Bayes rule, we know that the posterior
110
(a) Android phone (b) Belkin Switch (c) Insteon web camera
Figure 6: For each device, a distribution over TCP-flow types, enumerates on the horizontal axis.
is proportional to the product of the data likelihood and our prior

belief of the distribution over θ . Since we assume the distribution
over types is multinomial with a Dirichlet prior, computing the
posterior is trivial.
To compute the posterior distribution, we simply update our
prior with the count of each new observation of TCP-flow type
for every new TCP connection that is observed for a given device.
Moreover, the generative model then becomes simple to implement
as well.

(j)
P(θ |D, α) = Dir (α ); α = α j + di (2)
d i ∈D
We compare distributions using the two-sample Kolmogorov-

Smirnov test [20] (KS test). Let Fn be defined as the empirical dis-
tribution function for n i.i.d. ordered observations X i be defined as

Fn (x) = n1 i=1 nI [−∞,x ] (X i ), where I [−∞,x ] is the indicator func-
tion, equal to 1 if X i ≤ x and equal to 0 otherwise. The Kolmogorov-
Figure 7: Probability from same distribution as known de-
Smirnov statistic for F (x) is shown in Equation 3
vice as the number of unknown-device observations in-
creases.
D n = supx |Fn (x) − F (x)| (3)
where supx is the supremum of the set of distances. If the sample
comes from the distribution F (x) then D n converges to 0 as n goes
to infinity.
When we compare models we code the generative processes
4 EXPERIMENTAL METHODOLOGY
shown in Equation 1. We set the hyperparameters of the Dirich-
let prior, α, to the count over the TCP-flow classes, then we draw In this section we describe the data we used for our experiments and
the parameters for the multinomial by drawing a sample from the our experimental methodology. The following sections describes
Dirichlet and using that as the parameters of the representative that data sets we used and the differences between them. Both
multinomial. This is simply 2 lines of code in python and is very contain a number of IoT and Non-IoT devices observed over a period
easy to replicate over many time windows and many devices. Next, of several months in two separate laboratory settings. The details
in order to compare distributions we compare pairs of samples of are discussed in the following section, Section 4.1. Afterwards,
size 100 and run the KS test on the two samples. Since the prior is we describe the training and testing methodology. Specifically for
probabilistic, we run this test 100 times and take the mean of all the training and testing, we discuss the details of our training pipeline.
scores. Figure 7 shows the average scores as we take more samples. Many components of our pipeline have various parameters that
Note, in order to make meaningful comparison in the experimental must be carefully selected. Because of the high cost of doing an
section, we normalize the score by the average score for two mod- exhaustive search over the parameter space for each component, we
els with exactly the same α parameter setting. This allows us to instead use a Bayesian optimization technique that uses a Gaussian
measure how well we are doing at comparing distributions relative process prior to probabilistically select the parameters that optimize
to the best can can do with this averaging approach. the defined objective.
111
Amazon, Echo Google Inc, Chromecast, 3 Samsung, Refrigerator Sonos, SonosPlay, 1

Amazon, FireTV, 1 Google Inc, Chromecast, 4 Samsung, SmartTV,1 Sonos, SonosPlay, 2
Amazon, FireTV, 2 Google Inc, Chromecast, 5 Samsung, SmartTV,2 Sonos, SonosPlay, 3
Amazon, Tap Google Inc, Dropcam Samsung, SmartTV,3 Sonos, SonosPlay, 4
Amcrest, NVRNV4108 Google Inc, NEST-Smoke-alarm Samsung, SmartTV,4 Sony, SmartTV
Amcrest, WiFiCamera Google Inc, NEST-Thermostat Sharp, Aquos80 Toshiba, SmartTV
Apple, AppleTV Google Inc, NestCam SkybellTech, SkybellHD Vera, Vera3
Apple, AppleTV4thGen GraceDigital, Mondo SmartLabs, InsteonHub Vera, VeraEdge
Arcsoft, Simplicam Honeywell, Thermostat SmartThings, Hub, 1 Vizio, SmartTV, 1
Belkin, InsightSwitch, 1 IRobot, Roomba SmartThings, Hub, 2 Vizio, SmartTV, 2
Belkin, InsightSwitch, 2 LG, SmartTV, 1 SmartThings, Hub, 3 Withings, Aura-sleep-sensor
Figure 8: Sample of IoT devices in the second dataset
4.1 Data Sets data. We chose this approach because we have more information
We evaluate our proposed approach on the network traffic from two about the public data set than we did about the private one. In future
independent sources. The first one is the publicly available trace versions of our work, especially before a live deployment, we must
from the University of New South Wales [3]. A lab was set up at the consider training on a larger set of samples. In spite of the relatively
campus facility, comprising both IoT and non IoT devices, and their small sub-sample, the performance is still very good, even across
traffic was captured over a period of 21 days starting from Septem- the private-lab data. This suggests that the encoder generalized
ber 23, 2016. More specifically, the IoT devices consists of twenty well and that the probabilistic framework does not require much
one commercial IoT devices representing different device classes data to effectively capture and characterize device behavior across
of IoT devices, e.g., cameras (Nest Dropcam, Samsung SmartCam, deployments in different settings.
Netatmo Welcome, Insteon Camera, TP-Link Day Night Cloud Cam-
era, Withings Smart Baby Monitor), switches and triggers (iHome,
4.2.1 Optimal Cluster Parameters. To find the optimal clustering
TP-Link Smart Plug, Belkin Wemo Motion Sensor, Belkin Wemo
parameters we use Bayesian optimization to automatically explore
Switch), hubs (Smart Things, Amazon Echo), air quality sensors
the parameter space and find a configuration that maximizes the Sil-
(NEST Protect smoke alarm, Netatmo Weather station), electron-
houette score [28]. Specifically, we use Bayesian optimization with
ics (Triby speaker, PIXSTAR Photoframe, HP Printer), healthcare
Gaussian Process priors [33]. Bayesian optimization with Gaussian
devices (Withings Smart scale, Withings Aura smart sleep sensor,
Process (GP) priors is a technique that allows us to do black-box ex-
Blipcare blood pressure meter) and light bulbs (LiFX Smart Bulb).
ploration of a model through its parameters and a scoring function.
The non IoT devices included laptops, mobile phones, and tablets.
The devices are identified and labeled with their MAC address.
The second source is a private lab which was set up in 2016
in North America, and where commercial IoT devices had been
5 RESULTS
continuously added and removed. The devices are also identified In this section, we describe the results of our experiments. Our
and labeled with their MAC address, and their traffic captured at the experiments are partitioned into two sections. Section 5.1 looks at
border router. For this study, we focus on the traffic captured during how well our approach is able to match unlabeled devices with its
the month of April 2017, as it is the most recent month in the trace. corresponding labeled one. This is similar to supervised learning
During that time, we observe a total of 72 IoT devices. Figure 8 but with some important differences. The main difference is that our
presents a sample of the IoT devices from the second dataset. Each match process is probabilistic and we simply do a each comparison
device is represented with the format (Vendor, Type, [DeviceId]), in pairs. Then, we rank the comparison results. We do not set
where Vendor is the name of the vendor, Type is the name of the a threshold on the comparison score, we are interested in how
device type, and DeviceId is optional and only present when there informative the scoring is with respect to match difference with
are multiple instances of a same device type in the dataset. For some similar (and different distributions). We look at the comparisons
device types (e.g., Google Chromecast, SmartThings hub, Samsung across all devices and show that the more observations we have,
SmartTV), many instances of the same device were present. Also, the better we can differentiate between one device or another. In
we observe that many device types (e.g., Amazon Echo, Google addition, we show that there are class similarities amongst devices,
Dropcam) were present in both datasets. which suggests that we can reliably infer whether or not the device
is likely to be an IoT device. In Section 5.3, we show how to compare
4.2 Training completely new devices – devices for which we have no known
counterpart in our data – to show that the scores provide valuable
We collected over 170, 000 flows from the the UNSW data and over
insight into how the device is behaving relative to known devices,
4.3 million samples from the private lab in the month of Apirl 2017.
whether the behavior distribution is new, and whether or not it is
We only trained our encoder with 60% of the UNSW data. That is,
likely to be an IoT or non-IoT device.
we learned an encoder based only on 100, 000 flows from the UNSW
112
(a) Compare samples by class. (b) Compare 50 samples by class.
Figure 9: Each of the devices in our trace is compared with

itself and every other device. This figure shows the normal- (c) Compare to IoT. (d) Compare 50 samples to IoT.
ized similarity scores.
Figure 10: Here we show the CDF of various device score
comparisons. In 10a, we show the distribution between all
pair of devices of the same class (IoT,IoT) and (NonIoT, Non-
Device Name Rounds Cnt.
IoT) and in 10b we show the same comparison for devices
Samsung SmartCam 1 13 where at least 50 samples are observed. In 10c we pivot on
Netatmo weather station 1 1 IoT devices only and show the distribution between all de-
Belkin Wemo switch 1 13 vices, (IoT,IoT) and (IoT, NonIoT). Figure 10d shows the same
Belkin wemo motion sensor 1 18 distribution for devices with at least 50 samples.
Amazon Echo 2 16
Samsung Galaxy Tab 1 33 5.1 Device Matching
PIX-STAR Photo-frame 1 4 Figure 9 shows the normalized similarity scores for a subset of
Withings Smart Baby Monitor 1 5 devices our datasets. We randomly selected devices across a range
Netatmo Welcome 1 6 of sample observations. For each device, we create an labeled model
with 50% of the samples from the corresponding device. We use
iHome 1 4 the other 50% to create an unlabeled version of the model. In our
TP-Link Day Night Cloud camera 1 3 experiments, we update this matrix with each new tcp-flow sample
HP Printer 1 17 from the unlabeled sample set, as if they were being captured in a
Insteon Camera 1 11 streaming fashion. We update our unknown model and compare it
Triby Speaker 1 4 with all the known ones every k updates. In our experiments, we
varied the value of k. The results presented here are for k = 10. The
Withings Aura smart sleep sensor 1 11 figure shows the state of the final similarity matrix.
MacBook 12 24 We make several observations from these initial results. The
Android Phone 4 22 first is that there is a pronounced trend of high scores along the
diagonal. This is expected, since unlabeled models should all match
Laptop 3 30
with the labeled counterpart since they are built from a random
Table 1: The number rounds of observations (10 samples per sampling of a partition of the sample distribution. However, for
round) and the number of observed flow types where our some devices, the value is not high and is either at or very close to
algorithms makes a perfect match from the ‘unknown’ de- 0. This is because some devices have less than k samples observed,
vice model to the ‘known’ labeled one. Non-IoT devices are so a comparison score is never made. This is important to note,
highlighted in blue. since in a real deployment, there is a very large dynamic range in
the amount of traffic generated by devices. The range observed in
our data sets spans several orders of magnitude.
Figure 10a shows the distribution of scores for device pairs across
our datasets. We show two distributions in this figure. The ‘same
113
class’ distribution shows the distribution of scores when the pair 5.2 Ranking
being compared are both IoT or both non-IoT devices. The ‘diff In the previous section we examined the distribution of normalized
class’ distribution shows the distribution of scores for device pairs scores. However, the normalizing constant depends on the true
belonging to different device classes (i.e. one is IoT and the other is label for the unlabeled device. This is only useful if we know what
non-IoT). We include all devices in both distributions, regardless it is. For example, if we separate models for the same device, for
of the number of observations for each device in the pair. We can different times of the day or week, and we want to compare their
see that the distribution shifts to the right for devices in the same distributions to detect if we observe any changes. In this case, we
class. This suggests that flow behavior for devices from the same know the normalizing constant is the one associated with that
class show more similarity that flow behavior for devices from particular device. However, if the address of the device changes
different classes. However, both display a similar range, which or we are observing a new, unlabeled device, we will not know
indicates that even device pairs across similar (or different) classes how to normalize the score. In order to make use of the scores, we
can show similar types of traffic and may be indistinguishable from must rank the results instead. In principle, ranking should allow us
one another. to create an ordered list of the devices, where the highest ranked
We also examine the relationship between IoT devices and their device is the most similar to the unlabeled one.
similarity within class and without class membership. More specifi- To examine this, we create such a list without normalizing the
cally, we look at the (IoT, IoT) similarity score distribution and the scores and declared a match when the model with the correct label
(Non-IoT, Non-IoT) similarity score distribution. Figure 10a shows was the top-ranked device in the list associated with the corre-
the two CDFs for each distribution. Not surprisingly, we observe a sponding unlabeled device model. Figure 11a shows the rank of
similar relationship to the one seen in Figure 10c. IoT devices resem- the true label device relative to the number of observation made
ble each other more than they resemble Non-IoT devices. However, for the unlabeled device. Notice, the number of observations that
the distribution is more skewed within class than the ‘same-class’ need to be made varies by device. That is, some devices require
distribution observed in the more general class comparison. This less observation to match their true label than others. Figure 11b is
suggests that most IoT devices – between 30-40% of (IoT,IoT) device a close-up of the to upper left-hand corner of the Figure 11a. We
pairs – have a normalized similarity score greater than or equal can see that there is some instability when few observations are
to 0.9. IoT devices share much of the same behavior characteristics. made but that for most devices that instability is bounded, with
However, we still observe that scores span the entire range, sug- the true label appearing in the top-three ranked results. Figure 11a
gesting that IoT devices can behave indistinguishably from non-IoT also shows that the rank is stable as we continue to make new
devices. observations. We also observe that relatively few observations are
Figure 10d and Figure 10b show the distribution comparison needed to get the labeled device to match with its corresponding
within / without class and for an IoT devices a the point of reference, unlabeled counterpart. The average number of TCP-flow samples
respectively. This means that the ‘same class’ is (IoT, IoT) and the that need to be observed before being the top-ranked match is 18.9.
‘diff class’ is (Non-IoT, IoT). Note, for these two graphs we to only Table 1 shows a subset of the devices in our data set and the
look at pairwise comparisons for devices with at least 50 TCP-flow number of match rounds that need to take place before it becomes
sample observations. That is, there is a marked difference is the a match. Note, for most IoT devices the number of rounds is small,
effectiveness of the approach as more observations are available. only requiring a single round of observations (10 TCP-flow samples)
Note that while the distributions do not change very much between before a match is made. While, non-IoT devices – highlighted in
Figure 10c and 10d, we do see a marked and important change blue in the table – typically require more observations. In some
in the in-class score distribution from Figure 10a to 10b. First, a cases, needing as many as 12 round (120 TCP-flow samples) before
smaller fraction of (IoT, IoT) pairs have high scores. Figure 10a has becoming a match. Column three of the table suggests that this
roughly 40% of the distribution with score of 0.5 or higher while is related to the complexity of the model. The more complex the
Figure 10b shows roughly half as many (IoT, IoT) pairs with scores model, the more observation we need to make to get a match. Non-
higher than 0.5. The similarity score becomes more informative IoT devices have more intricate distributions over TCP-flow classes,
with each sample. It is interesting to see the shift in the distribution, therefore they require more observation to identify correctly and
yet this result is not surprising. The second observation is that the with high certainty.
out-of-class distribution maxes out at around 0.75. There is no (IoT, The expected number of observations is a function of the number
Non-IoT) device pair with a score higher than 0.75 when at least 50 of TCP-flow types and their distribution. Skewed distributions are
observations have been made for each device. The two classes are the most challenging, since rare classes take longer to observe.
clearly distinguishable in those cases and this could be used as a It is also challenging to model devices that rarely transmit (or
filter threshold. receive) data. The average number of TCP-flow types observed by
The fact the we get less precise results with less observation is a IoT devices is 6.9 while the number of TCP-flow classes observed
fundamental limitation of supervised methods with weak (or no) from Non-IoT devices is 22. Non-IoT devices are much chattier
priors. We use a weak prior in our model and we use that prior on average than IoT devices, so its fortuitous that IoT devices are
for all of our devices. Different techniques could be used to infer simpler to model and that we can typically do so with high certainty.
a better prior that may require less observations to converge to a Table 2 and Table 3 shows examples of IoT and non-IoT top-3
representative model. However, that is beyond the scope of this ranked matches. Observe the 2nd and 3rd ranked devices in both
paper and we leave that as an exercise for future work. tables. We see that the rank is usually dominated by devices in the
114
Reference Rank 1 Rank 2 Rank 3

Belkin wemo Belkin wemo Belkin wemo Belkin Wemo
motion sensor motion sensor motion sensor switch
Samsung Samsung Belkin wemo
Samsung SmartCam
SmartCam SmartCam motion sensor
TP-Link Day TP-Link Day TP-Link Day Belkin wemo
Night Cloud camera Night Cloud camera Night Cloud camera motion sensor
Light Bulbs LiFX
Netatmo Welcome Netatmo Welcome Netatmo Welcome
Smart Bulb
HP Printer HP Printer HP Printer Triby Speaker
PIX-STAR PIX-STAR PIX-STAR
Android Phone
Photo-frame Photo-frame Photo-frame
Netatmo Netatmo Netatmo
Dropcam
weather station weather station weather station
Table 2: IoT devices show statistical similarity to one another, with other IoT devices matched to the reference. However, non-
IoT device can appear amongst the top ranked matched, as seen with the ‘PIX-STAR Photo Frame’. This table shows a subset
of IoT devices for which we observed at least 50 TCP-flow samples.
Reference Rank 1 Rank 2 Rank 3 without any prior knowledge about that device. In practice, this is
Android Phone Laptop Android Phone Laptop the common case. It is most often the case that a large fraction of
Samsung Samsung Samsung IoT devices plugged into networks are unregistered. This approach
MacBook
Galaxy Tab Galaxy Tab Galaxy Tab takes advantage of the few labeled ones and becomes more reliable
Android Android Android Samsung as more labels are acquired.
Phone Phone Phone Galaxy Tab
Samsung 5.3 Identifying Previously Unseen Devices
MacBook MacBook MacBook
Galaxy Tab The common case in real deployment will not have a reference
Android distribution to compare to. We attempt to emulate this condition
Laptop Laptop Laptop
Phone by running an experiment where we compare each device to all
Table 3: Non-IoT devices rank high as potential matches for the others. In this case, there is no true ‘match’ so we attempt
other non-IoT devices. This table shows a subset of non-IoT instead to infer whether or not the device is IoT or non-IoT. In
devices for which we observed at least 50 TCP-flow samples. this experiment, we use the device model for every device except
one, and train a one-class SVM [29] with these samples. The model
produces a sequence of TCP-flow classes of size 100 and we treat
this as a feature for our one-class SVM. For the device left out, we
create a model using its data and draw 1000 samples from the model
and run each sample through the one-class SVM for each known
device. The final score is the fraction of samples classified as IoT
vs non-IoT for all known IoT and non-IoT classes. For example, if
there are 10 IoT SVM models and 10 non-IoT SVM models and 800
of the 1000 samples were classified as IoT by at least 1 of the 10
IoT SVMs, then we give a score of 0.8 to IoT. The non-IoT SVM
models are evaluated separely, so we could also get a score 0.8 for
(a) Rank versus observations. (b) Zoomed-in slice of 11a the non-IoT set evaluation.
Dataset F1 Acc
Figure 11: These show the rank of the true label versus the private 0.91 0.83
number of observations made for the unlabeled device. UNSW 0.79 0.66
all 0.76 0.61
same class – with one exception where a ‘PIX-STAR Photo-frame’ Table 4: Performance on different data sets for determining
ranks an ’Android-Phone’ as 3rd most similar. IoT devices show the right class of device when the device has not be previ-
statistical similarity to other IoT devices and non-IoT devices show ously observed.
statistical similarity to non-IoT devices. We see this trend by using
this ranking scheme, independent of the normalizing factor! This is
important. It suggests that we can infer whether or not a new device Table 4 summarizes the aggregate performance figures for the
‘looks’ more like an IoT device or more like an non-IoT device, even leave-one-out experiment on both dataset independently and on
115
both, together. That is, for the single-dataset case, we only consider Device IoT Non-IoT
the devices in that dataset in our comparison. For the combined NEST Protect smoke alarm 0 0
one, we consider the type-level similarity for all devices and the Laptop 0 0
one left out. Observe that the F1 score and accuracy is highest for Belkin Wemo switch 0 0
the data set obtained from the private IoT deployment. The F1 score
Dropcam 0.009 0
is 0.91 and accuracy is 0.83. We believe this is because this data
set has many devices with lots of samples. This allows us to build Withings Smart scale 0.001 0
better models of behavior that improve matching accuracy. Triby Speaker 0.003 0
The performance of the combined dataset is affected by the addi- MacBook/Iphone 0.001 0
tion of the UNSW data. We observe that the probabilistic compari- Light Bulbs LiFX Smart Bulb 0.006 0.001
son become less reliable with fewer samples. One way to address Nest Dropcam 0.006 0
this short-coming is to use a better prior distribution for the initial IPhone 0.002 0
device model. Text-based approaches may be used to infer the type
and the associated model parameters for that device could be used
Samsung SmartCam 0.732 0
as a prior until enough samples are obtained. Table 5 shows a num- Netatmo weather station 0.912 0
ber of devices picked from the mixture. We see that our model is Belkin wemo motion sensor 0.033 0
able to identify devices which do not have a similar distribution to Amazon Echo 0.453 0
the ones that have been seen. We also see that some devices do in PIX-STAR Photo-frame 0.727 0
fact show a similar distribution to others that have been observed. Withings Smart Baby Monitor 0.188 0
Also, note that even for devices that do not share much behavior
Netatmo Welcome 0.363 0
characteristics with the others, they do tend to lean (very slightly)
towards to true class type. This suggests that IoT and non-IoT do in-
Withings Aura smart sleep sensor 0.415 0
deed have different distributions on average (with some exceptions, iHome 0.747 0
like the Amazon Echo). TP-Link Day Night Cloud camera 0.852 0
The first 10 devices in the table have show either no leaning or HP Printer 0.438 0
a slight leaning towards the right class. They can essentially be Insteon Camera 0.075 0.003
classified as being ‘new’ distributions. As ‘new’ distributions, the MacBook 0 0.019
administrator could decide to go further in her investigation and
Android Phone 0 0.033
either quarantine the device or kick it off the network entirely. For
the others, there is a stronger lean in one direction or the other. I Samsung Galaxy Tab 0 0.101
would like to draw your attention to the last three entries in the Android Phone 0.005 0.035
table. These were unlabeled in our data set. We have not been able to None 0 0
verify what kind of devices they are, however, our analysis suggests None 0.067 0
that atleast one of them is an IoT device; although confidence in this None 0 0
assessment is small. The other two labeled ‘None’ do not resemble
Table 5: Most devices are declared ‘novel’ or previously unob-
distributions for any of the labeled devices in the dataset.
served. For these devices, note that the confidence score for
the classes is leaning in the right direction. All the ‘None’
6 CONCLUSION devices were positively identified as generating completely
In this paper we presented a technique for modeling device behavior new traffic.
using its network traffic. Our technique consists of three phases.
The first phase is the training phase, whereby we pre-process TCP-
flow data for each device and train a deep LSTM-Autoencoder
network to learn a set of representative features from the data itself. labeled counterpart with high accuracy, especially as more samples
Then, we use a Bayesian hyper-parameter tuning framework to are observed for both the labeled and unlabeled device models. We
tune a clustering algorithm that separates the features into the most also show that it is simple to create and compare models in code,
discernible classes, according to the cluster silhouette score. That only requiring a few lines to capture the distributions effectively.
is, we tune the clustering algorithm to maximize the separation We show that IoT devices and Non-IoT devices show differen-
between the different clusters. Then, we train a classifier on the tiable behavior, through their respective distributions. However,
labels assigned to the clusters. some IoT and non-IoT devices are statistically similar and cannot
In the next phase of the technique we use the classifier to produce be effectively classified as one or the other. We show that we can
a distribution over these classes, by classifying all the TCP-flow infer the class of a device never observed as well, leaning towards
samples for every device. We then create a model of this distribution the correct class on average. We also present a technique that uses
as a multinomial distribution with a Dirichlet prior on the parame- a mixture of one-class SVMs to infer the final class label of unseen
ters. Finally, for each device we want to ‘match’, we generate its devices and even uncover when a device’s distribution is completely
distribution and build a probabilistic model and compare its distri- new. All of these capabilities happen without supervision and on
bution to the others. We show that we can match devices to their unlabeled data. As such, in practice they will be very useful for
116
providing meaningful information to network administrators as the [24] BBC News. 2014. Smart meters can be hacked to cut power bills. http://www.
number and diversity of IoT devices continues to increase. This can bbc.com/news/technology-29643276.
[25] Jorge Ortiz, Catherine Crawford, Franck Le, and Ali Hasan. 2017. Strange (Internet
help them make better decisions in this setting and can be used in of) Things: Towards Automatic Identification of IoT Devices in the Wild. https:
combination with existing approaches to provide a complete suite //goo.gl/ExWQ6A.
[26] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the Difficulty
of tools that address the challenges with securing network from of Training Recurrent Neural Networks. In Proceedings of the 30th International
the onslaught of unsafe IoT and non-IoT device alike. Conference on International Conference on Machine Learning - Volume 28 (ICML’13).
JMLR.org, III–1310–III–1318. http://dl.acm.org/citation.cfm?id=3042817.3043083
[27] Gartner Research. 2017. Gartner Says 8.4 Billion Connected “Things” Will Be in
REFERENCES Use in 2017, Up 31 Percent From 2016. http://www.gartner.com/newsroom/id/
3598917.
[1] Apple. 2010. Bonjour Service Discovery Suite. https://developer.apple.com/
[28] Peter Rousseeuw. 1987. Silhouettes: A Graphical Aid to the Interpretation and
bonjour/.
Validation of Cluster Analysis. J. Comput. Appl. Math. 20, 1 (Nov. 1987), 53–65.
[2] N. Apthorpe, D. Reissman, and N. Feamster. 2016. A Smart Home is No Cas-
https://doi.org/10.1016/0377-0427(87)90125-7
tle: Privacy Vulnerabilities of Encrypted IoT Traffic. In Workshop on Data and
[29] Bernhard Schölkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and
Algorithmic Transparency (DAT).
Robert C. Williamson. 2001. Estimating the Support of a High-Dimensional
[3] UNSW Australia. 2017. Testbed Setup for IoT Data Collection. http://149.171.189.
Distribution. Neural Comput. 13, 7 (July 2001), 1443–1471. https://doi.org/10.
1.
1162/089976601750264965
[4] Avahi. 2010. Avahi Service Discovery Suite. http://www.avahi.org/.
[30] A. Sivanathan, H. Habibi Gharakheili, F. Loi, A. Radford, C. Wijenayake, A. Vish-
[5] Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies
wanath, and V. Sivaraman. 2018. Classifying IoT Devices in Smart Environments
with gradient descent is difficult. IEEE Transactions on Neural Networks 5, 2
Using Network Traffic Characteristics. IEEE Transactions on Mobile Computing
(March 1994), 157–166. https://doi.org/10.1109/72.279181
(2018), 1–1. https://doi.org/10.1109/TMC.2018.2866249
[6] Léon Bottou. 1991. Stochastic gradient learning in neural networks. Proceedings
[31] Arunan Sivanathan, Daniel Sherratt, Hassan Habibi Gharakheili, and Vijay Sivara-
of Neuro-Nımes 91, 8 (1991).
man amd Arun Vishwanath. 2016. Low-cost flow-based security solutions for
[7] CNBC. 2014. Suddenly hot smart home devices are ripe
smart-home IoT devices. In Advanced Networks and Telecommunications Systems
for hacking, experts warn. https://www.cnbc.com/2016/12/25/
(ANTS).
suddenly-hot-smart-home-devices-are-ripe-for-hacking-experts-warn.html.
[32] Arunan Sivanathan, Daniel Sherratt, Hassan Habibi Gharakheili, Adam Radford,
[8] Felix A. Gers, JÃĳrgen Schmidhuber, and Fred Cummins. 1999. Learning to Forget: Chamith Wijenayake, Arun Vishwanath, and Vijay Sivaraman. 2017. Characteriz-
Continual Prediction with LSTM. Neural Computation 12 (1999), 2451–2471. ing and Classifying IoT Traffic in Smart Cities and Campuses. In IEEE INFOCOM
[9] Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. 2003. Learning Workshop on SmartCity: Smart Cities and Urban Computing. Atlanta, GA.
Precise Timing with Lstm Recurrent Networks. J. Mach. Learn. Res. 3 (March [33] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian Opti-
2003), 115–143. https://doi.org/10.1162/153244303768966139 mization of Machine Learning Algorithms. In Proceedings of the 25th International
[10] A. Graves, M. Liwicki, S. FernÃąndez, R. Bertolami, H. Bunke, and J. Schmidhuber. Conference on Neural Information Processing Systems (NIPS’12). Curran Associates
2009. A Novel Connectionist System for Unconstrained Handwriting Recognition. Inc., USA, 2951–2959. http://dl.acm.org/citation.cfm?id=2999325.2999464
IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 5 (May 2009), [34] The Verge. 2014. How an army of vulnerable gadgets took down
855–868. https://doi.org/10.1109/TPAMI.2008.137 the web today. https://www.theverge.com/2016/10/21/13362354/
[11] The Guardian. 2013. Will giving the internet eyes and ears mean the dyn-dns-ddos-attack-cause-outage-status-explained.
end of privacy? https://www.theguardian.com/technology/2013/may/16/ [35] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
internet-of-things-privacy-google. 2008. Extracting and composing robust features with denoising autoencoders.
[12] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, In Proceedings of the 25th international conference on Machine learning. ACM,
Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N 1096–1103.
Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech [36] Security Week. 2014. Hackers Attack Shipping and Logistics Firms Using
recognition: The shared views of four research groups. IEEE Signal Processing Malware-Laden Handheld Scanners. https://goo.gl/BTppBy.
Magazine 29, 6 (2012), 82–97. [37] Tianlong Yu, Vyas Sekar, Srinivasan Seshan, Yuvraj Agarwal, and Chenren Xu.
[13] Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensional- 2015. Handling a Trillion (Unfixable) Flaws on a Billion Devices: Rethinking
ity of data with neural networks. Science (2006). Network Security for the Internet-of-Things. In Proceedings of the 14th ACM
[14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Workshop on Hot Topics in Networks (HotNets-XIV).
computation 9, 8 (1997), 1735–1780.
[15] Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes.
CoRR abs/1312.6114 (2013).
[16] Honglak Lee, Chaitanya Ekanadham, and Andrew Y Ng. 2008. Sparse deep belief
net model for visual area V2. In Advances in neural information processing systems.
873–880.
[17] M. Lopez-Martin, B. Carro, A. Sanchez-Esguevillas, and J. Lloret. 2017. Network
Traffic Classifier With Convolutional and Recurrent Neural Networks for Internet
of Things. IEEE Access 5 (2017), 18042–18050. https://doi.org/10.1109/ACCESS.
2017.2747560
[18] Wired Magazine. 2014. The Internet of Things is Wildly Insecure – and Often
Unpatchable. https://goo.gl/cuKnLN.
[19] Wired Magazine. 2015. Hackers Remotely Kill a Jeep on the Highway – With Me
In It. https://www.wired.com/2015/07/hackers-remotely-kill-jeep-highway/.
[20] F. J. Massey. 1951. The Kolmogorov-Smirnov test for goodness of fit. J. Amer.
Statist. Assoc. 46, 253 (1951), 68–78.
[21] Markus Miettinen, Samuel Marchal, Ibbad Hafeez, Tommaso Frassetto, N. Asokan,
Ahmad-Reza Sadeghi, and Sasu Tarkoma. 2017. IoT Sentinel Demo: Automated
Device-Type Identification for Security Enforcement in IoT. In Proc. 37th IEEE
International Conference on Distributed Computing Systems (ICDCS 2017). IEEE.
[22] Tomas Mikolov, Martin KarafiÃąt, LukÃąs Burget, Jan CernockÃ¡, and Sanjeev
Khudanpur. 2010. Recurrent neural network based language model.. In INTER-
SPEECH, Takao Kobayashi, Keikichi Hirose, and Satoshi Nakamura (Eds.). ISCA,
1045–1048. http://dblp.uni-trier.de/db/conf/interspeech/interspeech2010.html#
MikolovKBCK10
[23] Andrew W. Moore and Denis Zuev. 2005. Internet Traffic Classification Using
Bayesian Analysis Techniques. In Proceedings of the 2005 ACM SIGMETRICS
International Conference on Measurement and Modeling of Computer Systems
(SIGMETRICS ’05). ACM, New York, NY, USA, 50–60. https://doi.org/10.1145/
1064212.1064220
117

DeviceMien: Network Device Behavior Modeling For Identifying Unknown IoT Devices

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DeviceMien: Network Device Behavior Modeling For Identifying Unknown IoT Devices

Uploaded by

Copyright:

Available Formats

DeviceMien: Network Device Behavior Modeling for Identifying

Unknown IoT Devices

ABSTRACT littered with IoT devices – internet connected devices measuring

Moore et al. [23] use Bayesian analysis to classify internet traffic.

multiclass classification. We assume labels are attained over time

from new devices and probabilistic measures of similarity are used

vector is then used as the feature representation of the input. We

3.3 Bayesian Modeling

is proportional to the product of the data likelihood and our prior

We compare distributions using the two-sample Kolmogorov-

Amazon, Echo Google Inc, Chromecast, 3 Samsung, Refrigerator Sonos, SonosPlay, 1

Figure 8: Sample of IoT devices in the second dataset

(a) Compare samples by class. (b) Compare 50 samples by class.

Figure 9: Each of the devices in our trace is compared with

Reference Rank 1 Rank 2 Rank 3

You might also like