Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

MACHINE LEARNING FOR ENCRYPTED

AMAZON ECHO TRAFFIC


CLASSIFICATION

by
Ryan Blake Jackson
c Copyright by Ryan Blake Jackson, 2018

All Rights Reserved
A thesis submitted to the Faculty and the Board of Trustees of the Colorado School of
Mines in partial fulfillment of the requirements for the degree of Master of Science (Computer
Science).

Golden, Colorado
Date

Signed:
Ryan Blake Jackson

Signed:
Dr. Tracy K. Camp
Thesis Advisor

Golden, Colorado
Date

Signed:
Dr. Tracy K. Camp
Professor and Head
Department of Computer Science

ii
ABSTRACT

As smart speakers like the Amazon Echo become more popular, they have given rise to
rampant concerns regarding user privacy. This work investigates machine learning techniques
to extract ostensibly private information from the TCP traffic moving between an Echo
device and Amazon’s servers, despite the fact that all such traffic is encrypted. Specifically,
we investigate two supervised classification problems using six machine learning algorithms
and three feature vectors. The “request type classification” problem seeks to determine
what type of user request is being answered by the Echo. With six classes, we achieve 97%
accuracy in this task using random forests. The “speaker identification” problem seeks to
determine who, of a finite set of possible speakers, is speaking to the Echo. In this task, with
two classes, we outperform random guessing by a small but statistically significant margin
with an accuracy of 58%. We discuss the reasons for, and implications of, these results, and
suggest several avenues for future research in this domain.

iii
TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Amazon Echo Data Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 TCP Network Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CHAPTER 3 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 The Identification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Speaker Identification in VoIP . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Recovering User Information from the Echo . . . . . . . . . . . . . . . . . . . 18

CHAPTER 4 REQUEST TYPE CLASSIFICATION . . . . . . . . . . . . . . . . . . . 19

4.1 Our Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 tcptrace Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.2 Histogram Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.3 Combined Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . 23

iv
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Machine Learning Algorithm Hyper-parameters . . . . . . . . . . . . . . . . . 26

4.5 Relative Importance of Features . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.6 Generalization Across Networks and Users . . . . . . . . . . . . . . . . . . . . 32

CHAPTER 5 SPEAKER IDENTIFICATION . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 Our Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Lack of Differentiable Speech and Pause Packets . . . . . . . . . . . . . . . . . 36

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

CHAPTER 6 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . 43

6.1 Improving Generalizability Across Networks and Users . . . . . . . . . . . . . 44

6.2 Building Data Sets of Usage Patterns . . . . . . . . . . . . . . . . . . . . . . . 45

6.3 Exploring Similar Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

APPENDIX A SETTING UP THE PACKET CAPTURE ENVIRONMENT . . . . . . 50

APPENDIX B FEATURES FROM TCPTRACE . . . . . . . . . . . . . . . . . . . . . 52

v
LIST OF FIGURES

Figure 1.1 The Amazon Echo and Echo Dot . . . . . . . . . . . . . . . . . . . . . . . . 1

Figure 2.1 An example of a simple decision tree to classify a few animals. . . . . . . . . 5

Figure 2.2 A depiction of a random forest classifier. . . . . . . . . . . . . . . . . . . . . 7

Figure 2.3 The flow of data when using the Echo. This study focuses on the paths
labeled in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Figure 2.4 Our data collection environment places a gateway machine between the
Echo and the Alexa cloud service. . . . . . . . . . . . . . . . . . . . . . . . 10

Figure 2.5 The fields in a TCP/IP packet. . . . . . . . . . . . . . . . . . . . . . . . . 11

Figure 3.1 VAD and VBR techniques (blue) vary the bandwidth used to transmit
audio with the complexity of the audio. Constant bitrate encoding
(orange) does not consider audio complexity. . . . . . . . . . . . . . . . . . 17

Figure 4.1 Confusion matrix averaged over 100 different train/test splits for random
forest using tcptrace feature vectors. . . . . . . . . . . . . . . . . . . . . . 26

Figure 4.2 Confusion matrix averaged over 100 different train/test splits for random
forest using histogram feature vectors. . . . . . . . . . . . . . . . . . . . . 27

Figure 4.3 Confusion matrix averaged over 100 different train/test splits for random
forest using the combined feature vectors. . . . . . . . . . . . . . . . . . . 28

Figure 4.4 Confusion matrix for a random forest tested on tcptrace feature vectors
from a different network with a different user. . . . . . . . . . . . . . . . . 34

Figure 4.5 Confusion matrix for k-nearest neighbors tested on tcptrace feature
vectors from a different network with a different user. . . . . . . . . . . . . 35

Figure 4.6 Confusion matrix for a random forest tested on histogram feature vectors
from a different network with a different user. . . . . . . . . . . . . . . . . 35

Figure 5.1 Histogram of packet sizes for all packets collected for our speaker
identification problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

vi
Figure 5.2 Data packet sizes over time for a representative human request. . . . . . . 39

Figure 5.3 Data packet sizes over time for a representative artificial request. . . . . . 40

Figure 5.4 Confusion matrix averaged over 100 different train/test splits for linear
kernel support vector machine using tcptrace feature vectors. . . . . . . . . 41

vii
LIST OF TABLES

Table 4.1 Mean accuracy for a 400-tree random forest using the histogram feature
vectors with various bin sizes both with and without ACK packets.
Averaged over 100 random stratified train/test splits. . . . . . . . . . . . . 23

Table 4.2 Accuracy results for 100 trials with different train/test splits for our six
machine learning algorithms on our three different feature vectors.
Largest results are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Table 5.1 Accuracy results for 100 trials with different train/test splits for our six
machine learning algorithms on our three different feature vectors.
Largest results are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Table B.1 Features extracted from the Echo packet capture data by tcptrace. . . . . . 54

viii
ACKNOWLEDGMENTS

I would like to thank LGS Innovations, especially Dr. Mary Schurgot and Dr. Kerri
Stone, for inspiring and facilitating this project. I would also like to thank Dr. Tracy Camp
for being an invaluable teacher, mentor, and role model since she introduced me to computer
science in 2010.

ix
CHAPTER 1
INTRODUCTION

The Amazon Echo and Amazon Echo Dot (shown in Figure 1.1 and hereafter referred
to as “Echo”) are smart speakers designed and distributed by Amazon.com. Users interact
with the Echo via voice commands that are interpreted and carried out by Amazon’s cloud-
based intelligent personal assistant service Alexa. Via Alexa, an Echo is capable of voice
interaction, music playback, making to-do lists, setting alarms, streaming podcasts, playing
audiobooks, acting as a home automation hub, and providing weather, traffic and other real
time information [1].

Figure 1.1: The Amazon Echo and Echo Dot [2].

As is common for emerging Internet of things (IoT) technologies, the Echo has given rise
to privacy concerns. Historically, the level of privacy and security in collecting, processing,

1
and disseminating user information can make or break an IoT innovation, and the conse-
quences for inadequacy in this aspect include non-acceptance of the technology in question,
damage to company reputation, and costly law suits [3]. The most prevalent concerns pertain
to Amazon’s collection and treatment of Echo users’ recordings. The Echo’s 7 microphones
with a consistent Internet connection make it technically possible to continuously stream
audio data to Amazon’s servers. Thus, articles in popular publications have drawn attention
to concerns regarding Amazon’s ability to eavesdrop on its users [4] [5] [6]. Amazon has
attempted to assuage these concerns by assuring users that, although the Echo is constantly
listening for its cue to activate (i.e., its “wake word”), the only data sent to the cloud is the
request following the wake word and a few moments of recorded sound immediately prior to
the wake word’s detection. There is also a physical mute button on the Echo device that
disables any form of listening [1].
A similar set of concerns stems from the possibility of malicious third parties obtaining
user data, either by gaining access to the recordings saved on Amazon’s servers or by inter-
cepting information in transit between the Echo device and the Alexa cloud service [4] [7].
Amazon guards against this potential issue by encrypting all traffic between the Echo and
the server [7]. However, as discussed in Chapter 3, the fact that a network flow is encrypted
does not necessarily mean that all sensitive information is kept private. This study aims
to use machine learning techniques to extract information from encrypted packets captured
between the Echo and the Alexa servers.
Specifically, this study investigates two classification problems. The first, which we dub
“request type classification”, is to identify what type of request is being answered (e.g.
requests for music, weather, information, etc.) given the encrypted packets coming from the
Alexa cloud service to the Echo device. These packets are Alexa’s response to the user’s
request. The second classification problem, which we term “speaker identification”, is to
identify who, of a finite set of possible people, is speaking in the encrypted audio data sent
from the Echo to the Alexa cloud service. The speaker identification problem is clearly

2
pertinent to user privacy since it would yield knowledge of who is present or not present in a
home. We note, however, that request type classification could also be used to discern user
identity information by building large data sets of usage patterns for specific users over time.
Even in the absence of ground truthed identity labels, such data would allow for anomaly
detection in usage patterns, thus revealing changes in household dynamics.
Chapter 2 presents a general overview of concepts necessary to understand this work
including machine learning, network communication with the transmission control protocol,
and encryption. Chapter 3 describes previous studies on several similar problems and relates
them to our work. To the best of our knowledge, our particular problems have not been
previously explored in depth; however, we discuss several other popular research domains
as they bear notable similarities to this study. Chapter 4 details the methods by which we
investigated the request type classification problem and our results. Chapter 5 provides the
corresponding discussion of methodologies and results for the speaker identification problem.
Finally, we offer some concluding remarks and possible directions for future work in Chapter
6.

3
CHAPTER 2
BACKGROUND

We begin with a brief overview of supervised machine learning and the supervised classi-
fication algorithms that we use in Chapters 4 and 5. We then present a high-level summary
of the data paths involved in the Echo’s functionality. We conclude this chapter with a
conceptual review of TCP network traffic and encryption as they apply to this work.

2.1 Machine Learning

Supervised machine learning is the term for all algorithms that reason from externally
supplied instances to produce general hypotheses, which then make educated conjectures
about previously unseen instances [8]. A machine learning algorithm is said to be a clas-
sification algorithm when it seeks to classify an object into a finite set of categories. All
machine learning techniques discussed in this work are supervised classification algorithms.
In short, supervised classifiers aim to build a concise model of the class label distribution
based on features of the classifiable objects. This goal is accomplished via training data for
which the true classes are known. The resulting classifier is then used to predict class labels
for instances of unknown class, and evaluated by various metrics of efficacy in this task. Our
specific classification tasks are described in Chapters 4 and 5.
The aim of this study is not to investigate all classification algorithms on our data,
but rather to apply several notable classification algorithms to discern the level of threat
that intercepted data pose to Echo user privacy. To that end, we investigate the following
six machine learning algorithms: C4.5 decision trees, random forests, linear kernel support
vector machines, radial basis function kernel support vector machines, multi-layer perceptron
neural networks, and k-nearest neighbors classification. In the following discussion, we briefly
describe each of these algorithms and justify their inclusion in our work.

4
We begin with the decision tree, given its success in several related works. The C4.5
decision tree algorithm generates decision trees using the concept of information entropy in
the set of training data. At each node of the tree, C4.5 selects the feature of the data that
best splits the training samples into subsets, such that the subsets predominantly represent
one class. In other words, the tree is built from the top down, and, at each juncture, the tree
splits on the attribute that gives the highest normalized information gain. The algorithm
then recurs on the resultant subsets until some stopping criterion is reached [8]. Typical
stopping criteria include a maximum depth for the tree or a minimum level of purity at each
leaf. The result is a decision tree that looks like the example shown in Figure 2.1. New data
are classified by following the appropriate path down the tree from root to leaf.

Figure 2.1: An example of a simple decision tree to classify a few animals.

Our second machine learning model, the random forest, is an ensemble method. Machine
learning ensembles combine predictions from multiple base classifiers in an attempt to achieve
better predictive performance than any individual classifier alone. The intuition behind these
ensembles is that each constituent base classifier is trained slightly differently in the same

5
task. We expect occasional errors from individual classifiers, but we hope that their diversity
allows the majority to be correct despite a few misclassifications for any given input. This
idea is conceptually similar to humans making important decisions by committee rather
than leaving them up to one individual, where all committee members share a general goal.
Random forests, shown in Figure 2.2, are ensembles of decision trees. Each tree in the forest
is trained on a random subset of the training examples drawn with replacement. Typically,
these subsets are the same size as the original training data set. When training each tree,
each node considers only a proper subset of the features. In other words, instead of simply
splitting on the feature that provides the highest normalized information gain, the algorithm
randomly selects a few features and then splits on the best of those. Random forests are
typically more robust to overfitting than individual classifiers. They also generally give
error rates that compare favorably to those of other ensemble methods like Adaboost, while
being more resilient to noisy data [9]. We include random forests in this study in hopes of
improving on the performance of individual decision trees.
Our third machine learning algorithm, the support vector machine, works by creating a
linear decision surface in the input space that separates training examples of different classes
by as large a margin as possible. A support vector machine can perform non-linear classifica-
tion by using a kernel function to implicitly map input points to a higher dimensional space
before learning the linear decision boundary in this higher dimensional space. In essence,
this technique results in a non-linear decision boundary in the input space. In this study, we
use both linear kernel support vector machines, which do not map the inputs to a higher di-
mensional space, and support vector machines with radial basis function kernels. Although
single support vector machines can only perform binary classification, we perform multi-
class classification by training several support vector machines via the one-vs-one method
n(n−1)
[10]. Using this method, we train 2
hyperplanes to classify n possible classes. Stud-
ies have found the one-vs-one method to be more effective in some applications than other
available methods (like one-vs-all) for multi-class classification with support vector machines

6
Figure 2.2: A depiction of a random forest classifier.

[11]. We include support vector machines in this work for their high generalization ability
demonstrated in other domains [12].
Recently, artificial neural networks have become extremely fashionable within the ma-
chine learning community and popular culture [13]. Therefore, we include a form of artificial
neural network in this study. We use multi-layer perceptrons instead of developing a novel
neural network architecture custom tailored to our specific data, as this work is primarily
exploratory and concerned with feasibility. Our multi-layer perceptron networks are fully
connected feed-forward networks trained via backpropagation. Within those constraints, we
optimize across several architectures and activation functions as discussed in Chapter 4.
The final machine learning algorithm that we investigate is the k-nearest neighbors (knn)
classifier. In this method, any given input is assigned the majority class of the k training
points nearest it based on some measure of distance (e.g., Euclidean or Manhattan). It is
known that a simple majority vote between the nearest k points can be flawed when there are

7
large class imbalances within the data; we avoid this issue by collecting a perfectly balanced
data set. We include knn in this work for its simplicity, long-standing ubiquity, and the fact
that it has been proposed as a basic benchmark against which to evaluate other classification
methods [14]. The knn classifier was also used successfully in some related works described
in Chapter 3.

2.2 Amazon Echo Data Paths

When a user makes a verbal request to the Echo, little of the requisite computation
takes place on the Echo’s hardware. The Echo device simply listens for the wake word
“Alexa” and then streams the audio of the user request to Amazon’s servers running the
Alexa cloud service. Next, the Alexa service interprets the user’s recorded speech and decides
how to respond. The response is sent to the Echo device and delivered to the user via the
speaker. At the same time, a visual response is delivered from the Alexa service to the
user’s smartphone, tablet, or computer if it is running the Alexa companion application.
For example, in response to the question “Alexa, who is Thomas Jefferson”, the Echo might
respond “Thomas Jefferson was the third president of the United States of America” audibly
through the speaker while simultaneously delivering a link to Jefferson’s Wikipedia page
within the companion application. This described process is depicted in Figure 2.3.
This work is specifically interested in the data moving between the Echo device and
the servers running the Alexa cloud service. We seek to explore what ostensibly private
information an eavesdropping third party could infer by intercepting these data. To that
end, our data collection environment, detailed in Appendix A, essentially amounts to placing
a computer between the Echo and the servers that records all traffic passing through on its
way to its intended destination (see Figure 2.4).

2.3 TCP Network Traffic

The Transmission Control Protocol (TCP) is a set of rules that enables a connection
between two computers and governs the delivery of data over the Internet Protocol (IP).

8
Figure 2.3: The flow of data when using the Echo. This study focuses on the paths labeled
in red.

TCP provides reliable, ordered, error-checked, bidirectional communication. All information


transferred via TCP is separated into packets, and each packet consists of a header and a
payload. The header is information for connection management, while the payload is the
actual data that needs to be transferred. When a device needs to send data via TCP, it
divides the data into chunks, adds a TCP header to each chunk, encapsulates the chunk and
header into an IP datagram, and sends the resultant TCP/IP packet on its way. Figure 2.5
illustrates a TCP/IP packet, and many of the fields therein are referenced in Appendix B. A
specific TCP data exchange begins with a “handshake” between the two participants, e.g.,
the Echo client and the Alexa server, and ends with a special flag called the “FIN” flag that
designates the accompanying data as the last information from the sender. The first packet
from a sender is called an “SYN” packet because it serves to synchronize the conversation
sequence between the client and server. The Echo and the Alexa service communicate with

9
Figure 2.4: Our data collection environment places a gateway machine between the Echo
and the Alexa cloud service.

each other via TCP.

2.4 Encryption

Encryption is the practice of encoding data in such a way that only authorized parties
can understand it. Unauthorized parties are thereby denied intelligible content. All commu-
nications between the Echo and Amazon’s servers are encrypted. In other words, the data
payloads of the transmitted TCP packets are purposefully obfuscated such that an eaves-
dropping third party cannot glean any meaningful information by inspecting them. Thus,
deep packet inspection (DPI), i.e., the practice of inspecting the data payloads of network
packets, is not useful here. We therefore use shallow packet inspection (SPI), i.e., the prac-
tice of examining packet headers and statistical information regarding traffic patterns, to
generate feature data for our machine learning classifiers. Without encryption, it would be
trivial to discern all information traveling between the Echo and Alexa. This work aims to
discover what information we can extract despite the encryption.

10
Figure 2.5: The fields in a TCP/IP packet.

11
CHAPTER 3
RELATED WORK

To the best of our knowledge, this study is the first to use machine learning to glean
information from Echo traffic. There are several related problems that are well researched
and, therefore, discussed in this chapter. Specifically, notable papers within two related
domains are discussed. We also briefly describe existing work that concerns the Echo, though
this previous research does not consider the associated network traffic.

3.1 The Identification Problem

One relatively popular research area that is related to our work is termed the identification
problem. This problem consists of identifying or classifying encrypted web traffic. The
interest in this problem stems largely from its utility for surveillance, network management,
and security applications. For example, a campus network might want to prevent online
games from hogging available bandwidth. The first step in doing so is to separate game
traffic from other traffic, even if the traffic is encrypted. In this section, we discuss several
papers that have addressed versions of this problem, their relationships to our research, and
characteristics that differentiate our work from this preexisting body of knowledge.
Li and Moore [15] investigated real-time traffic classification for network monitoring and
intrusion detection. Li and Moore prioritized low latency and high throughput in their
classification algorithms to ensure that classification would be viable in real-time without
hindering network users. The increasing ubiquity of encapsulated and encrypted Internet
traffic motivated this study. Li and Moore viewed individual TCP flows as their basic object
of classification, where flows were bi-directional sessions between two participants uniquely
identified by the host-IP address, client-IP address, host port, client port, and timestamp of
the first packet. The host and client were determined by observing an SYN packet. Li and
Moore classified these flows into 10 general categories (such as mail and games) with 99.8%

12
accuracy using a C4.5 decision tree. This particular machine learning algorithm provided
a good balance between speed and accuracy. The data analyzed were two consecutive days
of real TCP network traffic from a research facility of roughly 1000 employees. The input
features for the decision tree were derived from packet headers.
Li and Moore’s high accuracy is partially attributable to favorable conditions not present
in our Echo study. First, although [15] states that the approach does not rely on port numbers
for classification, the client and server port numbers were part of the input feature vectors.
In fact, the server port number was the most discriminative individual feature used. We
cannot consider port numbers to differentiate between our classes of Echo communications,
because the client and server ports do not change with the class. Additionally, the 10
classes used by Li and Moore separate a wide range of traffic into 10 general bins. The 10
classes are as follows: web-browsing, mail, bulk (file transfer protocol traffic), attack (port
scanning, worms, viruses, etc.), peer-to-peer, database, multimedia, service, interactive, and
games. We posit that the differences between these classes of traffic are more pronounced
than the differences between different requests to Alexa. Finally, though it is convenient to
consider TCP flows as defined by Li and Moore as fundamental objects, the Echo traffic is
not discretized into flows. In other words, there is not one TCP flow per request to Alexa.
The authors of [16] use a hybrid of the k-means clustering algorithm and the k-nearest
neighbors geometric classifier to categorize 12 million flows from two Internet edge routers.
The classes in this work are HTTP, SMTP, POP3, Skype, EDonkey (peer-to-peer file shar-
ing), Bit Torrent (peer-to-peer file sharing), Real-time Transport Protocol, and ICQ (a mes-
saging program). Again, we expect that these classes are generally more disparate than ours.
The authors’ hybrid algorithm achieves classification accuracy between 94.0% and 99.9%,
depending on the traffic class. This performance is better than the 83% average accuracy
achieved by k-means clustering alone. Furthermore, although the hybrid algorithm is less
accurate on average than the k-nearest neighbors algorithm, which achieved an overall accu-
racy of 99.1%, the hybrid algorithm is faster and can classify traffic in real-time. Also, this

13
technique uses only the packet headers to gather features, so it is invariant to encryption.
To emphasize this point, the authors show that their algorithm is as accurate at classify-
ing unencrypted Bit Torrent peer-to-peer traffic as it is at classifying encrypted Bit Torrent
peer-to-peer traffic. We also note that this algorithm is robust to packet padding, which is
a technique used by Bit Torrent and other applications to avoid being detected by traffic
classification algorithms that only look at the first few packets of each flow. The 17 features
used in [16] are associated with (1) the amount and rate of data transfer in each direction
and (2) the protocol used. We note that the authors of [16] do not use the port numbers that
Li and Moore found valuable. The work in [16] relies on statistical features, however, which
prompted the authors to disregard “short flows” (those with 15 or fewer data packets). Our
work does not have a minimum required size in classifying Alexa requests or responses.
Much of the literature regarding the identification problem builds upon the work of
Moore, Zuev, and Crogan [17]. This paper describes a comprehensive set of 248 features that
can be collected from TCP flows (with or without encryption) and then used for classification.
A significant amount of recent research in this area uses these features or a subset thereof.
The authors provide public access to sample data sets and software capable of producing
their feature vectors from captured packets. We cannot use their features in our study
because of the definition of a flow. TCP flows have clearly defined starts and stops due to
special packets; however, as mentioned previously, each Alexa request does not necessarily
constitute a single flow. Instead, we use a finer granularity in terms of classifiable units of
network traffic. Nonetheless, we draw inspiration from [17] when selecting our own features.
The author of [18] used a subset of the features outlined in [17] to compare several different
machine learning techniques in the task of classifying TCP flows by web application for the
16 most commonly used web services at the Air Force Institute of Technology. In this work,
the J48 decision tree (which is an open source Java implementation of the C4.5 decision
tree algorithm) and AdaBoost+J48 were the two most successful algorithms; both of these
algorithms had a 98% classification accuracy. We note that the simple J48 decision tree

14
had a faster average runtime than the AdaBoost+J48 ensemble. The other machine learning
algorithms evaluated in [18] follow: support vector machine, Naive Bayes classifier, and Naive
Bayes tree. The success of the decision tree in both [15] and [18] motivates our investigation
of this method, despite the differences between our data and the data used in these studies.
A narrower version of the identification problem is investigated in [19]. This study inves-
tigates AdaBoost, support vector machines, Naive Bayes, C4.5 decision trees, and RIPPER
(Repeated Incremental Pruning to Produce Error Reduction, a depth-first rule induction
based algorithm) in the tasks of identifying encrypted Skype or SSH traffic in large traces
from several research institutions. The authors of [19] found that the C4.5 decision tree
was the best algorithm. They achieved detection rates greater than 80%, with false positive
rates less than 10%, for detecting both SSH and Skype in four sets of data from different
networks. That is, the training data came from a different network than the testing data.
A variant of the identification problem is presented in [20]. This study attempts to
distinguish between roughly 100,000 different web pages based solely on information that
is available to a third party intercepting packets from users that accessed the websites via
an encrypted channel. The available information is essentially HTTP object count and size.
The authors discussed that a relatively straightforward algorithm could identify many of
the websites with low false positive rates, unless significant padding to obfuscate the actual
website content exists. As discussed previously, these data sets are fundamentally different
in nature than our Echo traffic.
An open source tool called “Pacumen” aims to solve the identification problem through
machine learning in a way that requires less training data and is easier for network ad-
ministrators to use than the previously discussed academic approaches [21]. Rather than
classifying TCP flows as done by the previously discussed studies, the authors in this work
classify temporal windows of traffic. Each time window could be the start, end, or middle of
one or more flows. We note that we can extract similar features in our work with the Echo
traffic. In terms of machine learning algorithms, the authors of [21] found top performance

15
with decision trees, though their custom version of a decision tree outperformed the more
common C4.5 tree on some of their data sets. The classes in this study are similar to those
in [18], though these authors introduce the browser as a differentiator between classes. In
other words, gmail on Google Chrome is a different class than gmail on Firefox. In addition
to introducing and evaluating their Pacumen tool, the authors provide a concise survey of
previous research on the identification problem as well as an example of a typical feature set
for machine learning in this domain. We draw inspiration from this typical feature set in
selecting our features in Chapters 4 and 5.

3.2 Speaker Identification in VoIP

The authors of [22] investigated speaker identification in encrypted voice over Internet
protocol (VoIP) traffic. The techniques developed in [22] exploit the concept of voice activity
detection (VAD), a widely used technique for reducing the bandwidth consumption of voice
traffic (see Figure 3.1), and show that VAD techniques create patterns in the encrypted
traffic. These patterns reflect pauses in the underlying voice stream, and can undermine the
anonymity of the speaker in encrypted voice communication. In this work, the authors used
data from 20 speakers and achieved an identification accuracy of roughly 48%. They used
clustering techniques with various “distances” between speaker profiles. The techniques
developed in [22] depend on differentiating between speech packets and pause packets by
using differences in packet size. Encoding and decoding audio data with VAD requires a VAD
codec. A codec is a program that compresses data for faster transmission and decompresses
received data. With the particular VAD codec used in this study, the authors observed that
speech packets were roughly six times the size of pause packets, so distinguishing between
the two was easy. We observe no such clear divide in packet sizes within our data. We also
note that the data in [22] never actually transited a network; instead, it was VAD encoded
as though in preparation for network transmission. Therefore, none of the noise associated
with transiting a network (e.g., dropped, delayed, or retransmitted packets) effected this
study. Lastly, we note that the VAD encoded data used in [22] were not actually encrypted.

16
The authors state that the techniques developed and results reported on their unencrypted
packets would be the same even if the packets were encrypted because “the encryption
schemes used for VoIP are largely length-preserving”.

Figure 3.1: VAD and VBR techniques (blue) vary the bandwidth used to transmit audio
with the complexity of the audio. Constant bitrate encoding (orange) does not consider
audio complexity.

A similar study identified the speaker in VoIP traffic via patterns caused by variable
bitrate (VBR) encoding of voice data [23]. VBR is similar in purpose to VAD; both tech-
niques vary the bandwidth used to transmit sound with the complexity of the sound. Low
complexity audio does not require as much bandwidth, while high complexity audio needs
to be transmitted in more detail (see Figure 3.1). The authors of [23] achieved identification
accuracy of 70-75% among 10 possible speakers. While 10 is a small number of possible
speakers, we note that the Echo is intended as a household device and a typical household
has fewer than 10 regular members. Similar work has attempted to determine the language
spoken and specific phrases uttered in encrypted VoIP traffic [24] [25].

17
3.3 Recovering User Information from the Echo

The authors of [26] examined the Echo as a potential source of forensically relevant digital
information. They concluded that the Echo serves primarily as a conduit for interfacing
with other services and were unable to obtain meaningful information from the Echo device
itself. Therefore, the authors of [26] turned their attention to the tablet running the Alexa
companion application. They learned that, because Alexa is a cloud-based service, the only
information discovered on the tablet was timestamps for the commands given to Alexa and
the responses to user commands. The authors of [26] did not investigate the network traffic
between the Echo and the Alexa service, which is the focus of our work.

18
CHAPTER 4
REQUEST TYPE CLASSIFICATION

This chapter describes our efforts to identify the type of request being answered by Alexa
using the encrypted packets sent from the Alexa cloud service to the Echo device in response
to the user’s request. We were largely successful in this endeavor. We begin our discussion
by describing our data set, and then cover our machine learning processes and their results.

4.1 Our Data

We collected the data used in this problem via the data capture environment described
in Appendix A. An extensive search did not discover any similar data sets available for
use in this research. We collected the following six classes of data for this classification
problem: information, quotes, weather, directions, music, and unintelligible. We collected
130 examples of each class for a total of 780 packet capture files. The information class is
comprised of questions of the form “Alexa, who is Thomas Jefferson?” We used 130 different
names to add diversity to the data set and avoid any effects of possible server-side caching.
Each user request in the quotes class is “Alexa, give me a quote.” The 130 responses from
the Alexa servers are all unique. The weather class questions have the from “Alexa, what is
the weather in Paris?” with 130 different places cited. The music class is comprised of 130
different song samples. User requests in the music class have the form “Alexa, play me a
sample of Hey Jude by The Beatles” with 130 different pairs of song title and artist. The
user requests for directions have the form “Alexa, get me directions to the Home Depot.”
When collecting data in the directions class, we noticed a perceptible difference in response
time based on our proximity to the location in question. For example, when collecting
data in Golden, Colorado, the request “Alexa, get me directions to the Colorado School
of Mines” would receive a reply markedly more quickly than “Alexa, get me directions to
Anchorage, Alaska”. To avoid this variation in response time and collect data in keeping

19
with what we believe to be the typical use case for the direction functionality, we relegated
our direction requests to places within roughly a 30 minute drive from our data collection
environment. While we were not aware of 130 different such places, over half of our requests
contain unique locations and we varied their order to avoid server-side caching. Since traffic
information changes over time, two direction responses from Alexa to the same location are
likely different. Thus, we believe that server-side caching was not taking place. The user
requests in the unintelligible class are 130 unique, linguistically invalid sentences including
nonsense words. Alexa’s responses to the unintelligible requests, though varied, are along
the lines of “Sorry, I didn’t quite catch that.”
We use 80% of each class (104 examples) for training our machine learning models and
20% of each class (26 examples) for testing the models. Our train/test splits are computed
randomly, so the individual data points that end up in a given data set (training and testing)
vary across different train/test splits. However, we always use stratified train/test splits
which means that the proportion of each class in the training data is always the same as the
proportion of that class in the testing data. Since we collected the same number of examples
of each class, the classes are always represented equally in the training and testing data.

4.2 Feature Extraction

Inspecting our data revealed that each request response pair between the user and the
Echo does not necessarily correspond to a TCP flow as defined by [15], [16], [17], and [18].
Though the correspondence is TCP traffic, it is not punctuated by the starts and stops that
create a flow. We also did not observe a strictly one-to-one correspondence between requests
and flows when collecting traffic bidirectionally. Thus, the tools discussed in [17] are not
useful. We therefore turn to other feature vectors. We extract single feature vectors from
entire packet capture files, where each file corresponds to one request/response. In this study,
we investigate three different sets of features: “tcptrace”, “histogram”, and “combined”.
Sections 4.2.1, 4.2.2, and 4.2.3 describe these three types of feature vectors.

20
Some machine learning algorithms, e.g., support vector machines, assume input data
features with zero mean and unit variance. To accommodate such algorithms on all three
of our feature sets, we fit a transform on the training data that standardizes each feature to
zero mean and unit variance, and then apply that transform to both the training and testing
data sets.

4.2.1 tcptrace Feature Vector

Our first feature vector is called “tcptrace” because the feature extraction code leverages
a tool called “tcptrace” to create statistical features from packet capture files [27]. We
manipulate these features into a numerical vector form that is amenable to typical machine
learning algorithms. Appendix B shows the features from [27] that we collect with a brief
description of each. We believe that the features in Appendix B are similar to the “typical”
feature set described in [21], with additional information. Of the 35 tcptrace features listed
in Appendix B, 13 were useless in our study because they were constant for all collected
data. The 13 discarded values follow: sack pkts sent, dsack pkts sent, max sack blks/ack,
zwnd probe pkts, zwnd probe bytes, SYN/FIN pkts sent, urgent data pkts, urgent data
bytes, zero win adv, stream length, missed data, truncated data, and truncated packets.

4.2.2 Histogram Feature Vector

Our results on the speaker identification problem in Chapter 5 prompted us to explore


some alternative feature vectors that we also applied to the request type classification prob-
lem. Namely, we create a “histogram” feature vector from each packet capture file. The first
half of the features in this vector come from a normalized histogram of packet sizes. We
create this histogram based on a parameter-specified number of bins that evenly divide the
interval between the size of the smallest packet in our data set and the size of the largest
packet in our data set. Then, for each packet capture file, the histogram is created by cal-
culating the number of packets that fall into each bin. Finally, the histogram is normalized
to remove any information regarding the total number of packets in the packet capture file.

21
Each feature in our vector, then, corresponds to the value in one bin of the histogram. The
second half of the features in the vector come from another normalized histogram calcu-
lated in the same manner, but with packet interarrival times instead of packet sizes. Both
histograms use the same number of bins. For example, given 15 bins, the resultant feature
vector would contain 30 features, 15 for the packet size histogram and 15 for the packet
interarrival time histogram.
The rationale behind this histogram feature vector is twofold. First, we wanted features
that captured packet size information, especially relative frequencies of different packet sizes,
since packet size was important to previous work in speaker identification in VoIP traffic (see
Section 3.2). Second, we wanted a feature vector that contained no information regarding the
total number of packets or the total amount of information transmitted in a packet capture
file. Having such a feature vector allows us to verify whether classification is possible without
knowing the number of packets or number of bytes in a request/response pair.
We consider the histogram feature vector both with and without ACK packets. Discard-
ing the ACK packets is motivated by the idea that we are interested only in the sizes of the
data payloads, and including ACK packets would skew the packet size histogram towards
smaller sizes, which could obfuscate subtler information in the bins containing larger packet
sizes. However, we also consider the feature vector with ACK packets just in case they
provide some valuable information.
We also seek to optimize the number of bins for this feature vector. Table 4.1 shows how
a random forest classifier performs with various bin sizes. We see that, for every number
of bins under 200, more bins yields better results. However, we see drastically diminishing
returns after 15 bins, especially without ACK packets. To keep computational costs to a
reasonable level, we used 15 bins for our histogram feature vectors in the remainder of our
work. This decision reduced the experimental runtimes on our hardware by durations up to
hours, which allowed us to cover more exploratory ground than would otherwise have been
feasible. We provide the histogram features without ACK packets for the remainder of this

22
thesis because discarding ACK packets gave better performance on our chosen number of
bins (15).

Table 4.1: Mean accuracy for a 400-tree random forest using the histogram feature vectors
with various bin sizes both with and without ACK packets. Averaged over 100 random
stratified train/test splits.

Number of Bins Mean Accuracy with ACK Packets Mean Accuracy without ACK Packets
5 85.16 82.01
10 90.59 88.00
15 92.34 93.21
25 92.40 93.25
50 92.97 93.30
100 93.92 93.96
200 92.72 93.87

4.2.3 Combined Feature Vector

Our third and final feature vector, which we dub “combined”, is the tcptrace features
combined with the histogram features. This feature vector contains 65 elements, the 35
tcptrace features and the 30 histogram features. We included the combined vector in this
study in hopes that each constituent feature vector would contribute some unique informa-
tion allowing for better classification performance than either of the two constituent feature
vectors alone.

4.3 Results

We noticed significant variation in our results that depended on how the data were
divided into training and testing sets. For that reason, and because several of our machine
learning algorithms include some stochasticity, presenting results based on a single trial of
each algorithm with a single set of training data and a single set of testing data could be
misleading. Thus, instead, we present the aggregated results of 100 trials with random
stratified train/test splits. In each trial, we used stratified 3-fold cross validation to tune
salient model hyper-parameters via a grid search over possible hyper-parameter values. We

23
do not tune hyper-parameters on testing data, as doing so would overestimate model efficacy.
We note, however, that the chosen hyper-parameters are not always the same in each trial.
We believe our chosen method provides a more thorough and accurate evaluation of our
models than would be possible with the more common single-trial approach.
We leveraged the algorithmic implementations of Scikit-learn [28] for the six machine
learning models evaluated. Table 4.2 presents our accuracy results for each of our three
feature vectors with the different machine learning algorithms considered. We treat accuracy
as our metric of interest for several reasons. First, our data set is perfectly balanced, i.e., no
one class appears more often than any other (we collected 130 examples per class). Therefore,
metrics that account for class imbalances (e.g., F1-score) are unnecessary. We did, however,
evaluate F1-score and found the result extremely close to, or identical to, accuracy. Second,
no one type of prediction error is more costly than any other. Thus, we treat all errors
as equally costly and have no need for weighting. For the sake of thoroughness, we present
confusion matrices for our highest performing algorithm for each feature vector in Figure 4.1,
Figure 4.2, and Figure 4.3.
Table 4.2 shows that the random forest is the best performing classifier for our three
feature vectors. We note that, for each feature vector, the difference between the random
forest mean accuracy and the mean accuracy of the second best algorithm is statistically
significant with p < 0.0001. Furthermore, Table 4.2 indicates that the combined feature set
provides the best result, while the histogram feature vector performs the worst. The differ-
ence in the accuracy means for the random forest using tcptrace features versus histogram
features is statistically significant with p < 0.0001, as is the difference in accuracy means for
random forest using the combined feature vector versus the histogram features. However, the
difference in accuracy means for the random forest using the combined feature vector versus
the tcptrace features gives p = 0.427, and the 95% confidence interval for the difference is
-0.4869% to 0.2069%. These statistics indicate that using a random forest with the tcptrace
features alone is essentially equivalent to using a random forest with the combined features.

24
Table 4.2: Accuracy results for 100 trials with different train/test splits for our six machine
learning algorithms on our three different feature vectors. Largest results are in bold.

Feature Vector Model Mean Accuracy Standard Deviation Median Accuracy


Decision Tree 95.69 1.58 95.51
Random Forest 96.87 1.34 96.79
SVM (linear) 94.01 1.68 94.23
tcptrace
SVM (RBF) 94.03 1.69 94.23
Neural Net (MLP) 94.24 1.57 94.23
KNN 94.08 1.84 94.23
Decision Tree 87.05 2.29 87.18
Random Forest 93.23 1.82 93.27
SVM (linear) 86.89 2.31 87.18
Histogram
SVM (RBF) 87.91 2.31 87.82
Neural Net (MLP) 88.15 2.36 88.46
KNN 88.87 2.09 89.10
Decision Tree 95.42 1.68 95.51
Random Forest 97.01 1.14 97.44
SVM (linear) 96.12 1.47 96.15
Combined
SVM (RBF) 95.56 1.23 95.51
Neural Net (MLP) 95.07 1.44 95.19
KNN 95.72 1.71 95.51

25
The confusion matrices in Figure 4.1, Figure 4.2, and Figure 4.3 provide details on the
best performing classification algorithm (random forest) for each feature vector. The most
often misclassified class is information. Regardless of the feature vector used, the information
class is often mistaken for weather. Furthermore, when misclassifying, the classifier tends
to erroneously predict information for examples belonging to all of the other classes except
music. The music class is always classified correctly, and samples from other classes are
rarely mistaken for music. Intuitively, this result makes sense because the musical output
is fundamentally different in nature from the speech output of the other five classes. In
Chapter 6, we discuss ideas for adding other classes that are similar to the music class.

Figure 4.1: Confusion matrix averaged over 100 different train/test splits for random forest
using tcptrace feature vectors.

4.4 Machine Learning Algorithm Hyper-parameters

For the decision tree, the only hyper-parameter we tune is whether to use Gini impurity or
information gain as the splitting criterion. We require full purity in all leaves as the stopping
criterion. With the tcptrace feature vectors, we found that 55 cross validation trials selected
the Gini impurity while 45 cross validation trials selected information gain, indicating that
the two options are essentially equivalent. With the histogram feature vectors, 54 trials

26
Figure 4.2: Confusion matrix averaged over 100 different train/test splits for random forest
using histogram feature vectors.

chose information gain to Gini impurity’s 46, indicating again that the two are essentially
equivalent. The combined feature vectors showed a higher preference for information gain,
with 69 trials selecting information gain and only 31 trials selecting Gini impurity.
For the random forest classifier, we always use 400 trees. We believe that 400 trees is
sufficient because our results did not differ in any statistically significant way with 500 trees.
We use full leaf purity as the stopping criterion in building the decision trees that comprise the
random forest. We again use cross validation to choose between Gini impurity and entropy
as the splitting criterion for each trial. The separate tcptrace and histogram feature vectors
showed a slight preference for Gini impurity, while the combined feature vector showed a
slight preference for entropy. In regards to the number of features to consider at each node,
we test both the square root of the total number of features and the log base 2 of the total
number of features. We found the square root of the total number of features is selected
more often for all three of our feature vectors.
For the support vector machine with linear kernel function, the only parameter that we
tune is the penalty parameter of the error term, often dubbed “C”. The parameter C governs
the trade-off between finding a hyperplane that correctly classifies as many training points as

27
Figure 4.3: Confusion matrix averaged over 100 different train/test splits for random forest
using the combined feature vectors.

possible, and finding a hyperplane that generally has a large margin between separate classes,
even if that means allowing some misclassifications in training. High values of C prioritize
correctly classifying all training points, but, if C is too large, the learned hyperplane may
overfit the training data by being too sensitive to outliers during training. Low values of C
prioritize a large-margin hyperplane, even if some training points are misclassified. However,
if C is too small, the learned hyperplane will needlessly misclassify many points (even with
linearly separable data) because misclassifications are not sufficiently penalized. We test the
values 0.1, 0.5, 1.0, 5.0, and 10.0 for C. We believe that these values allow us to cover a
significant range of reasonable choices, while maintaining a practical runtime for the tuning
process. We found the best cross validation performance on tcptrace features with higher
values of C, with C = 5 or C = 10 being selected in 30 and 28 trials respectively. In
comparison, C = 1 was only selected in 21 trials with the tcptrace features. There was a
preference for C = 1 and C = 0.5 on the histogram features, with those values being chosen
in 34 and 37 trials respectively. Lower values of C proved to be better for the combined
feature set, with C = 0.1 and C = 0.5 chosen in 53 and 38 trials respectively. One possible
explanation for why these lower values of C gave better performance with the combined

28
features than with the other two feature sets concerns outlier presence in the training data.
Perhaps greater outlier presence in the combined features made the support vector machine
more prone to overfitting with high values of C.
For the support vector machine with a radial basis kernel function, we tune the kernel
coefficient gamma in addition to the penalty parameter C. For all feature vectors, the most
often optimal value for gamma is 1 over the number of features. There is also a pronounced
tendency toward higher values of C for all feature vectors. With the tcptrace features,
C = 10 was chosen 52 times, C = 5 was chosen 43 times, and C = 1 was selected 5 times.
With the histogram features, C = 10 was chosen 61 times and C = 5 was chosen 39 times.
For the combined feature vectors, C = 10 was chosen 53 times and C = 5 was chosen 47
times.
It was infeasible for us to explore the entire hyper-parameter space for neural networks in
this study due to the numerous architectural choices and various other quantitative hyper-
parameters. Initially, we chose to restrict our tuning to three multi-layer perceptron archi-
tectures: 1 hidden layer with 100 nodes, 1 hidden layer with 300 nodes, and 2 hidden layers
with 100 nodes each. We did evaluate both the logistic sigmoid and rectified linear unit
activation functions. The rest of the hyper-parameters are set to the Scikit-learn defaults
[28]. We found that the different hidden layer architectures do not seem to matter. The
rectified linear unit function is chosen as the superior activation function in the vast majority
of trials for all feature vectors.
For k-nearest neighbors classification, the hyper-parameters that we tune are k (the
number of neighbors to consider), whether to weight the neighbors based on their distance
to the point being classified, and whether to use the Manhattan or Euclidean distance for
this weighting. The tcptrace and histogram feature vectors both result in a preference for
weighting points by the Manhattan distance, while the combined feature vector showed a
preference for considering the neighbors unweighted. For the tcptrace features, k = 3 and
k = 5 are the preferred values, chosen in 37 and 24 trials respectively. Cross validation

29
tends to select k = 5 for the other two feature vectors (in 46 trials for the histogram features
and 50 trials for the combined features). The second most common value of k for both the
histogram features and the combined features was k = 3, selected in 20 cross validation trials
for the histogram features and 35 trials for the combined features.

4.5 Relative Importance of Features

An advantage of random forest classifiers is that, once trained, they can give information
about the relative discriminative power of each input feature. In the random forest imple-
mentation that we use, this is accomplished by calculating the mean decrease impurity for
each feature. The impurity decrease at any given node for the feature on which it splits is
the impurity of the data that arrived at the node minus the summed impurities of all child
nodes. The mean decrease impurity for any given feature is the average impurity decrease
over all nodes within the forest that split on that feature, weighted by the amount of data
that reaches each relevant node. The greater a feature’s mean decrease impurity, the more
discriminative that feature is. Due to the various stochastic elements of our random forest
training process, the mean decrease impurities for each feature are not consistent across dif-
ferent trials or different train/test splits. Despite this inconsistency, we can observe trends
that exist regardless of the stochasticity.
For the tcptrace features, the most important features are those pertaining to the window
advertisement. The top three most discriminative features are typically max win adv, min
win adv, and avg win adv. Initial window (either packets or bytes) is also usually in the
top six features. Using only these five window advertisement features, the random forest
algorithm achieves a mean accuracy of 91.92%. In TCP communication, the window adver-
tisement is basically one party telling the other how much data it is willing to receive per unit
time. Since our data for this request type classification problem consists of packets coming
from Amazon’s servers to the Echo, the window advertisements specify how much data the
server is willing to receive from the Echo. One reason why the music class is so easy to
identify is that its window sizes are typically about ten times smaller than the window sizes

30
in the other five classes. Based on changes in the source IP address, we hypothesize that
the servers responsible for streaming music to the Echo are different from the servers that
receive, interpret, and respond to Echo voice recordings. We consistently observe a change
in server IP address when the response shifts from speech (e.g., “Here is a sample of No One
by Alicia Keys”) to music. The music streaming servers are likely not expected or equipped
to receive much data from the Echo, so they advertise a small window. It is not clear why
the window advertisements differentiate the five non-music classes so well. Since we collected
our data one class at a time, it is possible that each class has different window advertisement
characteristics because the server was under a different load when each class was collected.
We do not believe this hypothesis to be the case, however, both because of significant vari-
ability in average window advertisement within classes (1013 bytes to 20576 bytes within
the information class), and because of similar average window advertisements across classes
(many window advertisements of roughly 4500 bytes appear in all classes except music and
directions).
A metric that quantifies the amount of data observed is also typically in the top five
tcptrace features, including unique bytes sent, actual data pkts, and actual data bytes. If we
use only features quantifying the amount of data (i.e., total packets, unique bytes sent, actual
data pckts, and actual data bytes), the random forest algorithm achieves a mean accuracy of
86.92%. In other words, while features quantifying the amount of data are useful, they are
not sufficient to achieve peak performance. Though no feature had a mean decrease impurity
of zero, the features pertaining to data and packet retransmission are the least important.
For the histogram features, the packet size features are typically more important than
the packet interarrival time features as indicated by greater impurity decreases. With only
the interarrival time histogram, the mean random forest accuracy is 43.85%. Manuel inspec-
tion of the interarrival time histogram data shows that the music class generally has longer
packet interarrival times than the other five classes. Although we were unable to differentiate
between the five non-music classes by manually inspecting the raw interarrival time data,

31
the random forest was able to distinguish them with higher accuracy than random guessing;
that is, if the algorithm had correctly classified only music and then guessed between the
five non-music possibilities for the other five classes, we would expect only 33.33% accuracy.
On the other hand, when the random forest algorithm only uses the packet size histogram,
the mean random forest accuracy is 92.53% with a standard deviation of 2.01%. The dif-
ference between this result and the 93.23% result in Table 4.2 (which includes the whole
histogram feature vector) is statistically significant with a 95% confidence interval of -1.23%
to -0.17%. However, the closeness of these two results indicates that the histogram feature
vector performs almost as well without considering the interarrival times. We do not see
any discernible pattern to the impurity decrease for the individual features (bins) within
each histogram (e.g., the feature corresponding to the histogram bin containing the largest
packets is not consistently the most important).
As one might expect from the results discussed previously, the mean decrease impurity
calculations for the combined feature vectors indicate that the most important tcptrace fea-
tures are the most important features overall. Similar to the separate tcptrace and histogram
feature sets, the least important features in the combined feature vector are the tcptrace fea-
tures related to data retransmission (rexmt data pkts and rexmt data bytes) and the packet
interarrival time histogram features.

4.6 Generalization Across Networks and Users

In a real eavesdropping scenario, the eavesdropper would likely not have, nor be able to
obtain, labeled training data of the people using the Echo on the target’s home network. It
seems more likely that the eavesdropper would collect labeled data on their own network,
train a machine learning model with it, and then use this model to classify data from the
target’s network. Therefore, to present a real threat, the techniques discussed within this
chapter must generalize between networks, i.e., the techniques must perform well on different
networks for training and testing. Our techniques must also generalize between Echo users;
the way that one person asks for information might be sufficiently different from the way

32
that another person asks for the same information to meaningfully impact Alexa’s response,
at least in timing if not in content. In cooperation with another researcher, we obtained data
for our six classes collected on another network. There are 10 examples for each class, except
for the unintelligible and music classes which each have nine examples. While this amount
of data is small, we intend this portion of our work to only be a preliminary exploration that
can guide and motivate subsequent studies to collect a more substantial multi-network data
set.
A random forest trained on our data with the tcptrace feature vectors achieves only
48.28% accuracy when tested on the other network’s tcptrace data. With six classes, the ac-
curacy overall is much better than guessing, but is significantly worse than the performance
when trained and tested on a single network with a single user. The accuracy varies widely,
however, depending on the class being tested. Figure 4.4 shows the corresponding confusion
matrix. All examples from the information, unknown, and music classes are classified cor-
rectly, but all examples from the quotes, weather, and directions classes are misclassified.
The results in Figure 4.4 indicate that the weather and directions classes in the new data
resemble the information class from our training data more closely than they resemble their
correct classes. This trend in misclassification might be because the information class in our
training data contains a larger range of window advertisements than the other classes. Since
the window sizes for quotes, weather, and directions on the new network do not match the
values in our training data well, they may end up in the information class where window
sizes tend to be more varied in training.
Interestingly, the k-nearest neighbors algorithm (with k = 3 and no weighting on tcptrace
features) achieves performance comparable to that of the random forest, achieving 50.00%
accuracy when tested on the new data. Figure 4.5 shows the corresponding confusion matrix.
The errors are fundamentally similar to those made by the random forest. We note that the
other machine learning models evaluated were markedly worse than the random forest and
k-nearest neighbors algorithms in this task using tcptrace features. We also note that the

33
combined feature set gave the exact same results on the new data as the tcptrace features,
indicating that the changes to tcptrace features across networks, combined with the impor-
tance of the tcptrace features in the combined vector (demonstrated by the mean decrease
in impurity as discussed in Section 4.5), is sufficient to negate any meaningful information
within the histogram features in all machine learning models evaluated.
The histogram features perform better than the tcptrace features on the new data. Specif-
ically, a random forest trained on our data with the histogram features achieves 60.34%
accuracy when tested on the other network’s histogram data. The corresponding confusion
matrix is in Figure 4.6. The random forest was the best performing machine learning algo-
rithm on the histogram data from the new network. The classification errors follow the same
general trends as with the tcptrace features, though fewer errors exist. The directions class
is the only class with no correct classifications. These results suggest that the histogram
features are more robust to changes in network and user than the tcptrace features.
We suggest several ideas for achieving better performance across networks and users in
Chapter 6 as future work. Though our results necessitate further research in this area, we
do not believe that they negate the threat to Echo user privacy.

Figure 4.4: Confusion matrix for a random forest tested on tcptrace feature vectors from a
different network with a different user.

34
Figure 4.5: Confusion matrix for k-nearest neighbors tested on tcptrace feature vectors from
a different network with a different user.

Figure 4.6: Confusion matrix for a random forest tested on histogram feature vectors from
a different network with a different user.

35
CHAPTER 5
SPEAKER IDENTIFICATION

This chapter describes our efforts to identify who, of a finite set of possible speakers, is
speaking in the encrypted audio traffic streaming from the Echo to the Alexa service. Recall
that these transmissions consist of audio recordings of users’ questions or commands. Our
results in this task are limited, perhaps due to the absence of VBR, VAD, or another similar
encoding technique that would create discriminatory traffic patterns based on speech pauses
(see Section 3.2). As in the previous chapter, we will first discuss the data that we collected,
and then describe our machine learning efforts and their results.

5.1 Our Data

We collected the data used in this research via the data capture environment described in
Appendix A. We collected two classes of data for this classification problem: human speech
and artificial speech (text to speech software). In our judgment, the human’s speech patterns
are noticeably different from the speech patterns created by the software in terms of pauses.
We collected 130 examples of each class for a total of 260 packet capture files. Each speaker
asked the same 130 questions of the form “Alexa, who is Thomas Jefferson?”, which are the
same questions used in the information class in Chapter 4. We again use 80% of each class
(104 examples) for training our machine learning models and 20% (26 examples) of each for
testing the models.

5.2 Lack of Differentiable Speech and Pause Packets

The feature extraction process used in [22] involves converting a stream of packets into
an “abstract voice stream”, which is the sequence of numbers corresponding to the number
of adjacent speech packets, then the number of adjacent pause packets, then the number of
adjacent speech packets, etc. For example, the abstract voice stream (5, 10, 4, 7, 5) would

36
be produced from a packet capture file containing 5 speech packets, then 10 pause packets,
then 4 speech packets, then 7 pause packets, and finally 5 speech packets. Differentiating
between these two packet types (speech and pause) was straightforward in [22] because the
speech packets were roughly six times the size of the pause packets. Given the success in [22],
we would have liked to use a similar method in our work. Unfortunately, we were unable to
differentiate between speech and pause packets in our data. Figure 5.1 shows a histogram of
the packet sizes for all our speaker identification packets. With VAD, we would expect the
histogram of packet sizes in Figure 5.1 to be bimodal. Instead, our distribution of packet
sizes is unimodal, with most of the packets between 400 and 500 bytes. Specifically, 74.78%
of the packets are 491 bytes. Furthermore, we observed long streams of 491 byte packets
interrupted only by ACK packets comprising the vast majority of our captured packets. We
hypothesize that the Echo’s chunking of audio data results in a stream of 491 byte packets
containing the user’s request. This hypothesis is supported in Figure 5.2 and Figure 5.3,
which show the data packet sizes over time for a representative human request and artificial
request respectively. In both of these figures, the majority of the time is taken up by a single
long stream of 491 byte packets. In summary, we posit that the Echo’s audio stream to the
server is using constant bitrate encoding; we see no evidence of VAD, VBR, or a similar
encoding technique.
The largest packets shown in Figure 5.1 (1500-1600 bytes) occur at the beginning of each
packet capture file; unfortunately, these packets do not seem to correspond to the input
audio in any discernible way. As shown in Figure 5.2 and Figure 5.3, each packet capture file
typically contains roughly five of these large packets within the first 0.2 seconds of streaming
a user’s request to the server. In addition, these large packets do not occur elsewhere in
the data stream. We have two hypotheses to explain the presence of these large packets.
First, they could contain state information about the Echo device making the request (e.g.,
software version, an ID number, hardware information, etc.). Second, the large packets could
contain recorded audio encompassing the utterance of the wake word “Alexa”, and the few

37
Figure 5.1: Histogram of packet sizes for all packets collected for our speaker identification
problem.

moments prior to that utterance. As discussed in Section 2, the audio transmitted to the
server begins a few moments prior to the wake word. Since the Echo does not begin sending
audio until it has recognized the wake word, the large packets in the beginning of the audio
stream could be the audio recorded prior to recognizing the wake word (which is sent in
larger packets to catch up to the user’s speech). We are unable to check these hypotheses
because the packets are encrypted.

5.3 Results

We use the same machine learning algorithms, feature vectors, and training and testing
procedures as in Chapter 4, except for one difference. Specifically, for the tcptrace features,
we found that “rexmt data pkts”, “rexmt data bytes”, “outoforder pkts”, and “mss re-

38
Figure 5.2: Data packet sizes over time for a representative human request.

quested” remained constant in our data set, so we discarded those four features (along with
the 13 discarded previously as they also never varied). All tcptrace features are delineated
and described in Appendix B.
Unlike our results for request type classification in Chapter 4, the random forest is not the
best algorithm in the speaker identification task. Instead, support vector machines achieve
the best performance in all three feature vectors, with the linear kernel outperforming the
radial basis function kernel for the tcptrace and histogram features. Our best result for
speaker identification, which comes from the linear kernel support vector machine using
the tcptrace features, is a mean accuracy of 58.13%, with standard deviation 5.66%, and
a median accuracy of 57.69%. We note, however, that the difference in means between
the best performing algorithm and the second best performing algorithm is not statistically
significant for any of the three feature vectors, so it is impossible to say which algorithm
is best. Broadly, our results suggest that the tcptrace features are the best feature vectors
to use for this task, while the histogram features are the worst. Table 5.1 summarizes our
results.

39
Figure 5.3: Data packet sizes over time for a representative artificial request.

Figure 5.4 shows the confusion matrix for the linear kernel support vector machine using
tcptrace features, as this algorithm achieved the highest mean and median accuracy of all
algorithms and feature vectors evaluated (see Table 5.1). Although most of the artificial
speech examples were classified correctly, over half of the human speech examples were
misclassified. Other algorithms, such as the random forest, were more equitable in their
misclassifications, correctly classifying a small majority of each class.
Our research herein is a 2 class problem with perfectly balanced classes, which means
that our results are only marginally better than random guessing. However, with p < 0.0001,
it is extremely likely that the true mean accuracy is greater than 50% for the best algorithm
on each of the three feature vectors. In other words, we are outperforming random guessing
despite the appearance that VAD and VBR encoding are not being used in the traffic sent
from the Echo to the Alexa cloud service. We hypothesize that the main distinguishing factor
between the two classes concerns the artificial speech being a bit faster than the human’s
speech. That is, the artificial requests are generally shorter in duration than the human’s
requests. This conjecture is corroborated by the fact that the most important tcptrace
features, as ranked by the mean decrease impurity of a random forest, are “data xmit time”,

40
“throughput”, and “idletime max”. It is also common for some quantity describing the total
amount of data (e.g., “unique bytes sent”) to be in the top five features. Lastly, we note
that the window advertisement features that were most important in Chapter 4 are the least
important features for speaker identification.

Figure 5.4: Confusion matrix averaged over 100 different train/test splits for linear kernel
support vector machine using tcptrace feature vectors.

41
Table 5.1: Accuracy results for 100 trials with different train/test splits for our six machine
learning algorithms on our three different feature vectors. Largest results are in bold.

Feature Vector Model Mean Accuracy Standard Deviation Median Accuracy


Decision Tree 53.62 6.89 53.85
Random Forest 55.69 5.19 55.77
SVM (linear) 58.13 5.66 57.69
tcptrace
SVM (RBF) 57.69 5.93 57.69
Neural Net (MLP) 55.02 5.54 53.85
KNN 53.92 6.66 53.85
Decision Tree 51.67 6.36 51.92
Random Forest 54.00 5.62 53.85
SVM (linear) 54.42 6.11 53.85
Histogram
SVM (RBF) 54.40 6.02 55.77
Neural Net (MLP) 52.79 6.08 51.92
KNN 50.94 5.93 51.92
Decision Tree 54.81 7.20 55.77
Random Forest 56.87 5.82 57.69
SVM (linear) 55.08 6.86 55.77
Combined
SVM (RBF) 57.48 6.34 57.69
Neural Net (MLP) 53.73 5.87 53.85
KNN 51.19 6.73 51.92

42
CHAPTER 6
CONCLUSIONS AND FUTURE WORK

The goal of our research was to extract ostensibly private information from the encrypted
TCP traffic flowing between the Amazon Echo and the Alexa cloud service. Because the
transmissions are encrypted, we must rely on shallow packet inspection techniques. Our
work is comprised of two supervised classification problems: request type classification and
speaker identification.
In the request type classification problem (Chapter 4), we classify the encrypted infor-
mation coming from the Alexa cloud service to the Echo device by the type of user request
that is being answered. This task is conceptually similar to the well-studied “identification
problem” on general purpose network traffic. We designed a data capture environment (Ap-
pendix A) and collected six classes of data. Our classes are information, quotes, weather,
directions, music, and unintelligible. We extracted three different feature vectors from the
captured packets in this research. While all three feature vectors proved to be valuable, we
found the combined feature vector yielded slightly better request type classification perfor-
mance than the tcptrace feature vector. We tested six machine learning algorithms after
tuning their hyper-parameters via 3-fold cross validation, and found that the random forest
algorithm performed the best in this application, achieving 97.01% accuracy averaged across
100 different train/test splits using the combined feature vectors. We use our random forests
to report on the relative importance of our different features based on their mean impurity
decreases. The most important features concern the window advertisements. Overall, we
believe that this result on the Amazon Echo constitutes a credible threat to user privacy.
We also evaluated this request type classification problem with machine learning models
trained on data from a different user and network than the testing data. While the different
user and network decreased accuracy, we were still moderately successful with a random

43
forest, and achieved mean accuracy of 60.34% using histogram features. We found that the
histogram features were more robust to the change in user and network than the tcptrace
features. We discuss our ideas for improving this result in Section 6.1.
In the speaker identification problem (Chapter 5), we seek to determine who, of a finite
set of possible speakers, is speaking in the encrypted audio data streaming from the Echo
device to the Alexa cloud service. We collected two classes of data: artificial speech and
human speech. We were unable to replicate the feature extraction techniques of similar
studies (e.g., [22]), likely because the Echo audio data is not encoded with VAD, VBR, or
similar techniques. We used the same feature vectors and machine learning models from our
request type classification problem. While we only achieved a maximum mean accuracy of
58.13%, the difference between this result and random guessing is statistically significant.
This slightly positive result is likely due to the artificial speech generally lasting for a shorter
duration than human speech. While we do not believe that this result constitutes a threat
to user privacy, we do suggest further research in this area. For example, future work could
investigate multiple human speakers rather than a single human speaker and text-to-speech
software.

6.1 Improving Generalizability Across Networks and Users

We believe our techniques have the potential to generalize well across networks and
users. Recall that the results in Section 4.6 were from training machine learning models
on data from one network and user, and then testing the models on data from a different
network and user. While this set of results had lower classification accuracy than our results
from training and testing on a single network and user, we do not know how much of the
performance decrease is attributable to the network, the user, and/or the data collection
procedure respectively. Subsequent experiments should isolate each of these variables. Also,
with only ten examples from each class, the amount of data from the other network and user
was limited. Subsequent experiments with more data would give us a clearer idea of model
performance.

44
In supervised machine learning, model training is typically undertaken with some task
in mind. To maximize performance in that task, it is advisable to use training data that
resemble the actual data that you expect to classify later as closely as possible. In our
application, we have confidence that a model trained on data from many different networks
and users will perform considerably better on previously unseen networks and users than a
model trained on a single network and user.

6.2 Building Data Sets of Usage Patterns

Our success with request type classification in Chapter 4 opens the door for several
avenues of future research. We would like to collect large data sets of people using the Echo
in their homes over several weeks or months. We would then explore feature extraction and
machine learning on the collected usage patterns. We could investigate how well the usage
patterns identify specific households. A trivial example would be that household A asks for
weather at 8:00 every morning, while household B routinely asks for music at that time.
Applying clustering techniques to this data set could reveal interesting behavioral groups of
households. Applying anomaly detection techniques to this data set could allow us to detect
changes in household dynamics (e.g., divorce or a child moving away to college), which
would also constitute a privacy violation. We would also like to investigate how personal
usage patterns could identify individuals within a household. Collecting a useful amount
of individually labeled data from the households being studied would require extra work
from the household members, but could greatly augment our speaker identification results
presented in Chapter 5.
The first step in investigating the topics in the previous paragraph would be to expand
the set of request type classes to be as comprehensive as possible. Our six classes from
Chapter 4 cover a large portion of the Echo’s functionality, but they are not exhaustive. We
could add, for example, classes for shopping and listening to podcasts. We believe that the
podcast data might be similar to the music data, which would test our classification process
well. We could also include a catchall “other” class for any overlooked request types that do

45
not belong in another class. With these added classes, we would collect data that represents
the typical Echo user.

6.3 Exploring Similar Devices

The consumer options for virtual assistant smart speakers are rapidly increasing as many
companies work to earn a share of the emerging smart speaker market. The Google Home
is a product similar to the Echo that is already on the market [29]. Samsung and Apple are
poised to release their own smart speakers in 2018 [30] [31]. With so many similar products
competing for essentially the same customers, differences in security could play a significant
role in consumer choices. We believe that a comprehensive study comparing how vulnerable
these products are, both to the techniques developed in this thesis and to other potential
threats to user privacy, could be valuable to conscientious consumers.

46
REFERENCES CITED

[1] Amazon Echo product page. https://www.amazon.com/dp/B00X4WHP5E. Access date:


January 25th, 2018.

[2] M. Hughes. The Amazon Echo and Echo Dot are coming to the
UK and Germany. https://thenextweb.com/gaming/2016/09/14/
the-amazon-echo-and-echo-dot-are-coming-to-the-uk-and-germany/#.tnw_
dz0yAwD8, 2016. Access date: January 25th, 2018.

[3] J. H. Ziegeldorf, O. G. Morchon, and K. Wehrle. Privacy in the Internet of things:


Threats and challenges. Security and Communication Networks, 7(12):2728–2742, 2014.

[4] S. Baral. Amazon Echo privacy: Is Alexa listening to ev-


erything you say? https://mic.com/articles/162865/
amazon-echo-privacy-is-alexa-listening-to-everything-you-say#
.1mt6eZ2WS. Access date: January 25th, 2018.

[5] C. Davies. How private is Amazon Echo? https://www.slashgear.com/


how-private-is-amazon-echo-07354486, 2014. Access date: January 25th, 2018.

[6] S. Machkovech. Amazon announces Echo, a $199 voice-driven


home assistant. https://arstechnica.com/gadgets/2014/11/
amazon-announces-echo-a-199-voice-driven-home-assistant, 2014. Access
date: January 25th, 2018.

[7] T. Moynihan. Alexa and Google Home record what you say, but what happens to that
data? https://www.wired.com/2016/12/alexa-and-google-record-your-voice/,
2016. Access date: January 25th, 2018.

[8] S. B. Kotsiantis. Supervised machine learning: A review of classification techniques.


Informatica, 31:249–268, 2007.

[9] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

[10] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise


procedure for building and training a neural network. Neurocomputing, 68:41–50, 1990.

[11] M. Pal. Multiclass approaches for support vector machine based land cover classification.
CoRR, abs/0802.2411, 2008. URL http://arxiv.org/abs/0802.2411.

47
[12] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297,
1995.

[13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, 2 edition, 2009.

[14] D. Coomans and D.L. Massart. Alternative k-nearest neighbour rules in supervised pat-
tern recognition: Part 1. k-nearest neighbour classification by using alternative voting
rules. Analytica Chimica Acta, 136:15–27, 1982.

[15] W. Li and A. W. Moore. A machine learning approach for efficient traffic classification.
Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2007.

[16] R. Bar-Yanai, M. Langberg, D. Peleg, and L. Roditty. Realtime classification for en-
crypted traffic. Proceedings of the 9th International Conference on Experimental Algo-
rithms (SEA’10), pages 373–385, 2010.

[17] A. W. Moore, D. Zuev, and M. L. Crogan. Discriminators for use in flow-based classi-
fication. Queen Mary and Westfield College, Department of Computer Science, 2005.

[18] W. C. Barto. Classification of encrypted web traffic using machine learning algo-
rithms. Department of the Air Force, Air University, 2013. http://www.dtic.mil/
get-tr-doc/pdf?AD=ADA585816 Access date: January 25th, 2018.

[19] R. Alshammari and A. N. Zincir-Heywood. Machine learning based encrypted traffic


classification: Identifying SSH and Skype. Proceedings of the IEEE Symposium on
Computational Intelligence for Security and Defense Applications (CISDA), 2009.

[20] Q. Sun, D.R. Simon, Y. Wang, W. Russell, V.N. Padmanabhan, and L. Qiu. Statistical
identification of encrypted web browsing traffic. Proceedings of the IEEE Symposium
on Security and Privacy, 2002.

[21] B. Neimczyk and P. Rao. Identification over encrypted channels. BlackHat USA, 2014.

[22] M. Backes, G. Doychev, M. Durmuth, and B. Kopf. Speaker recognition in encrypted


voice streams. Proceedings of the 15th European Conference on Research in Computer
Security (ESORICS’10), pages 508–523, 2010.

[23] L. A. Khan, M. S. Baig, and A. M. Youssef. Speaker recognition from encrypted VoIP
communications. Digital Investigation, 2009.

[24] C. V. Wright, L. Ballard, S. E. Coull, F. Monrose, and G. M. Masson. Spot me if you


can: Uncovering spoken phrases in encrypted VoIP conversations. Proceedings of the
IEEE Symposium on Security and Privacy, pages 35–49, 2008.

48
[25] C. V. Wright, L. Ballard, F. Monrose, and G. M. Masson. Language identification of
encrypted VoIP traffic: Alejandra y Roberto or Alice and Bob? Proceedings of the 16th
USENIX Security Symposium, pages 1–12, 2007.

[26] Champlain College Leahy Center for Digital Investigation. Amazon Echo foren-
sics. https://lcdiblog.champlain.edu/wp-content/uploads/sites/11/2016/05/
EDITED_Amazon_Echo_Report-1.pdf, 2016. Access date: January 25th, 2018.

[27] S. Ostermann. tcptrace. http://www.tcptrace.org/, 2003. Access date: January


25th, 2018.

[28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,


P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.

[29] A. Gebhart. Google home review. https://www.cnet.com/products/google-home/


review/, 2017. Access date: January 25th, 2018.

[30] Z. Hall. Homepod: Everything we know about the Apple smart speaker so far. https://
9to5mac.com/2017/11/14/homepod-siri-speaker-launch-details/, 2017. Access
date: January 25th, 2018.

[31] M. Gurman. Samsung targets first half of 2018 for smart


speaker. https://www.bloomberg.com/news/articles/2017-12-14/
samsung-is-said-to-target-first-half-of-2018-for-smart-speaker, 2017.
Access date: January 25th, 2018.

49
APPENDIX A
SETTING UP THE PACKET CAPTURE ENVIRONMENT

We collect our data using an Amazon Echo Dot. The Dot has no Ethernet port and, thus,
must interface with the Internet wirelessly. At a high level, our data collection approach is
to use a computer as a wireless access point for the Echo, record the desired packets as
they enter the computer, and then allow the traffic to continue to its destination. Using
previous research on capturing packets from IoT devices, we initially attempted to configure
our wireless access point with the tool Hostapd (Host access point daemon). Hostapd is a
user space software access point capable of turning a normal network interface card into an
access point and authentication server. We then attempted to couple the Hostapd access
point with bridge forwarding through dnsmasq or isc-dhcp-server. This approach, however,
did not work; that is, the Echo would connect to our access point, but no response traffic
came through from the Alexa cloud service. To collect the necessary data, we leveraged
tools built into the Ubuntu 14.04 operating system. Our steps are outlined in detail in the
following.
First, we need to verify that our gateway computer’s WiFi hardware card supports access
point (AP) mode. We use the tool iw which shows and manipulates wireless devices and
their configurations. Running iw list outputs the relevant information. Specifically, the
Supported interface modes section of the output should list AP. If not, USB wireless cards
are readily available with this feature.
Next, we open Network (by searching “network” in Dash) and click the option “Use as
Hotspot” with the wireless network interface selected. This step starts an ad hoc network
wireless hotspot and creates its configuration file. This ad hoc hotspot is different from the
access point that we need to set up; thus, we must edit the configuration file in the next
step. Since many devices, IoT and otherwise, do not support ad hoc networking, we stop

50
the hotspot via the graphical user interface and proceed to the next step.
As mentioned, we must change our hotspot to access point mode. The following command
will open the configuration file for the hotspot created in the previous step:

pkexec env DISPLAY=$DISPLAY XAUTHORITY=$XAUTHORITY gedit

/etc/NetworkManager/system-connections/Hotspot

We then locate the line of the configuration file that specifies the mode and change it to
mode=ap.
Finally, we start the access point by selecting “Create New Wi-Fi Network” from the
Network Indicator menu, and then selecting “Hotspot” from the dropdown menu under
“Connection:”. It is at this point that one names the network and sets up the security.
We chose WPA2 security. With the access point running, we use the Alexa companion
application to connect the Echo to the new network. Note that the computer must be
physically connected to the Internet via Ethernet. That is, the wireless card cannot act as
an access point for the Echo and simultaneously provide Internet connectivity via some other
wireless network. It should be possible to use two wireless cards, but we have not tested this
approach.
With the Echo connected to our access point and communicating with Alexa through it,
we can begin recording traffic with the tool “tcpdump” via the following command:

sudo tcpdump -i wlan0 ether host $MAC -w $FILENAME

where $MAC stands for the MAC address of the Echo device (visible under “Settings” in the
Alexa app), and $FILENAME refers to the file where the packet capture data will be saved.
Note that this command will capture traffic in both directions. If desired, we can replace
“host” with “src” to collect only traffic from the Echo, or with “dst” to collect only traffic
to the Echo. The “wlan0” component assumes that the access point is configured on the
first or only wireless card on the gateway machine.

51
APPENDIX B
FEATURES FROM TCPTRACE

The following table lists and describes the 35 numeric features that comprise our tcptrace
feature vector introduced in Section 4.2.1. As discussed in Chapter 4, 13 of these features
were discarded during our request type classification research because they were constant for
the whole data set. These discarded features are denoted by the ‡ symbol in the table. We
discarded 17 features, denoted by the † symbol, in Chapter 5 because they were constant in
our speaker identification data set. We note that all 13 features discarded for request type
classification were also discarded for speaker identification.

Feature Description
total packets The total number of packets.
ack pkts sent The total number of ack packets (i.e., TCP segments seen
with the ACK bit set).
pure acks sent The total number of ack packets without data payloads and
without the SYN/FIN/RST flags set.
sack pkts sent The total number of ack packets carrying TCP sack (selective
acknowledgment) blocks. ‡ †
dsack pkts sent The total number of sack packets carrying duplicate sack (D-
sack) blocks. ‡ †
max sack blks/ack The maximum number of sack blocks seen in any sack packet.
‡ †

unique bytes sent The number of unique bytes sent (i.e., all bytes sent excluding
retransmitted bytes and any window probing bytes.)
actual data pkts The count of all packets with at least a byte of TCP data
payload.
actual data bytes The total bytes of data transmitted including retransmissions
and window probe packets.
rexmt data pkts The count of all retransmitted packets. †
rexmt data bytes The total bytes of data transmitted in the retransmitted pack-
ets. †
zwnd probe pkts The count of all window probe packets. ‡ †
zwnd probe bytes The total bytes of data transmitted in the window probe pack-
ets. ‡ †
outoforder pkts The count of all packets that arrived out of order. †

52
pushed data pkts The count of all packets with the PUSH bit set in the TCP
header.
SYN/FIN pkts sent The count of all packets with the SYN/FIN bits set in the
TCP header. ‡ †
urgent data pkts The count of all packets with the URG bit set in the TCP
header. ‡ †
urgent data bytes The total bytes of urgent data sent. This field is calculated
by summing the urgent pointer offset values found in packets
having the URG bit set in the TCP header. ‡ †
mss requested The maximum segment size requested as a TCP option in the
SYN packet opening the connection. †
max segm size The observed maximum segment size during the connection.
min segm size The observed minimum segment size during the connection.
avg segm size The average segment size observed during the connection cal-
culated as the value reported in the actual data bytes field
divided by the actual data pkts reported.
max win adv The maximum window advertisement seen. If the connection
is using window scaling (i.e., both sides negotiated window
scaling during the opening of the connection), this feature is
the maximum window-scaled advertisement seen in the con-
nection.
min win adv The minimum window advertisement seen. This feature is
the minimum window-scaled advertisement seen if both sides
negotiated window scaling.
zero win adv The number of times a zero receive window was advertised. ‡

avg win adv The average window advertisement calculated, i.e., the sum
of all window advertisements divided by the total number of
packets. If the connection endpoints negotiated window scal-
ing, this average is calculated as the sum of all window-scaled
advertisements divided by the number of window-scaled pack-
ets.
initial window bytes The total number of bytes sent in the initial window (i.e., the
number of bytes in the initial flight of data before receiving
the first ack packet from the other endpoint).
initial window pkts The total number of segments (packets) sent in the initial
window.
stream length The theoretical stream length. This feature is calculated as
the difference between the sequence numbers of the SYN and
FIN packets, giving the length of the data stream. ‡ †
missed data The missed data, calculated as the difference between the
stream length and the unique bytes sent. ‡ †

53
truncated data The truncated data, calculated as the total bytes of data trun-
cated during packet capture. For example, when collecting
data with tcpdump as described in Appendix A, one could
specify the snaplen option to truncate all captured packets
down to some specified maximum size. In other words, when
using snaplen with maximum size n bytes, if we encounter a
packet with size s > n bytes, we discard the last s − n bytes
of the packet and record only the first n bytes. We do not use
any truncation in this study. ‡ †
truncated packets The total number of packets truncated. ‡ †
data xmit time Total data transmit time, calculated as the difference between
the times of the first and last packets carrying non-zero TCP
data payload.
idletime max Maximum idle time, calculated as the maximum time between
consecutive packets in the direction of communication.
throughput The average throughput calculated as the unique bytes sent
divided by the elapsed time.

Table B.1: Features extracted from the Echo packet capture data by tcptrace.

54

You might also like