Professional Documents
Culture Documents
Security in Smart Home Networks
Security in Smart Home Networks
Yan Meng
Haojin Zhu
Xuemin (Sherman) Shen
Security in
Smart Home
Networks
Wireless Networks
Series Editor
Xuemin Sherman Shen , University of Waterloo, Waterloo, ON, Canada
The purpose of Springer’s Wireless Networks book series is to establish the state
of the art and set the course for future research and development in wireless
communication networks. The scope of this series includes not only all aspects
of wireless networks (including cellular networks, WiFi, sensor networks, and
vehicular networks), but related areas such as cloud computing and big data.
The series serves as a central source of references for wireless networks research
and development. It aims to publish thorough and cohesive overviews on specific
topics in wireless networks, as well as works that are larger in scope than survey
articles and that contain more detailed background information. The series also
provides coverage of advanced and timely topics worthy of monographs, contributed
volumes, textbooks and handbooks.
Yan Meng • Haojin Zhu • Xuemin (Sherman) Shen
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
As a typical application of the Internet of things (IoT), the smart home system is
widely deployed and plays an important role in our lifestyle. In the smart home
environment, smart devices are connected through wireless networks, provide users
with non-contact interfaces such as the voice interface, and are managed uniformly
through smart applications. Smart terminal devices (e.g., smartphones, tablets, smart
sensing devices) provide users with richer functions and can sense the environmental
changes of the smart home in real time. With the popularization of the voice
interface, users can interact with the smart home system without contact. Smart
applications can automatically control devices and adjust the status of the smart
home system.
However, current researches show that the smart home network still faces a wide
range of security threats. Firstly, although the wireless communication technology
adopted by the terminal devices speeds up the network communication rate and
expands the network range, it also increases the privacy risks of users when
encountering side-channel attacks. Secondly, due to the open nature of the voice
channel, the voice interface faces various voice spoofing attacks. Lastly, the smart
application platform suffers from the abnormal behavior of smart applications. To
ensure the security and privacy of a smart home network, in this monograph, we
study the corresponding key technologies following the logic of “terminal device—
user interface—application platform.”
In Chap. 1, we first introduce the growth trend and architecture of the smart
home network. Especially we clearly describe three different components of the
smart home and their functionalities. In Chap. 2, we review existing literature about
security and privacy issues faced by three components of the smart home. More
specifically, we introduce the side-channel attacks faced by the terminal device,
the voice spoofing attacks and countermeasures in the voice interface, and the
misbehavior and defense mechanism for the application platform. In Chap. 3, at
the layer of the terminal device, we study the side-channel privacy threats caused
by the side-channel attack aiming at the wireless communication protocol and
propose an obfuscation-based countermeasure mechanism. In Chap. 4, at the layer
of voice interface, we propose a liveness detection scheme named WSVA via
v
vi Preface
leveraging the Wi-Fi signal, which is ubiquitous in the smart home environment.
WSVA uses the wireless Wi-Fi signal to characterize the user’s mouth movement
and then determines whether the voice command received by the voice interface is
an authentic voice command or a spoofing command by judging the consistency
between the user’s mouth movement and the voice signal. Then, in Chap. 5, to
further improve the universality of liveness detection, we propose the passive
liveness detection scheme named ArrayID that only depends on the collected voice
signal. ArrayID uses the microphone array commonly equipped by smart speakers
to achieve more robust liveness detection performance. In Chap. 6, at the layer of
the smart application platform, to solve the threat of application misbehavior, we
propose a third-party anomaly detection system HoMonit. By analyzing the side-
channel information of wireless communication traffic, HoMonit can accurately
detect the application’s misbehavior. Finally, in Chap. 7, we summarize the main
context of this monograph and introduce the possible future research directions.
We hope that this monograph can provide insightful lights on understanding the
security in smart home networks, including the terminal device security, the voice
interface security, and the application platform security. We would like to thank
Prof. Xiaohui Liang at the University of Massachusetts at Boston, Prof. Yao Liu
at the University of South Florida, Prof. Yinqian Zhang at Southern University
of Science and Technology, Prof. Yuan Tian at the University of California, Los
Angeles, Jin Li at Guangzhou University, and Prof. Xiaokuan Zhang at George
Mason University for their contributions in this monograph. We would also like
to thank all members of BBCR group, University of Waterloo and NSEC group,
Shanghai Jiao Tong University for their valuable suggestions and comments.
Special thanks to the staff at Springer Nature, Mary E. James, Brian Halm, and
Bakiyalakshmi RM for their help throughout the publication preparation process.
This work is also supported by the National Natural Science Foundation of China
(62132013, 61972453).
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Era of Smart Home and Its Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Architecture of Smart Home Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Terminal Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Voice Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Application Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Security and Privacy Challenges in Smart Home Network . . . . . . . . . . . 8
1.3.1 Terminal Device Layer: Privacy Leakages . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Voice Interface Layer: Spoofing Attacks . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Application Platform Layer: Misbehavior . . . . . . . . . . . . . . . . . . . . . 13
1.4 Aims and Organization of This Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 Aims of This Monograph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Organization of This Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Literature Review of Security in Smart Home Network . . . . . . . . . . . . . . . . 21
2.1 Side-channel Attacks Faced by Terminal Device . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Attacks Based on Physical Layer Side-channel
Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Attacks Based on Network Layer Side-channel
Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.3 Other Side-channel Attack Manners . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Voice Spoofing Attacks in Voice Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Voice Spoofing Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Two-Factor Authentication-Based Liveness Detection . . . . . . . 26
2.2.3 Voice Signal-Based Passive Liveness Detection . . . . . . . . . . . . . . 27
2.3 Misbehaviors in Application Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Misbehaviors of Smart Home Applications . . . . . . . . . . . . . . . . . . . 28
2.3.2 Defense Mechanisms against Misbehaviors . . . . . . . . . . . . . . . . . . . 29
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Acronyms
xi
xii Acronyms
With the development of IoT techniques, the smart home system is getting more
and more popular. With the help of smart home technology, users can connect
all kinds of smart devices together to realize home automation, remote control,
programmable control, and other functions. For example, users can realize all-round
information interaction and behavior management of smart household appliances
(e.g., smart refrigerator, smart microwave oven, and smart washing machine),
lighting systems, temperature regulation systems (e.g., air conditioner and heater),
and various security systems (e.g., access control and alarm system). According to
the report Smart Home Market with COVID-19 Impact Analysis by Product released
by Research and Market, one of the world’s largest market research institutions, on
July 1, 2020, the global market share of the smart home reached US $78.3 billion
in 2020, and it is predicted that it will continue to grow at an annual growth rate of
11.6% and reach US $135.3 billion in 2025 [23].
There are several differences between the smart home network and traditional
IoT architectures. On the one hand, in the communication mode, most of the devices
in traditional IoT systems are connected through wired communication cables, while
the smart devices in the smart home network (e.g., the light bulb control system,
the user access control system, the smart home medical system, and the smart
kitchen system) are connected to each other through wireless networks. On the other
hand, in terms of platform compatibility, unlike traditional specialized IoT systems,
the smart home presents higher compatibility. In order to connect smart devices
from different device manufacturers, some major smart home manufacturers have
developed smart home application platforms. In these platforms, devices developed
by different hardware manufacturers are abstracted uniformly (a.k.a. abstraction).
Device abstraction makes developers only need to know the functions and properties
of the device, without knowing the physical details, so they can easily design the
corresponding smart applications to automatically control various smart devices.
inversely cracked, revealing significant social security risks. Besides, the physical
layer information generated by the wireless communication process of smart
terminal devices (e.g., smartphone) is also easy to be attacked by an attacker
using the wireless sensing technology to obtain sensitive information such as
user motion [31].
2. Voice spoofing attacks at the voice interface level. While providing conve-
nience for users, the voice interface in the smart home is also vulnerable to
spoofing attacks from unauthenticated voice commands due to the openness of
the voice channel. For example, in a replay attack, an attacker can cheat the
voice interface by recording the voice samples of a legitimate user in advance
and then playing them back with a high-quality speaker [12]. In addition, due to
the defects of the speech recognition algorithm and the voice interface hardware,
some new and hidden attacks are also proposed. For example, Carlini et al. [7]
proposed a hidden command attack, which can successfully cheat the speech
recognition algorithm by converting the voice command into noise-like audio
while retaining the features required for speech recognition. Zhang et al. [34]
proposed “Dolphin attack” based on the hardware defects of the voice interface.
This attack embeds the voice signal into the ultrasonic signal that cannot be
detected by the human ear without causing the user’s alarm. In the voice spoofing
attack, the attacker can not only query the user’s sensitive information and
perform sensitive operations (e.g., online shopping, making malicious calls) but
also force the smart device to perform improper actions (e.g., opening the door
lock when the user leaves home), which poses a serious threat to the security and
privacy of the user’s smart home system.
3. Application misbehavior at the application platform level. In order to manage
heterogeneous IoT services, most smart home application platforms support
third-party smart applications. The smart application runs in the cloud backend,
monitors the status of the smart device (e.g., smart motion sensor), and triggers
some operations of another smart device (e.g., turning on smart lights) when
receiving the notification of certain device events (e.g., the motion sensors detect
the user’s activities). However, the access control of the application platform is
not perfect, which opens a door to malicious attacks. Fernandes et al. [14] found
that due to the coarse-grain authorization of the device function model, there
is a phenomenon of over-authorization of smart applications in the Samsung
SmartThings platform. More specifically, smart applications can access more
permissions than their users should grant them. For example, the lock command
lock.on() and the unlock command lock.off() of the smart door lock are often
authorized to the smart application together. In this case, the application may
cause a security threat to the user by abusing the command (e.g., using the
command lock.off() to unlock the door lock). In addition, malicious applications
can also disrupt the normal operation of the user’s smart home system by forging
device events in the cloud (e.g., falsely reporting the status of smart sensors to
trigger the operation of smart alarms).
4 1 Introduction
It can be seen that the security and privacy issues at the three levels of “terminal
device—voice interface—application platform” in the smart home have become the
hot issues in the development of the smart home. Therefore, to enhance smart home
security, this monograph analyzes and studies the following key technologies: side-
channel privacy risk at the terminal device layer, voice spoofing attacks and defense
methods in the voice interface layer, and the misbehavior detection system in the
application platform layer.
This monograph mainly studies the smart home’s security and privacy issues in the
terminal device layer, user interface layer, and application platform layer. Then, we
build cross-layer reliable security protection technologies. This section introduces
the smart home system architecture.
As shown in Fig. 1.1, a typical smart home system consists of three components:
smart terminal device, user interface, and smart application platform. This section
introduces each component one by one.
Heterogeneous IoT terminal devices occupy a key position in the smart home
network. In the human-centered smart home environment, devices are endowed
with diverse functions. Unlike traditional electronic devices, smart devices support
operate the actuator through the smart application (e.g., directly opening the smart
door and smart heater).
Among the above smart devices, smart terminal devices such as smartphones
involve the direct interaction of users and are often related to the private information
of users. However, sensors, actuators, and hubs are often controlled by smart
applications. Therefore, this monograph studies the side-channel privacy attack
mechanism and defense strategies related to smart terminals and the misbehavior
detection schemes related to other smart devices.
In smart home systems, users need to interact with smart devices frequently, so
a user-friendly and safe human–computer interaction mode is very important. In
addition to the traditional program interface mounted on smart terminal devices,
with the development of emerging artificial intelligence technology, users can
control smart devices and cloud platforms through various interfaces (e.g., voice
interaction and image sensing). Among various interfaces, the voice interface has
become the mainstream method for controlling smart home devices due to its
convenience, strong collectability, and high recognition accuracy [5]. According
to the report Voice Assistant Market—Information by Technology, Hardware and
Application—Forecast till 2025 published by Market Research Future in 2020, the
voice assistant will become the most important IoT user interface, with its global
market share reaching 1.68 billion US dollars in 2019 and predicted to reach 7.3
billion US dollars in 2025 [21].
Figure 1.2 shows a typical voice interface scenario in the smart home. At
present, smart speakers equipped with smart voice interfaces have become the hub
of mainstream smart home platforms. For example, in the Amazon Alexa smart
home platform, Amazon Echo smart speakers not only act as the voice interface
but also integrate the hub’s functions. As long as it is within the sound-receiving
range of the smart speaker, the user can remotely control household appliances or
query information. In order to improve the receiving quality of the user’s voice
commands, currently, the smart speaker uses a microphone array. In addition, the
use of microphone arrays helps determine the direction of users to provide more
fine-grained services. After receiving the audio information, the smart speaker will
report the audio information to the cloud platform. The cloud platform uses voice
User
Smart speaker Speech recognition
Fig. 1.2 An illustration of the voice interface in the smart home scenario
1.2 The Architecture of Smart Home Network 7
recognition algorithms to analyze the text information of users’ voices and guides
the operation of smart devices. For example, when the user says the command “open
the window,” the Amazon Echo device transmits the audio to the cloud platform.
After analysis, the smart application sends instructions to the door and window
sensors to open the corresponding windows.
At present, smart voice interfaces (e.g., Amazon Echo [20], Google Home [28],
TmallGenie smart speaker [2]) have been widely used in real life. The user can
perform operations such as making a phone call, paying online, sending information,
and opening an application through the voice interface. These operations include
many sensitive operations (e.g., online shopping, querying bank app information),
so the security of the voice interface is particularly important.
With the popularization of smart homes, the following two key problems need to
be solved. Firstly, smart devices from different manufacturers need to be able to
interact with each other. Secondly, in order to realize household intelligence and
automation, it is necessary to formulate standardized control logic for different
devices. The smart application platform plays an important role in it. The platform
can abstract devices adopting different communication protocols and manufacturing
methods and provide a simple and easy-to-use development interface for third-
party developers. As shown in Fig. 1.3, the characteristics of the smart application
platform are summarized as follows.
Abstracting Heterogeneous Devices in the Application Platform In the smart
home system, because smart home devices such as smart sensors and actuators are
mostly power-constraint devices, they cannot undertake complex computing, so the
application platform deployed in the cloud handles most of the computing tasks.
In order to enable smart devices of different manufacturers and communication
protocols to work together, the smart application platform abstracts smart devices
to realize the separation between devices and applications. Specifically, as shown
in Fig. 1.3, in the cloud platform, smart devices retain their basic attributes, such as
Apple Homekit
Platfo
Platform
f rm
Hub
Start End
name and function, and the same functions from different devices are given the same
function attributes in the platform. For example, in the Samsung SmartThings smart
home platform, ZigBee door and window sensors and Z-Wave door and window
sensors are connected to the cloud through a hub, and their switching functions are
abstracted as motion.active. This allows the user to treat the device as a black box
without knowing the details of the device so as to develop relevant applications
conveniently.
Automatically Running Application-Driven Devices In the smart application
layer, developers develop customized applications and deploy them in the cloud
to realize remote control of smart devices in the physical world. Users can install
customized applications to achieve specific functions. Smart applications generally
follow the control logic of “if-this-then-that” and control various smart devices
through hubs. For example, when the user enters the home, turn on the lamp. On
the other hand, developers can make smart devices more intelligent by building
diverse smart applications. At present, smart application platforms have become
very popular. In different platforms, the names of smart applications are slightly
different. For example, Samsung SmartThings, Amazon Alexa, and TmallGenie
refer to smart applications with SmartApp, skill, and voice skills, respectively, but
these applications follow similar working logic.
In a word, the smart application platform has become very active by abstracting
and intelligently driving smart devices. However, users may install malicious
applications published by attackers, which brings serious security risks. Therefore,
it is urgent to detect and monitor the application’s misbehavior in real time.
As shown in Fig. 1.4, there are different attack manners in the smart home.
According to the smart home architecture described in Sect. 1.2, this monograph
Application
3 Misbehaviors
Application platform
Fig. 1.4 The security challenges faced by the smart home system
1.3 Security and Privacy Challenges in Smart Home Network 9
reviews and summarizes the relevant security research status from the three levels
of “terminal device—voice interface—application platform.”
This subsection first introduces the relevant research around the terminal device
level and summarizes the research challenges. For the smart home’s terminal device
layer, the traditional security threats are mainly conducted from the perspective
of the device, including jamming attacks at the signal level and intrusion attacks
at the physical device level. In the smart home scenario, there are more serious
user privacy attacks faced by terminal devices. This monograph will focus on the
wireless side-channel attack, a new privacy threat to terminal devices based on the
IoT architecture. Side-channel attack is a form of indirect attack. It attacks the target
user or system based on side-channel information that is not directly related to the
target. In the smart home network, due to the application of wireless communication,
different devices can work together automatically. However, due to the openness of
the wireless channel, it is very easy for attackers to sniff the communication and
use the side-channel information to threaten system security and user privacy. The
academic research on side-channel attacks mainly includes physical layer wireless
sensing attacks and network layer information inference attacks. Their contents are
summarized as follows:
• Side-channel attacks in the physical layer. There are a large number of wireless
devices in the smart home system. However, since the attenuation, phase change,
and other information experienced by the wireless signal in the propagation
process are closely related to the movement of people in the smart home
environment, the attacker can reverse-guess the user’s privacy information. At
present, attackers can use physical layer information of various signals, including
Wi-Fi, ultrasound, visible light, and millimeter wave, as side-channel information
to carry out attacks. For example, Wang et al. [31] can monitor the user’s
movement (e.g., running, jumping, and lying down) by using the Wi-Fi signal
collected by the Intel 5300 network card, which poses a threat to the user’s
privacy. Ma et al. [19] recognized the user’s gesture near the solar equipment by
analyzing the pattern of photocurrent. Li et al. [18] proposed WaveSpy, which
can remotely collect the state response of the LCD screen and remotely read
the screen content through the millimeter wave probe. Privacy attack mechanism
based on the side-channel information brings a serious security threat to the smart
home network.
• Side-channel attack in the network layer. Researchers have found that even the
communication content is encrypted, the information related to user privacy can
still be obtained by analyzing the network layer traffic. Taking the commonly
used SSL/TLS mechanism as an example, although there is no technology to
directly crack the TLS protocol at present, by analyzing the traffic of the TLS
10 1 Introduction
protocol, we can obtain information about the sender, the receiver, and the sent
content without decrypting the payload content. For example, Li et al. [17]
pointed out that encrypted traffic generated in the communication process can be
used by attackers to guess the user’s gender, age, and education level. Panchenko
et al. [22] pointed out that by analyzing the inflow and outflow characteristics of
user data packets, it is possible to speculate on more than 100 websites visited
by users with 93% accuracy. The above facts show that even if the traffic is
encrypted, a large amount of information can still be obtained through analyzing
the metadata.
However, there are still serious security challenges in the side-channel privacy
attack and defense of the physical layer and the network layer of the terminal
device. In essence, the current research on side-channel privacy attacks is relatively
isolated in the physical layer and the network layer. In the physical layer, due to the
lack of information from the network layer, it is impossible to achieve fine-grained
attacks and obtain more refined results. However, only relying on the side-channel
information at the network layer side cannot obtain information related to the user’s
physical behavior. For example, the wireless physical layer information can be used
to obtain information such as the user’s movement and location on the terminal,
but the lack of network layer information makes it impossible to determine when
the user enters highly sensitive information such as login password and payment
password. Similarly, depending only on the side-channel information at the network
layer side, it is also difficult for an attacker to find the above attack targets due to
the protection of the network layer encryption mechanism. However, after fusing
the side-channel information from both physical and network layers, it is very likely
to break through the original attack capability and bring huge privacy risks to users.
Therefore, in terms of attack, it is urgent to study the privacy attack mechanism of
cross-layer fusion. In terms of defense, there is a lack of lightweight and flexible
defense mechanisms. Therefore, the first research challenge of this monograph is to
conduct cross-layer research on the physical layer and network layer wireless
side-channel attacks faced by smart home terminals and design a universal and
convenient defense mechanism.
In the smart home network, the voice interface stands out from many user interfaces
because of its convenience, easy collection, and high accuracy. Now, it has become
the preferred user interface of many mainstream smart home platforms. However,
the voice interface is vulnerable to various voice spoofing attacks, including the
classic voice playback attack [12] and the new spoofing attacks from the hardware
and software levels.
1.3 Security and Privacy Challenges in Smart Home Network 11
• At the voice interface software level, the deep learning model used in speech
recognition and speaker recognition has been proven to be vulnerable to the threat
of adversarial examples. For example, Nicholas et al. [8] pointed out that the
voice command can be converted into noise-like audio, which is understood as
noise by the human auditory system, but the voice recognition system will still
correctly recognize it and execute the corresponding malicious attack command.
Yuan et al. [33] further proposed the CommandSong attack. In this attack,
malicious voice instructions are cleverly embedded into a song. As far as humans
are concerned, the voice is not too different from the original song in terms of
hearing, which will not cause alarm, but the voice recognition system can still
recognize it. In terms of speaker recognition, Zhang et al. [35] propose VMASK
attack, which embeds the generated countermeasure samples into the audio
of non-registered users and breaks through the Apple Siri speaker recognition
system of registered users.
• At the hardware level of the voice interface, the non-linear characteristics of
the amplitude–frequency characteristic curve of the microphone device enable
the microphone to demodulate the high-frequency part of the voice to the low-
frequency part. Accordingly, Roy et al. [24] proposed the Backdoor attack, which
realized the denial of service attack on the voice interface by injecting audio into
ultrasound. Zhang et al. [34] proposed “Dolphin attack,” which injects the user’s
malicious voice command into high-frequency ultrasound signals and induces the
voice interface (e.g., apple Siri, Amazon Alexa) to perform sensitive operations
when the user’s ears cannot detect them.
Voice spoofing attacks enable attackers to query users’ sensitive information
(e.g., query users’ schedule information) through voice interfaces and even force
smart devices to perform improper actions (e.g., open the door lock when users
leave home), which poses a serious threat to the security and privacy of users’ smart
home systems. To this end, researchers have proposed various defense strategies,
almost all of which take advantage of the fact that the sound in the spoofing attack is
played by electronic devices (e.g., the high-quality speaker in the replay attack [30]
and the ultrasonic dynamic speaker in the dolphin attack [34]). Therefore, different
physical characteristics between humans and machines can be used as “liveness”
factors. Existing liveness detection schemes can be divided into two categories: two-
factor authentication and passive liveness detection schemes. Their characteristics
and limitations are summarized as follows:
• Two-factor authentication-based liveness detection. Two-factor authentication
means that in addition to the audio information collected by the voice interface,
the information highly related to the voice spoken by the user is used as the
liveness feature of the user to distinguish the spoofing attack samples generated
by the legitimate user and the attacker. There are many kinds of information that
can be used for two-factor authentication, including image or video information
collected by the camera [10], electromagnetic field change information extracted
12 1 Introduction
from the loudspeaker device [9], data information collected by the acceleration
sensor of the user’s wearable device [13], and the ultrasonic Doppler frequency
shift information caused by the user’s mouth movement [36]. It should be
pointed out that existing two-factor authentication schemes require users to
carry special sensor devices (e.g., the liveness detection mechanisms based on
image [10] and acceleration [13]) or perform specific actions to collect the
liveness information (e.g., the detection mechanism based on ultrasound [36]).
At present, the effectiveness and practicability of the research work in the smart
home environment are very limited, and a more convenient defense strategy is
urgently needed.
• Voice-based passive liveness detection. Unlike liveness detection based on
two-factor authentication, passive liveness detection only analyzes the voice
signal collected by the voice interface to determine whether the voice is from
a deceptive attacker. Its advantage is that it only depends on the voice interface
itself and does not need to deploy any additional equipment to obtain two-factor
information, so it has a wider scope of application. Shiota et al. [26] and Wang et
al. [29] use the “pop” noise when humans speak to distinguish voice commands
generated by real people and devices. Blue et al. [6] and Ahmed et al. [1]
identify spoofing attacks by analyzing the spectrum pattern of the collected
mono voice signals so as to achieve lightweight authentication. Yan et al. [32]
creatively proposed to use two microphones to collect voice signals and defined
the concept of “fieldprint” between two microphone signals to detect spoofing
attacks. However, since the features of the voice signal are easy to change with
the change of the sound propagation channel and the scheme based on the
fieldprint feature [32] requires the user to maintain a fixed manner to ensure the
robustness of the features, the current passive detection scheme faces the problem
of performance degradation in complicated scenes (e.g., the user walks, gestures
change). At present, smart speakers equipped with microphone arrays are widely
used in the voice interface of smart homes, while passive liveness detection based
on smart speakers remains to be studied.
This monograph will study the security mechanism of voice interface. Since the
above two-factor authentication and passive liveness detection are quite different in
terms of detection principle, implementation process, and research difficulties, this
monograph will study the two schemes, respectively. The corresponding research
challenges are as follows.
For two-factor authentication, the second research challenge in this monograph
is how to make use of the ubiquitous wireless signals in the smart home so
that users can leverage the efficient two-factor authentication information to
defend against voice spoofing attacks without carrying any device. For voice-
based passive liveness detection, the third research challenge is how to use the
microphone array in the smart speaker to achieve passive liveness detection
with high robustness and effectiveness by relying only on the collected multi-
channel voice signals.
1.3 Security and Privacy Challenges in Smart Home Network 13
The smart application platform undertakes a lot of computing tasks in the current
smart home system. In order to support more IoT services, most smart home
application platforms support third-party applications and realize automatic col-
laboration of multiple devices according to the application’s code logic. However,
there are some defects in the smart home platform, which open the door to
malicious behaviors of smart applications. For example, the Samsung SmartThings
smart application platform abstracts the corresponding capabilities for each device
and regulates the behavior of smart applications through the constraint function
model. However, because the functional model adopts a coarse-grained management
method, smart applications may obtain too many permissions, which may lead to
improper behavior. The malicious behavior of applications can be divided into two
categories: over-privileged access and event spoofing. Over-privileged access refers
to the behavior of a malicious application automatically controlling the device in the
cloud without user authorization (e.g., automatically opening the door lock). Event
spoofing refers to malicious applications falsely reporting an event in the cloud and
triggering subsequent abnormal operations (e.g., forgery of “high temperature” data
generated by a temperature sensor to induce the user’s intelligent air conditioner to
automatically turn on). In short, the risks at the application level have caused huge
security and privacy risks to the smart home system and users.
At present, the monitoring and prevention of malicious behaviors of smart
applications are mainly divided into three types:
• Introducing a control mechanism for sensitive data information flow by modi-
fying the smart home platform. For example, Fernandes et al. [15] proposed a
FlowFence system, which can intercept all data streams and require downstream
users to declare before using sensitive data.
• Designing a context-based permission control system to achieve fine-grained
access control. For example, the ContexIoT system proposed by Jia et al. [16]
can support fine-grained identification of sensitive behaviors and report sensitive
behaviors and context information to users.
• Improving the authorization mechanism of smart applications by analyzing
the application source code, comments, and description documents. Tian et
al. [27] propose SmartAuth to analyze whether the application logic is reasonable
through static analysis of the smart application’s code.
However, existing solutions either need to modify the platform itself (e.g.,
FlowFence [15], ContextIoT [16]) or inject patches into smart applications (e.g.,
SmartAuth [27]) or even need to modify communication protocols and design new
systems, which causes these solutions have insufficient versatility and availability.
There is an urgent need for a novel method to allow third-party defenders—in addi-
tion to smart home platform suppliers, smart device manufacturers, and application
developers—to monitor smart home applications without making any changes to
existing platforms. Therefore, the last research challenge of this monograph is to
14 1 Introduction
In view of the problem that users need to carry sensor equipment for the two-
factor authentication of voice interface, this monograph proposes WSVA, a voice
liveness detection system based on Wi-Fi signals. Unlike traditional two-factor
liveness detection schemes, WSVA uses the physical layer information of wireless
signals generated by Wi-Fi devices in IoT environment as the liveness factor without
requiring users to carry any additional devices or sensors. Since the user’s mouth
movement will modulate the wireless signal’s CSI, it can be determined whether
the voice command of the voice interface is actually issued by the user based on the
fluctuation of CSI. We use various real voice commands and spoofing commands to
evaluate WSVA in different scenarios and prove that WSVA has good accuracy and
scalability. The main contributions of this part include the following:
• This monograph successfully characterizes the correlation between CSI changes
in wireless signals and the user’s mouth movements and builds a liveness
detection mechanism based on two different types of signals: voice and CSI.
• WSVA proposed in this monograph is a two-factor liveness detection mechanism
that does not need additional devices and can resist voice spoofing attacks with
high efficiency.
Aiming at the problems of low robustness and low flexibility of current passive
liveness detection based on voice signals, this monograph designs a passive
liveness detection system ArrayID by using the microphone array widely used
by mainstream smart speakers in smart home. Because the microphones in the
microphone array have different positions, the multi-channel audio collected by
ArrayID will have better diversity. By using audio diversity, ArrayID can extract
more liveness factors related to the target user and improve the robustness and
accuracy of liveness detection. Specifically, ArrayID can use the correlation between
different channel data to eliminate the degradation of liveness detection performance
caused by factors such as air channel and user position changes. Subsequently, this
monograph constructs a dataset containing 38,720 multi-channel voice commands
to eluate the effectiveness of the proposed ArrayID. The main contributions of this
part include the following:
• This monograph theoretically analyzes the principle behind passive liveness
detection and designs the ArrayID to prevent voice spoofing attacks. By using
only the audio collected from the smart speaker, ArrayID does not require the
user to carry any equipment or perform other operations.
16 1 Introduction
• Through the experimental results, this monograph proves that ArrayID is superior
to the existing scheme and has strong robustness under many factors (such as
distance, direction, spoofing device, and background noise).
Considering the fact that current application misbehavior detection methods need
to modify smart applications or platforms, this monograph proposes HoMonit,
which is independent of smart home systems. HoMonit uses side-channel inference
technology to monitor the behavior of applications based on encrypted wireless
traffic in smart homes. The core idea of HoMonit is that the behavior of each
smart home application can be described by a deterministic finite automaton (DFA)
model, where each state in DFA represents the state of the application program
and the corresponding smart device, and the transition between states represents the
interaction between the application program and the device. To this end, HoMonit
extracts the benign DFA of the application from the source code of the benign
version of the application and then infers the operation and interaction status of
the corresponding device of the smart application by observing the size and interval
of the encrypted wireless data packet and converts it into DFA. After that, the DFA
extracted from the application is matched with the DFA inferred from the wireless
traffic side channel. If the matching fails, it indicates that the running application
has abnormal behavior. This monograph implements HoMonit and demonstrates its
effectiveness in detecting smart applications with abnormal behavior. At the same
time, considering that the wireless side-channel information may contain sensitive
behaviors in smart homes, this monograph further designs a privacy enhancement
module based on traffic obfuscation for HoMonit. The privacy enhancement module
can effectively protect privacy by increasing information entropy while retaining
HoMonit’s ability to monitor applications’ misbehavior. The main contributions of
this study are:
• This monograph points out that the wireless side-channel information can be
used not only for an attack but also for defense against malicious behavior and
proposes a device state automatic matching algorithm based on the wireless
traffic side-channel for the first time.
• This monograph also takes into account the risk of privacy leakage caused by
HoMonit and designs a privacy enhancement module based on traffic obfusca-
tion, which helps protect the privacy of the smart home system while ensuring
detection ability.
The above four contents of this study are aimed at the four important security
challenges at three levels of the smart home system. At the same time, the contents
are related to each other and together form the key security technologies of the smart
home network.
1.4 Aims and Organization of This Monograph 17
As shown in Fig. 1.5, this monograph is divided into seven chapters. This chapter
is an introduction, which introduces the research background of the monograph, the
architecture of the smart home system, the security challenges summarized around
the logic of “terminal equipment—voice interface—application platform,” and the
main research contents. This monograph is organized as follows:
Chapter 2 introduces the related research work around the security challenges of
this monograph. At the terminal device level, the research status of wireless side-
channel attacks faced by terminal device is introduced, including physical layer and
network layer side-channel attack. At the voice interface level, it introduces the
security threats found by researchers and reviews the existing research work on
two-factor-based liveness detection and voice-based passive liveness detection. At
the application platform level, the smart application’s misbehavior and the defense
mechanisms proposed by researchers are summarized.
Chapter 3 reveals the cross-layer privacy attack mechanism WindTalker.
WindTalker uses the side-channel information at the wireless physical layer and
network layer to speculate on the user’s payment password. Then, the defense
mechanism based on the signal obfuscation is introduced. Finally, the WindTalker
and the corresponding defense mechanism are evaluated in the real environment,
and their effectiveness is proven.
Chapter 1 Introduction
References
1. Ahmed, M.E., Kwak, I.Y., Huh, J.H., Kim, I., Oh, T., Kim, H.: Void: a fast and light
voice liveness detection system. In: 29th USENIX Security Symposium (USENIX Secu-
rity 20), pp. 2685–2702. USENIX Association (2020). https://www.usenix.org/conference/
usenixsecurity20/presentation/ahmed-muhammad
2. AliGenie: Tmallgenie (2021). https://aligenie.com/
3. Amazon: Amazon Alexa developer (2019). https://developer.amazon.com/alexa
4. Apple: Homekit (2018). https://www.apple.com/ios/home/
5. Associates, P.: Top 10 consumer IoT trends in 2017 (2017). http://www.parksassociates.com/
whitepapers/top10-2017
6. Blue, L., Vargas, L., Traynor, P.: Hello, is it me you’re looking for? Differentiating between
human and electronic speakers for voice interface security. In: Proceedings of the 11th ACM
Conference on Security & Privacy in Wireless and Mobile Networks, pp. 123–133. ACM
(2018)
7. Carlini, N., Mishra, P., Vaidya, T., Zhang, Y., Sherr, M., Shields, C., Wagner, D., Zhou,
W.: Hidden voice commands. In: Proceedings of USENIX Security Symposium (USENIX
Security), pp. 513–530 (2016)
8. Carlini, N., Mishra, P., Vaidya, T., Zhang, Y., Sherr, M., Shields, C., Wagner, D., Zhou,
W.: Hidden voice commands. In: 25th USENIX Security Symposium (USENIX Security
16), pp. 513–530. USENIX Association, Austin (2016). https://www.usenix.org/conference/
usenixsecurity16/technical-sessions/presentation/carlini
9. Chen, S., Ren, K., Piao, S., Wang, C., Wang, Q., Weng, J., Su, L., Mohaisen, A.: You can hear
but you cannot steal: Defending against voice impersonation attacks on smartphones. In: 2017
IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 183–195
(2017). https://doi.org/10.1109/ICDCS.2017.133
References 19
10. Chen, Y., Sun, J., Jin, X., Li, T., Zhang, R., Zhang, Y.: Your face your heart: secure mobile face
authentication with photoplethysmograms. In: Proceedings of IEEE Conference on Computer
Communications (INFOCOM), pp. 1–9 (2017)
11. Choi, K., Son, Y., Noh, J., Shin, H., Choi, J., Kim, Y.: Dissecting customized protocols:
automatic analysis for customized protocols based on IEEE 802.15.4. In: Proceedings of the
9th ACM Conference on Security & Privacy in Wireless and Mobile Networks, WiSec ’16,
pp. 183–193. ACM, New York (2016). https://doi.org/10.1145/2939918.2939921
12. Diao, W., Liu, X., Zhou, Z., Zhang, K.: Your voice assistant is mine: how to abuse speakers
to steal information and control your phone. In: Proceedings of the 4th ACM Workshop on
Security and Privacy in Smartphones & Mobile Devices (SPSM), pp. 63–74 (2014). https://
doi.org/10.1145/2666620.2666623
13. Feng, H., Fawaz, K., Shin, K.G.: Continuous authentication for voice assistants. In: Proceed-
ings of the 23rd Annual International Conference on Mobile Computing and Networking
(MobiCom), pp. 343–355 (2017). https://doi.org/10.1145/3117811.3117823
14. Fernandes, E., Jung, J., Prakash, A.: Security analysis of emerging smart home applications.
In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 636–654 (2016). https://doi.org/
10.1109/SP.2016.44
15. Fernandes, E., Paupore, J., Rahmati, A., Simionato, D., Conti, M., Prakash, A.: FlowFence:
Practical data protection for emerging IoT application frameworks. In: USENIX Security
Symposium (USENIX Security) (2016)
16. Jia, Y.J., Chen, Q.A., Wang, S., Rahmati, A., Fernandes, E., Mao, Z.M., Prakash, A.:
ContexIoT: Towards providing contextual integrity to appified IoT platforms. In: The Network
and Distributed System Security Symposium (NDSS) (2017)
17. Li, H., Xu, Z., Zhu, H., Ma, D., Li, S., Xing, K.: Demographics inference through wi-fi
network traffic analysis. In: IEEE International Conference on Computer Communications
(INFOCOM) (2016)
18. Li, Z., Ma, F., Rathore, A.S., Yang, Z., Chen, B., Su, L., Xu, W.: Wavespy: remote and through-
wall screen attack via mmwave sensing. In: 2020 IEEE Symposium on Security and Privacy
(SP), pp. 217–232 (2020). https://doi.org/10.1109/SP40000.2020.00004
19. Ma, D., Lan, G., Hassan, M., Hu, W., Upama, M.B., Uddin, A., Youssef, M.: Solargest:
Ubiquitous and battery-free gesture recognition using solar cells. In: The 25th Annual
International Conference on Mobile Computing and Networking, MobiCom ’19. Association
for Computing Machinery, New York (2019). https://doi.org/10.1145/3300061.3300129
20. Makwana, D.: Amazon echo smart speaker (3rd gen) review (2020). https://www.mobigyaan.
com/amazon-echo-smart-speaker-3rd-gen-review
21. Market Research Future: Voice Assistant Market - Information by Technology, Hardware and
Application - Forecast till 2025 (2020). https://www.marketresearchfuture.com/reports/voice-
assistant-market-4003
22. Panchenko, A., Lanze, F., Pennekamp, J., Engel, T., Zinnen, A., Henze, M., Wehrle, K.:
Website fingerprinting at internet scale. In: NDSS (2016)
23. Research and Markets: Global Smart Home Market with COVID-19 Impact Analysis by
Product (Lighting Control, Security & Access Control, HVAC Control, Smart Speaker, Smart
Kitchen, Smart Furniture), Software & Services, Sales Channel, and Region - Forecast
to 2026 (2021). https://www.researchandmarkets.com/reports/5448441/global-smart-home-
market-with-covid-19-impact
24. Roy, N., Hassanieh, H., Roy Choudhury, R.: Backdoor: Making microphones hear inaudible
sounds. In: Proceedings of the 15th ACM Annual International Conference on Mobile Systems,
Applications, and Services (MobiSys), pp. 2–14 (2017). https://doi.org/10.1145/3081333.
3081366
25. Samsung: SmartThings (2021). https://www.smartthings.com
26. Shiota, S., Villavicencio, F., Yamagishi, J., Ono, N., Echizen, I., Matsui, T.: Voice liveness
detection algorithms based on pop noise caused by human breath for automatic speaker
verification. In: Sixteenth Annual Conference of the International Speech Communication
Association (2015)
20 1 Introduction
27. Tian, Y., Zhang, N., Lin, Y.H., Wang, X., Ur, B., Guo, X., Tague, P.: SmartAuth: user-centered
authorization for the internet of things. In: USENIX Security Symposium (USENIX Security)
(2017)
28. Tillman, M.: Google home max review: Cranking smart speaker audio to the max
(2019). https://www.pocket-lint.com/smart-home/reviews/google/143184-google-home-max-
review-specs-price
29. Wang, Q., Lin, X., Zhou, M., Chen, Y., Wang, C., Li, Q., Luo, X.: VoicePop: a pop noise
based anti-spoofing system for voice authentication on smartphones. In: IEEE INFOCOM
2019-IEEE Conference on Computer Communications, pp. 2062–2070. IEEE (2019)
30. Wang, S., Cao, J., He, X., Sun, K., Li, Q.: When the differences in frequency domain are
compensated: Understanding and defeating modulated replay attacks on automatic speech
recognition. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and
Communications Security, CCS ’20, pp. 1103–1119. Association for Computing Machinery
(2020). https://doi.org/10.1145/3372297.3417254
31. Wang, W., Liu, A.X., Shahzad, M., Ling, K., Lu, S.: Understanding and modeling of WiFi
signal based human activity recognition. In: Proceedings of the 21st Annual International
Conference on Mobile Computing and Networking, pp. 65–76 (2015)
32. Yan, C., Long, Y., Ji, X., Xu, W.: The catcher in the field: a fieldprint based spoofing
detection for text-independent speaker verification. In: Proceedings of the 2019 ACM SIGSAC
Conference on Computer and Communications Security, CCS ’19, pp. 1215–1229. Association
for Computing Machinery (2019). https://doi.org/10.1145/3319535.3354248
33. Yuan, X., Chen, Y., Zhao, Y., Long, Y., Liu, X., Chen, K., Zhang, S., Huang, H., Wang,
X., Gunter, C.A.: CommanderSong: a systematic approach for practical adversarial voice
recognition. In: 27th USENIX Security Symposium (USENIX Security 18), pp. 49–64.
USENIX Association, Baltimore (2018). https://www.usenix.org/conference/usenixsecurity18/
presentation/yuan-xuejing
34. Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: DolphinAttack: inaudible voice
commands. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Com-
munications Security (CCS), pp. 103–117 (2017). https://doi.org/10.1145/3133956.3134052
35. Zhang, L., Meng, Y., Yu, J., Xiang, C., Falk, B., Zhu, H.: Voiceprint mimicry attack towards
speaker verification system in smart home. In: IEEE INFOCOM 2020—IEEE Conference on
Computer Communications, pp. 377–386 (2020). https://doi.org/10.1109/INFOCOM41043.
2020.9155483
36. Zhang, L., Tan, S., Yang, J.: Hearing your voice is not enough: an articulatory gesture
based liveness detection for voice authentication. In: Proceedings of the 2017 ACM SIGSAC
Conference on Computer and Communications Security (CCS), pp. 57–71 (2017). https://doi.
org/10.1145/3133956.3133962
Chapter 2
Literature Review of Security in Smart
Home Network
This subsection reviews the risk of side-channel privacy disclosure that terminal
devices in smart homes are vulnerable to. The so-called side-channel attack is
an indirect attack in which the attacker uses information that is not directly
related to the object being attacked to achieve its attack target. In the smart home
environment, the popularity of wireless signals promotes the interconnection of
intelligent devices. However, due to the openness of wireless channels, attackers can
easily sniff wireless communications and use side-channel information to threaten
system security and user privacy. This section will review the side-channel attacks
related to wireless communication and introduce other side-channel attacks.
In the smart home environment, smart devices support diversified wireless net-
work communication protocols (e.g., Bluetooth, Wi-Fi, ZigBee, Z-Wave) and new
communication models such as the millimeter wave and visible light. In order
to ensure normal wireless communication, devices need to interact with each
other on the physical layer. The physical layer information reflects the changes
in the surrounding environment, so it is used to realize the perception of the
environment and people. However, the development of physical layer wireless
sensing technology also brings huge privacy risks to terminal device users. This
subsection first introduces the wireless sensing frameworks based on various media
(e.g., Wi-Fi, ultrasound, visible light, millimeter wave) and then summarizes the
privacy risks caused by them.
Wi-Fi Based Wireless Sensing Technology and Related Privacy Risks The
physical layer information of Wi-Fi signals includes received signal strength (RSS)
and channel state information (CSI). In 2013, Adib et al. [1] proposed that the RSS
can be used to monitor the movement of people on the other side of the wall and
deployed in the software-defined radio platform. In 2015, Wang et al. [58] gave a
quantitative relationship between the CSI of Wi-Fi signals in commercial devices
and indoor user movements and established a mapping relationship between CSI
changes and user movements (e.g., running, jumping, lying down), thus realizing the
monitoring of user movements. Zhang et al. [41] realized a long-term daily health
monitoring system for the elderly living alone by using existing Wi-Fi commercial
equipment. Pierson et al. [44] proposed a fast localization algorithm for a single Wi-
Fi antenna and realized localization with an error of less than 14 cm. Qian et al. and
Zheng et al. put forward the Widar2.0 [45] and Widar3.0 [71] systems, respectively,
which realized the use of Wi-Fi to track human movement and recognize gestures.
These works not only enhance wireless sensing technology but also open the door
for attackers to threat user privacy by using wireless sensing. Shi et al. [47] proposed
that in addition to using software radio platforms (e.g., USRP) and commercial
network cards (e.g., Intel 5300 network cards) to obtain physical layer information,
in the IoT environment, the wireless CSI collected by user’s terminal devices can
also achieve the above functions. Therefore, by controlling the user’s wireless device
and obtaining the physical layer information, the attacker can speculate on the user’s
movement, location, and other sensitive information.
Sensing Technologies Based on Other Wireless Communication Media The
concept of wireless sensing has expanded from the initial Wi-Fi based sensing
technology to ubiquitous wireless sensing technologies based on the millimeter
wave, visible light, ultrasonic, and other media. Zhou et al. [72] use ultrasound
to sense the user’s face and use 3D facial contour to authenticate the user. In the
aspect of visible light perception, Ma et al. [40] proposed to recognize gestures near
solar equipment by analyzing the pattern of photocurrent. Each gesture interferes
with the incident light on the solar panel in a unique way, thus retaining its unique
characteristics in the harvested photocurrent. Finally, the millimeter wave, as a novel
communication mode, has attracted extensive attention due to its high frequency.
For example, in the field of wireless temperature control, Chen et al. [12] proposed
a wireless temperature monitoring system based on the thermal scattering effect of
millimeter wave signals and cholesterol-based materials with different molecular
modes at different ambient temperatures. Li et al. [35] put forward WaveSpy, an
end-to-end portable through wall screen eavesdropping system, which can remotely
collect the state response of the LCD screen and remotely read the screen content
through the millimeter wave probe. In the smart home system, with the wide use
of devices with diversified communication protocols, users will face greater privacy
risks.
Keystroke Behavior Inference Based on Wi-Fi Signals Among the speculations
about user privacy, the speculation about the user’s keystroke behavior on the
terminal device is the most sensitive because its behavior is highly related to
2.1 Side-channel Attacks Faced by Terminal Device 23
sensitive information such as the user’s payment password. The use of wireless
signals to inference on keystroke behavior has the advantages of device free and
non-intrusive, which attracts the attention from the academic community. Ali et
al. [4] proposed a keystroke inference system called WiKey. A Wi-Fi transmitting
and receiving device is arranged near the user. The unique waveform of the wireless
signal CSI generated when the user hits the keyboard is used to distinguish the
user’s keystroke input on the external keyboard. Zhang et al. [67] proposed WiPass,
which can speculate the user’s graphic unlock password on the mobile device. Tan
et al. [52] proposed WiFinger, which uses CSI signals of commercial devices to
capture the fine-grained motion of users’ fingers. Compared with the WindTalker in
this monograph, these schemes only rely on the physical layer information, so they
cannot obtain the user’s sensitive keystroke input.
Luzio et al. [19] and Cunche et al. [17] pointed out that even if the communication
content is encrypted, attackers may still obtain information related to user privacy by
analyzing the network layer traffic characteristics. Take the widely used SSL/TLS
mechanism as an example. Although there is no technology that can directly
crack the TLS protocol, through the analysis of the traffic characteristics of the
TLS protocol, the sender, receiver, and content information can still be obtained
without cracking the user payload information. Li et al. [33] pointed out that the
encrypted traffic generated in the communication process can be used by attackers to
speculate the user’s gender, age, and other information. Taylor et al. [53] proposed
a method to use metadata and machine learning to detect applications installed
on equipment, with a detection success rate of more than 70%. Wang et al. [57]
analyzed anonymous network traffic and used the k-nearest neighbors algorithm to
speculate website records visited by users with 80% accuracy. Panchenko et al. [43]
analyzed the inflow and outflow characteristics of user packets and used the Radial
Basis Function (RBF) kernel-based support vector machine technology can classify
100 websites with an accuracy rate of 93%. The above facts show that even if the
traffic is encrypted, a large amount of information can still be obtained through the
analysis of metadata.
In terms of the manners to obtain side-channel information on the network layer
side, a common method is to use malicious Wi-Fi hotspots. In smart home scenarios,
Wi-Fi hotspots are often used in mobile environments where cellular networks are
limited. The existing works [14, 23, 30, 33, 34, 61] have proved the feasibility of
deploying malicious Wi-Fi hotspots. For example, when an attacker turns on a Wi-
Fi hotspot without a password on a smartphone and changes the SSID name to the
same name as the Wi-Fi hotspot in the user’s home or office area (such as “Home”
or “Company Free Wi-Fi”), the surrounding users will mistakenly think that the Wi-
24 2 Literature Review of Security in Smart Home Network
Fi hotspot is a trusted Wi-Fi hotspot and access the hotspot. In this scenario, all the
traffic of the user’s wireless communication will flow through the malicious Wi-Fi
hotspot, and the attacker can use this Wi-Fi network traffic to infer the user’s private
information. Alan et al. [3] further expanded such an attack’s availability. Even
obtaining only the TCP/IP header of the traffic in the Android application startup
phase, the attacker can successfully identify the running mobile application. Conti
et al. [15] proposed a new network traffic analysis method, which can not only
inference the running applications but also identify some operations generated by
the user in applications.
Compared with the above attack scheme based on network layer traffic analysis,
WindTalker in Chap. 3 conducts cross-layer privacy analysis by integrating network
layer traffic and physical layer CSI information. Specifically, in the attack scenario
of WindTalker, the attacker creates a fake hotspot to attract the target user to connect.
Then, the attacker determines the sensitive time period by sniffing Wi-Fi traffic and
analyzes CSI information to speculate the user’s sensitive keystroke information.
In terms of defense methods, the Traffic Morphing method proposed by Wright et
al. [60] is to disguise the authentic traffic as that from other websites by constructing
a matrix transformation. Dyer et al. [22] proposed methods to send data packets at
fixed time intervals and fixed sizes to hide metadata, but these methods have brought
a heavy burden to the system. In Chap. 3, this monograph proposes a lightweight
defense mechanism based on the physical layer obfuscation of wireless signals.
touch screen and used the acoustic signal emitted by the mobile phone to realize
KeyListener system, which can speculate keystroke input on the QWERTY type
keyboard on the user’s touch screen. These works can achieve high accuracy on the
numeric keyboard or physical QWERTY keyboard.
Camera Video Signal-Based Attacks Yue et al. [64] proposed a camera-based
keystroke inference mechanism using commercial devices such as Google Glasses.
Shukla et al. [49] proposed a video-based analysis method. This method infers the
user’s input according to the temporal and spatial characteristics of the keyboard
tapping in the video. Sun et al. [51] leveraged the camera to record the motion
information on the back side of the tablet and speculated on the content entered by
the user.
In contrast, WindTalker proposed in Chap. 3 does not require the target user to
hit the keyboard at a fixed location nor does it need to place the sniffer device close
to the user. In addition, WindTalker can obtain network layer traffic information,
greatly improving its reliability in real environments.
This section will introduce the spoofing attacks faced by the voice interface and
review the current defense schemes, including both two-factor authentication and
voice signal-based passive liveness detection.
Although the voice interface is considered the most promising user interface in
smart homes, it also introduces some new security problems due to the inherent
broadcast property of the voice channel. These security problems make it vulnerable
to spoofing attacks. The most direct manner is the replay attack. In this attack, the
attacker prerecorded the voice samples of legitimate users and then replayed them
using high-quality speakers to cheat the voice interface [20]. The replay attack
has the best attack effect because it keeps as many audio properties as possible.
However, replay attacks also have the disadvantage of poor concealment. Therefore,
in order to achieve more covert and efficient voice spoofing attacks, researchers
have proposed the following two types of advanced voice spoofing attacks using the
software and hardware defects of the voice interface.
Adversarial Example Attacks in the Software Defects Voice interface generally
adopts a deep learning based speech recognition algorithm and speaker recognition
algorithm. However, these algorithms are vulnerable to emerging adversarial exam-
ple attacks. In 2016, Carlini et al. [8] proposed hidden voice command attack. In
this attack, the attacker will convert the voice sample, and the converted audio will
26 2 Literature Review of Security in Smart Home Network
be interpreted as noise by the human auditory system, but the speech recognition
system will still recognize it as a valid and malicious attack command. Yuan et
al. [63] further proposed the CommanderSong attack. In this attack, malicious
voice commands are embedded into a song. As far as human beings are concerned,
the generated sound is similar to the original song in hearing, so it will not cause
human’s suspicion. But the speech recognition system will recognize the real attack
instructions after processing the features.
The above attack methods have different limitations: the black box attack in the
hidden voice command attack uses the reverse MFCC method, which requires a
lot of computing resources. The CommanderSong attack only studied the voice
recognition part and did not involve the user authentication part of the voice control
system. Furthermore, Zhang et al. [68] proposed the VMASK attack to break
the speaker recognition system. VMASK attack is based on the principle of voice
adversarial example attack. By adding a small disturbance, it can make the voice that
is understood by human hearing as user A be understood by the speech recognition
system as user B and can attack the Apple Siri system with a probability of more
than 20%.
Inaudible Attacks Based on Hardware Defects In addition to the adversarial
example attacks from the software flaw, another type of voice spoofing attack
comes from the hardware flaw. A typical example is the “Dolphin Attack” proposed
by Zhang et al. [66]. In this attack, the attacker can send ultrasonic signals
that are not perceived by the human ear, thus inducing voice interfaces (e.g.,
Apple Siri, Amazon Alexa) to recognize them as voice instructions and trigger
dangerous subsequent operations. The principle of the dolphin attack is that the
amplitude-frequency characteristic curve of microphone equipment has nonlinear
characteristics. Therefore, by injecting low-frequency voice commands into high-
frequency ultrasound, the microphone of the voice interface can still demodulate the
ultrasonic signal to recover low-frequency voice commands. In addition, for voice
interface that require speaker authentication (e.g., Apple Siri), the dolphin attack
uses the victim’s voiceprint to spoof speaker authentication using text-to-speech
(TTS) based brute force cracking and splicing synthesis methods. TTS-based brute
force cracking uses different TTS parameters to obtain audio with different tones
and timbres. The splicing synthesis method searches the phonemes required for the
synthesis command from the collected victim recordings and then generates voice
commands. Roy et al. [46] proposed a similar attack, which is called Backdoor
attack. In addition to ultrasonic-based schemes, Sugawara et al. [50] proposed that
the voice interface can be deceived by laser without attracting the user’s attention.
his mouth when speaking, resulting in bursts of the collected audio. But in the replay
attack, the frequency band where the burst is located will be suppressed by the
electrical loudspeaker. Thus, the liveness detection can be carried out. However,
this scheme requires users to be close to the voice interface, which is suitable for
smartphone interaction scenarios but not for smart speaker scenario. Yan et al. [62]
proposed the concept of “fieldprint” feature to detect spoofing attacks. The insight
is that different audio sources (i.e., different humans and loudspeaker devices)
generate audio in different ways, which will produce unique field characteristics in
the sound transmission process. Therefore, leveraging the smartphone’s microphone
pair to measure the change in the field pattern can detect the voice spoofing attack.
However, this method also requires users to hold the mobile phone in a fixed way
and is close to the mobile phone, which is not suitable for smart speaker scenario. In
addition, Blue et al. [7] and Ahmed et al. [2] use the spectrum pattern of mono
voice signals to identify spoofing attacks to achieve lightweight authentication.
However, due to the instability in audio transmission, these two types of schemes
still face the problem of insufficient accuracy. Zhang et al. [65] proposed EarArray
to thwart dolphin attacks, but it is not intended to detect spoofing audio with human
voice frequency. Therefore, to overcome the above-mentioned methods’ issues, in
Chap. 5, we propose a robust and efficient passive detection mechanism that only
relies on voice.
In smart home network, users can deploy various smart applications in the cloud
server. These smart applications realize the home automation by automatically
controlling all kinds of smart devices according to code rules at runtime. However,
applications may have various misbehavior, which threatens the security of smart
home systems. This section will summarize the misbehavior of smart applications
and review the existing defense mechanisms.
The research shows that the application platform of smart homes has several defects,
which led attackers to use smart applications for conducting misbehavior. For
example, Demetriou et al. [18] pointed out that the coarse-grained permission
management of the Android system will enable malicious applications to access
devices such as smart blood glucose meters in the smart home environment without
restriction and leak sensitive data to external attackers. Fernandes et al. [25] pointed
out that in 2016, more than 55% of smart applications (SmartApps) on the Samsung
SmartThings smart home platform were over-authorized. More specifically, the
over-privileged SmartApps can actually access permissions that users think should
2.3 Misbehaviors in Application Platform 29
not be granted to them. The main reason for this is that the authorization granularity
of the device function model of the SmartThings platform is too coarse-grained.
These design flaws can enable potentially malicious SmartApps not only to perform
other controls on devices that have not been authorized but also to bring about the
potential risk of eavesdropping on device events or forging smart device events in
the cloud backend.
In addition, even if the logic of the smart application itself is normal, the protocol
used by the application and the associated devices may still be attacked, thus making
the smart application still misbehave during operation. At the protocol level, Zillner
et al. [74], and Lomas et al. [38] emphasized several security risks in ZigBee
deployment. Because the manufacturer uses the default link key when producing
ZigBee devices, it is easy to leak the key of ZigBee devices, allowing attackers to
enter network communications and generate false interactions with devices. Fouladi
et al. [27] analyzed the Z-Wave protocol stack layer and found the vulnerability
of AES encryption in the smart lock. Based on this finding, attackers can forge
false application wireless instructions, making the smart lock be opened abnormally,
which has caused great harm to users’ home security. The researchers also studied
the weak authentication mechanism in Bluetooth devices [5, 6, 28]. For the hub,
the central system of the emerging smart home, some researchers have revealed its
potential security risks. Lemos et al. [32] analyzed the security of hubs in three
smart home systems (i.e., Samsung SmartThings, Vera Control [16], and Wink
[59]). The research points out that there are some security vulnerabilities in these
hubs, which are easy for attackers to obtain their control permissions and further
violate the application logic to send commands to other smart devices.
In a word, there are various misbehaviors of smart applications in the current
smart home environment. These misbehaviors can be summarized into two cat-
egories: over-privileged access and event spoofing. The former means that the
application obtains permission that should not be granted and performs misbehavior.
For example, for the application that controls smart window, when receiving the
user’s command of closing window, it reversely opens the window. The latter
means that the application automatically executes subsequent operations when no
corresponding event is received. For example, an attacker forges an event in the
cloud, making the smart alarm work when the environment is normal. These two
types of attacks have a common feature: due to the existence of misbehaviors, the
operation of smart applications conflicts with their normal working logic. Thus,
HoMonit proposed in Chap. 6 makes use of this phenomenon and leverages the
side-channel information in the wireless traffic to achieve real time misbehavior
detection.
In order to solve the security and privacy risks caused by the misbehavior of
smart home applications, researchers have proposed several defense mechanisms.
30 2 Literature Review of Security in Smart Home Network
2.4 Summary
This chapter reviews existing security and privacy issues in smart home networks.
More specifically, we elaborate on the side-channel attacks at the terminal device
layer, which inspires WindTalker in Chap. 3. We review existing spoofing attacks
References 31
at the voice interface layer and introduce two-factor authentication and passive
liveness detection, which are the basis of WSVA in Chap. 4 and ArrayID in Chap. 5,
respectively. Finally, we describe two types of misbehavior at the application
platform layer and the corresponding defense strategies, which are the background
of HoMonit proposed in Chap. 6.
References
1. Adib, F., Katabi, D.: See through walls with WiFi! In: ACM Special Interest Group on Data
Communication (SIGCOMM) (2013)
2. Ahmed, M.E., Kwak, I.Y., Huh, J.H., Kim, I., Oh, T., Kim, H.: Void: a fast and light
voice liveness detection system. In: 29th USENIX Security Symposium (USENIX Secu-
rity 20), pp. 2685–2702. USENIX Association (2020). https://www.usenix.org/conference/
usenixsecurity20/presentation/ahmed-muhammad
3. Alan, H.F., Kaur, J.: Can android applications be identified using only TCP/IP headers of their
launch time traffic? In: Proceedings of the 9th ACM Conference on Security & Privacy in
Wireless and Mobile Networks, pp. 61–66. ACM (2016)
4. Ali, K., Liu, A.X., Wang, W., Shahzad, M.: Keystroke recognition using WiFi signals.
In: Proceedings of the 21st Annual International Conference on Mobile Computing and
Networking, pp. 90–102. ACM (2015)
5. Arsene, L.: Wearable plain-text communication exposed through brute-force, bitdefender finds
(2014). https://www.hotforsecurity.com/blog/wearable-plain-text-communication-exposed-
through-brute-force-bitdefender-finds-10973.html
6. AV-TEST: Test: Fitness wristbands reveal data (2015). https://www.av-test.org/en/news/news-
single-view/test-fitness-wristbands-reveal-data/
7. Blue, L., Vargas, L., Traynor, P.: Hello, is it me you’re looking for? Differentiating between
human and electronic speakers for voice interface security. In: Proceedings of the 11th ACM
Conference on Security & Privacy in Wireless and Mobile Networks, pp. 123–133. ACM
(2018)
8. Carlini, N., Mishra, P., Vaidya, T., Zhang, Y., Sherr, M., Shields, C., Wagner, D., Zhou,
W.: Hidden voice commands. In: Proceedings of USENIX Security Symposium (USENIX
Security), pp. 513–530 (2016)
9. Celik, Z.B., Babun, L., Sikder, A.K., Aksu, H., Tan, G., McDaniel, P., Uluagac, A.S.: Sensitive
information tracking in commodity IoT. In: 27th USENIX Security Symposium (USENIX
Security 18), pp. 1687–1704. USENIX Association, Baltimore (2018). https://www.usenix.
org/conference/usenixsecurity18/presentation/celik
10. Celik, Z.B., McDaniel, P., Tan, G.: Soteria: Automated IoT safety and security analysis. In:
2018 USENIX Annual Technical Conference (USENIX ATC 18), pp. 147–158. USENIX
Association, Boston (2018). https://www.usenix.org/conference/atc18/presentation/celik
11. Celik, Z.B., Tan, G., McDaniel, P.D.: IotGuard: Dynamic enforcement of security and safety
policy in commodity IoT. In: NDSS (2019)
12. Chen, B., Li, H., Li, Z., Chen, X., Xu, C., Xu, W.: ThermoWave: a new paradigm of wireless
passive temperature monitoring via mmWave sensing. In: Proceedings of the 26th Annual
International Conference on Mobile Computing and Networking, MobiCom ’20. Association
for Computing Machinery, New York (2020). https://doi.org/10.1145/3372224.3419184
13. Chen, S., Ren, K., Piao, S., Wang, C., Wang, Q., Weng, J., Su, L., Mohaisen, A.: You can hear
but you cannot steal: Defending against voice impersonation attacks on smartphones. In: 2017
IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 183–195
(2017). https://doi.org/10.1109/ICDCS.2017.133
32 2 Literature Review of Security in Smart Home Network
14. Cheng, N., Wang, X.O., Cheng, W., Mohapatra, P., Seneviratne, A.: Characterizing privacy
leakage of public WiFi networks for users on travel. In: 2013 Proceedings IEEE INFOCOM,
pp. 2769–2777 (2013). https://doi.org/10.1109/INFCOM.2013.6567086
15. Conti, M., Mancini, L.V., Spolaor, R., Verde, N.V.: Analyzing android encrypted network
traffic to identify user actions. IEEE Trans. Inform. Forensics Secur. 11(1), 114–125 (2016)
16. Control, V.: Vera3 advanced smart home controller (2018). http://getvera.com/controllers/
vera3/
17. Cunche, M., Kaafar, M.A., Boreli, R.: Linking wireless devices using information contained
in wi-fi probe requests. Pervasive Mobile Comput. 11, 56–69 (2014). https://doi.org/10.1016/
j.pmcj.2013.04.001
18. Demetriou, S., Zhou, X.y., Naveed, M., Lee, Y., Yuan, K., Wang, X., Gunter, C.A.: What’s
in your dongle and bank account? Mandatory and discretionary protection of android external
resources. In: NDSS (2015)
19. Di Luzio, A., Mei, A., Stefa, J.: Mind your probes: de-anonymization of large crowds
through smartphone WiFi probe requests. In: IEEE INFOCOM 2016 - The 35th Annual IEEE
International Conference on Computer Communications, pp. 1–9 (2016). https://doi.org/10.
1109/INFOCOM.2016.7524459
20. Diao, W., Liu, X., Zhou, Z., Zhang, K.: Your voice assistant is mine: how to abuse speakers
to steal information and control your phone. In: Proceedings of the 4th ACM Workshop on
Security and Privacy in Smartphones & Mobile Devices (SPSM), pp. 63–74 (2014). https://
doi.org/10.1145/2666620.2666623
21. Ding, W., Hu, H.: On the safety of IoT device physical interaction control. In: Proceedings of
the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, pp.
832–846. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/
3243734.3243865
22. Dyer, K.P., Coull, S.E., Ristenpart, T., Shrimpton, T.: Peek-a-boo, i still see you: why efficient
traffic analysis countermeasures fail. In: 2012 IEEE Symposium on Security and Privacy, pp.
332–346 (2012). https://doi.org/10.1109/SP.2012.28
23. Fan, Y., Jiang, Y., Zhu, H., Shen, X.: An efficient privacy-preserving scheme against traffic
analysis attacks in network coding. In: IEEE INFOCOM 2009, pp. 2213–2221 (2009). https://
doi.org/10.1109/INFCOM.2009.5062146
24. Feng, H., Fawaz, K., Shin, K.G.: Continuous authentication for voice assistants. In: Proceed-
ings of the 23rd Annual International Conference on Mobile Computing and Networking,
p. 343–355. Association for Computing Machinery (2017). https://doi.org/10.1145/3117811.
3117823
25. Fernandes, E., Jung, J., Prakash, A.: Security analysis of emerging smart home applications.
In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 636–654 (2016). https://doi.org/
10.1109/SP.2016.44
26. Fernandes, E., Paupore, J., Rahmati, A., Simionato, D., Conti, M., Prakash, A.: FlowFence:
Practical data protection for emerging IoT application frameworks. In: USENIX Security
Symposium (USENIX Security) (2016)
27. Fouladi, B., Ghanoun, S.: Security evaluation of the z-wave wireless protocol. In: Black Hat
USA (2013)
28. Ho, G., Leung, D., Mishra, P., Hosseini, A., Song, D., Wagner, D.: Smart locks: lessons for
securing commodity internet of things devices. In: ACM Asia Conference on Computer and
Communications Security (AsiaCCS) (2016)
29. Jia, Y.J., Chen, Q.A., Wang, S., Rahmati, A., Fernandes, E., Mao, Z.M., Prakash, A.:
ContexIoT: Towards providing contextual integrity to appified IoT platforms. In: The Network
and Distributed System Security Symposium (NDSS) (2017)
30. Konings, B., Bachmaier, C., Schaub, F., Weber, M.: Device names in the wild: investigating
privacy risks of zero configuration networking. In: Mobile Data Management (MDM), 2013
IEEE 14th International Conference on, vol. 2, pp. 51–56. IEEE (2013)
References 33
31. Lee, Y., Zhao, Y., Zeng, J., Lee, K., Zhang, N., Shezan, F.H., Tian, Y., Chen, K., Wang, X.:
Using sonar for liveness detection to protect smart speakers against remote attackers. Proc.
ACM Interact. Mob. Wearable Ubiquitous Technol. 4(1), 1–28 (2020). https://doi.org/10.1145/
3380991
32. Lemos, R.: Hubs driving smart homes are vulnerable, security firm finds. In: Eweek (2015)
33. Li, H., Xu, Z., Zhu, H., Ma, D., Li, S., Xing, K.: Demographics inference through wi-fi
network traffic analysis. In: IEEE International Conference on Computer Communications
(INFOCOM) (2016)
34. Li, H., Zhu, H., Du, S., Liang, X., Shen, X.: Privacy leakage of location sharing in mobile social
networks: Attacks and defense. IEEE Trans. Depend. Secure Comput. PP(99), 1–1 (2016).
https://doi.org/10.1109/TDSC.2016.2604383
35. Li, Z., Ma, F., Rathore, A.S., Yang, Z., Chen, B., Su, L., Xu, W.: WaveSpy: remote and through-
wall screen attack via mmWave sensing. In: 2020 IEEE Symposium on Security and Privacy
(SP), pp. 217–232 (2020). https://doi.org/10.1109/SP40000.2020.00004
36. Liu, J., Wang, Y., Kar, G., Chen, Y., Yang, J., Gruteser, M.: Snooping keystrokes with mm-level
audio ranging on a single phone. In: Proceedings of the 21st Annual International Conference
on Mobile Computing and Networking, pp. 142–154. ACM (2015)
37. Liu, X., Zhou, Z., Diao, W., Li, Z., Zhang, K.: When good becomes evil: keystroke inference
with smartwatch. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and
Communications Security, pp. 1273–1285. ACM (2015)
38. Lomas, N.: Critical flaw IDed in ZigBee smart home devices (2015). https://techcrunch.com/
2015/08/07/critical-flaw-ided-in-zigbee-smart-home-devices/
39. Lu, L., Yu, J., Chen, Y., Zhu, Y., Xu, X., Xue, G., Li, M.: KeyListener: inferring keystrokes
on qwerty keyboard of touch screen through acoustic signals. In: IEEE INFOCOM 2019—
IEEE Conference on Computer Communications, pp. 775–783 (2019). https://doi.org/10.1109/
INFOCOM.2019.8737591
40. Ma, D., Lan, G., Hassan, M., Hu, W., Upama, M.B., Uddin, A., Youssef, M.: SolarGest:
ubiquitous and battery-free gesture recognition using solar cells. In: The 25th Annual
International Conference on Mobile Computing and Networking, MobiCom ’19. Association
for Computing Machinery, New York (2019). https://doi.org/10.1145/3300061.3300129
41. Niu, X., Li, S., Zhang, Y., Liu, Z., Wu, D., Shah, R.C., Tanriover, C., Lu, H., Zhang, D.:
Wimonitor: Continuous long-term human vitality monitoring using commodity wi-fi devices.
Sensors 21(3) (2021). https://www.mdpi.com/1424-8220/21/3/751
42. Owusu, E., Han, J., Das, S., Perrig, A., Zhang, J.: Accessory: password inference using
accelerometers on smartphones. In: Proceedings of the Twelfth Workshop on Mobile Com-
puting Systems & Applications, pp. 1–6 (2012)
43. Panchenko, A., Lanze, F., Pennekamp, J., Engel, T., Zinnen, A., Henze, M., Wehrle, K.:
Website fingerprinting at internet scale. In: NDSS (2016)
44. Pierson, T.J., Peters, T., Peterson, R., Kotz, D.: Proximity detection with single-antenna IoT
devices. In: Proceedings of the 24th Annual International Conference on Mobile Computing
and Networking, MobiCom ’18, pp. 663–665. Association for Computing Machinery, New
York (2018). https://doi.org/10.1145/3241539.3267751
45. Qian, K., Wu, C., Zhang, Y., Zhang, G., Yang, Z., Liu, Y.: Widar2.0: passive human tracking
with a single wi-fi link. In: Proceedings of the 16th Annual International Conference on Mobile
Systems, Applications, and Services, MobiSys ’18, pp. 350–361. Association for Computing
Machinery, New York (2018). https://doi.org/10.1145/3210240.3210314
46. Roy, N., Hassanieh, H., Roy Choudhury, R.: Backdoor: Making microphones hear inaudible
sounds. In: Proceedings of the 15th ACM Annual International Conference on Mobile Systems,
Applications, and Services (MobiSys), pp. 2–14 (2017). https://doi.org/10.1145/3081333.
3081366
47. Shi, C., Liu, J., Liu, H., Chen, Y.: Smart user authentication through actuation of daily activities
leveraging WiFi-enabled IoT. In: Proceedings of the 18th ACM International Symposium on
Mobile Ad Hoc Networking and Computing (MobiHoc), pp. 5:1–5:10 (2017). https://doi.org/
10.1145/3084041.3084061
34 2 Literature Review of Security in Smart Home Network
48. Shiota, S., Villavicencio, F., Yamagishi, J., Ono, N., Echizen, I., Matsui, T.: Voice liveness
detection algorithms based on pop noise caused by human breath for automatic speaker
verification. In: Sixteenth Annual Conference of the International Speech Communication
Association (2015)
49. Shukla, D., Kumar, R., Serwadda, A., Phoha, V.V.: Beware, your hands reveal your secrets!
In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications
Security, pp. 904–917. ACM (2014)
50. Sugawara, T., Cyr, B., Rampazzi, S., Genkin, D., Fu, K.: Light commands: laser-based
audio injection attacks on voice-controllable systems. In: 29th USENIX Security Symposium
(USENIX Security 20), pp. 2631–2648. USENIX Association (2020). https://www.usenix.org/
conference/usenixsecurity20/presentation/sugawara
51. Sun, J., Jin, X., Chen, Y., Zhang, J., Zhang, R., Zhang, Y.: Visible: video-assisted keystroke
inference from tablet backside motion. In: Network and Distributed System Security Sympo-
sium, pp. 1–15 (2016)
52. Tan, S., Yang, J.: WiFinger: leveraging commodity WiFi for fine-grained finger gesture
recognition. In: Proceedings of the 17th ACM International Symposium on Mobile Ad Hoc
Networking and Computing, pp. 201–210. ACM (2016)
53. Taylor, V.F., Spolaor, R., Conti, M., Martinovic, I.: AppScanner: automatic fingerprinting of
smartphone apps from encrypted network traffic. In: Security and Privacy (EuroS&P), 2016
IEEE European Symposium on, pp. 439–454. IEEE (2016)
54. Tian, Y., Zhang, N., Lin, Y.H., Wang, X., Ur, B., Guo, X., Tague, P.: SmartAuth: user-centered
authorization for the internet of things. In: USENIX Security Symposium (USENIX Security)
(2017)
55. Wang, Q., Lin, X., Zhou, M., Chen, Y., Wang, C., Li, Q., Luo, X.: VoicePop: a pop noise
based anti-spoofing system for voice authentication on smartphones. In: IEEE INFOCOM
2019-IEEE Conference on Computer Communications, pp. 2062–2070. IEEE (2019)
56. Wang, S., Cao, J., He, X., Sun, K., Li, Q.: When the differences in frequency domain are
compensated: Understanding and defeating modulated replay attacks on automatic speech
recognition. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and
Communications Security, CCS ’20, p. 1103–1119. Association for Computing Machinery
(2020). https://doi.org/10.1145/3372297.3417254
57. Wang, T., Cai, X., Nithyanand, R., Johnson, R., Goldberg, I.: Effective attacks and provable
defenses for website fingerprinting. In: 23rd USENIX Security Symposium (USENIX Security
14), pp. 143–157. USENIX Association, San Diego (2014)
58. Wang, W., Liu, A.X., Shahzad, M., Ling, K., Lu, S.: Understanding and modeling of WiFi
signal based human activity recognition. In: Proceedings of the 21st Annual International
Conference on Mobile Computing and Networking, pp. 65–76 (2015)
59. Wink: Wink: a simpler, smarter home (2018). https://www.wink.com/
60. Wright, C.V., Coull, S.E., Monrose, F.: Traffic morphing: an efficient defense against statistical
traffic analysis. In: NDSS, vol. 9. Citeseer (2009)
61. Xia, N., Song, H.H., Liao, Y., Iliofotou, M., Nucci, A., Zhang, Z.L., Kuzmanovic, A.:
Mosaic: quantifying privacy leakage in mobile networks. In: ACM SIGCOMM Computer
Communication Review, vol. 43, pp. 279–290. ACM (2013)
62. Yan, C., Long, Y., Ji, X., Xu, W.: The catcher in the field: A fieldprint based spoofing
detection for text-independent speaker verification. In: Proceedings of the 2019 ACM SIGSAC
Conference on Computer and Communications Security, CCS ’19, pp. 1215–1229. Association
for Computing Machinery (2019). https://doi.org/10.1145/3319535.3354248
63. Yuan, X., Chen, Y., Zhao, Y., Long, Y., Liu, X., Chen, K., Zhang, S., Huang, H., Wang,
X., Gunter, C.A.: CommanderSong: A systematic approach for practical adversarial voice
recognition. In: 27th USENIX Security Symposium (USENIX Security 18), pp. 49–64.
USENIX Association, Baltimore (2018). https://www.usenix.org/conference/usenixsecurity18/
presentation/yuan-xuejing
References 35
64. Yue, Q., Ling, Z., Fu, X., Liu, B., Ren, K., Zhao, W.: Blind recognition of touched keys on
mobile devices. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and
Communications Security, pp. 1403–1414. ACM (2014)
65. Zhang, G., Ji, X., Li, X., Qu, G., Xu, W.: EarArray: Defending against DolphinAttack via
acoustic attenuation. In: NDSS (2021)
66. Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: DolphinAttack: inaudible voice
commands. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Com-
munications Security (CCS), pp. 103–117 (2017). https://doi.org/10.1145/3133956.3134052
67. Zhang, J., Zheng, X., Tang, Z., Xing, T., Chen, X., Fang, D., Li, R., Gong, X., Chen, F.: Privacy
leakage in mobile sensing: your unlock passwords can be leaked through wireless hotspot
functionality. Mobile Inform. Syst. 2016, 8793025 (2016)
68. Zhang, L., Meng, Y., Yu, J., Xiang, C., Falk, B., Zhu, H.: Voiceprint mimicry attack towards
speaker verification system in smart home. In: IEEE INFOCOM 2020—IEEE Conference on
Computer Communications, pp. 377–386 (2020). https://doi.org/10.1109/INFOCOM41043.
2020.9155483
69. Zhang, L., Tan, S., Yang, J.: Hearing your voice is not enough: An articulatory gesture
based liveness detection for voice authentication. In: Proceedings of the 2017 ACM SIGSAC
Conference on Computer and Communications Security (CCS), pp. 57–71 (2017). https://doi.
org/10.1145/3133956.3133962
70. Zhang, L., Tan, S., Yang, J., Chen, Y.: VoiceLive: A phoneme localization based liveness
detection for voice authentication on smartphones. In: Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security, CCS ’16, pp. 1080–1091. Association
for Computing Machinery (2016). https://doi.org/10.1145/2976749.2978296
71. Zheng, Y., Zhang, Y., Qian, K., Zhang, G., Liu, Y., Wu, C., Yang, Z.: Zero-effort cross-domain
gesture recognition with wi-fi. In: Proceedings of the 17th Annual International Conference
on Mobile Systems, Applications, and Services, MobiSys ’19, pp. 313–325. Association for
Computing Machinery, New York (2019). https://doi.org/10.1145/3307334.3326081
72. Zhou, B., Lohokare, J., Gao, R., Ye, F.: EchoPrint: Two-factor authentication using acoustics
and vision on smartphones. In: Proceedings of the 24th Annual International Conference on
Mobile Computing and Networking, MobiCom ’18, p. 321–336. Association for Computing
Machinery, New York (2018). https://doi.org/10.1145/3241539.3241575
73. Zhu, T., Ma, Q., Zhang, S., Liu, Y.: Context-free attacks using keyboard acoustic emanations.
In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications
Security, pp. 453–464. ACM (2014)
74. Zillner, T.: Zigbee exploited: the good the bad and the ugly. In: Black Hat USA (2015)
Chapter 3
Privacy Breaches and Countermeasures
at Terminal Device Layer
3.1 Introduction
In the smart home environment, the mart terminal devices such as smartphones
and tablets are widely used by users. Users use smartphones and other terminals to
conduct many sensitive information interactions (e.g., bank transfers, input payment
passwords, and social applications). There is a huge difference between smart
terminals and traditional static devices (such as bank ATM devices). Traditional
static devices generally connect to a secure network, and users usually use these
devices in a secure physical area. The smart terminal is generally carried by mobile
users and connected to a dynamic mobile network. Therefore, the attacker can steal
the user’s private information entered by keystroke on the terminal in various direct
or indirect ways [1, 2, 12, 13, 15, 18, 23, 30, 32].
Direct attack refers to the attacker’s direct observation of the user’s keystroke
input on the intelligent terminal, such as peeking at the victim’s input on the touch
screen or keyboard of the terminal through the camera. Side-channel attack refers to
the attacker inferring the input information of the target device by using information
that is not directly related to the user’s keystroke behavior. In the side-channel attack
scenario, the attacker can use the electromagnetic signal of the antenna [1], the
audio signal of the microphone [2, 12, 32], the video signal of the camera [23, 30],
and the motion state obtained by the sensor [13, 15, 18] and other side-channel
information to obtain the user’s keystroke information. Compared with the direct
attack, the side-channel attack can steal private information without the user’s
awareness, so it has received extensive attention.
Currently, to access the side-channel information, the works mentioned above
often assume either external signal collector devices are physically close to the target
device (for example, 30 cm in [1]) or the target device’s sensors are compromised.
However, in a mobile scenario, both assumptions are hardly true, and the impact of
attacks is thus limited. In addition, prior works have studied keystroke inference
aiming at achieving a high inference accuracy on a series of keystrokes for a
relatively long time. However, the keystrokes on a mobile device are not always
highly sensitive. The eavesdropping attacker has a greater interest in obtaining the
payment PIN in a short moment than regular typing information. Therefore, the
application context information needs to be considered in the keystroke inference
framework to increase the inference accuracy and efficiency.
This chapter presents WindTalker, a novel and practical keystroke inference
framework that can infer sensitive keystrokes on a mobile device based on Wi-Fi
signals. The proposed WindTalker is motivated by an observation that the typing
behavior on mobile devices involves hand and finger motions, which generate
significant interference to the multi-path Wi-Fi signals from the target device to
the Wi-Fi router that connects to the device. The attacker can exploit the strong
correlation between the CSI fluctuation and the keystrokes to infer the user’s
number input. Unlike prior side-channel attacks or traditional CSI-based gesture
recognition, WindTalker neither deploys external devices close to the target device
nor compromises any part of the target device. Instead, WindTalker setups a “rogue”
hotspot to lure the target user with free Wi-Fi service, which is easy to deploy and
difficult to detect. As long as the target mobile device connects to the rogue Wi-
Fi hotspot, WindTalker intercepts the Wi-Fi traffic and selectively collects the CSI
between the target device and the hotspot.
The study of WindTalker in this chapter has four major technical challenges. (i)
The impact of keystrokes’ hand and finger movement on CSI waveforms is very
subtle. An effective signal analysis method is needed to analyze keystrokes from
the limited CSI. (ii) The prior CSI collection method requires deploying two Wi-Fi
devices (i.e., one as a signal sender and the other as a signal receiver) close to the
victim. A more flexible and practical CSI collection method is highly desirable for
the mobile device scenario. (iii) The keystroke inference must select some sensitive
moments, such as the payment PIN. Prior works have not addressed such context-
oriented CSI collection. (iv) We need to devise a lightweight defense method to
thwart WindTalker.
The contributions of this chapter are summarized as follows:
• We present a practical cross-layer-based approach for mobile payment password
inference on smartphones using public Wi-Fi architecture. We propose a novel
password inference model that analyzes physical layer information (CSI) and
network layer traffic.
• We present a novel ICMP-based CSI collection method without deploying an
external device very close to the victim’s device or compromising the victim’s
device. We develop an IP pool-based method to recognize the PIN input period.
• We conduct a case study on password inference at the mobile payment platform
Alipay, which is secured by the HTTPS protocol and thus traditionally believed
to be secure. We investigate the impact of various factors on WindTalker,
demonstrating that WindTalker can infer the password with a highly successful
rate.
• We introduce some effective countermeasures to thwart the inference attack.
More specifically, we propose a novel CSI obfuscation algorithm to prevent the
3.2 Background Knowledge and Attack Principle 39
attacker from collecting accurate CSI data without the requirements of the user’s
participation.
The remainder of this chapter is organized as follows. Section 3.2 introduces the
preliminary knowledge and the principles of WindTalker. In Sect. 3.3, we introduce
the architecture and design details of each module of WindTalker. We evaluate
the performance of WindTalker in the keystroke inference attack and discuss the
impact of various factors in Sect. 3.4. Section 3.5 shows the cross-layer attack ability
of WindTalker in real-world payment platform, named Alipay. Finally, Sect. 3.6
introduces the countermeasures against WindTalker and Sect. 3.7 summarizes this
chapter.
This section introduces the channel state information of wireless signals, the attack
model, and the design principles behind WindTalker.
The basic insight of WindTalker is measuring the impact of hand and finger
movement on Wi-Fi signals and leveraging the correlation between CSI and the
unique hand motion to recognize the password inputted by the user. Below, we
briefly introduce the CSI-related backgrounds.
To improve the channel capacity of the wireless system, Wi-Fi Standards like
IEEE 802.11n/ac support Multiple-Input Multiple-Output (MIMO) and Orthogonal
Frequency Division Multiplexing (OFDM). In a system with transmitter antenna
number NT X , receiver antenna number NRX and OFDM subcarriers number Ns ,
NT X × NRX × Ns subcarriers will be used to transmit signal at the same time.
CSI measures the channel frequency response (CFR) in different subcarriers f .
CFR H (f, t) represents the state of the wireless channel in a signal transmission
process. Let X (f, t) and Y (f, t) represent the transmitted and received signal with
different subcarrier frequency. H (f, t) can be calculated in the receiver using a
known transmitted signal via
Currently, many commercial devices such as Atheros 9390 [22], Atheros 9580
[29] and Intel 5300 [8] network interface cards (NICs) with special drivers provide
open access to CSI value. In this chapter, we adopt Intel 5300 NICs, which follow
IEEE 802.11n standard [10] and can work in 2.4 GHz or 5.8 GHz. By selecting a
group of 30 OFDM subcarriers out of 56 subcarriers, Intel 5300 NICs collect CSI
value for each TX-RX antenna pair. During Wi-Fi communication, an antenna pair
contains 64 OFDM subcarriers. The Intel 5300 network card can extract 30 OFDM
subcarriers and return corresponding CSI values. Therefore, in this chapter, the CSI
acquired is a 30-dimensional time series for a single antenna pair.
In the privacy attack scenario against smart terminals, this chapter assumes that
the victim user carries a mobile terminal such as a smartphone and accesses
Internet services through a Wi-Fi hotspot. An attacker places a wireless device
(e.g., a smartphone, router) near the user’s area to establish a Wi-Fi hotspot. To
disguise the hotspot as a public hotspot, the service set identifier (SSID) of the
hotspot is a deceptive name (such as “public free Wi-Fi”), and users do not need
to enter a password to connect. In this scenario, users choose to connect to the
hotspot. Because Wi-Fi uses application layer encryption protocols (e.g., HTTPS),
users believe the communication process is encrypted. In other words, they believe
all private information will only be shared between users and service providers.
However, in this chapter, we will reveal that user privacy cannot be protected only
by application layer encryption. Our WindTalker framework presents an effective
keystroke inference method targeting the mobile terminal device.
This attack mode does not require the attacker to physically approach the user’s
mobile terminal, nor does it require the attacker to invade the user’s terminal
equipment, so it has strong practicability. We also assume that attackers can deploy
a Wi-Fi hotspot near users’ homes to carry out attacks. In addition, considering that
smart home technology is not only used in users’ personal homes but also widely
used in semi-open environments such as apartments, hotels, shopping malls, etc. In
these environments, deploying Wi-Fi hotspots will be more hidden for users and
more feasible for attackers. Therefore, WindTalker has the feasibility of multiple
types of smart homes and even mobile communication scenarios.
Based on whether the terminal device participates in the CSI signal acquisition
process, we can divide the keystroke inference mechanism into two models: in-
band keystroke inference (IKI) and out-band keystroke inference (OKI). Note that
3.2 Background Knowledge and Attack Principle 41
3G\4G\WiFi
TX TX Wi-Fi
RX
Signal
Establish Wi-Fi Connection RX
ICMP Request
WiFi Router
ICMP Reply
Victim’s terminal Attacker’s device Victim’s terminal Attacker device
Collect both CSI and Wi-Fi traffic from victim’s terminal device Only collect CSI data
(a) (b)
Fig. 3.1 Wi-Fi based keystroke inference models. (a) IKI attack model. (b) OKI attack model
the existing works about CSI-based side-channel attacks [1, 25, 31] choose the OKI
model. As shown in Fig. 3.1b, the adversary deploys two COTS Wi-Fi devices close
to the target device and ensures the target device is placed right between two COTS
Wi-Fi devices. One is the sender device continuously emitting signals, and the other
is the receiver device continuously receiving the signals. The keystrokes are inferred
from the multi-path distortions in signals.
Different with existing works [1, 25, 31], WindTalker chooses IKI model. As
shown in Fig. 3.1a, WindTalker deploys one Commercial Off-The-Shelf (COTS)
Wi-Fi device close to the target device, which could be a Wi-Fi hotspot. The Wi-
Fi hotspot provides free Wi-Fi networks for nearby users. When a user connects
her device to the hotspot, the Wi-Fi hotspot can monitor the application context by
checking the pattern of the transmitted Wi-Fi traffic. In addition, the Wi-Fi hotspot
periodically sends ICMP packets to obtain the CSI information from the target
device. With the metadata of the Wi-Fi traffic collected by the hotspot, WindTalker
knows when sensitive operations (such as typing passwords) happen. And then, the
hotspot adaptively launches a CSI-based keystroke inference method to recognize
sensitive key inputs. To the best of our knowledge, the IKI method we propose is
the first one using existing network protocols of IEEE 802.11n/ac standard to obtain
both the application context and the CSI information on mobile devices.
Compared with the OKI model, the proposed IKI model has the following
advantages. Firstly, the IKI model does not require the placement of both sender
and receiver devices and can be deployed more flexibly and stealthily. Secondly,
in the OKI model, the victim’s device is not connected to the attacker’s device,
so the attacker cannot obtain the Wi-Fi traffic from the user’s device. Therefore,
the OKI model fails to differentiate the non-sensitive operations on mobile devices
(e.g., clicking the screen to open an APP or just for web browsing) from sensitive
operations (e.g., inputting the password). Instead, the IKI model allows the attacker
to obtain both unencrypted metadata traffic and the CSI data to launch a more fine-
grained attack.
42 3 Privacy Breaches and Countermeasures at Terminal Device Layer
70 100
CSI Value after Processed
60
Vertical Touch
50
50
CSI Value
40
Vertical touch 0
30
20
-50
10 Keystroke
Oblique Touch
0 -100
0 500 1000 0 500 1000
Oblique touch CSI Sample Index CSI Sample Index
Fig. 3.2 Hand movement and CSI changes during keystrokes. (a) An illustration of finger
keystroke. (b) Hand movement: vertical touch and oblique touch. (c) CSI waveforms from the
21st to the 30th subcarriers during a keystroke. (d) Impact of hand movement on CSI
3.3 System Design of Cross-layer Privacy Inference Attack 43
50
40
CSI Value
30
20
10
0
0 1 2 3 4 5 6 7
CSI Sample Index 104
(a)
CSI Value
CSI Value
CSI Value
30
30 15
30
25 25
10 25
20 20
5 20
15
0 15 15
2000 4000 6000 4 4.2 4.4 5.6 5.8 6 6.2 7.3 7.4 7.5 7.6 7.7
CSI Sample Index CSI Sample Index x 104 CSI Sample Index x 104 CSI Sample Index x 104
(b)
Fig. 3.3 CSI changes when inputting keystrokes in the numerical keyboard. (a) Continuously
clicking on the numerical keyboard. (b) Continuously clicking in different keys
coverage leads to the different fluctuation ranges of the CSI value, which can be
exploited for key inference.
Finger click is another important factor contributing to CSI’s fluctuation.
Compared with CSI change caused by the hand coverage, the experiment shows
that finger click has a more direct influence on CSI by introducing a sharp
convex in Fig. 3.3b, which corresponds to the quick click’s influence on multi-
path propagation. This feature can be used to distinguish the oblique touches in
the case that the human continuously presses the same key or the adjacent keys,
which produce similar CSI values.
The basic strategy of WindTalker is hitting two birds with one stone. On the one
hand, it analyzes the Wi-Fi traffic to identify the sensitive attack windows (e.g.,
PIN number) on smartphones. On the other hand, as long as an attack window
is identified, WindTalker starts to launch the CSI-based keystroke recognition. As
shown in Fig. 3.4, WindTalker consists of the following modules: Sensitive Input
Window Recognition Module, which is responsible for distinguishing the sensitive
input time windows, ICMP-Based CSI Acquirement Module, which collects the
44 3 Privacy Breaches and Countermeasures at Terminal Device Layer
Victim
Sensitive Input Window Password
Recognition Module Inference Module
Wi-Fi Traffic Sensitive Keystroke Password
Analysis Input Extraction Inference
Password
user’s CSI data during his access to Wi-Fi hotspot, Data Preprocessing Module,
which preprocesses the CSI data to remove the noises and reduce the dimension,
Keystroke Extraction Module, which enables WindTalker to automatically determine
the start and the end point of keystroke waveform, and Keystroke Inference
Module, which compares the different keystroke waveforms and determines the
corresponding keystroke.
Different from the previous works, which rely on two devices, including both the
sender and the receiver, to collect CSI data, we apply an approach that leverages
Internet Control Message Protocol (ICMP) in hotspot to collect CSI data during the
user access to the pre-installed access point. In particular, WindTalker periodically
sends an ICMP Echo Request to the victim’s smartphone, which will reply an
Echo Reply for each request. To acquire enough CSI information about the victim,
WindTalker needs to send ICMP Echo Request at a high frequency, which enforces
the victim to replay at the same frequency. In practice, WindTalker can work well
for several smartphones, such as Xiaomi, Redmi, and Samsung, at the rate of 800
packets per second. It is important to point out that this approach does not require
any permission from the target smartphone and is difficult to be detected by the
victim.
ICMP-based CSI collection approach introduces a limited number of extra
traffics. For a 98 bytes ICMP packet, when we are sending 800 ICMP packets per
3.3 System Design of Cross-layer Privacy Inference Attack 45
second to the victim, it needs only 78.4 KB/s for the attack where IEEE 802.11n
can theoretically support the transmission speed up to 140 Mbits per second. It is
clear that the proposed attack makes little interference to the Wi-Fi experience of
the victim.
CSI will be influenced by both finger movement and people’s body movement.
One of the major challenges of obtaining the exact CSI data in public space is
how to minimize the interference caused by nearby human beings. We present
a noise reduction approach by adopting the directional antenna. Different from
omnidirectional antennas that have a uniform gain in each direction, directional
antennas have a different antenna gain in each direction. As a result, the signal level
at a receiver can be increased or decreased simply by rotating the orientation of the
directional antenna. WindTalker employs a directional antenna to focus the energy
toward the target of interest, which is expected to minimize the effects of the nearby
human body movement.
WindTalker employs a TDJ-2400BKC antenna working in 2.4 GHz to collect
CSI data of the targeted victim, whose Horizontal Plane −3 dB Power Beamwidth
and Vertical Plane −3 dB Power Beamwidth are 30◦ and 25◦ , respectively.
The comparison of CSI data when using a directional and omnidirectional
antenna is shown in Fig. 3.5. In the experiment, we recruited two volunteers.
Volunteer A acts as the target user and continuously clicks the number 1 on the
number pad of the smartphone. Volunteer B was asked to walk 1 m to the left of
volunteer A. In the experiment, the directional antenna of the Wi-Fi hotspot was
always aimed at volunteer A. As shown in Fig. 3.5a, when the distance between
the volunteer and the omnidirectional antenna of the Wi-Fi hotspot is 75 cm,
due to the movement of surrounding volunteer B, we cannot clearly observe the
fluctuation mode caused by keystroke from the collected CSI amplitude waveform.
Figure 3.5b–d show CSI amplitude in the case that a victim is located at 75,
125, 150 cm accordingly away from directional antenna while one human moving
50 28 28
26 22
26
CSI Value
CSI Value
40 24
24 20
22
22
30 20 18
18 20
20 16 18 16
0 2000 4000 0 1000 2000 3000 0 2000 4000 0 2000 4000
CSI Sample Index CSI Sample Index CSI Sample Index CSI Sample Index
Fig. 3.5 The comparison between omnidirectional and directional antennas on collecting CSI. (a)
Omnidirectional antenna in 75 cm. (b) Directional antenna in 75 cm. (c) Omnidirectional antenna
in 125 cm. (d) Directional antenna in 150 cm
46 3 Privacy Breaches and Countermeasures at Terminal Device Layer
nearby. The unique pattern caused by the finger click in number 1 can be easily
caught from the original CSI waveform without any preprocessing. Therefore, in
the subsequent experiments, we only consider the CSI acquisition method using a
directional antenna.
To extract the time window containing the sensitive input, WindTalker captures
all packets of the victim with Wireshark and records the timestamp of each CSI
data. Currently, most of the important applications are secured via HTTPS, which
provides end-to-end encryption and prevents the eavesdropper from obtaining
sensitive data such as the password. Our insight is that though HTTPS provides
end-to-end encryption, it cannot protect the metadata of the traffic, such as the IP
address of the destination server, which can be used to recognize the sensitive input
window.
Especially, WindTalker builds a Sensitive IP Pool for interested applications or
services. Take Alipay as an example. During the payment process, the data packets
will be directed to a limited number of IP addresses, which can be obtained via a
series of trials. In the experimental evaluation, it is shown that, for Alipay users,
the traffics of the users under the same network will be directed to the same server
IP, which will last for a period (e.g., several days for one round of experiment).
Therefore, it is feasible to try to access interesting applications or services at
regular intervals and append the obtained IP addresses to the Sensitive IP Pool.
This constantly updated pool allows WindTalker to figure out the sensitive input
time window.
We conduct experiments to validate the effectiveness of the method as mentioned
above. We choose three popular mobile payment applications (i.e., Alipay, Wechat
Pay, and JD Pay) and capture the network traffic using Wireshark. For each
application, we complete the mobile payment ten times. As shown in Table 3.1,
for a certain application, when the password input process starts, some packets with
a specific IP address will happen. This result demonstrates the effectiveness of the
sensitive IP pool-based method. Therefore, during the attack process, as long as the
traffic to the IP addresses contained in the Sensitive IP Pool is observed, WindTalker
will extract this traffic and then record the corresponding start time and the end time,
which serve as the start and the end of the Sensitive Input Window. Then, it starts
to analyze the CSI data in this period to launch the password inference attack via
Wi-Fi signals.
After collecting CSI data, WindTalker needs to conduct the preprocessing before
entering the keystroke inference module. The goal of preprocessing is to remove the
noises introduced by commodity Wi-Fi NICs due to the frequent changes in internal
CSI reference levels, transmission power levels, and transmission rates. To achieve
this, WindTalker first turns to wavelet denoising to remove noises from the obtained
signals. Then, WindTalker leverages the Principal Component Analysis to reduce
the dimensionality of the feature vectors to enable better analysis of the data.
From Fig. 3.5, we can observe that the variation of CSI waveforms caused by
finger motion normally appears at the low end of the spectrogram while the
frequency of the noise occupies the high end of the spectrogram. We do not adopt
the low-pass filter since the high-frequency signal includes some finger motion
characters. In this chapter, the wavelet denoising method is used to remove noise
from the raw signal. Different from the traditional frequency analysis such as Fourier
Transform, Discrete Wavelet Transform (DWT) is a time–frequency analysis that
has a good resolution at both the time and frequency domains. WindTalker can
thus leverage DWT to analyze the finger movement in varied frequency domains.
Wavelet denoising includes three main steps as follows:
Discrete Wavelet Transform A discrete signal x [n] can be expressed in terms of
the wavelet function by the following equation:
1
x[n] = √ Wφ [j0 , k]φj0 ,k [n]
L k
∞ (3.2)
1
+√ Wψ [j, k]ψj,k [n],
L j =j k
0
where x[n] represents the original discrete signal and L represents the length of
x[n]. φj0 ,k [n] and ψj,k [n] refer to wavelet basis. Wφ [j0 , k] and Wψ [j, k] refer
to the wavelet coefficients. The functions φj0 ,k [n] refer to scaling functions and
the corresponding coefficients Wφ [j0 , k] refer to the approximation coefficients.
Similarly, functions ψj,k [n] refer to wavelet functions and coefficients Wψ [j, k]
refer to detail coefficients. In order to obtain the wavelet coefficients, the wavelet
basis φj0 ,k [n] and ψj,k [n] are chosen to be orthogonal to each other.
During the decomposition process, we divide the original signal into approxi-
mation and detail coefficients. Then the approximation coefficients are iteratively
divided into the approximation and detail coefficients of the next level. The
approximation and the detail coefficients in j -th level can be calculated as follows:
48 3 Privacy Breaches and Countermeasures at Terminal Device Layer
1
Wφ [j0 , k] = x[n], φj0 +1,k [n] = √ x[n]φj0 +1,k [n]. (3.3)
L n
1
Wψ [j, k] = x[n], ψj +1,k [n] = √ x[n]ψj +1,k [n]. (3.4)
L n
Threshold Selection The recursive DWT decomposition breaks the raw signal into
detail coefficients (high-frequency) and approximation coefficients (low-frequency)
at different frequency levels. Then, the threshold is applied to the detail coefficients
to remove their noisy part. The threshold selection is important because a small
threshold will retain the noisy components while a large threshold will lose the
major information of signals. In this work, the minimax threshold is chosen based
on its dynamic, effectiveness, and simplicity [21].
Wavelet Reconstruction After finishing the above two steps, we reconstruct
the signal to achieve noise removal by combining the coefficients of the last
approximation level with all details which have applied the threshold. Wavelet
selection plays a key role in wavelet decomposition and reconstruction. There are
many wavelet bases, such as Daubechies and Haar wavelet [21]. In practice, we
choose Daubechies D4 wavelet and perform 5-level DWT decomposition in wavelet
denoising in our study.
45
5 40
CSI Vaue
Subcarrier Index
40
10
35
15 30
20
Not Sensitive 30
25
20
25 20
30 15 0 2000 4000 6000 8000
1000 2000 3000 4000 5000 6000 7000 8000
CSI Sample Index CSI Sample Index
(a) (b)
35
30
Variance
CSI Vaue
30
20
10
25
0
0 2000 4000 6000 8000 10 20 30
CSI Sample Index Subcarrier Index
(c) (d)
Fig. 3.6 An illustration of subcarrier selection. (a) CSI waveforms while inputting a keystroke. (b)
1st subcarrier: sensitive. (c) 16th subcarrier: high sensitivity. (d) The variance of each subcarrier
With PCA, we can identify the most representative components influenced by the
victim’s hand and finger movement and remove the noisy components at the same
time. In our experiment, it is observed the first k = 4 components almost show the
most significant changes in CSI waveforms, and the rest components are noises. In
our experiment part, we observed that the first PCA component reserves the most
50 3 Privacy Breaches and Countermeasures at Terminal Device Layer
changes in CSI while the ambient noise is weak. Thus we only take the first PCA
component in the password inference module.
After data preprocessing, it is observed that the CSI data shows a strong correlation
with the keystrokes, as shown in Fig. 3.7a. In the experiment, as mentioned in
Sect. 3.2.3, the sharp rise and fall of the CSI waveform signals are observed in
coincidence with the start and end of finger touch. How to determine the start and
the end point of the CSI time series during a keystroke is essential for keystroke
recognition. However, the existing burst detection schemes, such as Mann-Kendall
test, moving average method, and cumulative anomalies [9] do not work well in our
situation since the CSI waveform has many change points during the password input
period. Therefore, we propose a novel detection algorithm to automatically detect
the start and end points. The proposed algorithm includes the following three steps.
Waveform Profile Building As shown in Fig. 3.7a, it is observed that there is a
sharp rise and fall, which correspond to the finger motions. However, there is a
strong noise that prevents us from extracting interested CSI waveforms related to
the keystrokes. This motivates us to perform another round of noise filtering. In
20 20
CSI PCA Components
10 10
0 0
-10 -10
0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000
CSI Sample Index CSI Sample Index
(a) (b)
20 20
CSI PCA Components
0 0
-10 -10
0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000
CSI Sample Index CSI Sample Index
(c) (d)
Fig. 3.7 Extraction procedure of multiple keystrokes. (a) CSI waveform after data preprocessing.
(b) CSI waveform after twice filter. (c) Variance Scan. (d) The results of keystroke extraction
3.3 System Design of Cross-layer Privacy Inference Attack 51
the experiment, we still adopt wavelet denoising to make the waveform smooth.
After being filtered, the CSI data during the keystroke period are highlighted while
the waveform during the non-keystroke period becomes smooth, which is shown in
Fig. 3.7b. It is worth noting that the noise filtering in this step will filter out some
information related to user keystrokes. Therefore, this waveform is only used to
determine the start and end time of keystrokes, and subsequent waveform extraction
is still performed in the waveform shown in Fig. 3.7a.
CSI Time Series Segmentation and Feature Segment Selection To extract the
CSI waveforms for individual keystrokes, we slice the CSI time series into multiple
segments, which be grouped together according to the temporal proximity and then
choose the center of the segment as the feature waveform for a specific keystroke.
Without loss of the generality, it is assumed that each segment contains W samples.
Given the sampling frequency S and the time duration τ , W can be represented by
S × τ . For the waveform with a time duration of T , the number of segments N can
be calculated as below:
T ×S
N= . (3.6)
W
As shown in Fig. 3.7c, it is observed that the CSI segments during the keystroke
period show a much larger variance than those happening out of the period.
Motivated by this, we are only interested in the segments with a variance that is
larger than a predetermined threshold while removing the segments with a variance
under this threshold. The selected segments are grouped into various groups
according to the temporal proximity (e.g., five adjacent segments grouped into
one group in practice). Each group represents the CSI waveform of an individual
keystroke, and the center point of this group is selected as the feature segment of this
keystroke. The process of time series segmentation and feature segment selection is
shown in Fig. 3.7c.
Keystroke Waveforms Extraction To extract keystroke waveforms, the key issue
is how to determine the start and the end point of the CSI time series, which could
cover as much keystroke waveform as possible while minimizing the coverage of
the non-keystroke portion.
We calculate the average value of the segment samples J and then choose two
key metrics K and L. K is the average value of J and samples’ maximum value,
while L is the average value of J and samples’ minimum value. The intersection of
K, L, and the CSI waveform serves as the anchor points. On line K, starting from
the leftmost anchor point, it performs a local search and chooses the nearest local
extremum, which is below K, as the first start point. Similarly, beyond the rightmost
anchor point, it can choose the nearest local extremum, which is below K, as the
first end point. Also, we can perform local searches from two anchor points on line
L in order to choose two local extremum beyond L as the second start point and
the second end point. Finally, we compare these points, respectively. As shown in
52 3 Privacy Breaches and Countermeasures at Terminal Device Layer
4 4
CSI PCA Component
0 0
Start Point
-4 -4
Mean Value Mean Value
K K
L L
Anchor Points on K Extrems of K
Anchor Points on L Extrems of L
-8 -8
5000 5500 6000 6500 5000 5500 6000 6500
CSI Sample Index CSI Sample Index
(a) (b)
Fig. 3.8 Extraction procedure of a single keystroke. (a) An illustration of anchor points. (b) Start
and end points of keystroke
Figs. 3.8a, b, and 3.7d, with the lower start point and the higher end point, keystroke
waveform can be extracted.
Thus, we can divide a CSI waveform into several keystroke waveforms. The
ith keystroke waveform Ki from the kth principal component Hr (:, k) of CSI
waveforms as follows.
where si and ei are the start and the end time of ith keystroke. After keystroke
extraction, we use these keystroke waveforms to conduct the recognition process.
One of the major challenges for differentiating keystrokes is how to choose the
appropriate features that can uniquely represent the keystrokes. As shown in
Fig. 3.9, it is observed that different keystrokes will lead to different waveforms,
which motivates us to choose waveform shape as a feature for keystroke clas-
sification. However, directly using the keystroke waveforms as the classification
features leads to a high computational cost in the classification process since
waveforms contain many data points for each keystroke. Therefore, we leverage the
Discrete Wavelet Transform (DWT) to compress the length of the CSI waveform by
extracting the approximate sequence. Below, we will introduce the details.
Wavelet Compression As mentioned in Sect. 3.3.3.1, a discrete keystroke wave-
form Ki [n] can be expressed by the following equation:
3.3 System Design of Cross-layer Privacy Inference Attack 53
10 10
CSI Value
CSI Value
5 5
0 0
-5 -5
-10 -10
0 200 400 600 800 0 200 400 600
CSI Sample Index CSI Sample Index
(a)
10
CSI Value
CSI Value
0
0
-10
-10
-20 -20
0 200 400 600 0 200 400 600
CSI Sample Index CSI Sample Index
(b)
Fig. 3.9 CSI waveform differences between two keystroke PIN numbers. (a) Two CSI waveforms
when inputting keystroke PIN number 5. (b) Two CSI waveforms when inputting keystroke PIN
number 4
1
Ki [n] = √ Wφ [j0 , k]φj0 ,k [n]
L k
∞ (3.8)
1
+√ Wψ [j, k]ψj,k [n],
L j =j k
0
where L represents the length of Ki [n], Wφ [j0 , k] and Wψ [j, k] refer to the
approximation and detail coefficients, respectively. In the first DWT decomposition
step, the length of approximation coefficients is half of L. For the j th decomposition
step, the length is half of the previous decomposition step. We use the approximation
coefficients to compress the original keystroke waveforms to reduce computational
costs. In order to achieve the trade-off between the sequence length reduction
and preserving the waveform information, we choose Daubechies D4 wavelet and
perform 3-level DWT decomposition in the classification model. Therefore, for ith
keystroke, the third level approximation coefficient Fi of Ki is chosen as the feature
of the keystroke. After compression, the length of feature Fi is about 1/8 of Ki [n].
54 3 Privacy Breaches and Countermeasures at Terminal Device Layer
20 20
Frequency (Hz)
Frequency (Hz)
15 15
10 10
5 5
Frequency (Hz)
20 20
15 15
10 10
5 5
Fig. 3.10 CSI frequency domain feature difference between two PIN keystrokes. (a) Two CSI
spectrograms when inputting keystroke PIN number 5. (b) Two CSI spectrograms when inputting
keystroke PIN number 4
Besides the CSI waveform shape, the CSI frequency feature can also be used
to differentiate keystrokes. The CSI spectrograms in the frequency domain are a
stable property of CSI streams and are highly correlated to keystrokes. Figure 3.10
illustrates the CSI spectrograms corresponding to the CSI waveforms shown in
Fig. 3.9. It is observed that different keystrokes have significantly different CSI
spectrograms. Therefore, it is feasible to use CSI spectrogram information as the
feature to recognize keystrokes.
In this work, WindTalker first performs Short-Time Fourier Transform (STFT)
to obtain the two-dimensional frequency spectrograms of CSI. Then, WindTalker
calculates the contours of the spectrograms to extract features. To extract the
contour, WindTalker first resizes the CSI spectrograms with frequency from 0
to 30 Hz into a m-by-n matrix MCSI (i, h) and normalize the MCSI (i, h) to a
range between 0 and 1. Note that, in MCSI (i, h), each column represents the
normalized frequency shifts during the ith time slide. Then, WindTalker chooses
a pre-defined threshold and get the contour CCSI (i), where i = 1, ..., n. CCSI (i)
is the maximum value j which satisfies that MCSI (i, j ) ≥ threshold. As shown
in Fig. 3.10, the contours are marked by black lines. It is observed that, between
the same keystrokes, the contours of CSI spectrograms have similar variation
3.4 System Evaluation 55
trends. Thus we can regard the contours as the frequency domain features of
the classification and calculate the similarity between the contours for keystroke
recognition.
WindTalker builds a classifier to recognize the keystrokes based on both the time-
domain feature and the frequency domain feature. To compare the features of
different keystrokes, WindTalker adopts Dynamic Time Warping (DTW) to measure
the similarity between two keystrokes. DTW utilizes dynamic programming to
calculate the distance between two sequences with different lengths. With DTW,
the sequences (e.g., time series and spectrogram contours) are warped non-linearly
in the time dimension to measure their similarity. The input of the DTW algorithm is
two sequences, and the output is the distance between them. A low distance indicates
that these two sequences are highly similar.
By adopting DTW, the classifier gives each keystroke a set of scores, which
allows the keystrokes to be differentiated based on the user’s training dataset
(keystrokes on different PIN numbers). Take the numerical keyboard (i.e., key
values are “0-1-2-· · · -9”) as an example, for the i-th keystroke Ki , classifier first
calculates the DTW distances between the features of Ki and all of the keystroke
number’s features in time and frequency domain, respectively. Thus, for Ki , we will
get two scores arrays ST = {st0 , st1 , . . . , st9 }, SF = {sf 0 , sf 1 , . . . , sf 9 }, where ST ,
SF represent the scores in time and frequency domain, respectively, and stn refers
to the shortest distance between the input keystroke and the certain key number n
in time domain. sf n is similar but in the frequency domain. Finally the classifier
calculates the score S = {s0 , s1 , . . . , s9 }, where sn = stn × sf n . The lower the score
sn is, the higher possibility the certain number n is the actual input number. The
classifier chooses the PIN number which has the minimum score (the value n which
satisfies sn is the lowest one) as the predicted key number. Note that the classifier
saves all scores of the certain keystroke Ki in order to generate password candidates
in Sect. 3.4.2.2.
In Sect. 3.2.3, we have shown that different keystrokes may be correlated with
different CSI waveforms. In this section, we aim to explore whether the differences
in keystroke waveforms are large enough to be used for recognizing different PIN
number inputs in a real-world setting. We collected training and testing data from
20 volunteers. Each volunteer first generates 50 loop samples, where a loop is
defined as the CSI waveform of the keystroke number from 0 to 9 by pressing
the corresponding digit. After that, we evaluate the classification accuracy of
WindTalker through the collected CSI data.
The classification accuracy is evaluated in terms of 10-fold cross-validation
accuracy. However, in a real-world scenario, it is not reasonable to collect 50
training samples for one specific PIN number. Therefore, we first divide these 50
3.4 System Evaluation 57
loops of data into five groups evenly. Then, for every 10 loops of CSI data, we
pick up one loop in turn for the testing data and choose the other 9 loops as the
training data. WindTalker adopts the classifier introduced in Sect. 3.3.4 to recognize
the keystroke. We perform the evaluation on Xiaomi, Redmi, and Samsung Note3
smartphones. All of them run with Android 5.0.2. Figure 3.11a shows the average
classification accuracy of all 20 volunteers in 10 PIN numbers. It is observed that
WindTalker achieves the average accuracy classification of 93.5% using combined
CSI features. However, if WindTalker only utilizes the time-domain feature as [11],
the accuracy will drop to 87.3%.
Figure 3.11b describes the color map of the confusion matrix of keystroke
inference. The coordinate (X, Y )indicates the probability that WindTalker will
recognize the keystroke CSI waveform of the single digit PIN code Xas the single
digit PIN code Y . The darker the color, the greater the probability. Intuitively, it is
easier for an input number that is confused with the neighboring numbers during the
keystroke inference process. We further analyze the impact of the number of training
data on the recovery rate in WindTalker. Table 3.2 shows the keystroke inference
accuracy increases with the training loop increases. Even if there is only one training
sample for one keystroke, WindTalker can still achieve a whole recovery rate of
79.5%.
In a practical scenario setting, it may not be easy for WindTalker to get 9 training
samples for each PIN number. So in the remaining section, we only use 3 samples
per PIN number for training. To illustrate the performance of WindTalker for
password Inference, in this part, we ask volunteers to press 10 randomly generated
2
Accuracy (%)
80 3
4
60 5
6
40
7
20 8
9
0 0
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
PIN Number Predicted PIN Number
(a) (b)
Fig. 3.11 Classification accuracy. (a) Classification accuracy per key. (b) An illustration of
confusion matrix
6-digit passwords on Xiaomi smartphone and use their corresponding 3 loops as the
training dataset.
We test 500 passwords, which include 3000 keys. As shown in Table 3.2, with 3
loops as training data, WindTalker can achieve an average 1-digit recovery rate of
Ap1 = 86.2%. For a 6-digit password in Alipay, theoretically the recovery rate
is Ap1 6 = 41%. However, in real-world scenarios, the attacker can try several
times to recover the password at an increased success rate. Thus, we introduce
a new metric, recovery rate with Top N candidates, which indicates the rate of
successfully recovering the password for trying N times and represents a more
reasonable metric to describe the capability of the attacker in the practical setting.
As shown in Table 3.3, if we evaluate the 1-digit recovery rate under top 2 and top
3 candidates, the recovery rate can be significantly improved to 93.4% and 96.2%.
We further study how many candidates can help WindTalker to succeed in
predicting the right 6-digit payment password. In particular, we will investigate
the inference accuracy under top N candidates. In the experiment, each 6-digit
password will be correlated with six keystrokes K = {K1 , K2 , . . . , K6 }. For
the i-thkeystroke’s CSI data Ki , WindTalker calculates its corresponding score
Si = si,0 , si,1 , . . . , si,9 . Then, for a given password candidate PIN number
P = {p1 , p2 , . . . , p6 }, where pi ∈ [0, 9], WindTalker calculates the likelihood
L between K and P . L is defined by L = 6i=1 si,pi . Given a 6-digit password
K, for each keystroke Ki , we can obtain 5 candidates with the lowest si and
then generate 56 = 15626 candidate passwords. Then WindTalker sorts these
passwords by their likelihoods in ascending order. A successful password inference
is satisfied if the top N candidates contain the real password. In Fig. 3.12a, we give
the password inference accuracy under top N candidates, where N ranges from 1 to
20. The result is encouraging. It is shown that, given top-1 candidate, the inference
accuracy is 41.2%. And the inference rate can be significantly improved if given
top-5 candidates or top-10 candidates, which correspond to 69.6% and 77.4%,
respectively. It is also shown in Fig. 3.12b that, if given enough top N candidates
(e.g., set N as 60), the inference accuracy can reach above 85%.
There are many factors potentially affecting CSI. The performance of WindTalker is
affected by various factors, such as the relative position of AP and mobile devices,
CSI sampling rate, keyboard layout, human movement, and temporal factors. Even
clicking at the same key, the different distance and direction between AP and the
Table 3.3 Relationship between 1-digital PIN recovery rate and candidate numbers
Number of candidates 1 2 3
Recovery rate of 1-digital PIN 86.2% 93.4% 96.2%
3.4 System Evaluation 59
90
Password Inference Accuracy (%)
60
60 85
85
40 40
80 80
20 20
0 75
0 5 10 15 20 20 40 60 80 100
Number of Candidate Passwords Number of Candidate Passwords
(a) (b)
Fig. 3.12 The relationship between the 6-digit password recovery rate and candidate numbers. (a)
Top 20 Candidates. (b) Top 100 Candidates
50
0
0.75 1 1.25 1.5
Distance between Smartphone and Antenna (m)
mobile device may also lead to a different CSI. We will investigate the impact of
these factors on WindTalker in our experiments.
In a real scenario, the distance between the victim’s mobile device and AP is not
fixed. As shown in Fig. 3.13, the recovery rate of WindTalker will decrease along
with the increase of the distance. However, it is observed that, even if the distance
between the antenna of WindTalker and the victim’s smartphone (i.e., Xiaomi)
is enlarged to 1.5m, WindTalker can still achieve keystroke inference accuracy
of 83.5% in terms of 10-fold cross-validation, which is high enough for launch
keystroke inference.
Figure 3.14 shows that both CSI shape and degree will change under different
distances when pressing the same key. This indicates that WindTalker needs
to retrain the dataset even for the same victim with different distances. When
the distance between the antenna and the victim is too long, the multiple path
60 3 Privacy Breaches and Countermeasures at Terminal Device Layer
20
0
0
-50
-20 -5
0 500 1000 0 500 1000
CSI Sample Index CSI Sample Index
propagation will become more complicated. Thus the collected CSI cannot reflect
the victim’s finger precisely and result in inaccurate inference results. In order to
partially solve these limitations, there are two possible solutions. Firstly, the attacker
can fix the location of the table and chairs, which will make the user’s position
relatively stable. The other solution is placing three antennas of Intel 5300 NIC at
different locations to enlarge the effective range of WindTalker. Therefore, when the
victim connects to rogue Wi-Fi, WindTalker could dynamically choose the antenna
which is closest to the victim to collect CSI data.
The relative direction between the victim and attacker may seriously affect the
CSI since different directions mean different multi-path propagation between the
transmitter and the receiver. Thus, we show the performance of WindTalker in
different directions. Note that the mobile device (i.e., Xiaomi in this experiment)
is in front of the victim. It is important to point out that, for a right-handed user,
WindTalker shows better performance when the AP is on the left side of the victim.
This is because it is easier for WindTalker to sense the victim’s finger clicks and the
hand motion. Figure 3.15 shows the keystroke inference accuracy of WindTalker
3.4 System Evaluation 61
80
Accuracy (%)
60
40
20
0
Left Right Front Behind
Direction of antenna
The experiments in Sect. 3.4.2 are implemented on Xiaomi, Redmi, and Samsung
Note3 smartphones. To evaluate the impact of different smartphone types, we recruit
ten volunteers to generate 10 loop keystrokes on Xiaomi, Redmi, and Samsung
Note 3. All of these mobile phones run with Android 5.0.2. When using all 9 loops
of data, WindTalker achieves the average classification accuracy of 93.5%, 88.3%,
and 83.9% on Xiaomi, Redmi, and Samsung Note3, respectively. The experimental
result indicates that the WindTalker performance is affected by the smartphone type
because different smartphones may have different relative positions of antennas and
working powers. Fortunately, the accuracy of WindTalker on different smartphones
is still acceptable for password inference.
The keystroke recognition accuracy depends on the sampling rate of CSI. When the
CSI sampling rate is high, there is more information in the CSI waveform, which
increases the keystroke recognition accuracy. Thus, we are interested in how the
CSI sampling rate influences the performance of WindTalker. Figure 3.16 shows the
average classification accuracy of all volunteers with Xiaomi smartphones when
varying the sampling rate from 100 packets/s to 800 packets/s. The experiment
procedures are the same as Sect. 3.4.2 and the antenna is placed at the best
62 3 Privacy Breaches and Countermeasures at Terminal Device Layer
Accuracy (%)
80
60
40
20
0
100 200 400 600 800
CSI Sampling Rate (packets/s)
position as mentioned in Sect. 3.4.3.1 and 3.4.3.2. From Fig. 3.16, we observe
that the classification accuracy improves when the sampling rate is higher, but
the improvement is not significant beyond the sampling rate of 400 packets/s. For
instance, with a sampling rate of 400 packets/s, the classification accuracy of Xiaomi
is 90.3%, which is only a slight drop compared to 93.5% achieved for a sampling
rate of 800 packets/s. Note that the sampling rate of 400 packets/s is still enough
to capture the movement feature of the keystroke. When the sampling rate reduces
to 100 packets/s, the accuracy of Xiaomi reduces significantly to 82.8%, as this
sampling rate loses the detailed feature of the keystroke. In our experiment, we use
the sampling rate of 800 packets/s to achieve the best performance of WindTalker.
But when facing a high packet loss rate situation, we can use a lower sampling rate
above 100 packets/s to achieve an acceptable performance.
There are two different keyboard layouts that influence keystroke recognition
accuracy. As shown in Fig. 3.17, besides the numeric keyboard, which is used
in most online payment scenarios, there is QWERTY keyboard on which a user
can type letters, numbers, and special characters. The main difference between the
two keyboards is the key space. Compared with the numerical keyboard, the hand
movement tends to be subtle when typing adjacent keys on the QWERTY keyboard,
which makes recognizing keystrokes much more difficult since the CSI waveforms
become similar.
We are interested in how the QWERTY keyboard influences keystroke recogni-
tion. For simplicity, we just focus on the digital input on the QWERTY keyboard.
We perform experiments on the Xiaomi smartphone, and the keyboard layout is
provided by the Google input method. Figure 3.18a shows average classification
accuracy on both numeric and QWERTY keyboards. We observe that the accuracy
of the QWERTY keyboard is 67.8%, which significantly drops compared to 93.5%
of the numerical keyboard. Figure 3.18b is the confusion matrix of QWERTY
keyboard. We observe that most error recognition happens between the adjacent
3.4 System Evaluation 63
Numeric QWERTY 1
100 2
80 4
60 5
6
40 7
8
20 9
0
0
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
PIN Number Predicted PIN number
(a) (b)
Fig. 3.18 Keystroke recognition on QWERTY keyboard. (a) Accuracy on different layouts. (b)
Confusion matrix of QWERTY keyboard
keys. Although the accuracy of the QWERTY keyboard is lowered than the numeric
keyboard, it is still higher than the random guess.
In some cases, the CSI-based sensing may be affected by the movement of another
nearby human. Thus, we evaluate the impact of human walking and human arm
movement on the performance of WindTalker. As shown in Fig. 3.19a, while
WindTalker collects the CSI data to infer the victim’s keystroke, we recruit a
volunteer to walk along four different lines (L1, L2, L3, L4), respectively. The
distances between WindTalker’s antenna and the midpoints of L1, L2, L3, and
L4 are 1, 2, 3, and 4 m, respectively. The distance between the antenna and the
victim is 1 m. We totally conduct four experiments. In each experiment, we ask
the victim to continuously generate keystrokes and collect the corresponding CSI.
At the same time, the volunteer walks along one of the four lines at the speed of
0.5 m/s. Figure 3.19b shows the experimental result. When the distance between
the antenna and the midpoint of the walking man’s trajectory is larger than 2 m,
64 3 Privacy Breaches and Countermeasures at Terminal Device Layer
(a)
1m 2m 3m 4m
20 20 15 15
PCA 1st component
10 10 10 10
5 5
0 0
0 0
-10 -10
-5 -5
0 2000 4000 6000 8000 0 2000 4000 6000 0 2000 4000 6000 0 2000 4000 6000 8000
CSI sample index CSI sample index CSI sample index CSI sample index
(b)
1m 2m 3m 4m
PCA 1st component
10 10 6 10
4
5 5
5
2
0 0
0
0
-5 -5 -2
0 2000 4000 6000 8000 0 2000400060008000 2000 4000 6000 8000 4000 6000 8000
CSI sample index CSI sample index CSI sample index CSI sample index
(c)
Fig. 3.19 Impact of human movement. (a) The experimental environment. (b) The CSI data under
the human walking scenario. (c) The CSI data under the human arm’s movement scenario
the keystrokes could be easily found from the collected CSI waveforms. However,
when the distance is 1 m (i.e., the walking man’s trajectory is very close to the
victim), it is hard to extract keystroke waveforms from collected CSI data. The
results show that human walking will bring additional multiple path effects into
wireless transmission. However, WindTalker is still effective if only there is no
human walking within 2 m of the WindTalker’s antenna.
Besides human walking, we also consider another scenario in which a human
stays at a fixed location but waves his/her arms. We conduct four experiments. When
the victim continuously generates keystrokes, we ask the volunteer to stay at the
midpoints (i.e., C1, C2, C3, C4 in Fig. 3.19a) of above four lines, respectively, and
3.5 When CSI Meets Public Wi-Fi: A Case Study in Mobile Payment Platforms 65
spin his/her arm with the average speed of 0.91 cycles per second. As shown in
Fig. 3.19c, when the distance between the antenna and the volunteer is larger than
3 m, the keystrokes could be recognized from collected CSI data. When the distance
is 1 m (i.e., the victim is very close to another person), the keystrokes are hard to be
extracted. Therefore, WindTalker will normally work if there is no user who waves
his/her arms within 3 m of the WindTalker’s antenna.
The temporal factors will also affect the performance of WindTalker. Figure 3.20
shows how CSI waveform changes on different days. We can observe that these
CSI shape patterns look different. The reason is that on different days, the user’s
typing behavior may be inconsistent, and the surrounding environment may change,
which may affect the constructive and destructive interference of several multi-path
signals. Therefore, in the current state, for each keystroke inference, WindTalker
needs to update the user’s CSI profiles to ensure its performance. We leave this
limitation for future work.
The experiment evaluation in the previous section is mainly carried out in the
laboratory environment. To demonstrate the practicality of the WindTalker, we
perform an experimental evaluation of password inference on Alipay, a popular
mobile payment platform on Both Android and iOS systems.
Attack Deployment in Real-World Mobile Payment Platform Alipay is the
largest mobile payment company in the world and had 710 million monthly active
users in June 2020.1 As shown in Fig. 3.21, we deploy a WindTalker system at a
1 https://m.yicai.com/news/100747459.html.
66 3 Privacy Breaches and Countermeasures at Terminal Device Layer
50
CSI Amplitude
40
30
20
0 1 2 3 4
CSI Sample Index 10 4
(a) (b)
60
1
Keystroke Extraction
30
7 7
-30
Keystroke Inference
3 9 9
-60
0 10 20 30 40
Time (second)
(c)
Fig. 3.22 WindTalker performance in the case study. (a) Recognize the sensitive input window.
(b) Original CSI waveform: the 30th subcarrier. (c) Password inference procedure
In this section, we aim to propose some defense strategies to thwart the attacker. We
hope that the discussion will raise privacy awareness of the Wi-Fi hotspot and will
also inspire other researchers to find more advanced defense techniques.
The basic idea of the proposed defense strategy is introducing a randomly generated
CSI time series sequence to obfuscate the original one. As shown in Fig. 3.23, during
the sensitive input time window, when the attacker collects the CSI data (or original
CSI data) from the target user, the user device can randomly generate some CSI
data (or obfuscation data) to obfuscate the original CSI data and thwart the side-
channel attack. According to IEEE 802.11n standard [10], when the user does not
launch sensitive applications, the attacker can obtain the CSI data by analyzing the
training sequence of the preamble of the Wi-Fi packet obtained from the victim’s
device. Without loss of the generality, the original CSI between victim and attacker
is estimated as:
70 3 Privacy Breaches and Countermeasures at Terminal Device Layer
ICMP Request
Y1
H1 = , (3.9)
X1
where X1 is the training sequence on the transmitter and Y1 is the training sequence
on the receiver. In practice, both the transmitter and receiver assume that the training
sequence X1 will not change during the whole communication process.
During the password input progress for a specific mobile payment application
(e.g., Alipay), the defense strategy will be launched. The attacker uses the ICMP
requests to obtain Wi-Fi packets from the victim, and, at the same time, the user’s
device can also proactively send the obfuscation packets to the attacker. For instance,
the training sequence X1 in Eq. 3.9 was changed into:
X2 = H X1 . (3.10)
Y2 = H1 X2 = H1 H X1 = H2 X1 . (3.11)
From the attack’s perspective, the original and obfuscation data are indistinguish-
able. Because the attacker still utilizes the original training sequence X1 to estimate
CSI, therefore the attacker would estimate the victim’s CSI as H2 = H1 H . It
means that the original CSI data will be masked by inserting forged CSI data H2
into the original CSI sequence H1 . Thus the CSI-based side-channel attack can be
thwarted because the attacker cannot infer the user’s keystroke by analyzing CSI
data.
3.6 Countermeasures and Discussions 71
Three disnguishable
keystroke waveforms
3.7 Summary
References
1. Ali, K., Liu, A.X., Wang, W., Shahzad, M.: Keystroke recognition using WiFi signals.
In: Proceedings of the 21st Annual International Conference on Mobile Computing and
Networking, pp. 90–102. ACM (2015)
2. Balzarotti, D., Cova, M., Vigna, G.: ClearShot: eavesdropping on keyboard input from video.
In: Security and Privacy, 2008. SP 2008. IEEE Symposium on, pp. 170–183. IEEE (2008)
3. Benko, H., Wilson, A.D., Baudisch, P.: Precise selection techniques for multi-touch screens.
In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp.
1263–1272. ACM (2006)
4. Chen, B., Yenamandra, V., Srinivasan, K.: Tracking keystrokes using wireless signals. In:
Proceedings of the 13th Annual International Conference on Mobile Systems, Applications,
and Services, pp. 31–44. ACM (2015)
5. Fang, S., Markwood, I., Liu, Y., Zhao, S., Lu, Z., Zhu, H.: No training hurdles: fast training-
agnostic attacks to infer your typing. CCS ’18, p. 1747–1760. Association for Computing
Machinery, New York (2018). https://doi.org/10.1145/3243734.3243755
6. Forlines, C., Wigdor, D., Shen, C., Balakrishnan, R.: Direct-touch vs. mouse input for tabletop
displays. In: Proceedings of the SIGCHI conference on Human factors in computing systems,
pp. 647–656. ACM (2007)
7. Halperin, D.: Linux 802.11n CSI tool. http://dhalperi.github.io/linux-80211n-csitool/faq.html
8. Halperin, D., Hu, W., Sheth, A., Wetherall, D.: Tool release: gathering 802.11 n traces with
channel state information. ACM SIGCOMM Comput. Commun. Rev. 41(1), 53–53 (2011)
9. Holt, C.C.: Forecasting seasonals and trends by exponentially weighted moving averages. Int.
J. Forecasting 20(1), 5–10 (2004)
10. IEEE Std. 802.11n-2009: enhancements for higher throughput (2009). http://www.ieee802.org
11. Li, M., Meng, Y., Liu, J., Zhu, H., Liang, X., Liu, Y., Ruan, N.: When CSI meets public WiFi:
inferring your mobile phone password via WiFi signals. In: Proceedings of the 2016 ACM
SIGSAC Conference on Computer and Communications Security, CCS ’16, pp. 1068–1079.
ACM, New York (2016). https://doi.org/10.1145/2976749.2978397
12. Liu, J., Wang, Y., Kar, G., Chen, Y., Yang, J., Gruteser, M.: Snooping keystrokes with mm-level
audio ranging on a single phone. In: Proceedings of the 21st Annual International Conference
on Mobile Computing and Networking, pp. 142–154. ACM (2015)
13. Liu, X., Zhou, Z., Diao, W., Li, Z., Zhang, K.: When good becomes evil: keystroke inference
with smartwatch. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and
Communications Security, pp. 1273–1285. ACM (2015)
74 3 Privacy Breaches and Countermeasures at Terminal Device Layer
14. Mao, Y., Zhang, Y., Zhong, S.: Stemming downlink leakage from training sequences in multi-
user MIMO networks. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer
and Communications Security, pp. 1580–1590. ACM (2016)
15. Marquardt, P., Verma, A., Carter, H., Traynor, P.: (sp) iPhone: decoding vibrations from nearby
keyboards using mobile phone accelerometers. In: Proceedings of the 18th ACM Conference
on Computer and Communications Security, pp. 551–562. ACM (2011)
16. Nakhila, O., Dondyk, E., Amjad, M.F., Zou, C.: User-side Wi-Fi evil twin attack detection
using SSL/TCP protocols. In: 2015 12th Annual IEEE Consumer Communications and
Networking Conference (CCNC), pp. 239–244 (2015). https://doi.org/10.1109/CCNC.2015.
7157983
17. Nakhila, O., Zou, C.: User-side wi-fi evil twin attack detection using random wireless channel
monitoring. In: MILCOM 2016—2016 IEEE Military Communications Conference, pp. 1243–
1248 (2016). https://doi.org/10.1109/MILCOM.2016.7795501
18. Owusu, E., Han, J., Das, S., Perrig, A., Zhang, J.: Accessory: password inference using
accelerometers on smartphones. In: Proceedings of the Twelfth Workshop on Mobile Com-
puting Systems & Applications, pp. 1–6 (2012)
19. Postel, J.: Internet Control Message Protocol. RFC 792, Internet Engineering Task Force
(1981). http://www.rfc-editor.org/rfc/rfc792.txt
20. Ristenpart, T., Tromer, E., Shacham, H., Savage, S.: Hey, you, get off of my cloud: exploring
information leakage in third-party compute clouds. In: Proceedings of the 16th ACM Confer-
ence on Computer and Communications Security, pp. 199–212. ACM (2009)
21. Sardy, S., Tseng, P., Bruce, A.: Robust wavelet denoising. IEEE Trans. Signal Process. 49(6),
1146–1152 (2001). https://doi.org/10.1109/78.923297
22. Sen, S., Lee, J., Kim, K.H., Congdon, P.: Avoiding multipath to revive inbuilding WiFi
localization. In: Proceeding of the 11th Annual International Conference on Mobile Systems,
Applications, and Services, pp. 249–262. ACM (2013)
23. Shukla, D., Kumar, R., Serwadda, A., Phoha, V.V.: Beware, your hands reveal your secrets!
In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications
Security, pp. 904–917. ACM (2014)
24. Sun, J., Jin, X., Chen, Y., Zhang, J., Zhang, R., Zhang, Y.: Visible: video-assisted keystroke
inference from tablet backside motion. In: Network and Distributed System Security Sympo-
sium, pp. 1–15 (2016)
25. Tan, S., Yang, J.: WiFinger: leveraging commodity WiFi for fine-grained finger gesture
recognition. In: Proceedings of the 17th ACM International Symposium on Mobile Ad Hoc
Networking and Computing, pp. 201–210. ACM (2016)
26. Vecchiato, D., Martins, E.: Experience report: A field analysis of user-defined security
configurations of android devices. In: 2015 IEEE 26th International Symposium on Software
Reliability Engineering (ISSRE), pp. 314–323 (2015). https://doi.org/10.1109/ISSRE.2015.
7381824
27. Wang, F., Cao, X., Ren, X., Irani, P.: Detecting and leveraging finger orientation for interaction
with direct-touch surfaces. In: Proceedings of the 22nd Annual ACM Symposium on User
Interface Software and Technology, pp. 23–32. ACM (2009)
28. Wang, J., Zhao, K., Zhang, X., Peng, C.: Ubiquitous keyboard for small mobile devices:
harnessing multipath fading for fine-grained keystroke localization. In: Proceedings of the 12th
Annual International Conference on Mobile Systems, Applications, and Services, pp. 14–27.
ACM (2014)
29. Xie, Y., Li, Z., Li, M.: Precise power delay profiling with commodity WiFi. In: Proceedings of
the 21st Annual International Conference on Mobile Computing and Networking, MobiCom
’15, pp. 53–64. ACM, New York (2015). https://doi.org/10.1145/2789168.2790124
30. Yue, Q., Ling, Z., Fu, X., Liu, B., Ren, K., Zhao, W.: Blind recognition of touched keys on
mobile devices. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and
Communications Security, pp. 1403–1414. ACM (2014)
References 75
31. Zhang, J., Zheng, X., Tang, Z., Xing, T., Chen, X., Fang, D., Li, R., Gong, X., Chen, F.: Privacy
leakage in mobile sensing: your unlock passwords can be leaked through wireless hotspot
functionality. Mobile Inform. Syst. 2016, 8793025 (2016)
32. Zhu, T., Ma, Q., Zhang, S., Liu, Y.: Context-free attacks using keyboard acoustic emanations.
In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications
Security, pp. 453–464. ACM (2014)
Chapter 4
Wireless Signal Based Two-Factor
Authentication at Voice Interface Layer
4.1 Introduction
In the smart home environment, users can control their domestic appliances
(e.g., lights, temperature controller, electronic switch, microwave, refrigerator) via
various user interfaces such as image sensing, wireless communication, and voice
controller commands. As reported by Market Research Future in 2020 [16], the
voice interface is predicted to become the primary user interface for the smart home,
and the global revenue will reach $7.3 billion in 2025. Currently, the typical IoT
voice control systems include Amazon Alexa [3], Samsung SmartThings [28],
Google Home [17], and so on.
However, the inherent broadcast nature of voice unlocks a door for adversaries
to inject malicious commands (i.e., spoofing attacks). Besides the classical replay
attack [10, 35], emerging attacks leveraging flaws in smart speakers are also
proposed by researchers. On the hardware side, the nonlinearity of the microphone’s
frequency response provides a door for inaudible ultrasound-based attacks (e.g.,
Dolphin attack [40] and BackDoor attack [27]). For the software aspect, the deep
learning models employed by both speech recognition and speaker verification are
proved to be vulnerable to emerging adversarial attacks such as hidden voice [5],
CommanderSong [38], and user impersonation [42]. Spoofing attacks impose
emerging safety issues (e.g., deliberately turning on the smart thermostat [11]) and
privacy risks (e.g., querying user’s schedule information) on the smart speaker and
therefore cause great concern.
In order to thwart voice spoofing attacks, the most intuitive defense strategy is
voice password-based access control. In the password-based access control, the
user is required to speak a special password before inputting the voice controller
commands [1]. However, speaking a password is either inconvenient for the
user or vulnerable to eavesdropping attack. Therefore, liveness detection, as an
effective countermeasure, has been proposed. The liveness detection leverages the
fact that voices in the spoofing attack are played by electrical devices (e.g., high-
quality loudspeaker [35], ultrasonic dynamic speaker [40]). Thus, the physical
characteristics, which are different between humans and machines, could be used as
the “liveness” factors. The existing countermeasures (aka., liveness detection) could
be divided into multi-factor authentication and passive scheme. In this chapter, we
focus on the former—multi-factor authentication.
The multi-factor authentication-based liveness detection exploits the information
(i.e., image/video collected by camera [7], magnetic filed emitted from loudspeaker
[6], time-difference-of-arrival changes from different microphones of smartphone
[39], acceleration data of user’s wearable devices [15] and the Doppler shift of
ultrasonic caused by user’s mouth motion [41]) that are closely correlated with the
operations of VCS as the user’s liveness features to differentiate between the voice
samples generated by legitimate user and adversary. However, the existing two-
factor based liveness detection schemes require the user to either carry specialized
sensing devices or perform specific actions to collect the liveness information.
Thus their practicalities are limited. More seriously, some of these schemes pose
unacceptable privacy risks since the user’s daily behaviors may be leaked from the
collected information (e.g., image or video data in [7]).
In this chapter, we present WSVA, a wireless signal-based voice authentication
system to thwart the spoofing attacks aiming at VCS. Unlike prior liveness detection
schemes, WSVA is a device-free system without requiring the user to carry any
additional device or sensor, and it leverages the prevalent wireless signals generated
by Wi-Fi devices in IoT environment. WSVA is motivated by the following
observations. Firstly, inspired by the wide application of lip-reading technology,
it is feasible to understand speech by sensing the movements of the lips, face,
and tongue. In other words, voice command can be cross-checked by the user’s
mouth motions. Secondly, the prior researches show that indoor object movement
will disturb the multiple-path of wireless signals and can be reflected in Channel
State Information (CSI) of Wi-Fi signals. Thus a variety of human activities can be
identified by using CSI-based wireless sensing techniques. Therefore, WSVA aims
to build the correlation between the user’s mouth motion and the environmental
CSI change, and leverage this correlation to verify the liveness of voice commands
received by the voice interface.
To achieve the goal mentioned above, WSVA needs to address three challenges.
(1) The impact of mouth motion on wireless signals is subtle. Although previous
works utilize sophisticated methods such as MIMO beamforming or Frequency-
Modulated Carrier Waves (FMCW) [33, 36] to improve the wireless sensing
capability, they may not be suitable for our problem because the commercial
IoT devices are resource-constrained and cannot implement these sophisticated
wireless techniques. (2) According to our experimental result, only the jaw and
tongue movements can be recognized by wireless signals, while the vocal vibration,
which contributes a lot to the voice signal, cannot be distinguished. Besides,
prior works [24, 33] pointed out that not all voice syllables can be recognized
by lip-reading techniques. (3) To correlate the voice and CSI signals, how to
select appropriate features from these two-dimensional signals still remains a big
challenge.
4.2 Preliminaries and Motivation 79
To address the above challenges and achieve liveness detection, WSVA firstly
builds a new model to describe the correlation among the CSI changes, the mouth
motions, and the syllables of the received voice signals. Then, WSVA proposes a
novel signal processing method to filter the noises of collected voice and CSI signals
and to extract syllables and mouth motions within the voice command. Further,
WSVA utilizes a novel method to extract both time-domain and frequency-domain
of two types of signals and performs liveness detection. We conduct experiments to
evaluate the liveness detection performance of WSVA and demonstrate its feasibility
in IoT environment. The contributions of this chapter are summarized as follows:
• We present WSVA, a two-factor liveness detection system to thwart the various
voice spoofing attacks aiming at VCS. By utilizing the existing wireless signals
in IoT environment, WSVA shows its advantages of device-free, feasible deploy-
ment and privacy preservation.
• We study the correlation between voice samples and wireless signals. Specif-
ically, we build a mapping model to correlate the syllables within the voice
command, the user’s mouth motions, and their corresponding CSI change
patterns.
• We devise the architecture and algorithms of WSVA. We exploit some effective
technical mechanisms to process voice samples and CSI data, design novel
algorithms to extract the features from these different types of signals and
propose the liveness decision algorithm.
• We design and implement a testbed to demonstrate the practicability of WSVA.
We evaluate the impact of various factors on WSVA, and our experimental results
on 6 volunteers show that WSVA achieves 99% liveness detection accuracy with
1% false accept rate.
We point out that WSVA does not propose to use wireless signals for lip reading
since the existing works [13, 33] have shown that the lip reading accuracy is limited.
Instead, this chapter aims to utilize the consistency between voice and CSI signals
to authenticate the voice commands.
The remainder of this chapter is organized as follows. Sect. 4.2 introduces the
preliminary knowledge and the research motivation. In Sect. 4.3, we introduce the
design details of WSVA. We evaluate the performance of WSVA and discuss its
limitations in Sect. 4.4 and Sect. 4.5, respectively. Finally, Sect. 4.6 concludes this
chapter.
In this section, we introduce the background knowledge, including the attack model
and articulatory gestures. Then, to elaborate the research motivation of WSVA, this
section performs a series of experiments to answer the following questions: (1)
Do the mouth motions really have a correlation with the change of Wi-Fi signals?
80 4 Wireless Signal Based Two-Factor Authentication at Voice Interface Layer
Microphone
Attacker Loudspeaker Alexa
(2) How can we model this correlation between the mouth motions and the CSI
vibration?
As shown in Fig. 4.1, the spoofing attack is defined as that the adversary tries to fool
the VCS by injecting some pre-collected or forged voice commands. The existing
studies show that there are three major types of spoofing attacks.
Replay Attack To fool the voice interface, the attacker collects the legitimate
user’s audio samples and then plays them back with a high-quality loudspeaker
[10]. For the voice interface without liveness detection functionality, the replayed
voice will be translated as the voice command. The victim’s voice audio can be
recorded or captured in many manners, which is not limited to websites, daily life
talking, phone calls, etc.
Advanced Adversarial Attacks Even if attackers can only collect a limited
number of the target user’s voice samples by adopting the latest voice synthesized
technique [8], it is still feasible to attack existing speech recognition and speaker
verification systems. For instance, the adversary can craft subtle noises into the
audio (e.g., background sound [5], music [38] or a broadcaster’s voice [42])
or inaudible ultrasounds [27, 40] to launch an attack without raising the victim’s
concern.
Without loss of generality, in the remainder of this chapter, we use spoofing
attacks to represent the above-mentioned kinds of attacks. Our proposed defense
scheme is based on the fact that, in the spoofing attacks, the fake voice commands
are generated by the machine rather than the human, which means that there are no
corresponding mouth motions for these voice commands. This inconsistency can be
leveraged for performing liveness detection.
Note that, the replay attack is the most effective among various spoofing
approaches since it preserves the most comprehensive voiceprint of the victim
and requires no cumbersome hardware configurations and software parameter fine-
4.2 Preliminaries and Motivation 81
As shown in Fig. 4.2a, it is well known that articulation is related to human organs
(e.g., vocal cords, tongue, lips, jaw). The voice differences depend on the motions
of organs, which could affect the vibration frequency of the air (i.e., the timbre).
According to the air vibration position, the procedure of voice generation can be
divided into the following three stages: (1) Voice generation procedure starts when
the air is sent out from the thorax. The air passes through the vocal cords comprising
of cartilages and muscles, whose different shapes and positions have a significant
effect on air propagation. (2) The air arrives at the soft palate after passing through
the pharynx. The soft palate controls the direction and speed of the airflow and
decides whether it can enter the nasal cavity. (3) The voice wave is about to leave
the mouth when the air arrives at the oral cavity, after which the voice is spread
in the air. In this period, the user can produce different phonemes with different
motions of the tongue, lips, and jaw, which is known as articulatory gesture. As
Fig. 4.2 Articulatory gestures for voice pronunciation. (a) Vocal organs and consonant pronunci-
ation [25]. (b) Vowel pronunciation
82 4 Wireless Signal Based Two-Factor Authentication at Voice Interface Layer
shown in Fig. 4.2, according to the International Phonetic Alphabet [37], the users
pronounce different phonemes with different mouth shapes. For instance, as shown
in Fig. 4.2b, the position of the jaw can be halfway opened and fully opened when
the user pronounces /e/ and /a/, respectively.
In this subsection, we first review the wireless signal related knowledge mentioned
in Sect. 3.2.1. Then, we explore the relationship between the mouth motion and
the CSI vibration. Finally, we model the correlation among CSI vibrations, voice
syllables, and user mouth motions.
where α is the signal magnitude attenuation, f is the frequency, and τ is the time-
of-light. Given the length of signal propagation path d, the signal wavelength λ and
the speed-of-light c, τ can be calculated as τ = d/c and Eq. 4.1 can be rewritten as:
According to Eqs. 4.1 and 4.2, when the user speaks a voice command, the
movements of the lips and the jaw will change the d and α of the wireless signal.
These constructive and destructive interferences of several multi-path signals will be
reflected by a unique pattern in the time series of CSI values, which can be related to
the presence of the legitimate voice command. In this study, CSI extraction is quite
easy: we can deploy Universal Software Radio Peripheral (USRP) [14] to extract
CSI with 52 subcarrier values.
Figure 4.3a demonstrates the typical scenario of human voice commanding in voice
interfaces such as SmartThings or Amazon Alexa platform. When a user interacts
4.2 Preliminaries and Motivation 83
TX Voice Samples
Voice Command
Microphone
Human RX
CSI
(a)
0.5
Voice
Amplitude
-0.5
1.5
0 1 2 3 4 5 6 0.2
Amplitude
0 0.16
0 1 2 3 4 5 6
0 1 2 3 4 5 6
Time (s) Time (s)
(b) (c)
Fig. 4.3 Illustrations of the basic idea of WSVA. (a) Human speaking scenario. (b) Voice and CSI
samples during authentic voice commanding. (c) CSI samples during voice spoofing
with the voice interface, WSVA exploits a pair of antennas of the IoT devices in the
proximity to collect the CSI data from Wi-Fi packets and leverage a microphone
to record the voice samples simultaneously. Generally speaking, since CSI reflects
the environmental constructive and destructive interference on several multi-path
signals, the change of multi-path propagation caused by the mouth motions during
the voice speaking can generate a unique pattern in the time series of CSI values.
In this case, we investigate the influence of the mouth motions on the CSI, which
can be regarded as a liveness pattern of the user. As shown in Fig. 4.3b, the
dramatic fluctuations of CSI waveforms happen with the occurrence of human voice
commands. However, as shown in Fig. 4.3c, if an adversary launches the spoofing
attack described in Sect. 4.2.1, in which the spoofing voice command is injected
without any corresponding mouth motion, the attacks can be easily detected due
to the lack of the corresponding changes in CSI data. Therefore, our experimental
results validate our intuition that it is feasible to leverage the consistency of
fluctuations between voice samples and CSI streams to detect spoofing attacks.
Previous works have demonstrated that human movements can be sensed via
wireless signals [26, 29, 31, 34]. However, in IoT environment, achieving very
precise speech recognition is hard since it may go beyond the sensing capability
of Wi-Fi signal. As shown in Eq. 4.2, the sensing capability of wireless signal
84 4 Wireless Signal Based Two-Factor Authentication at Voice Interface Layer
Fig. 4.4 Four types of mouth motion shapes. (a) Hiant. (b) Grin. (c) Round. (d) Pout
depends on the wavelength of the signals. In practice, the Wi-Fi signal (e.g., 12.5 cm
wavelength for 2.4 GHz) based sensing mechanisms cannot accurately capture the
tiny human mouth motion. To make matters worse, in addition to the motion of the
tongue, lips, and jaw, Wi-Fi can hardly recognize the impact of other vocal organs.
According to the study of Dodd et al. [12], only 40% words in English can be
recognized by only considering mouth motions.
Considering that it is not feasible to achieve accurate lip reading via Wi-Fi
signals, WSVA authenticates the voice commands by checking the consistency
between voice and CSI signals rather than accurately identifying each syllable.
Therefore, in this chapter, by analyzing the International Phonetic Alphabet, we
classify the mouth motions into four categories, including hiant, grin, round, and
pout, which correspond to Fig. 4.4a–d, respectively. With the exception of a few
syllables (e.g.„ / /) with non-significant mouth motions, most phonetic syllables
e
can be categorized into one of these types. As shown in Table 4.1, the hiant, the
motion of opening the mouth largely, can pronounce the phonemes like /a:/ and /æ/,
which can be heard in words “bar” and “cat”. The grin, the motion of grinning
human mouth like Fig. 4.4b, can pronounce the phonemes like /e/ and /ei/, which
can be heard in “A” and “base”. The round, rounding lips at ease, can generate the
phonemes like / :/, which can be heard in “lot” and “saw”. Finally, the pout, the
c
motion pouting the lips, can send out the phonemes like /u:/, which can be heard
in words “root” and “shoe”. After such a classification, different types of mouth
motions can be correlated with different CSI features according to relevant voice
syllables, as mentioned in the following sections.
4.3 WSVA: Wireless Signal Based Two Factor Authentication 85
Legitimate Adversarial
noise in CSI and segments the collected voice samples. Feature Extraction Module
enables WSVA to select appropriate features from macro-level and mouth motion,
respectively. Finally, Feature Matching Module utilizes a classification mechanism
to determine whether the received voice command is authentic or spoofing.
This subsection introduces how to collect voice samples and the corresponding
CSI data. Most of the voice interfaces (e.g., Google Now and Amazon Alexa)
require the user to speak a pre-defined magic word as a trigger. For instance, Apple
iPhone needs “Hey, Siri” and Amazon Alexa needs “Alexa” to initialize their voice
assistants. Only when the voice trigger is recognized by the VCS, WSVA will be
activated and start to collect voice samples V and CSI data H by utilizing the
microphone and antenna pair, respectively. To collect CSI, the antennas can be
equipped by different devices or incorporated into the same IoT device. One of these
antennas acts as a transmitter to continuously send wireless packets (e.g., broadcast
packets), and the other receives packets and extracts CSI data from the preamble
sequences of these wireless packets.
1
Voice
0
-1
0 1 2 3 4 5 6 7 8 9
Time (s)
(a)
40
STE/ZCR
STE
20 ZCR
0
0 100 200 300 400 500 600 700 800 900
Voice Frames
(b)
Amplitude
1.8
0 1000 2000 3000 4000 5000 6000 7000 8000
CSI Samples
(c)
Fig. 4.6 An example of word-level segmentation. (a) Original voice samples. (b) An illustration
of the double-threshold method. (c) Inter word segmentation on CSI data
Word Level Segmentation When the user speaks a command, there is a short inter-
val (e.g., 200 ms) between the pronunciation of two successive words. Therefore,
the interval between two word samples can be utilized to segment voice command
into different word samples. WSVA exploits the double-threshold detection method.
Specifically, WSVA splits the voice samples V into frames of Nv points length,
with shifting Ns points each time. In this study, Nv and Ns are set to 512 and 256
respectively. For totally N frames, WSVA calculates their short term energy ST E[n]
and zero-crossing rate ZCR[n], and selects two adaptive thresholds for ST E[n] and
ZCR[n] to detect the start and end points sv,i and ev,i of the i-th word Wi . Then,
according to the timestamps, we can also divide the CSI data into several word
waveforms. Figure 4.6 illustrates the proceeding of inter word segmentation. For
the k-th CSI subcarrier H (:, k), its corresponding i-th word’s CSI waveform Wi can
be represented as follows.
where sc,i and ec,i are the start and end CSI indexes of the i-th word Wi which are
converted from the timestamps sv,i and ev,i on voice samples. Note that si and ei are
extended on both sides by 200 CSI indexes respectively, due to the fact that the CSI
change introduced by the mouth motion can occur a little bit earlier or later than the
speech can be heard.
Phoneme Level Segmentation and Mouth Motion Inference For a specific word,
its pronouncing behavior may involve more than one mouth motion. For instance,
speaking the word “open” needs the mouth motions of “round” and “grin”. Besides,
as mentioned in Sect. 4.2.3.3, the correlation between different categories of mouth
motion and CSI vibration types is a key factor that can be leveraged in liveness
detection. Therefore, the next step of WSVA is dividing the given CSI word
waveforms into multiple CSI mouth motion waveforms and then calculating the
similarity between the collected CSI mouth motion waveforms and pre-trained CSI
motion data.
Similar to word level segmentation, WSVA processes the voice samples of
the user and infers the start and end points of each mouth motion. In particular,
WSVA first utilizes automatic speech recognition to identify each word of a voice
command. The state-of-the-art system DeepSpeech [18] is utilized to perform such
a task automatically. After identifying existing words, WSVA then utilizes Munich
Automatic Segmentation System (MAUS), a widely adopted phonetic segmentation
system [19]. MAUS is based on the Hidden Markov Model method, and it can label
the phonemes of voice signals by analyzing the sound file and text description of the
voice. Specifically, based on the standard pronunciation model, the identified text
will be transformed into expected pronunciation. Then, a probabilistic graph will
be generated by combining the canonical pronunciation with millions of different
accents, which contains all possible phoneme combinations and the corresponding
probabilities. MAUS finally adopts Hidden Markov Model to perform path search
and find the combination of phonetic units with the highest probability.
After combining phonemes into syllables and inferring the mouth motions
according to International Phonetic Alphabet, we can obtain the segmented and
labeled mouth motions of the inputting voice command. WSVA matches the
timestamps of each segmented motion to the CSI samples to extract CSI mouth
motion waveforms as the method defined in Eq. 4.3. One example is illustrated in
Fig. 4.7, which segments a voice command (“Open the door”) into several phonemes
and extracts the CSI mouth motion waveforms. It is worth mentioning that since the
number of voice commands commonly used in VCS is limited, the performance of
speech recognition can be improved according to pre-defined common command
set. In addition, WSVA also utilizes inter word segmentation results to improve the
phoneme segmentation performance. After these steps, we obtain the start and end
points of all Nm mouth motions M = {M1 , M2 , ..., MNm } in a voice signal, and then
extract the CSI data HMi for the i-th mouth motion Mi .
4.3 WSVA: Wireless Signal Based Two Factor Authentication 89
Voice
CSI data
After data cleansing and pre-processing on CSI and voice samples, WSVA selects
the appropriate features to characterize the consistency between these two types
of signals. As mentioned in Sect. 4.2.3, at the macro level, it is observed that CSI
variation occurs along with human pronunciation. Besides, the CSI data of different
mouth motion types show different features, which is another criterion to describe
consistency. Therefore, WSVA extracts features from both the macro level and
mouth motion level to determine whether the voice command and the mouth motion
are consistent.
W1 W2 W3 W4
Frequency (Hz)
2
1000
1
5 Time (s) 10 15
(a)
W1 W2 W3 W4 100
Frequency (Hz)
20 80
60
10 40
20
5 Time (s) 10 15
(b)
W1 W2 W3 W4
Frequency (Hz)
20 3
2
10
1
5 Time (s) 10 15
(c)
Fig. 4.8 Illustration of the macro-level feature extraction. (a) The spectrogram of voice samples.
(b) The spectrogram of CSI data in normal scenario. (c) The spectrogram of CSI data in attack
scenario
mouth motion, the CSI contour is disordered and not inconsistent with that of voice
samples. Therefore, to measure the consistency between voice and CSI samples, an
intuitive solution is to calculate the similarity between the spectrogram contours of
these signals.
However, directly calculating the similarity between the spectrogram contours of
V and HP CA is inappropriate since the frequency shifts of these signals are affected
by different factors (i.e., voice tunes on voice and mouth movements on CSI), which
are not necessarily related. Instead, for the Nw words W = {W1 , W2 , ..., WNw } in
the command, WSVA calculates the similarity between contours of voice and CSI
signals for each word Wi , and then combines these similarities to obtain the macro-
level similarity SMacro . For the i-th word Wi , to calculate its similarity, we first
extract the CSI and voice samples HWi and VWi , which are represented as:
4.3 WSVA: Wireless Signal Based Two Factor Authentication 91
where sv,i and ev,i are the begin and ending indexes of i-th word Wi ’s voice samples
respectively. sc,i and ec,i are the begin and ending indexes of i-th word Wi ’s CSI
samples respectively. Lv,i and Lc,i are the spans of the voice and CSI samples of
Wi , in which Lv,i = ev,i − sv,i + 1, and Lc,i = ec,i − sc,i + 1. Note that, instead
of directly using the Eq. 4.3, we extend both sides of HWi and VWi to obtain more
details about the Wi .
Then, we extract the contour CCSI,Wi from the frequency spectrogram of i-th
word’s CSI data. Firstly, we resize the CSI spectrogram with frequency from 0 to
30 Hz into a m-by-n matrix MCSI (j, k) and normalize the MCSI (j, k) to a range
between 0 and 1. Note that, in MCSI (j, k), each column represents the normalized
frequency shifts during the j th time slide. Then, we choose a pre-defined threshold
and get the contour CCSI,Wi (j ), where j = 1...n. CCSI,Wi (j ) is the maximum
value k which satisfies that MCSI (j, k) ≥ threshold. The process of calculating
contours CV ,Wi for the voice spectrogram is similar to calculating CCSI,Wi . Besides,
as mentioned in Sect. voice:segmentation, we can set the value CV ,Wi (j ) to 0 to
eliminate the interference of background noise, if the j -th time slide is not within
the word segments
After obtaining CCSI,Wi and CV ,Wi for Wi , we measure the correlation between
these two contours by adopting Pearson correlation coefficient [20], which is
defined as Corr(Wi ). Corr(Wi ) ranges from 0 to +1, where a higher value
represents a higher level of similarity. To calculate Corr(Wi ), we first re-sample
CCSI,Wi and CV ,Wi into the sample length, and Corr(Wi ) can be represented as:
n
i=1 (CCSI,Wi (i) − CCSI,Wi )(CV ,Wi (i) − CV ,Wi )
Corr(Wi ) = , (4.6)
(n − 1)δCSI δV
where n is the length of re-sampled sequences CCSI,Wi and CV ,Wi , δCSI and
δV oice are the sample standard deviations of CCSI,Wi and CV ,Wi , respectively.
After calculating the similarity Corr(Wi ) for i-th word Wi , for total words W =
{W1 , W2 , ..., WNw } we could calculate the macro-level similarity SMacro as follows:
Nw
i=1 Corr(Wi )
SMacro = . (4.7)
Nw
1.1 1.1
Amplitude
Amplitude
0.8 0.8
1 1
0.7 0.7
0.9 0.9
0.8 1.1
Amplitude
Amplitude
1.3 1.3
0.75 1.05
1.2 1.2
0.7 1
(c) (d)
Fig. 4.9 Time domain of four mouth motion types. (a) Hiant: /a:/ and / la:/. (b) Grin: /e:/ and
/ge:/. (c) Round: /re:/ and /ra:/. (d) Pout: /u:/ and /gu:/
SMacro . For example, the dramatic change of environment may generate the drastic
vibrations of CSI data, which lead to a deviated contour CCSI,Wi and a higher
similarity Corr(Wi ) for detected word Wi . Therefore, to further improve detection
performance, WSVA will extract the mouth motion level features from both time
and frequency domain perspectives in this subsection.
Time Domain Feature Extraction Figure 4.9 shows the amplitudes of CSI
syllable data belonging to four mouth motion categories. It is observed that the
CSI waveforms belonging to the same mouth motion category have similar shapes.
For instance, in Fig. 4.9a, the waveforms of syllable /a:/ and / la:/ which belong
to the motion “hint” have the similar waveform shapes and amplitude vibrations.
And it is also discovered that the ranges of CSI amplitudes from different mouth
motion categories are quite different. For instance, as shown in Fig. 4.9a, d, the CSI
amplitude ranges of syllables /a:/ and / la:/ are much larger than syllables /u:/
and /gu:/. Thus we can extract the ranges from the CSI waveforms as their time
domain features. For a given CSI mouth motion M and its CSI data HM , the CSI
time domain feature Range(M) can be calculated as:
Ns
Max(HM,i ) − Min(HM,i )
Range(M) = , (4.8)
Ns × Mean(HM,i )
i=1
4.3 WSVA: Wireless Signal Based Two Factor Authentication 93
where Ns represents the number of CSI subcarriers and HM,i represents the i-th
subcarrier of HM . Note that, the PCA processed data HP CA is not utilized in this
scenario, since the PCA process will distort the signal’s range.
20 20 20 20
10 10 10 10
10 20 30 40 50 10 20 30 40 50 10 20 30 20 40
(a) (b)
Frequency (Hz)
20 20 20 20
10 10 10 10
10 20 30 10 20 30 20 40 60 20 40 60
(c) (d)
Fig. 4.10 Frequency domain of four mouth motion types. (a) Hiant: /a:/ and / la:/. (b) Grin: /e:/
and /ge:/. (c) Round: /re:/ and /ra:/. (d) Pout: /u:/ and /gu:/
94 4 Wireless Signal Based Two-Factor Authentication at Voice Interface Layer
Before performing liveness detection, it is reasonable to assume that the user can
provide totally J × N pre-collected CSI mouth motion data HP re , which contain
J syllable categories (i.e., four motion categories proposed in Sect. 4.2.3.3) and
each category contains N motions’ CSI data HP re (i, j ), where j = 1, 2, ..., N.
Then, for a given voice command input containing NM mouth motions M =
{M1 , M2 , .., MNM } which belong to four motion categories, WSVA processes the
voice samples V and CSI data H using the above mentioned modules. After that,
WSVA obtains the macro-level similarity SMacro and the mouth motion feature
Range(Mi ) and FMi for each motion Mi .
Mouth Motion Feature Combination For a given motion M, WSVA firstly
calculates the time domain range difference SMRT ime (i) between its feature
Range(HM ) and pre-collected i-th mouth motion category’s features. SMRT ime (i)
can be calculated as:
N
Range(HM ) − Range(HP re (i, j ))
SMRT ime (i) = . (4.9)
N
j =1
Since the corresponding motion category of M can be calculated from the voice
processing module as described in Sect. 4.3.2.2, we can calculate the time domain
similarity score between M and its corresponding motion type as follow:
where type represents the motion type of M, which ranges from 1 to J . The
generated ST ime ranges from 0 to 1, and the value closer to 1 indicates a high level
of similarity. Note that, the function of adjustment factor α is to prevent ST ime from
being zero, and we empirically set α to 0.1 in this study.
Then, WSVA compares the similarity between the frequency domain feature
FM of M and the J × N features FP re (i, j ) which are extracted from pre-
collected motion data HP re . Different from the previous work [22] which utilizes
Dynamic Time Warping with O(N 2 ) time complexity, in this chapter, to speed
up the computation, WSVA exploits neural network-based solution to characterize
the similarity. WSVA utilizes the pre-collected CSI data HP re to train a forward
propagation neural network net with 20 neurons in the hidden layer. For a given
CSI data HP re (i, j ), the input for net is frequency domain feature FP re (i, j )
and the training label is the mouth category number. In this study, we set the
category numbers for motion “hiant”, “grin”, “round” and “pout” are 1, 2, 3 and
4 respectively. After training, in the ideal case, net could map a specific motion
4.4 Performance Evaluation of WSVA 95
feature FM to its corresponding motion type. The similarity between M and the i-th
mouth motion category can be calculated as:
where label(i) is the label of i-th motion category, and net (FM ) is the prediction
of net.
Similar to Eq. 4.10, WSVA calculates the similarity score between FM and its
corresponding motion category type as:
where the adjustment factor α is set to 0.1. The result SF req closer to 1 indicates a
high level of similarity.
After obtaining the time domain similarity score ST ime and the frequency domain
similarity score SF req of a given mouth motion M, we can calculate the combination
mouth motion level similarity score SMotion as:
NM
Score = SMacro × SMotion (Mi ). (4.14)
i=1
100
95
85
80
Combined Feature
75 Mouth Motion Feature
Macro-level Feature
70
65
60
0 0.5 1 1.5 2
False Accept Rate (%)
Fig. 4.11 Performance on thwarting spoofing attacks when using combination feature, macro-
level feature, and mouth motion feature
the TAR relying on the mouth motion feature is better than that relying on macro-
level feature. More concretely, with 1% FAR, the TAR relying on mouth motion
feature still keeps above 99%. However, the TAR relying on macro-level feature is
reduced to 90.2%. The reason is that the macro-level features are more susceptible
to environmental noise. After collecting voice and CSI data, the average time delay
of performing each liveness detection is 0.26 s, which is acceptable in practice and
is smaller than that in previous work (i.e., 0.32 s in [22] with the same hardware
condition). In summary, our experimental results well validate that WSVA is highly
effective in defending against spoofing attacks, while the macro-level feature and
the mouth motion feature can complement each other to improve the detection
performance.
In evaluations described in Fig. 4.11, for each user, WSVA performs liveness
detection based on his/her pre-collected CSI profiles. However, in some smart home
environments with multiple users, it is less likely to collect each user’s mouth
motion profiles. A more desirable design is to collect only one user’s profiles but
work for multiple users. In this section, we perform experiments to evaluate the
scalability of WSVA. We first recruit a volunteer to provide WSVA with his/her
mouth motion profiles which record his/her articulatory gesture. Then we recruit
another two volunteers to perform voice commands 300 times. After that, we also
implement spoofing attacks 600 times. Figure 4.12 shows the evaluation result of
WSVA, where WSVA achieves 97.6% TAR with 1% FAR, and 97.9% TAR with
2% FAR. Note that, the mouth motion feature-based detection rate (i.e., 89.6% TAR
with 2% FAR in Fig. 4.12) is less than that in Sect. 4.4.2. The reason is that the
98 4 Wireless Signal Based Two-Factor Authentication at Voice Interface Layer
100
95
85
80
Combined Feature
75 Mouth Motion Feature
70 Macro-level Feature
65
60
0 2 4 6 8 10
False Accept Rate (%)
articulatory gesture of another volunteer is not the same as the user who provides
the pre-collected CSI profiles. However, compared with spoofing attacks, WSVA
can still achieve a high detection accuracy, which demonstrates that it is also highly
effective in multiple-user scenarios.
100
98
Accuracy (%) 96
94
Word based Feature
Syllable based Feature
92 Combined Feature
90
4 5 6 7 8
Mouth Motion Number
In the above evaluations, the volunteer is located at the line of sight (LOS) places of
antennas, and the distance between the user’s mouth and the receiver antenna is set
to 20 cm. To evaluate the impact of distance on detecting voice spoofing, a volunteer
is recruited to conduct experiment with distances of 50 cm, 100 cm, and 150 cm,
respectively. For each distance, the volunteer is required to provide CSI profiles
and generate 150 voice commands, and then the loudspeaker is deployed at the
volunteer’s location to perform spoofing attacks 300 times. As shown in Fig. 4.14,
it is observed that the detection accuracy decreases when the distance becomes
greater. By using the combined feature, WSVA achieves over 99% TAR with 2%
FAR when the distance is 50 cm. However, the TAR is decreased to 98% and 96%
when the distance is 100 cm and 150 cm, respectively. Besides, The TAR under 2%
FAR decreases dramatically when only utilizing macro-level features (from 94%
in 50 cm to 80% in 150 cm) and mouth motion features (from 97% in 50 cm to
88% in 150 cm) individually. It means that when distance increases, the impact of
mouth motion on multiple-path propagation of CSI becomes weaker and causes the
degradation of WSVA’s performance. However, when the distance is set to 1.5 m,
WSVA could still achieve satisfactory accuracy (96%) using the combined feature,
which is acceptable in most cases.
To evaluate the performance of WSVA in the non-LOS scenarios, two additional
experiments are conducted. As shown in Fig. 4.15a, the volunteer is required to stay
out of the line of sight area. To further demonstrate this scenario, we consider
a more extreme case in which we insert the obstruction board to separate the
transmitter antennas from the receiver while the volunteer is on the same side of
the transmitter. The dataset obtained with a distance of 50 cm as shown in Fig. 4.14
is chosen as the control group. The experimental results are shown in Fig. 4.15b.
When WSVA utilizes a combined feature, with 2% FAR, the TARs of WSVA under
wood obstruction and control group are still over 99%. However, the TAR under
iron obstruction is decreased to 92.7%. More specifically, when only exploiting
100 4 Wireless Signal Based Two-Factor Authentication at Voice Interface Layer
50 cm 100 cm 150 cm
Combined Feature Macro-level Feature Mouth Motion Feature
100 100 100
True Accept Rate (%)
98
90
90
96
80
94
80
70
92
90 60 70
0 2 4 0 2 4 0 2 4
False Accept Rate (%) False Accept Rate (%) False Accept Rate (%)
TX 95
User
90
85
0 2 4
Obstruction False Accept Rate (%)
Mouth Motion Feature
100
Wood/Iron
True Accept Rate (%)
75
50
25
RX 0
0 2 4
False Accept Rate (%)
(a) (b)
Fig. 4.15 Evaluations of obstructions. (a) The non-LOS scenarios. (b) Detection results under
different obstructions
mouth motion feature, the TARs under wood and iron obstructions are decreased to
56% and 36%. The results demonstrate that the obstruction in LOS could degrade
the wireless sensing capability, especially only with the mouth motion feature
extraction of WSVA. It is notable that WSVA’s performance under iron obstruction
is much weaker than that under wood, since iron material in LOS could cause
greater multiple-path distortions. However, WSVA could still be effective under
wood obstruction, and it is feasible for the users to keep them on LOS places in
their own smart homes.
4.5 Limitations and Discussions 101
90 90
70
80 80
70 70 60
60 60
50
50 50
0 2 4 0 2 4 0 2 4
False Accept Rate (%) False Accept Rate (%) False Accept Rate (%)
In ideal cases, the collected CSI patterns should be only related to mouth motion and
do not change with time. However, as reported by previous research [2, 21], CSI
patterns are changed over time in real-world scenarios. To evaluate the timeliness
of CSI profiles, we first recruit a volunteer to provide mouth motion profiles. Then,
the volunteer and adversary are required to perform 150 voice commands and 150
spoofing attacks with a time step of 12 h. Figure 4.16 shows the performance of
WSVA in real-time, 12 h and 24 h. It is observed that after 12 h, WSVA achieves
above 99% TAR with 1% FAR, which is similar to real-time performance. After
24 h, WSVA can still achieve 90.6% TAR with 1% FAR by utilizing the combined
feature. Note that after 24 h, the performance of the mouth motion-based feature
is decreased to 75.8% TAR with 2% FAR. The performance degradation may
be caused by the emotional changes of the user or the background environment
changes. This is an inherent drawback of CSI-based sensing, but it does not hinder
the deployment of WSVA essentially. Adaptively updating the user’s profile can
effectively avoid the effects of environmental changes [21]. The updating can
be processed during the user’s daily usage of voice commands, and the cost is
affordable for the user since we only utilize 40 mouth motion samples for training.
Antenna and User Positions In this study, the distance between the user and the
antennas of WSVA affects the performance of WSVA. When the distance is too
long (depending on the hardware condition), the collected CSI cannot reflect the
mouth motion components and result in inaccurate judgment of WSVA. However, in
smart home environment, many applications of voice control system leverage voice
command to control home appliances (e.g., light bulbs and temperature), which also
has specific requirements on the environmental factors (including the distance). For
example, according to the CNET’s report about Amazon Echo, it needs more than
one Echo device for full coverage of a large home [9]. In practice, we can deploy
multiple antennas on smart home to make WSVA applicable in a larger area or
distance. When the user interacts with VCS, WSVA could dynamically choose the
antennas which are closest to the user to collect CSI data.
Pronunciation Behaviours Currently, WSVA can only work in the situation where
all users speak the voice commands in English strictly according to the International
Phonetic Alphabet. However, in reality, for the same phoneme, different users may
use different articulatory gestures [4, 23]. In the experiment, two volunteers are
required to pronounce the phoneme /a : / with standard articulatory gestures
(gesture 1 of hiant) and strange gestures (gestures 2 and 3). As shown in Fig. 4.17,
although different articulatory gestures will result in collecting quite different CSI
waveforms, it is also observed that when users utilize the same articular gesture
(e.g., the gesture 1 used by user 1 and user 2), the collected CSI still have similar
patterns. In the family scenario with limited user numbers (generally 2-4 users), it is
feasible for these users to agree on a common articulatory gesture. The detection
accuracy is also improved by the utilization of macro-level feature. Therefore,
WSVA still has high practicality in multiple-user scenarios.
Defending Against the Insider Attack As described in Sect. 4.2.1, the adversary
can launch a more serious attack (i.e., insider attack), which is not considered
in this study. In an insider attack scenario, the adversary can approach the VCS
Fig. 4.17 CSI waveforms under different articulatory gestures. (a) Phoneme /a:/ generated by
user 1. (b) Phoneme /a:/ generated by user 2
4.5 Limitations and Discussions 103
No voice
0.3
Voice
0
-0.3
2 3 4 5 6 7 8 9
Additional motion
1
CSI Data
0.9
0.8
0.7
2 3 4 5 6 7 8 9
Time (s)
physically and mimic the mouth movements of a benign user. Therefore, it brings
the consistency between vibrations of CSI data and voice samples and decreases
the performances of WSVA. To reduce this risk, a potential method is that the
user makes special adjustments to the WSVA. For example, the user is required
to perform some pre-defined and secret additional mouth motions after each voice
command. As shown in Fig. 4.18, WSVA can distinguish between benign users and
insider attackers by detecting the existence of these additional motions.
Comparison Between WSVA and Lip Reading Previous research has proposed
some CSI-based lip-reading methods such as WiHear [33] and WiTalk [13].
These methods attempt to infer the contents of voice samples only through the
CSI information. However, in this study, we do not propose using wireless signals
for lip reading. Instead, WSVA aims to utilize the consistency between voice and
CSI signals to authenticate the voice commands and defend against voice spoofing
attacks targeted at the voice control system.
In addition, due to the limitation of Wi-Fi signals (e.g., only 12.5 cm wavelength
for 2.4 GHz), achieving high accuracy detection in lip reading is inherently difficult.
For instance, WiHear and WiTalk can only recognize 14 and 12 different syllables,
respectively, which means that many voice syllables cannot be identified by CSI.
Furthermore, [24] shows not all voice syllables can be recognized by lip-reading
techniques in theory. For instance, the SilentTalk [32] shows the ultrasonic-based
lip reading can only identify 12 basic mouth motions, even if ultrasonic (e.g., 17 mm
wavelength for 20 kHz) has stronger sensing capability than CSI. However, in the
application scenarios of WSVA, the contents of the voice samples are easy to obtain.
So instead of recognizing syllables, the technical contribution of WSVA is modeling
the consistency between the voice samples and the CSI information to determine
whether a voice command is issued by an authentic user.
104 4 Wireless Signal Based Two-Factor Authentication at Voice Interface Layer
4.6 Summary
References
1. Aley-Raz, A., Krause, N.M., Salmon, M.I., Gazit, R.Y.: Device, system, and method of liveness
detection utilizing voice biometrics. Google Patents, US Patent 8,442,824, 14 May 2013
2. Ali, K., Liu, A.X., Wang, W., Shahzad, M.: Keystroke recognition using WiFi signals.
In: Proceedings of the 21st Annual International Conference on Mobile Computing and
Networking, MobiCom ’15, pp. 90–102. ACM, New York (2015). https://doi.org/10.1145/
2789168.2790109, http://doi.acm.org/10.1145/2789168.2790109
3. Amazon: Amazon alexa developer (2019). https://developer.amazon.com/alexa
4. Browman, C.P., Goldstein, L.: Articulatory phonology: an overview. Phonetica 49(3–4), 155–
180 (1992)
5. Carlini, N., Mishra, P., Vaidya, T., Zhang, Y., Sherr, M., Shields, C., Wagner, D., Zhou,
W.: Hidden voice commands. In: 25th USENIX Security Symposium (USENIX Security
16), pp. 513–530. USENIX Association, Austin (2016). https://www.usenix.org/conference/
usenixsecurity16/technical-sessions/presentation/carlini
6. Chen, S., Ren, K., Piao, S., Wang, C., Wang, Q., Weng, J., Su, L., Mohaisen, A.: You can hear
but you cannot steal: defending against voice impersonation attacks on smartphones. In: 2017
IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 183–195
(2017). https://doi.org/10.1109/ICDCS.2017.133
7. Chen, Y., Sun, J., Jin, X., Li, T., Zhang, R., Zhang, Y.: Your face your heart: secure mobile face
authentication with photoplethysmograms. In: Proceedings of IEEE Conference on Computer
Communications (INFOCOM), pp. 1–9 (2017)
8. Chen, G., Chen, S., Fan, L., Du, X., Zhao, Z., Song, F., Liu, Y.: Who is real bob?
Adversarial attacks on speaker recognition systems. In: 2021 IEEE Symposium on Security and
Privacy (SP), pp. 55–72. IEEE Computer Society, Washington (2021). https://doi.org/10.1109/
SP40001.2021.00004, https://doi.ieeecomputersociety.org/10.1109/SP40001.2021.00004
9. CNET: How to bring alexa into every room of your home (2017). https://www.cnet.com/how-
to/how-to-install-alexa-in-every-room-of-your-home/
10. Diao, W., Liu, X., Zhou, Z., Zhang, K.: Your voice assistant is mine: how to abuse speakers
to steal information and control your phone. In: Proceedings of the 4th ACM Workshop on
Security and Privacy in Smartphones & Mobile Devices (SPSM), pp. 63–74 (2014). https://
doi.org/10.1145/2666620.2666623
11. Ding, W., Hu, H.: On the safety of iot device physical interaction control. In: Proceedings of
the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 832–846
(2018). https://doi.org/10.1145/3243734.3243865
12. Dodd, B., Campbell, R.: Hearing by eye: the psychology of lip-reading. Am. J. Psychol. 72(6)
(1987)
References 105
13. Du, C., Yuan, X., Lou, W., Hou, Y.T.: Context-free fine-grained motion sensing using WiFi.
In: 2018 15th Annual IEEE International Conference on Sensing, Communication, and
Networking (SECON), pp. 1–9 (2018)
14. Ettus research (2017). https://www.ettus.com/
15. Feng, H., Fawaz, K., Shin, K.G.: Continuous authentication for voice assistants. In: Proceed-
ings of the 23rd Annual International Conference on Mobile Computing and Networking
(MobiCom), pp. 343–355 (2017). https://doi.org/10.1145/3117811.3117823
16. Market Research Future: Voice Assistant Market - Information by Technology, Hardware and
Application - Forecast till 2025 (2020). https://www.marketresearchfuture.com/reports/voice-
assistant-market-4003
17. Google: Google home. https://store.google.com/product/google_home (2019)
18. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh,
S., Sengupta, S., Coates, A., Ng, A.Y.: Deep Speech: Scaling up end-to-end speech recognition
(2014). https://doi.org/10.48550/ARXIV.1412.5567
19. Kisler, T., Schiel, F., Sloetjes, H.: Signal processing via web services: the use case webmaus.
In: Proceedings of Digital Humanities, pp. 30–34 (2012)
20. Lin, L.I.K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1),
255–268 (1989)
21. Liu, J., Liu, H., Chen, Y., Wang, Y., Wang, C.: Wireless sensing for human activity: a survey.
IEEE Commun. Surv. Tutorials 1–1 (2019). https://doi.org/10.1109/COMST.2019.2934489
22. Meng, Y., Wang, Z., Zhang, W., Wu, P., Zhu, H., Liang, X., Liu, Y.: WiVo: enhancing the
security of voice control system via wireless signal in IoT environment. In: Proceedings of the
Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing,
Mobihoc ’18, pp. 81–90. ACM, New York (2018). https://doi.org/10.1145/3209582.3209591,
http://doi.acm.org/10.1145/3209582.3209591
23. Mugler, E.M., Tate, M.C., Livescu, K., Templer, J.W., Goldrick, M.A., Slutzky, M.W.:
Differential representation of articulatory gestures and phonemes in precentral and inferior
frontal gyri. J. Neurosci. 38(46), 9803–9813 (2018)
24. Ostry, D., Flanagan, J.: Human jaw movement in mastication and speech. Arch. Oral Biol.
34(9), 685–693 (1989)
25. Places of articulation (2017). https://en.wikipedia.org/wiki/File:Places_of_articulation.svg
26. Qian, K., Wu, C., Yang, Z., Liu, Y., Jamieson, K.: Widar: decimeter-level passive tracking via
velocity monitoring with commodity Wi-Fi. In: Proceedings of the 18th ACM International
Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc), pp. 6:1–6:10 (2017).
https://doi.org/10.1145/3084041.3084067
27. Roy, N., Hassanieh, H., Roy Choudhury, R.: Backdoor: making microphones hear inaudible
sounds. In: Proceedings of the 15th ACM Annual International Conference on Mobile Systems,
Applications, and Services (MobiSys), pp. 2–14 (2017). https://doi.org/10.1145/3081333.
3081366
28. Samsung: Smartthings (2021). https://www.smartthings.com
29. Shi, C., Liu, J., Liu, H., Chen, Y.: Smart user authentication through actuation of daily activities
leveraging WiFi-enabled IoT. In: Proceedings of the 18th ACM International Symposium on
Mobile Ad Hoc Networking and Computing (MobiHoc), pp. 5:1–5:10 (2017). https://doi.org/
10.1145/3084041.3084061
30. SmartThings: Smartthings public GitHub Repo (2018). https://github.com/
SmartThingsCommunity/SmartThingsPublic
31. Tan, S., Yang, J.: Wifinger: leveraging commodity WiFi for fine-grained finger gesture
recognition. In: Proceedings of the 17th ACM International Symposium on Mobile Ad
Hoc Networking and Computing (MobiHoc), pp. 201–210 (2016). https://doi.org/10.1145/
2942358.2942393
32. Tan, J., Nguyen, C., Wang, X.: Silenttalk: Lip reading through ultrasonic sensing on mobile
phones. In: IEEE INFOCOM 2017 – IEEE Conference on Computer Communications, pp. 1–
9 (2017). https://doi.org/10.1109/INFOCOM.2017.8057099
106 4 Wireless Signal Based Two-Factor Authentication at Voice Interface Layer
33. Wang, G., Zou, Y., Zhou, Z., Wu, K., Ni, L.M.: We can hear you with Wi-Fi! In: Proceedings of
the 20th Annual International Conference on Mobile Computing and Networking (MobiCom),
pp. 593–604 (2014). https://doi.org/10.1145/2639108.2639112
34. Wang, J., Jiang, H., Xiong, J., Jamieson, K., Chen, X., Fang, D., Xie, B.: LIFS: low human-
effort, device-free localization with fine-grained subcarrier information. In: Proceedings of the
22Nd Annual International Conference on Mobile Computing and Networking (MobiCom),
pp. 243–256 (2016). https://doi.org/10.1145/2973750.2973776
35. Wang, S., Cao, J., He, X., Sun, K., Li, Q.: When the differences in frequency domain are
compensated: understanding and defeating modulated replay attacks on automatic speech
recognition. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and
Communications Security, CCS ’20, pp. 1103–1119. Association for Computing Machinery,
New York (2020). https://doi.org/10.1145/3372297.3417254
36. Wei, T., Wang, S., Zhou, A., Zhang, X.: Acoustic eavesdropping through wireless vibrometry.
In: Proceedings of the 21st Annual International Conference on Mobile Computing and
Networking (MobiCom), pp. 130–141 (2015). https://doi.org/10.1145/2789168.2790119
37. Wikipedia: International phonetic alphabet (2018). https://en.wikipedia.org/wiki/
International_Phonetic_Alphabet
38. Yuan, X., Chen, Y., Zhao, Y., Long, Y., Liu, X., Chen, K., Zhang, S., Huang, H., Wang,
X., Gunter, C.A.: Commandersong: a systematic approach for practical adversarial voice
recognition. In: 27th USENIX Security Symposium (USENIX Security 18), pp. 49–64.
USENIX Association, Baltimore (2018). https://www.usenix.org/conference/usenixsecurity18/
presentation/yuan-xuejing
39. Zhang, L., Tan, S., Yang, J., Chen, Y.: Voicelive: a phoneme localization based liveness
detection for voice authentication on smartphones. In: Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security (CCS), pp. 1080–1091 (2016). https://
doi.org/10.1145/2976749.2978296
40. Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: Dolphinattack: inaudible voice
commands. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Com-
munications Security (CCS), pp. 103–117 (2017). https://doi.org/10.1145/3133956.3134052
41. Zhang, L., Tan, S., Yang, J.: Hearing your voice is not enough: an articulatory gesture
based liveness detection for voice authentication. In: Proceedings of the 2017 ACM SIGSAC
Conference on Computer and Communications Security (CCS), pp. 57–71 (2017). https://doi.
org/10.1145/3133956.3133962
42. Zhang, L., Meng, Y., Yu, J., Xiang, C., Falk, B., Zhu, H.: Voiceprint mimicry attack towards
speaker verification system in smart home. In: IEEE INFOCOM 2020 – IEEE Conference on
Computer Communications, pp. 377–386 (2020). https://doi.org/10.1109/INFOCOM41043.
2020.9155483
Chapter 5
Microphone Array Based Passive
Liveness Detection at Voice Interface
Layer
5.1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 107
Y. Meng et al., Security in Smart Home Networks, Wireless Networks,
https://doi.org/10.1007/978-3-031-24185-7_5
108 5 Microphone Array Based Passive Liveness Detection at Voice Interface Layer
In this chapter, we consider the same attack model as described in Sect. 4.2.1. In this
section, we introduce the voice commands generation and propagation processes in
the smart speaker scenario, review existing passive liveness detection schemes and
introduce our proposed array fingerprint.
(a) (b)
Fig. 5.1 Sound generation and propagation in smart home. (a) Sound generation. (b) Sound
propagation process
110 5 Microphone Array Based Passive Liveness Detection at Voice Interface Layer
gain in the sound signal modulation by the device as shown in Fig. 5.1b. Similarly,
when a user speaks voice commands, their mouth and lips also modulate the air, and
we can use huser (f, t) to represent the modulation gain, where the generated sound
is s(f, t) = huser (f, t) · x(f, t) Note that, in the real-world scenario, there is no
such x(f, t) during the human voice generation process. However, the concepts of
x(f, t) and huser (f, t) are widely used [4] and will help us understand features
in Sect. 5.3.3. Finally, the generated sound s(f, t) is spread through the air and
captured by the smart speaker.
Sound Transmission Process Currently, smart speakers usually have a micro-
phone array (e.g., Amazon Echo 3rd Gen [12] and Google Home Max [17]
both have 6 microphones). For a given microphone, when sound is transmitted to
it, the air pressure at the microphone’s location can be represented as y(f, t) =
hair (d, f, t) · s(f, t), where d is the distance of the transmission path between the
audio source and the microphone and hair (d, f, t) is the channel gain in the air
propagation of the sound signal.
Sound Processing Within the Smart Speaker Finally, y(f, t) is converted to
an electrical signal by the microphone. Since the microphones employed by
mainstream smart speakers usually have a flat frequency response curve in the
frequency area of the human voice, we assume smart speakers save original sensed
data y(f, t) which is also adopted by existing studies [19]. Finally, the collected
audio signal is uploaded to the smart home cloud to further influence the actions of
smart devices.
The recently proposed liveness detection schemes could be divided into two
categories: mono channel based detection (e.g., Sub-bass [4] and VOID [1]) and
fieldprint based detection (i.e., CaField [19]).
Principles As shown in Fig. 5.1a, the different sound generation principles between
real human and electrical spoofing devices could be characterized as two different
filters: huser (f, t) and hdev (f, t). If ignoring the distortion in the sound signal trans-
mission, hair (d, f, t) could be considered as a constant value A. Thus, the received
audio samples in authentic and spoofing attack scenarios are yauth (d, f, t) =
A·huser (f, t)·x(f, t) and yspoof (d, f, t) = A·hdev (f, t)·x(f, t), respectively. Since
A and x(f, t) are the same, it means that the spectrograms of the received audio
samples already contain the identity of the audio source (the real user huser (f, t) or
the spoofing one hdev (f, t)). Figure 5.2a shows the spectrums of the voice command
“OK Google” and its spoofing counterpart. It’s observed that the sub-bass spectrum
5.2 Preliminaries and Motivations 111
Spoofing Spoofing
1 1 1 1
Normalized Amplitude
0 0 0 0
100 200 300 100 200 300 100 200 300 100 200 300
Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz)
(a) (b)
Fig. 5.2 Spectrums of authentic and spoofing voices when putting the smart speaker in different
rooms. (a) Spectrums on room A. (b) Spectrums on room B
(20–300 Hz) difference between two audio samples is quite different even if they are
deemed similar, and this phenomenon is utilized by mono channel-based schemes
such as Sub-base [4].
Limitations However, in a real-world environment, hair (d, f, t) cannot always be
regarded as a constant. The surrounding object’s shape and materials, the sound
transmission path, and the absorption coefficient of air all affect the value of
hair (d, f, t). As shown in Fig. 5.2a, b, the spectrograms of authentic and spoof
audio samples change drastically when putting the smart speaker in different rooms.
The experimental result from Sect. 5.4.2 and [1] demonstrates the performance
of liveness detection undergoes degradation when handling datasets in which
audios are collected from complicated environments (e.g., ASVSpoofing 2017
Challenge [11], ReMasc Core [10]).
Principles The concept of fieldprint [19] is based on the assumption that audio
sources with different articulatory behaviors will cause a unique “sound field”
around them. By measuring the field characteristics around the audio source, it is
feasible to induce the audio’s identity. CaField is the typical scheme that deploys two
microphones to receive two audios y1 (f, t) and y2 (f, t), and defines the fieldprint
as:
y1 (f, t)
F ield = log . (5.1)
y2 (f, t)
Limitations Measuring stable and accurate field print requires the position
between the audio source and the print measure sensors must be relatively stable. For
instance, CaField only performs well when the user holds a smartphone equipped
with two microphones close to the face in a fixed manner. The fieldprint struggles
112 5 Microphone Array Based Passive Liveness Detection at Voice Interface Layer
in far distances (e.g., greater than 40 cm in [19]), making it unsuitable for a home
environment in which users want to communicate with a speaker across the room.
The goal of this study is to propose a novel and robust feature for passive liveness
detection.
In this subsection, we propose a novel and robust liveness feature array fingerprint
and design ArrayID based on it.
Figure 5.3 illustrates the scenario when audio signals are transmitted from the source
to the microphone array. The audio source is regarded as a point with coordinate
(L, 0), and the microphones are evenly distributed on a circle. Given the k-th
microphone Mk , the collected audio data is yk (f, t) = hair (dk , f, t) · s(f, t), where
dk is the path distance from the audio source to Mk .
Inspired by the circular layout of microphones in smart speaker as shown in
Fig. 5.3, we define the array fingerprint AF as below:
/ 1 2
0 ( , 0)
1
( − 1) ( − 1)
= ⋅ cos + , ⋅ +
−1
0 0 0 0
-0.1 -20
-0.2 -20
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0 500 1000 0 200 400 600 800 1000 1200
Time (s) Time (s) Frequency (Hz) Frequency (Hz)
(a) (b)
60 cm 120 cm
1 1
Array Fingerprint
Frequency (Hz)
0.5 0.5
0 0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Frequency (Hz) Frequency (Hz)
(c)
Fig. 5.4 Illustration of stability of array fingerprint under two locations. (a) Two original authentic
audios. (b) Dynamic power differences in different microphone pairs. (c) Stable array fingerprints
0 0 0
0 Frequency (kHz) 5 0 Frequency (kHz) 5 0 Frequency (kHz) 5
Fig. 5.5 Differentiating human authentic voice from two spoofing devices via array fingerprints
under different propagation paths
quite different.1 Among different distances, the fieldprints are also quite different.
However, from Fig. 5.4c we can see that the array fingerprints for different distances
are very similar.2
To show the distinctiveness of array fingerprint, we also conducted replay attacks
via smartphones and iPad (i.e., device #8 and #3 listed in Table 5.2, Sect. 5.4.1).
The normalized array fingerprints (i.e., FSAP in Sect. 5.3.3.1) are shown in Fig. 5.5.
It is observed that the array fingerprints for the same audio sources are quite
similar, while array fingerprints for different audio sources are quite different. Our
1 The real process of extracting fieldprint is more complicated. Figure 5.4b shows the basic
principle following the descriptions in Eq. 5.1.
2 This array fingerprint is refined after extracting from Eq. 5.4. The detailed calculation steps are
experimental results demonstrate the array fingerprint can serve as a better passive
liveness detection feature. This motivates us to design a novel, lightweight and
robust system which will be presented in the next section.
As shown in Fig. 5.6, ArrayID consists of the following modules: Data Collection
Module, Pre-processing Module, Feature Extraction Module, and Attack Detection
Module. We will elaborate on the details of each module in this section.
Currently, most popular smart speakers, such as Amazon Echo and Google Home,
employ a built-in microphone array to collect voice audio. In this chapter, we utilize
open modular development boards with voice interfaces (i.e., the Matrix Creator
[13] and Seeed Respeaker [16]) to collect the data. Since these development boards
have similar sizes to commercial smart speakers, ArrayID evaluations on the above
devices can be applied to a smart speaker without any notable alterations. Generally
speaking, given a smart speaker with N microphones, a sampling rate of Fs , and
data collection time T , the collected voice sample is denoted as VM×N , where M =
Fs × T and we let Vi be the i-th channel’s audio V (:, i). Then, the collected V is
sent to the next module.
Word-based
Feature Spectrogram Array
Training dataset
Extraction Features
[ , , ]
Audio feature label
Spectrogram Distribution
Real-time Features
command
Classification
Legitimate Model Audio LPCC Features
Adversarial
As shown in Eq. 5.2, the identity (i.e., real human or spoofing device) is hidden in the
audio’s spectrogram. Therefore, before feature extraction, we conduct the frequency
analysis on each channel’s signal and detect the audio’s direction.
Frequency Analysis on Multi-channel Audio Data As described in Sect. 5.2.3,
the audio spectrogram in the time-frequency domain contains crucial features for
further liveness detection. ArrayID performs Short-Time Fourier Transform (STFT)
to obtain two-dimensional spectrograms of each channel’s audio signal. For the i-th
channel’s audio Vi , which contains M samples, ArrayID applies a Hanning window
to divide the signals into small chunks with lengths of 1024 points and overlapping
sizes of 728 points. Finally, a 4096-sized Fast Fourier Transform (FFT) is performed
for each chunk, and a spectrogram Si is obtained as shown in Fig. 5.7a.
Direction Detection Given a collected audio VM×N , to determine the microphone
which is closest to the audio source, ArrayID firstly applies a high pass filter with
a cutoff frequency of 100 Hz to VM×N . Then, for the i-th microphone Mi , ArrayID
calculates the alignment errors Ei = mean((V (:, i − 1) − V (:, i))2 ) [15]. Finally,
from the calculated E, ArrayID chooses the microphone with minimum alignment
error as the corresponding microphone.
From Eq. 5.2, we observe that both audio spectrograms themselves and the micro-
phone array’s difference contain the liveness features of collected audio. In this
module, the following three features are selected by ArrayID: Spectrogram Array
Fingerprint FSAP , Spectrogram Distribution Fingerprint FSDP , and Channel LPCC
Features FLP C .
15
Frequency (Hz)
0 0 0
0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4
time (s) time (s) time (s)
(a)
80 80 80 40
60 60 60
20
40 40 40
0
20 20 20
5 10 15 20 5 10 15 20 5 10 15 20
(b)
Fig. 5.7 Grid processing on the multiple-channel audios. (a) Original spectrograms of different
channels. (b) Spectrogram grids of different channels
Figure 5.7a illustrates Spec of three channels of the command “OK Google”. It
is observed that different channels’ spectrograms are slightly different. However,
directly using such subtle differences would cause an inaccurate feature. Thus,
ArrayID transforms Speck into a grid matrix Gk with size MG × NG by dividing
Speck into MG × NG chunks and calculates the sum of magnitudes within each
chunk. The element of Gk could be represented as:
Gk (i, j ) = sum(Speck (1 + (i − 1) · SM : i · SM ,
(5.3)
1 + (j − 1) · SN : j · SN )),
M N
where SM = [ Mspec G
] and SN = [ NspecG
] are the width and length of each chunk.
Note that some elements of Speck may be discarded. However, it does not affect the
feature generation, since ArrayID focuses on the differences between spectrograms
according to Eq. 5.2. In this study, MG and NG are set to 100 and 20, respectively,
and Fig. 5.7b shows the spectrogram grids from the first, third and fifth microphones.
The difference among elements in G = [G1 , G2 , ..., GN ] is now very obvious. For
instance, the grid values in the red rectangles of Fig. 5.7b are quite different.
Then, based on Eq. 5.2, ArrayID calculates the array fingerprint FG from the
spectrogram G. FG has the same size as Gk , and the elements of FG can be
represented as:
5.3 ArrayID: Array Fingerprint Based Passive Liveness Detection 117
40 0.6
2
20
0.4
0
5 10 15 20 0 50 100 0 50 100
(a)
(b)
Fig. 5.8 Illustration of spectrogram array fingerprint feature FSAP extraction. (a) Array finger-
print extraction processing. (b) Features among different commands and distances
(a) (b)
0.5 0.5
0 0
0 10 20 30 0 10 20 30
Feature index Feature index
(c)
Fig. 5.9 Spectrogram distributions between authentic human and spoofing device. (a) The
authentic audio’s Ch. (b) The spoofing audio’s Ch. (c) FSDP between authentic and spoofing
audios
Besides FSAP , the spectrogram distribution also provides useful information related
to the identity of the audio source. Thus we also extract spectrogram distribution
fingerprint FSDP for liveness detection. Given a spectrogram Sk from the k-th
channel, ArrayID calculates a NG -dimension vector Chk in which Chk (i) =
Mspec
j =1 Sk (j, i), where Mspec and NG are set as 85 and 20 respectively in this
study. Note that, when calculating FSDP , we set the cutoff frequency as 1 kHz since
most human voice frequency components are located in the 0~1 kHz range and the
corresponding MSpec is 85 under the parameters in Sect. 5.3.3.1. For the audio with
N channels, the channel frequency strength Ch = [Ch1 , Ch2 , ..., ChN ] is obtained.
Figure 5.9a, b show channel frequency strength Ch1 and Ch4 of first and fourth
channels from authentic and spoofing audios. It is observed that Ch from real
human and spoofing device are quite different. Therefore, we utilize the average
of channel frequency strengths Ch and re-sample its length to NCh as the first
component of FSDP . In this study, Ch(i) = mean([Ch1 (i), Ch2 (i), ..., ChN (i)])
and NCh is set to 20. We can also find that for the same audio, Ch from different
channels have slightly different magnitudes and distributions. To characterize the
distribution of Ch, for Chk from the k-th channel, ArrayID first calculates the
cumulative distribution function Cumk and then determines the indices μ, which
can split Cumk uniformly. As shown in Fig. 5.9a, b, the Chk are segmented into 6
bands. ArrayID sets the T hr = [0.1, 0.3, 0.5, 0.7, 0.9], and the index μ(k, i) of the
i-th T hr for Chk satisfies the following condition:
closest microphone 1
1 1 LPCC of Mic 3
LPCC of Mic 1 LPCC of Mic 2
0.5
0.5 0.5
0
0 0
opposite microphone
1 LPCC of Mic 4 1 LPCC of Mic 5 1 LPCC of Mic 6
0.5 0.5 0.5
0 0 0
After obtaining the N × 5 indices μ, we utilize the mean value Dmean and
standard deviation Dstd among different channels as a part of the spectrogram
feature. Both Dmean and Dstd are vectors with length of 5, where Dmean (i) =
mean(μ(:, i)) and Dstd (i) = std(μ(:, i)). Finally, ArrayID obtains the spectrogram
distribution fingerprint FSDP = [Ch, Dmean , Dstd ]. Figure 5.9c illustrates the
FSDP from authentic and spoofing audios and demonstrates the robustness of FSDP .
The final feature of ArrayID is the Linear Prediction Cepstrum Coefficients (LPCC).
Since each channel has unique physical properties, retaining the LPCC which
characterizes a given audio signal could further improve the detection performance.
For audio signal yk (t) collected by microphone Mk , ArrayID calculates the LPCC
with the order p = 15. For audio signal yk (t) collected by microphone Mk , to
calculate the LPCC with the order p = 15, we firstly calculate the Linear Prediction
Coding (LPC) as a:
where p is the order of LPC, and the collected LPC can be represented as a =
[a0 , a1 , . . . , ap ]. Then, for the LPCC coefficient c = [c0 , c1 , · · · , cp ], we have c0 =
ln(p), and for other elments could be calculated as:
i
k
cn = −ai − 1− ak ci−k . (5.7)
i
k=1
For an example multi-channel voice, Fig. 5.10 shows the LPCCs on each channel.
In this figure, when M1 is the closest microphone, for a microphone array with six
channels, the opposite microphone is M4 . The LPCCs from these two channels
120 5 Microphone Array Based Passive Liveness Detection at Voice Interface Layer
are selected as FLP C in Sect. 5.3.3.3. To reduce the time overhead spent on
LPCC extraction, we only preserve the LPCCs from audios in these two channels
(Mi , Mmod(i+N/2,N ) ), where Mi is the closet microphone derived from Sect. 5.3.2.
Finally, we generate the final feature vector X = [FSAP , FSDP , FLP C ].
After generating the feature vector from the audio input, we choose a lightweight
feed-forward back-propagation neural network to perform liveness detection. The
neural network only contains three hidden layers with rectified-linear activation
(layer sizes: 64, 32, 16). The dropout is 20% after the 64 and 32 node layers, and the
output layer is one sigmoid-activated node. We adopt a lightweight neural network
because it can achieve a quick response to the decision, which is essential for the
devices in the smart home environment.
Evaluation Metrics Similar to previous works [1, 19], in this study, we choose
accuracy, false acceptance rate (FAR), false rejection rate (FRR), and equal error
rate (ERR) as metrics to evaluate ArrayID. The accuracy means the percentage of
the correctly recognized samples among all samples. FAR represents the rate at
which a spoofing sample is wrongly accepted by ArrayID, and FRR characterizes
the rate at which an authentic sample is falsely rejected. EER provides a balanced
view of FAR and FRR and it is the rate at which the FAR is equal to FRR. The
evaluation metrics are a little different from that in Sect. 4.4.1. The reason is that
ArrayID does not choose the threshold-based classification method. Thus, we use
the EER in the scenario once the numbers between positive and negative samples
are imbalanced. EER is also used by previous works (e.g., [1]).
0.5
0
1 2 3 4 5 6 7 8 9 10 13 14 19 20
User number
When evaluating ArrayID on our own Array dataset, we choose two-fold cross-
validation, which means the training and testing dataset are divided equally. ArrayID
achieves the detection accuracy of 99.84% and the EER of 0.17%. More specifically,
for all 32,780 samples, the overall FAR and FRR are only 0.05% (i.e., 13 out of
22,539 spoofing samples are wrongly accepted) and 0.39% (i.e., 40 out of 10,241
authentic samples are wrongly rejected) respectively. The results show that ArrayID
is highly effective in thwarting spoofing attacks.
To evaluate the performance of ArrayID on different users, we show the FAR and
FRR of each user in Fig. 5.12. Note that, for six users (i.e., users #11, #12, #15,
#16, #17, #18) which are not shown in this figure, there is no detection error. When
considering FAR, it is observed that the false acceptance cases only exist in 6 users.
Even in the worst cases (i.e., user #20), the false acceptance rate is still less than
0.51%. When considering FRR, the false rejection cases are distributed among 14
users. It’s observed that only the FRRs of users #3 and #20 are above 1%. Although
the performance of ArrayID on different users is different, even for the worst-case
(i.e., user #20), the detection accuracy is still at 99.0%, which demonstrates the
effectiveness of ArrayID.
For a desktop with Intel i7-7700T CPU and 16 GB RAM, the average time overhead
on 6-channel and 8-channel audios are 0.12 s and 0.38 s, respectively. Note that it
124 5 Microphone Array Based Passive Liveness Detection at Voice Interface Layer
Table 5.4 The detection Liveness feature Array dataset ReMasc Core [10]
accuracy on both datasets
ArrayID 99.84% 97.78%
Mono feature [1] 98.81% 84.37%
Two-channel [19] 77.99% 82.44%
is easy for the existing smart home systems(e.g., Amazon Alexa) to incorporate
ArrayID to their current industrial level solutions in the near future. In that case, both
speech recognition and liveness detection can be done in the cloud [12]. Therefore,
by leveraging the hardware configuration of the smart speaker’s cloud (e.g., Amazon
Cloud [9]), which is much better than our existing one (CPU processor), we believe
that the time overhead can be further reduced and will not incur notable delays.
In this subsection, we evaluate the impact of various factors (e.g., direction, distance,
user movement, spoofing device, array type) on ArrayID.
5.4 Performance Evaluation 125
In Sect. 5.4.1, when collecting audio samples, most participants face the smart
speaker while generating voice commands. To explore the impact of the angles
between the user’s face direction and the microphone array, we recruit 10 partic-
ipants to additionally collect authentic voice samples in four different directions
(i.e., front, left, right, back) and then the spoofing device #8 in Table 5.2 is utilized
to generate spoofing audios. As shown in Table 5.5, we totally collect 4219 authentic
samples and 3,830 spoofing samples. Then, we use the classification model trained
in Sect. 5.4.2 to conduct liveness detection. It is observed from Table 5.5 that in
all scenarios, ArrayID achieves an accuracy above 99.3%, which means ArrayID is
robust to the change of direction.
which demonstrates that ArrayID and the array fingerprint are robust even with the
user’s movement.
(a)
Accuracy (%)
EER (%)
20
90 10
80 0
45 50 55 60 65
Background noise level (dB)
(b)
Fig. 5.13 Performance under noisy environments. (a) Noise evaluation setting. (b) Accuracy and
EER
We utilize the classifier in Sect. 5.4.2 where the noise level is 30 dB to conduct
liveness detection. As shown in Fig. 5.13b, when increasing the noise level from
45 dB to 65 dB, the accuracy decreases from 98.8 % to 86.3 %. It is observed that
ArrayID can still work well when the background noise is less than 50 dB, which
also explains why ArrayID can handle the audio samples of the ReMasc Core dataset
collected in an outdoor environment. However, when there exists strong noise, since
the feature of ArrayID is only based on the collected audios, the performance of
ArrayID degrades sharply.
5.5 Discussions
when setting the training proportion as 10%, among 10,241 authentic samples from
20 users, the average number of audio samples provided by each user during the
enrollment is only 51. Since the average time length of the voice command is smaller
than 3 s, the enrollment can be done in less than 3 min. Compared with the time
overhead on deploying an Alexa skill which is up to 5 min [2], requiring 3 min for
enrollment is acceptable in real-world scenarios.
In Sect. 5.4.2, each participant is required to provide both authentic and spoofing
audio samples during enrollment. In this subsection, we consider two special
settings of training configuration: (1) a new user provides only authentic voice
samples (without spoofing samples); (2) a new user did not participate in the training
(i.e., unseen user).
In this subsection, we add an experiment to evaluate the performance of ArrayID
on participants that did not participate in the training (i.e., unseen users). In the
experiment, for each user in the Array dataset, we train the classifier using the other
19 users’ legitimate and spoofing voice samples and regard the user’s samples as
the testing dataset. The detection accuracy of each user is shown in Fig. 5.15. We
also show the results described in Sect. 5.4.2 when users participate in training as a
comparison.
98
96
94
1 2 3 6 7 10 13 Others
User number
90
80
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
User number
Fig. 5.15 Detection accuracy when the user participates in training or not
the evaluation phase, we test the ability of ArrayID to detect attack samples of this
user and calculate the corresponding detection accuracy (i.e., true rejection rate).
Figure 5.14 illustrates the detection accuracy under two different training
configurations. For all 18 users, the overall accuracy (i.e., TRR) decreases from
99.96% in the classical training configuration described in Sect. 5.4.2 to 99.68% in
this training configuration. For 11 users (i.e., user #4, #5, #8, #9, #11, #12, #14,
#15, #16, #17, #18), the accuracy remains 100% in both scenarios. For the other 7
users, the accuracy decreases slightly due to a lack of knowledge of the user’s attack
samples in the classifier, but all of them achieve an accuracy of above 96%, which
demonstrates the effectiveness of ArrayID in this training configuration.
In the experiment, for each user in the Array dataset, we train the classifier using the
other 19 users’ legitimate and spoofing voice samples and regard the user’s samples
as the testing dataset. The detection accuracy of each user is shown in Fig. 5.15. We
also show the results described in Sect. 5.4.2 when users participate in training as a
comparison.
From Fig. 5.15, it is observed that the overall detection accuracy decreases
from 99.84% to 92.97%. In the worst case (i.e., user #12), the detection accuracy
130 5 Microphone Array Based Passive Liveness Detection at Voice Interface Layer
decreases from 99.87% to 74.53%. The results demonstrate that ability of ArrayID
to address unseen users varies with different users. However, for 11 users, ArrayID
can still achieve detection accuracies higher than 95%. The overall results demon-
strate that ArrayID is still effective when addressing unseen users.
This performance degradation when addressing unseen users remains an open
problem in the area of liveness detection [1, 4, 23]. To partially mitigate this
issue, a practical solution is requiring the unseen users to provide only authentic
voice samples to enhance the classifier (i.e., the training configuration discussed in
Sect. 5.5.2.1).
5.6 Summary
In this study, we propose a novel liveness detection system ArrayID for thwarting
voice spoofing attacks without any extra devices. We give an analysis of existing
popular passive liveness detection schemes and propose a robust liveness feature
array fingerprint. This novel feature both enhances effectiveness and broadens the
application scenarios of passive liveness detection. ArrayID is tested on both our
own dataset and another public dataset, and the experimental results demonstrate
ArrayID is superior to existing passive liveness detection schemes. Besides, we
evaluate multiple factors and demonstrate the robustness of ArrayID.
References
1. Ahmed, M.E., Kwak, I.Y., Huh, J.H., Kim, I., Oh, T., Kim, H.: Void: a fast and light voice
liveness detection system. In: 29th USENIX Security Symposium (USENIX Security 20),
pp. 2685–2702. USENIX Association, Berkeley (2020). https://www.usenix.org/conference/
usenixsecurity20/presentation/ahmed-muhammad
2. Amazon.com, Inc.: Create and manage alexa-hosted skills (2021). https://developer.amazon.
com/en-US/docs/alexa/hosted-skills/alexa-hosted-skills-create.html
3. Blue, L., Abdullah, H., Vargas, L., Traynor, P.: 2MA: verifying voice commands via two
microphone authentication. In: Proceedings of the 2018 on Asia Conference on Computer
and Communications Security, pp. 89–100. ACM, New York (2018). https://doi.org/10.1145/
3196494.3196545
4. Blue, L., Vargas, L., Traynor, P.: Hello, is it me you’re looking for? differentiating between
human and electronic speakers for voice interface security. In: Proceedings of the 11th ACM
Conference on Security & Privacy in Wireless and Mobile Networks, pp. 123–133. ACM, New
York (2018)
5. Carlini, N., Mishra, P., Vaidya, T., Zhang, Y., Sherr, M., Shields, C., Wagner, D., Zhou,
W.: Hidden voice commands. In: 25th USENIX Security Symposium (USENIX Security
132 5 Microphone Array Based Passive Liveness Detection at Voice Interface Layer
21. Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: Dolphinattack: inaudible voice
commands. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Com-
munications Security (CCS), pp. 103–117 (2017). https://doi.org/10.1145/3133956.3134052
22. Zhang, L., Meng, Y., Yu, J., Xiang, C., Falk, B., Zhu, H.: Voiceprint mimicry attack towards
speaker verification system in smart home. In: IEEE INFOCOM 2020 – IEEE Conference on
Computer Communications, pp. 377–386 (2020). https://doi.org/10.1109/INFOCOM41043.
2020.9155483
23. Zhang, L., Tan, S., Wang, Z., Ren, Y., Wang, Z., Yang, J.: Viblive: a continuous liveness
detection for secure voice user interface in IoT environment. In: ACSAC ’20: Annual Computer
Security Applications Conference, pp. 884–896. ACM, New York (2020). https://doi.org/10.
1145/3427228.3427281
Chapter 6
Traffic Analysis Based Misbehavior
Detection at Application Platform Layer
6.1 Introduction
At present, the smart home market is experiencing prosperity, and many manufac-
turers have produced and sold various smart home devices. In order to promote
the collaborative interaction of smart devices generated by different manufacturers,
the concept of application platform was proposed. The application platform is
equipped with a local gateway (e.g., base station) and cloud backend server to
enable devices from different vendors to access the same network. In the application
platform, devices are abstracted by which developers can design smart applications
without knowing the physical details of devices, so that devices from different
manufacturers can work together. For example, a smart application can monitor the
status of a device (e.g., a smart motion sensor) and trigger certain operations of
another device (e.g., turning on a smart light) when receiving notification of certain
events (e.g., a motion sensor detects a user’s motions). The scalability of the smart
platform framework has greatly inspired a large number of device manufacturers
and application developers to participate in the ecosystem. Famous application
platforms include Samsung SmartThings [15], Apple HomeKit [2], and so on.
However, with the popularization of smart application platforms, security issues
have become increasingly prominent. For example, in the Samsung SmartThings
platform studied in this chapter, Fernandes et al. [7] reveal its defects. These defects
allow smart applications (referred to as SmartApp in SmartThings) running in the
cloud background to perform unauthorized operations on smart devices and can
eavesdrop on or even forge events generated by smart devices.
Existing solutions to SmartThings security, especially on the aspect of mis-
behaving SmartApp detection and prevention, mainly fall into three categories:
first, applying information flow control to confine sensitive data by modifying the
smart home platform [8], second, designing a context-based permission system for
fine-grained access control [12], and third, enforcing context-aware authorization
of SmartApps by analyzing the source code, annotation, and description [25].
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 135
Y. Meng et al., Security in Smart Home Networks, Wireless Networks,
https://doi.org/10.1007/978-3-031-24185-7_6
136 6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
As one of the most popular smart home platforms, Samsung SmartThings provides
an attractive feature favored by many device manufacturers and software developers,
which is the separation of intelligence from devices. In particular, it offers an
abstraction of a variety of lower-layer smart devices to software developers so
that the development of software programs (i.e., SmartApps) is decoupled with the
manufacture of the smart devices. In this way, the SmartThings platform fosters a
vibrant software market, which encourages third-party software developers to enrich
the diversity of home automation functionalities. It is worth noting that the number
of SmartThings has been changing. For example, from 2016 to 2017, more than
300 SmartApps were deleted either because security vulnerabilities were reported
or because users were not interested in them [16, 21]. Therefore, in this chapter,
we focus on the SmartApps supported by SmartThings in May 2018, including 133
device types and 181 SmartApps.
138 6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
Configure Devices /
Smart Home Control (SSL)
Install SmartApps
AES
SmartThings Hub
SmartThings
AES Mobile App
Capabilities Attribute/Command
Smart device
Switch.on command: turn on outlet
Switch
Switch.off command: turn off outlet
Section 1.2 introduces the overall architecture of smart home. In this subsection,
we take Samsung SmartThings as the research object to introduce how the applica-
tion platform integrates smart devices and user interfaces.
The architecture of the SmartThings platform is shown in Fig. 6.1. Smart devices
are the key building blocks of the entire SmartThings infrastructure. They are
connected to the hub with a variety of communication protocols, including ZigBee,
Z-Wave, and Wi-Fi. In SmartThings, the hub mediates the communication with all
connected devices and serves as the gateway to the SmartThings cloud, where device
handlers and SmartApps are hosted. Device handlers are virtual representations of
physical devices, which abstract away the implementation details of the protocols
for communicating with the devices. As shown in Fig. 6.2, device handlers specify
the capabilities of these devices.
Capabilities can have commands and attributes. Commands are methods for
SmartApps to control the devices. Attributes reflect properties or characteristics
6.2 Preliminaries and Motivation 139
of the devices. For example, smart device Samsung SmartThings Outlet has 9
capabilities, among which Switch and Power Meter are the most commonly used:
Switch enables the control of the switch and it has two commands: on() and
off(), while Power Meter has one attribute power for reporting the device’s power
consumption.
Fernandes et al. [7] presented several security-critical design flaws in the Smart-
Things’ capability model and event subsystem. These design flaws may enable the
following SmartApp misbehaviors that can lead to security compromises.
140 6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
enforcement mechanisms [25], and systems to enforce information flow control [8].
However, these solutions require either modification of the platform itself [8, 25]
or changes in the SmartApps [12]. A security mechanism that works on existing
platforms will be more practical as a business solution.
In this study, we propose a detection system, dubbed HoMonit, for detect-
ing misbehaving SmartApps in a non-intrusive manner by leveraging techniques
commonly used in wireless side-channel inference. HoMonit is inspired by two
observations. First, most communication protocols used in smart home environ-
ments are designed for a low transmission rate and reduced data redundancy
for low power consumption. Second, the wireless communications between the
hub and smart devices usually show unique and fixed patterns determined by
the corresponding smart devices and SmartApps. Therefore, after extracting the
working logic of SmartApps as Deterministic Finite Automatons (DFAs)—with the
help of code analysis (for open-source SmartApps) or natural language processing
techniques (for closed-source SmartApps)—an external observer can determine
which SmartApp is operating and which state this SmartApp is currently in by
monitoring only the metadata (e.g., the packet size, inter-packet timing) of the
encrypted wireless traffic. If a SmartApp deviates from its usual behavior, the pattern
of wireless traffic will also change, which can be utilized to detect misbehaving
SmartApps.
The capability of monitoring misbehaving SmartApps from encrypted traffic
enables a third-party defender—other than the smart home platform vendors,
smart device manufacturers, and SmartApp developers—to develop a smart home
anomaly detection system to detect misbehaving SmartApps at runtime. A major
advantage of a third-party defense mechanism is that no modification of the
protected platform is needed. HoMonit is designed to work without the need to
change the current SmartThings infrastructure, change the system software on the
hub or smart devices, or modify the SmartApps. HoMonit can work directly with
the existing SmartThings platform and is easily extensible to other platforms with
similar infrastructures.
We illustrate this idea using a concrete example: Brighten My Path is a SmartApp
for automatically turning on the outlet after a motion has been detected by the
motion sensor. We show the observed packet sizes of the communications between
the sensors and the hub in Fig. 6.3, in which the y-axis shows the packet sizes and
the x-axis shows the timestamps of the packets when they arrive. The SmartApp
subscribes to two capabilities, which include an attribute motion for capability
Motion Sensor and a command on() for capability Switch. In Fig. 6.3, motion.active
corresponds to packet sequence (54 ↑, 47 ↓) and switch.on corresponds to packet
sequence (50 ↓, 47 ↓, 47 ↓, 52 ↓), where ↓ means the packet is hub to device and ↑
means device to hub. The DFA consists of three states, which are connected by two
transitions (as shown in Fig. 6.3). If the events corresponding to motion.active and
switch.on are detected in a sequence, the DFA will transition from the start state to
the accept state. Thus, the behavior of the SmartApps can be inferred from the DFA
transitions: normal behavior sequences are always accepted by the DFA, whereas
abnormal ones are not.
142 6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
60
50
Packet Size (bytes)
40
30
motion.active switch.on
20
10
0
2.6 2.8 3 3.2 3.4 10 10.2 10.4 Timestamp (s)
DFA
The SmartApp Analysis Module aims to extract the expected behaviors of the
SmartApps. We utilize the Deterministic Finite Automaton (DFA) to characterize
the logic of SmartApps. We choose DFA to represent a SmartApp for two reasons:
(1) a SmartApp supervises a finite number of devices and (2) devices are driven into
a deterministic status by the SmartApp when a specific condition is satisfied. More
specifically, we formalize the SmartApp DFA as a 5-tuple M = (Q, , δ, q0 , F ),
where Q is a finite set of states of the SmartApp; is a finite set of symbols,
which correspond to attributes or commands of their capabilities; δ is the transition
function in Q × → Q, where Q × is the set of 2-tuple (q, a) with q ∈ Q and
a ∈ ; q0 is the start state, and F is a set of accept states.
This chapter mainly focuses on open-source SmartApps. HoMonit performs
static analysis on their source code and automatically translates them into DFAs.
We analyzed 181 open-source SmartApps to build their DFAs from SmartThings
Public GitHub Repository [19]. All of them are official open-source SmartApps.
Building DFA from close-source SmartApps is out of this section’s scope.
Since the open-source SmartApps are written in Groovy, to extract their logic,
we conducted a static analysis on the source code using AstBuilder [6]. Figure 6.5
shows an example of the source code of a SmartApp. HoMonit converts the source
code of the SmartApp into an Abstract Syntax Tree (AST) during the Groovy
compilation phase.
The translation from an AST to a DFA is completed in two steps. First, to
obtain the set of symbols (i.e., of the DFA), HoMonit extracts the capabilities
requested by the DFA from the preferences block statement (Fig. 6.5, line 7).
Specifically, all available capabilities are first obtained from the SmartThings
Developer Documentation [18], and then the input method calls (Fig. 6.5, line 9
and 13) of the preferences block statement are scanned to extract the capabilities
requested by the SmartApp. SmartApps use subscribe methods to request notifica-
tions when the device’s status has been changed. These notifications will trigger the
handler methods to react to these status changes. To further determine the specific
commands or attributes (i.e., symbols of the DFA), HoMonit scans the subscribe
methods and their corresponding commands or attributes (e.g., motion.active in the
subscribe method).
144 6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
The second step is to extract the state transitions (i.e., δ) from subscribe and
handler methods. HoMonit starts from the subscribe method call in installed and
updated block. Each subscribe method in these blocks indicates one transition from
the start state to an intermediate state; by inspecting the corresponding handler
method, how the intermediate state will transition to other states can be determined:
in the example shown in Fig. 6.5, one transition (with switch.on as its symbol) moves
the DFA to an accept state. Complex handler methods may involve multiple states
and transitions before the DFA reaches the accepted states. The set of states Q, start
state q0 , and accept states F in the DFA are automatically constructed according to
the transition function.
6.3 System Design of Misbehavior Detection System 145
The DFAs for 150 out of 181 SmartApps were successfully constructed (82.9%).
The success rate is already very high, considering that some SmartApps are much
more complex than the one listed in Fig. 6.5. There were complex apps with over
28 states and 40 transitions in their DFAs in the dataset, and these DFAs could
all be successfully extracted and further used in detection. Most of the popular
SmartApps can be successfully constructed. However, the DFA construction failed
on some SmartApps because they request capabilities that are not associated with
any device. Take a SmartApp Severe Weather Alert as an example. It only acquires
weather information from Internet and sends weather alert to the user’s smartphone;
a meaningful DFA cannot be constructed.
HoMonit collects wireless traffic and conducts DFA matching according to the
traffic characteristics to identify the misbehavior of SmartApp. In this chapter,
HoMonit collects both ZigBee and Z-Wave traffic between the hub and the smart
devices, since these two standards are widely used by smart home devices as shown
in Table. 6.1.
ZigBee Traffic Collection To sniff the ZigBee traffic, HoMonit employs a com-
mercial off-the-shelf ZigBee sniffer (i.e., Texas Instruments CC2531 USB Dongle
[24]) and an open-source software tool (i.e., 802.15.4 monitor [14]) to passively
collect the ZigBee traffic. ZigBee breaks the 2.4 GHz band into 16 channels,
where SmartThings hub and its devices are on a fixed channel (0xe in our case).
We customized the 802.15.4 monitor to achieve real-time packet capturing. The
captured packets were dumped into a log file once per second.
Z-Wave Traffic Collection HoMonit adopts Universal Software Radio Peripheral
(USRP) hardware to collect Z-Wave packets and modifies an open-source software,
Scapy-Radio [5], to automatically record them. Figure 6.6 shows the Z-Wave traffic
collection framework in the GNU Radio software. It is worth noting that some Z-
Wave devices may communicate with the hub at different channel frequencies in
different modes (e.g., sleep or active mode). Take the alarm sensor Aeotec Siren
(Gen 5) as an example. The device communicates with the hub in sleep mode at
the frequency of 908.4 MHz with a transmission rate of 40 kbps but communicates
at the frequency of 916 MHz with a transmission rate of 100 kbps rate in the active
mode. As such, to monitor two channels simultaneously, we exploited two USRPs
working at the above two frequencies to capture all Z-Wave traffic. For example,
in order to capture a data packet with a transmission rate of 100 kbps, in addition
to adjusting the frequency of the receiver, it is necessary to change the cut-off
frequency and sampling rate of the filter (i.e., changing the Omega in the Clock
146 6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
Recovery MM block of the receiver to 8). After adjustment, USRP can normally
monitor the communication of commercial Z-Wave devices.
During the event inference process, the collected wireless traffic contains packets
that are considered noise for our event inference, which must be filtered out. Noise
packets contain the following types:
• Beacon packets. Beacon packets are mainly used for acknowledging data trans-
mission and maintaining established connections, which carry less side-channel
information. HoMonit discards ZigBee beacon packets and drops Z-Wave pack-
ets with no payload.
• Retransmission packets. Retransmission packets will be sent in cases of trans-
mission failure. In ZigBee, they can be identified by checking if two subsequent
packets share the same sequence number. In Z-Wave, retransmission packets can
be identified if two consecutive packets sent by the sending device are observed
without having a response packet in between. We also discard the retransmission
packets to avoid affecting inference performance.
• Unrelated traffic. Traffic from devices using other wireless standards (e.g., Wi-Fi
and Bluetooth) or from other networks are treated as unrelated traffic. To identify
traffic from targeted networks, ZigBee uses a unique identifier called Personal
Area Network Identifier (or PANID for short), while Z-Wave uses Home ID,
6.3 System Design of Misbehavior Detection System 147
which denotes the ID that the Primary Controller assigns the node during the
inclusion process [9]. HoMonit filters out collected traffic that has different
PANIDs or Home IDs from the specified ones.
After noise filtering, we are ready to exploit the side channel of wireless traffic to
infer DFA events. An event on the SmartThings platforms can be a command that is
generated by hub and sent to the devices or an attribute of the device that is reported
φ
to the hub [17]. We formally denote an event as Et , which indicates that the event
is of type φ and is generated at time t. An event type φ is a 2-tuple (d, e), where d
is the device and e is a command sent to d or an attribute of d. We denote the set of
all event types as .
Each event will trigger a sequence of wireless packets. We denote a wireless
packet as a quadruple f = (t, l, di , dj ), where f refers to the packet of length l sent
from device di to device dj at time t. Here, di and dj are represented using the MAC
addresses in ZigBee or node IDs in Z-Wave. Once an event is triggered, a sequence
of n packets sent between device di and dj at a specific time t can be monitored
d ↔d
during a short interval, which can be represented as St i j = (f1 , f2 , . . . , fn ).
Note that either di or dj is the hub because the SmartThings framework dictates
all the devices communicate with the hub. If packets for multi-hop communications
are captured, HoMonit merges these consecutive packets from multiple hops into a
φ
single one. Therefore, there is typically a one-to-one mapping between an event Et
di ↔dj
and a sequence of packets, St .
For a given device, we first obtain the different types of events of a specific
device by referring to its open-source device handler. For each event type φ ∈ ,
we manually trigger the event and collect m samples (m = 50), denoted as Sφ =
φ φ φ φ
{S1 , S2 , . . . , Sm }, where Si is a sequence of packets collected in one experiment
when the event is triggered. The fingerprint Fφ of event type φ is defined as
1 φ φ
Fφ = arg min Dis(Si , Sj ), (6.1)
φ
Si ∈Sφ
S φ
φ
∀Sj ∈Sφ
φ φ
whereDis(Si , Sj ) adopts Levenshtein Distance [3] to measure the sequence
φ φ φ φ
similarity between Si and Sj , i.e., a small dist (Si , Sj ) means a high similarity
φ φ
between Si and Sj . Table 6.2 illustrates the device fingerprints of all devices we
possess (including seven ZigBee devices and four Z-Wave devices, as listed in
Table 6.3).
148
Table 6.2 Fingerprints for event types that are supported by 7 ZigBee devices and 4 Z-Wave devices
Device event Device name (abbreviation Protocol Fingerprinting
water.wet Samsung SmartThings 54 ↑ 45 ↑
water.dry Water Leak Sensor (Water ZigBee 54 ↑ 45 ↑
temperature Sensor) 53 ↑ 45 ↑
motion.active Samsung SmartThings 54 ↑
motion.inactive Motion Sensor (Motion ZigBee 54 ↑
temperature Sensor) 53 ↑ 45 ↑
switch.on Samsung SmartThings 50 ↓ 47 ↓ 47 ↓ 52 ↓
ZigBee
switch.off Outlet (Outlet) 50 ↓ 47 ↓ 47 ↓ 52 ↓
contact.open 54 ↑
contact.closed Samsung SmartThings 54 ↑
acceleration.active Multipurpose Sensor 2016 ZigBee 69 ↑ 65 ↑ 65 ↑ 65 ↑ · · ·
acceleration.inactive (Multipurpose Sensor 2016) Occur after event acceleration.active finishes
temperature 53 ↑
contact.open 54 ↑ 45 ↑
contact.closed Samsung SmartThings 54 ↑ 45 ↑
acceleration.active Multipurpose Sensor 2015 ZigBee 69 ↑ 65 ↑ 65 ↑ 65 ↑ · · ·
acceleration.inactive (Multipurpose Sensor 2016) Occur after event acceleration.active finishes
temperature 53 ↑ 45 ↑
beep 50 ↓ 45 ↓
Samsung SmartThings
rssi 52 ↑
Arrival Sensor (Arrival ZigBee
presence.present Sensor) 57 ↑ 48 ↑ 45 ↑ 45 ↑ 49 ↑ 45 ↑ 50 ↑ 45 ↑ 50 ↑ 45 ↑ 50 ↑ 45 ↑
presence.not present Occur after there is no periodic event rssi
6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
switch.on 50 ↓ 47 ↓
switch.off 50 ↓ 47 ↓
Osram Lightify CLA 60
illuminance ZigBee 53 ↓ 47 ↓
RGBW (Light)
setColorTemperature 54 ↓ 47 ↓ 52 ↓ 47 ↓
setColor 50 ↓ 47 ↓ 54 ↓ 47 ↓ 52 ↓ 47 ↓ 52 ↓ 47 ↓
switch.on Power Monitor Switch 13 ↓ 12 ↓ 10 ↓
Z-Wave
switch.off TD1200Z1 (Switch) 13 ↓ 12 ↓ 10 ↓
motion.active Aeotec MultiSensor 6 14 ↑ 21 ↑
Z-Wave
motion.inactive (MultiSensor) 14 ↑ 21 ↑
contact.open Aeotec Door/Window 17 ↑ 17 ↑
contact.closed Sensor 6 (Door/Window Z-Wave 17 ↑ 17 ↑
Sensor)
alarm.siren Aeotec Siren Gen 5 (Siren) Z-Wave 13 ↓ 34 ↓ 11 ↓ 33 ↓ 11 ↓ 21 ↓ 11 ↓
6.3 System Design of Misbehavior Detection System
149
150 6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
To infer the events based on the captured wireless packets, HoMonit first partitions
the traffic flow into a set of bursts. A burst is a group of network packets in which
the interval between any two consecutive packets is less than a pre-determined
burst threshold [23]. The packets in each burst are then ordered according to the
d ↔d
timestamps, and the burst is represented as St i j . HoMonit matches the burst
with the fingerprints of each of the known events by calculating their Levenshtein
d ↔d
Distance, Dis(St i j , Fφ ). The event type with the smallest Levenshtein Distance
from the packet sequence is considered as the inferred event. For instance, the burst
(13 ↓ 34 ↓ 11 ↓ 33 ↓ 11 ↓ 21 ↓ 11 ↓) will be inferred as the event alarm.siren.
However, as shown in Table 6.2, there are more than one event with exactly the
same patterns (e.g., packet size and direction). For instance, (50 ↓ 47 ↓ 47 ↓
52 ↓) may be inferred as either switch.on or switch.off of the device Samsung
SmartThings Outlet. To correctly identify the event, we classify them into two
categories:
• Events of the same device. One example is switch.on and switch.off of Samsung
SmartThings Outlet. The reason is that they are essentially the same event
message with different data fields. As these events typically exist in pairs, such
as on and off, active and inactive, wet and dry, we use one bit to trace the current
state of each device to differentiate these events.
• Events of different devices. One example is water.wet of Samsung SmartThings
Water Leak Sensor and contact.open of Samsung SmartThings Multipurpose
Sensor (2015). We first use other unique events to identify the device and then
determine the event type. For example, if we captured acceleration.active, then
we know that this device is Samsung SmartThings Multipurpose Sensor (2015).
Therefore, the event type must be contact.open instead of water.wet.
In this section, we will discuss the potential privacy leakage arising from the
wireless side channel analysis and then present a privacy enhancement design.
To enhance the privacy of the smart home environments while preserving the capa-
bility of detecting misbehaved SmartApps via side-channel analysis, we propose
152 6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
Fig. 6.7 Illustration of privacy enhancement via decoys. (a) Real network: a central hub connects
4 smart devices. (b) Obfuscated network: each device (including the hub) has one decoy
a novel solution that, by intelligently injecting dummy packets into the wireless
network, obfuscates the transmission patterns of the target smart device and the hub.
As the dummy packets are generated by HoMonit, they can be easily filtered out by
HoMonit in their own analysis. We detail our design of the Privacy Enhancement
Module as follows.
Dummy Packet Generation As illustrated in Fig. 6.7, the key idea of the Privacy
Enhancement Module is to create fake identities for the real devices and simulate
other, non-related SmartApp activities using these fake device identities. To create
a fake identity for a device, dummy packets are generated while reusing the MAC
address of the real device. Each of these fake identities is called a decoy of the real
device. Given a pre-defined security parameter k, k × N decoys are generated to
provide the (k + 1)-anonymity for N real devices (the hub is included). To make the
generated dummy packets indistinguishable from the real ones, the dummy packets
are generated with the transmission patterns (e.g., inter-packet interval) of the real
packets, which can be learned during offline training. Furthermore, to make the
encrypted payload indistinguishable, the payload of the dummy packets is of the
same length as that of the real packets. After generating a dummy packet sent from
a decoy device to a decoy hub, another dummy packet is sent from the decoy hub to
the decoy device to simulate the response.
Although there are some prior works that aim to distinguish the spoofed devices
from the real devices by leveraging the different Received Signal Strength (RSS)
values caused by the different distances [4, 13, 26], it is possible to make
them indistinguishable in terms of RSS. This can be achieved by placing the
Privacy Enhancement Module in the proximity of the real devices or adjusting the
transmission power of the USRPs to simulate the different transmission distances.
Maintaining Independent Sequence Numbers According to the ZigBee and Z-
Wave specifications, the sequence number of the packets sent from the same device
increases by 1 for each packet. Furthermore, in ZigBee packets and 916 MHz Z-
Wave packets, the size of the sequence number is 1 byte, ranging from 0x01
to 0xff; in 908.4 MHz Z-Wave packets, the sequence number takes only 4 bits,
6.4 Evaluation 153
ranging from 0x1 to 0xf. Therefore, to make the decoy devices and the real devices
indistinguishable, it is necessary to handle the sequence number properly. To do so,
we let each decoy maintain an independent sequence number and increment the
number for each packet it sends.
Privacy Analysis of the Privacy Enhancement Module In privacy inference
attacks, the attacker will leverage the side-channel leakage to infer (1) the devices
from which the observed traffic is generated, (2) the events that are associated with
the devices, and (3) the SmartApps that communicate with the devices with these
events. The decoy devices introduced by the Privacy Enhancement Module enhance
the privacy of the smart home environment by obfuscating the first step of the attacks
and hence fundamentally thwart the inference attacks. To evaluate the effectiveness
of our defense, we measure the entropy of each device to quantify the inference
difficulty.
For the simplicity of the presentation, we consider the case in which we have
only one smart device s0 , which can be easily extended to the case of multiple real
devices. The Privacy Enhancement Module deploys k decoys S = {s1 , s2 , . . . , sk }
for obfuscation. Therefore, the attacker will observe k + 1 devices S+ = {s0 } S
in the network. In one time unit (e.g., a day), we assume that the real device s0
generates a sequence of w0 events E = {e1 , e2 , . . . , ew0 }; the decoy device si
generates a sequence of wi events E i = {e1i , e2i , . . . , ew
i }. Let a random variable
i
X ∈ S+ represent a random process in which the attacker sniffs the wireless traffic
at a random time of the day and captures packets that correspond to an event eX
generated by one of the devices in S+ . We denote X = si when eX ∈ E i . Therefore,
P (X = si ) = P (eX ∈ E i ) = wi / i=k i=0 wi . The device entropy is defined as
follows:
i=k
+
ε(S ) = − P (X = si ) log2 P (X = si ). (6.2)
i=0
From this equation, the level of obfuscation is determined not only by the number
of decoy devices that the Privacy Enhancement Module introduces but also by the
number of events generated by the decoys and the real devices. The entropy ε(S+ )
has the maximum value when wi / i=k i=0 wi = 1/k + 1 for all i and a minimum
value when the events from the real device are dominant (i.e., P (X = s0 ) = 1). We
will empirically evaluate the effectiveness of the Privacy Enhancement Module in
the next section.
6.4 Evaluation
ZigBee and two USRPs for Z-Wave, as shown in Fig. 6.8a and b. The distance
between the SmartThings hub and HoMonit was about 2 m. As listed in Table 6.3,
we chose 30 SmartApps from the SmartThings Public GitHub Repository [19],
which interact with, in total, 7 ZigBee devices and 4 Z-Wave devices, as shown in
Fig. 6.8c and d. The devices were located less than 10 m away from the hub within
a room of 200 m2 .
Evaluation Metrics We choose true positive rate (TPR), true negative rate (TNR),
precision, recall, and F1-score as the evaluation metrics. We define the positive
as misbehavior. Thus, TPR and TNR are equal to the TRR and TAR in the voice
liveness detection task, respectively. The precision of the inference is defined as
the fraction of correctly inferred events overall inferred events; the recall of the
inference is defined as the fraction of successfully inferred events over all events
that have been triggered. The F1-score is simply the harmonic mean of precision
and recall.
Fig. 6.8 Wireless sniffers and smart devices. (a) ZigBee sniffer: TI CC2531 USB Dongle. (b)
Z-Wave sniffer: two USRPs. (c) Seven tested ZigBee devices. (d) Four tested Z-Wave devices
6.4 Evaluation 155
1 1
ZigBee
Z-Wave
0.8 0.95
F1 Score
F1 Score
0.6 0.9
0.4 0.85
ZigBee
Z-Wave
0.2 0.8
0 1 2 3 4 5 6 7 8 9 Near the Hub 1 Wall 2 Walls
Burst Threshold (s) Sniffer Location
(a) (b)
Fig. 6.9 Micro-benchmark: accuracy of event and SmartApp inference. (a) Evaluation of burst
threshold. (b) Impact of sniffer distance and wireless obstacles
The burst threshold is a parameter used to cluster captured wireless packets for the
same event, which directly impacts the effectiveness of SmartApp inference. We
performed the following experiments: we randomly selected 4 ZigBee devices and
another 4 Z-Wave devices. We manually triggered each event type 50 times on each
of the 8 devices. The time intervals between two consecutive events were 3–10 s.
We measured the accuracy of SmartApp inference when the burst threshold was
selected as integer values from 0 to 10 s.
As shown in Fig. 6.9a, the F1-score of event inference achieves the maximum
when the burst threshold is 1 s. This is because a smaller burst threshold separates
the packets belonging to the same events, which may cause more events to be
inferred than what we actually triggered, which may cause some events to be missed
by the detector. Therefore, in the remainder of our evaluation, the burst threshold
was chosen as 1 s.
the recalls are sometimes lower than 1.00. An average F1-score of 0.98 for ZigBee
SmartApps and an average F1-score of 0.96 for Z-Wave SmartApps were achieved.
Among all 30 tested SmartApps, the F1-scores of 26 are at least 0.95 (see Table 6.3).
This shows that HoMonit can accurately capture the working logic of SmartApps
through DFA matching. It is also important to point out that the major factors
contributing to the false inference come from packet loss, unrelated wireless traffic,
or traffic jam splitting a burst.
As shown in Table 6.4, the average TPR (over 40 ZigBee SmartApps) is 0.98 in
detecting over-privileged accesses, with a standard deviation of 0.03; the average
TNR of ZigBee SmartApps is 0.95, with a standard deviation of 0.07. The detection
of Z-Wave SmartApps achieves similar TPR and TNR, which are 0.98 and 0.92,
respectively. The major reason for failed test cases is packet loss and a few
unexpected wireless traffic which influence the event inference. Besides, accidental
signal reception delay will break the consistency of frames for a device event, which
may result in a false alarm for normal SmartApps.
Entropy
Upper bound
2
1.5
1
0.5
0
0 1 2 3 4 5 6
Number of ecoys
6.4.5 Discussions
6.4.5.1 Generality and Applicability
inherently suffers from a low-entropy issue. Therefore, this feature gives HoMonit
the opportunity to extract wireless fingerprints for smart devices by analyzing the
packet size and timing information.
We also investigate IFTTT [10], an open-source platform compatible with
SmartThings, to further demonstrate the applicability and generality of HoMonit
from both of the aspects of the wireless fingerprints capturing and the DFA building.
To capture the wireless fingerprints in IFTTT, we develop an Applet (a SmartApp in
IFTTT) that automatically turns on/off the Samsung SmartThings Outlet via ZigBee
protocols, as shown in Fig. 6.8c. It is shown that the events in IFTTT present the
same wireless fingerprints as in SmartThings (e.g., 50 ↓, 47 ↓, 47 ↓, 52 ↓ for
switch.on or switch.off ). The reason is that IFTTT employs the same lower-layer
protocols as SmartThings, and the wireless traffic patterns are not affected by their
upper-layer platforms.
The core idea of HoMonit is to compare the SmartApp activities inferred from
the encrypted traffic with the expected behaviors dictated by their source code
or UI. Therefore, acquiring the DFA of the benign version of the SmartApp (or
the groundtruth DFA) is critical for the successful deployment of HoMonit. The
simplest way to obtain such groundtruth DFAs is to download them from the official
app market if assuming the market operator has performed a good job in vetting all
published SmartApps. Otherwise, a trustworthy third party must step in to vouch for
benign apps, which will help bootstrap HoMonit.
Because some device events may have the same wireless fingerprints, such as
switch.on and switch.off of Samsung SmartThings Outlet (see Table 6.2), HoMonit
has to keep track of the current state of the device, which can be used in turn to infer
the content of the event. However, a potential attack strategy is that a SmartApp may
intentionally send the same commands twice to mislead HoMonit. We call this type
of attack a double-sending attack. For example, a misbehaving SmartApp may send
the command switch.off twice, hoping that they will be confused with a sequence
of switch.off and switch.on. However, in reality, this double-sending attack does not
work as the communication protocol of SmartThings devices is designed to deal
with duplicated messages. We performed the following experiments: (1) two events
[switch.on and switch.off ] were sent by the SmartApp and (2) two events [switch.on
and switch.on] were sent by the SmartApp. In both experiments, the initial state
of the Outlet was set as off. The first experiment represents a normal case, and the
second represents a double-sending attack. As shown in Table 6.2, the collected
traffic patterns are different: the packets in the first cases are (50 ↓, 47 ↓, 47 ↓
, 52 ↓), followed by (50 ↓, 47 ↓, 47 ↓, 52 ↓). In comparison, those in the second
References 161
When detecting any misbehavior of a specific SmartApp, HoMonit can alert the
users simply through text message or work together with existing home safety mon-
itoring tools (e.g., Smart Home Monitor [22] in SmartThings) to take immediate
actions. For example, HoMonit can generate different alerts based on the detected
scenarios, such as home unoccupied, occupied, or disarmed. In addition, HoMonit
can serve as a building block for enforcing user-centric [25] or context-based [12]
security policies and integrate with these previously proposed systems to interact
with users.
6.5 Summary
In this chapter, we present HoMonit, an anomaly detection system for smart home
platforms to detect misbehaving SmartApps. HoMonit leverages the side-channel
information leakage in the wireless communication channel—packet size and inter-
packet timing—to infer the type of communicated events between the smart devices
and the hub and then compares the inferred event sequences with the expected
program logic of the SmartApps to identify misbehavior. Key to HoMonit includes
techniques to extract the program logic from SmartApps’ source code or the
user interfaces of SmartThings’ mobile app and automated DFA construction and
matching algorithms that formalize the anomaly detection problem.
References
3. Black, P.: Levenshtein distance. In: Dictionary of Algorithms and Data Structures (2008)
4. Chen, Y., Trappe, W., Martin, R.P.: Detecting and localizing wireless spoofing attacks. In:
IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications
and Networks (SECON) (2007)
5. Cybertools: Scapy radio (2018). https://bitbucket.org/cybertools/scapy-radio/src
6. D’Arcy, H.: Astbuilder (2018). http://docs.groovy-lang.org/next/html/gapi/org/codehaus/
groovy/ast/builder/AstBuilder.html
7. Fernandes, E., Jung, J., Prakash, A.: Security analysis of emerging smart home applications.
In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 636–654 (2016). https://doi.org/
10.1109/SP.2016.44
8. Fernandes, E., Paupore, J., Rahmati, A., Simionato, D., Conti, M., Prakash, A.: FlowFence:
practical data protection for emerging IoT application frameworks. In: USENIX Security
Symposium (USENIX Security) (2016)
9. Honeywell (2013). http://library.ademconet.com/MWT/fs2/L5210/Introductory-Guide-to-Z-
Wave-Technology.pdf
10. IFTTT Inc. (2018). https://ifttt.com/
11. JFR, ABR and NOBRIOT: Z-wave command class specification (2016). http://z-wave.
sigmadesigns.com/wp-content/uploads/2016/08/SDS12657-12-Z-Wave-Command-Class-
Specification-A-M.pdf
12. Jia, Y.J., Chen, Q.A., Wang, S., Rahmati, A., Fernandes, E., Mao, Z.M., Prakash, A.:
ContexIoT: Towards providing contextual integrity to appified IoT platforms. In: The Network
and Distributed System Security Symposium (NDSS) (2017)
13. Jokar, P., Arianpoo, N., Leung, V.C.M.: Spoofing Detection in IEEE 802.15.4 Networks Based
on Received Signal Strength. Elsevier, Amsterdam (2013)
14. mitshell: 802.15.4 monitor (2018). https://github.com/mitshell/CC2531
15. Samsung: SmartThings (2021). https://www.smartthings.com
16. Schaller, K.: List of all officially published apps from the more category of smart setup in the
mobile app (2015). https://community.smartthings.com/t/list-of-all-officially-published-apps-
from-the-more-category-of-smart-setup-in-the-mobile-app-deprecated/13673
17. SmartThings: Capabilities reference (2018). http://docs.smartthings.com/en/latest/capabilities-
reference.html
18. SmartThings: Smartthings architecture (2018). http://docs.smartthings.com/en/latest/
architecture/index.html
19. SmartThings: SmartThings public GitHub repo (2018). https://github.com/
SmartThingsCommunity/SmartThingsPublic
20. SmartThings: Web services SmartApps (2018). http://docs.smartthings.com/en/latest/
smartapp-web-services-developers-guide/overview.html
21. SmartThings: What SmartApps are being retired from the marketplace? (2018). https://support.
smartthings.com/hc/en-us/articles/115003072406-What-SmartApps-are-being-retired-from-
the-Marketplace
22. Samsung SmartThings: Smart home monitor (2018). https://support.smartthings.com/hc/en-
us/articles/205380154
23. Taylor, V.F., Spolaor, R., Conti, M., Martinovic, I.: AppScanner: automatic fingerprinting of
smartphone apps from encrypted network traffic. In: IEEE European Symposium on Security
and Privacy (EuroS&P) (2016)
24. Texas Instrument: CC2531: system-on-chip solution for IEEE 802.15.4 and ZigBee applica-
tions (2018). http://www.ti.com/product/CC2531
25. Tian, Y., Zhang, N., Lin, Y.H., Wang, X., Ur, B., Guo, X., Tague, P.: SmartAuth: user-centered
authorization for the Internet of Things. In: USENIX Security Symposium (USENIX Security)
(2017)
26. Yang, J., Chen, Y., Trappe, W.: Detecting spoofing attacks in mobile wireless environments.
In: IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications
and Networks (SECON) (2009)
Chapter 7
Conclusion and Future Directions
This monograph focuses on the security issues in the smart home and takes “mobile
device—voice interface—application platform” as the research logic. Its main
contributions are described below.
In the terminal device layer, this monograph proposes a cross-layer privacy attack
and defense scheme for smartphones and other smart terminals. This monograph
firstly reveals the relationship between the dynamic changes of the wireless signal
in the physical layer and the user’s keystroke input in the mobile terminal devices.
Then, we propose WindTalker, a novel attack mechanism that combines side-
channel information from both physical layer and network traffic layer to realize
inferring the user’s mobile payment password. In order to solve the problem that
traditional attacks need to explicitly deploy sniffer devices around the target users
or invade the user’s device system, WindTalker proposes a new CSI collection
method based on the Internet control message protocol. At the same time, this
monograph designs a sensitive input window identification algorithm based on IP
address pool and proposes an efficient keystroke inference algorithm based on CSI
data. This monograph verifies the reliability of WindTalker on Alipay, the largest
mobile payment platform in China. The implementation shows that WindTalker
can bypass the HTTPS encryption protocol deployed by Alipay and successfully
speculate the user’s payment password. This monograph studies the influence of
many factors, such as distance and direction, and proves that WindTalker can
speculate on keystrokes with high accuracy in various scenarios. Finally, this
monograph proposes an efficient defense mechanism to resist the privacy threat
caused by WindTalker. This defense mechanism can effectively prevent attackers
from obtaining accurate CSI data through CSI obfuscation method.
In the voice interface layer, to solve the problem that existing two-factor
authentication schemes require users to carry sensor equipment, this monograph
proposes WSVA, a voice liveness detection system based on Wi-Fi signal. Unlike
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 163
Y. Meng et al., Security in Smart Home Networks, Wireless Networks,
https://doi.org/10.1007/978-3-031-24185-7_7
164 7 Conclusion and Future Directions
With the further development of smart home technology, its security problems must
emerge in endlessly, and its security research is a dynamic and continuous process.
On the basis of this study, the following prospects are made for future research:
First, in terms of side-channel attack and defense on terminal devices, wireless
sensing media has expanded from Wi-Fi signals to ultrasonic, millimeter waves,
visible light, and many other fields. In smart home scenarios, more and more
devices use these communication protocols. In addition, this research is mainly
aimed at smart terminal devices such as smartphones, while for smart devices such
as sensors and controllers using other communication protocols, further exploration
is needed. Therefore, in the future, it is necessary to further study the security of
these ubiquitous terminal devices under wireless media.
Second, in terms of two-factor authentication based on wireless signals, the
WSVA proposed in this monograph has good performance in voice deception
attacks such as recognition and replay attacks but lacks an efficient defense mech-
anism against internal attacks. This is because the perception of Wi-Fi signals is
limited. In the future, with the application of wireless sensing media and technology
with a fine-grained resolution, we can study and build a two-factor authentication
mechanism that integrates user behavior and interface information more.
Third, in terms of passive detection based on voice signals, the ArrayID proposed
in this monograph requires users and spoofing devices to provide real voice
commands and voice spoofing commands to train their classification models. This
restriction inevitably creates a certain degree of user burden in the smart home
environment and also limits the popularity of ArrayID. In the future, researchers
need to mine more efficient factors for liveness detection to get rid of the limitations
on training data and user dependence.
Fourthly, in the aspect of misbehavior detection based on the wireless traffic
side channel, the HoMonit proposed in this monograph is mainly aimed at smart
applications employing low power consumption communication protocols such
as ZigBee and Z-Wave. For high-speed communication protocols such as Wi-Fi,
166 7 Conclusion and Future Directions
A S
Application misbehavior detection, 16, 165 Side-channel attacks, v, 2–3, 9, 10, 14, 17,
Application platform, v, vi, 1–5, 7–8, 13–14, 21–25, 30, 37, 38, 41, 69, 70, 73, 151,
17, 28–30, 135–161, 163, 165, 166 165
Array fingerprint, 18, 108, 109, 112–120, 124, Signal obfuscation, 14, 17, 69–71
126, 130, 131, 164 Smart home, v, vi, 1–14, 16–18, 21–31, 37, 40,
66, 77, 81, 85, 97, 100, 102, 107–110,
120, 124, 128, 130, 135–138, 141, 142,
C 145, 151, 153, 154, 161, 163–165
Channel state information (CSI), 14, 15,
22–24, 38–54, 56–73, 78–80, 82–94,
96–99, 101–104, 107, 163, 164 T
Terminal device, v, vi, 2–6, 9–10, 14–15, 17,
21–25, 30, 37–73, 163, 165
H Traffic analysis, 24, 135–161, 165
High-speed traffic analysis, 165 Two-factor authentication, 11–12, 18, 25, 27,
31, 77–104, 131, 163–165
K
Keystroke inference, 23–25, 37–42, 44, 47, 48,
50–61, 65, 72 U
Ubiquitous sensing, 22
M
Microphone array, vi, 6, 12, 15–16, 18, V
107–131, 164, 165 Voice control system, 26, 77, 102, 103
Misbehavior detection, 4, 6, 16–18, 135–161, Voice interface, v, vi, 2–7, 9–12, 14, 15, 17, 18,
164, 165 25–28, 31, 77–104, 107–131, 163–165
Voice spoofing, v, 3, 4, 11, 12, 15, 18, 25–28,
77, 79, 83, 99, 104, 126, 131, 165
P
Passive liveness detection, 11, 12, 15, 17, 18,
25, 27–28, 107–131, 165 W
Physical layer information, 9, 10, 15, 18, Wireless fingerprinting, 160, 161
21–23, 38, 67 Wireless side-channel inference, 141, 164
Privacy inference, 42–55, 143, 153
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 167
Y. Meng et al., Security in Smart Home Networks, Wireless Networks,
https://doi.org/10.1007/978-3-031-24185-7