Security in Smart Home Networks

Wireless Networks
Yan Meng
Haojin Zhu
Xuemin (Sherman) Shen
Security in
Smart Home
Networks
Wireless Networks
Series Editor
Xuemin Sherman Shen , University of Waterloo, Waterloo, ON, Canada
The purpose of Springer’s Wireless Networks book series is to establish the state
of the art and set the course for future research and development in wireless
communication networks. The scope of this series includes not only all aspects
of wireless networks (including cellular networks, WiFi, sensor networks, and
vehicular networks), but related areas such as cloud computing and big data.
The series serves as a central source of references for wireless networks research
and development. It aims to publish thorough and cohesive overviews on specific
topics in wireless networks, as well as works that are larger in scope than survey
articles and that contain more detailed background information. The series also
provides coverage of advanced and timely topics worthy of monographs, contributed
volumes, textbooks and handbooks.
Yan Meng • Haojin Zhu • Xuemin (Sherman) Shen
Security in Smart Home

Networks
Yan Meng Haojin Zhu
Shanghai Jiao Tong University Shanghai Jiao Tong University
Shanghai, China Shanghai, China
Xuemin (Sherman) Shen

Electrical and Computer Engineering Dept
University of Waterloo
Waterloo, ON, Canada
ISSN 2366-1186 ISSN 2366-1445 (electronic)

Wireless Networks
ISBN 978-3-031-24184-0 ISBN 978-3-031-24185-7 (eBook)
https://doi.org/10.1007/978-3-031-24185-7
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
As a typical application of the Internet of things (IoT), the smart home system is
widely deployed and plays an important role in our lifestyle. In the smart home
environment, smart devices are connected through wireless networks, provide users
with non-contact interfaces such as the voice interface, and are managed uniformly
through smart applications. Smart terminal devices (e.g., smartphones, tablets, smart
sensing devices) provide users with richer functions and can sense the environmental
changes of the smart home in real time. With the popularization of the voice
interface, users can interact with the smart home system without contact. Smart
applications can automatically control devices and adjust the status of the smart
home system.
However, current researches show that the smart home network still faces a wide
range of security threats. Firstly, although the wireless communication technology
adopted by the terminal devices speeds up the network communication rate and
expands the network range, it also increases the privacy risks of users when
encountering side-channel attacks. Secondly, due to the open nature of the voice
channel, the voice interface faces various voice spoofing attacks. Lastly, the smart
application platform suffers from the abnormal behavior of smart applications. To
ensure the security and privacy of a smart home network, in this monograph, we
study the corresponding key technologies following the logic of “terminal device—
user interface—application platform.”
In Chap. 1, we first introduce the growth trend and architecture of the smart
home network. Especially we clearly describe three different components of the
smart home and their functionalities. In Chap. 2, we review existing literature about
security and privacy issues faced by three components of the smart home. More
specifically, we introduce the side-channel attacks faced by the terminal device,
the voice spoofing attacks and countermeasures in the voice interface, and the
misbehavior and defense mechanism for the application platform. In Chap. 3, at
the layer of the terminal device, we study the side-channel privacy threats caused
by the side-channel attack aiming at the wireless communication protocol and
propose an obfuscation-based countermeasure mechanism. In Chap. 4, at the layer
of voice interface, we propose a liveness detection scheme named WSVA via
v
vi Preface
leveraging the Wi-Fi signal, which is ubiquitous in the smart home environment.
WSVA uses the wireless Wi-Fi signal to characterize the user’s mouth movement
and then determines whether the voice command received by the voice interface is
an authentic voice command or a spoofing command by judging the consistency
between the user’s mouth movement and the voice signal. Then, in Chap. 5, to
further improve the universality of liveness detection, we propose the passive
liveness detection scheme named ArrayID that only depends on the collected voice
signal. ArrayID uses the microphone array commonly equipped by smart speakers
to achieve more robust liveness detection performance. In Chap. 6, at the layer of
the smart application platform, to solve the threat of application misbehavior, we
propose a third-party anomaly detection system HoMonit. By analyzing the side-
channel information of wireless communication traffic, HoMonit can accurately
detect the application’s misbehavior. Finally, in Chap. 7, we summarize the main
context of this monograph and introduce the possible future research directions.
We hope that this monograph can provide insightful lights on understanding the
security in smart home networks, including the terminal device security, the voice
interface security, and the application platform security. We would like to thank
Prof. Xiaohui Liang at the University of Massachusetts at Boston, Prof. Yao Liu
at the University of South Florida, Prof. Yinqian Zhang at Southern University
of Science and Technology, Prof. Yuan Tian at the University of California, Los
Angeles, Jin Li at Guangzhou University, and Prof. Xiaokuan Zhang at George
Mason University for their contributions in this monograph. We would also like
to thank all members of BBCR group, University of Waterloo and NSEC group,
Shanghai Jiao Tong University for their valuable suggestions and comments.
Special thanks to the staff at Springer Nature, Mary E. James, Brian Halm, and
Bakiyalakshmi RM for their help throughout the publication preparation process.
This work is also supported by the National Natural Science Foundation of China
(62132013, 61972453).
Shanghai, China Yan Meng

Shanghai, China Haojin Zhu
Waterloo, ON, Canada Xuemin (Sherman) Shen
November 2022
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Era of Smart Home and Its Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Architecture of Smart Home Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Terminal Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Voice Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Application Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Security and Privacy Challenges in Smart Home Network . . . . . . . . . . . 8
1.3.1 Terminal Device Layer: Privacy Leakages . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Voice Interface Layer: Spoofing Attacks . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Application Platform Layer: Misbehavior . . . . . . . . . . . . . . . . . . . . . 13
1.4 Aims and Organization of This Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 Aims of This Monograph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Organization of This Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Literature Review of Security in Smart Home Network . . . . . . . . . . . . . . . . 21
2.1 Side-channel Attacks Faced by Terminal Device . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Attacks Based on Physical Layer Side-channel
Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Attacks Based on Network Layer Side-channel
Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.3 Other Side-channel Attack Manners . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Voice Spoofing Attacks in Voice Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Voice Spoofing Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Two-Factor Authentication-Based Liveness Detection . . . . . . . 26
2.2.3 Voice Signal-Based Passive Liveness Detection . . . . . . . . . . . . . . 27
2.3 Misbehaviors in Application Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Misbehaviors of Smart Home Applications . . . . . . . . . . . . . . . . . . . 28
2.3.2 Defense Mechanisms against Misbehaviors . . . . . . . . . . . . . . . . . . . 29
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vii
viii Contents
3 Privacy Breaches and Countermeasures at Terminal Device Layer . . . 37

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Background Knowledge and Attack Principle . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Channel State Information of Wireless Signal . . . . . . . . . . . . . . . . 39
3.2.2 Attack Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Principles of Privacy Inference Attack. . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 System Design of Cross-layer Privacy Inference Attack . . . . . . . . . . . . . . 43
3.3.1 ICMP-based CSI Acquirement Module . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Sensitive Input Window Recognition Module . . . . . . . . . . . . . . . . . 46
3.3.3 Data Preprocessing Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.4 Keystroke Inference Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 System Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.1 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.2 CSI-based Keystroke Inference Performance . . . . . . . . . . . . . . . . . 56
3.4.3 Impact of Various Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 When CSI Meets Public Wi-Fi: A Case Study in Mobile
Payment Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Countermeasures and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6.1 Fundamental Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6.2 Proposed Signal Obfuscation Based Countermeasure . . . . . . . . 69
3.6.3 Limitations and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 Wireless Signal Based Two-Factor Authentication at Voice
Interface Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Preliminaries and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.1 Attack Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.2 User’s Articulatory Gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.3 The Influence of Mouth Movement on Wireless Signals . . . . . 82
4.3 WSVA: Wireless Signal Based Two Factor Authentication . . . . . . . . . . . 85
4.3.1 Data Collection Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.2 Data Cleansing and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.4 Feature Matching Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Performance Evaluation of WSVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.1 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.3 Impact of Various Factors on WSVA’s Performance. . . . . . . . . . 98
4.5 Limitations and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Contents ix
5 Microphone Array Based Passive Liveness Detection at Voice

Interface Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2 Preliminaries and Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2.1 Voice Command Generation and Propagation in
Smart Speaker Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2.2 Passive Liveness Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.3 Motivation: Array Fingerprint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3 ArrayID: Array Fingerprint Based Passive Liveness Detection . . . . . . 114
5.3.1 Multi-channel Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.4 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4.2 Performance of ArrayID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4.3 Impact of Various Factors on ArrayID . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.5.1 User Enrollment Time in Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.5.2 Handling Users Providing Incomplete Training Dataset . . . . . . 128
5.5.3 Limitations and Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.5.4 Comparison with WSVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 Traffic Analysis Based Misbehavior Detection at Application
Platform Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2 Preliminaries and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.1 Samsung SmartThings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.2 Communication Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2.3 Application Misbehavior in Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2.4 Motivation: Monitoring SmartApp’s Behaviors
Based on Wireless Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.3 System Design of Misbehavior Detection System . . . . . . . . . . . . . . . . . . . . 142
6.3.1 DFA Building via SmartApp Analysis. . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.2 Misbehavior Detection Based on Wireless Traffic Analysis . . 145
6.3.3 Privacy Enhancement Based on Dummy Traffic . . . . . . . . . . . . . . 151
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.4.1 Micro-benchmark: Inference of Events and SmartApps . . . . . . 154
6.4.2 Detection of Over-Privileged Accesses . . . . . . . . . . . . . . . . . . . . . . . . 157
6.4.3 Detection of Event Spoofing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.4.4 Evaluation of the Privacy Enhancement Module . . . . . . . . . . . . . . 159
6.4.5 Discussions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
x Contents
7 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

7.1 Conclusion on Security and Privacy in Smart Home . . . . . . . . . . . . . . . . . . 163
7.2 Open Research Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Acronyms
CFR Channel Frequency Response

COTS Commercial Off-the-Shelf
CSI Channel State Information
DFA Deterministic Finite Automaton
DTW Dynamic Time Warping
DWT Discrete Wavelet Transform
EER Equal Error Rate
FAR False Acceptance Rate
FRR False Rejection Rate
ICMP Internet Control Message Protocol
IKI In-band keystroke inference
IoT Internet of Things
LOS Line of Sight
LPC Linear Prediction Coding
LPCC Linear Prediction Cepstrum Coefficients
MFCC Mel-Frequency Cepstral Coefficients
MIMO Multiple-Input Multiple-Output
NN Neural Network
OFDM Orthogonal Frequency Division Multiplexing
OKI Out-band keystroke inference
PANID Personal Area Network ID
PCA Principal Component Analysis
PIN Personal Identification Number
ROC Receiver Operating Characteristic
RSS Received Signal Strength
SSID Service Set Identifier
STFT Short-Time Fourier Transform
SVM Support Vector Machine
TAR True Acceptance Rate
TNR True Negative Rate
TPR True Positive Rate
xi
xii Acronyms
TRR True Rejection Rate

TTS Text to Speech
USRP Universal Software Radio Peripheral
Chapter 1
Introduction
1.1 Era of Smart Home and Its Challenges
With the development of IoT techniques, the smart home system is getting more
and more popular. With the help of smart home technology, users can connect
all kinds of smart devices together to realize home automation, remote control,
programmable control, and other functions. For example, users can realize all-round
information interaction and behavior management of smart household appliances
(e.g., smart refrigerator, smart microwave oven, and smart washing machine),
lighting systems, temperature regulation systems (e.g., air conditioner and heater),
and various security systems (e.g., access control and alarm system). According to
the report Smart Home Market with COVID-19 Impact Analysis by Product released
by Research and Market, one of the world’s largest market research institutions, on
July 1, 2020, the global market share of the smart home reached US $78.3 billion
in 2020, and it is predicted that it will continue to grow at an annual growth rate of
11.6% and reach US $135.3 billion in 2025 [23].
There are several differences between the smart home network and traditional
IoT architectures. On the one hand, in the communication mode, most of the devices
in traditional IoT systems are connected through wired communication cables, while
the smart devices in the smart home network (e.g., the light bulb control system,
the user access control system, the smart home medical system, and the smart
kitchen system) are connected to each other through wireless networks. On the other
hand, in terms of platform compatibility, unlike traditional specialized IoT systems,
the smart home presents higher compatibility. In order to connect smart devices
from different device manufacturers, some major smart home manufacturers have
developed smart home application platforms. In these platforms, devices developed
by different hardware manufacturers are abstracted uniformly (a.k.a. abstraction).
Device abstraction makes developers only need to know the functions and properties
of the device, without knowing the physical details, so they can easily design the
corresponding smart applications to automatically control various smart devices.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1

Y. Meng et al., Security in Smart Home Networks, Wireless Networks,
https://doi.org/10.1007/978-3-031-24185-7_1
2 1 Introduction
Currently, popular smart home platforms include Samsung SmartThings [25],

Apple HomeKit [4], Amazon Alexa [3], and TmallGenie [2].
As a typical system integrating the information world and the physical world, the
characteristics of the smart home are reflected in the following three aspects. Firstly,
at the terminal device level, a series of wireless communication protocols (e.g., Wi-
Fi, ZigBee, and Z-Wave) are widely used in order to enable different devices to work
together. Mass devices are based on wireless protocols and can be interconnected
with local gateways (e.g., base stations and hubs). For example, Google’s smart
home platform can support more than 1500 smart devices from 200 manufacturers
to work together. Secondly, at the user interface level, with the development
of artificial intelligence technology, the smart home system provides a variety
of interaction modes, including voice interaction, image sensing, and wireless
communication. Among them, the voice interface has become the mainstream
interface of smart homes due to its properties of non-contact and convenience.
At present, smart speakers equipped with smart voice interfaces (e.g., Amazon
Echo and TmallGenie) have become the hub of mainstream smart home platforms.
Finally, at the application platform level, users can leverage smart applications
running in the cloud to realize automatic control of smart devices in the platform.
Using smart applications reduces the hardware complexity and energy consumption
of smart devices. The smart home also allows third-party developers to develop
customized applications to achieve special functions, further stimulating a large
number of developers to participate in the development of smart home systems. In
short, in the smart home system, smart devices can perceive the physical world and
transmit the perceived information to smart applications for processing via wireless
communication. At the same time, the voice interface realizes lightweight human–
computer interaction and finally realizes information processing and environment
control.
However, with the large-scale deployment of the smart home system, its security
threats and privacy risks have also received increasing attention. As a cyber-physical
system with ubiquitous equipment network connection, frequent human–computer
interaction, and automatic system operation, the smart home not only faces the
security risks existing in the traditional IoT system but also faces severe security
threats at the three levels of “terminal equipment—user interface—application
platform.”
1. Side-channel attacks at the terminal device level. In a smart home system,
various heterogeneous devices (e.g., resource-limited IoT smart devices, smart-
phones, tablets) are vulnerable to a series of attacks. In addition to the traditional
jamming attack that interferes with the wireless communication of smart devices,
the wireless communication process of many smart devices (e.g., door and
window sensors, alarms) is vulnerable to side-channel attacks. For example, due
to the demand of reducing energy consumption, some smart home devices do
not deploy encryption protocols, which makes their execution commands easy
to be cracked by attackers. For example, Choi et al. [11] pointed out that the
command protocol of the screen door of Seoul Metropolitan Subway can be
1.1 Era of Smart Home and Its Challenges 3
inversely cracked, revealing significant social security risks. Besides, the physical
layer information generated by the wireless communication process of smart
terminal devices (e.g., smartphone) is also easy to be attacked by an attacker
using the wireless sensing technology to obtain sensitive information such as
user motion [31].
2. Voice spoofing attacks at the voice interface level. While providing conve-
nience for users, the voice interface in the smart home is also vulnerable to
spoofing attacks from unauthenticated voice commands due to the openness of
the voice channel. For example, in a replay attack, an attacker can cheat the
voice interface by recording the voice samples of a legitimate user in advance
and then playing them back with a high-quality speaker [12]. In addition, due to
the defects of the speech recognition algorithm and the voice interface hardware,
some new and hidden attacks are also proposed. For example, Carlini et al. [7]
proposed a hidden command attack, which can successfully cheat the speech
recognition algorithm by converting the voice command into noise-like audio
while retaining the features required for speech recognition. Zhang et al. [34]
proposed “Dolphin attack” based on the hardware defects of the voice interface.
This attack embeds the voice signal into the ultrasonic signal that cannot be
detected by the human ear without causing the user’s alarm. In the voice spoofing
attack, the attacker can not only query the user’s sensitive information and
perform sensitive operations (e.g., online shopping, making malicious calls) but
also force the smart device to perform improper actions (e.g., opening the door
lock when the user leaves home), which poses a serious threat to the security and
privacy of the user’s smart home system.
3. Application misbehavior at the application platform level. In order to manage
heterogeneous IoT services, most smart home application platforms support
third-party smart applications. The smart application runs in the cloud backend,
monitors the status of the smart device (e.g., smart motion sensor), and triggers
some operations of another smart device (e.g., turning on smart lights) when
receiving the notification of certain device events (e.g., the motion sensors detect
the user’s activities). However, the access control of the application platform is
not perfect, which opens a door to malicious attacks. Fernandes et al. [14] found
that due to the coarse-grain authorization of the device function model, there
is a phenomenon of over-authorization of smart applications in the Samsung
SmartThings platform. More specifically, smart applications can access more
permissions than their users should grant them. For example, the lock command
lock.on() and the unlock command lock.off() of the smart door lock are often
authorized to the smart application together. In this case, the application may
cause a security threat to the user by abusing the command (e.g., using the
command lock.off() to unlock the door lock). In addition, malicious applications
can also disrupt the normal operation of the user’s smart home system by forging
device events in the cloud (e.g., falsely reporting the status of smart sensors to
trigger the operation of smart alarms).
4 1 Introduction
It can be seen that the security and privacy issues at the three levels of “terminal
device—voice interface—application platform” in the smart home have become the
hot issues in the development of the smart home. Therefore, to enhance smart home
security, this monograph analyzes and studies the following key technologies: side-
channel privacy risk at the terminal device layer, voice spoofing attacks and defense
methods in the voice interface layer, and the misbehavior detection system in the
application platform layer.
1.2 The Architecture of Smart Home Network
This monograph mainly studies the smart home’s security and privacy issues in the
terminal device layer, user interface layer, and application platform layer. Then, we
build cross-layer reliable security protection technologies. This section introduces
the smart home system architecture.
As shown in Fig. 1.1, a typical smart home system consists of three components:
smart terminal device, user interface, and smart application platform. This section
introduces each component one by one.
1.2.1 Terminal Device
Heterogeneous IoT terminal devices occupy a key position in the smart home
network. In the human-centered smart home environment, devices are endowed
with diverse functions. Unlike traditional electronic devices, smart devices support
Fig. 1.1 The architecture of the smart home system

1.2 The Architecture of Smart Home Network 5
wireless network communication with diverse communication protocols (e.g.,

Bluetooth, Wi-Fi, ZigBee, and Z-Wave). Therefore, the emergence of smart devices
promotes the development of smart home systems. In order to realize the perception
of the smart home environment and corresponding operations, smart terminal
devices are mainly divided into following four types: sensor, actuator, hub, and
smart terminal.
Sensor In the smart home network, sensors can sense the information change (e.g.,
environmental factor and motion) and report this information to trigger subsequent
operations by other devices. For example, the motion sensor in the smart home can
sense the motion state of the user in the home, and the temperature sensor can sense
the temperature change in the home.
Actuator The actuator in the smart home is generally controlled by the smart
application deployed on the cloud, and its function is to change the environment
state. When the smart application in the cloud background receives the information
reported by the sensor, it will perform corresponding processing according to
the pre-defined working logic and send the final command to the actuator. The
actuator then executes the command to change the environment state. Take a smart
temperature regulator as an example. When the backend knows the temperature
change reported by the temperature sensor, it will send a command to ask the smart
temperature regulator to work to maintain the indoor temperature.
Hub Because various sensors and actuators in the smart home environment use
different wireless communication standards, a hub is required to act as the brain of
the wireless network. A hub usually supports a variety of communication standards
(e.g., Wi-Fi, ZigBee, Z-Wave), thus allowing different heterogeneous devices to
connect with it. Meanwhile, the hub also acts as an intermediary between general
smart devices (i.e., sensors and actuators) and smart application platforms. Hubs
such as Amazon Echo and Samsung SmartThings hub exist in the form of smart
home routers and are placed near smart devices. Usually, when the hub receives
wireless data packets sent by the sensor, it will transmit the semantic information
to the cloud. Then, according to the feedback from the cloud, the hub transmits the
command to the actuator. In addition, the voice interface of the smart home (details
are described in Sect. 1.2.2) is often loaded into the hub. For example, the voice
interface of the Amazon Alexa platform is embedded into the Amazon Echo hub.
Smart Terminal Device In the smart home environment, smart terminal devices
(e.g., smartphones and smart tablets) refer to mobile devices by which users can
configure and control the smart home environment. A common terminal device is a
smartphone. The user can control the smart devices in the smart home by installing
the relevant application programs on the smartphone and can leverage the smart
application to control the corresponding smart devices. For example, the user can
install Samsung SmartThings or Amazon Alexa applications on the smartphone to
configure the application in the smart home system, read the sensor information
(e.g., reading the temperature data collected by the temperature sensor), and directly
6 1 Introduction
operate the actuator through the smart application (e.g., directly opening the smart
door and smart heater).
Among the above smart devices, smart terminal devices such as smartphones
involve the direct interaction of users and are often related to the private information
of users. However, sensors, actuators, and hubs are often controlled by smart
applications. Therefore, this monograph studies the side-channel privacy attack
mechanism and defense strategies related to smart terminals and the misbehavior
detection schemes related to other smart devices.
1.2.2 Voice Interface
In smart home systems, users need to interact with smart devices frequently, so
a user-friendly and safe human–computer interaction mode is very important. In
addition to the traditional program interface mounted on smart terminal devices,
with the development of emerging artificial intelligence technology, users can
control smart devices and cloud platforms through various interfaces (e.g., voice
interaction and image sensing). Among various interfaces, the voice interface has
become the mainstream method for controlling smart home devices due to its
convenience, strong collectability, and high recognition accuracy [5]. According
to the report Voice Assistant Market—Information by Technology, Hardware and
Application—Forecast till 2025 published by Market Research Future in 2020, the
voice assistant will become the most important IoT user interface, with its global
market share reaching 1.68 billion US dollars in 2019 and predicted to reach 7.3
billion US dollars in 2025 [21].
Figure 1.2 shows a typical voice interface scenario in the smart home. At
present, smart speakers equipped with smart voice interfaces have become the hub
of mainstream smart home platforms. For example, in the Amazon Alexa smart
home platform, Amazon Echo smart speakers not only act as the voice interface
but also integrate the hub’s functions. As long as it is within the sound-receiving
range of the smart speaker, the user can remotely control household appliances or
query information. In order to improve the receiving quality of the user’s voice
commands, currently, the smart speaker uses a microphone array. In addition, the
use of microphone arrays helps determine the direction of users to provide more
fine-grained services. After receiving the audio information, the smart speaker will
report the audio information to the cloud platform. The cloud platform uses voice
Voice collection Upload Cloud platform

Command
execution
User
Smart speaker Speech recognition
Fig. 1.2 An illustration of the voice interface in the smart home scenario
1.2 The Architecture of Smart Home Network 7
recognition algorithms to analyze the text information of users’ voices and guides
the operation of smart devices. For example, when the user says the command “open
the window,” the Amazon Echo device transmits the audio to the cloud platform.
After analysis, the smart application sends instructions to the door and window
sensors to open the corresponding windows.
At present, smart voice interfaces (e.g., Amazon Echo [20], Google Home [28],
TmallGenie smart speaker [2]) have been widely used in real life. The user can
perform operations such as making a phone call, paying online, sending information,
and opening an application through the voice interface. These operations include
many sensitive operations (e.g., online shopping, querying bank app information),
so the security of the voice interface is particularly important.
1.2.3 Application Platform
With the popularization of smart homes, the following two key problems need to
be solved. Firstly, smart devices from different manufacturers need to be able to
interact with each other. Secondly, in order to realize household intelligence and
automation, it is necessary to formulate standardized control logic for different
devices. The smart application platform plays an important role in it. The platform
can abstract devices adopting different communication protocols and manufacturing
methods and provide a simple and easy-to-use development interface for third-
party developers. As shown in Fig. 1.3, the characteristics of the smart application
platform are summarized as follows.
Abstracting Heterogeneous Devices in the Application Platform In the smart
home system, because smart home devices such as smart sensors and actuators are
mostly power-constraint devices, they cannot undertake complex computing, so the
application platform deployed in the cloud handles most of the computing tasks.
In order to enable smart devices of different manufacturers and communication
protocols to work together, the smart application platform abstracts smart devices
to realize the separation between devices and applications. Specifically, as shown
in Fig. 1.3, in the cloud platform, smart devices retain their basic attributes, such as
Apple Homekit
Platfo
Platform
f rm
Hub
Start End
Motion sensor Smart light Application

Terminal device layer
Smart application layer
Fig. 1.3 An illustration of the smart application platform

8 1 Introduction
name and function, and the same functions from different devices are given the same
function attributes in the platform. For example, in the Samsung SmartThings smart
home platform, ZigBee door and window sensors and Z-Wave door and window
sensors are connected to the cloud through a hub, and their switching functions are
abstracted as motion.active. This allows the user to treat the device as a black box
without knowing the details of the device so as to develop relevant applications
conveniently.
Automatically Running Application-Driven Devices In the smart application
layer, developers develop customized applications and deploy them in the cloud
to realize remote control of smart devices in the physical world. Users can install
customized applications to achieve specific functions. Smart applications generally
follow the control logic of “if-this-then-that” and control various smart devices
through hubs. For example, when the user enters the home, turn on the lamp. On
the other hand, developers can make smart devices more intelligent by building
diverse smart applications. At present, smart application platforms have become
very popular. In different platforms, the names of smart applications are slightly
different. For example, Samsung SmartThings, Amazon Alexa, and TmallGenie
refer to smart applications with SmartApp, skill, and voice skills, respectively, but
these applications follow similar working logic.
In a word, the smart application platform has become very active by abstracting
and intelligently driving smart devices. However, users may install malicious
applications published by attackers, which brings serious security risks. Therefore,
it is urgent to detect and monitor the application’s misbehavior in real time.
1.3 Security and Privacy Challenges in Smart Home

Network
As shown in Fig. 1.4, there are different attack manners in the smart home.
According to the smart home architecture described in Sect. 1.2, this monograph
Application
3 Misbehaviors
Application platform
1 Side-channel Voice spoofing

Privacy attack Terminal device Voice interface 2
attack
Fig. 1.4 The security challenges faced by the smart home system
1.3 Security and Privacy Challenges in Smart Home Network 9
reviews and summarizes the relevant security research status from the three levels
of “terminal device—voice interface—application platform.”
1.3.1 Terminal Device Layer: Privacy Leakages
This subsection first introduces the relevant research around the terminal device
level and summarizes the research challenges. For the smart home’s terminal device
layer, the traditional security threats are mainly conducted from the perspective
of the device, including jamming attacks at the signal level and intrusion attacks
at the physical device level. In the smart home scenario, there are more serious
user privacy attacks faced by terminal devices. This monograph will focus on the
wireless side-channel attack, a new privacy threat to terminal devices based on the
IoT architecture. Side-channel attack is a form of indirect attack. It attacks the target
user or system based on side-channel information that is not directly related to the
target. In the smart home network, due to the application of wireless communication,
different devices can work together automatically. However, due to the openness of
the wireless channel, it is very easy for attackers to sniff the communication and
use the side-channel information to threaten system security and user privacy. The
academic research on side-channel attacks mainly includes physical layer wireless
sensing attacks and network layer information inference attacks. Their contents are
summarized as follows:
• Side-channel attacks in the physical layer. There are a large number of wireless
devices in the smart home system. However, since the attenuation, phase change,
and other information experienced by the wireless signal in the propagation
process are closely related to the movement of people in the smart home
environment, the attacker can reverse-guess the user’s privacy information. At
present, attackers can use physical layer information of various signals, including
Wi-Fi, ultrasound, visible light, and millimeter wave, as side-channel information
to carry out attacks. For example, Wang et al. [31] can monitor the user’s
movement (e.g., running, jumping, and lying down) by using the Wi-Fi signal
collected by the Intel 5300 network card, which poses a threat to the user’s
privacy. Ma et al. [19] recognized the user’s gesture near the solar equipment by
analyzing the pattern of photocurrent. Li et al. [18] proposed WaveSpy, which
can remotely collect the state response of the LCD screen and remotely read
the screen content through the millimeter wave probe. Privacy attack mechanism
based on the side-channel information brings a serious security threat to the smart
home network.
• Side-channel attack in the network layer. Researchers have found that even the
communication content is encrypted, the information related to user privacy can
still be obtained by analyzing the network layer traffic. Taking the commonly
used SSL/TLS mechanism as an example, although there is no technology to
directly crack the TLS protocol at present, by analyzing the traffic of the TLS
10 1 Introduction
protocol, we can obtain information about the sender, the receiver, and the sent
content without decrypting the payload content. For example, Li et al. [17]
pointed out that encrypted traffic generated in the communication process can be
used by attackers to guess the user’s gender, age, and education level. Panchenko
et al. [22] pointed out that by analyzing the inflow and outflow characteristics of
user data packets, it is possible to speculate on more than 100 websites visited
by users with 93% accuracy. The above facts show that even if the traffic is
encrypted, a large amount of information can still be obtained through analyzing
the metadata.
However, there are still serious security challenges in the side-channel privacy
attack and defense of the physical layer and the network layer of the terminal
device. In essence, the current research on side-channel privacy attacks is relatively
isolated in the physical layer and the network layer. In the physical layer, due to the
lack of information from the network layer, it is impossible to achieve fine-grained
attacks and obtain more refined results. However, only relying on the side-channel
information at the network layer side cannot obtain information related to the user’s
physical behavior. For example, the wireless physical layer information can be used
to obtain information such as the user’s movement and location on the terminal,
but the lack of network layer information makes it impossible to determine when
the user enters highly sensitive information such as login password and payment
password. Similarly, depending only on the side-channel information at the network
layer side, it is also difficult for an attacker to find the above attack targets due to
the protection of the network layer encryption mechanism. However, after fusing
the side-channel information from both physical and network layers, it is very likely
to break through the original attack capability and bring huge privacy risks to users.
Therefore, in terms of attack, it is urgent to study the privacy attack mechanism of
cross-layer fusion. In terms of defense, there is a lack of lightweight and flexible
defense mechanisms. Therefore, the first research challenge of this monograph is to
conduct cross-layer research on the physical layer and network layer wireless
side-channel attacks faced by smart home terminals and design a universal and
convenient defense mechanism.
1.3.2 Voice Interface Layer: Spoofing Attacks
In the smart home network, the voice interface stands out from many user interfaces
because of its convenience, easy collection, and high accuracy. Now, it has become
the preferred user interface of many mainstream smart home platforms. However,
the voice interface is vulnerable to various voice spoofing attacks, including the
classic voice playback attack [12] and the new spoofing attacks from the hardware
and software levels.
• At the voice interface software level, the deep learning model used in speech
recognition and speaker recognition has been proven to be vulnerable to the threat
of adversarial examples. For example, Nicholas et al. [8] pointed out that the
voice command can be converted into noise-like audio, which is understood as
noise by the human auditory system, but the voice recognition system will still
correctly recognize it and execute the corresponding malicious attack command.
Yuan et al. [33] further proposed the CommandSong attack. In this attack,
malicious voice instructions are cleverly embedded into a song. As far as humans
are concerned, the voice is not too different from the original song in terms of
hearing, which will not cause alarm, but the voice recognition system can still
recognize it. In terms of speaker recognition, Zhang et al. [35] propose VMASK
attack, which embeds the generated countermeasure samples into the audio
of non-registered users and breaks through the Apple Siri speaker recognition
system of registered users.
• At the hardware level of the voice interface, the non-linear characteristics of
the amplitude–frequency characteristic curve of the microphone device enable
the microphone to demodulate the high-frequency part of the voice to the low-
frequency part. Accordingly, Roy et al. [24] proposed the Backdoor attack, which
realized the denial of service attack on the voice interface by injecting audio into
ultrasound. Zhang et al. [34] proposed “Dolphin attack,” which injects the user’s
malicious voice command into high-frequency ultrasound signals and induces the
voice interface (e.g., apple Siri, Amazon Alexa) to perform sensitive operations
when the user’s ears cannot detect them.
Voice spoofing attacks enable attackers to query users’ sensitive information
(e.g., query users’ schedule information) through voice interfaces and even force
smart devices to perform improper actions (e.g., open the door lock when users
leave home), which poses a serious threat to the security and privacy of users’ smart
home systems. To this end, researchers have proposed various defense strategies,
almost all of which take advantage of the fact that the sound in the spoofing attack is
played by electronic devices (e.g., the high-quality speaker in the replay attack [30]
and the ultrasonic dynamic speaker in the dolphin attack [34]). Therefore, different
physical characteristics between humans and machines can be used as “liveness”
factors. Existing liveness detection schemes can be divided into two categories: two-
factor authentication and passive liveness detection schemes. Their characteristics
and limitations are summarized as follows:
• Two-factor authentication-based liveness detection. Two-factor authentication
means that in addition to the audio information collected by the voice interface,
the information highly related to the voice spoken by the user is used as the
liveness feature of the user to distinguish the spoofing attack samples generated
by the legitimate user and the attacker. There are many kinds of information that
can be used for two-factor authentication, including image or video information
collected by the camera [10], electromagnetic field change information extracted
12 1 Introduction
from the loudspeaker device [9], data information collected by the acceleration
sensor of the user’s wearable device [13], and the ultrasonic Doppler frequency
shift information caused by the user’s mouth movement [36]. It should be
pointed out that existing two-factor authentication schemes require users to
carry special sensor devices (e.g., the liveness detection mechanisms based on
image [10] and acceleration [13]) or perform specific actions to collect the
liveness information (e.g., the detection mechanism based on ultrasound [36]).
At present, the effectiveness and practicability of the research work in the smart
home environment are very limited, and a more convenient defense strategy is
urgently needed.
• Voice-based passive liveness detection. Unlike liveness detection based on
two-factor authentication, passive liveness detection only analyzes the voice
signal collected by the voice interface to determine whether the voice is from
a deceptive attacker. Its advantage is that it only depends on the voice interface
itself and does not need to deploy any additional equipment to obtain two-factor
information, so it has a wider scope of application. Shiota et al. [26] and Wang et
al. [29] use the “pop” noise when humans speak to distinguish voice commands
generated by real people and devices. Blue et al. [6] and Ahmed et al. [1]
identify spoofing attacks by analyzing the spectrum pattern of the collected
mono voice signals so as to achieve lightweight authentication. Yan et al. [32]
creatively proposed to use two microphones to collect voice signals and defined
the concept of “fieldprint” between two microphone signals to detect spoofing
attacks. However, since the features of the voice signal are easy to change with
the change of the sound propagation channel and the scheme based on the
fieldprint feature [32] requires the user to maintain a fixed manner to ensure the
robustness of the features, the current passive detection scheme faces the problem
of performance degradation in complicated scenes (e.g., the user walks, gestures
change). At present, smart speakers equipped with microphone arrays are widely
used in the voice interface of smart homes, while passive liveness detection based
on smart speakers remains to be studied.
This monograph will study the security mechanism of voice interface. Since the
above two-factor authentication and passive liveness detection are quite different in
terms of detection principle, implementation process, and research difficulties, this
monograph will study the two schemes, respectively. The corresponding research
challenges are as follows.
For two-factor authentication, the second research challenge in this monograph
is how to make use of the ubiquitous wireless signals in the smart home so
that users can leverage the efficient two-factor authentication information to
defend against voice spoofing attacks without carrying any device. For voice-
based passive liveness detection, the third research challenge is how to use the
microphone array in the smart speaker to achieve passive liveness detection
with high robustness and effectiveness by relying only on the collected multi-
channel voice signals.
1.3.3 Application Platform Layer: Misbehavior
The smart application platform undertakes a lot of computing tasks in the current
smart home system. In order to support more IoT services, most smart home
application platforms support third-party applications and realize automatic col-
laboration of multiple devices according to the application’s code logic. However,
there are some defects in the smart home platform, which open the door to
malicious behaviors of smart applications. For example, the Samsung SmartThings
smart application platform abstracts the corresponding capabilities for each device
and regulates the behavior of smart applications through the constraint function
model. However, because the functional model adopts a coarse-grained management
method, smart applications may obtain too many permissions, which may lead to
improper behavior. The malicious behavior of applications can be divided into two
categories: over-privileged access and event spoofing. Over-privileged access refers
to the behavior of a malicious application automatically controlling the device in the
cloud without user authorization (e.g., automatically opening the door lock). Event
spoofing refers to malicious applications falsely reporting an event in the cloud and
triggering subsequent abnormal operations (e.g., forgery of “high temperature” data
generated by a temperature sensor to induce the user’s intelligent air conditioner to
automatically turn on). In short, the risks at the application level have caused huge
security and privacy risks to the smart home system and users.
At present, the monitoring and prevention of malicious behaviors of smart
applications are mainly divided into three types:
• Introducing a control mechanism for sensitive data information flow by modi-
fying the smart home platform. For example, Fernandes et al. [15] proposed a
FlowFence system, which can intercept all data streams and require downstream
users to declare before using sensitive data.
• Designing a context-based permission control system to achieve fine-grained
access control. For example, the ContexIoT system proposed by Jia et al. [16]
can support fine-grained identification of sensitive behaviors and report sensitive
behaviors and context information to users.
• Improving the authorization mechanism of smart applications by analyzing
the application source code, comments, and description documents. Tian et
al. [27] propose SmartAuth to analyze whether the application logic is reasonable
through static analysis of the smart application’s code.
However, existing solutions either need to modify the platform itself (e.g.,
FlowFence [15], ContextIoT [16]) or inject patches into smart applications (e.g.,
SmartAuth [27]) or even need to modify communication protocols and design new
systems, which causes these solutions have insufficient versatility and availability.
There is an urgent need for a novel method to allow third-party defenders—in addi-
tion to smart home platform suppliers, smart device manufacturers, and application
developers—to monitor smart home applications without making any changes to
existing platforms. Therefore, the last research challenge of this monograph is to
14 1 Introduction
propose a third-party system independent of the smart home system to realize

real-time monitoring of malicious behavior in smart home applications.
1.4 Aims and Organization of This Monograph
In view of the above-mentioned security challenges in “terminal device—voice

interface—application platform,” this monograph carries out security research,
respectively.
1.4.1 Aims of This Monograph

1.4.1.1 Terminal Device Level: Cross-layer Privacy Attack and Defense
Based on Wireless Side-Channel Information
In terms of the security challenges of wireless side-channel attacks at the terminal

device level, this monograph proposes a cross-layer privacy attack and defense
scheme for smart terminal devices such as smartphones. Specifically, this mono-
graph proposes WindTalker which obtains sensitive keystroke behaviors such as
passwords input by users on smartphones through channel state information (CSI)
at the physical layer and network layer sides of wireless communication. First of all,
in the physical layer, since the user’s keystroke behavior on the terminal screen will
lead to different hand coverage and finger movements, this will introduce unique
interference to multi-path signals and can be reflected through CSI. The attacker
can infer the user’s password input by taking advantage of the strong correlation
between CSI fluctuations and keystrokes. In the network layer, WindTalker attracts
user terminal devices to access the Wi-Fi hotspot deployed by the attack which not
only avoids physically approaching the target device but also determines when the
user inputs sensitive information such as passwords by analyzing the Wi-Fi traffic
side-channel information. We tested WindTalker on various types of smartphones
and verified the practicability of WindTalker on the Alipay platform through a case
study. In terms of defense, this monograph proposes a strategy based on CSI signal
obfuscation to prevent side-channel attacks. The main contributions of this part
include the following:
• This monograph leverages wireless signal-based side-channel information in the
physical layer and the traffic information in the network layer to implement
the privacy attack system WindTalker. WindTalker can infer the user’s mobile
payment password, and its effectiveness is validated in practical platforms such
as Alipay.
• In order to thwart WindTalker, this monograph proposes a defense mechanism
based on physical layer signal obfuscation to prevent attackers from obtaining
1.4 Aims and Organization of This Monograph 15
effective CSI side-channel information. This defense mechanism enhances the

security of mobile smart terminal devices.
1.4.1.2 Two Factors in Liveness Detection Based on Wi-Fi Signal
In view of the problem that users need to carry sensor equipment for the two-
factor authentication of voice interface, this monograph proposes WSVA, a voice
liveness detection system based on Wi-Fi signals. Unlike traditional two-factor
liveness detection schemes, WSVA uses the physical layer information of wireless
signals generated by Wi-Fi devices in IoT environment as the liveness factor without
requiring users to carry any additional devices or sensors. Since the user’s mouth
movement will modulate the wireless signal’s CSI, it can be determined whether
the voice command of the voice interface is actually issued by the user based on the
fluctuation of CSI. We use various real voice commands and spoofing commands to
evaluate WSVA in different scenarios and prove that WSVA has good accuracy and
scalability. The main contributions of this part include the following:
• This monograph successfully characterizes the correlation between CSI changes
in wireless signals and the user’s mouth movements and builds a liveness
detection mechanism based on two different types of signals: voice and CSI.
• WSVA proposed in this monograph is a two-factor liveness detection mechanism
that does not need additional devices and can resist voice spoofing attacks with
high efficiency.
1.4.1.3 Passive Voice Liveness Detection Based on Microphone Array
Aiming at the problems of low robustness and low flexibility of current passive
liveness detection based on voice signals, this monograph designs a passive
liveness detection system ArrayID by using the microphone array widely used
by mainstream smart speakers in smart home. Because the microphones in the
microphone array have different positions, the multi-channel audio collected by
ArrayID will have better diversity. By using audio diversity, ArrayID can extract
more liveness factors related to the target user and improve the robustness and
accuracy of liveness detection. Specifically, ArrayID can use the correlation between
different channel data to eliminate the degradation of liveness detection performance
caused by factors such as air channel and user position changes. Subsequently, this
monograph constructs a dataset containing 38,720 multi-channel voice commands
to eluate the effectiveness of the proposed ArrayID. The main contributions of this
part include the following:
• This monograph theoretically analyzes the principle behind passive liveness
detection and designs the ArrayID to prevent voice spoofing attacks. By using
only the audio collected from the smart speaker, ArrayID does not require the
user to carry any equipment or perform other operations.
16 1 Introduction
• Through the experimental results, this monograph proves that ArrayID is superior
to the existing scheme and has strong robustness under many factors (such as
distance, direction, spoofing device, and background noise).
1.4.1.4 Application Misbehavior Detection Based on Wireless Traffic

Side-Channel Information
Considering the fact that current application misbehavior detection methods need
to modify smart applications or platforms, this monograph proposes HoMonit,
which is independent of smart home systems. HoMonit uses side-channel inference
technology to monitor the behavior of applications based on encrypted wireless
traffic in smart homes. The core idea of HoMonit is that the behavior of each
smart home application can be described by a deterministic finite automaton (DFA)
model, where each state in DFA represents the state of the application program
and the corresponding smart device, and the transition between states represents the
interaction between the application program and the device. To this end, HoMonit
extracts the benign DFA of the application from the source code of the benign
version of the application and then infers the operation and interaction status of
the corresponding device of the smart application by observing the size and interval
of the encrypted wireless data packet and converts it into DFA. After that, the DFA
extracted from the application is matched with the DFA inferred from the wireless
traffic side channel. If the matching fails, it indicates that the running application
has abnormal behavior. This monograph implements HoMonit and demonstrates its
effectiveness in detecting smart applications with abnormal behavior. At the same
time, considering that the wireless side-channel information may contain sensitive
behaviors in smart homes, this monograph further designs a privacy enhancement
module based on traffic obfuscation for HoMonit. The privacy enhancement module
can effectively protect privacy by increasing information entropy while retaining
HoMonit’s ability to monitor applications’ misbehavior. The main contributions of
this study are:
• This monograph points out that the wireless side-channel information can be
used not only for an attack but also for defense against malicious behavior and
proposes a device state automatic matching algorithm based on the wireless
traffic side-channel for the first time.
• This monograph also takes into account the risk of privacy leakage caused by
HoMonit and designs a privacy enhancement module based on traffic obfusca-
tion, which helps protect the privacy of the smart home system while ensuring
detection ability.
The above four contents of this study are aimed at the four important security
challenges at three levels of the smart home system. At the same time, the contents
are related to each other and together form the key security technologies of the smart
home network.
1.4 Aims and Organization of This Monograph 17
1.4.2 Organization of This Monograph
As shown in Fig. 1.5, this monograph is divided into seven chapters. This chapter
is an introduction, which introduces the research background of the monograph, the
architecture of the smart home system, the security challenges summarized around
the logic of “terminal equipment—voice interface—application platform,” and the
main research contents. This monograph is organized as follows:
Chapter 2 introduces the related research work around the security challenges of
this monograph. At the terminal device level, the research status of wireless side-
channel attacks faced by terminal device is introduced, including physical layer and
network layer side-channel attack. At the voice interface level, it introduces the
security threats found by researchers and reviews the existing research work on
two-factor-based liveness detection and voice-based passive liveness detection. At
the application platform level, the smart application’s misbehavior and the defense
mechanisms proposed by researchers are summarized.
Chapter 3 reveals the cross-layer privacy attack mechanism WindTalker.
WindTalker uses the side-channel information at the wireless physical layer and
network layer to speculate on the user’s payment password. Then, the defense
mechanism based on the signal obfuscation is introduced. Finally, the WindTalker
and the corresponding defense mechanism are evaluated in the real environment,
and their effectiveness is proven.
Chapter 1 Introduction
Chapter 2 Literature Review of

Security in Smart Home Network
Chapter 3 Terminal Device Security:

Privacy Breaches and Countermeasures
Terminal device
Chapter 4 Voice Interface Security I:

Wireless Signal Based Two-Factor Authentication
Voice interface Chapter 5 Voice Interface Security II:

Smart home Microphone Array Based Passive Liveness Detection
Chapter 6 Application Platform Security:

Traffic Analysis Based Misbehavior Detection
Application platform
Chapter 7 Conclusion and Future Directions
Fig. 1.5 The structure of this monograph

18 1 Introduction
In Chap. 4, at the voice interface layer, a two-factor authentication mechanism

WSVA based on Wi-Fi signal is proposed. Firstly, the idea of recognizing the
user’s mouth movement by using the physical layer information of the Wi-Fi signal
is introduced. Then, the manner by which WSVA determines whether the voice
command is from the voice spoofing attack is introduced. Finally, the effectiveness
is demonstrated via a series of real-world experiments.
Furthermore, in Chap. 5, a novel passive liveness detection mechanism named
ArrayID based on the smart speaker’s microphone array is proposed. First, through
the theoretical analysis of the sound propagation process, a new active feature—
“array fingerprint”—is constructed. Then, the process of ArrayID is shown. Finally,
the effectiveness and robustness of ArrayID are verified on the self-built dataset and
the third-party dataset.
In Chap. 6, a third-party smart application’s misbehavior detection mechanism,
HoMonit, which is independent of the smart home system, is proposed. This chapter
first reveals how the side-channel information at the wireless network layer is related
to the smart home’s device behaviors, then introduces the description of application
behavior and the processing of wireless traffic by HoMonit, and finally deploys
HoMonit in Samsung SmartThings smart home and verifies its effectiveness.
Chapter 7 summarizes the research of this monograph and looks forward to future
research.
References
1. Ahmed, M.E., Kwak, I.Y., Huh, J.H., Kim, I., Oh, T., Kim, H.: Void: a fast and light
voice liveness detection system. In: 29th USENIX Security Symposium (USENIX Secu-
rity 20), pp. 2685–2702. USENIX Association (2020). https://www.usenix.org/conference/
usenixsecurity20/presentation/ahmed-muhammad
2. AliGenie: Tmallgenie (2021). https://aligenie.com/
3. Amazon: Amazon Alexa developer (2019). https://developer.amazon.com/alexa
4. Apple: Homekit (2018). https://www.apple.com/ios/home/
5. Associates, P.: Top 10 consumer IoT trends in 2017 (2017). http://www.parksassociates.com/
whitepapers/top10-2017
6. Blue, L., Vargas, L., Traynor, P.: Hello, is it me you’re looking for? Differentiating between
human and electronic speakers for voice interface security. In: Proceedings of the 11th ACM
Conference on Security & Privacy in Wireless and Mobile Networks, pp. 123–133. ACM
(2018)
7. Carlini, N., Mishra, P., Vaidya, T., Zhang, Y., Sherr, M., Shields, C., Wagner, D., Zhou,
W.: Hidden voice commands. In: Proceedings of USENIX Security Symposium (USENIX
Security), pp. 513–530 (2016)
W.: Hidden voice commands. In: 25th USENIX Security Symposium (USENIX Security
16), pp. 513–530. USENIX Association, Austin (2016). https://www.usenix.org/conference/
usenixsecurity16/technical-sessions/presentation/carlini
9. Chen, S., Ren, K., Piao, S., Wang, C., Wang, Q., Weng, J., Su, L., Mohaisen, A.: You can hear
but you cannot steal: Defending against voice impersonation attacks on smartphones. In: 2017
IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 183–195
(2017). https://doi.org/10.1109/ICDCS.2017.133
References 19
10. Chen, Y., Sun, J., Jin, X., Li, T., Zhang, R., Zhang, Y.: Your face your heart: secure mobile face
authentication with photoplethysmograms. In: Proceedings of IEEE Conference on Computer
Communications (INFOCOM), pp. 1–9 (2017)
11. Choi, K., Son, Y., Noh, J., Shin, H., Choi, J., Kim, Y.: Dissecting customized protocols:
automatic analysis for customized protocols based on IEEE 802.15.4. In: Proceedings of the
9th ACM Conference on Security & Privacy in Wireless and Mobile Networks, WiSec ’16,
pp. 183–193. ACM, New York (2016). https://doi.org/10.1145/2939918.2939921
12. Diao, W., Liu, X., Zhou, Z., Zhang, K.: Your voice assistant is mine: how to abuse speakers
to steal information and control your phone. In: Proceedings of the 4th ACM Workshop on
Security and Privacy in Smartphones & Mobile Devices (SPSM), pp. 63–74 (2014). https://
doi.org/10.1145/2666620.2666623
13. Feng, H., Fawaz, K., Shin, K.G.: Continuous authentication for voice assistants. In: Proceed-
ings of the 23rd Annual International Conference on Mobile Computing and Networking
(MobiCom), pp. 343–355 (2017). https://doi.org/10.1145/3117811.3117823
14. Fernandes, E., Jung, J., Prakash, A.: Security analysis of emerging smart home applications.
In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 636–654 (2016). https://doi.org/
10.1109/SP.2016.44
15. Fernandes, E., Paupore, J., Rahmati, A., Simionato, D., Conti, M., Prakash, A.: FlowFence:
Practical data protection for emerging IoT application frameworks. In: USENIX Security
Symposium (USENIX Security) (2016)
16. Jia, Y.J., Chen, Q.A., Wang, S., Rahmati, A., Fernandes, E., Mao, Z.M., Prakash, A.:
ContexIoT: Towards providing contextual integrity to appified IoT platforms. In: The Network
and Distributed System Security Symposium (NDSS) (2017)
17. Li, H., Xu, Z., Zhu, H., Ma, D., Li, S., Xing, K.: Demographics inference through wi-fi
network traffic analysis. In: IEEE International Conference on Computer Communications
(INFOCOM) (2016)
18. Li, Z., Ma, F., Rathore, A.S., Yang, Z., Chen, B., Su, L., Xu, W.: Wavespy: remote and through-
wall screen attack via mmwave sensing. In: 2020 IEEE Symposium on Security and Privacy
(SP), pp. 217–232 (2020). https://doi.org/10.1109/SP40000.2020.00004
19. Ma, D., Lan, G., Hassan, M., Hu, W., Upama, M.B., Uddin, A., Youssef, M.: Solargest:
Ubiquitous and battery-free gesture recognition using solar cells. In: The 25th Annual
International Conference on Mobile Computing and Networking, MobiCom ’19. Association
for Computing Machinery, New York (2019). https://doi.org/10.1145/3300061.3300129
20. Makwana, D.: Amazon echo smart speaker (3rd gen) review (2020). https://www.mobigyaan.
com/amazon-echo-smart-speaker-3rd-gen-review
21. Market Research Future: Voice Assistant Market - Information by Technology, Hardware and
Application - Forecast till 2025 (2020). https://www.marketresearchfuture.com/reports/voice-
assistant-market-4003
22. Panchenko, A., Lanze, F., Pennekamp, J., Engel, T., Zinnen, A., Henze, M., Wehrle, K.:
Website fingerprinting at internet scale. In: NDSS (2016)
23. Research and Markets: Global Smart Home Market with COVID-19 Impact Analysis by
Product (Lighting Control, Security & Access Control, HVAC Control, Smart Speaker, Smart
Kitchen, Smart Furniture), Software & Services, Sales Channel, and Region - Forecast
to 2026 (2021). https://www.researchandmarkets.com/reports/5448441/global-smart-home-
market-with-covid-19-impact
24. Roy, N., Hassanieh, H., Roy Choudhury, R.: Backdoor: Making microphones hear inaudible
sounds. In: Proceedings of the 15th ACM Annual International Conference on Mobile Systems,
Applications, and Services (MobiSys), pp. 2–14 (2017). https://doi.org/10.1145/3081333.
3081366
25. Samsung: SmartThings (2021). https://www.smartthings.com
26. Shiota, S., Villavicencio, F., Yamagishi, J., Ono, N., Echizen, I., Matsui, T.: Voice liveness
detection algorithms based on pop noise caused by human breath for automatic speaker
verification. In: Sixteenth Annual Conference of the International Speech Communication
Association (2015)
20 1 Introduction
27. Tian, Y., Zhang, N., Lin, Y.H., Wang, X., Ur, B., Guo, X., Tague, P.: SmartAuth: user-centered
authorization for the internet of things. In: USENIX Security Symposium (USENIX Security)
(2017)
28. Tillman, M.: Google home max review: Cranking smart speaker audio to the max
(2019). https://www.pocket-lint.com/smart-home/reviews/google/143184-google-home-max-
review-specs-price
29. Wang, Q., Lin, X., Zhou, M., Chen, Y., Wang, C., Li, Q., Luo, X.: VoicePop: a pop noise
based anti-spoofing system for voice authentication on smartphones. In: IEEE INFOCOM
2019-IEEE Conference on Computer Communications, pp. 2062–2070. IEEE (2019)
30. Wang, S., Cao, J., He, X., Sun, K., Li, Q.: When the differences in frequency domain are
compensated: Understanding and defeating modulated replay attacks on automatic speech
recognition. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and
Communications Security, CCS ’20, pp. 1103–1119. Association for Computing Machinery
(2020). https://doi.org/10.1145/3372297.3417254
31. Wang, W., Liu, A.X., Shahzad, M., Ling, K., Lu, S.: Understanding and modeling of WiFi
signal based human activity recognition. In: Proceedings of the 21st Annual International
Conference on Mobile Computing and Networking, pp. 65–76 (2015)
32. Yan, C., Long, Y., Ji, X., Xu, W.: The catcher in the field: a fieldprint based spoofing
detection for text-independent speaker verification. In: Proceedings of the 2019 ACM SIGSAC
Conference on Computer and Communications Security, CCS ’19, pp. 1215–1229. Association
for Computing Machinery (2019). https://doi.org/10.1145/3319535.3354248
33. Yuan, X., Chen, Y., Zhao, Y., Long, Y., Liu, X., Chen, K., Zhang, S., Huang, H., Wang,
X., Gunter, C.A.: CommanderSong: a systematic approach for practical adversarial voice
recognition. In: 27th USENIX Security Symposium (USENIX Security 18), pp. 49–64.
USENIX Association, Baltimore (2018). https://www.usenix.org/conference/usenixsecurity18/
presentation/yuan-xuejing
34. Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: DolphinAttack: inaudible voice
commands. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Com-
munications Security (CCS), pp. 103–117 (2017). https://doi.org/10.1145/3133956.3134052
35. Zhang, L., Meng, Y., Yu, J., Xiang, C., Falk, B., Zhu, H.: Voiceprint mimicry attack towards
speaker verification system in smart home. In: IEEE INFOCOM 2020—IEEE Conference on
Computer Communications, pp. 377–386 (2020). https://doi.org/10.1109/INFOCOM41043.
2020.9155483
36. Zhang, L., Tan, S., Yang, J.: Hearing your voice is not enough: an articulatory gesture
based liveness detection for voice authentication. In: Proceedings of the 2017 ACM SIGSAC
Conference on Computer and Communications Security (CCS), pp. 57–71 (2017). https://doi.
org/10.1145/3133956.3133962
Chapter 2
Literature Review of Security in Smart
Home Network
2.1 Side-channel Attacks Faced by Terminal Device
This subsection reviews the risk of side-channel privacy disclosure that terminal
devices in smart homes are vulnerable to. The so-called side-channel attack is
an indirect attack in which the attacker uses information that is not directly
related to the object being attacked to achieve its attack target. In the smart home
environment, the popularity of wireless signals promotes the interconnection of
intelligent devices. However, due to the openness of wireless channels, attackers can
easily sniff wireless communications and use side-channel information to threaten
system security and user privacy. This section will review the side-channel attacks
related to wireless communication and introduce other side-channel attacks.
2.1.1 Attacks Based on Physical Layer Side-channel

Information
In the smart home environment, smart devices support diversified wireless net-
work communication protocols (e.g., Bluetooth, Wi-Fi, ZigBee, Z-Wave) and new
communication models such as the millimeter wave and visible light. In order
to ensure normal wireless communication, devices need to interact with each
other on the physical layer. The physical layer information reflects the changes
in the surrounding environment, so it is used to realize the perception of the
environment and people. However, the development of physical layer wireless
sensing technology also brings huge privacy risks to terminal device users. This
subsection first introduces the wireless sensing frameworks based on various media
(e.g., Wi-Fi, ultrasound, visible light, millimeter wave) and then summarizes the
privacy risks caused by them.

https://doi.org/10.1007/978-3-031-24185-7_2
22 2 Literature Review of Security in Smart Home Network
Wi-Fi Based Wireless Sensing Technology and Related Privacy Risks The
physical layer information of Wi-Fi signals includes received signal strength (RSS)
and channel state information (CSI). In 2013, Adib et al. [1] proposed that the RSS
can be used to monitor the movement of people on the other side of the wall and
deployed in the software-defined radio platform. In 2015, Wang et al. [58] gave a
quantitative relationship between the CSI of Wi-Fi signals in commercial devices
and indoor user movements and established a mapping relationship between CSI
changes and user movements (e.g., running, jumping, lying down), thus realizing the
monitoring of user movements. Zhang et al. [41] realized a long-term daily health
monitoring system for the elderly living alone by using existing Wi-Fi commercial
equipment. Pierson et al. [44] proposed a fast localization algorithm for a single Wi-
Fi antenna and realized localization with an error of less than 14 cm. Qian et al. and
Zheng et al. put forward the Widar2.0 [45] and Widar3.0 [71] systems, respectively,
which realized the use of Wi-Fi to track human movement and recognize gestures.
These works not only enhance wireless sensing technology but also open the door
for attackers to threat user privacy by using wireless sensing. Shi et al. [47] proposed
that in addition to using software radio platforms (e.g., USRP) and commercial
network cards (e.g., Intel 5300 network cards) to obtain physical layer information,
in the IoT environment, the wireless CSI collected by user’s terminal devices can
also achieve the above functions. Therefore, by controlling the user’s wireless device
and obtaining the physical layer information, the attacker can speculate on the user’s
movement, location, and other sensitive information.
Sensing Technologies Based on Other Wireless Communication Media The
concept of wireless sensing has expanded from the initial Wi-Fi based sensing
technology to ubiquitous wireless sensing technologies based on the millimeter
wave, visible light, ultrasonic, and other media. Zhou et al. [72] use ultrasound
to sense the user’s face and use 3D facial contour to authenticate the user. In the
aspect of visible light perception, Ma et al. [40] proposed to recognize gestures near
solar equipment by analyzing the pattern of photocurrent. Each gesture interferes
with the incident light on the solar panel in a unique way, thus retaining its unique
characteristics in the harvested photocurrent. Finally, the millimeter wave, as a novel
communication mode, has attracted extensive attention due to its high frequency.
For example, in the field of wireless temperature control, Chen et al. [12] proposed
a wireless temperature monitoring system based on the thermal scattering effect of
millimeter wave signals and cholesterol-based materials with different molecular
modes at different ambient temperatures. Li et al. [35] put forward WaveSpy, an
end-to-end portable through wall screen eavesdropping system, which can remotely
collect the state response of the LCD screen and remotely read the screen content
through the millimeter wave probe. In the smart home system, with the wide use
of devices with diversified communication protocols, users will face greater privacy
risks.
Keystroke Behavior Inference Based on Wi-Fi Signals Among the speculations
about user privacy, the speculation about the user’s keystroke behavior on the
terminal device is the most sensitive because its behavior is highly related to
2.1 Side-channel Attacks Faced by Terminal Device 23
sensitive information such as the user’s payment password. The use of wireless
signals to inference on keystroke behavior has the advantages of device free and
non-intrusive, which attracts the attention from the academic community. Ali et
al. [4] proposed a keystroke inference system called WiKey. A Wi-Fi transmitting
and receiving device is arranged near the user. The unique waveform of the wireless
signal CSI generated when the user hits the keyboard is used to distinguish the
user’s keystroke input on the external keyboard. Zhang et al. [67] proposed WiPass,
which can speculate the user’s graphic unlock password on the mobile device. Tan
et al. [52] proposed WiFinger, which uses CSI signals of commercial devices to
capture the fine-grained motion of users’ fingers. Compared with the WindTalker in
this monograph, these schemes only rely on the physical layer information, so they
cannot obtain the user’s sensitive keystroke input.
2.1.2 Attacks Based on Network Layer Side-channel

Information
Luzio et al. [19] and Cunche et al. [17] pointed out that even if the communication
content is encrypted, attackers may still obtain information related to user privacy by
analyzing the network layer traffic characteristics. Take the widely used SSL/TLS
mechanism as an example. Although there is no technology that can directly
crack the TLS protocol, through the analysis of the traffic characteristics of the
TLS protocol, the sender, receiver, and content information can still be obtained
without cracking the user payload information. Li et al. [33] pointed out that the
encrypted traffic generated in the communication process can be used by attackers to
speculate the user’s gender, age, and other information. Taylor et al. [53] proposed
a method to use metadata and machine learning to detect applications installed
on equipment, with a detection success rate of more than 70%. Wang et al. [57]
analyzed anonymous network traffic and used the k-nearest neighbors algorithm to
speculate website records visited by users with 80% accuracy. Panchenko et al. [43]
analyzed the inflow and outflow characteristics of user packets and used the Radial
Basis Function (RBF) kernel-based support vector machine technology can classify
100 websites with an accuracy rate of 93%. The above facts show that even if the
traffic is encrypted, a large amount of information can still be obtained through the
analysis of metadata.
In terms of the manners to obtain side-channel information on the network layer
side, a common method is to use malicious Wi-Fi hotspots. In smart home scenarios,
Wi-Fi hotspots are often used in mobile environments where cellular networks are
limited. The existing works [14, 23, 30, 33, 34, 61] have proved the feasibility of
deploying malicious Wi-Fi hotspots. For example, when an attacker turns on a Wi-
Fi hotspot without a password on a smartphone and changes the SSID name to the
same name as the Wi-Fi hotspot in the user’s home or office area (such as “Home”
or “Company Free Wi-Fi”), the surrounding users will mistakenly think that the Wi-
Fi hotspot is a trusted Wi-Fi hotspot and access the hotspot. In this scenario, all the
traffic of the user’s wireless communication will flow through the malicious Wi-Fi
hotspot, and the attacker can use this Wi-Fi network traffic to infer the user’s private
information. Alan et al. [3] further expanded such an attack’s availability. Even
obtaining only the TCP/IP header of the traffic in the Android application startup
phase, the attacker can successfully identify the running mobile application. Conti
et al. [15] proposed a new network traffic analysis method, which can not only
inference the running applications but also identify some operations generated by
the user in applications.
Compared with the above attack scheme based on network layer traffic analysis,
WindTalker in Chap. 3 conducts cross-layer privacy analysis by integrating network
layer traffic and physical layer CSI information. Specifically, in the attack scenario
of WindTalker, the attacker creates a fake hotspot to attract the target user to connect.
Then, the attacker determines the sensitive time period by sniffing Wi-Fi traffic and
analyzes CSI information to speculate the user’s sensitive keystroke information.
In terms of defense methods, the Traffic Morphing method proposed by Wright et
al. [60] is to disguise the authentic traffic as that from other websites by constructing
a matrix transformation. Dyer et al. [22] proposed methods to send data packets at
fixed time intervals and fixed sizes to hide metadata, but these methods have brought
a heavy burden to the system. In Chap. 3, this monograph proposes a lightweight
defense mechanism based on the physical layer obfuscation of wireless signals.
2.1.3 Other Side-channel Attack Manners
In addition to wireless signals, there exist various available side-channel informa-

tion (e.g., motion sensors, cameras, audio information). Since the main goal of
WindTalker proposed in Chap. 3 of this monograph is to realize a cross-layer side-
channel keystroke inference mechanism, this section will mainly introduce related
research work around keystroke inference.
Motion Sensor Information Based Attacks Owusu et al. [42] proposed a
keystroke inference method based on the acceleration sensor, which can recover
the 6-digit digital password on the smartphone. Liu et al. [37] realized keystroke
speculation for smartwatches. This method uses an acceleration sensor to track the
movement of the user’s hand on the smartwatch screen and achieves 65% accuracy
in keystroke inference.
Acoustic Signal-Based Attacks Zhu et al. [73] proposed a text-independent
keystroke inference system. The system uses the microphone of the smartphone
to measure the sound attenuation caused by keystrokes and infers the keystrokes of
users from this. The experimental results show that the system can achieve 72.2%
accuracy. Liu et al. [36] further used audio hardware to achieve millimeter-level
discrimination of sound sources and thus constructed a keystroke inference system
with an accuracy of 94%. Lu et al. [39] placed a mobile phone near the user’s
2.2 Voice Spoofing Attacks in Voice Interface 25
touch screen and used the acoustic signal emitted by the mobile phone to realize
KeyListener system, which can speculate keystroke input on the QWERTY type
keyboard on the user’s touch screen. These works can achieve high accuracy on the
numeric keyboard or physical QWERTY keyboard.
Camera Video Signal-Based Attacks Yue et al. [64] proposed a camera-based
keystroke inference mechanism using commercial devices such as Google Glasses.
Shukla et al. [49] proposed a video-based analysis method. This method infers the
user’s input according to the temporal and spatial characteristics of the keyboard
tapping in the video. Sun et al. [51] leveraged the camera to record the motion
information on the back side of the tablet and speculated on the content entered by
the user.
In contrast, WindTalker proposed in Chap. 3 does not require the target user to
hit the keyboard at a fixed location nor does it need to place the sniffer device close
to the user. In addition, WindTalker can obtain network layer traffic information,
greatly improving its reliability in real environments.
2.2 Voice Spoofing Attacks in Voice Interface
This section will introduce the spoofing attacks faced by the voice interface and
review the current defense schemes, including both two-factor authentication and
voice signal-based passive liveness detection.
2.2.1 Voice Spoofing Attacks
Although the voice interface is considered the most promising user interface in
smart homes, it also introduces some new security problems due to the inherent
broadcast property of the voice channel. These security problems make it vulnerable
to spoofing attacks. The most direct manner is the replay attack. In this attack, the
attacker prerecorded the voice samples of legitimate users and then replayed them
using high-quality speakers to cheat the voice interface [20]. The replay attack
has the best attack effect because it keeps as many audio properties as possible.
However, replay attacks also have the disadvantage of poor concealment. Therefore,
in order to achieve more covert and efficient voice spoofing attacks, researchers
have proposed the following two types of advanced voice spoofing attacks using the
software and hardware defects of the voice interface.
Adversarial Example Attacks in the Software Defects Voice interface generally
adopts a deep learning based speech recognition algorithm and speaker recognition
algorithm. However, these algorithms are vulnerable to emerging adversarial exam-
ple attacks. In 2016, Carlini et al. [8] proposed hidden voice command attack. In
this attack, the attacker will convert the voice sample, and the converted audio will
be interpreted as noise by the human auditory system, but the speech recognition
system will still recognize it as a valid and malicious attack command. Yuan et
al. [63] further proposed the CommanderSong attack. In this attack, malicious
voice commands are embedded into a song. As far as human beings are concerned,
the generated sound is similar to the original song in hearing, so it will not cause
human’s suspicion. But the speech recognition system will recognize the real attack
instructions after processing the features.
The above attack methods have different limitations: the black box attack in the
hidden voice command attack uses the reverse MFCC method, which requires a
lot of computing resources. The CommanderSong attack only studied the voice
recognition part and did not involve the user authentication part of the voice control
system. Furthermore, Zhang et al. [68] proposed the VMASK attack to break
the speaker recognition system. VMASK attack is based on the principle of voice
adversarial example attack. By adding a small disturbance, it can make the voice that
is understood by human hearing as user A be understood by the speech recognition
system as user B and can attack the Apple Siri system with a probability of more
than 20%.
Inaudible Attacks Based on Hardware Defects In addition to the adversarial
example attacks from the software flaw, another type of voice spoofing attack
comes from the hardware flaw. A typical example is the “Dolphin Attack” proposed
by Zhang et al. [66]. In this attack, the attacker can send ultrasonic signals
that are not perceived by the human ear, thus inducing voice interfaces (e.g.,
Apple Siri, Amazon Alexa) to recognize them as voice instructions and trigger
dangerous subsequent operations. The principle of the dolphin attack is that the
amplitude-frequency characteristic curve of microphone equipment has nonlinear
characteristics. Therefore, by injecting low-frequency voice commands into high-
frequency ultrasound, the microphone of the voice interface can still demodulate the
ultrasonic signal to recover low-frequency voice commands. In addition, for voice
interface that require speaker authentication (e.g., Apple Siri), the dolphin attack
uses the victim’s voiceprint to spoof speaker authentication using text-to-speech
(TTS) based brute force cracking and splicing synthesis methods. TTS-based brute
force cracking uses different TTS parameters to obtain audio with different tones
and timbres. The splicing synthesis method searches the phonemes required for the
synthesis command from the collected victim recordings and then generates voice
commands. Roy et al. [46] proposed a similar attack, which is called Backdoor
attack. In addition to ultrasonic-based schemes, Sugawara et al. [50] proposed that
the voice interface can be deceived by laser without attracting the user’s attention.
2.2.2 Two-Factor Authentication-Based Liveness Detection
In order to thwart voice spoofing attacks, researchers have proposed a variety

of schemes. Almost all schemes are based on the fact that voice commands
2.2 Voice Spoofing Attacks in Voice Interface 27
in voice spoofing attacks are played by electronic devices (e.g., high-quality

speakers [56], ultrasonic dynamic speakers [66]), while real voice is generated by
the movement of the user’s mouth. Therefore, the different physical characteristics
between humans and electrical machines can be used as liveness factors for liveness
detection. The existing schemes can be divided into two categories: two-factor
authentication-based on liveness detection and passive liveness detection. This
subsection introduces the former, and the latter will be introduced in Sect. 2.2.3.
The two-factor authentication-based liveness detection scheme refers to using
some other information highly related to the voice interface as the user’s liveness
factor to determine whether the voice command is sent by the real user. According
to different second factor selected, there are several different types of the two-factor
authentication. Chen et al. [13] pointed out that when an attacker try to spoof the
voice interface, the vibration of the loudspeaker used will cause changes in the
surrounding magnetic field. Therefore, electromagnetic equipment can be used to
capture and analyze such changes for liveness detection. Feng et al. [24] used
wearable devices, such as glasses, earphones, and necklaces, to collect the user’s
body acceleration data and match it with the user’s voice signal to achieve liveness
detection. Zhang et al. [69, 70] used a smartphone to send ultrasonic waves to the
user’s face and monitor its reflection, captured the ultrasonic Doppler frequency
shift caused by the user’s mouth movement, and used it to judge the authenticity
of voice commands. However, this solution requires users to hold smartphones to
ensure stable and reliable liveness detection features. Lee et al. [31] require the
user to generate movement when speaking voice commands and use the ultrasound
sent by the smart speaker to sense the user’s moving direction. By comparing the
consistency of the user’s body movement perceived by the ultrasound with the user’s
voice direction change collected by the voice interface, liveness detection is realized.
It should be pointed out that these existing voice interface defense methods either
need to introduce some special sensing devices or require users to interact in a
fixed way, and the newly introduced devices will also bring new privacy risks to
users. At present, the effectiveness and practicability of research work in the smart
home network need to be further enhanced, so in Chap. 4, a more convenient defense
strategy is proposed.
2.2.3 Voice Signal-Based Passive Liveness Detection
Although the two-factor authentication-based defense mechanism has achieved

good results, it still needs to use information different from the voice signal, and its
flexibility is limited. Therefore, how to rely only on the audio collected by the voice
interface for passive liveness detection has become an important topic. Researchers
have made a series of achievements around this topic.
Shiota et al. [48] and Wang et al. [55] used the pop noise of human speech to
distinguish between voice commands generated by authentic people and devices.
They are based on the observation that the real user will swallow and slightly move
his mouth when speaking, resulting in bursts of the collected audio. But in the replay
attack, the frequency band where the burst is located will be suppressed by the
electrical loudspeaker. Thus, the liveness detection can be carried out. However,
this scheme requires users to be close to the voice interface, which is suitable for
smartphone interaction scenarios but not for smart speaker scenario. Yan et al. [62]
proposed the concept of “fieldprint” feature to detect spoofing attacks. The insight
is that different audio sources (i.e., different humans and loudspeaker devices)
generate audio in different ways, which will produce unique field characteristics in
the sound transmission process. Therefore, leveraging the smartphone’s microphone
pair to measure the change in the field pattern can detect the voice spoofing attack.
However, this method also requires users to hold the mobile phone in a fixed way
and is close to the mobile phone, which is not suitable for smart speaker scenario. In
addition, Blue et al. [7] and Ahmed et al. [2] use the spectrum pattern of mono
voice signals to identify spoofing attacks to achieve lightweight authentication.
However, due to the instability in audio transmission, these two types of schemes
still face the problem of insufficient accuracy. Zhang et al. [65] proposed EarArray
to thwart dolphin attacks, but it is not intended to detect spoofing audio with human
voice frequency. Therefore, to overcome the above-mentioned methods’ issues, in
Chap. 5, we propose a robust and efficient passive detection mechanism that only
relies on voice.
2.3 Misbehaviors in Application Platform
In smart home network, users can deploy various smart applications in the cloud
server. These smart applications realize the home automation by automatically
controlling all kinds of smart devices according to code rules at runtime. However,
applications may have various misbehavior, which threatens the security of smart
home systems. This section will summarize the misbehavior of smart applications
and review the existing defense mechanisms.
2.3.1 Misbehaviors of Smart Home Applications
The research shows that the application platform of smart homes has several defects,
which led attackers to use smart applications for conducting misbehavior. For
example, Demetriou et al. [18] pointed out that the coarse-grained permission
management of the Android system will enable malicious applications to access
devices such as smart blood glucose meters in the smart home environment without
restriction and leak sensitive data to external attackers. Fernandes et al. [25] pointed
out that in 2016, more than 55% of smart applications (SmartApps) on the Samsung
SmartThings smart home platform were over-authorized. More specifically, the
over-privileged SmartApps can actually access permissions that users think should
2.3 Misbehaviors in Application Platform 29
not be granted to them. The main reason for this is that the authorization granularity
of the device function model of the SmartThings platform is too coarse-grained.
These design flaws can enable potentially malicious SmartApps not only to perform
other controls on devices that have not been authorized but also to bring about the
potential risk of eavesdropping on device events or forging smart device events in
the cloud backend.
In addition, even if the logic of the smart application itself is normal, the protocol
used by the application and the associated devices may still be attacked, thus making
the smart application still misbehave during operation. At the protocol level, Zillner
et al. [74], and Lomas et al. [38] emphasized several security risks in ZigBee
deployment. Because the manufacturer uses the default link key when producing
ZigBee devices, it is easy to leak the key of ZigBee devices, allowing attackers to
enter network communications and generate false interactions with devices. Fouladi
et al. [27] analyzed the Z-Wave protocol stack layer and found the vulnerability
of AES encryption in the smart lock. Based on this finding, attackers can forge
false application wireless instructions, making the smart lock be opened abnormally,
which has caused great harm to users’ home security. The researchers also studied
the weak authentication mechanism in Bluetooth devices [5, 6, 28]. For the hub,
the central system of the emerging smart home, some researchers have revealed its
potential security risks. Lemos et al. [32] analyzed the security of hubs in three
smart home systems (i.e., Samsung SmartThings, Vera Control [16], and Wink
[59]). The research points out that there are some security vulnerabilities in these
hubs, which are easy for attackers to obtain their control permissions and further
violate the application logic to send commands to other smart devices.
In a word, there are various misbehaviors of smart applications in the current
smart home environment. These misbehaviors can be summarized into two cat-
egories: over-privileged access and event spoofing. The former means that the
application obtains permission that should not be granted and performs misbehavior.
For example, for the application that controls smart window, when receiving the
user’s command of closing window, it reversely opens the window. The latter
means that the application automatically executes subsequent operations when no
corresponding event is received. For example, an attacker forges an event in the
cloud, making the smart alarm work when the environment is normal. These two
types of attacks have a common feature: due to the existence of misbehaviors, the
operation of smart applications conflicts with their normal working logic. Thus,
HoMonit proposed in Chap. 6 makes use of this phenomenon and leverages the
side-channel information in the wireless traffic to achieve real time misbehavior
detection.
2.3.2 Defense Mechanisms against Misbehaviors
In order to solve the security and privacy risks caused by the misbehavior of
smart home applications, researchers have proposed several defense mechanisms.
In terms of security enhancement, the current research mainly focuses on checking

the workflow of smart applications. Fernandes et al. [26] proposed the FlowFence
system. By embedding data flow control components in the intelligent application
system, the system requires downstream users to declare sensitive data flows before
using them. Otherwise, the data will be isolated and protected in the sandbox.
Jia et al. [29] proposed the ContexIoT system. The system designs a data flow
control mechanism in the Samsung SmartThings platform, identifies the sensitive
operations through context information, and feeds back the sensitive operations
to users to help users control the behavior of smart homes. Tian et al. [54]
proposed SmartAuth, a user-centered IoT authorization system. The system is
oriented to Samsung SmartThings smart home application platform. It collects
security-related information through the description documents, codes, and code
comments of smart applications and generates an authorization interface for users.
It can solve the problem of inconsistency between the user’s understanding of the
smart application’s permissions and the actual operations. Celik et al. [10] proposed
that the SOTERIA system extracts the state model for Samsung SmartThings
smart home devices and then generates security logic according to the security
requirements of the smart home (e.g., the smart door lock should be closed when
the user leaves home), and determine whether the smart device is in a safe state
via matching the security logic with the device state model. Ding et al. [21] put
forward IoTMon framework on the basis of SOTERIA. This framework mines all
potential physical exchange chains between applications by analyzing the code of
smart applications in the SmartThings platform and ensures the security of the smart
home environment.
In terms of privacy protection of smart applications, Celik et al. [9] proposed a
mechanism SAINT for Samsung SmartThings smart applications. The mechanism
analyzed more than 230 real smart applications and revealed 138 smart applications
that may cause user privacy disclosure. In terms of the intervention mechanism
for the misbehavior of smart applications, Celik et al. [11] further proposed
IoTGuard on the basis of SAINT. This mechanism conducts static analysis on smart
applications, builds security rules, and terminates the execution of codes that violate
security rules in the way of network services.
However, most of the existing solutions need to modify the platform itself or
inject patch into the running smart application, or even modify the communication
protocol, leading to the low universality and availability. The HoMonit system pro-
posed in Chap. 6 is a third-party system to detect smart application’s misbehavior,
without any modification to existing platforms or applications.
2.4 Summary
This chapter reviews existing security and privacy issues in smart home networks.
More specifically, we elaborate on the side-channel attacks at the terminal device
layer, which inspires WindTalker in Chap. 3. We review existing spoofing attacks
References 31
at the voice interface layer and introduce two-factor authentication and passive
liveness detection, which are the basis of WSVA in Chap. 4 and ArrayID in Chap. 5,
respectively. Finally, we describe two types of misbehavior at the application
platform layer and the corresponding defense strategies, which are the background
of HoMonit proposed in Chap. 6.
References
1. Adib, F., Katabi, D.: See through walls with WiFi! In: ACM Special Interest Group on Data
Communication (SIGCOMM) (2013)
2. Ahmed, M.E., Kwak, I.Y., Huh, J.H., Kim, I., Oh, T., Kim, H.: Void: a fast and light
voice liveness detection system. In: 29th USENIX Security Symposium (USENIX Secu-
rity 20), pp. 2685–2702. USENIX Association (2020). https://www.usenix.org/conference/
3. Alan, H.F., Kaur, J.: Can android applications be identified using only TCP/IP headers of their
launch time traffic? In: Proceedings of the 9th ACM Conference on Security & Privacy in
Wireless and Mobile Networks, pp. 61–66. ACM (2016)
4. Ali, K., Liu, A.X., Wang, W., Shahzad, M.: Keystroke recognition using WiFi signals.
In: Proceedings of the 21st Annual International Conference on Mobile Computing and
Networking, pp. 90–102. ACM (2015)
5. Arsene, L.: Wearable plain-text communication exposed through brute-force, bitdefender finds
(2014). https://www.hotforsecurity.com/blog/wearable-plain-text-communication-exposed-
through-brute-force-bitdefender-finds-10973.html
6. AV-TEST: Test: Fitness wristbands reveal data (2015). https://www.av-test.org/en/news/news-
single-view/test-fitness-wristbands-reveal-data/
7. Blue, L., Vargas, L., Traynor, P.: Hello, is it me you’re looking for? Differentiating between
Conference on Security & Privacy in Wireless and Mobile Networks, pp. 123–133. ACM
(2018)
W.: Hidden voice commands. In: Proceedings of USENIX Security Symposium (USENIX
Security), pp. 513–530 (2016)
9. Celik, Z.B., Babun, L., Sikder, A.K., Aksu, H., Tan, G., McDaniel, P., Uluagac, A.S.: Sensitive
information tracking in commodity IoT. In: 27th USENIX Security Symposium (USENIX
Security 18), pp. 1687–1704. USENIX Association, Baltimore (2018). https://www.usenix.
org/conference/usenixsecurity18/presentation/celik
10. Celik, Z.B., McDaniel, P., Tan, G.: Soteria: Automated IoT safety and security analysis. In:
2018 USENIX Annual Technical Conference (USENIX ATC 18), pp. 147–158. USENIX
Association, Boston (2018). https://www.usenix.org/conference/atc18/presentation/celik
11. Celik, Z.B., Tan, G., McDaniel, P.D.: IotGuard: Dynamic enforcement of security and safety
policy in commodity IoT. In: NDSS (2019)
12. Chen, B., Li, H., Li, Z., Chen, X., Xu, C., Xu, W.: ThermoWave: a new paradigm of wireless
passive temperature monitoring via mmWave sensing. In: Proceedings of the 26th Annual
but you cannot steal: Defending against voice impersonation attacks on smartphones. In: 2017
14. Cheng, N., Wang, X.O., Cheng, W., Mohapatra, P., Seneviratne, A.: Characterizing privacy
leakage of public WiFi networks for users on travel. In: 2013 Proceedings IEEE INFOCOM,
pp. 2769–2777 (2013). https://doi.org/10.1109/INFCOM.2013.6567086
15. Conti, M., Mancini, L.V., Spolaor, R., Verde, N.V.: Analyzing android encrypted network
traffic to identify user actions. IEEE Trans. Inform. Forensics Secur. 11(1), 114–125 (2016)
16. Control, V.: Vera3 advanced smart home controller (2018). http://getvera.com/controllers/
vera3/
17. Cunche, M., Kaafar, M.A., Boreli, R.: Linking wireless devices using information contained
in wi-fi probe requests. Pervasive Mobile Comput. 11, 56–69 (2014). https://doi.org/10.1016/
j.pmcj.2013.04.001
18. Demetriou, S., Zhou, X.y., Naveed, M., Lee, Y., Yuan, K., Wang, X., Gunter, C.A.: What’s
in your dongle and bank account? Mandatory and discretionary protection of android external
resources. In: NDSS (2015)
19. Di Luzio, A., Mei, A., Stefa, J.: Mind your probes: de-anonymization of large crowds
through smartphone WiFi probe requests. In: IEEE INFOCOM 2016 - The 35th Annual IEEE
International Conference on Computer Communications, pp. 1–9 (2016). https://doi.org/10.
1109/INFOCOM.2016.7524459
doi.org/10.1145/2666620.2666623
21. Ding, W., Hu, H.: On the safety of IoT device physical interaction control. In: Proceedings of
the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, pp.
832–846. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/
3243734.3243865
22. Dyer, K.P., Coull, S.E., Ristenpart, T., Shrimpton, T.: Peek-a-boo, i still see you: why efficient
traffic analysis countermeasures fail. In: 2012 IEEE Symposium on Security and Privacy, pp.
332–346 (2012). https://doi.org/10.1109/SP.2012.28
23. Fan, Y., Jiang, Y., Zhu, H., Shen, X.: An efficient privacy-preserving scheme against traffic
analysis attacks in network coding. In: IEEE INFOCOM 2009, pp. 2213–2221 (2009). https://
doi.org/10.1109/INFCOM.2009.5062146
ings of the 23rd Annual International Conference on Mobile Computing and Networking,
p. 343–355. Association for Computing Machinery (2017). https://doi.org/10.1145/3117811.
3117823
10.1109/SP.2016.44
Practical data protection for emerging IoT application frameworks. In: USENIX Security
27. Fouladi, B., Ghanoun, S.: Security evaluation of the z-wave wireless protocol. In: Black Hat
USA (2013)
28. Ho, G., Leung, D., Mishra, P., Hosseini, A., Song, D., Wagner, D.: Smart locks: lessons for
securing commodity internet of things devices. In: ACM Asia Conference on Computer and
Communications Security (AsiaCCS) (2016)
30. Konings, B., Bachmaier, C., Schaub, F., Weber, M.: Device names in the wild: investigating
privacy risks of zero configuration networking. In: Mobile Data Management (MDM), 2013
IEEE 14th International Conference on, vol. 2, pp. 51–56. IEEE (2013)
References 33
31. Lee, Y., Zhao, Y., Zeng, J., Lee, K., Zhang, N., Shezan, F.H., Tian, Y., Chen, K., Wang, X.:
Using sonar for liveness detection to protect smart speakers against remote attackers. Proc.
ACM Interact. Mob. Wearable Ubiquitous Technol. 4(1), 1–28 (2020). https://doi.org/10.1145/
3380991
32. Lemos, R.: Hubs driving smart homes are vulnerable, security firm finds. In: Eweek (2015)
33. Li, H., Xu, Z., Zhu, H., Ma, D., Li, S., Xing, K.: Demographics inference through wi-fi
network traffic analysis. In: IEEE International Conference on Computer Communications
(INFOCOM) (2016)
34. Li, H., Zhu, H., Du, S., Liang, X., Shen, X.: Privacy leakage of location sharing in mobile social
networks: Attacks and defense. IEEE Trans. Depend. Secure Comput. PP(99), 1–1 (2016).
https://doi.org/10.1109/TDSC.2016.2604383
35. Li, Z., Ma, F., Rathore, A.S., Yang, Z., Chen, B., Su, L., Xu, W.: WaveSpy: remote and through-
wall screen attack via mmWave sensing. In: 2020 IEEE Symposium on Security and Privacy
(SP), pp. 217–232 (2020). https://doi.org/10.1109/SP40000.2020.00004
36. Liu, J., Wang, Y., Kar, G., Chen, Y., Yang, J., Gruteser, M.: Snooping keystrokes with mm-level
audio ranging on a single phone. In: Proceedings of the 21st Annual International Conference
on Mobile Computing and Networking, pp. 142–154. ACM (2015)
37. Liu, X., Zhou, Z., Diao, W., Li, Z., Zhang, K.: When good becomes evil: keystroke inference
with smartwatch. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and
Communications Security, pp. 1273–1285. ACM (2015)
38. Lomas, N.: Critical flaw IDed in ZigBee smart home devices (2015). https://techcrunch.com/
2015/08/07/critical-flaw-ided-in-zigbee-smart-home-devices/
39. Lu, L., Yu, J., Chen, Y., Zhu, Y., Xu, X., Xue, G., Li, M.: KeyListener: inferring keystrokes
on qwerty keyboard of touch screen through acoustic signals. In: IEEE INFOCOM 2019—
IEEE Conference on Computer Communications, pp. 775–783 (2019). https://doi.org/10.1109/
INFOCOM.2019.8737591
40. Ma, D., Lan, G., Hassan, M., Hu, W., Upama, M.B., Uddin, A., Youssef, M.: SolarGest:
ubiquitous and battery-free gesture recognition using solar cells. In: The 25th Annual
41. Niu, X., Li, S., Zhang, Y., Liu, Z., Wu, D., Shah, R.C., Tanriover, C., Lu, H., Zhang, D.:
Wimonitor: Continuous long-term human vitality monitoring using commodity wi-fi devices.
Sensors 21(3) (2021). https://www.mdpi.com/1424-8220/21/3/751
42. Owusu, E., Han, J., Das, S., Perrig, A., Zhang, J.: Accessory: password inference using
accelerometers on smartphones. In: Proceedings of the Twelfth Workshop on Mobile Com-
puting Systems & Applications, pp. 1–6 (2012)
43. Panchenko, A., Lanze, F., Pennekamp, J., Engel, T., Zinnen, A., Henze, M., Wehrle, K.:
Website fingerprinting at internet scale. In: NDSS (2016)
44. Pierson, T.J., Peters, T., Peterson, R., Kotz, D.: Proximity detection with single-antenna IoT
devices. In: Proceedings of the 24th Annual International Conference on Mobile Computing
and Networking, MobiCom ’18, pp. 663–665. Association for Computing Machinery, New
York (2018). https://doi.org/10.1145/3241539.3267751
45. Qian, K., Wu, C., Zhang, Y., Zhang, G., Yang, Z., Liu, Y.: Widar2.0: passive human tracking
with a single wi-fi link. In: Proceedings of the 16th Annual International Conference on Mobile
Systems, Applications, and Services, MobiSys ’18, pp. 350–361. Association for Computing
Machinery, New York (2018). https://doi.org/10.1145/3210240.3210314
46. Roy, N., Hassanieh, H., Roy Choudhury, R.: Backdoor: Making microphones hear inaudible
3081366
47. Shi, C., Liu, J., Liu, H., Chen, Y.: Smart user authentication through actuation of daily activities
leveraging WiFi-enabled IoT. In: Proceedings of the 18th ACM International Symposium on
Mobile Ad Hoc Networking and Computing (MobiHoc), pp. 5:1–5:10 (2017). https://doi.org/
10.1145/3084041.3084061
48. Shiota, S., Villavicencio, F., Yamagishi, J., Ono, N., Echizen, I., Matsui, T.: Voice liveness
detection algorithms based on pop noise caused by human breath for automatic speaker
verification. In: Sixteenth Annual Conference of the International Speech Communication
Association (2015)
49. Shukla, D., Kumar, R., Serwadda, A., Phoha, V.V.: Beware, your hands reveal your secrets!
In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications
Security, pp. 904–917. ACM (2014)
50. Sugawara, T., Cyr, B., Rampazzi, S., Genkin, D., Fu, K.: Light commands: laser-based
audio injection attacks on voice-controllable systems. In: 29th USENIX Security Symposium
(USENIX Security 20), pp. 2631–2648. USENIX Association (2020). https://www.usenix.org/
conference/usenixsecurity20/presentation/sugawara
51. Sun, J., Jin, X., Chen, Y., Zhang, J., Zhang, R., Zhang, Y.: Visible: video-assisted keystroke
inference from tablet backside motion. In: Network and Distributed System Security Sympo-
sium, pp. 1–15 (2016)
52. Tan, S., Yang, J.: WiFinger: leveraging commodity WiFi for fine-grained finger gesture
recognition. In: Proceedings of the 17th ACM International Symposium on Mobile Ad Hoc
Networking and Computing, pp. 201–210. ACM (2016)
53. Taylor, V.F., Spolaor, R., Conti, M., Martinovic, I.: AppScanner: automatic fingerprinting of
smartphone apps from encrypted network traffic. In: Security and Privacy (EuroS&P), 2016
IEEE European Symposium on, pp. 439–454. IEEE (2016)
authorization for the internet of things. In: USENIX Security Symposium (USENIX Security)
(2017)
55. Wang, Q., Lin, X., Zhou, M., Chen, Y., Wang, C., Li, Q., Luo, X.: VoicePop: a pop noise
based anti-spoofing system for voice authentication on smartphones. In: IEEE INFOCOM
2019-IEEE Conference on Computer Communications, pp. 2062–2070. IEEE (2019)
compensated: Understanding and defeating modulated replay attacks on automatic speech
Communications Security, CCS ’20, p. 1103–1119. Association for Computing Machinery
(2020). https://doi.org/10.1145/3372297.3417254
57. Wang, T., Cai, X., Nithyanand, R., Johnson, R., Goldberg, I.: Effective attacks and provable
defenses for website fingerprinting. In: 23rd USENIX Security Symposium (USENIX Security
14), pp. 143–157. USENIX Association, San Diego (2014)
58. Wang, W., Liu, A.X., Shahzad, M., Ling, K., Lu, S.: Understanding and modeling of WiFi
signal based human activity recognition. In: Proceedings of the 21st Annual International
Conference on Mobile Computing and Networking, pp. 65–76 (2015)
59. Wink: Wink: a simpler, smarter home (2018). https://www.wink.com/
60. Wright, C.V., Coull, S.E., Monrose, F.: Traffic morphing: an efficient defense against statistical
traffic analysis. In: NDSS, vol. 9. Citeseer (2009)
61. Xia, N., Song, H.H., Liao, Y., Iliofotou, M., Nucci, A., Zhang, Z.L., Kuzmanovic, A.:
Mosaic: quantifying privacy leakage in mobile networks. In: ACM SIGCOMM Computer
Communication Review, vol. 43, pp. 279–290. ACM (2013)
62. Yan, C., Long, Y., Ji, X., Xu, W.: The catcher in the field: A fieldprint based spoofing
X., Gunter, C.A.: CommanderSong: A systematic approach for practical adversarial voice
References 35
64. Yue, Q., Ling, Z., Fu, X., Liu, B., Ren, K., Zhao, W.: Blind recognition of touched keys on
mobile devices. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and
65. Zhang, G., Ji, X., Li, X., Qu, G., Xu, W.: EarArray: Defending against DolphinAttack via
acoustic attenuation. In: NDSS (2021)
66. Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: DolphinAttack: inaudible voice
67. Zhang, J., Zheng, X., Tang, Z., Xing, T., Chen, X., Fang, D., Li, R., Gong, X., Chen, F.: Privacy
leakage in mobile sensing: your unlock passwords can be leaked through wireless hotspot
functionality. Mobile Inform. Syst. 2016, 8793025 (2016)
speaker verification system in smart home. In: IEEE INFOCOM 2020—IEEE Conference on
2020.9155483
69. Zhang, L., Tan, S., Yang, J.: Hearing your voice is not enough: An articulatory gesture
org/10.1145/3133956.3133962
70. Zhang, L., Tan, S., Yang, J., Chen, Y.: VoiceLive: A phoneme localization based liveness
detection for voice authentication on smartphones. In: Proceedings of the 2016 ACM SIGSAC
71. Zheng, Y., Zhang, Y., Qian, K., Zhang, G., Liu, Y., Wu, C., Yang, Z.: Zero-effort cross-domain
gesture recognition with wi-fi. In: Proceedings of the 17th Annual International Conference
on Mobile Systems, Applications, and Services, MobiSys ’19, pp. 313–325. Association for
Computing Machinery, New York (2019). https://doi.org/10.1145/3307334.3326081
72. Zhou, B., Lohokare, J., Gao, R., Ye, F.: EchoPrint: Two-factor authentication using acoustics
and vision on smartphones. In: Proceedings of the 24th Annual International Conference on
Mobile Computing and Networking, MobiCom ’18, p. 321–336. Association for Computing
73. Zhu, T., Ma, Q., Zhang, S., Liu, Y.: Context-free attacks using keyboard acoustic emanations.
Security, pp. 453–464. ACM (2014)
74. Zillner, T.: Zigbee exploited: the good the bad and the ugly. In: Black Hat USA (2015)
Chapter 3
Privacy Breaches and Countermeasures
at Terminal Device Layer
3.1 Introduction
In the smart home environment, the mart terminal devices such as smartphones
and tablets are widely used by users. Users use smartphones and other terminals to
conduct many sensitive information interactions (e.g., bank transfers, input payment
passwords, and social applications). There is a huge difference between smart
terminals and traditional static devices (such as bank ATM devices). Traditional
static devices generally connect to a secure network, and users usually use these
devices in a secure physical area. The smart terminal is generally carried by mobile
users and connected to a dynamic mobile network. Therefore, the attacker can steal
the user’s private information entered by keystroke on the terminal in various direct
or indirect ways [1, 2, 12, 13, 15, 18, 23, 30, 32].
Direct attack refers to the attacker’s direct observation of the user’s keystroke
input on the intelligent terminal, such as peeking at the victim’s input on the touch
screen or keyboard of the terminal through the camera. Side-channel attack refers to
the attacker inferring the input information of the target device by using information
that is not directly related to the user’s keystroke behavior. In the side-channel attack
scenario, the attacker can use the electromagnetic signal of the antenna [1], the
audio signal of the microphone [2, 12, 32], the video signal of the camera [23, 30],
and the motion state obtained by the sensor [13, 15, 18] and other side-channel
information to obtain the user’s keystroke information. Compared with the direct
attack, the side-channel attack can steal private information without the user’s
awareness, so it has received extensive attention.
Currently, to access the side-channel information, the works mentioned above
often assume either external signal collector devices are physically close to the target
device (for example, 30 cm in [1]) or the target device’s sensors are compromised.
However, in a mobile scenario, both assumptions are hardly true, and the impact of
attacks is thus limited. In addition, prior works have studied keystroke inference
aiming at achieving a high inference accuracy on a series of keystrokes for a

https://doi.org/10.1007/978-3-031-24185-7_3
38 3 Privacy Breaches and Countermeasures at Terminal Device Layer
relatively long time. However, the keystrokes on a mobile device are not always
highly sensitive. The eavesdropping attacker has a greater interest in obtaining the
payment PIN in a short moment than regular typing information. Therefore, the
application context information needs to be considered in the keystroke inference
framework to increase the inference accuracy and efficiency.
This chapter presents WindTalker, a novel and practical keystroke inference
framework that can infer sensitive keystrokes on a mobile device based on Wi-Fi
signals. The proposed WindTalker is motivated by an observation that the typing
behavior on mobile devices involves hand and finger motions, which generate
significant interference to the multi-path Wi-Fi signals from the target device to
the Wi-Fi router that connects to the device. The attacker can exploit the strong
correlation between the CSI fluctuation and the keystrokes to infer the user’s
number input. Unlike prior side-channel attacks or traditional CSI-based gesture
recognition, WindTalker neither deploys external devices close to the target device
nor compromises any part of the target device. Instead, WindTalker setups a “rogue”
hotspot to lure the target user with free Wi-Fi service, which is easy to deploy and
difficult to detect. As long as the target mobile device connects to the rogue Wi-
Fi hotspot, WindTalker intercepts the Wi-Fi traffic and selectively collects the CSI
between the target device and the hotspot.
The study of WindTalker in this chapter has four major technical challenges. (i)
The impact of keystrokes’ hand and finger movement on CSI waveforms is very
subtle. An effective signal analysis method is needed to analyze keystrokes from
the limited CSI. (ii) The prior CSI collection method requires deploying two Wi-Fi
devices (i.e., one as a signal sender and the other as a signal receiver) close to the
victim. A more flexible and practical CSI collection method is highly desirable for
the mobile device scenario. (iii) The keystroke inference must select some sensitive
moments, such as the payment PIN. Prior works have not addressed such context-
oriented CSI collection. (iv) We need to devise a lightweight defense method to
thwart WindTalker.
The contributions of this chapter are summarized as follows:
• We present a practical cross-layer-based approach for mobile payment password
inference on smartphones using public Wi-Fi architecture. We propose a novel
password inference model that analyzes physical layer information (CSI) and
network layer traffic.
• We present a novel ICMP-based CSI collection method without deploying an
external device very close to the victim’s device or compromising the victim’s
device. We develop an IP pool-based method to recognize the PIN input period.
• We conduct a case study on password inference at the mobile payment platform
Alipay, which is secured by the HTTPS protocol and thus traditionally believed
to be secure. We investigate the impact of various factors on WindTalker,
demonstrating that WindTalker can infer the password with a highly successful
rate.
• We introduce some effective countermeasures to thwart the inference attack.
More specifically, we propose a novel CSI obfuscation algorithm to prevent the
3.2 Background Knowledge and Attack Principle 39
attacker from collecting accurate CSI data without the requirements of the user’s
participation.
The remainder of this chapter is organized as follows. Section 3.2 introduces the
preliminary knowledge and the principles of WindTalker. In Sect. 3.3, we introduce
the architecture and design details of each module of WindTalker. We evaluate
the performance of WindTalker in the keystroke inference attack and discuss the
impact of various factors in Sect. 3.4. Section 3.5 shows the cross-layer attack ability
of WindTalker in real-world payment platform, named Alipay. Finally, Sect. 3.6
introduces the countermeasures against WindTalker and Sect. 3.7 summarizes this
chapter.
3.2 Background Knowledge and Attack Principle
This section introduces the channel state information of wireless signals, the attack
model, and the design principles behind WindTalker.
3.2.1 Channel State Information of Wireless Signal
The basic insight of WindTalker is measuring the impact of hand and finger
movement on Wi-Fi signals and leveraging the correlation between CSI and the
unique hand motion to recognize the password inputted by the user. Below, we
briefly introduce the CSI-related backgrounds.
To improve the channel capacity of the wireless system, Wi-Fi Standards like
IEEE 802.11n/ac support Multiple-Input Multiple-Output (MIMO) and Orthogonal
Frequency Division Multiplexing (OFDM). In a system with transmitter antenna
number NT X , receiver antenna number NRX and OFDM subcarriers number Ns ,
NT X × NRX × Ns subcarriers will be used to transmit signal at the same time.
CSI measures the channel frequency response (CFR) in different subcarriers f .
CFR H (f, t) represents the state of the wireless channel in a signal transmission
process. Let X (f, t) and Y (f, t) represent the transmitted and received signal with
different subcarrier frequency. H (f, t) can be calculated in the receiver using a
known transmitted signal via
Y (f, t) = H (f, t) × X (f, t) . (3.1)
The received signal reflects the constructive and destructive interference of

several multi-path signals scattered from the wall/objects and the movements of the
fingers. Thus, the password input can generate a unique pattern in the time series of
CSI values, which can be used for keystrokes recognition.
Currently, many commercial devices such as Atheros 9390 [22], Atheros 9580
[29] and Intel 5300 [8] network interface cards (NICs) with special drivers provide
open access to CSI value. In this chapter, we adopt Intel 5300 NICs, which follow
IEEE 802.11n standard [10] and can work in 2.4 GHz or 5.8 GHz. By selecting a
group of 30 OFDM subcarriers out of 56 subcarriers, Intel 5300 NICs collect CSI
value for each TX-RX antenna pair. During Wi-Fi communication, an antenna pair
contains 64 OFDM subcarriers. The Intel 5300 network card can extract 30 OFDM
subcarriers and return corresponding CSI values. Therefore, in this chapter, the CSI
acquired is a 30-dimensional time series for a single antenna pair.
3.2.2 Attack Model

3.2.2.1 Attack Assumption and Scenario
In the privacy attack scenario against smart terminals, this chapter assumes that
the victim user carries a mobile terminal such as a smartphone and accesses
Internet services through a Wi-Fi hotspot. An attacker places a wireless device
(e.g., a smartphone, router) near the user’s area to establish a Wi-Fi hotspot. To
disguise the hotspot as a public hotspot, the service set identifier (SSID) of the
hotspot is a deceptive name (such as “public free Wi-Fi”), and users do not need
to enter a password to connect. In this scenario, users choose to connect to the
hotspot. Because Wi-Fi uses application layer encryption protocols (e.g., HTTPS),
users believe the communication process is encrypted. In other words, they believe
all private information will only be shared between users and service providers.
However, in this chapter, we will reveal that user privacy cannot be protected only
by application layer encryption. Our WindTalker framework presents an effective
keystroke inference method targeting the mobile terminal device.
This attack mode does not require the attacker to physically approach the user’s
mobile terminal, nor does it require the attacker to invade the user’s terminal
equipment, so it has strong practicability. We also assume that attackers can deploy
a Wi-Fi hotspot near users’ homes to carry out attacks. In addition, considering that
smart home technology is not only used in users’ personal homes but also widely
used in semi-open environments such as apartments, hotels, shopping malls, etc. In
these environments, deploying Wi-Fi hotspots will be more hidden for users and
more feasible for attackers. Therefore, WindTalker has the feasibility of multiple
types of smart homes and even mobile communication scenarios.
3.2.2.2 Keystroke Inference Model
Based on whether the terminal device participates in the CSI signal acquisition
process, we can divide the keystroke inference mechanism into two models: in-
band keystroke inference (IKI) and out-band keystroke inference (OKI). Note that
3.2 Background Knowledge and Attack Principle 41
3G\4G\WiFi
TX TX Wi-Fi
RX
Signal
Establish Wi-Fi Connection RX
ICMP Request
WiFi Router
ICMP Reply
Victim’s terminal Attacker’s device Victim’s terminal Attacker device
Collect both CSI and Wi-Fi traffic from victim’s terminal device Only collect CSI data
(a) (b)
Fig. 3.1 Wi-Fi based keystroke inference models. (a) IKI attack model. (b) OKI attack model
the existing works about CSI-based side-channel attacks [1, 25, 31] choose the OKI
model. As shown in Fig. 3.1b, the adversary deploys two COTS Wi-Fi devices close
to the target device and ensures the target device is placed right between two COTS
Wi-Fi devices. One is the sender device continuously emitting signals, and the other
is the receiver device continuously receiving the signals. The keystrokes are inferred
from the multi-path distortions in signals.
Different with existing works [1, 25, 31], WindTalker chooses IKI model. As
shown in Fig. 3.1a, WindTalker deploys one Commercial Off-The-Shelf (COTS)
Wi-Fi device close to the target device, which could be a Wi-Fi hotspot. The Wi-
Fi hotspot provides free Wi-Fi networks for nearby users. When a user connects
her device to the hotspot, the Wi-Fi hotspot can monitor the application context by
checking the pattern of the transmitted Wi-Fi traffic. In addition, the Wi-Fi hotspot
periodically sends ICMP packets to obtain the CSI information from the target
device. With the metadata of the Wi-Fi traffic collected by the hotspot, WindTalker
knows when sensitive operations (such as typing passwords) happen. And then, the
hotspot adaptively launches a CSI-based keystroke inference method to recognize
sensitive key inputs. To the best of our knowledge, the IKI method we propose is
the first one using existing network protocols of IEEE 802.11n/ac standard to obtain
both the application context and the CSI information on mobile devices.
Compared with the OKI model, the proposed IKI model has the following
advantages. Firstly, the IKI model does not require the placement of both sender
and receiver devices and can be deployed more flexibly and stealthily. Secondly,
in the OKI model, the victim’s device is not connected to the attacker’s device,
so the attacker cannot obtain the Wi-Fi traffic from the user’s device. Therefore,
the OKI model fails to differentiate the non-sensitive operations on mobile devices
(e.g., clicking the screen to open an APP or just for web browsing) from sensitive
operations (e.g., inputting the password). Instead, the IKI model allows the attacker
to obtain both unencrypted metadata traffic and the CSI data to launch a more fine-
grained attack.
3.2.3 Principles of Privacy Inference Attack
In this subsection, we use real-world experiments to illustrate the rationale behind

CSI-based keystroke inference on the terminal device (e.g., smartphone). Fig-
ure 3.2a shows the sketch of a typical touching screen during the PIN entry for
mobile payment (e.g., Alipay or Wechat pay). We particularly focus on the vertical
and the oblique touch, the two most common touching gestures [3, 6, 27]. As shown
in Fig. 3.2b, oblique touch is the most common typing gesture, which happens when
people press different keys, and vertical touch usually happens when the human
continuously presses the same key.
Figure 3.2c shows the original CSI waveforms from the 21st subcarrier to the
30th subcarrier during a given keystroke. We can observe that CSI waveforms
collected by Intel 5300 NIC are affected by the keystroke, and the fluctuations of
these ten waveforms are similar. Figure 3.2d shows the processed CSI value for the
keystroke. We find that the pattern of processed CSI value is very closely related to
oblique and vertical touch, which can characterize the corresponding keystroke.
We further investigate how these two common typing gestures influence CSI.
Generally speaking, since CSI reflects the constructive and destructive interference
of several multi-path signals, the change of multi-path propagation during the PIN
entry can generate a unique pattern in the time series of CSI values, which can be
used for keystrokes inference. Our experiments demonstrate that hand coverage and
finger click are two main factors contributing to CSI changes.
Hand coverage and finger position on a smartphone touchscreen are one of the
major factors that cause the fluctuation of the CSI waveform. Since the time series
of the CSI waveform reflects the interference of several multi-path signals, different
finger positions and coverages will inevitably introduce interference to the Wi-Fi
signals and thus lead to changes in the CSI. We further demonstrate the relationship
between CSI variation and finger position/coverage via experiments. Figure 3.3a
shows a CSI waveform when continuously pressing different numbers from 1 to 9,
followed by 0, while each number is clicked five times. We can see that the different
70 100
CSI Value after Processed
60
Vertical Touch
50
50
CSI Value
40
Vertical touch 0
30
20
-50
10 Keystroke
Oblique Touch
0 -100
0 500 1000 0 500 1000
Oblique touch CSI Sample Index CSI Sample Index
(a) (b) (c) (d)
Fig. 3.2 Hand movement and CSI changes during keystrokes. (a) An illustration of finger
keystroke. (b) Hand movement: vertical touch and oblique touch. (c) CSI waveforms from the
21st to the 30th subcarriers during a keystroke. (d) Impact of hand movement on CSI
3.3 System Design of Cross-layer Privacy Inference Attack 43
50
40
CSI Value
30
20
10
0
0 1 2 3 4 5 6 7
CSI Sample Index 104
(a)
Number 1 Number 6 Number 8 Number 0

40 35 40
20
35 35
CSI Value
CSI Value
CSI Value
CSI Value
30
30 15
30
25 25
10 25
20 20
5 20
15
0 15 15
2000 4000 6000 4 4.2 4.4 5.6 5.8 6 6.2 7.3 7.4 7.5 7.6 7.7
CSI Sample Index CSI Sample Index x 104 CSI Sample Index x 104 CSI Sample Index x 104
(b)
Fig. 3.3 CSI changes when inputting keystrokes in the numerical keyboard. (a) Continuously
clicking on the numerical keyboard. (b) Continuously clicking in different keys
coverage leads to the different fluctuation ranges of the CSI value, which can be
exploited for key inference.
Finger click is another important factor contributing to CSI’s fluctuation.
Compared with CSI change caused by the hand coverage, the experiment shows
that finger click has a more direct influence on CSI by introducing a sharp
convex in Fig. 3.3b, which corresponds to the quick click’s influence on multi-
path propagation. This feature can be used to distinguish the oblique touches in
the case that the human continuously presses the same key or the adjacent keys,
which produce similar CSI values.
3.3 System Design of Cross-layer Privacy Inference Attack
The basic strategy of WindTalker is hitting two birds with one stone. On the one
hand, it analyzes the Wi-Fi traffic to identify the sensitive attack windows (e.g.,
PIN number) on smartphones. On the other hand, as long as an attack window
is identified, WindTalker starts to launch the CSI-based keystroke recognition. As
shown in Fig. 3.4, WindTalker consists of the following modules: Sensitive Input
Window Recognition Module, which is responsible for distinguishing the sensitive
input time windows, ICMP-Based CSI Acquirement Module, which collects the
ICMP Based CSI Data Preprocessing

Acquirement Module Module
CSI
Directional ICMP Based Noise Dimension
Antenna Data Collection Removal Reduction
Victim
Sensitive Input Window Password
Recognition Module Inference Module
Wi-Fi Traffic Sensitive Keystroke Password
Analysis Input Extraction Inference
Password
Fig. 3.4 The framework of WindTalker
user’s CSI data during his access to Wi-Fi hotspot, Data Preprocessing Module,
which preprocesses the CSI data to remove the noises and reduce the dimension,
Keystroke Extraction Module, which enables WindTalker to automatically determine
the start and the end point of keystroke waveform, and Keystroke Inference
Module, which compares the different keystroke waveforms and determines the
corresponding keystroke.
3.3.1 ICMP-based CSI Acquirement Module

3.3.1.1 Collecting CSI Data by Enforcing ICMP Reply
Different from the previous works, which rely on two devices, including both the
sender and the receiver, to collect CSI data, we apply an approach that leverages
Internet Control Message Protocol (ICMP) in hotspot to collect CSI data during the
user access to the pre-installed access point. In particular, WindTalker periodically
sends an ICMP Echo Request to the victim’s smartphone, which will reply an
Echo Reply for each request. To acquire enough CSI information about the victim,
WindTalker needs to send ICMP Echo Request at a high frequency, which enforces
the victim to replay at the same frequency. In practice, WindTalker can work well
for several smartphones, such as Xiaomi, Redmi, and Samsung, at the rate of 800
packets per second. It is important to point out that this approach does not require
any permission from the target smartphone and is difficult to be detected by the
victim.
ICMP-based CSI collection approach introduces a limited number of extra
traffics. For a 98 bytes ICMP packet, when we are sending 800 ICMP packets per
second to the victim, it needs only 78.4 KB/s for the attack where IEEE 802.11n
can theoretically support the transmission speed up to 140 Mbits per second. It is
clear that the proposed attack makes little interference to the Wi-Fi experience of
the victim.
3.3.1.2 Reducing Noise via Directional Antenna
CSI will be influenced by both finger movement and people’s body movement.
One of the major challenges of obtaining the exact CSI data in public space is
how to minimize the interference caused by nearby human beings. We present
a noise reduction approach by adopting the directional antenna. Different from
omnidirectional antennas that have a uniform gain in each direction, directional
antennas have a different antenna gain in each direction. As a result, the signal level
at a receiver can be increased or decreased simply by rotating the orientation of the
directional antenna. WindTalker employs a directional antenna to focus the energy
toward the target of interest, which is expected to minimize the effects of the nearby
human body movement.
WindTalker employs a TDJ-2400BKC antenna working in 2.4 GHz to collect
CSI data of the targeted victim, whose Horizontal Plane −3 dB Power Beamwidth
and Vertical Plane −3 dB Power Beamwidth are 30◦ and 25◦ , respectively.
The comparison of CSI data when using a directional and omnidirectional
antenna is shown in Fig. 3.5. In the experiment, we recruited two volunteers.
Volunteer A acts as the target user and continuously clicks the number 1 on the
number pad of the smartphone. Volunteer B was asked to walk 1 m to the left of
volunteer A. In the experiment, the directional antenna of the Wi-Fi hotspot was
always aimed at volunteer A. As shown in Fig. 3.5a, when the distance between
the volunteer and the omnidirectional antenna of the Wi-Fi hotspot is 75 cm,
due to the movement of surrounding volunteer B, we cannot clearly observe the
fluctuation mode caused by keystroke from the collected CSI amplitude waveform.
Figure 3.5b–d show CSI amplitude in the case that a victim is located at 75,
125, 150 cm accordingly away from directional antenna while one human moving
50 28 28
26 22
26
CSI Value
CSI Value
40 24
24 20
22
22
30 20 18
18 20
20 16 18 16
0 2000 4000 0 1000 2000 3000 0 2000 4000 0 2000 4000
CSI Sample Index CSI Sample Index CSI Sample Index CSI Sample Index
(a) (b) (c) (d)
Fig. 3.5 The comparison between omnidirectional and directional antennas on collecting CSI. (a)
Omnidirectional antenna in 75 cm. (b) Directional antenna in 75 cm. (c) Omnidirectional antenna
in 125 cm. (d) Directional antenna in 150 cm
nearby. The unique pattern caused by the finger click in number 1 can be easily
caught from the original CSI waveform without any preprocessing. Therefore, in
the subsequent experiments, we only consider the CSI acquisition method using a
directional antenna.
3.3.2 Sensitive Input Window Recognition Module
To extract the time window containing the sensitive input, WindTalker captures
all packets of the victim with Wireshark and records the timestamp of each CSI
data. Currently, most of the important applications are secured via HTTPS, which
provides end-to-end encryption and prevents the eavesdropper from obtaining
sensitive data such as the password. Our insight is that though HTTPS provides
end-to-end encryption, it cannot protect the metadata of the traffic, such as the IP
address of the destination server, which can be used to recognize the sensitive input
window.
Especially, WindTalker builds a Sensitive IP Pool for interested applications or
services. Take Alipay as an example. During the payment process, the data packets
will be directed to a limited number of IP addresses, which can be obtained via a
series of trials. In the experimental evaluation, it is shown that, for Alipay users,
the traffics of the users under the same network will be directed to the same server
IP, which will last for a period (e.g., several days for one round of experiment).
Therefore, it is feasible to try to access interesting applications or services at
regular intervals and append the obtained IP addresses to the Sensitive IP Pool.
This constantly updated pool allows WindTalker to figure out the sensitive input
time window.
We conduct experiments to validate the effectiveness of the method as mentioned
above. We choose three popular mobile payment applications (i.e., Alipay, Wechat
Pay, and JD Pay) and capture the network traffic using Wireshark. For each
application, we complete the mobile payment ten times. As shown in Table 3.1,
for a certain application, when the password input process starts, some packets with
a specific IP address will happen. This result demonstrates the effectiveness of the
sensitive IP pool-based method. Therefore, during the attack process, as long as the
traffic to the IP addresses contained in the Sensitive IP Pool is observed, WindTalker
will extract this traffic and then record the corresponding start time and the end time,
which serve as the start and the end of the Sensitive Input Window. Then, it starts
to analyze the CSI data in this period to launch the password inference attack via
Wi-Fi signals.
Table 3.1 Payment Payment application IP address

applications and their
sensitive IP addresses Alipay 110.76.15.1xx & 110.75.236.xx
Wechat Pay 182.254.78.1xx
JD Pay 111.13.142.x
3.3.3 Data Preprocessing Module
After collecting CSI data, WindTalker needs to conduct the preprocessing before
entering the keystroke inference module. The goal of preprocessing is to remove the
noises introduced by commodity Wi-Fi NICs due to the frequent changes in internal
CSI reference levels, transmission power levels, and transmission rates. To achieve
this, WindTalker first turns to wavelet denoising to remove noises from the obtained
signals. Then, WindTalker leverages the Principal Component Analysis to reduce
the dimensionality of the feature vectors to enable better analysis of the data.
3.3.3.1 Wavelet Denoising
From Fig. 3.5, we can observe that the variation of CSI waveforms caused by
finger motion normally appears at the low end of the spectrogram while the
frequency of the noise occupies the high end of the spectrogram. We do not adopt
the low-pass filter since the high-frequency signal includes some finger motion
characters. In this chapter, the wavelet denoising method is used to remove noise
from the raw signal. Different from the traditional frequency analysis such as Fourier
Transform, Discrete Wavelet Transform (DWT) is a time–frequency analysis that
has a good resolution at both the time and frequency domains. WindTalker can
thus leverage DWT to analyze the finger movement in varied frequency domains.
Wavelet denoising includes three main steps as follows:
Discrete Wavelet Transform A discrete signal x [n] can be expressed in terms of
the wavelet function by the following equation:
1
x[n] = √ Wφ [j0 , k]φj0 ,k [n]
L k
∞ (3.2)
1
+√ Wψ [j, k]ψj,k [n],
L j =j k
0
where x[n] represents the original discrete signal and L represents the length of
x[n]. φj0 ,k [n] and ψj,k [n] refer to wavelet basis. Wφ [j0 , k] and Wψ [j, k] refer
to the wavelet coefficients. The functions φj0 ,k [n] refer to scaling functions and
the corresponding coefficients Wφ [j0 , k] refer to the approximation coefficients.
Similarly, functions ψj,k [n] refer to wavelet functions and coefficients Wψ [j, k]
refer to detail coefficients. In order to obtain the wavelet coefficients, the wavelet
basis φj0 ,k [n] and ψj,k [n] are chosen to be orthogonal to each other.
During the decomposition process, we divide the original signal into approxi-
mation and detail coefficients. Then the approximation coefficients are iteratively
divided into the approximation and detail coefficients of the next level. The
approximation and the detail coefficients in j -th level can be calculated as follows:
1
Wφ [j0 , k] = x[n], φj0 +1,k [n] = √ x[n]φj0 +1,k [n]. (3.3)
L n
1
Wψ [j, k] = x[n], ψj +1,k [n] = √ x[n]ψj +1,k [n]. (3.4)
L n
Threshold Selection The recursive DWT decomposition breaks the raw signal into
detail coefficients (high-frequency) and approximation coefficients (low-frequency)
at different frequency levels. Then, the threshold is applied to the detail coefficients
to remove their noisy part. The threshold selection is important because a small
threshold will retain the noisy components while a large threshold will lose the
major information of signals. In this work, the minimax threshold is chosen based
on its dynamic, effectiveness, and simplicity [21].
Wavelet Reconstruction After finishing the above two steps, we reconstruct
the signal to achieve noise removal by combining the coefficients of the last
approximation level with all details which have applied the threshold. Wavelet
selection plays a key role in wavelet decomposition and reconstruction. There are
many wavelet bases, such as Daubechies and Haar wavelet [21]. In practice, we
choose Daubechies D4 wavelet and perform 5-level DWT decomposition in wavelet
denoising in our study.
3.3.3.2 Dimension Reduction
To facilitate subsequent keystroke inference, we reduce the dimensions of CSI data

in this part. As mentioned in Sect. 3.2.1, for a communication system with one
transmitting antenna and one receiving antenna, the dimensions of CSI collected
by the Intel 5300 network card are 30 dimensions. These 30 dimensions of CSI data
represent the user keystroke behavior perceived by different frequency subcarriers.
It is important to reduce the dimensionality of the CSI information obtained from
30 subcarriers and recognize those subcarriers which show the strongest correlation
with the hand and finger movements. WindTalker adopts principal component
analysis (PCA) to choose the most representative or principal components from all
CSI time series. PCA is also expected to remove the uncorrelated noisy components.
The procedure of dimension reduction of CSI time series based on PCA includes the
following steps.
Subcarrier Selection We observe that the CSI waveforms from different sub-
carriers have different sensitivities to CSI variation caused by keystrokes due to
frequency diversity. As shown in Fig. 3.6a–c, some subcarriers amplitudes vary a
lot with keystrokes, but others are obtuse. As shown in Fig. 3.6d, we calculate the
variance of each subcarrier. We find that subcarriers 10–19 have lower variances,
which means they have lower sensitivities with keystrokes. So we discard the lowest
ten subcarriers before PCA.
45
5 40
CSI Vaue
Subcarrier Index
40
10
35
15 30
20
Not Sensitive 30
25
20
25 20
30 15 0 2000 4000 6000 8000
1000 2000 3000 4000 5000 6000 7000 8000
CSI Sample Index CSI Sample Index
(a) (b)
35
30
Variance
CSI Vaue
30
20
10
25
0
0 2000 4000 6000 8000 10 20 30
CSI Sample Index Subcarrier Index
(c) (d)
Fig. 3.6 An illustration of subcarrier selection. (a) CSI waveforms while inputting a keystroke. (b)
1st subcarrier: sensitive. (c) 16th subcarrier: high sensitivity. (d) The variance of each subcarrier
Finishing PCA Process We use a matrix H to represent original CSI waveform

data. For example, in a system with one pair of TX-RX antenna, we will get 30
CSI waveforms from 30 subcarriers. After removing ten subcarriers as mentioned
above, H will contain 20 CSI waveforms. Thus, with sampling rate S and time T ,
H has dimension of M × 20, where M = S × T . Then we calculate the mean value
of each column in H and subtract the corresponding mean values in every column.
After the centralization step, we get a processed matrix Hp . After that, we calculate
the correlation matrix of Hp as Hp T × Hp and its corresponding eigenvalues and
eigenvectors. We sort the eigenvalues from large to small and choose the maximum
k number of Eigenvalues. The corresponding k Eigenvectors are used as the column
vectors to form an Eigenvector matrix. We will get an Eigenvector matrix whose
dimension is 20 × k. Finally, the processed CSI data stream is denoted as Hr with
the dimension of M × k as below:
Hr (M × k) = Hp (M × 20) × EigenV ectors(20 × k). (3.5)
With PCA, we can identify the most representative components influenced by the
victim’s hand and finger movement and remove the noisy components at the same
time. In our experiment, it is observed the first k = 4 components almost show the
most significant changes in CSI waveforms, and the rest components are noises. In
our experiment part, we observed that the first PCA component reserves the most
changes in CSI while the ambient noise is weak. Thus we only take the first PCA
component in the password inference module.
3.3.4 Keystroke Inference Module

3.3.4.1 Keystroke Extraction
After data preprocessing, it is observed that the CSI data shows a strong correlation
with the keystrokes, as shown in Fig. 3.7a. In the experiment, as mentioned in
Sect. 3.2.3, the sharp rise and fall of the CSI waveform signals are observed in
coincidence with the start and end of finger touch. How to determine the start and
the end point of the CSI time series during a keystroke is essential for keystroke
recognition. However, the existing burst detection schemes, such as Mann-Kendall
test, moving average method, and cumulative anomalies [9] do not work well in our
situation since the CSI waveform has many change points during the password input
period. Therefore, we propose a novel detection algorithm to automatically detect
the start and end points. The proposed algorithm includes the following three steps.
Waveform Profile Building As shown in Fig. 3.7a, it is observed that there is a
sharp rise and fall, which correspond to the finger motions. However, there is a
strong noise that prevents us from extracting interested CSI waveforms related to
the keystrokes. This motivates us to perform another round of noise filtering. In
20 20
CSI PCA Components
CSI PCA Components
the 1st PCA Component the 1st PCA Component
10 10
0 0
-10 -10
0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000
(a) (b)
20 20
CSI PCA Components
CSI PCA Components
the 1st PCA Component the 1st PCA Component

Variance Start Point
10 10 End Point
0 0
-10 -10
0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000
(c) (d)
Fig. 3.7 Extraction procedure of multiple keystrokes. (a) CSI waveform after data preprocessing.
(b) CSI waveform after twice filter. (c) Variance Scan. (d) The results of keystroke extraction
the experiment, we still adopt wavelet denoising to make the waveform smooth.
After being filtered, the CSI data during the keystroke period are highlighted while
the waveform during the non-keystroke period becomes smooth, which is shown in
Fig. 3.7b. It is worth noting that the noise filtering in this step will filter out some
information related to user keystrokes. Therefore, this waveform is only used to
determine the start and end time of keystrokes, and subsequent waveform extraction
is still performed in the waveform shown in Fig. 3.7a.
CSI Time Series Segmentation and Feature Segment Selection To extract the
CSI waveforms for individual keystrokes, we slice the CSI time series into multiple
segments, which be grouped together according to the temporal proximity and then
choose the center of the segment as the feature waveform for a specific keystroke.
Without loss of the generality, it is assumed that each segment contains W samples.
Given the sampling frequency S and the time duration τ , W can be represented by
S × τ . For the waveform with a time duration of T , the number of segments N can
be calculated as below:

T ×S
N= . (3.6)
W
As shown in Fig. 3.7c, it is observed that the CSI segments during the keystroke
period show a much larger variance than those happening out of the period.
Motivated by this, we are only interested in the segments with a variance that is
larger than a predetermined threshold while removing the segments with a variance
under this threshold. The selected segments are grouped into various groups
according to the temporal proximity (e.g., five adjacent segments grouped into
one group in practice). Each group represents the CSI waveform of an individual
keystroke, and the center point of this group is selected as the feature segment of this
keystroke. The process of time series segmentation and feature segment selection is
shown in Fig. 3.7c.
Keystroke Waveforms Extraction To extract keystroke waveforms, the key issue
is how to determine the start and the end point of the CSI time series, which could
cover as much keystroke waveform as possible while minimizing the coverage of
the non-keystroke portion.
We calculate the average value of the segment samples J and then choose two
key metrics K and L. K is the average value of J and samples’ maximum value,
while L is the average value of J and samples’ minimum value. The intersection of
K, L, and the CSI waveform serves as the anchor points. On line K, starting from
the leftmost anchor point, it performs a local search and chooses the nearest local
extremum, which is below K, as the first start point. Similarly, beyond the rightmost
anchor point, it can choose the nearest local extremum, which is below K, as the
first end point. Also, we can perform local searches from two anchor points on line
L in order to choose two local extremum beyond L as the second start point and
the second end point. Finally, we compare these points, respectively. As shown in
4 4
CSI PCA Component
CSI PCA Component

End Point
0 0
Start Point
-4 -4
Mean Value Mean Value
K K
L L
Anchor Points on K Extrems of K
Anchor Points on L Extrems of L
-8 -8
5000 5500 6000 6500 5000 5500 6000 6500
(a) (b)
Fig. 3.8 Extraction procedure of a single keystroke. (a) An illustration of anchor points. (b) Start
and end points of keystroke
Figs. 3.8a, b, and 3.7d, with the lower start point and the higher end point, keystroke
waveform can be extracted.
Thus, we can divide a CSI waveform into several keystroke waveforms. The
ith keystroke waveform Ki from the kth principal component Hr (:, k) of CSI
waveforms as follows.
Ki = Hr (si : ei , k), (3.7)
where si and ei are the start and the end time of ith keystroke. After keystroke
extraction, we use these keystroke waveforms to conduct the recognition process.
3.3.4.2 Keystroke Time-Domain Feature Extraction
One of the major challenges for differentiating keystrokes is how to choose the
appropriate features that can uniquely represent the keystrokes. As shown in
Fig. 3.9, it is observed that different keystrokes will lead to different waveforms,
which motivates us to choose waveform shape as a feature for keystroke clas-
sification. However, directly using the keystroke waveforms as the classification
features leads to a high computational cost in the classification process since
waveforms contain many data points for each keystroke. Therefore, we leverage the
Discrete Wavelet Transform (DWT) to compress the length of the CSI waveform by
extracting the approximate sequence. Below, we will introduce the details.
Wavelet Compression As mentioned in Sect. 3.3.3.1, a discrete keystroke wave-
form Ki [n] can be expressed by the following equation:
10 10
CSI Value
CSI Value
5 5
0 0
-5 -5
-10 -10
0 200 400 600 800 0 200 400 600
(a)
10
CSI Value
CSI Value
0
0
-10
-10
-20 -20
0 200 400 600 0 200 400 600
(b)
Fig. 3.9 CSI waveform differences between two keystroke PIN numbers. (a) Two CSI waveforms
when inputting keystroke PIN number 5. (b) Two CSI waveforms when inputting keystroke PIN
number 4
1
Ki [n] = √ Wφ [j0 , k]φj0 ,k [n]
L k
∞ (3.8)
1
+√ Wψ [j, k]ψj,k [n],
L j =j k
0
where L represents the length of Ki [n], Wφ [j0 , k] and Wψ [j, k] refer to the
approximation and detail coefficients, respectively. In the first DWT decomposition
step, the length of approximation coefficients is half of L. For the j th decomposition
step, the length is half of the previous decomposition step. We use the approximation
coefficients to compress the original keystroke waveforms to reduce computational
costs. In order to achieve the trade-off between the sequence length reduction
and preserving the waveform information, we choose Daubechies D4 wavelet and
perform 3-level DWT decomposition in the classification model. Therefore, for ith
keystroke, the third level approximation coefficient Fi of Ki is chosen as the feature
of the keystroke. After compression, the length of feature Fi is about 1/8 of Ki [n].
20 20
Frequency (Hz)
Frequency (Hz)
15 15
10 10
5 5
200 400 600 800 200 400 600

(a)
Frequency (Hz)
Frequency (Hz)
20 20
15 15
10 10
5 5
200 400 600 200 400 600

(b)
Fig. 3.10 CSI frequency domain feature difference between two PIN keystrokes. (a) Two CSI
spectrograms when inputting keystroke PIN number 5. (b) Two CSI spectrograms when inputting
keystroke PIN number 4
3.3.4.3 Keystroke Frequency Domain Feature Extraction
Besides the CSI waveform shape, the CSI frequency feature can also be used
to differentiate keystrokes. The CSI spectrograms in the frequency domain are a
stable property of CSI streams and are highly correlated to keystrokes. Figure 3.10
illustrates the CSI spectrograms corresponding to the CSI waveforms shown in
Fig. 3.9. It is observed that different keystrokes have significantly different CSI
spectrograms. Therefore, it is feasible to use CSI spectrogram information as the
feature to recognize keystrokes.
In this work, WindTalker first performs Short-Time Fourier Transform (STFT)
to obtain the two-dimensional frequency spectrograms of CSI. Then, WindTalker
calculates the contours of the spectrograms to extract features. To extract the
contour, WindTalker first resizes the CSI spectrograms with frequency from 0
to 30 Hz into a m-by-n matrix MCSI (i, h) and normalize the MCSI (i, h) to a
range between 0 and 1. Note that, in MCSI (i, h), each column represents the
normalized frequency shifts during the ith time slide. Then, WindTalker chooses
a pre-defined threshold and get the contour CCSI (i), where i = 1, ..., n. CCSI (i)
is the maximum value j which satisfies that MCSI (i, j ) ≥ threshold. As shown
in Fig. 3.10, the contours are marked by black lines. It is observed that, between
the same keystrokes, the contours of CSI spectrograms have similar variation
3.4 System Evaluation 55
trends. Thus we can regard the contours as the frequency domain features of
the classification and calculate the similarity between the contours for keystroke
recognition.
3.3.4.4 Keystroke Recognition
WindTalker builds a classifier to recognize the keystrokes based on both the time-
domain feature and the frequency domain feature. To compare the features of
different keystrokes, WindTalker adopts Dynamic Time Warping (DTW) to measure
the similarity between two keystrokes. DTW utilizes dynamic programming to
calculate the distance between two sequences with different lengths. With DTW,
the sequences (e.g., time series and spectrogram contours) are warped non-linearly
in the time dimension to measure their similarity. The input of the DTW algorithm is
two sequences, and the output is the distance between them. A low distance indicates
that these two sequences are highly similar.
By adopting DTW, the classifier gives each keystroke a set of scores, which
allows the keystrokes to be differentiated based on the user’s training dataset
(keystrokes on different PIN numbers). Take the numerical keyboard (i.e., key
values are “0-1-2-· · · -9”) as an example, for the i-th keystroke Ki , classifier first
calculates the DTW distances between the features of Ki and all of the keystroke
number’s features in time and frequency domain, respectively. Thus, for Ki , we will
get two scores arrays ST = {st0 , st1 , . . . , st9 }, SF = {sf 0 , sf 1 , . . . , sf 9 }, where ST ,
SF represent the scores in time and frequency domain, respectively, and stn refers
to the shortest distance between the input keystroke and the certain key number n
in time domain. sf n is similar but in the frequency domain. Finally the classifier
calculates the score S = {s0 , s1 , . . . , s9 }, where sn = stn × sf n . The lower the score
sn is, the higher possibility the certain number n is the actual input number. The
classifier chooses the PIN number which has the minimum score (the value n which
satisfies sn is the lowest one) as the predicted key number. Note that the classifier
saves all scores of the certain keystroke Ki in order to generate password candidates
in Sect. 3.4.2.2.
3.4 System Evaluation
This section will evaluate the performance of WindTalker on inferring user’s

keystrokes and evaluate the impacts of various factors on WindTalker’s perfor-
mance.
3.4.1 System Setup
WindTalker is built with commercial-off-the-shelf hardware, which is actually a

commercial laptop computer equipped with Intel 5300 NIC with one external
directional antenna. WindTalker also serves as the Wi-Fi hotspot to attract users
to access the Wi-Fi. The laptop runs Ubuntu 14.04 LTS with a modified Intel driver
to collect CSI data. To collect the CSI data related to the user’s touchscreen clicks,
WindTalker uses ICMP echo and reply to achieve the sampling rate of 800 packets/s.
In this evaluation, the distance between the mobile user and the AP is 75 cm, and
the AP is placed on the left side of the mobile phone.
In the experiments, we recruit 20 volunteers to join our evaluation, including 17
males and 3 females. All of the volunteers are right-handed, and they perform the
touchscreen operations by following their own fashions. During the experiment, the
volunteers should participate in the data training phase and keystroke recognition
phase by inputting the numbers according to the system hints. In the data training
phase, WindTalker records each input and its corresponding CSI data. In the test
phase, WindTalker infers the input data based on the observed CSI time series. The
training data and testing data collection should be finished within 30 minutes since
CSI may change with the change of environment.
We start the evaluation by testing the classification accuracy and the 6-digit
password inference accuracy. Then we perform a more specific case study by
inferring the password of mobile payment for Alipay in Sect. 3.5. Afterward, we
investigate various metrics that may influence the inference accuracy of WindTalker,
including the distance, the direction, and the human movement in Sect. 3.4.3. In the
current stage evaluation, we only perform user-specific training and will discuss its
limitation in Sect. 3.6.3.
3.4.2 CSI-based Keystroke Inference Performance

3.4.2.1 Classification Accuracy for Single Keystroke
In Sect. 3.2.3, we have shown that different keystrokes may be correlated with
different CSI waveforms. In this section, we aim to explore whether the differences
in keystroke waveforms are large enough to be used for recognizing different PIN
number inputs in a real-world setting. We collected training and testing data from
20 volunteers. Each volunteer first generates 50 loop samples, where a loop is
defined as the CSI waveform of the keystroke number from 0 to 9 by pressing
the corresponding digit. After that, we evaluate the classification accuracy of
WindTalker through the collected CSI data.
The classification accuracy is evaluated in terms of 10-fold cross-validation
accuracy. However, in a real-world scenario, it is not reasonable to collect 50
training samples for one specific PIN number. Therefore, we first divide these 50
loops of data into five groups evenly. Then, for every 10 loops of CSI data, we
pick up one loop in turn for the testing data and choose the other 9 loops as the
training data. WindTalker adopts the classifier introduced in Sect. 3.3.4 to recognize
the keystroke. We perform the evaluation on Xiaomi, Redmi, and Samsung Note3
smartphones. All of them run with Android 5.0.2. Figure 3.11a shows the average
classification accuracy of all 20 volunteers in 10 PIN numbers. It is observed that
WindTalker achieves the average accuracy classification of 93.5% using combined
CSI features. However, if WindTalker only utilizes the time-domain feature as [11],
the accuracy will drop to 87.3%.
Figure 3.11b describes the color map of the confusion matrix of keystroke
inference. The coordinate (X, Y )indicates the probability that WindTalker will
recognize the keystroke CSI waveform of the single digit PIN code Xas the single
digit PIN code Y . The darker the color, the greater the probability. Intuitively, it is
easier for an input number that is confused with the neighboring numbers during the
keystroke inference process. We further analyze the impact of the number of training
data on the recovery rate in WindTalker. Table 3.2 shows the keystroke inference
accuracy increases with the training loop increases. Even if there is only one training
sample for one keystroke, WindTalker can still achieve a whole recovery rate of
79.5%.
3.4.2.2 Inference Accuracy for 6-digit Password
In a practical scenario setting, it may not be easy for WindTalker to get 9 training
samples for each PIN number. So in the remaining section, we only use 3 samples
per PIN number for training. To illustrate the performance of WindTalker for
password Inference, in this part, we ask volunteers to press 10 randomly generated
WindTalker accuracy Without frequency feature

100 1
Actual PIN Number
2
Accuracy (%)
80 3
4
60 5
6
40
7
20 8
9
0 0
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
PIN Number Predicted PIN Number
(a) (b)
Fig. 3.11 Classification accuracy. (a) Classification accuracy per key. (b) An illustration of
confusion matrix
Table 3.2 Relationship between accuracy and training looping times

Loop times One Three Five Nine
Recovery rate 79.5% 86.2% 88.5% 93.5%
6-digit passwords on Xiaomi smartphone and use their corresponding 3 loops as the
training dataset.
We test 500 passwords, which include 3000 keys. As shown in Table 3.2, with 3
loops as training data, WindTalker can achieve an average 1-digit recovery rate of
Ap1 = 86.2%. For a 6-digit password in Alipay, theoretically the recovery rate
is Ap1 6 = 41%. However, in real-world scenarios, the attacker can try several
times to recover the password at an increased success rate. Thus, we introduce
a new metric, recovery rate with Top N candidates, which indicates the rate of
successfully recovering the password for trying N times and represents a more
reasonable metric to describe the capability of the attacker in the practical setting.
As shown in Table 3.3, if we evaluate the 1-digit recovery rate under top 2 and top
3 candidates, the recovery rate can be significantly improved to 93.4% and 96.2%.
We further study how many candidates can help WindTalker to succeed in
predicting the right 6-digit payment password. In particular, we will investigate
the inference accuracy under top N candidates. In the experiment, each 6-digit
password will be correlated with six keystrokes K = {K1 , K2 , . . . , K6 }. For
the i-thkeystroke’s CSI data Ki , WindTalker calculates its corresponding score
Si = si,0 , si,1 , . . . , si,9 . Then, for a given password candidate PIN number
P = {p1 , p2 , . . . , p6 }, where pi ∈ [0, 9], WindTalker calculates the likelihood
L between K and P . L is defined by L = 6i=1 si,pi . Given a 6-digit password
K, for each keystroke Ki , we can obtain 5 candidates with the lowest si and
then generate 56 = 15626 candidate passwords. Then WindTalker sorts these
passwords by their likelihoods in ascending order. A successful password inference
is satisfied if the top N candidates contain the real password. In Fig. 3.12a, we give
the password inference accuracy under top N candidates, where N ranges from 1 to
20. The result is encouraging. It is shown that, given top-1 candidate, the inference
accuracy is 41.2%. And the inference rate can be significantly improved if given
top-5 candidates or top-10 candidates, which correspond to 69.6% and 77.4%,
respectively. It is also shown in Fig. 3.12b that, if given enough top N candidates
(e.g., set N as 60), the inference accuracy can reach above 85%.
3.4.3 Impact of Various Factors
There are many factors potentially affecting CSI. The performance of WindTalker is
affected by various factors, such as the relative position of AP and mobile devices,
CSI sampling rate, keyboard layout, human movement, and temporal factors. Even
clicking at the same key, the different distance and direction between AP and the
Table 3.3 Relationship between 1-digital PIN recovery rate and candidate numbers
Number of candidates 1 2 3
Recovery rate of 1-digital PIN 86.2% 93.4% 96.2%
90
Password Inference Accuracy (%)
Password Inference Accuracy (%)

80
80
60
60 85
85
40 40
80 80
20 20
0 75
0 5 10 15 20 20 40 60 80 100
Number of Candidate Passwords Number of Candidate Passwords
(a) (b)
Fig. 3.12 The relationship between the 6-digit password recovery rate and candidate numbers. (a)
Top 20 Candidates. (b) Top 100 Candidates
Fig. 3.13 Distance’s WindTalker accuracy Without frequency feature

influence on WindTalker 100
Accuracy (%)
50
0
0.75 1 1.25 1.5
Distance between Smartphone and Antenna (m)
mobile device may also lead to a different CSI. We will investigate the impact of
these factors on WindTalker in our experiments.
3.4.3.1 Impact of Distance
In a real scenario, the distance between the victim’s mobile device and AP is not
fixed. As shown in Fig. 3.13, the recovery rate of WindTalker will decrease along
with the increase of the distance. However, it is observed that, even if the distance
between the antenna of WindTalker and the victim’s smartphone (i.e., Xiaomi)
is enlarged to 1.5m, WindTalker can still achieve keystroke inference accuracy
of 83.5% in terms of 10-fold cross-validation, which is high enough for launch
keystroke inference.
Figure 3.14 shows that both CSI shape and degree will change under different
distances when pressing the same key. This indicates that WindTalker needs
to retrain the dataset even for the same victim with different distances. When
the distance between the antenna and the victim is too long, the multiple path
PCA 1st Component PCA 2nd Component PCA 3rd Component

75 cm 100 cm
CSI PCA Components
CSI PCA Components

50 40
20
0
0
-50
0 500 1000 1500 0 500 1000

125 cm 150 cm
10 15
CSI PCA Components
CSI PCA Components

10
0
5
-10
0
-20 -5
0 500 1000 0 500 1000
Fig. 3.14 CSI shape change by distance: PIN 1
propagation will become more complicated. Thus the collected CSI cannot reflect
the victim’s finger precisely and result in inaccurate inference results. In order to
partially solve these limitations, there are two possible solutions. Firstly, the attacker
can fix the location of the table and chairs, which will make the user’s position
relatively stable. The other solution is placing three antennas of Intel 5300 NIC at
different locations to enlarge the effective range of WindTalker. Therefore, when the
victim connects to rogue Wi-Fi, WindTalker could dynamically choose the antenna
which is closest to the victim to collect CSI data.
3.4.3.2 Impact of Direction
The relative direction between the victim and attacker may seriously affect the
CSI since different directions mean different multi-path propagation between the
transmitter and the receiver. Thus, we show the performance of WindTalker in
different directions. Note that the mobile device (i.e., Xiaomi in this experiment)
is in front of the victim. It is important to point out that, for a right-handed user,
WindTalker shows better performance when the AP is on the left side of the victim.
This is because it is easier for WindTalker to sense the victim’s finger clicks and the
hand motion. Figure 3.15 shows the keystroke inference accuracy of WindTalker
Fig. 3.15 Accuracy in WindTalker accuracy Without frequency feature

different directions 100
80
Accuracy (%)
60
40
20
0
Left Right Front Behind
Direction of antenna
in different directions in terms of 10-fold cross-validation. It is interesting that

WindTalker can achieve a high performance even the AP is deployed behind victims
(i.e., 81%), which means that the proposed CSI-based keystroke inference can work
well even if the attacker is behind the user without visually seeing the clicking
actions. This represents one of the significant merits which cannot be achieved
by any previous work. In the real world, the attacker can adjust the position and
orientation of the directional antenna to overcome the limitations of distance and
direction.
3.4.3.3 Impact of Smart Terminal Device’s Type
The experiments in Sect. 3.4.2 are implemented on Xiaomi, Redmi, and Samsung
Note3 smartphones. To evaluate the impact of different smartphone types, we recruit
ten volunteers to generate 10 loop keystrokes on Xiaomi, Redmi, and Samsung
Note 3. All of these mobile phones run with Android 5.0.2. When using all 9 loops
of data, WindTalker achieves the average classification accuracy of 93.5%, 88.3%,
and 83.9% on Xiaomi, Redmi, and Samsung Note3, respectively. The experimental
result indicates that the WindTalker performance is affected by the smartphone type
because different smartphones may have different relative positions of antennas and
working powers. Fortunately, the accuracy of WindTalker on different smartphones
is still acceptable for password inference.
3.4.3.4 Impact of CSI Sampling Rate
The keystroke recognition accuracy depends on the sampling rate of CSI. When the
CSI sampling rate is high, there is more information in the CSI waveform, which
increases the keystroke recognition accuracy. Thus, we are interested in how the
CSI sampling rate influences the performance of WindTalker. Figure 3.16 shows the
average classification accuracy of all volunteers with Xiaomi smartphones when
varying the sampling rate from 100 packets/s to 800 packets/s. The experiment
procedures are the same as Sect. 3.4.2 and the antenna is placed at the best
Fig. 3.16 Effect of CSI WindTalker accuracy Without frequency feature

sampling rate 100
Accuracy (%)
80
60
40
20
0
100 200 400 600 800
CSI Sampling Rate (packets/s)
position as mentioned in Sect. 3.4.3.1 and 3.4.3.2. From Fig. 3.16, we observe
that the classification accuracy improves when the sampling rate is higher, but
the improvement is not significant beyond the sampling rate of 400 packets/s. For
instance, with a sampling rate of 400 packets/s, the classification accuracy of Xiaomi
is 90.3%, which is only a slight drop compared to 93.5% achieved for a sampling
rate of 800 packets/s. Note that the sampling rate of 400 packets/s is still enough
to capture the movement feature of the keystroke. When the sampling rate reduces
to 100 packets/s, the accuracy of Xiaomi reduces significantly to 82.8%, as this
sampling rate loses the detailed feature of the keystroke. In our experiment, we use
the sampling rate of 800 packets/s to achieve the best performance of WindTalker.
But when facing a high packet loss rate situation, we can use a lower sampling rate
above 100 packets/s to achieve an acceptable performance.
3.4.3.5 Impact of Keyboard Layout
There are two different keyboard layouts that influence keystroke recognition
accuracy. As shown in Fig. 3.17, besides the numeric keyboard, which is used
in most online payment scenarios, there is QWERTY keyboard on which a user
can type letters, numbers, and special characters. The main difference between the
two keyboards is the key space. Compared with the numerical keyboard, the hand
movement tends to be subtle when typing adjacent keys on the QWERTY keyboard,
which makes recognizing keystrokes much more difficult since the CSI waveforms
become similar.
We are interested in how the QWERTY keyboard influences keystroke recogni-
tion. For simplicity, we just focus on the digital input on the QWERTY keyboard.
We perform experiments on the Xiaomi smartphone, and the keyboard layout is
provided by the Google input method. Figure 3.18a shows average classification
accuracy on both numeric and QWERTY keyboards. We observe that the accuracy
of the QWERTY keyboard is 67.8%, which significantly drops compared to 93.5%
of the numerical keyboard. Figure 3.18b is the confusion matrix of QWERTY
keyboard. We observe that most error recognition happens between the adjacent
Digital keyboard QWERTY keyboard
Fig. 3.17 Different keyboard layouts
Numeric QWERTY 1
100 2
Actual PIN number

3
Accuracy (%)
80 4
60 5
6
40 7
8
20 9
0
0
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
PIN Number Predicted PIN number
(a) (b)
Fig. 3.18 Keystroke recognition on QWERTY keyboard. (a) Accuracy on different layouts. (b)
Confusion matrix of QWERTY keyboard
keys. Although the accuracy of the QWERTY keyboard is lowered than the numeric
keyboard, it is still higher than the random guess.
3.4.3.6 Impact of Surrounding Human Movement
In some cases, the CSI-based sensing may be affected by the movement of another
nearby human. Thus, we evaluate the impact of human walking and human arm
movement on the performance of WindTalker. As shown in Fig. 3.19a, while
WindTalker collects the CSI data to infer the victim’s keystroke, we recruit a
volunteer to walk along four different lines (L1, L2, L3, L4), respectively. The
distances between WindTalker’s antenna and the midpoints of L1, L2, L3, and
L4 are 1, 2, 3, and 4 m, respectively. The distance between the antenna and the
victim is 1 m. We totally conduct four experiments. In each experiment, we ask
the victim to continuously generate keystrokes and collect the corresponding CSI.
At the same time, the volunteer walks along one of the four lines at the speed of
0.5 m/s. Figure 3.19b shows the experimental result. When the distance between
the antenna and the midpoint of the walking man’s trajectory is larger than 2 m,
(a)
1m 2m 3m 4m
20 20 15 15
PCA 1st component
10 10 10 10
5 5
0 0
0 0
-10 -10
-5 -5
0 2000 4000 6000 8000 0 2000 4000 6000 0 2000 4000 6000 0 2000 4000 6000 8000
CSI sample index CSI sample index CSI sample index CSI sample index
(b)
1m 2m 3m 4m
PCA 1st component
10 10 6 10
4
5 5
5
2
0 0
0
0
-5 -5 -2
0 2000 4000 6000 8000 0 2000400060008000 2000 4000 6000 8000 4000 6000 8000
CSI sample index CSI sample index CSI sample index CSI sample index
(c)
Fig. 3.19 Impact of human movement. (a) The experimental environment. (b) The CSI data under
the human walking scenario. (c) The CSI data under the human arm’s movement scenario
the keystrokes could be easily found from the collected CSI waveforms. However,
when the distance is 1 m (i.e., the walking man’s trajectory is very close to the
victim), it is hard to extract keystroke waveforms from collected CSI data. The
results show that human walking will bring additional multiple path effects into
wireless transmission. However, WindTalker is still effective if only there is no
human walking within 2 m of the WindTalker’s antenna.
Besides human walking, we also consider another scenario in which a human
stays at a fixed location but waves his/her arms. We conduct four experiments. When
the victim continuously generates keystrokes, we ask the volunteer to stay at the
midpoints (i.e., C1, C2, C3, C4 in Fig. 3.19a) of above four lines, respectively, and
3.5 When CSI Meets Public Wi-Fi: A Case Study in Mobile Payment Platforms 65
Day 1 Day 2 Day 3 Day 4

PCA 1st component
10
20 5 40
10 0 20
0
0 0
-10 -5 -10
9500 10000 5000 6000 8000 9000 10000 1000 2000
sample index sample index sample index sample index
Fig. 3.20 CSI waveforms of PIN number 1 on different days
spin his/her arm with the average speed of 0.91 cycles per second. As shown in
Fig. 3.19c, when the distance between the antenna and the volunteer is larger than
3 m, the keystrokes could be recognized from collected CSI data. When the distance
is 1 m (i.e., the victim is very close to another person), the keystrokes are hard to be
extracted. Therefore, WindTalker will normally work if there is no user who waves
his/her arms within 3 m of the WindTalker’s antenna.
3.4.3.7 Impact of Temporal Factors
The temporal factors will also affect the performance of WindTalker. Figure 3.20
shows how CSI waveform changes on different days. We can observe that these
CSI shape patterns look different. The reason is that on different days, the user’s
typing behavior may be inconsistent, and the surrounding environment may change,
which may affect the constructive and destructive interference of several multi-path
signals. Therefore, in the current state, for each keystroke inference, WindTalker
needs to update the user’s CSI profiles to ensure its performance. We leave this
limitation for future work.
3.5 When CSI Meets Public Wi-Fi: A Case Study in Mobile

Payment Platforms
The experiment evaluation in the previous section is mainly carried out in the
laboratory environment. To demonstrate the practicality of the WindTalker, we
perform an experimental evaluation of password inference on Alipay, a popular
mobile payment platform on Both Android and iOS systems.
Attack Deployment in Real-World Mobile Payment Platform Alipay is the
largest mobile payment company in the world and had 710 million monthly active
users in June 2020.1 As shown in Fig. 3.21, we deploy a WindTalker system at a
1 https://m.yicai.com/news/100747459.html.
Fig. 3.21 Real case scenario
smart home-powered public environment (e.g., apartment building, cafe-like room)

and release an authentication-free Wi-Fi. The AP (including Intel 5300 NIC and
the antennas) is set up behind the counter, which makes it less likely to be detected
visually. The victim is 1 m away from our deployed Wi-Fi devices. When we collect
the data, one volunteer walks pass the victim, but none of the volunteers walks
between the victim and the AP.
In order to simulate real-world attack scenarios, the recruited volunteers are
required to access this free Wi-Fi access point and perform the following three
phases: (1) Online Training Phase: the volunteers are required to input some
randomly generated numbers by following a similar way as Text Captcha. This
phase is designed to collect the user’s input number and the corresponding CSI
data to finish the data training. (2) Normal Use Phase: the volunteers perform
the online browsing or use the applications as normal users. (3) Mobile Payment
Phase: when the users use online shopping applications, it will be ended with
mobile payment. All online shopping and mobile payments are secured with HTTPS
protocol. According to Alipay’s mobile payment policy, mobile users must input the
6-digital password to finish a mobile payment transaction. The goal of the attacker
is to recover the mobile payment password of the volunteers.
Operations of WindTalker After the volunteers connect to the authentication-
free Wi-Fi hotspot, WindTalker triggers ICMP-based CSI Acquirement Module to
collect the CSI data at the sampling rate of 800 packets/s. WindTalker records the
timestamp per CSI data. Simultaneously, WindTalker utilizes Wireshark to capture
and record Wi-Fi traffic packets and their corresponding timestamps. During the
real-world experiment, WindTalker collects Wi-Fi traffic data and CSI data in the
online phase. After collecting the data, WindTalker infers the user’s mobile payment
password in the offline phase.
3.5 When CSI Meets Public Wi-Fi: A Case Study in Mobile Payment Platforms 67
50
CSI Amplitude
40
30
20
0 1 2 3 4
CSI Sample Index 10 4
(a) (b)
Sensitive Input Window the 1st PCA component

CSI PCA Components
60
1
Keystroke Extraction
30
7 7
-30
Keystroke Inference
3 9 9
-60
0 10 20 30 40
Time (second)
(c)
Fig. 3.22 WindTalker performance in the case study. (a) Recognize the sensitive input window.
(b) Original CSI waveform: the 30th subcarrier. (c) Password inference procedure
During collecting ICMP reply packets, WindTalker also collects additional

network traffic packets from users’ APPs. As pointed out in [8] and [7], only some
particular types of packets (e.g., ICMP packets using “HT” rate) will be measured
by Linux 802.11n CSI Tool. In our real-world experiments, CSI values will not
be extracted from these packets generated by the user’s APPs. Besides, since CSI
is the physical layer information that reflects the wireless channel environment,
the CSI measurements are irrelevant to the types of network traffic packets. Thus
even if some additional packets were measured by Linux 802.11n CSI Tool, they
would only cause the CSI sampling rate to vary slightly. Because we can record the
timestamp of each CSI value, thus we can use the timestamps to reconstruct the CSI
stream to eliminate the impact of sampling rate variation. Figure 3.22 shows the CSI
waveforms reconstructed according to timestamps.
Recognizing the Sensitive Input Window To determine the sensitive input win-
dow, WindTalker utilizes Wireshark to collect the metadata (e.g., IP address) of the
Wi-Fi traffic during collecting CSI data. The metadata collected by Wireshark is
shown in Fig. 3.22a. We can find that in the experiment, Alipay applications route
their data to a server of some fixed IP address such as “110.75.xx.xx”. These IP
addresses are used by the Alipay service provider and do not change for one to two
weeks. With the traffic metadata, as shown in Fig. 3.22a, WindTalker obtains the
rough start time and end time of the sensitive input window via searching packets
whose destination is “110.75.xx.xx”. Then, according to the timestamps of CSI data,

WindTalker locks the CSI data during this period of time.
CSI-Based Password Inference Figure 3.22b shows the original 30th subcarrier
CSI data in the Sensitive Input Window. After data preprocessing, Fig. 3.22c shows
the first three principal components of CSI data after PCA. It is found that in the
real-world experiment that besides inputting a payment password, the victim may
have other operations, such as selecting a credit card for payment in a period of
time of Sensitive Input. In order to handle this situation, WindTalker only needs
to find a continuous keystroke of a certain length. In our case, we are interested
in continuous 6-bit password input since Alipay chooses a 6-digit mobile payment
password. Thus after the keystroke extraction and recognition process, WindTalker
is able to list possible password candidates according to probability. The top three
password candidates, in this case, are 773919, 773619, and 773916, while the actual
password is 773919. We carry out the real-world experiment ten times, and each
time the password was different. Our experiment results show that the attacker can
successfully recover 6, 8, and 9 passwords if allowing to try the password input 5,
10, and 50 times (or Top 5, 10, and 50 candidates). This further demonstrates the
practicality of the proposed attack in the practical environment.
3.6 Countermeasures and Discussions
In this section, we aim to propose some defense strategies to thwart the attacker. We
hope that the discussion will raise privacy awareness of the Wi-Fi hotspot and will
also inspire other researchers to find more advanced defense techniques.
3.6.1 Fundamental Countermeasures
Randomizing the Keyboard Layout One of the most straightforward defense

strategies is to randomize the layouts of the PIN keyboard such that the attacker
cannot recover the typed PIN number even if he can infer the keystroke positions
on the touchscreen. As pointed out by Yue et al. [30], randomizing the keyboards is
effective at the cost of the user experience since the user needs to find every key on
a random keyboard layout for every key typing.
Changing Typing Gesture For WindTalker, collecting accurate CSI data is essen-
tial for achieving high inference success rate. Thus the user can intentionally change
his typing gestures or clicking patterns to introduce unexpected interference to the
CSI data. For example, the randomized human behaviors (e.g., human mobility)
would introduce more impact on CSI than finger click on wireless signals, which
reduce the successful chance of the adversary.
3.6 Countermeasures and Discussions 69
Refusing to Connect to Rogue Wi-Fi The most thorough defense strategy is

refusing to connect to a rogue Wi-Fi hotspot. For instance, [16] and [17] proposed
a method which can detect a rogue Wi-Fi hotspot. These detection systems suppose
that both the rogue hotspot and the legitimate hotspot have the same SSID. However,
if the attacker uses a new SSID that is not observed before by the detection system,
it will fail, either.
Blocking the ICMP Echo Request Our CSI-based typing inference requires
collecting CSI data with a high frequency. According to [19], the data received
in the echo message must be returned in the echo reply message. It means that the
victim’s device must reply to the attacker’s Wi-Fi hotspot when it receives the ICMP
echo request. A countermeasure for the user is configuring the firewall to detect and
block high-frequency ICMP echo requests. But this countermeasure is rarely used
in Android smartphones because it needs to be implemented at the operating system
level, and the common users have no access to it [26]. As far as we have tested in 3
mainstream un-rooting smartphones (Xiaomi, Redmi, and Samsung), none of them
have blocked this kind of ICMP echo request because it will forbid other hosts to
ping the user device and affect the user experience.
3.6.2 Proposed Signal Obfuscation Based Countermeasure
In this subsection, we propose a novel obfuscation strategy to defend against CSI-

based side-channel attacks. Our goal is to prevent the attackers from collecting the
accurate CSI data introduced by the user’s password input. In the ideal case, the
strategy can be implemented and deployed on the user’s side and can be triggered in
a user-transparent way as long as any sensitive input time window is observed. This
strategy does not need the user’s participation and thus minimizes its impact on the
user experience.
3.6.2.1 Overview of the Basic Idea
The basic idea of the proposed defense strategy is introducing a randomly generated
CSI time series sequence to obfuscate the original one. As shown in Fig. 3.23, during
the sensitive input time window, when the attacker collects the CSI data (or original
CSI data) from the target user, the user device can randomly generate some CSI
data (or obfuscation data) to obfuscate the original CSI data and thwart the side-
channel attack. According to IEEE 802.11n standard [10], when the user does not
launch sensitive applications, the attacker can obtain the CSI data by analyzing the
training sequence of the preamble of the Wi-Fi packet obtained from the victim’s
device. Without loss of the generality, the original CSI between victim and attacker
is estimated as:
ICMP Request
User ICMP Reply Hotspot

No Attacker
Victim's device
Sensitive
Application Sending Obfuscation Packets
Yes
ICMP Request
ICMP Reply Hotspot
Attacker
Victim's device
Fig. 3.23 Obfuscation defense strategy
Y1
H1 = , (3.9)
X1
where X1 is the training sequence on the transmitter and Y1 is the training sequence
on the receiver. In practice, both the transmitter and receiver assume that the training
sequence X1 will not change during the whole communication process.
During the password input progress for a specific mobile payment application
(e.g., Alipay), the defense strategy will be launched. The attacker uses the ICMP
requests to obtain Wi-Fi packets from the victim, and, at the same time, the user’s
device can also proactively send the obfuscation packets to the attacker. For instance,
the training sequence X1 in Eq. 3.9 was changed into:
X2 = H X1 . (3.10)
The revised training sequence will be received by the attacker as follows:
Y2 = H1 X2 = H1 H X1 = H2 X1 . (3.11)
From the attack’s perspective, the original and obfuscation data are indistinguish-
able. Because the attacker still utilizes the original training sequence X1 to estimate
CSI, therefore the attacker would estimate the victim’s CSI as H2 = H1 H . It
means that the original CSI data will be masked by inserting forged CSI data H2
into the original CSI sequence H1 . Thus the CSI-based side-channel attack can be
thwarted because the attacker cannot infer the user’s keystroke by analyzing CSI
data.
3.6 Countermeasures and Discussions 71
Three disnguishable
keystroke waveforms
Two illegible keystroke

waveforms due to
obfuscaon device
Fig. 3.24 The 4th subcarrier waveform in the experiment
3.6.2.2 Experiment Evaluations
We perform an experiment to prove the effectiveness of our proposed strategy. In

the ideal case, it should be the user’s device that generates the obfuscation data. In
our experiments, we adopt a mobile phone as the target device and another phone
as the obfuscation device to perform the proof-of-concept experiments and evaluate
the effectiveness of the proposed defending strategy. In practice, to implement this
defense strategy in user’s devices, we can use Software Defined Radios (SDR) to
revise the training sequence of mobile devices [14].
In our experiment, both devices are connected to a Wi-Fi hotspot released by
WindTalker. The WindTalker uses ICMP-based CSI Acquirement Model to obtain
CSI data H1 from the victim, and during this period, the victim continuously types
PIN numbers. When the victim launches a sensitive application (e.g., Alipay), the
obfuscation device continuously sends packets (e.g., UDP packets) to WindTalker
so that WindTalker receives mixed CSI data. Note that, the obfuscation device is
placed at different places to get a different CSI estimated value H2 .
The result is shown in Fig. 3.24. We can find that without the involvement of
an obfuscation device, the WindTalker works normally, and the finger clicks are
easily distinguished in CSI H1 . With the involvement of the obfuscation device, the
finger click patterns are obfuscated with the forged CSI measurements H2 . So the
effectiveness of this defense strategy is demonstrated.
3.6.3 Limitations and Discussions
In this section, we discuss the main limitations of WindTalker. WindTalker’s high

performance is achieved in an experimental environment. However, if we try to
apply WindTalker at any time and any place, we need to overcome the limitations
as follows.
Hardware Limitations In WindTalker, we use Intel 5300 NIC and Linux 802.11n
CSI Tool [8]. In our experiments, it is observed that the system will crash when we
perform CSI data collection for iPhone or some versions of android smartphones.
This is because, according to the statement of the CSI Tool, it is very easy to
crash when one Intel 5300 NIC works with other NICs (e.g., an iPhone). However,
our implementation and evaluation of a wide range of smartphones (including
Xiaomi, Redmi, and Samsung phones) demonstrate the practicality of the proposed
CSI-based keystroke inference method. We will leave the issues of improving the
compatibility of Intel 5300 NIC with a wider range of mobile devices to our future
work.
Fixed Typing Gesture Currently, WindTalker can only work in a situation where
the victim can only touch the screen with a relatively fixed gesture, and the phone
needs to be placed in a relatively stable environment (e.g., a table). In reality, the user
may type in an ad-hoc way (e.g., the victim may hold and shake the phone or even
perform some other actions while typing). We argue that is a common problem for
most of the side-channel based keystroke inference schemes such as [1, 13, 18]. This
problem can be partially circumvented by profiling the victim ahead or performing
a targeted attack by applying the relevant movement model as pointed out by Liu et
al. [13].
User-Specific Training WindTalker needs to extract the keystroke samples from
the victim before launching a password inference attack. This requirement is
a common assumption for most of the side-channel keystroke inference attacks
such as [1, 4, 15, 20, 24, 28]. To launch a real-world attack, the attacker can
consider the following two strategies. Firstly, WindTalker could leverage some
social engineering methods to collect training data from the victim. For example, the
attacker could implement online training by mimicking a Text Captcha to require the
victim to input the chosen numbers. As shown in Sect. 3.4.2.2, given three training
samples per key, WindTalker could achieve 6-digit password inference accuracy
of 69.6% under top-5 password candidates. The second strategy is using the self-
contained structures of collected CSI data. For example, Fang et al. [5] propose
a non-training CSI-based keystroke inference system. In this system, the attacker
extracts the correlations between the CSI features of keystrokes and then maps the
collected CSI data to a word within a pre-defined dictionary. Applying this idea in
our 6-digital password inference scenario may be a potential solution, and we leave
it for future work.
References 73
3.7 Summary
In this chapter, we have proposed a novel side-channel attack model named

WindTalker, which can be used to infer a victim’s mobile password via Wi-Fi
signals. WindTalker is a cross-layer inference system that utilizes both network
layer traffic information and physical-layer CSI information. Our experiments
on Alipay show that WindTalker can be effective in recognizing the victim’s
password on smartphones. Compared with previous works, WindTalker neither
deploys external devices close to the target device nor compromises the target
device. Furthermore, we proposed the CSI obfuscation-based countermeasure and
performed the experiment to prove the effectiveness of this countermeasure method.
References
Networking, pp. 90–102. ACM (2015)
2. Balzarotti, D., Cova, M., Vigna, G.: ClearShot: eavesdropping on keyboard input from video.
In: Security and Privacy, 2008. SP 2008. IEEE Symposium on, pp. 170–183. IEEE (2008)
3. Benko, H., Wilson, A.D., Baudisch, P.: Precise selection techniques for multi-touch screens.
In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp.
1263–1272. ACM (2006)
4. Chen, B., Yenamandra, V., Srinivasan, K.: Tracking keystrokes using wireless signals. In:
Proceedings of the 13th Annual International Conference on Mobile Systems, Applications,
and Services, pp. 31–44. ACM (2015)
5. Fang, S., Markwood, I., Liu, Y., Zhao, S., Lu, Z., Zhu, H.: No training hurdles: fast training-
agnostic attacks to infer your typing. CCS ’18, p. 1747–1760. Association for Computing
6. Forlines, C., Wigdor, D., Shen, C., Balakrishnan, R.: Direct-touch vs. mouse input for tabletop
displays. In: Proceedings of the SIGCHI conference on Human factors in computing systems,
pp. 647–656. ACM (2007)
7. Halperin, D.: Linux 802.11n CSI tool. http://dhalperi.github.io/linux-80211n-csitool/faq.html
8. Halperin, D., Hu, W., Sheth, A., Wetherall, D.: Tool release: gathering 802.11 n traces with
channel state information. ACM SIGCOMM Comput. Commun. Rev. 41(1), 53–53 (2011)
9. Holt, C.C.: Forecasting seasonals and trends by exponentially weighted moving averages. Int.
J. Forecasting 20(1), 5–10 (2004)
10. IEEE Std. 802.11n-2009: enhancements for higher throughput (2009). http://www.ieee802.org
11. Li, M., Meng, Y., Liu, J., Zhu, H., Liang, X., Liu, Y., Ruan, N.: When CSI meets public WiFi:
inferring your mobile phone password via WiFi signals. In: Proceedings of the 2016 ACM
SIGSAC Conference on Computer and Communications Security, CCS ’16, pp. 1068–1079.
ACM, New York (2016). https://doi.org/10.1145/2976749.2978397
12. Liu, J., Wang, Y., Kar, G., Chen, Y., Yang, J., Gruteser, M.: Snooping keystrokes with mm-level
audio ranging on a single phone. In: Proceedings of the 21st Annual International Conference
on Mobile Computing and Networking, pp. 142–154. ACM (2015)
13. Liu, X., Zhou, Z., Diao, W., Li, Z., Zhang, K.: When good becomes evil: keystroke inference
with smartwatch. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and
14. Mao, Y., Zhang, Y., Zhong, S.: Stemming downlink leakage from training sequences in multi-
user MIMO networks. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer
and Communications Security, pp. 1580–1590. ACM (2016)
15. Marquardt, P., Verma, A., Carter, H., Traynor, P.: (sp) iPhone: decoding vibrations from nearby
keyboards using mobile phone accelerometers. In: Proceedings of the 18th ACM Conference
on Computer and Communications Security, pp. 551–562. ACM (2011)
16. Nakhila, O., Dondyk, E., Amjad, M.F., Zou, C.: User-side Wi-Fi evil twin attack detection
using SSL/TCP protocols. In: 2015 12th Annual IEEE Consumer Communications and
Networking Conference (CCNC), pp. 239–244 (2015). https://doi.org/10.1109/CCNC.2015.
7157983
17. Nakhila, O., Zou, C.: User-side wi-fi evil twin attack detection using random wireless channel
monitoring. In: MILCOM 2016—2016 IEEE Military Communications Conference, pp. 1243–
1248 (2016). https://doi.org/10.1109/MILCOM.2016.7795501
18. Owusu, E., Han, J., Das, S., Perrig, A., Zhang, J.: Accessory: password inference using
accelerometers on smartphones. In: Proceedings of the Twelfth Workshop on Mobile Com-
puting Systems & Applications, pp. 1–6 (2012)
19. Postel, J.: Internet Control Message Protocol. RFC 792, Internet Engineering Task Force
(1981). http://www.rfc-editor.org/rfc/rfc792.txt
20. Ristenpart, T., Tromer, E., Shacham, H., Savage, S.: Hey, you, get off of my cloud: exploring
information leakage in third-party compute clouds. In: Proceedings of the 16th ACM Confer-
ence on Computer and Communications Security, pp. 199–212. ACM (2009)
21. Sardy, S., Tseng, P., Bruce, A.: Robust wavelet denoising. IEEE Trans. Signal Process. 49(6),
1146–1152 (2001). https://doi.org/10.1109/78.923297
22. Sen, S., Lee, J., Kim, K.H., Congdon, P.: Avoiding multipath to revive inbuilding WiFi
localization. In: Proceeding of the 11th Annual International Conference on Mobile Systems,
Applications, and Services, pp. 249–262. ACM (2013)
23. Shukla, D., Kumar, R., Serwadda, A., Phoha, V.V.: Beware, your hands reveal your secrets!
Security, pp. 904–917. ACM (2014)
24. Sun, J., Jin, X., Chen, Y., Zhang, J., Zhang, R., Zhang, Y.: Visible: video-assisted keystroke
inference from tablet backside motion. In: Network and Distributed System Security Sympo-
sium, pp. 1–15 (2016)
25. Tan, S., Yang, J.: WiFinger: leveraging commodity WiFi for fine-grained finger gesture
recognition. In: Proceedings of the 17th ACM International Symposium on Mobile Ad Hoc
Networking and Computing, pp. 201–210. ACM (2016)
26. Vecchiato, D., Martins, E.: Experience report: A field analysis of user-defined security
configurations of android devices. In: 2015 IEEE 26th International Symposium on Software
Reliability Engineering (ISSRE), pp. 314–323 (2015). https://doi.org/10.1109/ISSRE.2015.
7381824
27. Wang, F., Cao, X., Ren, X., Irani, P.: Detecting and leveraging finger orientation for interaction
with direct-touch surfaces. In: Proceedings of the 22nd Annual ACM Symposium on User
Interface Software and Technology, pp. 23–32. ACM (2009)
28. Wang, J., Zhao, K., Zhang, X., Peng, C.: Ubiquitous keyboard for small mobile devices:
harnessing multipath fading for fine-grained keystroke localization. In: Proceedings of the 12th
Annual International Conference on Mobile Systems, Applications, and Services, pp. 14–27.
ACM (2014)
29. Xie, Y., Li, Z., Li, M.: Precise power delay profiling with commodity WiFi. In: Proceedings of
the 21st Annual International Conference on Mobile Computing and Networking, MobiCom
’15, pp. 53–64. ACM, New York (2015). https://doi.org/10.1145/2789168.2790124
30. Yue, Q., Ling, Z., Fu, X., Liu, B., Ren, K., Zhao, W.: Blind recognition of touched keys on
mobile devices. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and
References 75
31. Zhang, J., Zheng, X., Tang, Z., Xing, T., Chen, X., Fang, D., Li, R., Gong, X., Chen, F.: Privacy
leakage in mobile sensing: your unlock passwords can be leaked through wireless hotspot
functionality. Mobile Inform. Syst. 2016, 8793025 (2016)
32. Zhu, T., Ma, Q., Zhang, S., Liu, Y.: Context-free attacks using keyboard acoustic emanations.
Security, pp. 453–464. ACM (2014)
Chapter 4
Wireless Signal Based Two-Factor
Authentication at Voice Interface Layer
4.1 Introduction
In the smart home environment, users can control their domestic appliances
(e.g., lights, temperature controller, electronic switch, microwave, refrigerator) via
various user interfaces such as image sensing, wireless communication, and voice
controller commands. As reported by Market Research Future in 2020 [16], the
voice interface is predicted to become the primary user interface for the smart home,
and the global revenue will reach $7.3 billion in 2025. Currently, the typical IoT
voice control systems include Amazon Alexa [3], Samsung SmartThings [28],
Google Home [17], and so on.
However, the inherent broadcast nature of voice unlocks a door for adversaries
to inject malicious commands (i.e., spoofing attacks). Besides the classical replay
attack [10, 35], emerging attacks leveraging flaws in smart speakers are also
proposed by researchers. On the hardware side, the nonlinearity of the microphone’s
frequency response provides a door for inaudible ultrasound-based attacks (e.g.,
Dolphin attack [40] and BackDoor attack [27]). For the software aspect, the deep
learning models employed by both speech recognition and speaker verification are
proved to be vulnerable to emerging adversarial attacks such as hidden voice [5],
CommanderSong [38], and user impersonation [42]. Spoofing attacks impose
emerging safety issues (e.g., deliberately turning on the smart thermostat [11]) and
privacy risks (e.g., querying user’s schedule information) on the smart speaker and
therefore cause great concern.
In order to thwart voice spoofing attacks, the most intuitive defense strategy is
voice password-based access control. In the password-based access control, the
user is required to speak a special password before inputting the voice controller
commands [1]. However, speaking a password is either inconvenient for the
user or vulnerable to eavesdropping attack. Therefore, liveness detection, as an
effective countermeasure, has been proposed. The liveness detection leverages the
fact that voices in the spoofing attack are played by electrical devices (e.g., high-

https://doi.org/10.1007/978-3-031-24185-7_4
78 4 Wireless Signal Based Two-Factor Authentication at Voice Interface Layer
quality loudspeaker [35], ultrasonic dynamic speaker [40]). Thus, the physical
characteristics, which are different between humans and machines, could be used as
the “liveness” factors. The existing countermeasures (aka., liveness detection) could
be divided into multi-factor authentication and passive scheme. In this chapter, we
focus on the former—multi-factor authentication.
The multi-factor authentication-based liveness detection exploits the information
(i.e., image/video collected by camera [7], magnetic filed emitted from loudspeaker
[6], time-difference-of-arrival changes from different microphones of smartphone
[39], acceleration data of user’s wearable devices [15] and the Doppler shift of
ultrasonic caused by user’s mouth motion [41]) that are closely correlated with the
operations of VCS as the user’s liveness features to differentiate between the voice
samples generated by legitimate user and adversary. However, the existing two-
factor based liveness detection schemes require the user to either carry specialized
sensing devices or perform specific actions to collect the liveness information.
Thus their practicalities are limited. More seriously, some of these schemes pose
unacceptable privacy risks since the user’s daily behaviors may be leaked from the
collected information (e.g., image or video data in [7]).
In this chapter, we present WSVA, a wireless signal-based voice authentication
system to thwart the spoofing attacks aiming at VCS. Unlike prior liveness detection
schemes, WSVA is a device-free system without requiring the user to carry any
additional device or sensor, and it leverages the prevalent wireless signals generated
by Wi-Fi devices in IoT environment. WSVA is motivated by the following
observations. Firstly, inspired by the wide application of lip-reading technology,
it is feasible to understand speech by sensing the movements of the lips, face,
and tongue. In other words, voice command can be cross-checked by the user’s
mouth motions. Secondly, the prior researches show that indoor object movement
will disturb the multiple-path of wireless signals and can be reflected in Channel
State Information (CSI) of Wi-Fi signals. Thus a variety of human activities can be
identified by using CSI-based wireless sensing techniques. Therefore, WSVA aims
to build the correlation between the user’s mouth motion and the environmental
CSI change, and leverage this correlation to verify the liveness of voice commands
received by the voice interface.
To achieve the goal mentioned above, WSVA needs to address three challenges.
(1) The impact of mouth motion on wireless signals is subtle. Although previous
works utilize sophisticated methods such as MIMO beamforming or Frequency-
Modulated Carrier Waves (FMCW) [33, 36] to improve the wireless sensing
capability, they may not be suitable for our problem because the commercial
IoT devices are resource-constrained and cannot implement these sophisticated
wireless techniques. (2) According to our experimental result, only the jaw and
tongue movements can be recognized by wireless signals, while the vocal vibration,
which contributes a lot to the voice signal, cannot be distinguished. Besides,
prior works [24, 33] pointed out that not all voice syllables can be recognized
by lip-reading techniques. (3) To correlate the voice and CSI signals, how to
select appropriate features from these two-dimensional signals still remains a big
challenge.
4.2 Preliminaries and Motivation 79
To address the above challenges and achieve liveness detection, WSVA firstly
builds a new model to describe the correlation among the CSI changes, the mouth
motions, and the syllables of the received voice signals. Then, WSVA proposes a
novel signal processing method to filter the noises of collected voice and CSI signals
and to extract syllables and mouth motions within the voice command. Further,
WSVA utilizes a novel method to extract both time-domain and frequency-domain
of two types of signals and performs liveness detection. We conduct experiments to
evaluate the liveness detection performance of WSVA and demonstrate its feasibility
in IoT environment. The contributions of this chapter are summarized as follows:
• We present WSVA, a two-factor liveness detection system to thwart the various
voice spoofing attacks aiming at VCS. By utilizing the existing wireless signals
in IoT environment, WSVA shows its advantages of device-free, feasible deploy-
ment and privacy preservation.
• We study the correlation between voice samples and wireless signals. Specif-
ically, we build a mapping model to correlate the syllables within the voice
command, the user’s mouth motions, and their corresponding CSI change
patterns.
• We devise the architecture and algorithms of WSVA. We exploit some effective
technical mechanisms to process voice samples and CSI data, design novel
algorithms to extract the features from these different types of signals and
propose the liveness decision algorithm.
• We design and implement a testbed to demonstrate the practicability of WSVA.
We evaluate the impact of various factors on WSVA, and our experimental results
on 6 volunteers show that WSVA achieves 99% liveness detection accuracy with
1% false accept rate.
We point out that WSVA does not propose to use wireless signals for lip reading
since the existing works [13, 33] have shown that the lip reading accuracy is limited.
Instead, this chapter aims to utilize the consistency between voice and CSI signals
to authenticate the voice commands.
The remainder of this chapter is organized as follows. Sect. 4.2 introduces the
preliminary knowledge and the research motivation. In Sect. 4.3, we introduce the
design details of WSVA. We evaluate the performance of WSVA and discuss its
limitations in Sect. 4.4 and Sect. 4.5, respectively. Finally, Sect. 4.6 concludes this
chapter.
4.2 Preliminaries and Motivation
In this section, we introduce the background knowledge, including the attack model
and articulatory gestures. Then, to elaborate the research motivation of WSVA, this
section performs a series of experiments to answer the following questions: (1)
Do the mouth motions really have a correlation with the change of Wi-Fi signals?
Replay attack Siri
Microphone
Attacker Loudspeaker Alexa
Adversarial examples attack Voice interface
Fig. 4.1 Illustrations of the attack to the voice interface
(2) How can we model this correlation between the mouth motions and the CSI
vibration?
4.2.1 Attack Model
As shown in Fig. 4.1, the spoofing attack is defined as that the adversary tries to fool
the VCS by injecting some pre-collected or forged voice commands. The existing
studies show that there are three major types of spoofing attacks.
Replay Attack To fool the voice interface, the attacker collects the legitimate
user’s audio samples and then plays them back with a high-quality loudspeaker
[10]. For the voice interface without liveness detection functionality, the replayed
voice will be translated as the voice command. The victim’s voice audio can be
recorded or captured in many manners, which is not limited to websites, daily life
talking, phone calls, etc.
Advanced Adversarial Attacks Even if attackers can only collect a limited
number of the target user’s voice samples by adopting the latest voice synthesized
technique [8], it is still feasible to attack existing speech recognition and speaker
verification systems. For instance, the adversary can craft subtle noises into the
audio (e.g., background sound [5], music [38] or a broadcaster’s voice [42])
or inaudible ultrasounds [27, 40] to launch an attack without raising the victim’s
concern.
Without loss of generality, in the remainder of this chapter, we use spoofing
attacks to represent the above-mentioned kinds of attacks. Our proposed defense
scheme is based on the fact that, in the spoofing attacks, the fake voice commands
are generated by the machine rather than the human, which means that there are no
corresponding mouth motions for these voice commands. This inconsistency can be
leveraged for performing liveness detection.
Note that, the replay attack is the most effective among various spoofing
approaches since it preserves the most comprehensive voiceprint of the victim
and requires no cumbersome hardware configurations and software parameter fine-
tuning. Therefore, we mainly investigate how to leverage liveness detection to

thwart replay attacks since most of the existing voice biometric-based authentication
(human speaker verification) systems are vulnerable to this kind of replay attack.
Besides, there exists another attack type—insider attack, which means the adversary
can break into the home and impersonate a real user to inject fake voice command.
However, this attack model has a very strong assumption and is less practical
in smart home environment. Therefore, this attack type is not considered in our
proposed liveness detection scheme, and we will discuss its defense strategy in
Sect. 4.5.
4.2.2 User’s Articulatory Gesture
As shown in Fig. 4.2a, it is well known that articulation is related to human organs
(e.g., vocal cords, tongue, lips, jaw). The voice differences depend on the motions
of organs, which could affect the vibration frequency of the air (i.e., the timbre).
According to the air vibration position, the procedure of voice generation can be
divided into the following three stages: (1) Voice generation procedure starts when
the air is sent out from the thorax. The air passes through the vocal cords comprising
of cartilages and muscles, whose different shapes and positions have a significant
effect on air propagation. (2) The air arrives at the soft palate after passing through
the pharynx. The soft palate controls the direction and speed of the airflow and
decides whether it can enter the nasal cavity. (3) The voice wave is about to leave
the mouth when the air arrives at the oral cavity, after which the voice is spread
in the air. In this period, the user can produce different phonemes with different
motions of the tongue, lips, and jaw, which is known as articulatory gesture. As
Fig. 4.2 Articulatory gestures for voice pronunciation. (a) Vocal organs and consonant pronunci-
ation [25]. (b) Vowel pronunciation
shown in Fig. 4.2, according to the International Phonetic Alphabet [37], the users
pronounce different phonemes with different mouth shapes. For instance, as shown
in Fig. 4.2b, the position of the jaw can be halfway opened and fully opened when
the user pronounces /e/ and /a/, respectively.
4.2.3 The Influence of Mouth Movement on Wireless Signals
In this subsection, we first review the wireless signal related knowledge mentioned
in Sect. 3.2.1. Then, we explore the relationship between the mouth motion and
the CSI vibration. Finally, we model the correlation among CSI vibrations, voice
syllables, and user mouth motions.
4.2.3.1 Wireless Signals in Voice Interface
In this chapter, we consider the Wi-Fi wireless communication protocol which is

widely applied by many IoT devices (e.g., smart camera and smart alarm) [30]. In
Wi-Fi communication, CSI characterizes Channel Frequency Response (CFR) H in
different subcarriers. In this chapter, we only consider the system with only a single
antenna pair, and thus CSI data extracted from a packet could be represented by Ns
dimension vector. And for the i-th subcarrier, CSI value Hi can be defined as:
Hi
Hi = |Hi | ej = αe−j 2πf τ , (4.1)
where α is the signal magnitude attenuation, f is the frequency, and τ is the time-
of-light. Given the length of signal propagation path d, the signal wavelength λ and
the speed-of-light c, τ can be calculated as τ = d/c and Eq. 4.1 can be rewritten as:
Hi = αe−j 2π cτ/λ = αe−j 2π d/λ . (4.2)
According to Eqs. 4.1 and 4.2, when the user speaks a voice command, the
movements of the lips and the jaw will change the d and α of the wireless signal.
These constructive and destructive interferences of several multi-path signals will be
reflected by a unique pattern in the time series of CSI values, which can be related to
the presence of the legitimate voice command. In this study, CSI extraction is quite
easy: we can deploy Universal Software Radio Peripheral (USRP) [14] to extract
CSI with 52 subcarrier values.
4.2.3.2 The Influence of Mouth Motion on CSI
Figure 4.3a demonstrates the typical scenario of human voice commanding in voice
interfaces such as SmartThings or Amazon Alexa platform. When a user interacts
TX Voice Samples
Voice Command
Microphone
Human RX
CSI
(a)
0.5
Voice
Amplitude
-0.5
1.5
0 1 2 3 4 5 6 0.2
Amplitude
0 0.16
0 1 2 3 4 5 6
0 1 2 3 4 5 6
Time (s) Time (s)
(b) (c)
Fig. 4.3 Illustrations of the basic idea of WSVA. (a) Human speaking scenario. (b) Voice and CSI
samples during authentic voice commanding. (c) CSI samples during voice spoofing
with the voice interface, WSVA exploits a pair of antennas of the IoT devices in the
proximity to collect the CSI data from Wi-Fi packets and leverage a microphone
to record the voice samples simultaneously. Generally speaking, since CSI reflects
the environmental constructive and destructive interference on several multi-path
signals, the change of multi-path propagation caused by the mouth motions during
the voice speaking can generate a unique pattern in the time series of CSI values.
In this case, we investigate the influence of the mouth motions on the CSI, which
can be regarded as a liveness pattern of the user. As shown in Fig. 4.3b, the
dramatic fluctuations of CSI waveforms happen with the occurrence of human voice
commands. However, as shown in Fig. 4.3c, if an adversary launches the spoofing
attack described in Sect. 4.2.1, in which the spoofing voice command is injected
without any corresponding mouth motion, the attacks can be easily detected due
to the lack of the corresponding changes in CSI data. Therefore, our experimental
results validate our intuition that it is feasible to leverage the consistency of
fluctuations between voice samples and CSI streams to detect spoofing attacks.
4.2.3.3 Modeling the Correlation Among CSI Vibrations, Voice Syllables,

and User Mouth Motions
Previous works have demonstrated that human movements can be sensed via
wireless signals [26, 29, 31, 34]. However, in IoT environment, achieving very
precise speech recognition is hard since it may go beyond the sensing capability
of Wi-Fi signal. As shown in Eq. 4.2, the sensing capability of wireless signal
Fig. 4.4 Four types of mouth motion shapes. (a) Hiant. (b) Grin. (c) Round. (d) Pout
depends on the wavelength of the signals. In practice, the Wi-Fi signal (e.g., 12.5 cm
wavelength for 2.4 GHz) based sensing mechanisms cannot accurately capture the
tiny human mouth motion. To make matters worse, in addition to the motion of the
tongue, lips, and jaw, Wi-Fi can hardly recognize the impact of other vocal organs.
According to the study of Dodd et al. [12], only 40% words in English can be
recognized by only considering mouth motions.
Considering that it is not feasible to achieve accurate lip reading via Wi-Fi
signals, WSVA authenticates the voice commands by checking the consistency
between voice and CSI signals rather than accurately identifying each syllable.
Therefore, in this chapter, by analyzing the International Phonetic Alphabet, we
classify the mouth motions into four categories, including hiant, grin, round, and
pout, which correspond to Fig. 4.4a–d, respectively. With the exception of a few
syllables (e.g.„ / /) with non-significant mouth motions, most phonetic syllables
e
can be categorized into one of these types. As shown in Table 4.1, the hiant, the
motion of opening the mouth largely, can pronounce the phonemes like /a:/ and /æ/,
which can be heard in words “bar” and “cat”. The grin, the motion of grinning
human mouth like Fig. 4.4b, can pronounce the phonemes like /e/ and /ei/, which
can be heard in “A” and “base”. The round, rounding lips at ease, can generate the
phonemes like / :/, which can be heard in “lot” and “saw”. Finally, the pout, the
c
motion pouting the lips, can send out the phonemes like /u:/, which can be heard
in words “root” and “shoe”. After such a classification, different types of mouth
motions can be correlated with different CSI features according to relevant voice
syllables, as mentioned in the following sections.
4.3 WSVA: Wireless Signal Based Two Factor Authentication 85
Table 4.1 Four typical Mouth motion Syllables Words

mouth movements and their
corresponding syllables Hiant /a:/ /æ/ /ai/ bar, cat
Grin /e/ /ei/ A, base
Round / / / :/
c c lot, saw
Pout /u:/ // root, shoe
Non-obvious / / /i /
e e sir, here
Fig. 4.5 Workflow of WSVA

Data Collecon Module
CSI Collection Voice Collection
Data Cleansing and Preprocessing Module

CSI Pre-Processing Voice Processing
Wavelet
PCA Word Syllable
De-noising
Feature Extracon Module

Macro-level Mouth motion
Feature Extraction Feature Extraction
STFT Contour Time Frequency
Feature Matching Module

Similarity Comparison
Legitimate Adversarial
4.3 WSVA: Wireless Signal Based Two Factor Authentication
The basic strategy of WSVA to detect if a voice command is authentic is checking

the consistency between the voice samples and its corresponding CSI data intro-
duced by mouth motions. The CSI data can be collected via a specialized device
(e.g., USRP) or the COTS device. In the context of the smart home, with the
prevalence of IoT platforms such as Samsung SmartThings, which controls smart
devices with wireless signals, it is technically feasible to take advantage of these
existing wireless infrastructures to collect the voice samples and their corresponding
CSI data simultaneously.
WSVA consists of the following four modules as shown in Fig. 4.5. In Data
Collection Module, when human voice is detected by the voice interface, WSVA
collects the voice samples and its corresponding CSI data. In Data Cleansing
and Preprocessing Module, WSVA exploits wavelet-based method to remove the
noise in CSI and segments the collected voice samples. Feature Extraction Module
enables WSVA to select appropriate features from macro-level and mouth motion,
respectively. Finally, Feature Matching Module utilizes a classification mechanism
to determine whether the received voice command is authentic or spoofing.
4.3.1 Data Collection Module
This subsection introduces how to collect voice samples and the corresponding
CSI data. Most of the voice interfaces (e.g., Google Now and Amazon Alexa)
require the user to speak a pre-defined magic word as a trigger. For instance, Apple
iPhone needs “Hey, Siri” and Amazon Alexa needs “Alexa” to initialize their voice
assistants. Only when the voice trigger is recognized by the VCS, WSVA will be
activated and start to collect voice samples V and CSI data H by utilizing the
microphone and antenna pair, respectively. To collect CSI, the antennas can be
equipped by different devices or incorporated into the same IoT device. One of these
antennas acts as a transmitter to continuously send wireless packets (e.g., broadcast
packets), and the other receives packets and extracts CSI data from the preamble
sequences of these wireless packets.
4.3.2 Data Cleansing and Preprocessing

4.3.2.1 CSI Denoising
To improve the liveness detection performance, we should remove the background

noises in the original collected CSI data. In this chapter, for each subcarrier data
from collected CSI data H , WSVA leverages wavelet-based denoising to eliminate
high-frequency noises. The manner is identical to that described in Sect. 3.3.3.1. As
shown in Fig. 4.6c, after wavelet-based denoising, most of the burst noises in Hi can
be removed.
4.3.2.2 Voice Samples Segmentation
It is observed that after performing wavelet-based denoising, the CSI waveform

shows a strong correlation with mouth motions. In order to verify the consistency
between CSI and voice samples, it is critical to detect the start and end points
of each mouth motion sample. However, since the CSI vibration is caused by the
movement of the surrounding object, it does not mean all vibrations are related to
voice generation. Therefore, we first perform word level segmentation and phoneme
level segmentation on voice samples and then extract the corresponding CSI data
according to the timestamps. The detailed steps are introduced below.
1
Voice
0
-1
0 1 2 3 4 5 6 7 8 9
Time (s)
(a)
40
STE/ZCR
STE
20 ZCR
0
0 100 200 300 400 500 600 700 800 900
Voice Frames
(b)
Amplitude
1.8
0 1000 2000 3000 4000 5000 6000 7000 8000
CSI Samples
(c)
Fig. 4.6 An example of word-level segmentation. (a) Original voice samples. (b) An illustration
of the double-threshold method. (c) Inter word segmentation on CSI data
Word Level Segmentation When the user speaks a command, there is a short inter-
val (e.g., 200 ms) between the pronunciation of two successive words. Therefore,
the interval between two word samples can be utilized to segment voice command
into different word samples. WSVA exploits the double-threshold detection method.
Specifically, WSVA splits the voice samples V into frames of Nv points length,
with shifting Ns points each time. In this study, Nv and Ns are set to 512 and 256
respectively. For totally N frames, WSVA calculates their short term energy ST E[n]
and zero-crossing rate ZCR[n], and selects two adaptive thresholds for ST E[n] and
ZCR[n] to detect the start and end points sv,i and ev,i of the i-th word Wi . Then,
according to the timestamps, we can also divide the CSI data into several word
waveforms. Figure 4.6 illustrates the proceeding of inter word segmentation. For
the k-th CSI subcarrier H (:, k), its corresponding i-th word’s CSI waveform Wi can
be represented as follows.
HW,i = H (sc,i : ec,i , k), (4.3)

where sc,i and ec,i are the start and end CSI indexes of the i-th word Wi which are
converted from the timestamps sv,i and ev,i on voice samples. Note that si and ei are
extended on both sides by 200 CSI indexes respectively, due to the fact that the CSI
change introduced by the mouth motion can occur a little bit earlier or later than the
speech can be heard.
Phoneme Level Segmentation and Mouth Motion Inference For a specific word,
its pronouncing behavior may involve more than one mouth motion. For instance,
speaking the word “open” needs the mouth motions of “round” and “grin”. Besides,
as mentioned in Sect. 4.2.3.3, the correlation between different categories of mouth
motion and CSI vibration types is a key factor that can be leveraged in liveness
detection. Therefore, the next step of WSVA is dividing the given CSI word
waveforms into multiple CSI mouth motion waveforms and then calculating the
similarity between the collected CSI mouth motion waveforms and pre-trained CSI
motion data.
Similar to word level segmentation, WSVA processes the voice samples of
the user and infers the start and end points of each mouth motion. In particular,
WSVA first utilizes automatic speech recognition to identify each word of a voice
command. The state-of-the-art system DeepSpeech [18] is utilized to perform such
a task automatically. After identifying existing words, WSVA then utilizes Munich
Automatic Segmentation System (MAUS), a widely adopted phonetic segmentation
system [19]. MAUS is based on the Hidden Markov Model method, and it can label
the phonemes of voice signals by analyzing the sound file and text description of the
voice. Specifically, based on the standard pronunciation model, the identified text
will be transformed into expected pronunciation. Then, a probabilistic graph will
be generated by combining the canonical pronunciation with millions of different
accents, which contains all possible phoneme combinations and the corresponding
probabilities. MAUS finally adopts Hidden Markov Model to perform path search
and find the combination of phonetic units with the highest probability.
After combining phonemes into syllables and inferring the mouth motions
according to International Phonetic Alphabet, we can obtain the segmented and
labeled mouth motions of the inputting voice command. WSVA matches the
timestamps of each segmented motion to the CSI samples to extract CSI mouth
motion waveforms as the method defined in Eq. 4.3. One example is illustrated in
Fig. 4.7, which segments a voice command (“Open the door”) into several phonemes
and extracts the CSI mouth motion waveforms. It is worth mentioning that since the
number of voice commands commonly used in VCS is limited, the performance of
speech recognition can be improved according to pre-defined common command
set. In addition, WSVA also utilizes inter word segmentation results to improve the
phoneme segmentation performance. After these steps, we obtain the start and end
points of all Nm mouth motions M = {M1 , M2 , ..., MNm } in a voice signal, and then
extract the CSI data HMi for the i-th mouth motion Mi .
Command open the door
Voice
Phoneme /əʊ/ /p/ /ən/ /ðə/ /dɔr/
Mouth round grin Non-sig round

motion
CSI data
Fig. 4.7 An example of the mouth motion detection
4.3.3 Feature Extraction
After data cleansing and pre-processing on CSI and voice samples, WSVA selects
the appropriate features to characterize the consistency between these two types
of signals. As mentioned in Sect. 4.2.3, at the macro level, it is observed that CSI
variation occurs along with human pronunciation. Besides, the CSI data of different
mouth motion types show different features, which is another criterion to describe
consistency. Therefore, WSVA extracts features from both the macro level and
mouth motion level to determine whether the voice command and the mouth motion
are consistent.
4.3.3.1 Macro-level Feature Extraction
In particular, after performing wavelet denoising, CSI data still compose of Ns

subcarriers (e.g., Ns = 52 in this chapter). To remove the DC components in all
subcarriers and extract the strongest correlation component with mouth motions,
WSVA adopts PCA to extract the first principle component HP CA of all CSI
subcarriers H . The details of PCA are already described in Sect. 3.3.3.2, and we skip
them in this chapter. Then, WSVA adopts Short Time Fourier Transform (STFT) on
both CSI data HP CA and voice samples V to obtain their two-dimensional frequency
spectrograms. Figure 4.8a, b show the frequency shifts on the voice samples and the
corresponding CSI data spectrograms in non-attack scenarios. It is observed that in a
non-attack scenario, the contours (marked by black lines) of CSI and voice samples
have similar variation trends. However, as shown in Fig. 4.8a, c, in the spoofing
scenario, since the voice samples are injected by the adversary without any user’s
W1 W2 W3 W4
Frequency (Hz)
2
1000
1
5 Time (s) 10 15
(a)
W1 W2 W3 W4 100
Frequency (Hz)
20 80
60
10 40
20
5 Time (s) 10 15
(b)
W1 W2 W3 W4
Frequency (Hz)
20 3
2
10
1
5 Time (s) 10 15
(c)
Fig. 4.8 Illustration of the macro-level feature extraction. (a) The spectrogram of voice samples.
(b) The spectrogram of CSI data in normal scenario. (c) The spectrogram of CSI data in attack
scenario
mouth motion, the CSI contour is disordered and not inconsistent with that of voice
samples. Therefore, to measure the consistency between voice and CSI samples, an
intuitive solution is to calculate the similarity between the spectrogram contours of
these signals.
However, directly calculating the similarity between the spectrogram contours of
V and HP CA is inappropriate since the frequency shifts of these signals are affected
by different factors (i.e., voice tunes on voice and mouth movements on CSI), which
are not necessarily related. Instead, for the Nw words W = {W1 , W2 , ..., WNw } in
the command, WSVA calculates the similarity between contours of voice and CSI
signals for each word Wi , and then combines these similarities to obtain the macro-
level similarity SMacro . For the i-th word Wi , to calculate its similarity, we first
extract the CSI and voice samples HWi and VWi , which are represented as:
HWi = HP CA (sc,i − Lc,i : ec,i + Lc,i ), (4.4)
VWi = V (sv,i − Lv,i : ev,i + Lv,i ), (4.5)
where sv,i and ev,i are the begin and ending indexes of i-th word Wi ’s voice samples
respectively. sc,i and ec,i are the begin and ending indexes of i-th word Wi ’s CSI
samples respectively. Lv,i and Lc,i are the spans of the voice and CSI samples of
Wi , in which Lv,i = ev,i − sv,i + 1, and Lc,i = ec,i − sc,i + 1. Note that, instead
of directly using the Eq. 4.3, we extend both sides of HWi and VWi to obtain more
details about the Wi .
Then, we extract the contour CCSI,Wi from the frequency spectrogram of i-th
word’s CSI data. Firstly, we resize the CSI spectrogram with frequency from 0 to
30 Hz into a m-by-n matrix MCSI (j, k) and normalize the MCSI (j, k) to a range
between 0 and 1. Note that, in MCSI (j, k), each column represents the normalized
frequency shifts during the j th time slide. Then, we choose a pre-defined threshold
and get the contour CCSI,Wi (j ), where j = 1...n. CCSI,Wi (j ) is the maximum
value k which satisfies that MCSI (j, k) ≥ threshold. The process of calculating
contours CV ,Wi for the voice spectrogram is similar to calculating CCSI,Wi . Besides,
as mentioned in Sect. voice:segmentation, we can set the value CV ,Wi (j ) to 0 to
eliminate the interference of background noise, if the j -th time slide is not within
the word segments
After obtaining CCSI,Wi and CV ,Wi for Wi , we measure the correlation between
these two contours by adopting Pearson correlation coefficient [20], which is
defined as Corr(Wi ). Corr(Wi ) ranges from 0 to +1, where a higher value
represents a higher level of similarity. To calculate Corr(Wi ), we first re-sample
CCSI,Wi and CV ,Wi into the sample length, and Corr(Wi ) can be represented as:
n
i=1 (CCSI,Wi (i) − CCSI,Wi )(CV ,Wi (i) − CV ,Wi )
Corr(Wi ) = , (4.6)
(n − 1)δCSI δV
where n is the length of re-sampled sequences CCSI,Wi and CV ,Wi , δCSI and
δV oice are the sample standard deviations of CCSI,Wi and CV ,Wi , respectively.
After calculating the similarity Corr(Wi ) for i-th word Wi , for total words W =
{W1 , W2 , ..., WNw } we could calculate the macro-level similarity SMacro as follows:
Nw
i=1 Corr(Wi )
SMacro = . (4.7)
Nw
4.3.3.2 Mouth Motion Feature Extraction
In the previous section, we discussed how to obtain the macro-level feature

SMacro from CSI data and voice samples during the voice command pronunciation.
However, it may not be enough to perform liveness detection only relying on
20st subcarrier 30st subcarrier

1.2 1.2 0.9 0.9
1.1 1.1
Amplitude
Amplitude
0.8 0.8
1 1
0.7 0.7
0.9 0.9
0.8 0.8 0.6 0.6

0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 0 0.5 1
Time (s) Time (s)
(a) (b)
1.4 1.4 0.85 1.15
0.8 1.1
Amplitude
Amplitude
1.3 1.3
0.75 1.05
1.2 1.2
0.7 1
1.1 1.1 0.65 0.95

0 0.5 1 0 0.5 1 0 0.5 1 1.5 0 1 2
Time (s) Time (s)
(c) (d)
Fig. 4.9 Time domain of four mouth motion types. (a) Hiant: /a:/ and / la:/. (b) Grin: /e:/ and
/ge:/. (c) Round: /re:/ and /ra:/. (d) Pout: /u:/ and /gu:/
SMacro . For example, the dramatic change of environment may generate the drastic
vibrations of CSI data, which lead to a deviated contour CCSI,Wi and a higher
similarity Corr(Wi ) for detected word Wi . Therefore, to further improve detection
performance, WSVA will extract the mouth motion level features from both time
and frequency domain perspectives in this subsection.
Time Domain Feature Extraction Figure 4.9 shows the amplitudes of CSI
syllable data belonging to four mouth motion categories. It is observed that the
CSI waveforms belonging to the same mouth motion category have similar shapes.
For instance, in Fig. 4.9a, the waveforms of syllable /a:/ and / la:/ which belong
to the motion “hint” have the similar waveform shapes and amplitude vibrations.
And it is also discovered that the ranges of CSI amplitudes from different mouth
motion categories are quite different. For instance, as shown in Fig. 4.9a, d, the CSI
amplitude ranges of syllables /a:/ and / la:/ are much larger than syllables /u:/
and /gu:/. Thus we can extract the ranges from the CSI waveforms as their time
domain features. For a given CSI mouth motion M and its CSI data HM , the CSI
time domain feature Range(M) can be calculated as:

Ns
Max(HM,i ) − Min(HM,i )
Range(M) = , (4.8)
Ns × Mean(HM,i )
i=1
where Ns represents the number of CSI subcarriers and HM,i represents the i-th
subcarrier of HM . Note that, the PCA processed data HP CA is not utilized in this
scenario, since the PCA process will distort the signal’s range.
Frequency Domain Feature Extraction In time domain feature extraction part,

the CSI waveform shape changes over time. However, the experimental results
show that the frequency shifts of CSI data caused by mouth movement have a
relatively stable pattern. Figure 4.10 shows the STFT spectrograms of syllables
which are displayed in Fig. 4.9, and the contours of frequency spectrograms with
three different thresholds (i.e., T hr1 = 0.2, T hr2 = 0.5, T hr3 = 0.8) are marked
as solid lines. It is observed that the CSI syllable data from different mouth motion
categories have quite different contours. For instance, the contours of syllables /a:/
and / la:/ are more widely than that of /e:/ and /ge:/. Therefore, for a given mouth
motion M and its CSI data HM , we can utilize the contours of CSI spectrogram as
the frequency features. In this study, we choose these three thresholds {T hr1, T hr2,
T hr3} and obtain the corresponding contours CM,T hr1 ,CM,T hr2 and CM,T hr3 . After
that, each contour CHM ,T hri is compressed to Nc points (e.g., Nc is 5 in this study),
and WSVA merges these compressed contours into the feature vector FM with
3 × Nc elements.
Thr1 Thr2 Thr3

Frequency (Hz)
20 20 20 20
10 10 10 10
10 20 30 40 50 10 20 30 40 50 10 20 30 20 40
(a) (b)
Frequency (Hz)
20 20 20 20
10 10 10 10
10 20 30 10 20 30 20 40 60 20 40 60
(c) (d)
Fig. 4.10 Frequency domain of four mouth motion types. (a) Hiant: /a:/ and / la:/. (b) Grin: /e:/
and /ge:/. (c) Round: /re:/ and /ra:/. (d) Pout: /u:/ and /gu:/
4.3.4 Feature Matching Module
Before performing liveness detection, it is reasonable to assume that the user can
provide totally J × N pre-collected CSI mouth motion data HP re , which contain
J syllable categories (i.e., four motion categories proposed in Sect. 4.2.3.3) and
each category contains N motions’ CSI data HP re (i, j ), where j = 1, 2, ..., N.
Then, for a given voice command input containing NM mouth motions M =
{M1 , M2 , .., MNM } which belong to four motion categories, WSVA processes the
voice samples V and CSI data H using the above mentioned modules. After that,
WSVA obtains the macro-level similarity SMacro and the mouth motion feature
Range(Mi ) and FMi for each motion Mi .
Mouth Motion Feature Combination For a given motion M, WSVA firstly
calculates the time domain range difference SMRT ime (i) between its feature
Range(HM ) and pre-collected i-th mouth motion category’s features. SMRT ime (i)
can be calculated as:

N
Range(HM ) − Range(HP re (i, j ))
SMRT ime (i) = . (4.9)
N
j =1
Since the corresponding motion category of M can be calculated from the voice
processing module as described in Sect. 4.3.2.2, we can calculate the time domain
similarity score between M and its corresponding motion type as follow:
SMRT ime (type)

ST ime = Min{1 − + α, 1}, (4.10)
Max(SMRT ime )
where type represents the motion type of M, which ranges from 1 to J . The
generated ST ime ranges from 0 to 1, and the value closer to 1 indicates a high level
of similarity. Note that, the function of adjustment factor α is to prevent ST ime from
being zero, and we empirically set α to 0.1 in this study.
Then, WSVA compares the similarity between the frequency domain feature
FM of M and the J × N features FP re (i, j ) which are extracted from pre-
collected motion data HP re . Different from the previous work [22] which utilizes
Dynamic Time Warping with O(N 2 ) time complexity, in this chapter, to speed
up the computation, WSVA exploits neural network-based solution to characterize
the similarity. WSVA utilizes the pre-collected CSI data HP re to train a forward
propagation neural network net with 20 neurons in the hidden layer. For a given
CSI data HP re (i, j ), the input for net is frequency domain feature FP re (i, j )
and the training label is the mouth category number. In this study, we set the
category numbers for motion “hiant”, “grin”, “round” and “pout” are 1, 2, 3 and
4 respectively. After training, in the ideal case, net could map a specific motion
4.4 Performance Evaluation of WSVA 95
feature FM to its corresponding motion type. The similarity between M and the i-th
mouth motion category can be calculated as:
SMRF req (i) = |net (FM ) − label(i)| , (4.11)
where label(i) is the label of i-th motion category, and net (FM ) is the prediction
of net.
Similar to Eq. 4.10, WSVA calculates the similarity score between FM and its
corresponding motion category type as:
SMRF req (type)

SF req = Min 1 − + α, 1 , (4.12)
Max(SMRF req )
where the adjustment factor α is set to 0.1. The result SF req closer to 1 indicates a
high level of similarity.
After obtaining the time domain similarity score ST ime and the frequency domain
similarity score SF req of a given mouth motion M, we can calculate the combination
mouth motion level similarity score SMotion as:
SMotion (M) = ST ime × SF req . (4.13)
Liveness Detection For a voice command which contains NM mouth motions

belonging to four motion categories, after performing mouth motion feature com-
bination, we obtain its macro-level feature SMacro and the mouth motion score
SMotion (Mi ) of each motion Mi , where i = 1, 2, ..., NM . Then, we can calculate
the final decision score of the input, which is calculated as follows:

NM
Score = SMacro × SMotion (Mi ). (4.14)
i=1
We utilize a threshold-based mechanism to perform human liveness detection

in this chapter. For the given voice command input, if its Score is larger than the
pre-defined verification threshold, WSVA regards it as an authentic voice command.
Otherwise, WSVA judges it as a fake command and refuses to execute it. In the next
section, we will give a detailed experimental evaluation.
4.4 Performance Evaluation of WSVA
In this section, we conduct a series of experiments to evaluate the performance of

WSVA in different scenarios and explore the implementation of WSVA in real-
world IoT environment.
4.4.1 System Setup
Hardwares WSVA consists of two hardwares: (1) a Universal Software Radio

Peripheral (USRP) N210 device, which connects two commercial Wi-Fi antennas,
and (2) a microphone, responsible for collecting voice samples. In the experiment,
the distance between antennas and human is 20 cm. When a user speaks a voice
command, the USRP N210 collects CSI data at the rate of 1000 packets/second
in 2.4 GHz Wi-Fi frequency with the 1/2 BPSK modulation mechanism, and the
microphone collects the voice samples simultaneously. We exploit USRP rather
than COTS device (e.g., Intel 5300 NIC) to collect more stable CSI data, since
some commercial devices change their power adaptively and result in unstable CSI
measurements. However, essentially speaking, USRP and COTS devices have the
same wireless function.
Data Collection Our experiment totally recruits 6 volunteers. Before performing
voice command, each volunteer was required to perform the four categories of
mouth motions (i.e., the corresponding syllables) 10 times as WSVA’s pre-collected
CSI profiles. Then, each volunteer performs voice commands, and the adversary
performs spoofing attacks using this volunteer’s voice profiles. WSVA finally
performs liveness detection by analyzing the collected CSI data and voice samples
with the volunteer’s mouth motion profiles.
Evaluation Metrics To assess the performance of WSVA, we choose the False
Accept Rate (FAR) and the True Accept Rate (TAR) as evaluation metrics.
TAR is the rate at which WSVA detects the authentic user correctly, while FAR
characterizes the rate at which an attacker is wrongly accepted by the system and
considered as an authentic user. Both FAR and TAR are influenced by varying the
pre-defined verification threshold, and we show their relationship using Receiver
Operating Characteristic (ROC) curve. In our experiment, we adjust the verification
threshold value of WSVA to study more comprehensive results.
4.4.2 Overall Performance
In this subsection, we evaluate the effectiveness of WSVA to defend against spoofing

attacks. Firstly, each volunteer is required to provide his/her pre-collected CSI
profiles and speak 150 legitimate voice commands. After that, we perform spoofing
attacks as described in Fig. 4.3 by using each volunteer’s voice profiles 750 times.
We totally collect 5400 voice commands, and in a given command, the numbers of
mouth motions belonging to the four types range from 4 to 8. The ROC curve of
WSVA in detecting live users in non-attack scenarios and spoofing attack scenarios
is depicted in Fig. 4.11. We can observe that with 1% FAR, the TAR is as high
as 99.2% when WSVA exploits combined features Sscore . Moreover, we find that
100
95
True Accept Rate (%)

90
85
80
Combined Feature
75 Mouth Motion Feature
Macro-level Feature
70
65
60
0 0.5 1 1.5 2
False Accept Rate (%)
Fig. 4.11 Performance on thwarting spoofing attacks when using combination feature, macro-
level feature, and mouth motion feature
the TAR relying on the mouth motion feature is better than that relying on macro-
level feature. More concretely, with 1% FAR, the TAR relying on mouth motion
feature still keeps above 99%. However, the TAR relying on macro-level feature is
reduced to 90.2%. The reason is that the macro-level features are more susceptible
to environmental noise. After collecting voice and CSI data, the average time delay
of performing each liveness detection is 0.26 s, which is acceptable in practice and
is smaller than that in previous work (i.e., 0.32 s in [22] with the same hardware
condition). In summary, our experimental results well validate that WSVA is highly
effective in defending against spoofing attacks, while the macro-level feature and
the mouth motion feature can complement each other to improve the detection
performance.
In evaluations described in Fig. 4.11, for each user, WSVA performs liveness
detection based on his/her pre-collected CSI profiles. However, in some smart home
environments with multiple users, it is less likely to collect each user’s mouth
motion profiles. A more desirable design is to collect only one user’s profiles but
work for multiple users. In this section, we perform experiments to evaluate the
scalability of WSVA. We first recruit a volunteer to provide WSVA with his/her
mouth motion profiles which record his/her articulatory gesture. Then we recruit
another two volunteers to perform voice commands 300 times. After that, we also
implement spoofing attacks 600 times. Figure 4.12 shows the evaluation result of
WSVA, where WSVA achieves 97.6% TAR with 1% FAR, and 97.9% TAR with
2% FAR. Note that, the mouth motion feature-based detection rate (i.e., 89.6% TAR
with 2% FAR in Fig. 4.12) is less than that in Sect. 4.4.2. The reason is that the
100
95

90
85
80
Combined Feature
75 Mouth Motion Feature
70 Macro-level Feature
65
60
0 2 4 6 8 10
Fig. 4.12 Scaling up to multiple users
articulatory gesture of another volunteer is not the same as the user who provides
the pre-collected CSI profiles. However, compared with spoofing attacks, WSVA
can still achieve a high detection accuracy, which demonstrates that it is also highly
effective in multiple-user scenarios.
4.4.3 Impact of Various Factors on WSVA’s Performance

4.4.3.1 Impact of Mouth Motion Number
In this subsection, we investigate how different mouth motion numbers contained

in the voice command affect the performance of WSVA. In our experiments, the
motion numbers range from 4 to 8, and their corresponding accuracy is depicted in
Fig. 4.13. The accuracy is the rate of successfully detecting authentic and spoofing
commands among all commands under the 2% FAR. It is observed that with the
increase of mouth motion number, the accuracy slightly rises from 97.5% to 99.8%.
This result indicates that a higher number of mouth motions can reduce the impact
of a single mouth motion misjudgment. In the experiment, the accuracy decreases
slightly when the mouth motion number is greater than 6. This phenomenon
is caused by unsatisfactory data quality during the process of data collection.
Moreover, the accuracy exceeds 99% when the mouth motion number is greater than
4, which means the features extracted from mouth motion by WSVA are accurate
enough for liveness detection.
100
98
Accuracy (%) 96
94
Word based Feature
Syllable based Feature
92 Combined Feature
90
4 5 6 7 8
Mouth Motion Number
Fig. 4.13 The impact of syllable length
4.4.3.2 Impact of Distance and LOS/NLOS
In the above evaluations, the volunteer is located at the line of sight (LOS) places of
antennas, and the distance between the user’s mouth and the receiver antenna is set
to 20 cm. To evaluate the impact of distance on detecting voice spoofing, a volunteer
is recruited to conduct experiment with distances of 50 cm, 100 cm, and 150 cm,
respectively. For each distance, the volunteer is required to provide CSI profiles
and generate 150 voice commands, and then the loudspeaker is deployed at the
volunteer’s location to perform spoofing attacks 300 times. As shown in Fig. 4.14,
it is observed that the detection accuracy decreases when the distance becomes
greater. By using the combined feature, WSVA achieves over 99% TAR with 2%
FAR when the distance is 50 cm. However, the TAR is decreased to 98% and 96%
when the distance is 100 cm and 150 cm, respectively. Besides, The TAR under 2%
FAR decreases dramatically when only utilizing macro-level features (from 94%
in 50 cm to 80% in 150 cm) and mouth motion features (from 97% in 50 cm to
88% in 150 cm) individually. It means that when distance increases, the impact of
mouth motion on multiple-path propagation of CSI becomes weaker and causes the
degradation of WSVA’s performance. However, when the distance is set to 1.5 m,
WSVA could still achieve satisfactory accuracy (96%) using the combined feature,
which is acceptable in most cases.
To evaluate the performance of WSVA in the non-LOS scenarios, two additional
experiments are conducted. As shown in Fig. 4.15a, the volunteer is required to stay
out of the line of sight area. To further demonstrate this scenario, we consider
a more extreme case in which we insert the obstruction board to separate the
transmitter antennas from the receiver while the volunteer is on the same side of
the transmitter. The dataset obtained with a distance of 50 cm as shown in Fig. 4.14
is chosen as the control group. The experimental results are shown in Fig. 4.15b.
When WSVA utilizes a combined feature, with 2% FAR, the TARs of WSVA under
wood obstruction and control group are still over 99%. However, the TAR under
iron obstruction is decreased to 92.7%. More specifically, when only exploiting
50 cm 100 cm 150 cm
Combined Feature Macro-level Feature Mouth Motion Feature
100 100 100
98
90
90
96
80
94
80
70
92
90 60 70
0 2 4 0 2 4 0 2 4
False Accept Rate (%) False Accept Rate (%) False Accept Rate (%)
Fig. 4.14 The impact of distance on WSVA
LOS Wood Iron

20 cm Combined Feature
100
TX 95
User
90
85
0 2 4
Obstruction False Accept Rate (%)
Mouth Motion Feature
100
Wood/Iron
75
50
25
RX 0
0 2 4
(a) (b)
Fig. 4.15 Evaluations of obstructions. (a) The non-LOS scenarios. (b) Detection results under
different obstructions
mouth motion feature, the TARs under wood and iron obstructions are decreased to
56% and 36%. The results demonstrate that the obstruction in LOS could degrade
the wireless sensing capability, especially only with the mouth motion feature
extraction of WSVA. It is notable that WSVA’s performance under iron obstruction
is much weaker than that under wood, since iron material in LOS could cause
greater multiple-path distortions. However, WSVA could still be effective under
wood obstruction, and it is feasible for the users to keep them on LOS places in
their own smart homes.
4.5 Limitations and Discussions 101
Real Time 12 Hours 24 Hours

Combined Feature Macro-level Feature Mouth Motion Feature
100 100 80
90 90
70
80 80
70 70 60
60 60
50
50 50
0 2 4 0 2 4 0 2 4
False Accept Rate (%) False Accept Rate (%) False Accept Rate (%)
Fig. 4.16 The impact of time
4.4.3.3 Timeliness of Pre-collected CSI Profiles
In ideal cases, the collected CSI patterns should be only related to mouth motion and
do not change with time. However, as reported by previous research [2, 21], CSI
patterns are changed over time in real-world scenarios. To evaluate the timeliness
of CSI profiles, we first recruit a volunteer to provide mouth motion profiles. Then,
the volunteer and adversary are required to perform 150 voice commands and 150
spoofing attacks with a time step of 12 h. Figure 4.16 shows the performance of
WSVA in real-time, 12 h and 24 h. It is observed that after 12 h, WSVA achieves
above 99% TAR with 1% FAR, which is similar to real-time performance. After
24 h, WSVA can still achieve 90.6% TAR with 1% FAR by utilizing the combined
feature. Note that after 24 h, the performance of the mouth motion-based feature
is decreased to 75.8% TAR with 2% FAR. The performance degradation may
be caused by the emotional changes of the user or the background environment
changes. This is an inherent drawback of CSI-based sensing, but it does not hinder
the deployment of WSVA essentially. Adaptively updating the user’s profile can
effectively avoid the effects of environmental changes [21]. The updating can
be processed during the user’s daily usage of voice commands, and the cost is
affordable for the user since we only utilize 40 mouth motion samples for training.
4.5 Limitations and Discussions
The performance evaluation part demonstrates the effectiveness of WSVA in

thwarting spoofing attacks. However, there are some limitations that may degrade
the detection accuracy of WSVA and leave possibilities for the adversary to attack
the voice interface successfully.
Antenna and User Positions In this study, the distance between the user and the
antennas of WSVA affects the performance of WSVA. When the distance is too
long (depending on the hardware condition), the collected CSI cannot reflect the
mouth motion components and result in inaccurate judgment of WSVA. However, in
smart home environment, many applications of voice control system leverage voice
command to control home appliances (e.g., light bulbs and temperature), which also
has specific requirements on the environmental factors (including the distance). For
example, according to the CNET’s report about Amazon Echo, it needs more than
one Echo device for full coverage of a large home [9]. In practice, we can deploy
multiple antennas on smart home to make WSVA applicable in a larger area or
distance. When the user interacts with VCS, WSVA could dynamically choose the
antennas which are closest to the user to collect CSI data.
Pronunciation Behaviours Currently, WSVA can only work in the situation where
all users speak the voice commands in English strictly according to the International
Phonetic Alphabet. However, in reality, for the same phoneme, different users may
use different articulatory gestures [4, 23]. In the experiment, two volunteers are
required to pronounce the phoneme /a : / with standard articulatory gestures
(gesture 1 of hiant) and strange gestures (gestures 2 and 3). As shown in Fig. 4.17,
although different articulatory gestures will result in collecting quite different CSI
waveforms, it is also observed that when users utilize the same articular gesture
(e.g., the gesture 1 used by user 1 and user 2), the collected CSI still have similar
patterns. In the family scenario with limited user numbers (generally 2-4 users), it is
feasible for these users to agree on a common articulatory gesture. The detection
accuracy is also improved by the utilization of macro-level feature. Therefore,
WSVA still has high practicality in multiple-user scenarios.
Defending Against the Insider Attack As described in Sect. 4.2.1, the adversary
can launch a more serious attack (i.e., insider attack), which is not considered
in this study. In an insider attack scenario, the adversary can approach the VCS
10st subcarrier 30st subcarrier

Gesture 1 Gesture 2 Gesture 1 Gesture 3
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 1.5

Time (s) Time (s)
(a) (b)
Fig. 4.17 CSI waveforms under different articulatory gestures. (a) Phoneme /a:/ generated by
user 1. (b) Phoneme /a:/ generated by user 2
4.5 Limitations and Discussions 103
No voice
0.3
Voice
0
-0.3
2 3 4 5 6 7 8 9
Additional motion
1
CSI Data
0.9
0.8
0.7
2 3 4 5 6 7 8 9
Time (s)
Fig. 4.18 An example of the strategy to defend insider attack
physically and mimic the mouth movements of a benign user. Therefore, it brings
the consistency between vibrations of CSI data and voice samples and decreases
the performances of WSVA. To reduce this risk, a potential method is that the
user makes special adjustments to the WSVA. For example, the user is required
to perform some pre-defined and secret additional mouth motions after each voice
command. As shown in Fig. 4.18, WSVA can distinguish between benign users and
insider attackers by detecting the existence of these additional motions.
Comparison Between WSVA and Lip Reading Previous research has proposed
some CSI-based lip-reading methods such as WiHear [33] and WiTalk [13].
These methods attempt to infer the contents of voice samples only through the
CSI information. However, in this study, we do not propose using wireless signals
for lip reading. Instead, WSVA aims to utilize the consistency between voice and
CSI signals to authenticate the voice commands and defend against voice spoofing
attacks targeted at the voice control system.
In addition, due to the limitation of Wi-Fi signals (e.g., only 12.5 cm wavelength
for 2.4 GHz), achieving high accuracy detection in lip reading is inherently difficult.
For instance, WiHear and WiTalk can only recognize 14 and 12 different syllables,
respectively, which means that many voice syllables cannot be identified by CSI.
Furthermore, [24] shows not all voice syllables can be recognized by lip-reading
techniques in theory. For instance, the SilentTalk [32] shows the ultrasonic-based
lip reading can only identify 12 basic mouth motions, even if ultrasonic (e.g., 17 mm
wavelength for 20 kHz) has stronger sensing capability than CSI. However, in the
application scenarios of WSVA, the contents of the voice samples are easy to obtain.
So instead of recognizing syllables, the technical contribution of WSVA is modeling
the consistency between the voice samples and the CSI information to determine
whether a voice command is issued by an authentic user.
4.6 Summary
In this chapter, we propose WSVA, a device-free liveness detection system to thwart

the spoofing attacks aiming at VCS. WSVA utilizes the prevalent wireless signals in
the IoT environment to sense the human mouth motion and then verifies the liveness
of voice command according to the consistency between voice samples and CSI
data. WSVA does not require the user to carry any device or demand a large number
of training data. We implement WSVA to demonstrate its feasibility, and the results
show that WSVA can achieve 99% detection accuracy with 1% false accept rate.
References
1. Aley-Raz, A., Krause, N.M., Salmon, M.I., Gazit, R.Y.: Device, system, and method of liveness
detection utilizing voice biometrics. Google Patents, US Patent 8,442,824, 14 May 2013
Networking, MobiCom ’15, pp. 90–102. ACM, New York (2015). https://doi.org/10.1145/
2789168.2790109, http://doi.acm.org/10.1145/2789168.2790109
3. Amazon: Amazon alexa developer (2019). https://developer.amazon.com/alexa
4. Browman, C.P., Goldstein, L.: Articulatory phonology: an overview. Phonetica 49(3–4), 155–
180 (1992)
but you cannot steal: defending against voice impersonation attacks on smartphones. In: 2017
7. Chen, Y., Sun, J., Jin, X., Li, T., Zhang, R., Zhang, Y.: Your face your heart: secure mobile face
authentication with photoplethysmograms. In: Proceedings of IEEE Conference on Computer
Communications (INFOCOM), pp. 1–9 (2017)
8. Chen, G., Chen, S., Fan, L., Du, X., Zhao, Z., Song, F., Liu, Y.: Who is real bob?
Adversarial attacks on speaker recognition systems. In: 2021 IEEE Symposium on Security and
Privacy (SP), pp. 55–72. IEEE Computer Society, Washington (2021). https://doi.org/10.1109/
SP40001.2021.00004, https://doi.ieeecomputersociety.org/10.1109/SP40001.2021.00004
9. CNET: How to bring alexa into every room of your home (2017). https://www.cnet.com/how-
to/how-to-install-alexa-in-every-room-of-your-home/
doi.org/10.1145/2666620.2666623
11. Ding, W., Hu, H.: On the safety of iot device physical interaction control. In: Proceedings of
the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 832–846
(2018). https://doi.org/10.1145/3243734.3243865
12. Dodd, B., Campbell, R.: Hearing by eye: the psychology of lip-reading. Am. J. Psychol. 72(6)
(1987)
References 105
13. Du, C., Yuan, X., Lou, W., Hou, Y.T.: Context-free fine-grained motion sensing using WiFi.
In: 2018 15th Annual IEEE International Conference on Sensing, Communication, and
Networking (SECON), pp. 1–9 (2018)
14. Ettus research (2017). https://www.ettus.com/
ings of the 23rd Annual International Conference on Mobile Computing and Networking
(MobiCom), pp. 343–355 (2017). https://doi.org/10.1145/3117811.3117823
16. Market Research Future: Voice Assistant Market - Information by Technology, Hardware and
Application - Forecast till 2025 (2020). https://www.marketresearchfuture.com/reports/voice-
assistant-market-4003
17. Google: Google home. https://store.google.com/product/google_home (2019)
18. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh,
S., Sengupta, S., Coates, A., Ng, A.Y.: Deep Speech: Scaling up end-to-end speech recognition
(2014). https://doi.org/10.48550/ARXIV.1412.5567
19. Kisler, T., Schiel, F., Sloetjes, H.: Signal processing via web services: the use case webmaus.
In: Proceedings of Digital Humanities, pp. 30–34 (2012)
20. Lin, L.I.K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1),
255–268 (1989)
21. Liu, J., Liu, H., Chen, Y., Wang, Y., Wang, C.: Wireless sensing for human activity: a survey.
IEEE Commun. Surv. Tutorials 1–1 (2019). https://doi.org/10.1109/COMST.2019.2934489
22. Meng, Y., Wang, Z., Zhang, W., Wu, P., Zhu, H., Liang, X., Liu, Y.: WiVo: enhancing the
security of voice control system via wireless signal in IoT environment. In: Proceedings of the
Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing,
Mobihoc ’18, pp. 81–90. ACM, New York (2018). https://doi.org/10.1145/3209582.3209591,
http://doi.acm.org/10.1145/3209582.3209591
23. Mugler, E.M., Tate, M.C., Livescu, K., Templer, J.W., Goldrick, M.A., Slutzky, M.W.:
Differential representation of articulatory gestures and phonemes in precentral and inferior
frontal gyri. J. Neurosci. 38(46), 9803–9813 (2018)
24. Ostry, D., Flanagan, J.: Human jaw movement in mastication and speech. Arch. Oral Biol.
34(9), 685–693 (1989)
25. Places of articulation (2017). https://en.wikipedia.org/wiki/File:Places_of_articulation.svg
26. Qian, K., Wu, C., Yang, Z., Liu, Y., Jamieson, K.: Widar: decimeter-level passive tracking via
velocity monitoring with commodity Wi-Fi. In: Proceedings of the 18th ACM International
Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc), pp. 6:1–6:10 (2017).
https://doi.org/10.1145/3084041.3084067
27. Roy, N., Hassanieh, H., Roy Choudhury, R.: Backdoor: making microphones hear inaudible
3081366
28. Samsung: Smartthings (2021). https://www.smartthings.com
29. Shi, C., Liu, J., Liu, H., Chen, Y.: Smart user authentication through actuation of daily activities
leveraging WiFi-enabled IoT. In: Proceedings of the 18th ACM International Symposium on
Mobile Ad Hoc Networking and Computing (MobiHoc), pp. 5:1–5:10 (2017). https://doi.org/
10.1145/3084041.3084061
30. SmartThings: Smartthings public GitHub Repo (2018). https://github.com/
SmartThingsCommunity/SmartThingsPublic
31. Tan, S., Yang, J.: Wifinger: leveraging commodity WiFi for fine-grained finger gesture
recognition. In: Proceedings of the 17th ACM International Symposium on Mobile Ad
Hoc Networking and Computing (MobiHoc), pp. 201–210 (2016). https://doi.org/10.1145/
2942358.2942393
32. Tan, J., Nguyen, C., Wang, X.: Silenttalk: Lip reading through ultrasonic sensing on mobile
phones. In: IEEE INFOCOM 2017 – IEEE Conference on Computer Communications, pp. 1–
9 (2017). https://doi.org/10.1109/INFOCOM.2017.8057099
33. Wang, G., Zou, Y., Zhou, Z., Wu, K., Ni, L.M.: We can hear you with Wi-Fi! In: Proceedings of
the 20th Annual International Conference on Mobile Computing and Networking (MobiCom),
pp. 593–604 (2014). https://doi.org/10.1145/2639108.2639112
34. Wang, J., Jiang, H., Xiong, J., Jamieson, K., Chen, X., Fang, D., Xie, B.: LIFS: low human-
effort, device-free localization with fine-grained subcarrier information. In: Proceedings of the
22Nd Annual International Conference on Mobile Computing and Networking (MobiCom),
pp. 243–256 (2016). https://doi.org/10.1145/2973750.2973776
compensated: understanding and defeating modulated replay attacks on automatic speech
Communications Security, CCS ’20, pp. 1103–1119. Association for Computing Machinery,
New York (2020). https://doi.org/10.1145/3372297.3417254
36. Wei, T., Wang, S., Zhou, A., Zhang, X.: Acoustic eavesdropping through wireless vibrometry.
Networking (MobiCom), pp. 130–141 (2015). https://doi.org/10.1145/2789168.2790119
37. Wikipedia: International phonetic alphabet (2018). https://en.wikipedia.org/wiki/
International_Phonetic_Alphabet
X., Gunter, C.A.: Commandersong: a systematic approach for practical adversarial voice
39. Zhang, L., Tan, S., Yang, J., Chen, Y.: Voicelive: a phoneme localization based liveness
detection for voice authentication on smartphones. In: Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security (CCS), pp. 1080–1091 (2016). https://
doi.org/10.1145/2976749.2978296
40. Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: Dolphinattack: inaudible voice
41. Zhang, L., Tan, S., Yang, J.: Hearing your voice is not enough: an articulatory gesture
org/10.1145/3133956.3133962
speaker verification system in smart home. In: IEEE INFOCOM 2020 – IEEE Conference on
2020.9155483
Chapter 5
Microphone Array Based Passive
Liveness Detection at Voice Interface
Layer
5.1 Introduction
Nowadays, voice assistance-enabled smart speakers serve as the hub of popular

smart home platforms (e.g., Amazon Alexa, Google Home) and allow the user
to remotely control home appliances (e.g., smart lighter, locker, thermostat) or
query information (e.g., weather, news) as long as it can hear the user. However,
as analyzed in Chap. 4, the inherent broadcast nature of voice unlocks a door for
adversaries to inject malicious commands (i.e., spoofing attack) including replay
attacks [7, 18], adversarial example attacks [5, 20, 22], inaudible attacks [14, 21],
and so on.
Voices in the spoofing attack are played by electrical devices (e.g., high-
quality loudspeaker [18], ultrasonic dynamic speaker [21]). Thus, the physical
characteristics, which are different between humans and machines, could be used as
the “liveness” factors. The existing liveness detection schemes could be divided into
multi-factor authentication and passive scheme. In Chap. 4, a Wi-Fi signal based
liveness detection system WSVA is proposed. However, to capture the liveness
factor of a real human, multiple-factor authentication either requires the user to carry
specialized sensors (e.g., accelerator [8], magnetometer [6]), which adds additional
burdens for users. Although WSVA does not require users to carry additional
devices, it needs to deploy wireless IoT devices to collect CSI in the smart home
environment, which also causes the deployment burden.
By contrast, the passive scheme only considers the audio data collected by
the smart speaker. Its key insight is that the difference in articulatory manners
between real humans (i.e., vocal vibration and mouth movement) and electrical
machines (i.e., diaphragm vibration) will result in subtle but significant differences
in the collected audios spectrograms. Passive schemes based on mono audio [1, 4]
and two-channel audio [3, 19] have already been proposed and could be directly
incorporated into the smart speaker’s software level.
https://doi.org/10.1007/978-3-031-24185-7_5
108 5 Microphone Array Based Passive Liveness Detection at Voice Interface Layer
However, the existing passive liveness detection schemes face a series of

challenges in the aspects of usability and efficiency, which seriously hinder their
real-world deployment in practice. Some passive schemes leveraging sub-bass low-
frequency area (from 20 to 300 Hz in [4]) or voice area (below 10 kHz in [1])
of mono audio’s spectrum as liveness factor are vulnerable to sound propagation
channel’s change. Another scheme [19] aiming to extract audio’s fieldprint from
two-channel audio requires the user to keep a fixed manner to ensure the robustness
of such fingerprints. As a result, the scheme is difficult to be deployed in many
practical scenarios (e.g., users walking or having gesture changes). Therefore,
it is desirable to propose a novel passive liveness detection scheme with the
following merits: (1) Device-free: performing passive detection only relying on
the collected audio; (2) Resilient to environment change: being robust to dynamic
sound propagation channel and user’s movement, (3) High accuracy: achieving high
accuracy compared to existing works.
To achieve a device-free, robust passive liveness detection, in this chapter, we
propose ArrayID, a microphone array-based liveness detection system to effectively
defend against spoofing attacks. ArrayID is motivated by the basic observation
that the microphone array has been widely adopted by the mainstream smart
speakers (e.g., both Amazon Echo 3rd Gen [12] and Google Home Max [17]
having 6 microphones), which is expected to significantly enhance the diversity of
the collected audios thanks to the different locations and mutual distances of the
microphones in this array. By exploiting the audio diversity, ArrayID can extract
more useful information related to the target user, which is expected to significantly
improve the robustness and accuracy of the liveness detection.
For the multi-channel audio collected by the smart speaker, we give a formal
definition of array fingerprint, the main feature used by ArrayID, and discuss the
theoretic performance gain of adopting microphone array, which can leverage the
relationship among different channels’ data to eliminate the distortions caused by
factors including air channel and user’s position changes. We collect and build
the first array fingerprint-based open dataset containing multi-channel voices from
38,720 voice commands. To evaluate the effectiveness of ArrayID, we compare
ArrayID with previous passive schemes (i.e., CaField [19], and Void [1]) on
both our dataset and a third-party dataset called ReMasc Core dataset [10].
ArrayID achieves the authentication accuracy of 99.84% and 97.78% on our dataset
and ReMasc Core dataset, respectively, while the best performance of existing
schemes [1, 19] on these two datasets are 98.81% and 84.37% respectively. The
experimental results well demonstrate the effectiveness and robustness of ArrayID.
ArrayID is the first to exploit the circular microphone array of the smart
speaker to perform passive liveness detection in a smart home environment. The
contributions of this chapter are summarized as follows:
• Novel system. We design, implement, and evaluate ArrayID for thwarting voice
spoofing attacks. By only using audio collected from a smart speaker, ArrayID
does not require the user to carry any device or conduct additional action.
5.2 Preliminaries and Motivations 109
• Effective feature. We give an analysis of the principles behind passive detection

and propose a robust liveness feature: the array fingerprint. This novel feature
both enhances effectiveness and broadens the application scenarios of passive
liveness detection.
• Robust performance. Experimental results on both our dataset and a third-
party dataset show that ArrayID outperforms existing schemes. Besides, we
evaluate multiple factors (e.g., distance, direction, spoofing devices, noise) and
demonstrate the robustness of ArrayID.
The remainder of this chapter is below. Sect. 5.2 introduces the preliminaries and
motivations. Sect. 5.3 shows the system design of ArrayID, which is followed by
evaluation, and discussions in Sect. 5.4 and Sect. 5.5, respectively. Finally, Sect. 5.6
summarizes this chapter.
5.2 Preliminaries and Motivations
In this chapter, we consider the same attack model as described in Sect. 4.2.1. In this
section, we introduce the voice commands generation and propagation processes in
the smart speaker scenario, review existing passive liveness detection schemes and
introduce our proposed array fingerprint.
5.2.1 Voice Command Generation and Propagation in Smart

Speaker Scenarios
Before reviewing existing passive liveness detection schemes, it is important to

describe the sound generation and propagation process.
Sound Generation Process As shown in Fig. 5.1a, voice commands are generated
by a human or electrical device (i.e., loudspeaker). For the electrical loudspeaker,
given an original voice command signal x(f, t), where f represents the frequency
and t is time, the loudspeaker utilizes the electromagnetic field change to vibrate the
diaphragm. The movement of the diaphragm suspends and pushes air to generate the
sound wave s(f, t) = hdev (f, t) · x(f, t), where hdev (f, t) represents the channel
authenc smart speaker: ( , ) cloud

user: ℎ ( , )
or Mic
sound: ( , )
audio: ( , ) array
device: ℎ ( , ) air channel: ℎ ( , , )
spooﬁng smart d
devices
i
(a) (b)
Fig. 5.1 Sound generation and propagation in smart home. (a) Sound generation. (b) Sound
propagation process
gain in the sound signal modulation by the device as shown in Fig. 5.1b. Similarly,
when a user speaks voice commands, their mouth and lips also modulate the air, and
we can use huser (f, t) to represent the modulation gain, where the generated sound
is s(f, t) = huser (f, t) · x(f, t) Note that, in the real-world scenario, there is no
such x(f, t) during the human voice generation process. However, the concepts of
x(f, t) and huser (f, t) are widely used [4] and will help us understand features
in Sect. 5.3.3. Finally, the generated sound s(f, t) is spread through the air and
captured by the smart speaker.
Sound Transmission Process Currently, smart speakers usually have a micro-
phone array (e.g., Amazon Echo 3rd Gen [12] and Google Home Max [17]
both have 6 microphones). For a given microphone, when sound is transmitted to
it, the air pressure at the microphone’s location can be represented as y(f, t) =
hair (d, f, t) · s(f, t), where d is the distance of the transmission path between the
audio source and the microphone and hair (d, f, t) is the channel gain in the air
propagation of the sound signal.
Sound Processing Within the Smart Speaker Finally, y(f, t) is converted to
an electrical signal by the microphone. Since the microphones employed by
mainstream smart speakers usually have a flat frequency response curve in the
frequency area of the human voice, we assume smart speakers save original sensed
data y(f, t) which is also adopted by existing studies [19]. Finally, the collected
audio signal is uploaded to the smart home cloud to further influence the actions of
smart devices.
5.2.2 Passive Liveness Detection
The recently proposed liveness detection schemes could be divided into two
categories: mono channel based detection (e.g., Sub-bass [4] and VOID [1]) and
fieldprint based detection (i.e., CaField [19]).
5.2.2.1 Mono Channel Based Detection
Principles As shown in Fig. 5.1a, the different sound generation principles between
real human and electrical spoofing devices could be characterized as two different
filters: huser (f, t) and hdev (f, t). If ignoring the distortion in the sound signal trans-
mission, hair (d, f, t) could be considered as a constant value A. Thus, the received
audio samples in authentic and spoofing attack scenarios are yauth (d, f, t) =
A·huser (f, t)·x(f, t) and yspoof (d, f, t) = A·hdev (f, t)·x(f, t), respectively. Since
A and x(f, t) are the same, it means that the spectrograms of the received audio
samples already contain the identity of the audio source (the real user huser (f, t) or
the spoofing one hdev (f, t)). Figure 5.2a shows the spectrums of the voice command
“OK Google” and its spoofing counterpart. It’s observed that the sub-bass spectrum
Spoofing Spoofing
1 1 1 1
Normalized Amplitude
0.5 0.5 0.5 0.5
0 0 0 0
100 200 300 100 200 300 100 200 300 100 200 300
Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz)
(a) (b)
Fig. 5.2 Spectrums of authentic and spoofing voices when putting the smart speaker in different
rooms. (a) Spectrums on room A. (b) Spectrums on room B
(20–300 Hz) difference between two audio samples is quite different even if they are
deemed similar, and this phenomenon is utilized by mono channel-based schemes
such as Sub-base [4].
Limitations However, in a real-world environment, hair (d, f, t) cannot always be
regarded as a constant. The surrounding object’s shape and materials, the sound
transmission path, and the absorption coefficient of air all affect the value of
hair (d, f, t). As shown in Fig. 5.2a, b, the spectrograms of authentic and spoof
audio samples change drastically when putting the smart speaker in different rooms.
The experimental result from Sect. 5.4.2 and [1] demonstrates the performance
of liveness detection undergoes degradation when handling datasets in which
audios are collected from complicated environments (e.g., ASVSpoofing 2017
Challenge [11], ReMasc Core [10]).
5.2.2.2 Fieldprint Based Detection
Principles The concept of fieldprint [19] is based on the assumption that audio
sources with different articulatory behaviors will cause a unique “sound field”
around them. By measuring the field characteristics around the audio source, it is
feasible to induce the audio’s identity. CaField is the typical scheme that deploys two
microphones to receive two audios y1 (f, t) and y2 (f, t), and defines the fieldprint
as:

y1 (f, t)
F ield = log . (5.1)
y2 (f, t)
Limitations Measuring stable and accurate field print requires the position
between the audio source and the print measure sensors must be relatively stable. For
instance, CaField only performs well when the user holds a smartphone equipped
with two microphones close to the face in a fixed manner. The fieldprint struggles
in far distances (e.g., greater than 40 cm in [19]), making it unsuitable for a home
environment in which users want to communicate with a speaker across the room.
The goal of this study is to propose a novel and robust feature for passive liveness
detection.
5.2.3 Motivation: Array Fingerprint
In this subsection, we propose a novel and robust liveness feature array fingerprint
and design ArrayID based on it.
5.2.3.1 Definition of Array Fingerprint
Figure 5.3 illustrates the scenario when audio signals are transmitted from the source
to the microphone array. The audio source is regarded as a point with coordinate
(L, 0), and the microphones are evenly distributed on a circle. Given the k-th
microphone Mk , the collected audio data is yk (f, t) = hair (dk , f, t) · s(f, t), where
dk is the path distance from the audio source to Mk .
Inspired by the circular layout of microphones in smart speaker as shown in
Fig. 5.3, we define the array fingerprint AF as below:
AF = std(log[y1 , y2 , ..., yN ]). (5.2)
We further validate the effectiveness of the proposed array fingerprint via a

series of real-world case studies. In the experiment, the participant is required to
speak the command “Ok Google” at distances of 0.6 m and 1.2 m, respectively.
Figure 5.4a shows the audio signal clips collected by a microphone array with six
microphones, and the audio difference between different channels is obvious. When
employing the concept of fieldprint defined in Eq. 5.1, it is observed from Fig. 5.4b
that the fieldprints extracted from microphone pair (M1 , M2 ) and (M1 , M5 ) are
/ 1 2
0 ( , 0)
1
( − 1) ( − 1)
= ⋅ cos + , ⋅ +
−1
Fig. 5.3 Sound propagation in microphone array scenario

Mic 1 & 2 Mic 1 & 5
Power Difference (dB)

60 cm Mic 1 Mic 2 120 cm 60 cm 120 cm
0.2 20
0.1 20
Voice
0 0 0 0
-0.1 -20
-0.2 -20
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0 500 1000 0 200 400 600 800 1000 1200
Time (s) Time (s) Frequency (Hz) Frequency (Hz)
(a) (b)
60 cm 120 cm
1 1
Array Fingerprint
Frequency (Hz)
0.5 0.5
0 0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Frequency (Hz) Frequency (Hz)
(c)
Fig. 5.4 Illustration of stability of array fingerprint under two locations. (a) Two original authentic
audios. (b) Dynamic power differences in different microphone pairs. (c) Stable array fingerprints
Authentic Phone iPad

1 1 1
Array fingerprint
0.6 m 0.6 m 0.6 m

1.2 m 1.2 m 1.2 m
0 0 0
0 Frequency (kHz) 5 0 Frequency (kHz) 5 0 Frequency (kHz) 5
Fig. 5.5 Differentiating human authentic voice from two spoofing devices via array fingerprints
under different propagation paths
quite different.1 Among different distances, the fieldprints are also quite different.
However, from Fig. 5.4c we can see that the array fingerprints for different distances
are very similar.2
To show the distinctiveness of array fingerprint, we also conducted replay attacks
via smartphones and iPad (i.e., device #8 and #3 listed in Table 5.2, Sect. 5.4.1).
The normalized array fingerprints (i.e., FSAP in Sect. 5.3.3.1) are shown in Fig. 5.5.
It is observed that the array fingerprints for the same audio sources are quite
similar, while array fingerprints for different audio sources are quite different. Our
1 The real process of extracting fieldprint is more complicated. Figure 5.4b shows the basic
principle following the descriptions in Eq. 5.1.
2 This array fingerprint is refined after extracting from Eq. 5.4. The detailed calculation steps are
described in Sect. 5.3.3.1.

experimental results demonstrate the array fingerprint can serve as a better passive
liveness detection feature. This motivates us to design a novel, lightweight and
robust system which will be presented in the next section.
5.3 ArrayID: Array Fingerprint Based Passive Liveness

Detection
As shown in Fig. 5.6, ArrayID consists of the following modules: Data Collection
Module, Pre-processing Module, Feature Extraction Module, and Attack Detection
Module. We will elaborate on the details of each module in this section.
5.3.1 Multi-channel Data Collection
Currently, most popular smart speakers, such as Amazon Echo and Google Home,
employ a built-in microphone array to collect voice audio. In this chapter, we utilize
open modular development boards with voice interfaces (i.e., the Matrix Creator
[13] and Seeed Respeaker [16]) to collect the data. Since these development boards
have similar sizes to commercial smart speakers, ArrayID evaluations on the above
devices can be applied to a smart speaker without any notable alterations. Generally
speaking, given a smart speaker with N microphones, a sampling rate of Fs , and
data collection time T , the collected voice sample is denoted as VM×N , where M =
Fs × T and we let Vi be the i-th channel’s audio V (:, i). Then, the collected V is
sent to the next module.
··· Spectrogram Direction

Analysis Detection
Word-based
Feature Spectrogram Array
Training dataset
Extraction Features
[ , , ]
Audio feature label
Spectrogram Distribution
Real-time Features
command
Classification
Legitimate Model Audio LPCC Features
Adversarial
Fig. 5.6 System overflow

5.3 ArrayID: Array Fingerprint Based Passive Liveness Detection 115
5.3.2 Data Pre-processing
As shown in Eq. 5.2, the identity (i.e., real human or spoofing device) is hidden in the
audio’s spectrogram. Therefore, before feature extraction, we conduct the frequency
analysis on each channel’s signal and detect the audio’s direction.
Frequency Analysis on Multi-channel Audio Data As described in Sect. 5.2.3,
the audio spectrogram in the time-frequency domain contains crucial features for
further liveness detection. ArrayID performs Short-Time Fourier Transform (STFT)
to obtain two-dimensional spectrograms of each channel’s audio signal. For the i-th
channel’s audio Vi , which contains M samples, ArrayID applies a Hanning window
to divide the signals into small chunks with lengths of 1024 points and overlapping
sizes of 728 points. Finally, a 4096-sized Fast Fourier Transform (FFT) is performed
for each chunk, and a spectrogram Si is obtained as shown in Fig. 5.7a.
Direction Detection Given a collected audio VM×N , to determine the microphone
which is closest to the audio source, ArrayID firstly applies a high pass filter with
a cutoff frequency of 100 Hz to VM×N . Then, for the i-th microphone Mi , ArrayID
calculates the alignment errors Ei = mean((V (:, i − 1) − V (:, i))2 ) [15]. Finally,
from the calculated E, ArrayID chooses the microphone with minimum alignment
error as the corresponding microphone.
5.3.3 Feature Extraction
From Eq. 5.2, we observe that both audio spectrograms themselves and the micro-
phone array’s difference contain the liveness features of collected audio. In this
module, the following three features are selected by ArrayID: Spectrogram Array
Fingerprint FSAP , Spectrogram Distribution Fingerprint FSDP , and Channel LPCC
Features FLP C .
5.3.3.1 Spectrogram Array Feature
After obtaining the spectrogram S = [S1 , S2 , . . . , SN ] from N channels’ audio

data V = [V1 , V2 , ..., VN ], ArrayID firstly exploits the array fingerprint which
is proposed in Sect. 5.2.3 to extract the identity of the audio source. To reduce
the computation overhead, for Sk with size Ms × Ns , we only preserve the
components in which frequency is less than the cutoff frequency fsap . In this
study, we empirically set fsap as 5 kHz. The resized spectrograms are denoted
as Spec = [Spec1 , Spec2 , ..., Speck ], where Speck = Sk (: Mspec , :). In this
study, with sampling rate F s = 48kHz and FFT points Nff t = 4096, the Mspec
f ×N
is [ sap Fs ff t ] = 426.
Mic 1 Mic 3 Mic 5
15
Frequency (Hz)
4000 4000 4000
3000 3000 3000 10
2000 2000 2000

5
1000 1000 1000
0 0 0
0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4
time (s) time (s) time (s)
(a)
Grid Spec of Mic 1 Grid Spec of Mic 3 Grid Spec of Mic 5

100 100 100
80 80 80 40
60 60 60
20
40 40 40
0
20 20 20
5 10 15 20 5 10 15 20 5 10 15 20
(b)
Fig. 5.7 Grid processing on the multiple-channel audios. (a) Original spectrograms of different
channels. (b) Spectrogram grids of different channels
Figure 5.7a illustrates Spec of three channels of the command “OK Google”. It
is observed that different channels’ spectrograms are slightly different. However,
directly using such subtle differences would cause an inaccurate feature. Thus,
ArrayID transforms Speck into a grid matrix Gk with size MG × NG by dividing
Speck into MG × NG chunks and calculates the sum of magnitudes within each
chunk. The element of Gk could be represented as:
Gk (i, j ) = sum(Speck (1 + (i − 1) · SM : i · SM ,
(5.3)
1 + (j − 1) · SN : j · SN )),
M N
where SM = [ Mspec G
] and SN = [ NspecG
] are the width and length of each chunk.
Note that some elements of Speck may be discarded. However, it does not affect the
feature generation, since ArrayID focuses on the differences between spectrograms
according to Eq. 5.2. In this study, MG and NG are set to 100 and 20, respectively,
and Fig. 5.7b shows the spectrogram grids from the first, third and fifth microphones.
The difference among elements in G = [G1 , G2 , ..., GN ] is now very obvious. For
instance, the grid values in the red rectangles of Fig. 5.7b are quite different.
Then, based on Eq. 5.2, ArrayID calculates the array fingerprint FG from the
spectrogram G. FG has the same size as Gk , and the elements of FG can be
represented as:
Standard deviation Std on different chunk Moving Average of Std

100 1
6 S d(:,5) S d(:,15)
80 S d(:,10) Average
0.8
60 4
40 0.6
2
20
0.4
0
5 10 15 20 0 50 100 0 50 100
(a)
Authentic User Spoofing Device Visulization

1 1 20
Authentic
0.8 Attack
0.8 10
0.6
0.4 Command 1 Command 1 0

0.6
Command 2 Command 2
0.2
Command 3 Command 3
-10
0 0.4
0 50 100 0 50 100 -10 -5 0 5 10
(b)
Fig. 5.8 Illustration of spectrogram array fingerprint feature FSAP extraction. (a) Array finger-
print extraction processing. (b) Features among different commands and distances
FG (i, j ) = std([G1 (i, j ), G2 (i, j ), ..., GN (i, j )]). (5.4)
Figure 5.8a illustrates the FG containing NG chunks calculated from spectrogram

grids as shown in Fig. 5.7b. However, we find that in different time chunks, the
FG (:, i) varies. The reason is that different phonemes are pronounced by different
articulatory gestures, which can be mapped to a different huser (f, t) function in
Sect. 5.2.1. To solve this problem, we leverage the idea that even though different
phonemes contain different gestures, there are common components over a long
duration of time. Therefore, ArrayID averages the FG across the time axis, and
Fig. 5.8a shows the average result FG . ArrayID performs a 5-point moving average
and normalization on FG to remove noise and generate the spectrogram array
fingerprint FSAP .
Figure 5.8b gives a simple demonstration about the effectiveness of the FSAP
feature generation process. We test three voice commands “OK Google”, “Turn
on Bluetooth” and “record a video”, while the distances between the speaker and
microphone array are set as 0.6 m and 1.2 m in the first two commands and the
last command, respectively. In Fig. 5.8b, it is observed that the different commands
result in a similar array fingerprint, and the feature difference between authentic
audio and spoofing audio is clear. Finally, since ArrayID requires a fast response
time, the feature should be lightweight. So, the FSAP is re-sampled to the length of
NSAP points. In this study, we empirically choose NSAP as 40.
Ch1 distribution Ch4 distribution Ch1 distribution Ch4 distribution

600 1000 1 1
20% 10% 20%
10%
10% 20%
400 20%
500 10% 0.5 0.5
200
10% 10% 10%
20% 20% 20% 20% 20% 20%
20% 20% 20% 20% 10% 20% 20%
0 0 0 0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz)
(a) (b)
Normal audio Spoofing audio

1 1
FSDP
0.5 0.5
0 0
0 10 20 30 0 10 20 30
Feature index Feature index
(c)
Fig. 5.9 Spectrogram distributions between authentic human and spoofing device. (a) The
authentic audio’s Ch. (b) The spoofing audio’s Ch. (c) FSDP between authentic and spoofing
audios
5.3.3.2 Spectrogram Distribution Feature
Besides FSAP , the spectrogram distribution also provides useful information related
to the identity of the audio source. Thus we also extract spectrogram distribution
fingerprint FSDP for liveness detection. Given a spectrogram Sk from the k-th
channel, ArrayID calculates a NG -dimension vector Chk in which Chk (i) =
Mspec
j =1 Sk (j, i), where Mspec and NG are set as 85 and 20 respectively in this
study. Note that, when calculating FSDP , we set the cutoff frequency as 1 kHz since
most human voice frequency components are located in the 0~1 kHz range and the
corresponding MSpec is 85 under the parameters in Sect. 5.3.3.1. For the audio with
N channels, the channel frequency strength Ch = [Ch1 , Ch2 , ..., ChN ] is obtained.
Figure 5.9a, b show channel frequency strength Ch1 and Ch4 of first and fourth
channels from authentic and spoofing audios. It is observed that Ch from real
human and spoofing device are quite different. Therefore, we utilize the average
of channel frequency strengths Ch and re-sample its length to NCh as the first
component of FSDP . In this study, Ch(i) = mean([Ch1 (i), Ch2 (i), ..., ChN (i)])
and NCh is set to 20. We can also find that for the same audio, Ch from different
channels have slightly different magnitudes and distributions. To characterize the
distribution of Ch, for Chk from the k-th channel, ArrayID first calculates the
cumulative distribution function Cumk and then determines the indices μ, which
can split Cumk uniformly. As shown in Fig. 5.9a, b, the Chk are segmented into 6
bands. ArrayID sets the T hr = [0.1, 0.3, 0.5, 0.7, 0.9], and the index μ(k, i) of the
i-th T hr for Chk satisfies the following condition:
Cumk (μ(k, i) ≤ T hri ≤ Cumk (μ(k, i) + 1). (5.5)

closest microphone 1
1 1 LPCC of Mic 3
LPCC of Mic 1 LPCC of Mic 2
0.5
0.5 0.5
0
0 0
opposite microphone
1 LPCC of Mic 4 1 LPCC of Mic 5 1 LPCC of Mic 6
0.5 0.5 0.5
0 0 0
Fig. 5.10 LPCC in each channel
After obtaining the N × 5 indices μ, we utilize the mean value Dmean and
standard deviation Dstd among different channels as a part of the spectrogram
feature. Both Dmean and Dstd are vectors with length of 5, where Dmean (i) =
mean(μ(:, i)) and Dstd (i) = std(μ(:, i)). Finally, ArrayID obtains the spectrogram
distribution fingerprint FSDP = [Ch, Dmean , Dstd ]. Figure 5.9c illustrates the
FSDP from authentic and spoofing audios and demonstrates the robustness of FSDP .
5.3.3.3 Channel LPCC Features
The final feature of ArrayID is the Linear Prediction Cepstrum Coefficients (LPCC).
Since each channel has unique physical properties, retaining the LPCC which
characterizes a given audio signal could further improve the detection performance.
For audio signal yk (t) collected by microphone Mk , ArrayID calculates the LPCC
with the order p = 15. For audio signal yk (t) collected by microphone Mk , to
calculate the LPCC with the order p = 15, we firstly calculate the Linear Prediction
Coding (LPC) as a:
a = LP C(yk (t), p), (5.6)
where p is the order of LPC, and the collected LPC can be represented as a =
[a0 , a1 , . . . , ap ]. Then, for the LPCC coefficient c = [c0 , c1 , · · · , cp ], we have c0 =
ln(p), and for other elments could be calculated as:
i

k
cn = −ai − 1− ak ci−k . (5.7)
i
k=1
For an example multi-channel voice, Fig. 5.10 shows the LPCCs on each channel.
In this figure, when M1 is the closest microphone, for a microphone array with six
channels, the opposite microphone is M4 . The LPCCs from these two channels
Fig. 5.11 Microphone array:

Matrix Creator and Seeed
ReSpeaker core V2
are selected as FLP C in Sect. 5.3.3.3. To reduce the time overhead spent on
LPCC extraction, we only preserve the LPCCs from audios in these two channels
(Mi , Mmod(i+N/2,N ) ), where Mi is the closet microphone derived from Sect. 5.3.2.
Finally, we generate the final feature vector X = [FSAP , FSDP , FLP C ].
5.3.4 Classification Model
After generating the feature vector from the audio input, we choose a lightweight
feed-forward back-propagation neural network to perform liveness detection. The
neural network only contains three hidden layers with rectified-linear activation
(layer sizes: 64, 32, 16). The dropout is 20% after the 64 and 32 node layers, and the
output layer is one sigmoid-activated node. We adopt a lightweight neural network
because it can achieve a quick response to the decision, which is essential for the
devices in the smart home environment.
5.4 Performance Evaluation
5.4.1 Experiment Setup
Hardware Setup To collect multi-channel audios, as shown in Fig. 5.11, we

employ two open modular development boards (i.e., Matrix Creator and Seeed
ReSpeaker Core v2) with a sampling rate of 48 kHz to serve as smart speakers.
The number of microphones in the Matrix and ReSpeaker are 8 and 6, respectively,
and their radiuses are 5.4 cm and 4.7 cm, respectively. For the spoofing device, we
employ 14 different electrical devices with various sizes and audio qualities whose
detailed parameters are shown in Table 5.2.
Data Collection Procedure We recruit 20 participants to provide the multi-channel

audio data. The data collection procedure consists of two phases: (1) Authentic audio
5.4 Performance Evaluation 121
Table 5.1 The commands in experiments

No. Voice command No. Voice command
1 OK Google 11 Decrease volume
2 Turn on Bluetooth 12 Turn on flashlight
3 Record a video 13 Set the volume to full
4 Take a photo 14 Mute the volume
5 Open music player 15 What’s the definition of transmit?
6 Set an alarm for 6:30 am 16 Call Pizza Hut
7 Remind me to buy coffee at 7 am 17 Call the nearest computer shop
8 What is my schedule for tomorrow? 18 Show me my messages
9 Square root of 2105? 19 Translate please give me directions to Chinese
10 Open browser 20 How do you say good night in Japanese?
Table 5.2 Loudspeaker used for generating spoofing attacks

No. Type Manufacture Model Size (L*W*H in cm)
1 Loudspeaker Bose SoundLink Mini 5.6 × 18.0 × 5.1
2 Tablet Apple iPad 6 24.0 × 16.9 × 0.7
3 Tablet Apple iPad 9 24.0 × 16.9 × 0.7
4 Loudspeaker GGMM Ture 360 17.5 × 10.9 × 10.9
5 Smartphone Apple iPhone 8 Plus 15.8 × 7.8 × 0.7
6 Smartphone Apple iPhone 8 13.8 × 6.7 × 0.7
7 Smartphone Apple iPhone 6s 13.8 × 6.7 × 0.7
8 Smartphone Xiaomi MIX2 15.2 × 7.6 × 0.8
9 Loudspeaker Amazon Echo Dot (2nd Gen) 8.4 × 3.2 × 8.4
10 Laptop Apple MacBook Pro (2017) 30.4 × 21.2 × 1.5
11 Loudspeaker VicTsing SoundHot 12.7 × 12.2 × 5.6
12 Loudspeaker Ultimate Ears Megaboom 8.3 × 8.3 × 22.6
13 Loudspeaker Amazon Echo Plus (1st Gen) 23.4 × 8.4 × 8.4
14 Smartphone Xiaomi Mi 9 15.8 × 7.5 × 0.8
collection: in this phase, the participant speaks 20 different voice commands as

listed in Table 5.1, and the experimental session can be repeated multiple times
by this participant. We pre-define four distances (i.e., 0.6 m, 1.2 m, 1.8 m, 2.4 m)
between the microphone array and the participant can choose any of them in each
session. For the speaking behavior, we ask the participant to speak commands as
she/he likes and did not specify any fixed speed/tone. (2) Spoofing audio collection:
in this phase, similar to the manners adopted by the previous works [1, 19, 23],
after collecting the authentic voice samples, we utilize the spoofing devices as
listed in Table 5.2 to automatically replay the samples without the participant’s
involvement. When replaying a voice command, the electrical device is placed at
the same location as the corresponding participant.
Table 5.3 Detailed information of the dataset

User # # Authentic samples # Spoofing samples Distance (cm) Spoofing devices
1, 7 1200 3600 60,120,180 SoundLink Mini, iPad 6,
iPhone 8 Plus
2 600 1079 60,120,180 Ture360, iPhone 6s
3 533 904 60, 120, 180 Ture360, iPad9
4–6, 8 2305 6415 60, 120, 180 iPad9, Ture360, MIX2
9–12 3211 3198 60, 120,180, 240 Echo Plus (1st Gen)
13–18 1191 4577 180 iPad9, Mi 9, Echo Plus
(1st Gen)
19 591 1767 60,120,180 iPhone 8, Echo Dot (2nd
Gen), MacBook Pro
(2017)
20 610 998 60, 120, 180 SoundHot, Megaboom
Dataset Description After finishing experiments, we utilize pyAudioAnalysis tool

to split the collected audio into multiple voice command samples.3 After removing
incorrectly recognized samples, we get a dataset containing 32,780 audio samples.
We refer to this dataset as Array dataset and utilize it to assess ArrayID. The details
of the dataset are shown in Table 5.3. For instance, user #7 provides 600 authentic
samples at three different positions (i.e., the distance of 0.6 m, 1.2 m, and 1.8 m)
and we utilize these collected samples with three spoofing devices (i.e., SoundLink,
iPad, iPhone) to generate 1800 spoofing samples.
Evaluation Metrics Similar to previous works [1, 19], in this study, we choose
accuracy, false acceptance rate (FAR), false rejection rate (FRR), and equal error
rate (ERR) as metrics to evaluate ArrayID. The accuracy means the percentage of
the correctly recognized samples among all samples. FAR represents the rate at
which a spoofing sample is wrongly accepted by ArrayID, and FRR characterizes
the rate at which an authentic sample is falsely rejected. EER provides a balanced
view of FAR and FRR and it is the rate at which the FAR is equal to FRR. The
evaluation metrics are a little different from that in Sect. 4.4.1. The reason is that
ArrayID does not choose the threshold-based classification method. Thus, we use
the EER in the scenario once the numbers between positive and negative samples
are imbalanced. EER is also used by previous works (e.g., [1]).
3 PyAudioAnalysis website: https://pypi.org/project/pyAudioAnalysis/.

False detection rate (%)

2
False acceptance rate
1.5
False rejection rate
1
0.5
0
1 2 3 4 5 6 7 8 9 10 13 14 19 20
User number
Fig. 5.12 Per-user breakdown analysis
5.4.2 Performance of ArrayID

5.4.2.1 Overall Accuracy
When evaluating ArrayID on our own Array dataset, we choose two-fold cross-
validation, which means the training and testing dataset are divided equally. ArrayID
achieves the detection accuracy of 99.84% and the EER of 0.17%. More specifically,
for all 32,780 samples, the overall FAR and FRR are only 0.05% (i.e., 13 out of
22,539 spoofing samples are wrongly accepted) and 0.39% (i.e., 40 out of 10,241
authentic samples are wrongly rejected) respectively. The results show that ArrayID
is highly effective in thwarting spoofing attacks.
5.4.2.2 Per-user Breakdown Analysis
To evaluate the performance of ArrayID on different users, we show the FAR and
FRR of each user in Fig. 5.12. Note that, for six users (i.e., users #11, #12, #15,
#16, #17, #18) which are not shown in this figure, there is no detection error. When
considering FAR, it is observed that the false acceptance cases only exist in 6 users.
Even in the worst cases (i.e., user #20), the false acceptance rate is still less than
0.51%. When considering FRR, the false rejection cases are distributed among 14
users. It’s observed that only the FRRs of users #3 and #20 are above 1%. Although
the performance of ArrayID on different users is different, even for the worst-case
(i.e., user #20), the detection accuracy is still at 99.0%, which demonstrates the
effectiveness of ArrayID.
5.4.2.3 Time Overhead
For a desktop with Intel i7-7700T CPU and 16 GB RAM, the average time overhead
on 6-channel and 8-channel audios are 0.12 s and 0.38 s, respectively. Note that it
Table 5.4 The detection Liveness feature Array dataset ReMasc Core [10]
accuracy on both datasets
ArrayID 99.84% 97.78%
Mono feature [1] 98.81% 84.37%
Two-channel [19] 77.99% 82.44%
is easy for the existing smart home systems(e.g., Amazon Alexa) to incorporate
ArrayID to their current industrial level solutions in the near future. In that case, both
speech recognition and liveness detection can be done in the cloud [12]. Therefore,
by leveraging the hardware configuration of the smart speaker’s cloud (e.g., Amazon
Cloud [9]), which is much better than our existing one (CPU processor), we believe
that the time overhead can be further reduced and will not incur notable delays.
5.4.2.4 Comparison with Previous Works
We further compare the performance of ArrayID with existing works to demonstrate

the superiority of the proposed array fingerprints. Besides our collected Array
dataset, we also exploited a third-party public dataset, ReMasc Core [10], which
contains 12,023 multi-channel audio samples from 40 different users. Note that,
we only consider the audio samples collected by circular microphone arrays in the
experiment. We re-implement mono audio-based scheme Void [1] and two-channel
audio-based scheme CaField [19].
As shown in Table 5.4, ArrayID is superior to previous works in both datasets.
Especially for the ReMasc Core dataset in which half of the audio samples are
collected in the outdoor environment, ArrayID is the only scheme that achieves an
accuracy above 98.25%. The two-channel-based scheme CaField, gets relatively low
performance on both the Array dataset and ReMasc dataset. It is quite natural since
CaField claimed it needs the user to hold the device with fixed gestures and short
distances. In summary, these results demonstrate that compared with mono audio-
based or two-channel-based schemes, exploiting multi-channel features achieves
superior performance in the liveness detection task.
5.4.3 Impact of Various Factors on ArrayID
In this subsection, we evaluate the impact of various factors (e.g., direction, distance,
user movement, spoofing device, array type) on ArrayID.
Table 5.5 Performance Direction Front Left Right Back

under different directions
# authentic samples 1020 1004 1195 1000
# spoofing samples 980 947 971 932
Accuracy (%) 100 99.69 99.31 99.74
EER (%) 0 0.59 1.08 0.43
Table 5.6 Performance Training position (m) 1.2 1.8 2.4

when changing the distance
Accuracy (%) 99.41 99.53 99.66
EER (%) 1.11 0.93 0.69
5.4.3.1 Impact of Changing Direction
In Sect. 5.4.1, when collecting audio samples, most participants face the smart
speaker while generating voice commands. To explore the impact of the angles
between the user’s face direction and the microphone array, we recruit 10 partic-
ipants to additionally collect authentic voice samples in four different directions
(i.e., front, left, right, back) and then the spoofing device #8 in Table 5.2 is utilized
to generate spoofing audios. As shown in Table 5.5, we totally collect 4219 authentic
samples and 3,830 spoofing samples. Then, we use the classification model trained
in Sect. 5.4.2 to conduct liveness detection. It is observed from Table 5.5 that in
all scenarios, ArrayID achieves an accuracy above 99.3%, which means ArrayID is
robust to the change of direction.
5.4.3.2 Impact of Changing Distance
To evaluate the performance of ArrayID on a totally new distance, we recruit four

participants to attend experiments at three different locations (i.e., 1.2 m, 1.8 m,
2.4 m). We totally collect 2410 authentic and 2379 spoofing audio samples. For a
given distance, the classifier is trained with audios at this distance and tested on
audios at other distances. As shown in Table 5.6, compared with the performance
in Sect. 5.4.2, ArrayID’s performance undergoes degradation when the audio source
(i.e., the human or the spoofing device) changes its location. However, in all cases,
ArrayID achieves an accuracy above 99.4%, which demonstrates that ArrayID is
robust to the distance.
5.4.3.3 Impact of User Movement
Similar to the above paragraphs, we recruit 10 participants to speak while walking.

Then, the participant walks while holding a spoofing device (i.e., Amazon Echo)
and plays spoofing audios. We collect 1999 authentic and 1799 spoofing samples,
and the classifier is the same as that in Sect. 5.4.2. The detection accuracy is 98.2%
Table 5.7 The FAR of each Device # 1 4 8 9 10

voice spoofing device
FAR (%) 0.09 1.04 0.05 0.55 0.96
Device # 11 12 13 14 Others
FAR (%) 3.15 4.14 0.79 0.76 0
which demonstrates that ArrayID and the array fingerprint are robust even with the
user’s movement.
5.4.3.4 Impact of Changing Environment
To evaluate the impact of different environments on ArrayID, we recruit 10

participants to speak voice commands and use device #8 to launch voice spoofing
in a room different from that in Sect. 5.4.2. We collect 1988 authentic samples and
1882 spoofing samples respectively. When utilizing the classifier in Sect. 5.4.2, the
detection accuracy is 99.30%, which shows ArrayID can effectively thwart voice
spoofing under various environments.
It is well known that different devices have different frequency-amplitude
response properties and may have different attacker power. To evaluate ArrayID’s
performance on thwarting different spoofing devices, we conduct an experiment
based on the Array dataset containing 14 spoofing devices as listed in Table 5.2. As
discussed in Sect. 5.5.1, to reduce the user’s enrollment burden, we set the training
proportion as 10%.
Table 5.7 illustrates the false acceptance rate (FAR) of ArrayID on each device
in this case. It is observed that among 14 devices, the overall FAR is 0.58%
(i.e., 117 out of 20,290 spoofing samples are wrongly accepted). Besides, ArrayID
achieves overall 100% detection accuracy on 5 devices (i.e., devices #2, #3, #5,
#6, #7). Even in the worst case (i.e., device #12 Megaboom), the true rejection
rate is still at 95.86%. Furthermore, as shown in Sect. 5.4.2, when increasing the
training proportion to 50%, the false accept rate (FAR) of ArrayID is only 0.05%.
In summary, ArrayID is robust to various spoofing devices.
5.4.3.5 Liveness Detection on Noisy Environments
We add an experiment to evaluate the impact of background noise. As shown in

Fig. 5.13a, to ensure the noise level is consistent when the user is speaking a voice
command, we place a noise generator to play white noise during the data collection.
We utilize an advanced sound level meter (i.e., Smart Sensor AR824) with an A-
weighted decibel to measure the background noise level. The strengths of noise
level at the microphone array are set as 45 dB, 50 dB, 55 dB, 60 dB, and 65 dB,
respectively, and a total of 4528 audio samples are collected from 10 participants
and the spoofing device #13 (i.e., Amazon Echo plus).
5.5 Discussions 127
(a)
Accuracy (%)
Detection accuracy EER

100
EER (%)
20
90 10
80 0
45 50 55 60 65
Background noise level (dB)
(b)
Fig. 5.13 Performance under noisy environments. (a) Noise evaluation setting. (b) Accuracy and
EER
We utilize the classifier in Sect. 5.4.2 where the noise level is 30 dB to conduct
liveness detection. As shown in Fig. 5.13b, when increasing the noise level from
45 dB to 65 dB, the accuracy decreases from 98.8 % to 86.3 %. It is observed that
ArrayID can still work well when the background noise is less than 50 dB, which
also explains why ArrayID can handle the audio samples of the ReMasc Core dataset
collected in an outdoor environment. However, when there exists strong noise, since
the feature of ArrayID is only based on the collected audios, the performance of
ArrayID degrades sharply.
5.5 Discussions
5.5.1 User Enrollment Time in Training
Impact of Training Dataset Size To reduce the user’s registration burden, we

explore the impact of training data size on the performance of ArrayID. For our
Array dataset, we set the training dataset proportion as 10%, 20%, 30%, and 50%,
respectively. The results are shown in Table 5.8. It is observed that the detection
performance increases from 99.14% to 99.84% when involving more training
samples. Note that, even if we only choose 10% samples for training, ArrayID still
achieves the accuracy of 99.14% and EER of 0.96%.
Time Overhead of User’s Enrollment As mentioned in Sect. 5.4.1, the participant
does not need to provide spoofing audio samples. Besides, as shown in Table 5.8,
when setting the training proportion as 10%, among 10,241 authentic samples from
20 users, the average number of audio samples provided by each user during the
enrollment is only 51. Since the average time length of the voice command is smaller
than 3 s, the enrollment can be done in less than 3 min. Compared with the time
overhead on deploying an Alexa skill which is up to 5 min [2], requiring 3 min for
enrollment is acceptable in real-world scenarios.
5.5.2 Handling Users Providing Incomplete Training Dataset
In Sect. 5.4.2, each participant is required to provide both authentic and spoofing
audio samples during enrollment. In this subsection, we consider two special
settings of training configuration: (1) a new user provides only authentic voice
samples (without spoofing samples); (2) a new user did not participate in the training
(i.e., unseen user).
In this subsection, we add an experiment to evaluate the performance of ArrayID
on participants that did not participate in the training (i.e., unseen users). In the
experiment, for each user in the Array dataset, we train the classifier using the other
19 users’ legitimate and spoofing voice samples and regard the user’s samples as
the testing dataset. The detection accuracy of each user is shown in Fig. 5.15. We
also show the results described in Sect. 5.4.2 when users participate in training as a
comparison.
5.5.2.1 Handling a New User with Only Authentic Samples (Without

Spoofing Samples)
We consider a special setting of training configuration in which a new user provides

only authentic voice samples (without spoofing samples). We add an experiment
by leveraging the MALD dataset. Note that, we assume the attacker only utilizes
existing devices in the smart home to conduct spoofing. Thus a total of 18 users are
selected (i.e., users #1–#18), whose spoofing devices are listed in Table 5.2. During
the experiment, for each selected user, ArrayID is trained with this user’s authentic
voice samples, and generic spoofing samples provided by other 17 users. Then, in
Table 5.8 Enrollment times per user

Training proportion Authentic samples Time (mm:ss) Accuracy (%) EER (%)
10% 51 02:33 99.14 0.96
20% 103 05:09 99.47 0.55
30% 155 07:45 99.63 0.43
50% 263 13:09 99.84 0.17
5.5 Discussions 129
Detection accuracy (%)

Classic training configuration Only authentic samples
100
98
96
94
1 2 3 6 7 10 13 Others
User number
Fig. 5.14 Detection performance under different training configurations

Detection accuracy (%)
Participate in training Not participate in training

100
90
80
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
User number
Fig. 5.15 Detection accuracy when the user participates in training or not
the evaluation phase, we test the ability of ArrayID to detect attack samples of this
user and calculate the corresponding detection accuracy (i.e., true rejection rate).
Figure 5.14 illustrates the detection accuracy under two different training
configurations. For all 18 users, the overall accuracy (i.e., TRR) decreases from
99.96% in the classical training configuration described in Sect. 5.4.2 to 99.68% in
this training configuration. For 11 users (i.e., user #4, #5, #8, #9, #11, #12, #14,
#15, #16, #17, #18), the accuracy remains 100% in both scenarios. For the other 7
users, the accuracy decreases slightly due to a lack of knowledge of the user’s attack
samples in the classifier, but all of them achieve an accuracy of above 96%, which
demonstrates the effectiveness of ArrayID in this training configuration.
5.5.2.2 Handling Unseen Users
In the experiment, for each user in the Array dataset, we train the classifier using the
other 19 users’ legitimate and spoofing voice samples and regard the user’s samples
as the testing dataset. The detection accuracy of each user is shown in Fig. 5.15. We
also show the results described in Sect. 5.4.2 when users participate in training as a
comparison.
From Fig. 5.15, it is observed that the overall detection accuracy decreases
from 99.84% to 92.97%. In the worst case (i.e., user #12), the detection accuracy
decreases from 99.87% to 74.53%. The results demonstrate that ability of ArrayID
to address unseen users varies with different users. However, for 11 users, ArrayID
can still achieve detection accuracies higher than 95%. The overall results demon-
strate that ArrayID is still effective when addressing unseen users.
This performance degradation when addressing unseen users remains an open
problem in the area of liveness detection [1, 4, 23]. To partially mitigate this
issue, a practical solution is requiring the unseen users to provide only authentic
voice samples to enhance the classifier (i.e., the training configuration discussed in
Sect. 5.5.2.1).
5.5.3 Limitations and Countermeasures
In this subsection, we discuss the limitations of ArrayID.

The User’s Burden on the Enrollment We can incorporate the enrollment into
daily use to reduce the user’s time overhead on training ArrayID. Firstly, the
evaluation results from Sect. 5.4.3 show that ArrayID is robust to the change of
user’s position, direction, and movement. That means the user can participate in the
enrollment anytime. Then, to achieve this goal, we divide ArrayID into working and
idle phases. In the working phase, when a user generates a voice command, ArrayID
collects the audio and saves the extracted features. During the idle phase, ArrayID
can automatically update the classifier based on these newly generated features.
These steps can be done automatically without human involvement, which means
ArrayID can continuously improve its performance along with daily use.
Impact of Noise and Other Speakers During the user’s enrollment, we assume the
environment is silent, and there is no user who is talking. As shown in Sect. 5.4.3.5,
since ArrayID is a passive liveness detection that only depends on audios, the strong
noise or other speaker’s voice existing in the collected audios will inevitably degrade
its performance. Therefore, the existence of noise and other users who are talking
will increase the enrollment time. Fortunately, since ArrayID is designed for the
smart home or office environment, asking the users to keep a silent environment
during enrollment is a reasonable assumption.
Temporal Stability of Array Fingerprint To evaluate the timeliness of ArrayID,
we recruit a participant to provide 100 authentic voice commands and launch voice
spoofing per 24 h. When using the classification model as described in Sect. 5.4.2
and the audio dataset collected by 24 h and 48 h later, ArrayID still achieves
over 98% accuracy. We admit that the generated feature may be variant when
the participant changes her/his speaking manner or suffers from mood swings. As
mentioned in Sect. 5.5.3, a feasible solution to address this issue is incorporating
the enrollment into the user’s daily use to ensure the freshness of the classification
model of ArrayID.
References 131
5.5.4 Comparison with WSVA
As a passive liveness detection system, ArrayID has advantages and disadvantages

compared with WSVA based on two-factor authentication. On the one hand,
ArrayID does not need to deploy any device to collect data related to user
pronunciation at all, and ArrayID can realize speaker recognition for users, so it can
resist internal attacks. On the other hand, ArrayID needs to ask the user to provide
training data before use. Otherwise, ArrayID cannot obtain any characteristics
about the user. However, as described in the analysis of the multi-user scenario in
Sect. 4.4.2, WSVA can implement the liveness detection task without requiring users
to provide registered samples. This is because WSVA mainly focuses on whether the
user has mouth movements that are compatible with voice commands. In practice,
we can consider using WSVA and ArrayID in different scenarios.
5.6 Summary
In this study, we propose a novel liveness detection system ArrayID for thwarting
voice spoofing attacks without any extra devices. We give an analysis of existing
popular passive liveness detection schemes and propose a robust liveness feature
array fingerprint. This novel feature both enhances effectiveness and broadens the
application scenarios of passive liveness detection. ArrayID is tested on both our
own dataset and another public dataset, and the experimental results demonstrate
ArrayID is superior to existing passive liveness detection schemes. Besides, we
evaluate multiple factors and demonstrate the robustness of ArrayID.
References
1. Ahmed, M.E., Kwak, I.Y., Huh, J.H., Kim, I., Oh, T., Kim, H.: Void: a fast and light voice
liveness detection system. In: 29th USENIX Security Symposium (USENIX Security 20),
pp. 2685–2702. USENIX Association, Berkeley (2020). https://www.usenix.org/conference/
2. Amazon.com, Inc.: Create and manage alexa-hosted skills (2021). https://developer.amazon.
com/en-US/docs/alexa/hosted-skills/alexa-hosted-skills-create.html
3. Blue, L., Abdullah, H., Vargas, L., Traynor, P.: 2MA: verifying voice commands via two
microphone authentication. In: Proceedings of the 2018 on Asia Conference on Computer
and Communications Security, pp. 89–100. ACM, New York (2018). https://doi.org/10.1145/
3196494.3196545
4. Blue, L., Vargas, L., Traynor, P.: Hello, is it me you’re looking for? differentiating between
Conference on Security & Privacy in Wireless and Mobile Networks, pp. 123–133. ACM, New
York (2018)

but you cannot steal: defending against voice impersonation attacks on smartphones. In: 2017
doi.org/10.1145/2666620.2666623
ings of the 23rd Annual International Conference on Mobile Computing and Networking,
pp. 343–355. Association for Computing Machinery, New York (2017). https://doi.org/10.
1145/3117811.3117823
9. Gonfalonieri, A.: How amazon alexa works? Your guide to natural language process-
ing (AI) (2018). https://towardsdatascience.com/how-amazon-alexa-works-your-guide-to-
natural-language-processing-ai-7506004709d3
10. Gong, Y., Yang, J., Huber, J., MacKnight, M., Poellabauer, C.: ReMASC: realistic replay attack
corpus for voice controlled systems. In: Proceedings of the Interspeech 2019, pp. 2355–2359
(2019). https://doi.org/10.21437/Interspeech.2019-1541
11. Kinnunen, T., Sahidullah, M., Delgado, H., Todisco, M., Evans, N., Yamagishi, J., Lee, K.A.:
The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection. In:
Proceedings of the Interspeech 2017, pp. 2–6 (2017). https://doi.org/10.21437/Interspeech.
2017-1111
12. Makwana, D.: Amazon echo smart speaker (3rd gen) review (2020). https://www.mobigyaan.
com/amazon-echo-smart-speaker-3rd-gen-review
13. Matirx: Matrix creator (2020). https://matrix-io.github.io/matrix-documentation/matrix-
creator/overview/
14. Roy, N., Hassanieh, H., Roy Choudhury, R.: Backdoor: making microphones hear inaudible
3081366
15. Shen, S., Chen, D., Wei, Y.L., Yang, Z., Choudhury, R.R.: Voice localization using nearby wall
reflections. In: Proceedings of the 26th Annual International Conference on Mobile Computing
and Networking, MobiCom ’20. Association for Computing Machinery (2020). https://doi.org/
10.1145/3372224.3380884
16. Studio, S.: Respeaker core v2.0. 2019 (2020). http://wiki.seeedstudio.com/ReSpeaker_Core_
v2.0/
17. Tillman, M.: Google home max review: cranking smart speaker audio to the max
(2019). https://www.pocket-lint.com/smart-home/reviews/google/143184-google-home-max-
review-specs-price
compensated: understanding and defeating modulated replay attacks on automatic speech
Communications Security, CCS ’20, p. 1103–1119. Association for Computing Machinery
(2020). https://doi.org/10.1145/3372297.3417254
19. Yan, C., Long, Y., Ji, X., Xu, W.: The catcher in the field: A fieldprint based spoofing
X., Gunter, C.A.: Commandersong: a systematic approach for practical adversarial voice
References 133
21. Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: Dolphinattack: inaudible voice
speaker verification system in smart home. In: IEEE INFOCOM 2020 – IEEE Conference on
2020.9155483
23. Zhang, L., Tan, S., Wang, Z., Ren, Y., Wang, Z., Yang, J.: Viblive: a continuous liveness
detection for secure voice user interface in IoT environment. In: ACSAC ’20: Annual Computer
Security Applications Conference, pp. 884–896. ACM, New York (2020). https://doi.org/10.
1145/3427228.3427281
Chapter 6
Traffic Analysis Based Misbehavior
Detection at Application Platform Layer
6.1 Introduction
At present, the smart home market is experiencing prosperity, and many manufac-
turers have produced and sold various smart home devices. In order to promote
the collaborative interaction of smart devices generated by different manufacturers,
the concept of application platform was proposed. The application platform is
equipped with a local gateway (e.g., base station) and cloud backend server to
enable devices from different vendors to access the same network. In the application
platform, devices are abstracted by which developers can design smart applications
without knowing the physical details of devices, so that devices from different
manufacturers can work together. For example, a smart application can monitor the
status of a device (e.g., a smart motion sensor) and trigger certain operations of
another device (e.g., turning on a smart light) when receiving notification of certain
events (e.g., a motion sensor detects a user’s motions). The scalability of the smart
platform framework has greatly inspired a large number of device manufacturers
and application developers to participate in the ecosystem. Famous application
platforms include Samsung SmartThings [15], Apple HomeKit [2], and so on.
However, with the popularization of smart application platforms, security issues
have become increasingly prominent. For example, in the Samsung SmartThings
platform studied in this chapter, Fernandes et al. [7] reveal its defects. These defects
allow smart applications (referred to as SmartApp in SmartThings) running in the
cloud background to perform unauthorized operations on smart devices and can
eavesdrop on or even forge events generated by smart devices.
Existing solutions to SmartThings security, especially on the aspect of mis-
behaving SmartApp detection and prevention, mainly fall into three categories:
first, applying information flow control to confine sensitive data by modifying the
smart home platform [8], second, designing a context-based permission system for
fine-grained access control [12], and third, enforcing context-aware authorization
of SmartApps by analyzing the source code, annotation, and description [25].
https://doi.org/10.1007/978-3-031-24185-7_6
136 6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
However, these existing solutions either require modification of the platform

itself [8, 25] or need to patch the SmartApps [12]. It is desirable to have a novel
approach that allows a third-party defender—other than the smart home platform
vendors, smart device manufacturers, and app developers—to monitor the smart
home apps without making any changes to the existing platform. However, without
access to the smart sensors, gateway devices, and cloud backend servers, the only
avenue that permits third-party monitoring is through the communication traffic,
which, however, is encrypted using industry standards. Whether one could monitor
the behavior of smart home apps from encrypted traffic remains an open research
question.
In this chapter, we present HoMonit, a system for monitoring smart home
apps from encrypted wireless traffic. We particularly demonstrate the concept by
implementing HoMonit to work with Samsung’s SmartThings framework and detect
the misbehaving SmartApps. At the core of HoMonit is a Deterministic Finite
Automaton (DFA) matching algorithm. Our intuition is that every smart home app’s
behavior follows a certain DFA model, in which each state represents the status
of the app and the corresponding smart devices, and the transitions between states
indicate interactions between the app and the devices.
To do so, HoMonit first extracts DFAs from the source code of the apps or the
text information embedded in their descriptions and user interfaces (UIs) of the
SmartThings Android app. HoMonit then leverages wireless side-channel analysis
to monitor the encrypted wireless traffic to infer the state transition of the DFA.
Our key insight is that the smart home traffic is particularly vulnerable to side-
channel analysis; the encrypted content can be inferred by observing the size and
interval of encrypted wireless packets. Then, HoMonit applies the DFA matching
algorithm to compare the inferred transition with the expected DFA transitions
of all installed SmartApps. If the DFA matching fails, with a high probability
the behavior of the SmartApp has deviated from the expectation—a misbehaving
SmartApp is detected. Our evaluation suggests that HoMonit can effectively detect
several types of SmartApp abnormal behaviors, including over-privileged accesses
and event spoofing.
Nevertheless, the wireless traffic side-channel leakage also allows an adversary
who is able to place a sniffing device in the proximity of a home to monitor the
interaction between SmartApps and the smart devices to conduct inference attacks.
These attacks severely compromise user privacy, including, but not limited to, (1)
daily routines, (2) home occupations, and (3) health conditions. To address this
privacy concern, we further enhanced HoMonit by introducing an extra Privacy
Enhancement Module which dynamically injects dummy traffic into the smart home
network to obfuscate the communication patterns of the SmartApps to prevent the
attacker’s inference attacks on users’ privacy. Meanwhile, the dummy traffic can be
easily filtered out from the captured traffic as it is injected by HoMonit itself. In this
way, HoMonit preserves the capability of detecting misbehaved SmartApps.
We implemented HoMonit and evaluated its effectiveness in detecting misbehav-
ing SmartApps. Totally 60 misbehaving SmartApps were developed by altering the
code of open-source SmartApps to perform the evaluation. The results suggest that
HoMonit detects misbehaving SmartApps effectively: it achieves an average recall

of 0.98 for SmartApps with over-privileged accesses and an average recall of 0.99
for SmartApps conducting event-spoofing attacks. We also evaluated the Privacy
Enhancement Module and showed that HoMonit could effectively ensure privacy
protection with an increased entropy while preserving the misbehavior detection
functionality.
The contributions of this chapter include the following:
• Novel techniques. We developed techniques for extracting DFAs from the source
code of the SmartApps or the UI of SmartThings’ mobile app and methods for
inferring SmartApp activities using wireless side-channel analysis.
• New systems. We designed, implemented, and evaluated HoMonit for detecting
misbehaving SmartApps in the SmartThings platform, which operates without
cooperation from the platforms, device vendors, or SmartApp developers.
• Open-source dataset. A dataset of 60 misbehaving SmartApps will become
publicly available, which can be used by researchers, vendors, and developers
to evaluate their security measures.
The remainder of this chapter is organized as follows. Section 6.2 introduces the
preliminaries about the smart application platform and misbehavior. Section 6.2.4
introduces the design motivation of HoMonit. Section 6.3 introduces each compo-
nent of HoMonit. We evaluate HoMonit, discuss its limitations, and conclude this
chapter in Sects. 6.4, 6.4.5, and 6.5, respectively.
6.2 Preliminaries and Motivation
6.2.1 Samsung SmartThings
As one of the most popular smart home platforms, Samsung SmartThings provides
an attractive feature favored by many device manufacturers and software developers,
which is the separation of intelligence from devices. In particular, it offers an
abstraction of a variety of lower-layer smart devices to software developers so
that the development of software programs (i.e., SmartApps) is decoupled with the
manufacture of the smart devices. In this way, the SmartThings platform fosters a
vibrant software market, which encourages third-party software developers to enrich
the diversity of home automation functionalities. It is worth noting that the number
of SmartThings has been changing. For example, from 2016 to 2017, more than
300 SmartApps were deleted either because security vulnerabilities were reported
or because users were not interested in them [16, 21]. Therefore, in this chapter,
we focus on the SmartApps supported by SmartThings in May 2018, including 133
device types and 181 SmartApps.
SmartThings Cloud Backend

Groovy Sandbox Groovy Sandbox
Subscription
SmartApp Processing Device Handler
<Groovy> <Groovy>
subscribe
capability.water,
capability.water,
capability.switch,
capability.switch,
capability.motion,
capability.motion, Capability
...
... Checks
Configure Devices /
Smart Home Control (SSL)
Install SmartApps
AES
SmartThings Hub
SmartThings
AES Mobile App
Fig. 6.1 System architecture of SmartThings platform
Capabilities Attribute/Command
Smart device
Switch.on command: turn on outlet
Switch
Switch.off command: turn off outlet
PowerMeter PowerMeter.power attribute:

Samsung
device power consumption
SmartThings Outlet
··· ···
Fig. 6.2 Relationship between device and capability
Section 1.2 introduces the overall architecture of smart home. In this subsection,
we take Samsung SmartThings as the research object to introduce how the applica-
tion platform integrates smart devices and user interfaces.
The architecture of the SmartThings platform is shown in Fig. 6.1. Smart devices
are the key building blocks of the entire SmartThings infrastructure. They are
connected to the hub with a variety of communication protocols, including ZigBee,
Z-Wave, and Wi-Fi. In SmartThings, the hub mediates the communication with all
connected devices and serves as the gateway to the SmartThings cloud, where device
handlers and SmartApps are hosted. Device handlers are virtual representations of
physical devices, which abstract away the implementation details of the protocols
for communicating with the devices. As shown in Fig. 6.2, device handlers specify
the capabilities of these devices.
Capabilities can have commands and attributes. Commands are methods for
SmartApps to control the devices. Attributes reflect properties or characteristics
Table 6.1 Protocols Protocols Supported devices

supported by SmartThings
devices ZigBee 48/133≈36.1%
Z-wave 58/133≈43.6%
Others 27/133≈20.3%
of the devices. For example, smart device Samsung SmartThings Outlet has 9
capabilities, among which Switch and Power Meter are the most commonly used:
Switch enables the control of the switch and it has two commands: on() and
off(), while Power Meter has one attribute power for reporting the device’s power
consumption.
6.2.2 Communication Protocols
SmartThings supports a variety of communication protocols. In particular, ZigBee

and Z-Wave are typically characterized as protocols with low power consumption,
low data communication rate, and close proximity. As shown in Table 6.1, among
the 133 smart devices we have surveyed, ZigBee and Z-Wave are two dominant
wireless protocols in SmartThings device market, which together contribute to about
79.7% of market share.
• ZigBee. ZigBee is a wireless communication specification following the IEEE
802.15.4 standard [1]. ZigBee devices in SmartThings are on the 2.4 GHz
band with 250 kbps data rate. ZigBee supports encrypted and authenticated
communications. Encryption is enabled by default at the network layer with
128-bit AES-CCM* encryption mode. Application support sublayer (APS) also
allows optional encryption with a link key, which is a shared key between two
devices in the same personal area network (PAN).
• Z-Wave. Z-Wave is another popular low-power consumption communication
protocol [11]. Z-Wave is implemented by following the ITU-T G.9959 rec-
ommendation. It has different working frequencies in different regions. In the
United States, the Z-Wave devices are specified to work on the frequency of
908.4 MHz with 40 kbps data rate and 916 MHz with 100 kbps data rate. Z-Wave
supports strong encryption and authentication via Z-Wave S2 security solution,
which implements 128-bit AES encryption.
6.2.3 Application Misbehavior in Platform
Fernandes et al. [7] presented several security-critical design flaws in the Smart-
Things’ capability model and event subsystem. These design flaws may enable the
following SmartApp misbehaviors that can lead to security compromises.
Over-Privileged Accesses The capability model of SmartThings grants coarse-

grained capabilities to SmartApps: (1) a SmartApp that only needs a certain
command or attribute of a capability will always be granted the capability as a
whole [18] and (2) a SmartApp that is granted permission to some capabilities
of a device can gain access to all capabilities of the devices. Hence, a malicious
SmartApp could be significantly over-privileged and take control of a whole device
even if it only asks for permission to access partial information.
For example, an auto-lock SmartApp which only requires the lock() command
of Lock capability can also get access to the unlock() command. Moreover, even
for benign SmartApps, it is still possible for the attackers to launch command
injection attacks to trick them to perform unintended actions. For example, in
WebService SmartApps [20], if the developers implement the HTTP endpoints
using the dynamic method invocation feature of Groovy, the SmartApp will be
vulnerable to command injection attacks [7].
Event Spoofing The SmartThings framework does not protect the integrity of the
events, which allows event-spoofing attacks. An event object is a data object created
and processed in the SmartThings cloud, which contains information associated
with the event, such as the (128-bit) identifiers of the hub and the device, as well
as the state information. A malicious SmartApp with the knowledge of the hub
and device identifiers, which are easy to learn, can spoof an arbitrary event. The
event will be deemed legitimate by the SmartThings cloud and propagated to all
SmartApps that subscribe to the corresponding capability related to the event [7].
For example, an alarm panel SmartApp can raise a siren alarm when the CO detector
is triggered. However, an attack SmartApp can spoof a fake event for the CO
detector, causing the alarm panel SmartApp to activate the siren alarm mistakenly.
In this chapter, given a benign SmartApp, attackers use vulnerabilities to design
the SmartApp as a malicious application, making its behavior deviate from the
original design goal. This chapter assumes that SmartApps from SmartThings
Marketplace and SmartThings Public GitHub Repository [19] are trusted and can
be used to extract benign DFAs. Note that detecting the vulnerability of SmartApp
by analyzing its source code is an important research topic [7, 25], which is worth
studying separately and is beyond the scope of this monograph. In this study, we do
not consider attacks against device hardware (e.g., smart devices or hubs) because
hardware-related attacks usually require physical access to these devices, which is
beyond the scope of this study.
6.2.4 Motivation: Monitoring SmartApp’s Behaviors Based on

Wireless Traffic
Given the aforementioned threats from misbehaving SmartApps [7], systems

for mitigating such threats are of practical importance. Previous studies have
proposed context-based permission systems [12], user-centered authorization and
enforcement mechanisms [25], and systems to enforce information flow control [8].
However, these solutions require either modification of the platform itself [8, 25]
or changes in the SmartApps [12]. A security mechanism that works on existing
platforms will be more practical as a business solution.
In this study, we propose a detection system, dubbed HoMonit, for detect-
ing misbehaving SmartApps in a non-intrusive manner by leveraging techniques
commonly used in wireless side-channel inference. HoMonit is inspired by two
observations. First, most communication protocols used in smart home environ-
ments are designed for a low transmission rate and reduced data redundancy
for low power consumption. Second, the wireless communications between the
hub and smart devices usually show unique and fixed patterns determined by
the corresponding smart devices and SmartApps. Therefore, after extracting the
working logic of SmartApps as Deterministic Finite Automatons (DFAs)—with the
help of code analysis (for open-source SmartApps) or natural language processing
techniques (for closed-source SmartApps)—an external observer can determine
which SmartApp is operating and which state this SmartApp is currently in by
monitoring only the metadata (e.g., the packet size, inter-packet timing) of the
encrypted wireless traffic. If a SmartApp deviates from its usual behavior, the pattern
of wireless traffic will also change, which can be utilized to detect misbehaving
SmartApps.
The capability of monitoring misbehaving SmartApps from encrypted traffic
enables a third-party defender—other than the smart home platform vendors,
smart device manufacturers, and SmartApp developers—to develop a smart home
anomaly detection system to detect misbehaving SmartApps at runtime. A major
advantage of a third-party defense mechanism is that no modification of the
protected platform is needed. HoMonit is designed to work without the need to
change the current SmartThings infrastructure, change the system software on the
hub or smart devices, or modify the SmartApps. HoMonit can work directly with
the existing SmartThings platform and is easily extensible to other platforms with
similar infrastructures.
We illustrate this idea using a concrete example: Brighten My Path is a SmartApp
for automatically turning on the outlet after a motion has been detected by the
motion sensor. We show the observed packet sizes of the communications between
the sensors and the hub in Fig. 6.3, in which the y-axis shows the packet sizes and
the x-axis shows the timestamps of the packets when they arrive. The SmartApp
subscribes to two capabilities, which include an attribute motion for capability
Motion Sensor and a command on() for capability Switch. In Fig. 6.3, motion.active
corresponds to packet sequence (54 ↑, 47 ↓) and switch.on corresponds to packet
sequence (50 ↓, 47 ↓, 47 ↓, 52 ↓), where ↓ means the packet is hub to device and ↑
means device to hub. The DFA consists of three states, which are connected by two
transitions (as shown in Fig. 6.3). If the events corresponding to motion.active and
switch.on are detected in a sequence, the DFA will transition from the start state to
the accept state. Thus, the behavior of the SmartApps can be inferred from the DFA
transitions: normal behavior sequences are always accepted by the DFA, whereas
abnormal ones are not.
60
50
Packet Size (bytes)
40
30
motion.active switch.on
20
10
0
2.6 2.8 3 3.2 3.4 10 10.2 10.4 Timestamp (s)
Outlet → Hub Start Accept

Hub → Outlet motion.active
Hub → Motion Sensor S! 0 S! 1 S! 2
Motion Sensor → Hub
Non-data Packets DFA switch.on
Fig. 6.3 A motivating SmartApp: Brighten My Path
DFA
Wireless Traffic Over-privilege

(ZigBee & Z-Wave) Accesses
Event
Protected
Wireless Traffic
Fig. 6.4 Workflow of HoMonit
6.3 System Design of Misbehavior Detection System
We present HoMonit, a system to monitor behaviors of smart devices using side-

channel information from the encrypted wireless traffic. The smart home network
is centralized by the hub, which connects to all of the smart devices via wireless
communications (e.g., ZigBee or Z-Wave) and to the cloud backend via Internet.
HoMonit is equipped with multiple wireless interfaces to collect the wireless
packets, including both ZigBee and Z-Wave packets. In practice, eavesdropping
devices should be put near the hub to ensure that the wireless packets can be
correctly collected.
6.3 System Design of Misbehavior Detection System 143
As illustrated in Fig. 6.4, HoMonit system is comprised of three major com-

ponents. The SmartApp Analysis Module extracts the expected DFA logic of the
installed SmartApps from their source code. The Misbehavior Detection Module
identifies the misbehavior of the SmartApps by conducting side-channel inference
on the sniffed wireless traffic and comparing the inferred behavior with the expected
DFA models. Finally, considering that side-channel information may cause privacy
leakage, the Privacy Enhancement Module utilizes the dummy traffic manner to
thwart the potential privacy inference attack.
6.3.1 DFA Building via SmartApp Analysis
The SmartApp Analysis Module aims to extract the expected behaviors of the
SmartApps. We utilize the Deterministic Finite Automaton (DFA) to characterize
the logic of SmartApps. We choose DFA to represent a SmartApp for two reasons:
(1) a SmartApp supervises a finite number of devices and (2) devices are driven into
a deterministic status by the SmartApp when a specific condition is satisfied. More
specifically, we formalize the SmartApp DFA as a 5-tuple M = (Q, , δ, q0 , F ),
where Q is a finite set of states of the SmartApp; is a finite set of symbols,
which correspond to attributes or commands of their capabilities; δ is the transition
function in Q × → Q, where Q × is the set of 2-tuple (q, a) with q ∈ Q and
a ∈ ; q0 is the start state, and F is a set of accept states.
This chapter mainly focuses on open-source SmartApps. HoMonit performs
static analysis on their source code and automatically translates them into DFAs.
We analyzed 181 open-source SmartApps to build their DFAs from SmartThings
Public GitHub Repository [19]. All of them are official open-source SmartApps.
Building DFA from close-source SmartApps is out of this section’s scope.
Since the open-source SmartApps are written in Groovy, to extract their logic,
we conducted a static analysis on the source code using AstBuilder [6]. Figure 6.5
shows an example of the source code of a SmartApp. HoMonit converts the source
code of the SmartApp into an Abstract Syntax Tree (AST) during the Groovy
compilation phase.
The translation from an AST to a DFA is completed in two steps. First, to
obtain the set of symbols (i.e., of the DFA), HoMonit extracts the capabilities
requested by the DFA from the preferences block statement (Fig. 6.5, line 7).
Specifically, all available capabilities are first obtained from the SmartThings
Developer Documentation [18], and then the input method calls (Fig. 6.5, line 9
and 13) of the preferences block statement are scanned to extract the capabilities
requested by the SmartApp. SmartApps use subscribe methods to request notifica-
tions when the device’s status has been changed. These notifications will trigger the
handler methods to react to these status changes. To further determine the specific
commands or attributes (i.e., symbols of the DFA), HoMonit scans the subscribe
methods and their corresponding commands or attributes (e.g., motion.active in the
subscribe method).
Fig. 6.5 A SmartApp code example: Smart Light
The second step is to extract the state transitions (i.e., δ) from subscribe and
handler methods. HoMonit starts from the subscribe method call in installed and
updated block. Each subscribe method in these blocks indicates one transition from
the start state to an intermediate state; by inspecting the corresponding handler
method, how the intermediate state will transition to other states can be determined:
in the example shown in Fig. 6.5, one transition (with switch.on as its symbol) moves
the DFA to an accept state. Complex handler methods may involve multiple states
and transitions before the DFA reaches the accepted states. The set of states Q, start
state q0 , and accept states F in the DFA are automatically constructed according to
the transition function.
The DFAs for 150 out of 181 SmartApps were successfully constructed (82.9%).
The success rate is already very high, considering that some SmartApps are much
more complex than the one listed in Fig. 6.5. There were complex apps with over
28 states and 40 transitions in their DFAs in the dataset, and these DFAs could
all be successfully extracted and further used in detection. Most of the popular
SmartApps can be successfully constructed. However, the DFA construction failed
on some SmartApps because they request capabilities that are not associated with
any device. Take a SmartApp Severe Weather Alert as an example. It only acquires
weather information from Internet and sends weather alert to the user’s smartphone;
a meaningful DFA cannot be constructed.
6.3.2 Misbehavior Detection Based on Wireless Traffic

Analysis
6.3.2.1 Wireless Traffic Collection
HoMonit collects wireless traffic and conducts DFA matching according to the
traffic characteristics to identify the misbehavior of SmartApp. In this chapter,
HoMonit collects both ZigBee and Z-Wave traffic between the hub and the smart
devices, since these two standards are widely used by smart home devices as shown
in Table. 6.1.
ZigBee Traffic Collection To sniff the ZigBee traffic, HoMonit employs a com-
mercial off-the-shelf ZigBee sniffer (i.e., Texas Instruments CC2531 USB Dongle
[24]) and an open-source software tool (i.e., 802.15.4 monitor [14]) to passively
collect the ZigBee traffic. ZigBee breaks the 2.4 GHz band into 16 channels,
where SmartThings hub and its devices are on a fixed channel (0xe in our case).
We customized the 802.15.4 monitor to achieve real-time packet capturing. The
captured packets were dumped into a log file once per second.
Z-Wave Traffic Collection HoMonit adopts Universal Software Radio Peripheral
(USRP) hardware to collect Z-Wave packets and modifies an open-source software,
Scapy-Radio [5], to automatically record them. Figure 6.6 shows the Z-Wave traffic
collection framework in the GNU Radio software. It is worth noting that some Z-
Wave devices may communicate with the hub at different channel frequencies in
different modes (e.g., sleep or active mode). Take the alarm sensor Aeotec Siren
(Gen 5) as an example. The device communicates with the hub in sleep mode at
the frequency of 908.4 MHz with a transmission rate of 40 kbps but communicates
at the frequency of 916 MHz with a transmission rate of 100 kbps rate in the active
mode. As such, to monitor two channels simultaneously, we exploited two USRPs
working at the above two frequencies to capture all Z-Wave traffic. For example,
in order to capture a data packet with a transmission rate of 100 kbps, in addition
to adjusting the frequency of the receiver, it is necessary to change the cut-off
frequency and sampling rate of the filter (i.e., changing the Omega in the Clock
Fig. 6.6 Workflow of Z-Wave traffic collection
Recovery MM block of the receiver to 8). After adjustment, USRP can normally
monitor the communication of commercial Z-Wave devices.
6.3.2.2 Filtering Noise Traffic
During the event inference process, the collected wireless traffic contains packets
that are considered noise for our event inference, which must be filtered out. Noise
packets contain the following types:
• Beacon packets. Beacon packets are mainly used for acknowledging data trans-
mission and maintaining established connections, which carry less side-channel
information. HoMonit discards ZigBee beacon packets and drops Z-Wave pack-
ets with no payload.
• Retransmission packets. Retransmission packets will be sent in cases of trans-
mission failure. In ZigBee, they can be identified by checking if two subsequent
packets share the same sequence number. In Z-Wave, retransmission packets can
be identified if two consecutive packets sent by the sending device are observed
without having a response packet in between. We also discard the retransmission
packets to avoid affecting inference performance.
• Unrelated traffic. Traffic from devices using other wireless standards (e.g., Wi-Fi
and Bluetooth) or from other networks are treated as unrelated traffic. To identify
traffic from targeted networks, ZigBee uses a unique identifier called Personal
Area Network Identifier (or PANID for short), while Z-Wave uses Home ID,
which denotes the ID that the Primary Controller assigns the node during the
inclusion process [9]. HoMonit filters out collected traffic that has different
PANIDs or Home IDs from the specified ones.
6.3.2.3 Constructing Fingerprinting for Device Event
After noise filtering, we are ready to exploit the side channel of wireless traffic to
infer DFA events. An event on the SmartThings platforms can be a command that is
generated by hub and sent to the devices or an attribute of the device that is reported
φ
to the hub [17]. We formally denote an event as Et , which indicates that the event
is of type φ and is generated at time t. An event type φ is a 2-tuple (d, e), where d
is the device and e is a command sent to d or an attribute of d. We denote the set of
all event types as .
Each event will trigger a sequence of wireless packets. We denote a wireless
packet as a quadruple f = (t, l, di , dj ), where f refers to the packet of length l sent
from device di to device dj at time t. Here, di and dj are represented using the MAC
addresses in ZigBee or node IDs in Z-Wave. Once an event is triggered, a sequence
of n packets sent between device di and dj at a specific time t can be monitored
d ↔d
during a short interval, which can be represented as St i j = (f1 , f2 , . . . , fn ).
Note that either di or dj is the hub because the SmartThings framework dictates
all the devices communicate with the hub. If packets for multi-hop communications
are captured, HoMonit merges these consecutive packets from multiple hops into a
φ
single one. Therefore, there is typically a one-to-one mapping between an event Et
di ↔dj
and a sequence of packets, St .
For a given device, we first obtain the different types of events of a specific
device by referring to its open-source device handler. For each event type φ ∈ ,
we manually trigger the event and collect m samples (m = 50), denoted as Sφ =
φ φ φ φ
{S1 , S2 , . . . , Sm }, where Si is a sequence of packets collected in one experiment
when the event is triggered. The fingerprint Fφ of event type φ is defined as
1 φ φ
Fφ = arg min Dis(Si , Sj ), (6.1)
φ
Si ∈Sφ
S φ
φ
∀Sj ∈Sφ
φ φ
whereDis(Si , Sj ) adopts Levenshtein Distance [3] to measure the sequence
φ φ φ φ
similarity between Si and Sj , i.e., a small dist (Si , Sj ) means a high similarity
φ φ
between Si and Sj . Table 6.2 illustrates the device fingerprints of all devices we
possess (including seven ZigBee devices and four Z-Wave devices, as listed in
Table 6.3).
148
Table 6.2 Fingerprints for event types that are supported by 7 ZigBee devices and 4 Z-Wave devices
Device event Device name (abbreviation Protocol Fingerprinting
water.wet Samsung SmartThings 54 ↑ 45 ↑
water.dry Water Leak Sensor (Water ZigBee 54 ↑ 45 ↑
temperature Sensor) 53 ↑ 45 ↑
motion.active Samsung SmartThings 54 ↑
motion.inactive Motion Sensor (Motion ZigBee 54 ↑
temperature Sensor) 53 ↑ 45 ↑
switch.on Samsung SmartThings 50 ↓ 47 ↓ 47 ↓ 52 ↓
ZigBee
switch.off Outlet (Outlet) 50 ↓ 47 ↓ 47 ↓ 52 ↓
contact.open 54 ↑
contact.closed Samsung SmartThings 54 ↑
acceleration.active Multipurpose Sensor 2016 ZigBee 69 ↑ 65 ↑ 65 ↑ 65 ↑ · · ·
acceleration.inactive (Multipurpose Sensor 2016) Occur after event acceleration.active finishes
temperature 53 ↑
contact.open 54 ↑ 45 ↑
contact.closed Samsung SmartThings 54 ↑ 45 ↑
acceleration.active Multipurpose Sensor 2015 ZigBee 69 ↑ 65 ↑ 65 ↑ 65 ↑ · · ·
acceleration.inactive (Multipurpose Sensor 2016) Occur after event acceleration.active finishes
temperature 53 ↑ 45 ↑
beep 50 ↓ 45 ↓
Samsung SmartThings
rssi 52 ↑
Arrival Sensor (Arrival ZigBee
presence.present Sensor) 57 ↑ 48 ↑ 45 ↑ 45 ↑ 49 ↑ 45 ↑ 50 ↑ 45 ↑ 50 ↑ 45 ↑ 50 ↑ 45 ↑
presence.not present Occur after there is no periodic event rssi
6 Traffic Analysis Based Misbehavior Detection at Application Platform Layer
switch.on 50 ↓ 47 ↓
switch.off 50 ↓ 47 ↓
Osram Lightify CLA 60
illuminance ZigBee 53 ↓ 47 ↓
RGBW (Light)
setColorTemperature 54 ↓ 47 ↓ 52 ↓ 47 ↓
setColor 50 ↓ 47 ↓ 54 ↓ 47 ↓ 52 ↓ 47 ↓ 52 ↓ 47 ↓
switch.on Power Monitor Switch 13 ↓ 12 ↓ 10 ↓
Z-Wave
switch.off TD1200Z1 (Switch) 13 ↓ 12 ↓ 10 ↓
motion.active Aeotec MultiSensor 6 14 ↑ 21 ↑
Z-Wave
motion.inactive (MultiSensor) 14 ↑ 21 ↑
contact.open Aeotec Door/Window 17 ↑ 17 ↑
contact.closed Sensor 6 (Door/Window Z-Wave 17 ↑ 17 ↑
Sensor)
alarm.siren Aeotec Siren Gen 5 (Siren) Z-Wave 13 ↓ 34 ↓ 11 ↓ 33 ↓ 11 ↓ 21 ↓ 11 ↓
6.3 System Design of Misbehavior Detection System
149
6.3.2.4 Inferring Events
To infer the events based on the captured wireless packets, HoMonit first partitions
the traffic flow into a set of bursts. A burst is a group of network packets in which
the interval between any two consecutive packets is less than a pre-determined
burst threshold [23]. The packets in each burst are then ordered according to the
d ↔d
timestamps, and the burst is represented as St i j . HoMonit matches the burst
with the fingerprints of each of the known events by calculating their Levenshtein
d ↔d
Distance, Dis(St i j , Fφ ). The event type with the smallest Levenshtein Distance
from the packet sequence is considered as the inferred event. For instance, the burst
(13 ↓ 34 ↓ 11 ↓ 33 ↓ 11 ↓ 21 ↓ 11 ↓) will be inferred as the event alarm.siren.
However, as shown in Table 6.2, there are more than one event with exactly the
same patterns (e.g., packet size and direction). For instance, (50 ↓ 47 ↓ 47 ↓
52 ↓) may be inferred as either switch.on or switch.off of the device Samsung
SmartThings Outlet. To correctly identify the event, we classify them into two
categories:
• Events of the same device. One example is switch.on and switch.off of Samsung
SmartThings Outlet. The reason is that they are essentially the same event
message with different data fields. As these events typically exist in pairs, such
as on and off, active and inactive, wet and dry, we use one bit to trace the current
state of each device to differentiate these events.
• Events of different devices. One example is water.wet of Samsung SmartThings
Water Leak Sensor and contact.open of Samsung SmartThings Multipurpose
Sensor (2015). We first use other unique events to identify the device and then
determine the event type. For example, if we captured acceleration.active, then
we know that this device is Samsung SmartThings Multipurpose Sensor (2015).
Therefore, the event type must be contact.open instead of water.wet.
6.3.2.5 SmartApp Misbehavior Detection Based on Event Fingerprint
To detect misbehaving SmartApps, in the Misbehavior Detection Module of

HoMonit, we propose the DFA matching algorithm. This module passively sniffs
the wireless traffic between SmartThings hub and devices and tries to match it with
the current SmartApps working logic. An alarm will be raised once the verification
fails.
Formally, the input of the algorithm is a sequence of events E =
φ1 φ φ
{Et1 , Et22 , . . . , Etnn } that is inferred from the encrypted wireless traffic and the
DFA M = (Q, , δ, q0 , F ) of the target SmartApp. The algorithm transitions the
φ
DFA from state Si to Si+1 by consuming each of the events Eti i in the order they
φi φi
appear in the sequence, if Eti ∈ ∧ δ(Si , Eti ) = Si+1 ∈ Q. Initially, S0 = q0 . If
the sequence of events finally transitions the DFA into one of the accepted states,
that is, Sn ∈ F , the behavior of the SmartApp is accepted. Otherwise, a misbehaving

SmartApp is detected.
In this work, we focus on detecting over-privilege and spoofing misbehavior of
SmartApps. Since we assume the original SmartApp is benign, we aim to detect the
misbehavior with DFA matching method. The over-privilege problem occurs due
to coarse-grained permission control, which will raise extra device event compared
with the original SmartApps. This misbehavior can be detected when there are new
transitions detected in the DFA. The spoofing problem occurs due to being notified
of fake events generated by other malicious SmartApps. Thus, misbehavior can be
detected when there is partial DFA matching (skipped states resulting from spoofed
events in the cloud). We will give detection details in the following evaluation
section.
6.3.3 Privacy Enhancement Based on Dummy Traffic
In this section, we will discuss the potential privacy leakage arising from the
wireless side channel analysis and then present a privacy enhancement design.
6.3.3.1 Privacy Leakage due to Side-Channel Attacks
Side-channel information leakage is a double-edged sword. It not only enables our

detection of misbehaved SmartApps but also allows an attacker who can place a
wireless sniffer in the proximity of the smart devices to launch inference attacks to
learn the private information of the residents. In this section, we use three examples
to show that Zigbee/Z-Wave sniffing can reveal such private information:
• Daily routines. For example, Good Night is a SmartApp that changes its mode
when there is no human activity in the home after some time during the night. The
attacker can spy on the victim’s daily activities and learn his sleeping patterns by
monitoring this SmartApp behavior.
• Home occupations. For example, a SmartApp named Vacation Lighting Director
deceptively turns on/off lights while the residents are away. By inferring the
existence of such a SmartApp, the attacker can infer if the house is vacant.
• Health conditions. For example, Elder Care: Slip & Fall monitors the behavior
of aged people. Detecting such a SmartApp may leak age information about the
residents.
6.3.3.2 Privacy Enhancement
To enhance the privacy of the smart home environments while preserving the capa-
bility of detecting misbehaved SmartApps via side-channel analysis, we propose
Fig. 6.7 Illustration of privacy enhancement via decoys. (a) Real network: a central hub connects
4 smart devices. (b) Obfuscated network: each device (including the hub) has one decoy
a novel solution that, by intelligently injecting dummy packets into the wireless
network, obfuscates the transmission patterns of the target smart device and the hub.
As the dummy packets are generated by HoMonit, they can be easily filtered out by
HoMonit in their own analysis. We detail our design of the Privacy Enhancement
Module as follows.
Dummy Packet Generation As illustrated in Fig. 6.7, the key idea of the Privacy
Enhancement Module is to create fake identities for the real devices and simulate
other, non-related SmartApp activities using these fake device identities. To create
a fake identity for a device, dummy packets are generated while reusing the MAC
address of the real device. Each of these fake identities is called a decoy of the real
device. Given a pre-defined security parameter k, k × N decoys are generated to
provide the (k + 1)-anonymity for N real devices (the hub is included). To make the
generated dummy packets indistinguishable from the real ones, the dummy packets
are generated with the transmission patterns (e.g., inter-packet interval) of the real
packets, which can be learned during offline training. Furthermore, to make the
encrypted payload indistinguishable, the payload of the dummy packets is of the
same length as that of the real packets. After generating a dummy packet sent from
a decoy device to a decoy hub, another dummy packet is sent from the decoy hub to
the decoy device to simulate the response.
Although there are some prior works that aim to distinguish the spoofed devices
from the real devices by leveraging the different Received Signal Strength (RSS)
values caused by the different distances [4, 13, 26], it is possible to make
them indistinguishable in terms of RSS. This can be achieved by placing the
Privacy Enhancement Module in the proximity of the real devices or adjusting the
transmission power of the USRPs to simulate the different transmission distances.
Maintaining Independent Sequence Numbers According to the ZigBee and Z-
Wave specifications, the sequence number of the packets sent from the same device
increases by 1 for each packet. Furthermore, in ZigBee packets and 916 MHz Z-
Wave packets, the size of the sequence number is 1 byte, ranging from 0x01
to 0xff; in 908.4 MHz Z-Wave packets, the sequence number takes only 4 bits,
6.4 Evaluation 153
ranging from 0x1 to 0xf. Therefore, to make the decoy devices and the real devices
indistinguishable, it is necessary to handle the sequence number properly. To do so,
we let each decoy maintain an independent sequence number and increment the
number for each packet it sends.
Privacy Analysis of the Privacy Enhancement Module In privacy inference
attacks, the attacker will leverage the side-channel leakage to infer (1) the devices
from which the observed traffic is generated, (2) the events that are associated with
the devices, and (3) the SmartApps that communicate with the devices with these
events. The decoy devices introduced by the Privacy Enhancement Module enhance
the privacy of the smart home environment by obfuscating the first step of the attacks
and hence fundamentally thwart the inference attacks. To evaluate the effectiveness
of our defense, we measure the entropy of each device to quantify the inference
difficulty.
For the simplicity of the presentation, we consider the case in which we have
only one smart device s0 , which can be easily extended to the case of multiple real
devices. The Privacy Enhancement Module deploys k decoys S = {s1 , s2 , . . . , sk }
for obfuscation. Therefore, the attacker will observe k + 1 devices S+ = {s0 } S
in the network. In one time unit (e.g., a day), we assume that the real device s0
generates a sequence of w0 events E = {e1 , e2 , . . . , ew0 }; the decoy device si
generates a sequence of wi events E i = {e1i , e2i , . . . , ew
i }. Let a random variable
i
X ∈ S+ represent a random process in which the attacker sniffs the wireless traffic
at a random time of the day and captures packets that correspond to an event eX
generated by one of the devices in S+ . We denote X = si when eX ∈ E i . Therefore,
P (X = si ) = P (eX ∈ E i ) = wi / i=k i=0 wi . The device entropy is defined as
follows:

i=k
+
ε(S ) = − P (X = si ) log2 P (X = si ). (6.2)
i=0
From this equation, the level of obfuscation is determined not only by the number
of decoy devices that the Privacy Enhancement Module introduces but also by the
number of events generated by the decoys and the real devices. The entropy ε(S+ )
has the maximum value when wi / i=k i=0 wi = 1/k + 1 for all i and a minimum
value when the events from the real device are dominant (i.e., P (X = s0 ) = 1). We
will empirically evaluate the effectiveness of the Privacy Enhancement Module in
the next section.
6.4 Evaluation
Hardware Platform To evaluate the effectiveness and efficiency of HoMonit, we

built a prototype system with the off-the-shelf hardware: a laptop equipped with
wireless sniffer interfaces, including a Texas Instruments CC2531 USB Dongle for
ZigBee and two USRPs for Z-Wave, as shown in Fig. 6.8a and b. The distance
between the SmartThings hub and HoMonit was about 2 m. As listed in Table 6.3,
we chose 30 SmartApps from the SmartThings Public GitHub Repository [19],
which interact with, in total, 7 ZigBee devices and 4 Z-Wave devices, as shown in
Fig. 6.8c and d. The devices were located less than 10 m away from the hub within
a room of 200 m2 .
Evaluation Metrics We choose true positive rate (TPR), true negative rate (TNR),
precision, recall, and F1-score as the evaluation metrics. We define the positive
as misbehavior. Thus, TPR and TNR are equal to the TRR and TAR in the voice
liveness detection task, respectively. The precision of the inference is defined as
the fraction of correctly inferred events overall inferred events; the recall of the
inference is defined as the fraction of successfully inferred events over all events
that have been triggered. The F1-score is simply the harmonic mean of precision
and recall.
6.4.1 Micro-benchmark: Inference of Events and SmartApps
In this subsection, we evaluate the accuracy of inferring the SmartApps installed

in the smart home environment from the sniffed wireless traffic. Although it is
not the design goal of HoMonit (because we already assume the knowledge of
the installed SmartApps), the accuracy of SmartApps inference measures the basic
capability of DFA construction and DFA matching. Therefore, we use a set of
SmartApp inference tests as micro-benchmarks. We also discuss the impact of a few
key parameters of HoMonit, including the burst threshold and sniffer distance and
wireless obstacles, on the accuracy of SmartApp inference. The evaluation metrics
are precision, recall, and F1-score.
Fig. 6.8 Wireless sniffers and smart devices. (a) ZigBee sniffer: TI CC2531 USB Dongle. (b)
Z-Wave sniffer: two USRPs. (c) Seven tested ZigBee devices. (d) Four tested Z-Wave devices
6.4 Evaluation 155
Table 6.3 SmartApps used in the evaluation

SmartApp Device abbreviation Protocol Precision Recall F1-score
Lights Off When Closed Light, Multipurpose Sensor ZigBee 1.00 0.90 0.95
2015
Turn It On When It Outlet, Multipurpose Sensor ZigBee 1.00 0.95 0.97
Opens 2016
Darken Behind Me Light, Motion Sensor ZigBee 1.00 0.95 0.97
Let There Be Light Light, Multipurpose Sensor ZigBee 1.00 0.95 0.97
2015
Monitor On Sense Outlet, Multipurpose Sensor ZigBee 1.00 0.80 0.89
2016
Big Turn Off Outlet ZigBee 1.00 1.00 1.00
Big Turn On Outlet ZigBee 1.00 1.00 1.00
Presence Change Push Arrival Sensor ZigBee 1.00 1.00 1.00
Door Knocker Multipurpose Sensor 2016 ZigBee 1.00 0.95 0.97
Let There Be Dark Light, Multipurpose Sensor ZigBee 1.00 0.80 0.89
2015
Flood Alert Water Sensor ZigBee 1.00 1.00 1.00
Turn It On When I’m Outlet, Arrival Sensor ZigBee 1.00 1.00 1.00
here
The Gun Case Moved Multipurpose Sensor 2015 ZigBee 1.00 1.00 1.00
It Moved Multipurpose Sensor 2016 ZigBee 1.00 1.00 1.00
Light Follows Me Light, Motion Sensor ZigBee 1.00 0.95 0.97
Undead Early Warning Light, Multipurpose Sensor ZigBee 1.00 0.90 0.95
2016
Cameras On When I’m Outlet, Arrival Sensor ZigBee 1.00 0.95 0.97
Away
Brighten My Path Light, Motion Sensor ZigBee 1.00 1.00 1.00
Dry The Wetspot Multipurpose Sensor 2016, ZigBee 1.00 0.95 0.97
Water Sensor
Curling Iron Arrival Sensor, Outlet, ZigBee 1.00 1.00 1.00
Motion Sensor
Big Turn On Switch Z-Wave 1.00 0.90 0.95
Brighten My Path MultiSensor, Switch Z-Wave 1.00 0.85 0.92
Darken Behind Me MultiSensor, Switch Z-Wave 1.00 0.85 9.92
Forgiving Security MultiSensor, Siren, Switch Z-Wave 1.00 0.90 0.95
Let There Be Dark Door/Window Sensor, Z-Wave 1.00 0.95 0.97
Switch
Let There Be Light Door/Window Sensor, Z-Wave 1.00 0.95 0.97
Switch
Light Follows Me MultiSensor, Switch Z-Wave 1.00 0.90 0.95
Lights Off When Closed Door/Window Sensor, Z-Wave 1.00 0.95 0.97
Switch
Smart Security MultiSensor, Siren, Z-Wave 1.00 1.00 1.00
Door/Window Sensor
Turn It On When It Door/Window Sensor, Z-Wave 1.00 0.95 0.97
Opens Switch
1 1
ZigBee
Z-Wave
0.8 0.95
F1 Score
F1 Score
0.6 0.9
0.4 0.85
ZigBee
Z-Wave
0.2 0.8
0 1 2 3 4 5 6 7 8 9 Near the Hub 1 Wall 2 Walls
Burst Threshold (s) Sniffer Location
(a) (b)
Fig. 6.9 Micro-benchmark: accuracy of event and SmartApp inference. (a) Evaluation of burst
threshold. (b) Impact of sniffer distance and wireless obstacles
6.4.1.1 Determining the Burst Threshold
The burst threshold is a parameter used to cluster captured wireless packets for the
same event, which directly impacts the effectiveness of SmartApp inference. We
performed the following experiments: we randomly selected 4 ZigBee devices and
another 4 Z-Wave devices. We manually triggered each event type 50 times on each
of the 8 devices. The time intervals between two consecutive events were 3–10 s.
We measured the accuracy of SmartApp inference when the burst threshold was
selected as integer values from 0 to 10 s.
As shown in Fig. 6.9a, the F1-score of event inference achieves the maximum
when the burst threshold is 1 s. This is because a smaller burst threshold separates
the packets belonging to the same events, which may cause more events to be
inferred than what we actually triggered, which may cause some events to be missed
by the detector. Therefore, in the remainder of our evaluation, the burst threshold
was chosen as 1 s.
6.4.1.2 SmartApp Inference Accuracy
We chose 20 SmartApps that work with ZigBee devices and 10 SmartApps

connecting Z-Wave devices (listed in Table 6.3). Each SmartApp is invoked by
manually triggered events 20 times. During the experiment, the sniffer was placed
2 m away from the hub, and the burst threshold was set to 1 s. Table 6.3 contains
the evaluation results for each individual SmartApp. Similar to event inference,
the precision is defined as the number of correctly inferred SmartApp invocations
over the total number of inferences made; the recall is defined as the number
of successfully inferred SmartApp invocations over the 20 invocations of each
SmartApp. It is worth noting that the precisions for all SmartApps are 1.00, while
6.4 Evaluation 157
the recalls are sometimes lower than 1.00. An average F1-score of 0.98 for ZigBee
SmartApps and an average F1-score of 0.96 for Z-Wave SmartApps were achieved.
Among all 30 tested SmartApps, the F1-scores of 26 are at least 0.95 (see Table 6.3).
This shows that HoMonit can accurately capture the working logic of SmartApps
through DFA matching. It is also important to point out that the major factors
contributing to the false inference come from packet loss, unrelated wireless traffic,
or traffic jam splitting a burst.
6.4.1.3 Impact of Distance and Wireless Obstacles
In practice, the effectiveness of SmartApp inference may be affected by environ-

mental conditions, such as the distance between the sniffer and the devices and
various wireless obstacles (e.g., walls) that block the wireless signals. Thus, we
evaluated the effectiveness of SmartApp inference with different distances and
wireless obstacles: (1) 2 m without walls, (2) 5 m with 1 wall, and (3) 10 m with 2
walls. As shown in Fig. 6.9b, although the recalls of the inference drop with longer
distance and more wireless obstacles, in all three cases, the recalls are above 0.88;
the precisions are all 1.00, and the F1-scores are all above 0.94.
6.4.2 Detection of Over-Privileged Accesses
We first developed over-privileged versions of the original benign SmartApps

by adding malicious code to cause unintended operations. We used modified
SmartApps to evaluate because there is no existing public malware dataset for
SmartThings platform. We developed the misbehaving SmartApps following [7].
HoMonit will detect any malware that has over-privilege and spoofing problems. For
example, a SmartApp named Brighten My Path, which is used to turn the light on
when motion is detected, only requires on() command of switch capability according
to its description. We developed an over-privileged version of this SmartApp in
either of the two ways: (1) illegally obtaining accesses to off() command and (2)
illegally gaining access to all the capabilities of the devices for which the user grants
the SmartApp only switch capability.
We selected 20 SmartApps that work with different categories of ZigBee devices
and 10 SmartApps for Z-Wave devices (listed in Table 6.3). By following the
above example, a total of 30 over-privileged SmartApps were developed as the
misbehaving SmartApps for evaluation. Therefore, the dataset used in our evaluation
contains 30 benign SmartApps and 30 misbehaving SmartApps. We recruited 3
volunteers to simulate residents of the home. For each SmartApp (including both the
benign and over-privileged versions), the related devices were manually triggered
20 times by the volunteers for 20 min. During this period, HoMonit continuously
monitors the wireless channel and detects the misbehaviors in real time.
Table 6.4 Monitoring result of misbehavior occurrence in SmartApps

ZigBee SmartApp Z-Wave SmartApp
Over-privileged accesses Event spoofing Over-privileged accesses Event spoofing
TPR 0.98 0.99 0.98 0.99
TNR 0.95 0.95 0.92 0.92
As shown in Table 6.4, the average TPR (over 40 ZigBee SmartApps) is 0.98 in
detecting over-privileged accesses, with a standard deviation of 0.03; the average
TNR of ZigBee SmartApps is 0.95, with a standard deviation of 0.07. The detection
of Z-Wave SmartApps achieves similar TPR and TNR, which are 0.98 and 0.92,
respectively. The major reason for failed test cases is packet loss and a few
unexpected wireless traffic which influence the event inference. Besides, accidental
signal reception delay will break the consistency of frames for a device event, which
may result in a false alarm for normal SmartApps.
6.4.3 Detection of Event Spoofing
We first developed event-spoofing versions of the benign SmartApps by adding

malicious code to cause unintended operations. Specifically, the attackers exploit
the insufficient event protection of SmartThings to spoof a physical device event
and trigger the SmartApps which subscribe to this event. For example, Flood Alert
is a SmartApp that triggers a siren alarm when the water sensor detects the wet state.
In SmartThings, each connected device is assigned a 128-bit device identifier when
it is paired with a hub. Once a SmartApp acquires the identifier for a device, it can
spoof all the events of that device without possessing any of the capabilities that the
device supports. By imitating the attacker, we raised a fake water sensor event in
the cloud with a malicious SmartApp, causing the flood alert SmartApp to react and
raise an alert.
In this experiment, we developed 30 misbehaving SmartApps which spoofed the
device events by modifying the same set of 30 benign SmartApps (see Table 6.3).
We performed an experimental evaluation on detecting the event-spoofing attacks
by following similar procedures as the previous section. As shown in Table 6.4, the
average TPR (over 40 ZigBee SmartApps) is 0.99 in detecting event spoofing; the
average TNR of ZigBee SmartApps is 0.95. The detection of Z-Wave SmartApps
achieves a similar TPR and a slightly lower TNR, which are 0.99 and 0.92,
respectively.
6.4 Evaluation 159
Fig. 6.10 Evaluations on the 4

Privacy Enhancement Module Door sensor
3.5 Motion sensor
3 Switch
Siren
2.5
Entropy
Upper bound
2
1.5
1
0.5
0
0 1 2 3 4 5 6
Number of ecoys
6.4.4 Evaluation of the Privacy Enhancement Module
To evaluate the effectiveness of the Privacy Enhancement Module, we performed the

following experiments. The attacker adopted TI CC2531 USB Dongle for ZigBee
sniffing and two USRPs for Z-Wave sniffing. These sniffers were positioned outside
the house, 10 m away from the hub. The Privacy Enhancement Module generates
dummy packets with 3 USRPs for obfuscation: one for ZigBee packets generation
and another two for Z-Wave packets. We used 4 smart devices to communicate with
the hub. We introduced 6 different types of decoy devices; each device may have 0
to 6 decoys.
Figure 6.10 shows the entropy for each device with a different number of decoys.
The entropy for each device with no decoy is 0. With one decoy for each device, the
entropy increases to 1.0. The entropy increases as the number of decoys increases
and reaches 2.0 when the decoy number is 6. The line in Fig. 6.10 represents the
theoretical upper bound of the entropy estimation. In the experiments, since the
types of decoy devices are randomly generated, the empirical entropy is lower than
the theoretical upper bound in practice. With the increased entropy, it is difficult
for the attacker to accurately infer the smart devices used in the home; SmartApp
inference based on inaccurate smart device inference is even more challenging.
6.4.5 Discussions
6.4.5.1 Generality and Applicability
Though HoMonit mainly investigates the SmartThings platform, the presented

approach can be potentially applied to other IoT systems. This is because, in IoT
environment, most of the devices are power-constraint, and the employed wireless
protocols are lightweight. These lightweight protocols are typically designed for low
transmission rates and reduced data redundancy for low power consumption, which
inherently suffers from a low-entropy issue. Therefore, this feature gives HoMonit
the opportunity to extract wireless fingerprints for smart devices by analyzing the
packet size and timing information.
We also investigate IFTTT [10], an open-source platform compatible with
SmartThings, to further demonstrate the applicability and generality of HoMonit
from both of the aspects of the wireless fingerprints capturing and the DFA building.
To capture the wireless fingerprints in IFTTT, we develop an Applet (a SmartApp in
IFTTT) that automatically turns on/off the Samsung SmartThings Outlet via ZigBee
protocols, as shown in Fig. 6.8c. It is shown that the events in IFTTT present the
same wireless fingerprints as in SmartThings (e.g., 50 ↓, 47 ↓, 47 ↓, 52 ↓ for
switch.on or switch.off ). The reason is that IFTTT employs the same lower-layer
protocols as SmartThings, and the wireless traffic patterns are not affected by their
upper-layer platforms.
6.4.5.2 Generating DFA from Benign SmartApps
The core idea of HoMonit is to compare the SmartApp activities inferred from
the encrypted traffic with the expected behaviors dictated by their source code
or UI. Therefore, acquiring the DFA of the benign version of the SmartApp (or
the groundtruth DFA) is critical for the successful deployment of HoMonit. The
simplest way to obtain such groundtruth DFAs is to download them from the official
app market if assuming the market operator has performed a good job in vetting all
published SmartApps. Otherwise, a trustworthy third party must step in to vouch for
benign apps, which will help bootstrap HoMonit.
6.4.5.3 Double-Sending Attacks
Because some device events may have the same wireless fingerprints, such as
switch.on and switch.off of Samsung SmartThings Outlet (see Table 6.2), HoMonit
has to keep track of the current state of the device, which can be used in turn to infer
the content of the event. However, a potential attack strategy is that a SmartApp may
intentionally send the same commands twice to mislead HoMonit. We call this type
of attack a double-sending attack. For example, a misbehaving SmartApp may send
the command switch.off twice, hoping that they will be confused with a sequence
of switch.off and switch.on. However, in reality, this double-sending attack does not
work as the communication protocol of SmartThings devices is designed to deal
with duplicated messages. We performed the following experiments: (1) two events
[switch.on and switch.off ] were sent by the SmartApp and (2) two events [switch.on
and switch.on] were sent by the SmartApp. In both experiments, the initial state
of the Outlet was set as off. The first experiment represents a normal case, and the
second represents a double-sending attack. As shown in Table 6.2, the collected
traffic patterns are different: the packets in the first cases are (50 ↓, 47 ↓, 47 ↓
, 52 ↓), followed by (50 ↓, 47 ↓, 47 ↓, 52 ↓). In comparison, those in the second
References 161
Table 6.5 Wireless Case Event Fingerprint

fingerprints in normal and
double-sending attack Normal 1st: switch.on 50 ↓ 47 ↓ 47 ↓ 52 ↓
scenarios 2nd: switch.off 50 ↓ 47 ↓ 47 ↓ 52 ↓
Attack 1st: switch.on 50 ↓ 47 ↓ 47 ↓ 52 ↓
2nd: switch.on 50 ↓ 47 ↓
are (50 ↓, 47 ↓, 47 ↓, 52 ↓), followed by (50 ↓, 47 ↓). We speculate this is because

that, under double-sending attacks, when the hub receives the second switch.on
command from the SmartApp, as the hub knows the outlet’s status (on), it regards
this command as duplicated and alters the message sent to the outlet (Table 6.5).
6.4.5.4 Alerting Users After Detection
When detecting any misbehavior of a specific SmartApp, HoMonit can alert the
users simply through text message or work together with existing home safety mon-
itoring tools (e.g., Smart Home Monitor [22] in SmartThings) to take immediate
actions. For example, HoMonit can generate different alerts based on the detected
scenarios, such as home unoccupied, occupied, or disarmed. In addition, HoMonit
can serve as a building block for enforcing user-centric [25] or context-based [12]
security policies and integrate with these previously proposed systems to interact
with users.
6.5 Summary
In this chapter, we present HoMonit, an anomaly detection system for smart home
platforms to detect misbehaving SmartApps. HoMonit leverages the side-channel
information leakage in the wireless communication channel—packet size and inter-
packet timing—to infer the type of communicated events between the smart devices
and the hub and then compares the inferred event sequences with the expected
program logic of the SmartApps to identify misbehavior. Key to HoMonit includes
techniques to extract the program logic from SmartApps’ source code or the
user interfaces of SmartThings’ mobile app and automated DFA construction and
matching algorithms that formalize the anomaly detection problem.
References
1. Alliance, Z.: Zigbee specification (2012). http://www.zigbee.org/wp-content/uploads/2014/11/

docs-05-3474-20-0csg-zigbee-specification.pdf
2. Apple: Homekit (2018). https://www.apple.com/ios/home/
3. Black, P.: Levenshtein distance. In: Dictionary of Algorithms and Data Structures (2008)
4. Chen, Y., Trappe, W., Martin, R.P.: Detecting and localizing wireless spoofing attacks. In:
IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications
and Networks (SECON) (2007)
5. Cybertools: Scapy radio (2018). https://bitbucket.org/cybertools/scapy-radio/src
6. D’Arcy, H.: Astbuilder (2018). http://docs.groovy-lang.org/next/html/gapi/org/codehaus/
groovy/ast/builder/AstBuilder.html
10.1109/SP.2016.44
practical data protection for emerging IoT application frameworks. In: USENIX Security
9. Honeywell (2013). http://library.ademconet.com/MWT/fs2/L5210/Introductory-Guide-to-Z-
Wave-Technology.pdf
10. IFTTT Inc. (2018). https://ifttt.com/
11. JFR, ABR and NOBRIOT: Z-wave command class specification (2016). http://z-wave.
sigmadesigns.com/wp-content/uploads/2016/08/SDS12657-12-Z-Wave-Command-Class-
Specification-A-M.pdf
13. Jokar, P., Arianpoo, N., Leung, V.C.M.: Spoofing Detection in IEEE 802.15.4 Networks Based
on Received Signal Strength. Elsevier, Amsterdam (2013)
14. mitshell: 802.15.4 monitor (2018). https://github.com/mitshell/CC2531
15. Samsung: SmartThings (2021). https://www.smartthings.com
16. Schaller, K.: List of all officially published apps from the more category of smart setup in the
mobile app (2015). https://community.smartthings.com/t/list-of-all-officially-published-apps-
from-the-more-category-of-smart-setup-in-the-mobile-app-deprecated/13673
17. SmartThings: Capabilities reference (2018). http://docs.smartthings.com/en/latest/capabilities-
reference.html
18. SmartThings: Smartthings architecture (2018). http://docs.smartthings.com/en/latest/
architecture/index.html
19. SmartThings: SmartThings public GitHub repo (2018). https://github.com/
SmartThingsCommunity/SmartThingsPublic
20. SmartThings: Web services SmartApps (2018). http://docs.smartthings.com/en/latest/
smartapp-web-services-developers-guide/overview.html
21. SmartThings: What SmartApps are being retired from the marketplace? (2018). https://support.
smartthings.com/hc/en-us/articles/115003072406-What-SmartApps-are-being-retired-from-
the-Marketplace
22. Samsung SmartThings: Smart home monitor (2018). https://support.smartthings.com/hc/en-
us/articles/205380154
23. Taylor, V.F., Spolaor, R., Conti, M., Martinovic, I.: AppScanner: automatic fingerprinting of
smartphone apps from encrypted network traffic. In: IEEE European Symposium on Security
and Privacy (EuroS&P) (2016)
24. Texas Instrument: CC2531: system-on-chip solution for IEEE 802.15.4 and ZigBee applica-
tions (2018). http://www.ti.com/product/CC2531
authorization for the Internet of Things. In: USENIX Security Symposium (USENIX Security)
(2017)
26. Yang, J., Chen, Y., Trappe, W.: Detecting spoofing attacks in mobile wireless environments.
In: IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications
and Networks (SECON) (2009)
Chapter 7
Conclusion and Future Directions
7.1 Conclusion on Security and Privacy in Smart Home
This monograph focuses on the security issues in the smart home and takes “mobile
device—voice interface—application platform” as the research logic. Its main
contributions are described below.
In the terminal device layer, this monograph proposes a cross-layer privacy attack
and defense scheme for smartphones and other smart terminals. This monograph
firstly reveals the relationship between the dynamic changes of the wireless signal
in the physical layer and the user’s keystroke input in the mobile terminal devices.
Then, we propose WindTalker, a novel attack mechanism that combines side-
channel information from both physical layer and network traffic layer to realize
inferring the user’s mobile payment password. In order to solve the problem that
traditional attacks need to explicitly deploy sniffer devices around the target users
or invade the user’s device system, WindTalker proposes a new CSI collection
method based on the Internet control message protocol. At the same time, this
monograph designs a sensitive input window identification algorithm based on IP
address pool and proposes an efficient keystroke inference algorithm based on CSI
data. This monograph verifies the reliability of WindTalker on Alipay, the largest
mobile payment platform in China. The implementation shows that WindTalker
can bypass the HTTPS encryption protocol deployed by Alipay and successfully
speculate the user’s payment password. This monograph studies the influence of
many factors, such as distance and direction, and proves that WindTalker can
speculate on keystrokes with high accuracy in various scenarios. Finally, this
monograph proposes an efficient defense mechanism to resist the privacy threat
caused by WindTalker. This defense mechanism can effectively prevent attackers
from obtaining accurate CSI data through CSI obfuscation method.
In the voice interface layer, to solve the problem that existing two-factor
authentication schemes require users to carry sensor equipment, this monograph
proposes WSVA, a voice liveness detection system based on Wi-Fi signal. Unlike
https://doi.org/10.1007/978-3-031-24185-7_7
164 7 Conclusion and Future Directions
traditional two-factor authentication methods, WSVA shows its advantages of no

need to carry special equipment and easy deployment by using the existing wireless
signals in the IoT environment. Because the movement of the user’s mouth will
modulate the wireless signal CSI, it is feasible to determine whether the voice
command of the voice interface is actually sent by the user based on the fluctuation
of CSI. This monograph studies the correlation between voice samples and wireless
signals. Then, it establishes a mapping model to correlate syllables in voice
commands, user mouth movements, and their corresponding CSI change patterns.
In this monograph, we use effective technical mechanisms to process voice samples
and CSI data, design novel algorithms to extract features from these different
signals, and propose a decision algorithm for living detection. This monograph
uses various authentic and spoofing voice commands to evaluate WSVA in different
scenarios. The experimental results of six volunteers show that WSVA achieves an
accuracy of more than 99% in liveness detection with a false acceptance rate of 1%.
Further, given the insufficient robustness and low flexibility of current passive
liveness detection schemes based on voice signals, this monograph uses the
microphone array widely used by mainstream smart speakers to design a passive
liveness detection system ArrayID. ArrayID only uses audio collected from smart
speakers for liveness detection and does not require users to deploy sensor devices
(e.g., wireless antennas) or perform any actions. ArrayID uses the diversity of multi-
channel audio due to the different positions of microphones in the microphone array
to extract more effective liveness detection feature. In this monograph, we first use
the wave propagation theory to analyze the existing schemes and propose a powerful
active feature: array fingerprint. This monograph collects and constructs a data set
containing 38720 multi-channel voice commands. Through the experimental results
of the self-built dataset and another public dataset, this monograph proves that
ArrayID is superior to the existing scheme. In addition, we evaluated the impact
of multiple factors (e.g., distance, direction, deception device, background noise)
on ArrayID, and proved the robustness of ArrayID.
Finally, given the defects that the current work of monitoring malicious behavior
of smart applications needs to modify the applications or platforms, this monograph
proposes a misbehavior detection system HoMonit that is independent of the smart
home architecture. HoMonit uses the wireless side-channel inference technology to
monitor the behavior of applications based on encrypted wireless traffic. HoMonit
first extracts the DFA of the application from the source code of the benign version
of the application. Secondly, it infers the operation and interaction state of the
application’s devices by observing the size and interval of the encrypted wireless
data packet. Then, HoMonit matches the DFA extracted from the application
with the DFA inferred from the wireless traffic. If the matching fails, it indicates
that the running application has malicious behavior. We implemented HoMonit
and evaluated its effectiveness. In the experimental evaluation, this monograph
developed 60 malicious applications by modifying the source code and proved
that HoMonit could effectively detect SmartApps with misbehavior. At the same
time, considering that the side-channel information may be leveraged by external
attackers, this monograph further designs a privacy enhancement module based
7.2 Open Research Problems 165
on traffic confusion. This monograph evaluates the privacy enhancement module

and shows that HoMonit can effectively protect privacy by increasing information
entropy while retaining the ability to monitor the misbehavior of applications.
In summary, this monograph reveals a cross-layer privacy attack and defense
mechanism based on wireless side channel information for the terminal device level.
In terms of voice interface, this monograph proposes two-factor live detection based
on Wi-Fi wireless sensing and passive liveness detection based on the microphone
array. Finally, at the application platform level, this monograph proposes a wireless
traffic analysis-based application misbehavior detection system. Through the above
security key technologies, this monograph finally constructs a cross-layer compre-
hensive security guarantee for the smart home network.
7.2 Open Research Problems
With the further development of smart home technology, its security problems must
emerge in endlessly, and its security research is a dynamic and continuous process.
On the basis of this study, the following prospects are made for future research:
First, in terms of side-channel attack and defense on terminal devices, wireless
sensing media has expanded from Wi-Fi signals to ultrasonic, millimeter waves,
visible light, and many other fields. In smart home scenarios, more and more
devices use these communication protocols. In addition, this research is mainly
aimed at smart terminal devices such as smartphones, while for smart devices such
as sensors and controllers using other communication protocols, further exploration
is needed. Therefore, in the future, it is necessary to further study the security of
these ubiquitous terminal devices under wireless media.
Second, in terms of two-factor authentication based on wireless signals, the
WSVA proposed in this monograph has good performance in voice deception
attacks such as recognition and replay attacks but lacks an efficient defense mech-
anism against internal attacks. This is because the perception of Wi-Fi signals is
limited. In the future, with the application of wireless sensing media and technology
with a fine-grained resolution, we can study and build a two-factor authentication
mechanism that integrates user behavior and interface information more.
Third, in terms of passive detection based on voice signals, the ArrayID proposed
in this monograph requires users and spoofing devices to provide real voice
commands and voice spoofing commands to train their classification models. This
restriction inevitably creates a certain degree of user burden in the smart home
environment and also limits the popularity of ArrayID. In the future, researchers
need to mine more efficient factors for liveness detection to get rid of the limitations
on training data and user dependence.
Fourthly, in the aspect of misbehavior detection based on the wireless traffic
side channel, the HoMonit proposed in this monograph is mainly aimed at smart
applications employing low power consumption communication protocols such
as ZigBee and Z-Wave. For high-speed communication protocols such as Wi-Fi,
166 7 Conclusion and Future Directions
HoMonit is currently difficult to construct effective device behavior fingerprints. In

addition, HoMonit currently mainly extracts working logic from open-source smart
applications, while its ability to extract working logic from closed-source appli-
cations needs to be improved. Therefore, research on high-speed communication
protocols and closed-source intelligent application platforms is needed in the future.
Index
A S
Application misbehavior detection, 16, 165 Side-channel attacks, v, 2–3, 9, 10, 14, 17,
Application platform, v, vi, 1–5, 7–8, 13–14, 21–25, 30, 37, 38, 41, 69, 70, 73, 151,
17, 28–30, 135–161, 163, 165, 166 165
Array fingerprint, 18, 108, 109, 112–120, 124, Signal obfuscation, 14, 17, 69–71
126, 130, 131, 164 Smart home, v, vi, 1–14, 16–18, 21–31, 37, 40,
66, 77, 81, 85, 97, 100, 102, 107–110,
120, 124, 128, 130, 135–138, 141, 142,
C 145, 151, 153, 154, 161, 163–165
Channel state information (CSI), 14, 15,
22–24, 38–54, 56–73, 78–80, 82–94,
96–99, 101–104, 107, 163, 164 T
Terminal device, v, vi, 2–6, 9–10, 14–15, 17,
21–25, 30, 37–73, 163, 165
H Traffic analysis, 24, 135–161, 165
High-speed traffic analysis, 165 Two-factor authentication, 11–12, 18, 25, 27,
31, 77–104, 131, 163–165
K
Keystroke inference, 23–25, 37–42, 44, 47, 48,
50–61, 65, 72 U
Ubiquitous sensing, 22
M
Microphone array, vi, 6, 12, 15–16, 18, V
107–131, 164, 165 Voice control system, 26, 77, 102, 103
Misbehavior detection, 4, 6, 16–18, 135–161, Voice interface, v, vi, 2–7, 9–12, 14, 15, 17, 18,
164, 165 25–28, 31, 77–104, 107–131, 163–165
Voice spoofing, v, 3, 4, 11, 12, 15, 18, 25–28,
77, 79, 83, 99, 104, 126, 131, 165
P
Passive liveness detection, 11, 12, 15, 17, 18,
25, 27–28, 107–131, 165 W
Physical layer information, 9, 10, 15, 18, Wireless fingerprinting, 160, 161
21–23, 38, 67 Wireless side-channel inference, 141, 164
Privacy inference, 42–55, 143, 153
https://doi.org/10.1007/978-3-031-24185-7

Security in Smart Home Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Security in Smart Home Networks

Uploaded by

Copyright:

Available Formats

Wireless Networks

Security in Smart Home

Xuemin (Sherman) Shen

ISSN 2366-1186 ISSN 2366-1445 (electronic)

Shanghai, China Yan Meng

3 Privacy Breaches and Countermeasures at Terminal Device Layer . . . 37

5 Microphone Array Based Passive Liveness Detection at Voice

7 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

CFR Channel Frequency Response

TRR True Rejection Rate

1.1 Era of Smart Home and Its Challenges

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1

Currently, popular smart home platforms include Samsung SmartThings [25],

1.2 The Architecture of Smart Home Network

1.2.1 Terminal Device

Fig. 1.1 The architecture of the smart home system

wireless network communication with diverse communication protocols (e.g.,

1.2.2 Voice Interface

Voice collection Upload Cloud platform

1.2.3 Application Platform

Motion sensor Smart light Application

Fig. 1.3 An illustration of the smart application platform

1.3 Security and Privacy Challenges in Smart Home

1 Side-channel Voice spoofing

1.3.1 Terminal Device Layer: Privacy Leakages

1.3.2 Voice Interface Layer: Spoofing Attacks

1.3.3 Application Platform Layer: Misbehavior

propose a third-party system independent of the smart home system to realize

1.4 Aims and Organization of This Monograph

In view of the above-mentioned security challenges in “terminal device—voice

1.4.1 Aims of This Monograph

In terms of the security challenges of wireless side-channel attacks at the terminal

effective CSI side-channel information. This defense mechanism enhances the

1.4.1.2 Two Factors in Liveness Detection Based on Wi-Fi Signal

1.4.1.3 Passive Voice Liveness Detection Based on Microphone Array

1.4.1.4 Application Misbehavior Detection Based on Wireless Traffic

1.4.2 Organization of This Monograph

Chapter 2 Literature Review of

Chapter 3 Terminal Device Security:

Chapter 4 Voice Interface Security I:

Voice interface Chapter 5 Voice Interface Security II:

Chapter 6 Application Platform Security:

Chapter 7 Conclusion and Future Directions

Fig. 1.5 The structure of this monograph

In Chap. 4, at the voice interface layer, a two-factor authentication mechanism

2.1 Side-channel Attacks Faced by Terminal Device

2.1.1 Attacks Based on Physical Layer Side-channel

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 21

2.1.2 Attacks Based on Network Layer Side-channel

2.1.3 Other Side-channel Attack Manners

In addition to wireless signals, there exist various available side-channel informa-

2.2 Voice Spoofing Attacks in Voice Interface

2.2.1 Voice Spoofing Attacks

2.2.2 Two-Factor Authentication-Based Liveness Detection

In order to thwart voice spoofing attacks, researchers have proposed a variety

in voice spoofing attacks are played by electronic devices (e.g., high-quality

2.2.3 Voice Signal-Based Passive Liveness Detection

Although the two-factor authentication-based defense mechanism has achieved

2.3 Misbehaviors in Application Platform

2.3.1 Misbehaviors of Smart Home Applications

2.3.2 Defense Mechanisms against Misbehaviors

In terms of security enhancement, the current research mainly focuses on checking

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 37

3.2 Background Knowledge and Attack Principle