Understanding and Mitigating Security Risks of Voice-Controlled Third-Party Functions On Virtual Personal Assistant

UNDERSTANDING AND MITIGATING SECURITY RISKS OF
VOICE-CONTROLLED THIRD-PARTY FUNCTIONS ON VIRTUAL

PERSONAL ASSISTANT
Nan Zhang , Xianghang Mi , Xuan Feng, XiaoFeng Wang , Yuan Tian and Feng
Qian
TEAM 7
LAKSHMISAGAR A S
LOHITH J R
MANAVI SATTIGARAHALLI
 Virtual Personal Assistants (VPA) rely on voice commands to communicate with users.
 Smart speakers allow third-party to publish function (skills or actions) for service of users.
 VPAs are vulnerable to security threats like voice squatting and voice masquerading, whose aim will be to steal personal
information.
 The paper proposes technique to automatically captures ongoing attacks to address threats within Amazon Echo and Google
Home.
CONTENT
 Introduction
 Background
 Exploiting VPA Voice control
 Defending voice squatting
 Defending voice masquerading
 Discussion
 Conclusion
INTRODUCTION
 Smart speaker interacts with Voice User Interface (VUI).

 Enhancement of VPA ecosystem is done with provision of third-party developers to build functions, called as skills by
Amazon and actions by Google.
 Increase in IOT and VPA technologies in today’s world by the rapid rate (fig 1).
 VPA services are provided by smart speaker which interacts with VUI (fig II).
 Currently we have 25,000 skills and 1000+ actions (fig III).
Fig II : VPA usage per day
Fig I : Growth of VPA
Fig III : Third party usage

Security risks in VPA
 Attackers can impersonate voice-controlled system, as no protection is in place to authenticate user.

 Difficult for user to determine whether user is talking to right skill..
 Attackers can collect private information through conversations with systems.
Voice-based attacks
 Adversary exploits utilizing the variations in commands and how the skill is invoked.
. Attackers aims at interactions between users and VPA, and the hand-overing commands are manipulated.

 VPA responses with the skill which contains the maximum matching words or longest matching pattern.
 Alexa and Google Home cannot recognize accurately and are capable of hijacking.
Experiment
 Popular skill “Sleep and Relaxation Sounds”

 Invoked by 2,699 users with 21,308 commands within a month.
 Found strong evidence that both voice squatting and voice masquerading can happen.
 Experiment prove that existing skills are susceptible to threats with consequences of attacks, and disclosure of private information.
 Further, Skill named ‘Scanner’ – converts invocation name of skill into phonetic expression. ( similar sounds, subset relations etc.)
 Ran against 19,670, discovered 4,718 skills had squatting risk.
BACKGROUND
2015 2016 2016 2017

Amazon Echo Google Home Amazon Echo dot Google Home mini
 Affordable alternatives  Accommodated with conventional I/O experience.

 Variations in designs and colors  360-degree audio pickup
 Integrated into IOT from other vendors  Beamforming for far-field voice recognition
 Hands-free user experience
VPA skills
 Skills are third-party apps offering variety of services to customers

 Skills can be developed using Alexa skills Kit and Actions on Google.
 Skill markets have categories based on function – Amazon has 23 and Google has 18 categories.
 Skills can be invoked with user calling the skill’s name (explicit invocation) or related task (implicit invocation)
 When developing, intents (open, tell, start, ask, etc.) and sample utterances needs to be defined for mapping purpose.
intent/ utterance + invocation name + extra words

Read / Open Messages / BoFa please / now
VPA skills invocation
1. Captured voice commands are sent to VPA service provider’s cloud.

2. Cloud performs speech recognition and translates voice to text.
3. Finds out invocation skill by pattern matching.
4. Delivers response with time stamp and other meta information.
5. SSML response is received by the service
6. Converted into speech by cloud and is played to user.
EXPLOITING VPA VOICE CONTROL
 In Alexa skills names need not be unique. Multiple skills with same names are allowed in market.
 The skills are chosen based on lonest matching pattern and some undisclosed policies.
 Google does not allow duplicate invocation names, but pronunciations can be misleading.
 As only one skill runs at a time and there is no strong indication whether skill is quitted, VPAs are vulnerable.
 More than 85% uses natural language.

 More than half of users try to switch skills in the middle of interaction.
 Around 30% participants found difficult to terminate skill.
 78% did not use light indicator and termination of indication.
Voice Squatting Attack
 Adversary intentionally induce confusion by using similar names as target skills.
 Alexa allows having similar skills whereas Google’s skills allows only unique name yet, adversary can cause confusion with
pronunciation.
 An adversary exploits how a skill is invoked utilizing the variations in the ways the command is spoken. (accent,
courteous expression, etc.)
 There can be voice squatting (phonetically similar) and word squatting (includes strategical extra words).
 According to a study, Alexa and Google invoked around 50% of skills mistakenly.
Voice Masquerading Attack
 Running skill can pretend to hand-over control or fake termination in order to impersonate.
 The response content is in JSON object for each command. By having a Goodbye or silent response, the skill can fake-terminate.
 After every 6 seconds for Alexa and 8 seconds for Google, skills will be forceful terminated. To continue faking, the silent audio
needs to be repromoted.
 Sensitive data or information can be obtained by the attacking skill by impersonating legitimate skill.
 Analysis show that out of 9,582 commands, 52 were to switch skill and 485 were to terminate in the middle, and these could easily
be exploited.
Attacker threatens the interactions between the user

and the VPA system, including those supposed to be
processed by VPA system. (switch and terminate
commands)
Takes advance of improper responses or provides fake
command responses from VPA.
DEFENDING VOICE SQUATTING
 Data Collection
 Ran web crawlers to collect metadata like name, author, invocation name, sample utterances, description and reviews.
 Gathered around 23,758 skills with 19,670 of them being 3 rd party skills.
 Methodology
 Built scanner to capture Competitive Invocation Name (CIN) – names for a given invocation name.
 Two parts within defensing voice squatting –
 Utterance paraphrasing – identifies suspicious variations of invocation name
 Pronunciation comparison – finds the similarity in pronunciation between two different names
 Utterance paraphrasing
 Paraphrase common invocation utterances of the target skill using Bilingual pivoting and deep neural networks.
 Gathered 11 prefixes and 6 suffixes and applied to build a variations recognizable to VPA system.
Please, my, can you, it, the, for me, app, some, few, etc.…
 Pronunciation comparison
 Converts a name into phonemic presentation using ARPABET phoneme code.
 CMU pronunciation dictionary which includes 134,000 words is used.
 Train a Grapheme-to-phoneme model using recurrent neural network with long short term memory(LSTM) units.
 Every phonemic representation is compared with target name and their edit distant is measured within scanner.
 Weighed cost matrix is obtained for the operations on different phoneme pairs.
 Needleman- Wunsch algorithm used to identify the minimum edit distance and related edit path.
 )=1-
Evaluation
 Calculated the cost for transforming misrecognized invocations names to identified from voice commands.
 3655 out of 19,670 Alexa skills have CINs which include identical invocation names.
 After removing skills with identical names, 531 skills have CINs with average of 1.31 CINs.
 Invocation name with most CINs is “cat facts” with 66 skills.
 345 skills CINs are the utterance paraphrasing of other skills’ names.
 Google has only 1,001 skills and does not allow them to have identical invocation names.
 Only 4 skills have similarly pronounced CINs.
 Some skills deliberately utilize the invocation names unrelated to their functionalities and follow popular skills
invocation name.
DEFENDING VOICE MASQUERADING
 Built context-sensitive detector upon VPA infrastructure.

 Takes skill response as input and finds if impersonation risk is present or not.
 Two parts in defending masquerading attack -
 Skill Response Checker – captures suspicious responses from a malicious skill.
 User Intention Classifier – checks the voice commands issued by the user.
 Skill Response Checker
 Maintains a set of common utterance templates used by VPA system.
 Alarm is triggered if skill’s response found to be like one in the template.
 Sentence embedding model is trained using a recurrent neural network with bidirectional LSTM units.
 Calculate absolute cosine similarity as sentence relevance.
 Threshold of 0.8 is set to differentiate between suspicious response to legitimate.
 User Intention Classifier

 Aims at automatically detecting erroneous commands from the user.
 Used Learning-based approach that utilizes contextual information to identify the user intention.
 Compare the user’s utterance to both system commands and the running skill’s context.
 Utterance’s maximum and average SRs are used as features for classification.
 User’s command for a skill is related to the skill’s prior communication with the user and its stated functionalities.
Evaluation
 Selected 10 popular skills and collected 61 utterances.

 Built 10 faking termination attacks and crafted 15 skill switch instances.
 Detector successfully detected all 25 context- switching impersonation instances.
 Evaluated on real-world for 9582 conversations.
 Detector identified 341 as context-switch instances out of which 326 were confirmed to be actual.
 UIC component achieved a precision of 95.60%.
 Average of 0.003s of detection latency is observed which is negligible.
DISCUSSION
 Limitation of Defense – All the data set might not comprehend enough for covering real-world attack cases that
could happen.
 Future direction – Manual inspection of each skill is infeasible as skill’s inside logic is invisible to VPA system.
Potential future is to develop lightweight and effective dynamic analysis system.

CONCLUSION
 VPA ecosystem is vulnerable to two attacks, VSA and VMA.

 Scanner skill was published and ran against Amazon and Google skills, which lead to finding that large number of
Alexa skills are at risk and problematic skill names are already published in market.
 Implementation of context-sensitive detector to mitigate threat achieved 95% precision.
 Future work is need to protect the voice channel better and authentication among the parties.
THANK YOU
ANY
QUESTIONS?

Understanding and Mitigating Security Risks of Voice-Controlled Third-Party Functions On Virtual Personal Assistant

Uploaded by

Copyright:

Available Formats

You might also like

Understanding and Mitigating Security Risks of Voice-Controlled Third-Party Functions On Virtual Personal Assistant

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understanding and Mitigating Security Risks of Voice-Controlled Third-Party Functions On Virtual Personal Assistant

Uploaded by

Copyright:

Available Formats

UNDERSTANDING AND MITIGATING SECURITY RISKS OF

VOICE-CONTROLLED THIRD-PARTY FUNCTIONS ON VIRTUAL

 Smart speaker interacts with Voice User Interface (VUI).

Fig II : VPA usage per day

Fig I : Growth of VPA

Fig III : Third party usage

 Attackers can impersonate voice-controlled system, as no protection is in place to authenticate user.

 Popular skill “Sleep and Relaxation Sounds”

2015 2016 2016 2017

 Affordable alternatives  Accommodated with conventional I/O experience.

 Skills are third-party apps offering variety of services to customers

intent/ utterance + invocation name + extra words

1. Captured voice commands are sent to VPA service provider’s cloud.

 More than 85% uses natural language.

Attacker threatens the interactions between the user

 Built context-sensitive detector upon VPA infrastructure.

 User Intention Classifier

 Selected 10 popular skills and collected 61 utterances.

Potential future is to develop lightweight and effective dynamic analysis system.

 VPA ecosystem is vulnerable to two attacks, VSA and VMA.

You might also like