Hear No Evil and See No Evil - Black-Box Attacks on Speech Recognition

-By Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Logan Blue, Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton and Patrick Traynor University of Florida

Paper Link

Courtesy Arxiv

The telephony network is still the most widely used mode of audio communication on the planet, with billions of phone calls occurring every day within the USA alone. Such a degree of activity makes the telephony network a prime target for mass surveillance by governments. However, hiring individual human listeners to monitor these conversations can not scale. To overcome this bottleneck, governments have used Machine Learning (ML) based Automatic Speech Recognition (ASR) systems and Automatic Voice Identification (AVI) systems to conduct mass surveillance of their populations. Governments accomplish this by employing ASR systems to flag anti-state phone conversations and AVI systems to identify the participants. The ASR systems convert the phone call audio into text. Next, the government can use keywords searches on the audio transcripts to flag potentially dissenting audio conversations. Similarly, AVI systems identify the participants of the phone call using voice signatures.

Currently, there does not exist any countermeasure for a dissident attempting to circumvent this mass surveillance infrastructure. There are several targeted attacks against ASR and AVI systems that exist in the current literature. However, none of these consider the limitations of the dissident (near-real-time, no access/knowledge of the state’s ASR and AVI systems, success over the telephony network, limited queries, transferable, high audio quality). Targeted attacks either require white-box knowledge generate noisy audio, are query intensive, or not resistant to the churn of the telephone network. For a comprehensive overview of the state of the current attacks with respect to our own, we refer the reader to Table IV in the Appendix.

Paper propose the first near-real-time, black-box, model agnostic method to help evade the ASR and AVI systems employed as part of the mass telephony surveillance infrastructure111Recently, we have seen a number of attack papers against ASR and AVI systems. To better understand why our work in novel and clearly differentiate it from existing literature, we encourage the readers to review Table IV in the Appendix.. Using our method, a dissident can force any ASR system to mistranscribe their phone call audio and an AVI system to misidentify their voice. As a result, governments will lose trust in their surveillance models and invest greater resources to account for our attack. Additionally, by forcing mistranscriptions, our attack will prevent governments from successfully flagging the conversation. Our attack is untargeted i.e., it can not generate selected words or specific speakers. However, in the absence of any technique that can attain these goals in the severely constrained setting of the dissident, our method’s ability to achieve a limited set of goals (i.e., evasion) is still valuable. This can be used by dissident as the first line of defence.

Our attack is specifically designed to address the needs of the dissident attempting to evade the audio surveillance infrastructure. The following are the key contributions of our work:

Our attack can circumvent any state of the art ASR and AVI system in near real-time, black-box, transferable manner: A dissident attempting to evade mass surveillance will not have access to the target ASR or AVI systems. A key contribution of this work is the ability to generate audio samples that will induce errors in a variety ASR and AVI systems in a black-box setting, where the adversary has no knowledge of the target model. Current black-box attacks against audio models use genetic algorithms, which still require hundreds of thousands of queries and days of execution to generate a single attack sample. In contrast, our attack can generate a sample in fewer than 15 queries to the target model. Additionally, we show that if dissident can not query the target model, which is most likely the case, our adversarial audio samples will still be transferable i.e., evade unknown models.

Attack does not significantly impact human-perceived quality or comprehension and works real audio environments: The dissident must be confident that the attack perturbations will maintain the quality of the phone call audio, survive the churn of the telephony network and still be able to evade the ASR and AVI systems. Therefore, we design our attack to introduce imperceptible changes, such that there is no significant degradation of the audio quality. We substantiate this claim by conducting an Amazon Turk user study. Similarly, we test our attacks over the cellular network, which introduces significant audio quality degradation due to transcoding, jitter and packet loss. We show that even after undergoing such serious degradation and loss, our attack audio sample is still effective in tricking the target ASR and AVI systems. To our knowledge, our work is the first to generate attack audio that is robust to the cellular network. Therefore, our attack ensures that the dissenter’s adversarial audio will not have any significant impact on quality and will evade the surveillance models after having being intercepted within telephony networks.

Robust to existing adversarial detection and defence mechanisms: Finally, we evaluate our attack against existing techniques detecting or defending adversarial samples. For the former, we test the attack against the temporal based method, which has shown excellent results against traditional adversarial attacks. We show that this method has limited effectiveness for our attack. It is not better than randomly choosing whether an attack is in progress or not. Regarding defences, we test our attack against adversarial training, which has shown promise in the adversarial image space. We observe that this method slightly improves model robustness, but at the cost of a significant decrease in model accuracy.

Hypothesis threat model

The figure shows the steps involved in generating an attack audio sample. First, the target audio sample is passed through a signal decomposition function (a) which breaks the input signal into components. Next, subject to some constraints, a subset of the components are discarded during thresholding (b) A perturbed audio sample is reconstructed (c) using the remaining weights from (a) and (b). The audio sample is then passed to the ASR/AVI system (d) for transcription. The difference between the transcription of the perturbed audio and the original audio is measured (e). The thresholding constraints are updated accordingly (c) and the entire process is repeated.

Results

Phoneme vs. Word Level Perturbation

Our attack aims to force a mistranscription while still being indistinguishable from the original audio to a human listener. Our results indicate that at both the phoneme-level and the word-level, the attack is able to fool black-box models while keeping audio quality intact. However, the choice of performing word-level or phoneme-level attacks is dependent on factors such as attack success guarantee, audible distortion, and speed. The adversary can achieve guaranteed attack success for any word in the dictionary if word-level perturbations are used. However, this is not always true for a phoneme-level perturbation, particularly for phonemes which are phonetically silent. An ASR system may still properly transcribe the entire word even if the chosen phoneme is maximally perturbed. Phoneme-level perturbations may introduce less human-audible distortion to the entire word, as the human brain is well suited to interpolate speech and can compensate for a single perturbed phoneme. In terms of speed, creating word-level perturbations is significantly slower than creating phoneme-level perturbations. This is because a phoneme-level attack requires perturbing only a fraction of the audio samples needed when attacking an entire word.

Steps to Maximize Attack Success

An adversary wishing to launch an Over-Cellular evasion attack on an ASR system would be best off using the DFT-based phoneme-level attack on vowels, as it guarantees a high level of attack success. Our transferability results show that an attacker can generate samples for a known ‘hard’ model such as Google (Phone) and have reasonable confidence that the attack audio will transfer to an unknown ASR model. From our ASR poisoning results, we observe that an adversary does not have to perturb every word to earn 100% mistranscription of the utterance. Instead, the attacker can perturb a vowel of every other word in the worst case, and every fifth word in the best case. The ASR poisoning effect will ensure that the non-perturbed words are also mistranscribed. Finally, the attack audio samples have a high probability of surviving the compression of a cellular network, which will enable the success of our attack over lossy and noisy mediums.

Contrary to an ASR system attack, an adversary looking to execute an evasion attack on an AVI system would prefer to use the SSA-based phoneme-level attack. Similar to ASR poisoning, we observe that an adversary does not have to perturb the entire sentence to cause a misidentification, but rather just a single phoneme of a word in the sentence. Based on our results, the attacker would need to perturb on average one phoneme every 8 words (33 phoneme) to ensure a high likelihood of attack success. The attack audio samples are generated in an identical manner for both the ASR and AVI system attacks, thus the AVI attack audio should also be robust against lossy and noisy mediums (e.g., a cellular network).

Why the Attack Works

Our attacks exploit the fundamental difference in how the human brain and ASR/AVI systems process speech. Specifically, our attack discards low intensity components of an audio sample which the human brain is primed to ignore. The remaining components are enough for a human listener to correctly interpret the perturbed audio sample. On the other hand, the ASR or AVI systems have unintentionally learned to depend on these low intensity components for inference. This explains why removing such insignificant parts of speech confuses the model and causes a mistranscription or misidentification. Additionally, this may also explain some portion of the ASR and AVI error on regular testing data sets. Future work may use these revelations in order to build more robust models and be able to explain and reduce ASR and AVI system error.

Audio CAPTCHAs

In addition to helping dissidents overcome mass-surveillance, our attack has other applications as well. Specifically, in the domain of audio CAPTCHAs. These are often used by web services to validate the presence of a human. CAPTCHAs relies on humans being able to transcribe audio better than machines, an assumption that modern ASR systems call into question. Our attack could potentially be used to intelligently distort audio CAPTCHAs as a countermeasure to modern ASR systems.

Related Work

Machine Learning (ML) models, and in particular deep learning models, have shown great performance advancements in previously complex tasks, such as image classification and speech recognition. However, previous work has shown that ML models are inherently vulnerable to a class of ML known as Adversarial Machine Learning (AML).

Early AML techniques focused on visually imperceptible changes to an image that cause the model to incorrectly classify the image. Such attacks target either specific pixels, or entire patches of pixels. In some cases, the attack generates entirely new images that the model would classify to an adversary’s chosen target.

However, the success of these attacks are a result of two restrictive assumptions. First, the attacks assume that an underlying target model is a form of a neural network. Second, they assume the model can be influenced by changes at the pixel level. These assumptions prevent image attacks from being used against ASR models. ASR systems have found success across a variety of ML architectures, from Hidden Markov Models (HMMs) to Deep Neural Networks (DNNs). Further, since audio data is normally preprocessed for feature extraction before entering the statistical model, the models initially operate at a higher level than the ‘pixel-level’ of their image counterparts.

To overcome these limitations, previous works have proposed several new attacks that exploit the behaviours of particular models. These attacks can be categorized into three broad techniques that generate audio that include: a) inaudible to the human ear but will be detected by the speech recognition model, b) noisy such that it might sound like noise to the human, but will be correctly deciphered by the automatic speech recognition, and c) pristine audio such that the audio sounds normal to the human but will be deciphered to a different, chosen phrase. Although they may seem the most useful, attacks in the third category are limited in their success, as they often require white-box access to the model.

Attacks against image recognition models are well studied, giving attackers the ability to execute targeted attacks even in black-box settings. This has not yet been possible against speech models, even for untargeted attacks in a query efficient manner. That is, both targeted and untargeted attacks require knowledge of the model internals (such as architecture and parameterization) and a large number of queries to the model. In contrast, we propose a query efficient black-box attack that is able to generate an attack audio sample that will be reliably mistranscribed by the model, regardless of architecture or parameterization. Our attack can generate an attack audio sample in logarithmic time while leaving the audio quality mostly unaffected.

Conclusion

Automatic speech recognition systems are playing an increasingly important role in security decisions. As such, the robustness of these systems (and the foundations upon which they are built) must be rigorously evaluated. We perform such an evaluation in this paper, with particular focus on speech-transcription. By exhibiting black-box attacks against multiple models, we demonstrate that such systems rely on audio features which are not critical to human comprehension and are therefore vulnerable to mistranscription attacks when such features are removed. We then show that such attacks can be efficiently conducted as perturbations to certain phonemes (e.g., vowels) that cause significantly greater misclassification to the words that follow them. Finally, we not only demonstrate that our attacks can work across models but also show that the audio generated has no impact on understandability to users. This detail is critical, as attacks that simply obscure audio and make it useless to everyone are not particularly useful to the adversaries we consider. While adversarial training may help in partial mitigations, we believe that more substantial defenses are ultimately required to defend against these attacks.

SRI Blog

Search This Blog