Six researchers from the Zhejiang University published an excellent paper describing DolphinAttack: a new attack against voice-based assistants such as Siri or Alexa. As usual, the objective is to force the assistant to accept a command that the owner of the assistant did not issue. The attack is more powerful if the owner does not detect its occurrence (excepted, of course, the potential consequences of the accepted command). The owner should not hear a recognizable command or even better hear nothing.
Many attacks try to fool the Speech Recognition system by finding characteristics that may fool the machine learning system that powers the recognition without using actual phonemes. The proposed approach is different. The objective is to fool the audio capturing system rather than the speech recognition.
Humans do not hear ultrasounds, i.e., frequencies greater than 20 kHz. Speech is usually in the range of a few 100 HZ up to 5 kHz. The researchers’ great idea is to exploit the characteristics of the acquisition system.
- The acquisition system is a microphone, an amplifier, a low-pass filter (LPF), and an analog to digital converter (ADC), regardless of the Speech Recognition system in use. The LPF filters out the frequencies over 20 kHz and the ADC samples at 44.1 kHz.
- Any electronic system creates harmonics due to non-linearity. Thus, if you modulate a signal of fm
with a carrier at fc, in the Fourier domain, many harmonics will appear such as fC – fm, fC + fm¸ and
fC as well as their multiples.
You may have guessed the trick. If the attacker modulates the command (fm) with an ultrasound carrier fc, then the resulting signal is inaudible. However, the LPF will remove the carrier frequency before sending it to the ADC. The residual command will be present in the filtered signal and may be understood by the speech recognition system. Of course, the commands are more complicated than a mono-frequency, but the system stays valid.
They modulated the amplitude of a frequency carrier with a vocal command. The carrier was in the range 20 kHz to 25 kHz. They experimented with many hardware and speech recognition. As we may guess, the system is highly hardware dependent. There is an optimal frequency carrier that is device dependent (due to various microphones). Nevertheless, with the right parameters for a given device, they seemed to have fooled most devices. Of course, the optimal equipment requires an ultrasound speaker and adapted amplifier. Usually, speakers have a response curve that cut before 20 kHz.
I love this attack because it thinks out of the box and exploits “characteristics” of the hardware. It is also a good illustration of Law N°6: Security is not stronger than its weakest link.
A good paper to read.
Zhang, Guoming, Chen Yan, Xiaoyu Ji, Taimin Zhang, Tianchen Zhang, and Wenyuan Xu. “DolphinAttack: Inaudible Voice Commands.” In ArXiv:1708.09537 [Cs], 103–17. Dallas, Texas, USA: ACM, 2017. http://arxiv.org/abs/1708.09537