Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Spectral differences between

keywords

Introduction
This document aims to show the spectral differences between all the keywords, when they are
captured and after they have been processed by the ESP32.
In order to generate data for this document, we prepared two devices, one fully encased in plastic
and within a pine box. The other device was disassembled in order to expose the microphone.
Once the devices were ready, we captured all keywords required for the LITE I solution
simultaneously from both devices. In order to add variety, we performed measurements with a male
and a female voice.
Ey Lola
Ey Lola
Enciende
Enciende
Apaga
Apaga
Sube
Sube
Baja
Para
Para
Ayuda
Conclusions
In the spectrogram representations it is easy to see how the attenuation affects every keyword,
especially those with higher pitch such as the female voice. Luckily since the attenuation is above
1000 Hz, most of the energy of the word is preserved, especially the vowels, which are the most
important features to implement keyword recognition.
On the other hand, the effect of the attenuation is less obvious on the log-spectrogram, since it
places emphasis on lower frequencies, more relevant to human speech, the 1000-2000 Hz band is
compressed and moved upward, to a point where the attenuation is not as noticeable.
Thanks to this algorithm, a constrained system such as the ESP32 can take a lot more advantage
from it’s resources.
Finally, although the attenuation does not seem to be as detrimental to the device’s performance as
previously thought, we still believe that it is important to obtain data suffering from the same issues,
or otherwise to correct the data that does not suffer from them, this offers a better chance to the
models to find features within the attenuated bands.

You might also like