This article that I wrote with Arun Narayanan, Eric Fosler-Lussier, and DeLiang Wang has just been published in the IEEE Transactions on Audio Speech and Language Processing. The official version can be found on the IEEE website, but there is also a free preprint available on my website. I recently wrote about the process of publishing this article, but I also wanted to discuss the article itself.
If I had to sum up the article in one sentence, “We show that, contrary to prior belief, the ideal binary mask can be used directly in automatic speech recognition systems without additional compensation.”
Our work presents a large amount of experimental evidence to support this claim. We also provide a thorough analysis of why the direct masking approach works and why it may have been missed by other researchers.
In the fields of speech separation, speech enhancement, and robust ASR, the ideal binary mask (IBM) has been proposed as a solution. Assume we have a sound signal that is composed of speech and some type of background noise. If the sound signal is represented in a time-frequency representation, then each pixel in the representation contains some energy from the speech and some energy from the noise. A ratio mask can be defined as the ratio of speech energy to noise energy in any pixel. The IBM is simply a thresholded ratio mask. The basic principle is that we want to keep time-frequency units that are mostly speech energy and discard all other information.
Many interesting perceptual experiments have been conducted with the IBM over the years. It is clear that if the IBM can be computed and applied to a noisy signal, it can greatly enhance both the quality and intelligibility of the signal. In fact, a properly formatted binary mask applied to a noise signal containing no speech can produce a signal that sounds like a speech utterance.
While this work is very compelling in terms of human speech perception, the application of the IBM in automatic speech recognition has been lacking. Direct use of a masked signal in ASR systems did not seem to provide any benefit. Alternative approaches to applying the IBM have been proposed, but it inherently removes the simplicity of the IBM-based approach.
Our work demonstrates that by making a small change to the features used in ASR—a change that most modern ASR systems incorporate anyway—the IBM masked signal can be used directly. In fact, this direct masking approach outperforms some other previously proposed compensation methods.
The small change we refer to is variance normalization. It is well known that additive noise reduces the variance of cepstral features for ASR, and variance normalization helps alleviate this issue. We demonstrate that the binary masking of speech has the opposite effect—it increases the variance of the features—but variance normalization also solves this problem.
This paper adds to the established literature in binary mask based approaches to robust ASR. The direct masking approach is arguably the simplest approach to using the IBM in ASR; its absence from the literature was a mystery. Our work explains this absence and demonstrates the viability of the approach.