This post refers to the paper: W. Hartmann and E. Fosler-Lussier. “ASR-Driven Top-Down Binary Mask Estimation using Spectral Priors.” Proceedings of IEEE ICASSP, pp. 4685-4688, 2012. (Preprint, Postprint)
I also have a brief description of the general project on my website. While this paper was an initial pilot study, it set the stage for the work that comprised the majority of my thesis.
Much of the work in binary masking is noise-centric; the goal is often to estimate the noise or the instantaneous SNR. I wanted to attempt a much more speech-centric approach. The difficulty is that since the ideal binary mask (IBM) is defined in terms of SNR, it is not obvious how to incorporate speech information into the estimation.
Instead, I wanted to define a binary mask that was determined only by the underlying linguistic information; no noise information was considered. In hindsight, this was probably an unnecessary and unrealistic constraint; however, I was so intrigued by the idea of a speech enhancement approach that ignored noise, that I followed through—legitimate practical concerns of my advisors be damned.
My assumption was that, for a given speech event, if a frequency band typically contained a large amount of speech energy, then it should be left unmasked regardless of the amount of noise present in the same frequency band. In contrast, if a frequency band typically had little speech energy, then masking should not be detrimental regardless of the amount of noise energy. This assumption was partially supported by experiments I conducted on clean speech; I found that I could mask large portions of the speech signal, as long as the energy was low, without much loss in recognition accuracy.
Given this assumption, a speech event or underlying linguistic information needs to be more specifically defined. Since I use an HMM-based speech recognition system, an obvious choice would be to consider a speech event to be one of the sub-phonetic states modeled by the recognizer. For each sub-phonetic unit, I collected statistics about the distribution of energy across all frequencies. This information combined with a simple threshold could be used to define a binary mask. My experiments showed that using this ASR-Driven binary mask (using oracle information about the sub-phonetic states) could significantly improve results over the baseline, but was not as good the IBM.
This is the point where I typically get a confused look and someone asks, “so these are the results you can get assuming you already know the words that were spoken?” My response is, “yes, I understand how ridiculous that sounds, but I promise we are almost at the interesting part.” The point is that we can now use information about the underlying speech to estimate a mask, not that the mask requires oracle knowledge of the speech.
In this paper, I proposed a very simple approach to estimating the mask—a more sophisticated approach would be presented in a later work.
- Perform recognition on the unenhanced noisy speech.
- Generate the ASR-Driven mask based on the hypothesized sub-phonetic unit at each frame.
- Generate new features using the masked speech signal.
- Recognize the now enhanced speech.
Surprisingly, this simple approach—one which completely ignores the interfering noise—improves both the recognition accuracy and the SNR of the signal. While the absolute improvements were modest and would not necessarily be competitive compared to more sophisticated approaches, I still think this is a fascinating result.