I recently read the paper MVA Processing of Speech Features by Chia-Ping Chen and Jeff A. Bilmes—sorry, I cannot find a free version at the moment. It provides a good analysis of the effects of noise on feature calculation for ASR. The paper also reminded me why analyzing the effects of additive noise can be deceptively difficult.

The basic assumption in additive noise is that the signal $y = x + n$, where $x$ is the clean speech signal and $n$ is the noise signal. This assumption is generally true, and is absolutely true in the synthetically created datasets commonly used in the literature.

The first step in the feature calculation process is to convert the signal from the time domain to the spectral domain. This is usually accomplished by applying the Fourier transform. We will use captial letters for the spectral versions of the signals, so $Y=X+N$; an addition in the time domain is still an addition in the cepstral domain.

The next step is to take the magnitude of the signal. This is where things can begin to become a little complicated. Most analyses will make the simple assumption that $Y^2 = X^2 + N^2$. This assumption is very convenient and allows any interactions between the clean speech and noise terms to be ignored when looking at the spectral magnitudes.

Usually we are interested in the log-spectra, so the next step is to take the log of the signal. Even with our basic assumption, we can no longer easily separate the speech and noise interaction. After another step or two the final features will be a complicated nonlinear function of both the speech and noise.

If we go back to taking the magnitude of the signal, we can see things actually become more complicated even earlier. By removing the earlier assumption, we get $Y^2 = X^2 + 2(XN) + N^2$. Even at this early stage in the process, we cannot separate the speech and noise terms. The above mentioned Chen and Bilmes paper starts from this point.

Though not explicitly stated in their paper, this analysis still makes an assumption. While $Y^2 = X^2 + 2(XN) + N^2$ is true if we are dealing with real numbers, the signals in the spectral domain are actually complex numbers. We are actually taking the square of the complex magnitude $|Y|^2 = |X + N|^2$.

The absolute value sign is the problem. We can no longer separate the two terms at all; the first step past the conversion to the spectral domain introduces problems for analysis. If we assume the final features are determined by a function $C$—I choose $C$ as in cepstral—then the most we can really say is $C(Y) = C(X) + C(F(X,N))$, where $F(X,N)$ represents some complicated nonlinear relationship between the speech and noise signals.

When making the simplifying assumptions commonly seen in the literature, analysis of the final features is difficult. By the time we remove those assumptions, a satisfactory analysis becomes almost impossible.