Interspeech was in Portland this year. Compared to other conferences I have attended in the last few years, it was practically in my backyard—as long as you ignore the fact I was supposed to be in Paris.
I’ve included a short list of papers that piqued my interest through their presentation. Obviously the list is greatly biased by my interests and many good papers have certainly been omitted. Also, I did not want to appear biased, so I have not included any work by colleagues or labmates (even award-winning work).
Context-Dependent MLPs for LVCSR: TANDEM, Hybrid or Both? by Zoltán Tüske, Ralf Schlüter, Hermann Ney, Martin Sundermeyer: With the success of deep neural networks, there has been some renewed interest in Tandem and Hybrid systems. The paper seems to give a good overview of the different systems and tests a wide variety of setups.
Can Modified Casual Speech Reach The Intelligibility of Clear Speech? by Maria Koutsogiannaki, Michelle Pettinato, Cassie Mayo, Varvara Kandia, Yannis Stylianou: In this paper, they showed that simply artificially slowing down fast speech produced big improvements in performance according to standard speech intelligibility metrics. However, human intelligibility metrics showed the exact opposite result. Artificially slowing down the speech actually makes intelligibility worse for humans. The take away for me was that intelligibility metrics can be severely lacking and that there are more issues with fast speech than simply the speed.
MAP Estimation of Whole-Word Acoustic Models with Dictionary Priors by Keith Kintzley, Aren Jansen, Hynek Hermansky: I have always liked the work with Point Process Models. One drawback was the issue of building these whole word models with very few examples. They present a novel of way of compensating for the low resource case.
Estimating Word-Stability During Incremental Speech Recognition by Ian McGraw, Alex Gruenstein: Word stability is not a problem I have thought about before. The idea is that an incremental speech recognizer is less useful if the words constantly change as a new word comes in. They propose a measure for the stability of a word and only display the next word when its stability passes a certain threshold.
Longer Features: They do a Speech Detector Good by TJ Tsai, Nelson Morgan: I’m always interested in work exploring features other than the standard frame-level cepstral features. They found impressive performance using Gabor-style features. Since the task was speech activity detection, though, it is possible the features were only able to capture general speech characteristics and not more fine-grained phonetic distinctions.
Estimating Classifier Performance in Unknown Noise by Ehsan Variani, Hynek Hermansky: They consider methods to identify regions of speech that have distortions not seen on the training set. I like that the approach does not require an initial labeling of the words or phones in the test utterance. Assuming this classifier works well, it could potentially be useful in many downstream processes.
I would also like to present my favorite paper title of the conference. Nice to see a title that also functions as an abstract.
Correlation Between Vocal Tract Length, Body Height, Formant Frequencies, and Pitch Frequency for the Five Japanese Vowels Uttered by Fifteen Male Speakers by Hiroaki Hatano, Tatsuya Kitamura, Hironori Takemoto, Parham Mokhtari, Kiyoshi Honda, Shinobu Masaki