I realize this post is about 6 months too late, but I wanted to get it out there.
Attended ICASSP 2012 in Kyoto, Japan in late March; had a great time and saw many interesting presentations and posters. Hot topics this year appeared to be Subspace GMMs, Deep Belief Nets, and Non-negative Matrix Factorization. Without at least a high level understanding of these topics, many of the papers in the sessions of interest to me would be difficult to understand.
Below is a list of papers I found particularly interesting. These are not necessarily the best papers from the conference, but simply papers where I remember being intrigued by the poster or presentation.
Sequential Deep Belief Networks by Galen Andrew, Jeff Bilmes: This is the first paper I have seen adapting the recently popular DBN for sequence modeling tasks. They demonstrate improved phone recognition results on TIMIT.
Understanding How Deep Belief Networks Perform Acoustic Modelling by Abdel-Rahman Mohamed, Geoffrey Hinton, Gerald Penn: Spectral features are shown to perform better than cepstral features on the TIMIT phone recognition task. They attempt to explain these surprising results through several additional experiments and visualizations of the high-dimensional features. The results raise many interesting questions about features in the context of DBNs. Also, is the common practice of altering the features to match the model assumptions appropriate?
Multilevel Speech Intelligibility for Robust Speaker Recognition by Sridhar Krishna Nemala, Mounya Elhilali: If I remember correctly, they attempt to build a classifier that matches human subject judgements about the intelligibility of speech segments in noise. A signal can then be segmented by its intelligibility. The technique is presented as an alternative to the typical voice activity detection used in speaker recognition systems. Their results were intriguing and I wonder if this technique could be incorporated into a speech recognition system.
Classification and Recognition with Direct Segment Models by Geoffrey Zweig: His previously paper on segmental conditional random fields presented an exciting approach to applying CRFs to word recognition. One major limitation is the initial identification of segments; his original work starts from a HMM lattice. This paper is a step towards removing that limitation. Results are still presented for phone recognition on TIMIT. It will be interesting to see if this can be further adapted to word recognition.
Discriminative Training for Speech Recognition is Compensating for Statistical Dependence in the HMM Framework by Dan Gillick, Larry Gillick, Steven Wegmann: A paper that attempts to increase our understanding of standard techniques. They found that if they resampled the testing data such that the frames were truly conditionally independent, not only does performance dramatically improve, discriminative training performs no better than generative training. I feel this paper may lead to some very interesting further work.