I am not sure if the general topics were more in my current areas of interest or if the quality of the work was higher than normal, but I found a large number of papers very interesting this year—many more than I could possibly describe. Below are just a few that I found particularly interesting.
Language Independent and Unsupervised Acoustic Models for Speech Recognition and Keyword Spotting by Kate M. Knill, Mark J.F. Gales, Anton Ragni, Shakti P. Rath:
Building truly multilingual speech recognition systems has long been a goal in the field. This work explores building multilingual acoustic models for a highly challenging dataset, the IARPA Babel data. All languages are mapped to a common phone set. Surprisingly, the acoustic models actually produce results that are not garbage, even without any in language training data. They further use the language independent system to transcribe in-language data—as with other work on this project, they assume a limited lexicon and language model exists. This further improves results and actually meets the targets for the task on some of the easier Babel languages. This is particularly impressive when you consider how poor the language model actually is.
Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale
Acoustic Modeling by Haşim Sak, Andrew Senior, Fran.oise Beaufays:
I think this was a highly anticipated talk. The entire room was filled and plenty of people were standing too (though it was like that for much of this session). The LSTM has been seen before for LVCSR, but this paper presented Google’s current best results on their in-house dataset. Two aspects in particular are interesting for LSTMs. Unlike standard DNN architectures, they do not need large frame windows on input, only a single frame is given at each step. Google’s best performing system not only beats their traditional DNN system, but does so with only 10% of the parameters.
Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR by Zoltán Tüske, Pavel Golik, Ralf Schlüter, Hermann Ney:
Note that this paper also won a best student paper award. Recognition with DNNs from the raw time signal has been tried several times before, however, the results have been poor. In the cases where reasonable results were obtained, some type of transformation was actually applied to the time signal, so it was not truly the raw time signal. In this work, the only transformation applied to the time signal is mean and variance normalization. Their initial results found the raw signal performed dramatically worse than typical speech features. Instead of being satisfied with a negative result, they continued their investigation. By switching to rectified linear unit (ReLU) activations, they saw a rather large increase in performance. When they increased the amount of training data (from 50 hours to 250 hours) they saw another large increase in performance. It may be that the DNN just requires more data to learn when not supplied with expertly crafted features. Finally, they obtained a result that was 10% worse relative to standard MFCC features, an impressive and exciting result.
Word Embeddings for Speech Recognition by Samy Bengio, Georg Heigold:
While recently many labs have moved to using DNNs for acoustic modeling as opposed to GMMs, the same tied-state structure remains. This work attempts to partially change this situation by using a DNN to directly predict words.It reminds me of the segmental CRF work in the sense that it tries to break out of the frame, but still requires an initial segmentation. An obvious approach would be to train a DNN with a separate word for each target. The downside is the inability of such a system to handle OOV words. This paper avoids this pitfall by embedding the word in a feature space that allows any word to be represented. While the system does not outperform a baseline, the results are good enough to be intrigued by the approach. Combined with a standard system, they obtain a tiny gain.
Autoregressive Product of Multi-Frame Predictions Can Improve the Accuracy of Hybrid Models by Navdeep Jaitly, Vincent Vanhoucke, Geoffrey Hinton:
One of the reasons I liked this work was that it used the DNN to jointly predict the state for a sequence of frames. While not the first work to explore this, it was the first time I had seen it. During decoding the system still only moves through the signal one step at a time. This produces multiple estimates for the target at each frame. By combining the scores using the geometric mean, they found a nice gain over the baseline system. There probably was not room in the paper, but I would have liked to see further analysis. For instance, what were the frame error rates when predicting at the various offsets? I am curious what the difference in accuracy is between the model that predicts the frame at t = -3 and t =0.