Interspeech and ISCLP 2014 Overview

The 2014 installment of the Interspeech conference was held in Singapore. Below is a brief recap and links to more detailed notes regarding the talks and papers.

On the weekend prior to the conference, the International Symposium on Chinese Spoken Language Processing (ISCSLP) was also held. Each day was begun with a keynote. All three speakers were well-known researchers in the field and gave interesting talks. Many of the papers were quite interesting. They would have been of interest to the more general Interspeech audience, but the techniques just happened to be applied to a Chinese language dataset.

The main Interspeech conference also had its share of interesting keynotes. While the papers covered a wide range of topics, the lion’s share of the interest was in the deep learning sessions. Low-resourced languages and keyword spotting were also major topics this year. I had been attributing this solely to the IARPA Babel project, but there were a fair number of papers not attached the IARPA project. A few of the papers were of particular interest to me.

And as always, a shameless plug for my own work.

Posted in Conference | Tagged , , , , | Leave a comment

Comparing Decoding Strategies for Subword-based Keyword Spotting in Low-Resourced Languages

I presented this paper at the 2014 Interspeech conference[1]. The work was done with Viet-Bac Le, Abdel Messaoudi, Lori Lamel, and Jean-Luc Gauvain. It was supported by the IARPA Babel project.

This is a continuation of some of our previous work in detecting out-of-vocabulary (OOV) keywords in the context of the IARPA project. We more fully explored how the types of subword units used in decoding affected the final results.

The first approach is to simply decode using the standard word-based lexicon. In order to detect OOV keywords, we convert the word lattice to a subword lattice. After this conversion, a consensus network is created. By converting to a consensus network, we are able to introduce sequences of subwords that were not present in the original lattice. If instead we searched just the converted subword lattice, we would not expect to find many OOV keywords. This process does recover some amount of the keywords, but performance is still much worse compared to in-vocabulary keyword detection.

An alternative approach is to decode using subword units. There are many possible ways to segment the original lexicon into a set of subwords; we explore several approaches in this work. In all cases, decoding with subword units and then searching provides a large improvement over the lattice conversion approach. The downside is that multiple decodings are now required—a word-based decoding for in-vocabulary keywords and a subword decoding for OOV keywords. We also found that combining the results from multiple decodings using different types of subword units further improves results. Of course this increases the number of decodings required, increasing the computational cost. In the end, it becomes a trade off between performance and speed.

The final approach attempts to reduce the number of decodings required. Instead of using a single type of subword unit, all types are combined (including the original words), into a single language model. Now when decoding is performed, all subword units can appear in the lattice. Unfortunately, this approach was not as good as combining multiple decodings. The performance is similar to the average of all the other systems—better than the worst system, but not as good as the best. We also see a small drop in performance for the in-vocabulary keywords. Overall, it provides the best tradeoff between in-vocabuary and OOV performance for a single system.

In all cases, we search for keywords by looking for exact matches. Other approaches also consider inexact matches. While we do not report results in this work, we have done some preliminary experiments using inexact matches. When carefully tuned, it does provide nice gains over simply searching for exact matches. It will be interesting to see if our conclusions in this paper regarding the performance of various decoding approaches still hold when allowing inexact matches.

[1] W. Hartmann, V.-B. Le, A. Messaoudi, L. Lamel, and J.-L. Gauvain, “Comparing Decoding Strategies for Subword-based Keyword Spotting in Low-Resourced Languages,” Proceedings of Interspeech, pp. 2764-2768, 2014. (Preprint, Postprint).

Posted in Conference, Paper, Research | Tagged , , , , , , , | 1 Comment

Interesting Papers from Interspeech 2014

I am not sure if the general topics were more in my current areas of interest or if the quality of the work was higher than normal, but I found a large number of papers very interesting this year—many more than I could possibly describe. Below are just a few that I found particularly interesting.

Language Independent and Unsupervised Acoustic Models for Speech Recognition and Keyword Spotting by Kate M. Knill, Mark J.F. Gales, Anton Ragni, Shakti P. Rath:
Building truly multilingual speech recognition systems has long been a goal in the field. This work explores building multilingual acoustic models for a highly challenging dataset, the IARPA Babel data. All languages are mapped to a common phone set. Surprisingly, the acoustic models actually produce results that are not garbage, even without any in language training data. They further use the language independent system to transcribe in-language data—as with other work on this project, they assume a limited lexicon and language model exists. This further improves results and actually meets the targets for the task on some of the easier Babel languages. This is particularly impressive when you consider how poor the language model actually is.

Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale
Acoustic Modeling
by Haşim Sak, Andrew Senior, Fran.oise Beaufays:
I think this was a highly anticipated talk. The entire room was filled and plenty of people were standing too (though it was like that for much of this session). The LSTM has been seen before for LVCSR, but this paper presented Google’s current best results on their in-house dataset. Two aspects in particular are interesting for LSTMs. Unlike standard DNN architectures, they do not need large frame windows on input, only a single frame is given at each step. Google’s best performing system not only beats their traditional DNN system, but does so with only 10% of the parameters.

Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR by Zoltán Tüske, Pavel Golik, Ralf Schlüter, Hermann Ney:
Note that this paper also won a best student paper award. Recognition with DNNs from the raw time signal has been tried several times before, however, the results have been poor. In the cases where reasonable results were obtained, some type of transformation was actually applied to the time signal, so it was not truly the raw time signal. In this work, the only transformation applied to the time signal is mean and variance normalization. Their initial results found the raw signal performed dramatically worse than typical speech features. Instead of being satisfied with a negative result, they continued their investigation. By switching to rectified linear unit (ReLU) activations, they saw a rather large increase in performance. When they increased the amount of training data (from 50 hours to 250 hours) they saw another large increase in performance. It may be that the DNN just requires more data to learn when not supplied with expertly crafted features. Finally, they obtained a result that was 10% worse relative to standard MFCC features, an impressive and exciting result.

Word Embeddings for Speech Recognition by Samy Bengio, Georg Heigold:
While recently many labs have moved to using DNNs for acoustic modeling as opposed to GMMs, the same tied-state structure remains. This work attempts to partially change this situation by using a DNN to directly predict words.It reminds me of the segmental CRF work in the sense that it tries to break out of the frame, but still requires an initial segmentation. An obvious approach would be to train a DNN with a separate word for each target. The downside is the inability of such a system to handle OOV words. This paper avoids this pitfall by embedding the word in a feature space that allows any word to be represented. While the system does not outperform a baseline, the results are good enough to be intrigued by the approach. Combined with a standard system, they obtain a tiny gain.

Autoregressive Product of Multi-Frame Predictions Can Improve the Accuracy of Hybrid Models by Navdeep Jaitly, Vincent Vanhoucke, Geoffrey Hinton:
One of the reasons I liked this work was that it used the DNN to jointly predict the state for a sequence of frames. While not the first work to explore this, it was the first time I had seen it. During decoding the system still only moves through the signal one step at a time. This produces multiple estimates for the target at each frame. By combining the scores using the geometric mean, they found a nice gain over the baseline system. There probably was not room in the paper, but I would have liked to see further analysis. For instance, what were the frame error rates when predicting at the various offsets? I am curious what the difference in accuracy is between the model that predicts the frame at t = -3 and t =0.

Posted in Conference, Paper, Research | Tagged , , , , | 1 Comment

Keynotes from Interspeech 2014

Interspeech had a great selection of speakers this year. My thoughts on the individual talks are below. Note that Interspeech had five keynotes, but I only describe three. Unfortunately, I was unable to attend. Their absence here does not indicate a lack of quality or interest, just my inability to wake up.

Anne Cutler,ISCA Medalist
Learning about Speech

As the newest ISCA medalist, Anne had the privilege of giving the first talk of the conference. She began by advertising a large number of Phd positions and postdoc positions (though they do not appear to be posted yet) thanks to a recent large grant.

Much of her work deals with language learning of infants. She played a recording from inside the womb of a mother—not sure how they get that microphone in there—and some aspects were remarkably clear. While the speech itself was not understandable, the gender of the speaker and the prosody were obvious. Infants begin speech learning in the womb, specifically during the final trimester.

When an infant is born, they already have a preference for known speakers and languages that are similar to their native tongue. Contrary to popular conception, infants actually cope with continuous speech at a very young age (10 months). They have the ability to pick out and recognize words presented in a continuous utterance. Even more interesting, the type of language used when speaking to infants (Motherese) appears to be controlled by feedback from the infant. Your child controls your speech.

She also presented some interesting results demonstrating how learning can shape perception. For instance, babies seem to only be able to discriminate between speakers when they speak the same language as their native language.

The adaptation experiments were also interesting. They were based on the distinction between /s/ and /f/. Listeners are given a few examples of words where the /s/ phone has been replaced with a phone more similar to /f/—and vice versa—for words where the distinction between /s/ and /f/ does not produce a confusable pair. After these examples, they were presented with words that were confusable and it was shown their perception of the phone had shifted. Humans can quickly adapt their speech perception to cope with new speakers, a task that is still very difficult for machines.

Her final comment contrasted human perception with machine learning. I believe her point was that human beings are perceptual animals and that we are highly motivated to learn in this setting; maybe our machine learning algorithms lack this motivation. Not sure if it was a tongue-in-cheek comment, or a serious comment. If she is implying that our objective functions are not ideal, then there is some truth in that.

Lori Lamel
Language Diversity: Speech Processing In A Multi-Lingual Context

Lori began by discussing multilingual models. Multilingual modeling has mostly been a string of failures until recently. Some labs are now starting to see improvements with multilingual bottleneck features and even hybrid acoustic models. The current resurgence may be thanks in part to the development of standardized corpora. The amount of available datasets in a variety of languages continues to increase. Publishing on standard datasets is always easier than trying to publish on your own private data. There is still a heavy reliance on annotated language resources, requiring a large amount of human effort.

The focus then changed to unsupervised acoustic model training. This has been a focus at Limsi for more than ten years now. She showed some examples of why pronunciation models are crucial. With incorrect pronunciations, the alignments will be incorrect. This is carried over to the acoustic model, leading to poor models for certain contexts. I can see how this can be a problem in the supervised case, so I understand why it may be even worse in the unsupervised case.

A brief overview of the IARPA Babel project followed. One point was to highlight how much worse the performance for Babel is compared to previous CTS work. More interesting were her language analysis results. One example was the breathiness at the end of French words. This is a property that appeared relatively recently and has slowly increased through the years. Their analysis of French broadcast news data confirmed this.

Li Deng
Achievements and Challenges of Deep Learning – From Speech Analysis And Recognition To Language And Multimodal Processing

This was a very dense talk. Luckily Li Deng is a very engaging speaker, so the audience tried to keep up. His first point was that too many people equate Deep Learning with Deep Neural Networks. They are not the same thing; a DNN is only one kind of deep model. Deep generative models also exist. Li referred to the large amount of work presented at this past ICML.

Another detailed slide was on the differences between generative models and neural networks, focusing on their strengths and weaknesses. One of the obvious—and frequently talked about—advantages of generative models is their interpretability. Complaints that you cannot know what a neural network is doing are common. He referred to this advantage both because it is intellectually satisfying, but because it also means you can more easily add explicit knowledge constraints to the model. Another advantage is the ability of the generative model to handle uncertainty.

A major cited advantage of neural networks is their ease of computation. This may seem counterintuitive considering how much effort it can take to train the models. He was referring to the fact that neural networks basically require performing the same operation billions of times. This is something that can be parallelized and GPUs can be utilized to greatly speed up the entire process.

It is difficult to give an overview of the talk as it was so detailed and he hit so many major points. If there was a main point, I think it was that there is more to deep learning than DNNs. Also, the combination of neural networks with generative models is an exciting and promising direction.

Posted in Conference, Research | Tagged , , , , , , , | 1 Comment

Interesting Papers from ISCSLP 2014

Below are several papers from the 9th International Symposium on Chinese Spoken Language Processing (ISCSLP) that I found particularly interesting.

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition by Shaofei Xue, Hui Jiang and Lirong Dai:
Adapting DNNs based on speaker or environment characteristics is difficult. The large number of parameters means a large amount of adaptation data is required. The authors propose a method that handles the adaptation with far less data. The trick is decompose the weight matrix with SVD. The adaptation is then performed only on the the singular values from the decomposition. The authors demonstrate modest gains on Switchboard using only a small amount of adaptation data.

Research on Deep Neural Network’s Hidden Layers in Phoneme Recognition by Yuan Ma, Jianwu Dang and Weifeng Li
Previously, there has been work trying to understand what the different layers of a DNN correspond to in computer vision applications like facial recognition and digit recognition. The authors attempted to perform a similar study for phonetic recognition. I appreciated the idea, though I do think a different approach will be required to discover the relationship between the hidden layers and the phonetics. The presenter also had my favorite response of the whole workshop. Someone questioned the purpose of the study and wanted to know why it was of interest. She responded with, “Um, I think, because it is science.”

Decision Tree based State Tying for Speech Recognition using DNN Derived Embeddings by Xiangang Li and Xihong Wu:
This paper is in line with some recent work on removing the dependence on the GMM for building a CD-DNN-HMM. In this case, they perform the clustering using the DNN instead of the GMM. By talking the final hidden layer of the DNN, they create an embedding for the individual states. The state-tying is performed by clustering these embeddings. In this approach, states which are confusable in the original model end up being clustered together.

Speech Separation Based on Improved Deep Neural Networks with Dual Outputs of Speech Features for Both Target and Interfering Speakers by Yanhui Tu, Jun Du, Yong Xu, Lirong Dai and Chin-Hui Lee:
The authors presented an approach for speech separation based on DNNs. Given the mixture, the DNN produces both target and interference signals. During testing it is assumed the system has seen the target speaker before, but not the interfering signal. It is interesting work, but I do not know how it generalizes to the case when the target speaker has never been seen.

Posted in Conference, Paper, Research | Tagged , , , , , , , | 1 Comment

Keynotes from ISCSLP 2014

The 9th International Symposium on Chinese Spoken Language Processing (ISCSLP) was held in Singapore on the weekend before Interspeech. Each day began with a keynote talk. I enjoyed all three talks, and describe them below.

Michiel Bacchiani
Large Scale Neural Network Optimization for Mobile Speech Recognition Applications

While the title mentions optimizations for mobile speech applications, it seemed absent from the talk. It was mostly an advertisement style talk, similar to other talks from Google I have seen. However, in addition to the standard Google advertisement, there was some interesting information.

Michiel claimed this was the “golden age of speech recognition”. Of course people have been saying this for many years, but he did try to provide some evidence. For instance, he showed a clip of Saturday Night Live mocking standard dialogue systems from 2007. It highlighted not only how poor the systems performed, but also how the general population recognized it. Contrast that to today where voice is now a commonly accepted way to interact with our phones and other devices. It has also become profitable for companies to develop devices where speech is the primary modality for interaction.

Google has also recently been working on removing the GMM from speech recognition. Until now, a DNN required an initial GMM to perform the initial labeling of the data along with the state-tying. They can now get similar performance by training the DNN from a flat start without requiring an initial GMM. This is a more complicated process than you may realize because of the priors. Since the DNN models the reverse of what the GMM models, a state prior is required. A poor prior can lead to a poor DNN model due to errors in alignment. They find it is important to frequently update the prior model during training.

Finally, he discussed some more recent work with long short term memory model (LSTM). The major take away is they can achieve similar performance to a DNN with a much smaller model. I think this will be an active area of research in the future; finding alternative models that are similar to DNNs, but require fewer parameters.

Tanja Schultz
Multilingual Automatic Speech Recognition for Code-switching Speech

The number of multilingual speakers in the world outnumber the monolingual speakers. Given this knowledge, it is surprising how little work has been done in the area of code-switching. It is a difficult task due to lack of training data and the speaker dependency. This was a perfect talk for Singapore. The mixture of languages here and the fluency with which many people switch back and forth (even at the word level) is a perfect illustration of this problem.

While I tend to only think about acoustic modeling issues within this domain, Tanja discussed difficulties in designing a lexicon (including the acoustic unit inventory), training an acoustic model, and building a language model.

In designing a lexicon, the simplest approach is to merge lexicons from multiple languages. This produces two main issues. The first is that two languages may have homographs, but the pronunciations and semantics may differ greatly. The other issue is the set of acoustic units. If you are using something like IPA, it is questionable whether identical phones from different languages are actually identical. Tanja also introduced a tool developed in her lab for dealing with these issues: Rapid Language Adaptation Toolkit (RLAT).

For acoustic models, all previous work has basically shown that a monolingual model outperforms multilingual models. She claimed that this was not true of more recent work. I know there has been some success in using multilingual bottleneck features, but I do not think the evidence for acoustic models is clear yet.

Language modeling is probably the least investigated aspect of this task, but potentially the most difficult. Since code-switching is a phenomenon of conversational speech, finding adequate amounts of text for language model training is nearly impossible. In addition, there are no rules for code switching. The variation between speakers makes using general models of code switching ineffective.

Yifan Gong
Selected Challenges and Solutions for DNN Acoustic Modeling

The talk was mostly a brief synopsis of multiple DNN research topics being investigated at Microsoft. I found a couple of topics of particular interest. As with Tanja’s talk, Yifan also discussed multilingual acoustic models. The standard approach with DNNs was used. A single DNN was trained with multiple languages, where the final output layer was language dependent. In this case, they had hundreds of hours of audio in alternative languages and very limited training data in the target language. I do wonder if the improvements disappear in the case where you have greater amounts of untranscribed in-language data.

He also presented work on reducing the size of DNNs while minimizing the impact on accuracy. One drawback to DNNs are their large number of parameters, especially for mobile applications. In this work, they trained a large DNN and a small DNN system jointly. After updating each network for a particular batch, they would send an additional update to the small DNN to minimize the KL-Divergence between their outputs. Although this approach does not reduce the cost of training the initial system, it allows a smaller model to be used during decoding.

Yifan ended his presentation by stressing that robustness is still a major research area for DNNs. He supported this with some experimental results showing that DNNs are not necessarily more robust to variation than GMMs (overall performance is better, but the relative effects of different types of variation is similar).

Posted in Conference, Research | Tagged , , , , , , , | 1 Comment

The 2014 Workshop on Spoken Language Technology for Under-Resourced Languages (SLTU)

Recently attended the SLTU workshop in St. Petersburg Russia. I presented my first paper on keyword spotting. As is obvious from the name, the focus was on under-resourced languages. The workshop was spread over 2.5 days and every paper was presented as a talk.

Overall, I enjoyed the workshop. Met some interesting people and saw some good presentations. A few of the papers that were particularly interesting to me are below.

Speech Recognition and Keyword Spotting for Low Resource Languages: Babel Project Research at CUED by Mark J.F. Gales, Kate M. Knil, Anton Ragni, and Shakti P. Rath:
This paper was of particular interest to me since I also work on the IARPA Babel Project. It is always interesting to get the perspectives of a different lab.The major topics covered were deep neural network acoustic models, data augmentation, and zero-resource training. Surprisingly, they were able to build reasonable systems with no in language acoustic training data.

Query-by-Example Spoken Term Detection Evaluation on Low-Resource Languages by Xavier Anguera Miro, Luis Javier Rodriguez Fuentes, Igor Szöke, Andi Buzo, Florian Metze, and Mikel Penagarikano:
As opposed to the keyword spotting work done in the context of the IARPA Babel project, this paper describes a query-by-example keyword spotting challenge. The main difference is that search is not performed based on the lexical representation of a word. Instead the words are given as an acoustic example. For the challenge described in the paper, no knowledge of the text, or even the language is given. Participants must search through 20 hours of untranscribed audio in a variety of languages.

Adapting Multilingual Neural Network Hierarchy to a New Language by Frantisek Grezl and Martin Karafiat:
BUT has been achieving tremendous results with their DNN systems—both in the tandem and hybrid framework. When they write a paper describing their work with DNNs, it is usually well worth the read. This work focuses on issues surrounding limited in-domain training data and the desire to quickly build new recognition systems. They augment the original training data by first training a large bottleneck feature system using data from a variety of languages. When a new language is presented, the system is quickly adapted with a small amount of in-domain data.

Combining Grapheme-to-Phoneme Convertor Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios by Tim Schlippe, Wolf Quaschningk, and Tanja Schultz:
This work was interesting for two reasons. First, they show G2P results for a variety of approaches (including Moses, Sequitor, and Phonetisaurus). It is always nice to see comparisons like this. Given all of these approaches, they present a method of combining the outputs of multiple systems. While it does improve the pronunciations in terms of phone accuracy when compared to a gold standard, it seems to have little effect on ASR performance. However, the improved accuracy could be important for speech synthesis. The combination could also improve pronunciations for long-tail words that have little effect on the traditional WER metric.

Cross-Language Mapping for Small-Vocabulary ASR in Under-Resourced Languages: Investigating the Impact of Source Language Choice by Anjana Vakil and Alexis Palmer:
The most interesting aspect of this work is that it succeeded in making me care about a low-vocabulary task. Motivation for the work is strong and well presented. The premise is that even small-vocabulary ASR systems can be of some benefit—consider domains like banking or domain-specific information retrieval. For most of the world’s languages, there is not enough motivation or resources to build complete ASR systems, however, a small-vocabulary system could be simulated using a system from another language. A pronunciation lexicon is built using the acoustic units of another language and used for recognition. Obviously this approach is not ideal for traditional ASR tasks, but it can provide a low error rate system for a small-vocabulary task.

The NCHLT Speech Corpus of the South African Languages by Etienne Barnard, Marelie Davel, Charl van Heerden, Febe de Wet, and Jaco Badenhorst:
This paper introduced a corpus that I did not know existed. The corpus consists of 50 hours of transcribed speech for each of the 11 official languages of South Africa. They actually collected far more data, but this was the data remaining after cleaning and verification. The difficulties surrounding the collection of the corpus were quite interesting. For instance, in such a diverse language environment, it can be difficult to determine what exactly is the mother tongue of a given speaker.

Posted in Conference | Tagged , , , , , , , , , , | Leave a comment