There were more talks than the previous day, so I will not go into quite as much detail.
“Unsupervised Acoustic Model Training with Limited Linguistic Resources” presented by Lori Lamel
This was mostly a talk about the history of unsupervised training at Limsi. Being funded by many EU projects, Limsi has worked on a large variety of languages and an impressive array of results was shown using models trained without transcribed data. Initial comment was interesting. Why did we waste so many years working on the recognition of read speech? If speech is being read from a transcript, then it does not need to be transcribed.
The work described was performed on the broadcast news task for a host of languages (Luxembourgish, Latvian, Korean, Hungarian, etc.). Recognizers were bootstrapped using seed models from other languages. In most cases the in-language training data had no transcripts. An iterative process was used to decode the acoustic training data and then retrain using the decoded speech. Performance began to asymptote after about 50 hours of speech. Going to 200 hours or more produced minimal gains.
Prior to training, it was important to gather a good lexicon and a strong language model. Without those resources, the procedure would likely not have worked as well. Significant gains were also seen by adding acoustic transcripts to the language model training data. Using purely written text is not good enough.
Conclusions contained some information about pushing the unsupervised training to an even lower amount of resources. Open questions are how to handle languages without a written form and how to determine which units are actually meaningful when doing automatic lexical unit discovery. Also, stressed that though punctuation is rarely discussed, it is an important and challenging part of automatic transcription.
“Building Speech Recognition Systems with Low Resources” presented by Tanja Schultz
Plugged a paper soon to be published in Speech Communication, “Automatic Speech Recognition for Under-Resourced Languages: A Survey”. The GlobalPhone corpus was briefly described (21 languages and approximately 450 hours of acoustic data). An approach to unsupervised training was presented. Assume a set of acoustic models have been trained in a variety of languages. Use a lexicon—in terms of units from each of the previously seen languages—and a language model trained on the target language. As in the previous talk, the goal is to transcribe the acoustic training data and then use it for training. In this case, acoustic models from a set of other languages are used. Transcripts are then chosen by consensus.
The second half of the talk dealt with learning lexicons. They found that learning pronunciations from wiktionary worked quite well. Attempts to crowd source the learning of pronunciations were less successful. Most languages do not have a writing system. An approach to handling recognition of those languages was also presented. The basic idea was to transcribe the speech as phonetic units; each word in the lexicon would simply be a string of phonetic units. I think the assumption is the “words” would be passed through a speech-to-text system instead of being shown in the written form. It was interesting work, but it is clear nobody really has a clear idea of how to handle languages without written forms, at least in terms of automatic transcription. Example-based keyword spotting might be a more feasible application.
“The Babel Program and Low Resource Speech Technology” presented by Mary Harper
Mary Harper is currently a program manager at IARPA. The Babel program is the large, multi-team research program she is managing. A major portion of the project deals with data collection. By the end of the project, 26 languages will be included with approximately 100 hours of transcribed data per language. One very exciting prospect is the idea that the data will eventually be made freely available to the research community.
The motivation is to develop the algorithms and technology for quickly building a recognizer for a new language, with limited data, able to accurately perform keyword spotting. Complete transcription is not the goal; teams are judged on their ability to detect keywords. Currently building recognizers for new languages that achieve good performance can take months to years for a typical research lab. Eventually the teams will be expected to build competitive systems within a matter of days.
“Zero to One Hour of Resources: A Self-Organizing Unit Approach to Training Speech Recognizers” presented by Herb Gish
The same general principles described in the previous talks was used in this talk. Take untranscribed data, transcribe it with an acoustic model, and use the new transcriptions to retrain the model. The main difference was that he was working from zero resources. All he had was untranscribed audio. Without any prior models or knowledge, he attempted to learn acoustic units and pseudo-words for the language.
The technologies were referred to as sequential Gaussian mixture models and self-organizing units. Learning acoustic units did require an initial segmentation of the data. His approach was to use a basic technique based on spectral discontinuities. My guess is that this would heavily depend on having an initial set of data with little noise and a high speech to silence ratio. He presented some interesting results on topic id, speaker id, acoustic event classification, and keyword discovery.
“Recent Progress in Unsupervised Speech Processing” presented by Jim Glass
I did not take many notes for this lecture as I was already familiar with the papers he was pulling his information from. Jim did have some of the best slides of any of the presenters. They were very visual and clearly illustrated what he was describing.
The majority of the talk dealt with automatically learning acoustic units and pronunciations from audio transcribed at the word level using a Baysian framework. One of the more interesting aspects of the work is that it does not need to know the number of acoustic units a priori; they are learned from the data. While the current models do not outperform an expert lexicon, he believes they could eventually. Finally, he called for researchers to chase human abilities in their research—learning in a nearly completely unsupervised setting.
“Reverse Engineering Infant Language Acquisition” by Emmanuel Dupoux
This was an unexpected bonus talk. There were a few interesting points. He believes that phonemes are essentially unlearnable from acoustics (at least without some very sophisticated processing or prior knowledge). Based on work I have done, I would tend to agree with him. Without extra knowledge, how do you know a spectral change is a phonetic boundary or just a change within a phone?
I liked his idea for evaluation of unsupervised tasks. Just check if the learned units and pronunciations allow for discrimination between words that should be different. For instance, the pronunciations for “dog” and “doll” should be different.