Recently attended the SLTU workshop in St. Petersburg Russia. I presented my first paper on keyword spotting. As is obvious from the name, the focus was on under-resourced languages. The workshop was spread over 2.5 days and every paper was presented as a talk.
Overall, I enjoyed the workshop. Met some interesting people and saw some good presentations. A few of the papers that were particularly interesting to me are below.
Speech Recognition and Keyword Spotting for Low Resource Languages: Babel Project Research at CUED by Mark J.F. Gales, Kate M. Knil, Anton Ragni, and Shakti P. Rath:
This paper was of particular interest to me since I also work on the IARPA Babel Project. It is always interesting to get the perspectives of a different lab.The major topics covered were deep neural network acoustic models, data augmentation, and zero-resource training. Surprisingly, they were able to build reasonable systems with no in language acoustic training data.
Query-by-Example Spoken Term Detection Evaluation on Low-Resource Languages by Xavier Anguera Miro, Luis Javier Rodriguez Fuentes, Igor Szöke, Andi Buzo, Florian Metze, and Mikel Penagarikano:
As opposed to the keyword spotting work done in the context of the IARPA Babel project, this paper describes a query-by-example keyword spotting challenge. The main difference is that search is not performed based on the lexical representation of a word. Instead the words are given as an acoustic example. For the challenge described in the paper, no knowledge of the text, or even the language is given. Participants must search through 20 hours of untranscribed audio in a variety of languages.
Adapting Multilingual Neural Network Hierarchy to a New Language by Frantisek Grezl and Martin Karafiat:
BUT has been achieving tremendous results with their DNN systems—both in the tandem and hybrid framework. When they write a paper describing their work with DNNs, it is usually well worth the read. This work focuses on issues surrounding limited in-domain training data and the desire to quickly build new recognition systems. They augment the original training data by first training a large bottleneck feature system using data from a variety of languages. When a new language is presented, the system is quickly adapted with a small amount of in-domain data.
Combining Grapheme-to-Phoneme Convertor Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios by Tim Schlippe, Wolf Quaschningk, and Tanja Schultz:
This work was interesting for two reasons. First, they show G2P results for a variety of approaches (including Moses, Sequitor, and Phonetisaurus). It is always nice to see comparisons like this. Given all of these approaches, they present a method of combining the outputs of multiple systems. While it does improve the pronunciations in terms of phone accuracy when compared to a gold standard, it seems to have little effect on ASR performance. However, the improved accuracy could be important for speech synthesis. The combination could also improve pronunciations for long-tail words that have little effect on the traditional WER metric.
Cross-Language Mapping for Small-Vocabulary ASR in Under-Resourced Languages: Investigating the Impact of Source Language Choice by Anjana Vakil and Alexis Palmer:
The most interesting aspect of this work is that it succeeded in making me care about a low-vocabulary task. Motivation for the work is strong and well presented. The premise is that even small-vocabulary ASR systems can be of some benefit—consider domains like banking or domain-specific information retrieval. For most of the world’s languages, there is not enough motivation or resources to build complete ASR systems, however, a small-vocabulary system could be simulated using a system from another language. A pronunciation lexicon is built using the acoustic units of another language and used for recognition. Obviously this approach is not ideal for traditional ASR tasks, but it can provide a low error rate system for a small-vocabulary task.
The NCHLT Speech Corpus of the South African Languages by Etienne Barnard, Marelie Davel, Charl van Heerden, Febe de Wet, and Jaco Badenhorst:
This paper introduced a corpus that I did not know existed. The corpus consists of 50 hours of transcribed speech for each of the 11 official languages of South Africa. They actually collected far more data, but this was the data remaining after cleaning and verification. The difficulties surrounding the collection of the corpus were quite interesting. For instance, in such a diverse language environment, it can be difficult to determine what exactly is the mother tongue of a given speaker.