I presented this paper at the 2014 Workshop on Spoken Language Technology for Under-Resourced Languages (SLTU) . The work was done with Lori Lamel and Jean-Luc Gauvain. It was supported by the Quaero project the IARPA Babel project.
Keyword spotting is the task of detecting specific words or sequences of words in audio. There seems to be a resurgence in the popularity of the task thanks to the IARPA Babel project. The Babel project is focused on a particular type of keyword spotting where searching for words must happen after the audio has been processed. The standard approach is to generate lattices and then search the lattice once the keywords arrive.
Given this setup, it is important to generate lattices such that you can accurately detect as many keywords as possible. If the keywords are not in-vocabulary, then this becomes impossible—at least to find the keywords by searching for exact matches. One solution is to use a more sophisticated approach to searching for the keywords—either use similar words as proxies or use some type of phonetic confusion model. In our work, we instead attempt to minimize the number of out-of-vocabulary words by generating lattices using sub words.
Subword units for keyword spotting are used by many systems in the literature. However, the subword units are almost always single characters or phones. We use longer subword units in this work. As implied by the title, we also explore subword units that cross word boundaries. This work was a preliminary exploration using the Kaldi speech recognition toolkit. We have since improved the results with better features and a more intelligent search strategy.
Our results—and more recent work—show that no particular type of subword unit appears to consistently work better than the other units. However, using multiple subword unit types provide complimentary information and combine quite well. Most of the subword units are character n-grams. For a given n (3,5, or 7 in this work), we find all possible sequences of characters of length n in the training script. Based on this set of subwords, we build a unigram langauge model. The training data is then segmented based on the language model. This is equivalent to selecting the segmentation that minimizes the number of segments in the data. Once the data has been initially segmented, a new trigram language model is built. Using the larger language model, the training data is segmented again. This process is repeated until convergence.
For each subword unit type—a total of 7 types are used in this work—a language model and pronunciation dictionary are produced. The data is decoded separately using each subword unit type using the Kaldi speech recognition toolkit. Keyword spotting is performed separately on each system and the results are combined.
Combining multiple systems significantly improves the performance of both in-vocabulary and out-of-vocabulary keywords. The downside to combining multiple systems is the increased computational cost. Our future work will focus on obtaining the performance of combining multiple systems , but requiring only a single decoding of the data.
 W. Hartmann, L. Lamel, and J.-L. Gauvain, “Cross-Word Sub-Word Units for Low Resource Keyword Spotting,” Proceedings of SLTU, pp. 112-117, 2013. (Preprint, postprint not yet available)