I presented this paper at the 2014 Interspeech conference. The work was done with Viet-Bac Le, Abdel Messaoudi, Lori Lamel, and Jean-Luc Gauvain. It was supported by the IARPA Babel project.
This is a continuation of some of our previous work in detecting out-of-vocabulary (OOV) keywords in the context of the IARPA project. We more fully explored how the types of subword units used in decoding affected the final results.
The first approach is to simply decode using the standard word-based lexicon. In order to detect OOV keywords, we convert the word lattice to a subword lattice. After this conversion, a consensus network is created. By converting to a consensus network, we are able to introduce sequences of subwords that were not present in the original lattice. If instead we searched just the converted subword lattice, we would not expect to find many OOV keywords. This process does recover some amount of the keywords, but performance is still much worse compared to in-vocabulary keyword detection.
An alternative approach is to decode using subword units. There are many possible ways to segment the original lexicon into a set of subwords; we explore several approaches in this work. In all cases, decoding with subword units and then searching provides a large improvement over the lattice conversion approach. The downside is that multiple decodings are now required—a word-based decoding for in-vocabulary keywords and a subword decoding for OOV keywords. We also found that combining the results from multiple decodings using different types of subword units further improves results. Of course this increases the number of decodings required, increasing the computational cost. In the end, it becomes a trade off between performance and speed.
The final approach attempts to reduce the number of decodings required. Instead of using a single type of subword unit, all types are combined (including the original words), into a single language model. Now when decoding is performed, all subword units can appear in the lattice. Unfortunately, this approach was not as good as combining multiple decodings. The performance is similar to the average of all the other systems—better than the worst system, but not as good as the best. We also see a small drop in performance for the in-vocabulary keywords. Overall, it provides the best tradeoff between in-vocabuary and OOV performance for a single system.
In all cases, we search for keywords by looking for exact matches. Other approaches also consider inexact matches. While we do not report results in this work, we have done some preliminary experiments using inexact matches. When carefully tuned, it does provide nice gains over simply searching for exact matches. It will be interesting to see if our conclusions in this paper regarding the performance of various decoding approaches still hold when allowing inexact matches.
 W. Hartmann, V.-B. Le, A. Messaoudi, L. Lamel, and J.-L. Gauvain, “Comparing Decoding Strategies for Subword-based Keyword Spotting in Low-Resourced Languages,” Proceedings of Interspeech, pp. 2764-2768, 2014. (Preprint, Postprint).