This post refers to the paper: W. Hartmann, A. Roy, L. Lamel, and J.-L. Gauvain, “Acoustic Unit Discovery and Pronunciation Generation from a Grapheme-Based Lexicon,” Proceedings of IEEE ASRU, pp. 380-385, 2013. (Preprint, postprint not yet available)
I also have a brief description of the general project on my website. We recently presented this work at the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. In general, the reception from the attendees was positive and encouraging. We were even selected to present a synopsis of our work at the end of the day devoted to Limited Resources.
Lately there has been a renewed interest in speech recognition for low-resource languages. In those cases, a hand-crafted dictionary may not exist. Furthermore, the phonetic inventory of the language may not be completely known. Instead of relying on a pre-existing lexicon, we only assume that we have acoustic data with some type of orthographic transcription. While the motivation for this work is low-resource languages, we hope to eventually apply this procedure to languages with expert-defined lexicons.
We first build a grapheme-based recognizer. For this study, we used English, where grapheme-based systems perform considerably worse than phone-based systems. Once a baseline grapheme-based system has been built, we discover acoustic units and generate pronunciations in two separate stages.
Acoustic units are generated by clustering the grapheme-based, context-dependent HMMs. Other work has performed a similar clustering, but on the individual states. By clustering the actual HMMs, we can obtain units more similar to phones. Though the number of clusters—or acoustic units—must be predefined, we found improved performance for a large range of total units.
Since each context-dependent HMM is assigned to a single cluster, we can map the original pronunciations to the new acoustic units. Each pronunciation will still use the same number of units as with the grapheme-based pronunciation. A new system can now be built from this new lexicon. We show a significant improvement in WER from using the discovered acoustic units.
While the new acoustic units improve performance, we believed the pronunciations did not make optimal use of the acoustic units. In a second stage, we transform the mapped pronunciations to further improve results. We take a statistical machine translation (SMT)-based approach. This is similar to previous work in building grapheme-to-phoneme (G2P) systems. The main difference is we are learning a transformation between two identical unit sets—a G2G system.
The first step is to generate data to train the SMT-based system. This is accomplished by decoding the training data to generate a set of pronunciation hypotheses for each word. A phrase translation table is trained using the pronunciation hypotheses. The original pronunciations can be transformed by applying the learned phrase translation table. Unfortunately, this actually decreased performance in our experiments.
We found the set of pronunciation hypotheses to be very noisy, resulting in a phrase table containing many rules that lead to an increase in WER. Our solution was to rescore every rule. After applying each rule individually, we measured the change in likelihood on the training set. This change in likelihood became the new score for each rule. By pruning the phrase table of low scoring rules, we significantly improved the final pronunciations.
We proposed a method for acoustic unit discovery and a method for pronunciation generation that both improve WER individually, but produce even greater gains when combined. We are currently working on applying these techniques to low resource languages.