Below are several papers from the 9th International Symposium on Chinese Spoken Language Processing (ISCSLP) that I found particularly interesting.
Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition by Shaofei Xue, Hui Jiang and Lirong Dai:
Adapting DNNs based on speaker or environment characteristics is difficult. The large number of parameters means a large amount of adaptation data is required. The authors propose a method that handles the adaptation with far less data. The trick is decompose the weight matrix with SVD. The adaptation is then performed only on the the singular values from the decomposition. The authors demonstrate modest gains on Switchboard using only a small amount of adaptation data.
Research on Deep Neural Network’s Hidden Layers in Phoneme Recognition by Yuan Ma, Jianwu Dang and Weifeng Li
Previously, there has been work trying to understand what the different layers of a DNN correspond to in computer vision applications like facial recognition and digit recognition. The authors attempted to perform a similar study for phonetic recognition. I appreciated the idea, though I do think a different approach will be required to discover the relationship between the hidden layers and the phonetics. The presenter also had my favorite response of the whole workshop. Someone questioned the purpose of the study and wanted to know why it was of interest. She responded with, “Um, I think, because it is science.”
Decision Tree based State Tying for Speech Recognition using DNN Derived Embeddings by Xiangang Li and Xihong Wu:
This paper is in line with some recent work on removing the dependence on the GMM for building a CD-DNN-HMM. In this case, they perform the clustering using the DNN instead of the GMM. By talking the final hidden layer of the DNN, they create an embedding for the individual states. The state-tying is performed by clustering these embeddings. In this approach, states which are confusable in the original model end up being clustered together.
Speech Separation Based on Improved Deep Neural Networks with Dual Outputs of Speech Features for Both Target and Interfering Speakers by Yanhui Tu, Jun Du, Yong Xu, Lirong Dai and Chin-Hui Lee:
The authors presented an approach for speech separation based on DNNs. Given the mixture, the DNN produces both target and interference signals. During testing it is assumed the system has seen the target speaker before, but not the interfering signal. It is interesting work, but I do not know how it generalizes to the case when the target speaker has never been seen.