Articulatory Information and Multiview Features for Large Vocabulary Continuous Speech Recognition
Abstract
End-to-end neural networks have shown promising results on large vocabulary continuous speech recognition (LVCSR) systems. However, it is challenging to integrate domain knowledge into such systems. Specifically, articulatory features (AFs) which are inspired by the human speech production mechanism can help in speech recognition. This paper presents two approaches to incorporate domain knowledge into end-to-end training: (a) fine-tuning networks which reuse hidden layer representations of AF extractors as input for ASR tasks; (b) progressive networks which combine articulatory knowledge by lateral connections from AF extractors. We evaluate the proposed approaches on the speech Wall Street Journal corpus and test on the eval92 standard evaluation dataset. Results show that both fine-tuning and progressive networks can integrate articulatory information into end-to-end learning and outperform previous systems.
Keywords
- Articulatory features
- Automatic speech recognition
- Deep neural networks (DNN)
- End-to-end learning
References
-
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
-
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the ICLR (2015)
-
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of ICCV-2011, pp. 1457–1464 (2011)
-
Miao, Y., Metze, F.: End-to-End Architectures for Speech Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds.) New Era for Robust Speech Recognition, pp. 299–323. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64680-0_13
-
Graves, A., Fernández, S., Gomez, F., et al.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML-2006, pp. 369–376 (2006)
-
Zweig, G., Yu, C., Droppo, J., et al.: Advances in all-neural speech recognition. In: Proceedings of ICASSP-2017, pp. 4805–4809 (2017)
-
King, S., Taylor, P.: Detection of phonological features in continuous speech using neural networks. Comput. Speech Lang. 14(4), 333–353 (2000)
-
Kirchhoff, K.: Robust speech recognition using articulatory information. Ph.D. thesis, University of Bielefeld (1999)
-
Yu, D., Siniscalchi, S.M., Deng, L., et al.: Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition. In: Proceedings of ICASSP-2012, pp. 4169–4172 (2012)
-
Sak, H., Senior, A., Rao, K., et al.: Learning acoustic frame labelling for speech recognition with recurrent neural networks. In: Proceedings of ICASSP-2015, pp. 4280–4284 (2015)
-
Chorowski, J.K., Bahdanau, D., Serdyuk, D., et al.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)
-
Bahdanau, D., Chorowski, J., Serdyuk, D., et al.: End-to-end attention-based large vocabulary speech recognition. In: Proceedings of ICASSP-2016, pp. 4945–4949 (2016)
-
Chan, W., Jaitly, N., Le, Q., et al.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings of ICASSP-2016, pp. 4960–4964 (2016)
-
Lee, C.-H., et al.: An overview on automatic speech attribute transcription (ASAT). In: Proceedings of INTERSPEECH-2007, pp. 1825–1828 (2007)
-
Siniscalchi, S.M., Lee, C.-H.: A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Commun. 51, 1139–1153 (2009)
-
Siniscalchi, S.M., Lyu, D.C., Svendsen, T., et al.: Experiments on cross-language attribute detection and phone recognition with minimal target-specific training data. IEEE Trans. Audio Speech Lang. Process. 20(3), 875–887 (2012)
-
Ananthakrishnan, S., Narayanan, S.: Improved speech recognition using acoustic and lexical correlates of pitch accent in a n-best rescoring framework. In: Proceedings of ICASSP-2007, vol. 4, pp. IV-873–IV-876 (2007)
-
Rusu, A.A., Rabinowitz, N.C., Desjardins, G., et al.: Progressive neural networks. arXiv preprint arXiv:1606.04671 (2016)
-
Amodei, D., Ananthanarayanan, S., Anubhai, R., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: Proceedings of ICML-2016, pp. 173–182 (2016)
-
Sainath, T.N., Vinyals,. O., Senior, A., et al.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP-2015, pp. 4580–4584 (2015)
-
Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: Proceedings of ICML-2015, pp. 2342–2350 (2015)
-
Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362 (1992)
-
Abdel-Hamid, O., Mohamed, A., Jiang, H., et al.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of ICASSP-2012, pp. 4277–4280 (2012)
-
Hannun, A.Y., Maas, A.L., Jurafsky, D., et al.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint arXiv:1408.2873 (2014)
-
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: Proceedings of ICASSP-2015, pp. 357–366 (1980)
-
Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
-
Veselý, K., Ghoshal, A., Burget, L., et al.: Sequence-discriminative training of deep neural networks. In: Proceedings of INTERSPEECH-2013, pp. 2345–2349 (2013)
Acknowledgements
The authors gratefully acknowledge partial support from the China Scholarship Council (CSC), the German Research Foundation DFG under project CML (TRR 169), and the European Union under project SECURE (No. 642667).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Qu, L., Weber, C., Lakomkin, E., Twiefel, J., Wermter, S. (2018). Combining Articulatory Features with End-to-End Learning in Speech Recognition. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science(), vol 11141. Springer, Cham. https://doi.org/10.1007/978-3-030-01424-7_49
Download citation
- .RIS
- .ENW
- .BIB
-
DOI : https://doi.org/10.1007/978-3-030-01424-7_49
-
Published:
-
Publisher Name: Springer, Cham
-
Print ISBN: 978-3-030-01423-0
-
Online ISBN: 978-3-030-01424-7
-
eBook Packages: Computer Science Computer Science (R0)
Source: https://link.springer.com/chapter/10.1007/978-3-030-01424-7_49