Abstract:
To solve the problems that the Attention′s overly flexible alignment was poorly adaptable in complex environments and the language features were not fully utilized by simple end-to-end models, a speech recognition method was investigated based on multi-task loss with additional language model. By analyzing the characteristics of the speech signal, the features containing more information were selected in the training. Based on the Attention-based Conformer end-to-end model, the model was trained using multi-task loss of CTC loss assisted pure Conformer(Attention), and the Conformer-CTC speech recognition model was obtained. Based on the Conformer-CTC model, by analyzing and comparing the characteristics and effects of some language models, the Transformer language model was added to the training of the above model through re-scoring mechanism, and the Conformer-CTC-Transformer speech recognition model was obtained. The experiments on the above model were completed on the AISHELL-1 data set. The results show that compared with the pure Conformer(Attention) model, the character error rate(CER) of the Conformer-CTC model on the test set is reduced by 0.49%, and the CER of the Conformer-CTC-Transformer model on the test set is reduced by 0.79% compared with the Conformer-CTC model. The adaptability of Attention alignment in complex environments can be improved by CTC loss, and after re-scoring the Transformer-CTC model with the Transformer language model, the recognition accuracy can be increased by 0.30% again. Compared with some existing end-to-end models, the recognition effect of the Conformer-CTC-Transformer model is better, indicating that the model has certain effectiveness.