基于多任务损失附加语言模型的语音识别方法

柳永利; 张绍阳; 王裕恒; 解熠

基于多任务损失附加语言模型的语音识别方法

Speech recognition method based on multi-task loss with additional language model

摘要

摘要: 针对Attention过于灵活的对齐方式在复杂环境中适应性差、简单端到端模型对语言特征利用不充分的问题，研究了基于多任务损失附加语言模型的语音识别方法.通过分析语音信号特征，训练中选用包含更多信息的特征.以基于Attention的Conformer端到端模型为基础，采用CTC损失辅助纯Conformer (Attention)的多任务损失训练模型，得到Conformer-CTC语音识别模型.在Conformer-CTC模型基础上，通过分析对比部分语言模型的特点与效果，将Transformer语言模型通过重打分机制附加至上述模型的训练中，最终得到Conformer-CTC-Transformer语音识别模型.在AISHELL-1数据集上对上述模型进行了试验.结果表明：Conformer-CTC模型相对于纯Conformer (Attention)模型在测试集上的字错率(character error rate,CER)降低了0.49%,而Conformer-CTC-Transformer模型相对于Conformer-CTC模型在测试集上的CER又降低了0.79%.CTC损失可以改善Attention对齐方式在复杂环境中的适应性，并且对Conformer-CTC模型附加Transformer语言模型重打分后能再次提升0.30%的识别准确率.相较于现有的部分端到端模型，Conformer-CTC-Transformer模型识别效果较好，说明该模型具有一定的有效性.

Abstract: To solve the problems that the Attention′s overly flexible alignment was poorly adaptable in complex environments and the language features were not fully utilized by simple end-to-end models, a speech recognition method was investigated based on multi-task loss with additional language model. By analyzing the characteristics of the speech signal, the features containing more information were selected in the training. Based on the Attention-based Conformer end-to-end model, the model was trained using multi-task loss of CTC loss assisted pure Conformer（Attention）, and the Conformer-CTC speech recognition model was obtained. Based on the Conformer-CTC model, by analyzing and comparing the characteristics and effects of some language models, the Transformer language model was added to the training of the above model through re-scoring mechanism, and the Conformer-CTC-Transformer speech recognition model was obtained. The experiments on the above model were completed on the AISHELL-1 data set. The results show that compared with the pure Conformer（Attention） model, the character error rate（CER） of the Conformer-CTC model on the test set is reduced by 0.49%, and the CER of the Conformer-CTC-Transformer model on the test set is reduced by 0.79% compared with the Conformer-CTC model. The adaptability of Attention alignment in complex environments can be improved by CTC loss, and after re-scoring the Transformer-CTC model with the Transformer language model, the recognition accuracy can be increased by 0.30% again. Compared with some existing end-to-end models, the recognition effect of the Conformer-CTC-Transformer model is better, indicating that the model has certain effectiveness.

HTML全文

参考文献(14)

施引文献

资源附件(0)