基于多特征融合的外来入侵植物细粒度命名实体识别

尚俊平; 程春畅; 卢洋; 席磊; 程金鹏; 刘合兵

doi:10.11975/j.issn.1002-6819.202502078

基于多特征融合的外来入侵植物细粒度命名实体识别

Fine-grained named entity recognition of invasive alien plants using multi-feature fusion

摘要

摘要: 外来入侵植物命名实体识别是进一步挖掘入侵植物信息的关键步骤。为解决外来入侵植物领域命名实体识别存在训练数据稀缺、字符级向量表征单一、专业实体识别精度不足等问题，构建了一种基于多特征融合的外来入侵植物细粒度命名实体识别模型（invasive alien plant fine-grained named entity recognition model based on multi-feature fusion，IAP-MFF）。首先，采用RoBERTa（Robustly optimized BERT approach，RoBERTa）预训练模型为基础架构，通过构建领域专用词典并通过词汇特征向量融合，增强模型对低频词及专业术语的表征能力；其次，设计双通道特征提取层，利用双向长短时记忆网络（Bi-directional long-short term memory，BiLSTM）提取长序列语义特征，结合卷积残差结构（convolution residual structure，CRS）捕获更多细粒度特征；然后，设计分层特征融合机制，通过多头自注意力机制加权融合两种特征向量，构建多维度语义表征；最后，采用条件随机场（conditional random field, CRF）进行序列解码优化。基于专家知识，构建包含24类细粒度实体标签的外来入侵植物命名实体识别数据集。试验表明，IAP-MFF模型在外来入侵植物命名实体识别数据集上取得91.51%精确率、92.51%召回率和92.01%的F1值，较基线模型分别提升4.40、3.39、3.91个百分点，显著改善了小样本细粒度实体的识别效果。在Weibo、Resume公共数据集上F1值分别达到72.75%和97.15%，表明了模型的泛化性和优越性能。IAP-MFF模型通过融合包含领域知识在内的多种特征，有效提升实体识别精度与泛化能力，为外来入侵植物知识图谱构建奠定技术基础。

Abstract: Biological invasions have been ever increasingly severe issue worldwide in recent years. It is highly required for the efficient information extraction of invasive alien plants. Among them, named entity recognition (NER) is one of the most crucial techniques for extracting key information from unstructured text. However, three challenges still remain: Firstly, it is lacking in the domain-specific training data. As a result, the existing models cannot effectively learn the features of the entities that are related to invasive alien plants. Secondly, the simple representation of the character vectors is limited to capturing the subtle linguistic differences in the domain. Finally, the low rate of recognition has restricted the extraction of the specialized domain entities. In this study, an improved NER method was proposed to enhance the accuracy and generalization of the text information extraction from the invasive alien plants, according to the domain dictionary enhancement and multi-feature fusion. The RoBERTa (Robustly optimized BERT approach) pre-trained model was adopted as the foundational framework, in order to solve the data scarcity and limited comprehension. A feature representation was constructed using a domain dictionary of invasive alien plants. The domain-specific entities were then recognized after incorporation. The semantic information was integrated from the different dimensions. Additionally, a fine-grained definition was also developed for the entity types of invasive alien plants, in collaboration with the domain experts. A high-quality NER dataset was then constructed to divide into the 24 categories for the invasive alien plants. The results show that the improved model outperformed the rest mainstream NER models, in terms of the invasive alien plant corpus. The better performance was achieved, with a precision of 91.51%, a recall of 92.51%, and an F1 score of 92.01%, compared with the baseline model, it increased by 4.40,3.39 and 3.91 percentage points respectively. The better performance was attributed to several configurations: 1) The pre-trained model and domain dictionary were integrated to incorporate with the RoBERTa outputs. A domain dictionary of the invasive alien plants was used to fuse the feature representation. Then, the recognition of the model was significantly improved for the rare and domain-specific entities; 2) The optimization layer was selected for the feature extraction. A Bi-directional Long Short-Term Memory network (BiLSTM) was utilized to capture the local character vector features and dependencies among text elements. Additionally, a convolutional residual structure (CRS) was used to extract more detailed information; 3) Automatic feature fusion layer was used to enhance the representation and interaction of the feature vectors from different sources. A weighted automatic layer of the feature fusion was introduced using a multi-head self-attention mechanism. The splicing operation of the direct feature was replaced in many previous studies. The contextual and semantic information were explicitly integrated to adaptively weigh the features from BiLSTM and CRS components. A more effective fusion was then obtained on the domain knowledge and textual semantics; 4) Decoding strategy optimization: The conditional random field (CRF) was employed to decode the feature vectors, and then to generate the globally optimal sequences. The accuracy was improved for the NER tasks; Furthermore, strong transferability and generalization were achieved, with the F1 scores of 72.75% and 97.15% on the public datasets for Weibo and resumes, respectively. The high accuracy of the recognition was obtained in the invasive alien plant domain applicable to the rest of the NER tasks. The findings can provide vital technical and data support to construct knowledge graphs on invasive alien plants.

HTML全文

参考文献(44)

施引文献

资源附件(0)