Abstract:
Named entity recognition (NER) plays a crucial role in various subsequent tasks, including vertical domain information extraction, knowledge graph construction, intelligent question-answering services, etc. To address the issue of low recognition accuracy in the field of apple cultivation, which was caused by scarce annotated data, single-dimensional character embedding representation, and insufficient ability to mine multi-dimensional features, a Chinese apple cultivation named entity recognition Model (ACNM) based on data augmentation and multi-feature fusion was proposed. Firstly, focusing on the primary production processes in apple cultivation, an Apple Cultivation NER Dataset (ACND) covering 14 entity categories was constructed, and a data augmentation layer was then designed to perform entity-level and sentence-level data enhancement. Secondly, a Multi-feature-layer with a pre-trained model, glyph, radical, and lexicon (MPGRL) was designed to extract and dynamically integrate the multi-dimensional features of the apple cultivation texts, including character, glyph, radical, and lexicon embeddings, and the semantic representation of characters was thus enhanced by incorporating dynamic word representations, visual morphological features of Chinese characters, internal structures of Chinese characters, and lexical boundary information. The BERT with whole word masking (BERT-WWM) pre-trained model was employed to acquire character embeddings, incorporating lexical-level semantic information and mitigating the problem of polysemy. The Vision Transformer model was utilized to obtain glyph embeddings, by modeling and learning the visual features of glyphs. The Mamba model was applied to extract radical embeddings while preserving the radical features that encompass richer semantic information. The SoftLexicon method was adopted to acquire lexical embeddings and enhance lexical boundary information. Thirdly, the receptance weighted key value (RWKV) model framework, which combines the advantages of the Transformer’s parallel training and RNN’s efficient reasoning ability, was adopted as the encoding layer to fully extract the semantic information from the MPGRL, and thus better explore the long-range sequence contextual semantics of the apple cultivation text. Finally, the conditional random field (CRF) was used to learn the constraint relationships between different label sequences, thereby obtaining the optimal label sequence for the apple cultivation NER task. The experimental results showed that the data augmentation technology combined with the multi-feature fusion strategy effectively improved the NER accuracy of the model. The F1 value of the ACNM model on the ACND dataset reached 97.02%, which was 2.93~7.80 percentage points higher than the compared model. This indicated that the ACNM model could efficiently extract the rich semantic information from the MPGRL, ultimately improving the NER accuracy of apple cultivation. The ablation experiment results demonstrated that after removing the data augmentation layer, the MPGRL module, and the RWKV module separately from the ACNM model, the F1 values decreased by 2.11, 3.43, and 0.77 percentage points, respectively, suggesting that each module designed had made a positive contribution to the ACNM model. Compared with the fusion of three types of features adopted in this study, the F1 value decreased by 0.85, 0.33, and 2.02 percentage points respectively after eliminating the glyph feature, radical feature, and vocabulary feature. This indicated that all features had a positive effect on improving the accuracy of entity recognition, the combination of glyph and radical could encode Chinese characters from data of two particle sizes and two modalities, playing a complementary role, the combination of the above three external features could comprehensively model the semantics of apple cultivation text, improving the accuracy of model recognition. Experiments were also conducted on three publicly available datasets, namely CLUENER2020, CCKS2017and Boson, and the F1 values achieved 79.33%, 95.20%, and 83.06%, respectively, which also outperformed other comparative models, indicating that the model had certain generalization ability. This study has practical value for the construction of the apple cultivation knowledge graph and also can provide technical reference for NER research of other crops.