Abstract:
Biological invasions have been ever increasingly severe issue worldwide in recent years. It is highly required for the efficient information extraction of invasive alien plants. Among them, named entity recognition (NER) is one of the most crucial techniques for extracting key information from unstructured text. However, three challenges still remain: Firstly, it is lacking in the domain-specific training data. As a result, the existing models cannot effectively learn the features of the entities that are related to invasive alien plants. Secondly, the simple representation of the character vectors is limited to capturing the subtle linguistic differences in the domain. Finally, the low rate of recognition has restricted the extraction of the specialized domain entities. In this study, an improved NER method was proposed to enhance the accuracy and generalization of the text information extraction from the invasive alien plants, according to the domain dictionary enhancement and multi-feature fusion. The RoBERTa (Robustly optimized BERT approach) pre-trained model was adopted as the foundational framework, in order to solve the data scarcity and limited comprehension. A feature representation was constructed using a domain dictionary of invasive alien plants. The domain-specific entities were then recognized after incorporation. The semantic information was integrated from the different dimensions. Additionally, a fine-grained definition was also developed for the entity types of invasive alien plants, in collaboration with the domain experts. A high-quality NER dataset was then constructed to divide into the 24 categories for the invasive alien plants. The results show that the improved model outperformed the rest mainstream NER models, in terms of the invasive alien plant corpus. The better performance was achieved, with a precision of 91.51%, a recall of 92.51%, and an
F1 score of 92.01%, compared with the baseline model, it increased by 4.40,3.39 and 3.91 percentage points respectively. The better performance was attributed to several configurations: 1) The pre-trained model and domain dictionary were integrated to incorporate with the RoBERTa outputs. A domain dictionary of the invasive alien plants was used to fuse the feature representation. Then, the recognition of the model was significantly improved for the rare and domain-specific entities; 2) The optimization layer was selected for the feature extraction. A Bi-directional Long Short-Term Memory network (BiLSTM) was utilized to capture the local character vector features and dependencies among text elements. Additionally, a convolutional residual structure (CRS) was used to extract more detailed information; 3) Automatic feature fusion layer was used to enhance the representation and interaction of the feature vectors from different sources. A weighted automatic layer of the feature fusion was introduced using a multi-head self-attention mechanism. The splicing operation of the direct feature was replaced in many previous studies. The contextual and semantic information were explicitly integrated to adaptively weigh the features from BiLSTM and CRS components. A more effective fusion was then obtained on the domain knowledge and textual semantics; 4) Decoding strategy optimization: The conditional random field (CRF) was employed to decode the feature vectors, and then to generate the globally optimal sequences. The accuracy was improved for the NER tasks; Furthermore, strong transferability and generalization were achieved, with the F1 scores of 72.75% and 97.15% on the public datasets for Weibo and resumes, respectively. The high accuracy of the recognition was obtained in the invasive alien plant domain applicable to the rest of the NER tasks. The findings can provide vital technical and data support to construct knowledge graphs on invasive alien plants.