Abstract:
Health knowledge of the cattle (both beef and dairy production) can represent one of the most important components in the intelligent and data-driven livestock farming. Multiple dimensions can be involved, such as the housing environment, disease prevention, breeding, nutrition, as well as the feed and water regulation, all of which are closely related to animal welfare and productivity in sustainable production. However, it is still lacking in high-quality Chinese textual resources in cattle health research. Particularly, the annotated corpora for the named entity recognition (NER) have limited the knowledge extraction, intelligent monitoring, and decision making in precision livestock farming. Compared with the general text, the NER of the cattle health data is characterized by a highly diverse entity type, complex and nested entity structure, uneven data distribution, and frequent occurrence of the domain-specific terminology. The general-purpose models, such as BERT, cannot fully meet the requirement of accurate entity recognition. It is often required for domain adaptation and high performance in order to identify the long-tail or low-frequency entities. In this study, a Chinese NER corpus was constructed for cattle health. The dataset also covered 17 entity categories, including diseases, drugs, feed, physiological indicators, operations, and environmental factors. A multi-feature fusion NER model was proposed using the Livestock Enhanced Representation for Text (LERT). At the representation layer, the LERT was employed as a pre-trained language model to enhance the Chinese semantic comprehension and effectively capture the long-range contextual dependencies specific to the cattle domain. At the feature extraction layer, a Bi-directional Long Short-Term Memory (BiLSTM) network and an Iterated Dilated Convolutional Neural Network (IDCNN) were utilized to integrate the global and local context during representation learning, where the BiLSTM was used for the long-range dependencies, while the IDCNN was used to efficiently extract the local features. Furthermore, a Scaled Dot-Product Multi-Head Attention mechanism was introduced at the feature fusion layer to strengthen the perception of the long-distance dependencies for the boundary and category identification, while a Conditional Random Field (CRF) layer was applied at the decoding stage to globally optimize the label sequences for the structural consistency of the outputs. Experimental evaluations demonstrated that the model achieved excellent performance on the corpus, with a precision of 90.45%, recall of 90.76%, and
F1-score of 90.57%, outperforming baseline models, such as BERT and RoBERTa. All entity categories were achieved with a precision above 80%, indicating the strong and stable recognition. Ablation experiments verified that both the multi-head attention mechanism and the combination of BiLSTM with IDCNN contributed significantly to the feature fusion and overall performance. A high-precision and domain-adaptive approach can provide for the entity recognition of the Chinese NER resources in the field of cattle health. The valuable insights can also be offered for natural language processing in intelligent livestock farming.