Abstract:
The construction of a knowledge graph in the field of sheep diseases is a crucial prerequisite for disease prevention, control, and intelligent diagnosis. This paper addresses the common challenges in sheep disease texts, such as ambiguous semantic boundaries, overlapping entity roles, and complex relational semantics, by proposing a knowledge graph construction method based on the CaRoMHPE model (CasRel-based model combined with RoBERTa, multi-scale cross-Attention mechanism, and hybrid position encoding in multi-head attention). First, based on the characteristics of sheep disease corpora, a dedicated dataset containing 9 types of entities and 8 types of relationships was constructed, covering key entities and their relationships throughout the entire process from disease diagnosis to treatment, providing substantial data support for entity-relationship extraction tasks. Subsequently, building on the cascade relational triple extraction (CasRel) framework, the robustly optimized BERT approach with whole word masking (RoBERTa-wwm-ext) was used to replace the traditional bidirectional encoder representations from transformers (BERT) as the text encoder, enhancing the model's understanding of contextual semantics and its ability to handle complex linguistic structures. After the subject annotation module, a multi-scale cross-attention mechanism was introduced. This mechanism takes subject embeddings and full-sentence embeddings as inputs, computes interaction scores using multi-head attention, fuses multi-source information through residual connections and layer normalization, and further extracts high-level semantic representations via a feed-forward network (FFN). The FFN consists of two fully connected layers with a GeLU activation function in between. The multi-scale feature fusion module obtains low-dimensional features through linear projection, concatenates the original and compressed features along the last dimension, and then restores the original dimensionality through linear transformation, thereby integrating global and local semantic information. This enhances the model's perception of long-range dependencies and local semantic features, allowing for a more refined characterization of semantic relationships between entities. Additionally, a hybrid position encoding (HPE) method was incorporated into the multi-head attention mechanism, combining the advantages of absolute and relative position encoding. This includes relative position bias and rotary position encoding (RoPE). The relative position bias dynamically adjusts the bias matrix shape based on sequence length and maps positional relationships into the attention scores between queries and keys. RoPE applies sine and cosine functions to the even and odd dimensions of query and key vectors for rotational transformation, explicitly encoding positional information into the attention calculation. This enables more accurate modeling of structural relationships between words, enhances the model's perception of positional information, and significantly improves entity boundary recognition and role differentiation. Experimental results show that the proposed model achieved precision, recall, and F1 scores of 94.70%, 94.04%, and 94.37%, respectively, in the knowledge extraction task, representing improvements of 9.14%, 9.21%, and 9.18% over the original CasRel model. These results demonstrate a significant optimization in entity-relationship extraction for the sheep disease domain. Finally, based on the structured triples extracted from the text, semantic embedding vectors generated by the RoBERTa model were used to calculate cosine similarity for synonym discrimination and entity fusion. This method leverages corresponding knowledge fusion corpora for different entity types, effectively identifying and unifying synonymous expressions while eliminating duplicate entities and potential ambiguities. After confirming no duplicates through retrieval, the fused triples were stored in a Neo4j graph database, resulting in a standardized, consistent, and well-structured sheep disease knowledge graph. This provides reliable data support for subsequent intelligent diagnosis and treatment of sheep diseases.