基于CaRoMHPE的羊疾病知识图谱构建方法

张泽嘉; 孙小华; 王超; 王斌; 袁万哲; 王福顺

doi:10.11975/j.issn.1002-6819.202504180

基于CaRoMHPE的羊疾病知识图谱构建方法

Sheep disease knowledge graph construction method based on CaRoMHPE

摘要

摘要: 羊疾病领域知识图谱是实现羊疾病防控与智能诊疗的前提。针对羊疾病文本语义边界模糊、实体角色重叠及关系语义复杂等问题，提出了一种基于CaRoMHPE（CasRel-based model combined with RoBERTa, multi-scale cross-attention mechanism, and hybrid position encoding in multi-head attention）模型的知识图谱构建方法。首先根据羊疾病语料特点，构建了一个包含9类实体和8种关系的羊疾病数据集，涵盖了羊疾病诊疗全过程中的关键实体及关系，为实体关系抽取任务提供数据支持。随后，以CasRel（cascade relational triple extraction）为基础模型，使用RoBERTa-wwm-ext（robustly optimized BERT approach）替换BERT（bidirectional encoder representations from transformers）作为预训练编码模型，以增强模型对上下文的理解和对复杂语言结构的处理能力；在主体标注模块后添加多尺度跨注意力机制，更好地细化实体之间的语义关系，同时融入混合位置编码（hybrid position encoding, HPE）对多头注意力机制进行改进，增强关系抽取任务中的实体边界划分和角色区分能力。结果表明，该模型知识抽取的准确率、召回率和F1值分别达到了94.70%、94.04%、94.37%，相较于CasRel模型分别提升了9.14、9.21和9.18个百分点，增强了羊疾病信息实体关系抽取效果。最后，在抽取得到的三元组基础上，结合语义嵌入技术和余弦相似度算法，通过消除同义词重复和处理潜在歧义，构建了规范化的知识图谱，为智能化羊疾病诊疗提供有力的支持。

Abstract: The construction of a knowledge graph in the field of sheep diseases is a crucial prerequisite for disease prevention, control, and intelligent diagnosis. This paper addresses the common challenges in sheep disease texts, such as ambiguous semantic boundaries, overlapping entity roles, and complex relational semantics, by proposing a knowledge graph construction method based on the CaRoMHPE model (CasRel-based model combined with RoBERTa, multi-scale cross-Attention mechanism, and hybrid position encoding in multi-head attention). First, based on the characteristics of sheep disease corpora, a dedicated dataset containing 9 types of entities and 8 types of relationships was constructed, covering key entities and their relationships throughout the entire process from disease diagnosis to treatment, providing substantial data support for entity-relationship extraction tasks. Subsequently, building on the cascade relational triple extraction (CasRel) framework, the robustly optimized BERT approach with whole word masking (RoBERTa-wwm-ext) was used to replace the traditional bidirectional encoder representations from transformers (BERT) as the text encoder, enhancing the model's understanding of contextual semantics and its ability to handle complex linguistic structures. After the subject annotation module, a multi-scale cross-attention mechanism was introduced. This mechanism takes subject embeddings and full-sentence embeddings as inputs, computes interaction scores using multi-head attention, fuses multi-source information through residual connections and layer normalization, and further extracts high-level semantic representations via a feed-forward network (FFN). The FFN consists of two fully connected layers with a GeLU activation function in between. The multi-scale feature fusion module obtains low-dimensional features through linear projection, concatenates the original and compressed features along the last dimension, and then restores the original dimensionality through linear transformation, thereby integrating global and local semantic information. This enhances the model's perception of long-range dependencies and local semantic features, allowing for a more refined characterization of semantic relationships between entities. Additionally, a hybrid position encoding (HPE) method was incorporated into the multi-head attention mechanism, combining the advantages of absolute and relative position encoding. This includes relative position bias and rotary position encoding (RoPE). The relative position bias dynamically adjusts the bias matrix shape based on sequence length and maps positional relationships into the attention scores between queries and keys. RoPE applies sine and cosine functions to the even and odd dimensions of query and key vectors for rotational transformation, explicitly encoding positional information into the attention calculation. This enables more accurate modeling of structural relationships between words, enhances the model's perception of positional information, and significantly improves entity boundary recognition and role differentiation. Experimental results show that the proposed model achieved precision, recall, and F1 scores of 94.70%, 94.04%, and 94.37%, respectively, in the knowledge extraction task, representing improvements of 9.14%, 9.21%, and 9.18% over the original CasRel model. These results demonstrate a significant optimization in entity-relationship extraction for the sheep disease domain. Finally, based on the structured triples extracted from the text, semantic embedding vectors generated by the RoBERTa model were used to calculate cosine similarity for synonym discrimination and entity fusion. This method leverages corresponding knowledge fusion corpora for different entity types, effectively identifying and unifying synonymous expressions while eliminating duplicate entities and potential ambiguities. After confirming no duplicates through retrieval, the fused triples were stored in a Neo4j graph database, resulting in a standardized, consistent, and well-structured sheep disease knowledge graph. This provides reliable data support for subsequent intelligent diagnosis and treatment of sheep diseases.

HTML全文

参考文献(31)

施引文献

资源附件(0)