基于神经离散表示学习的遮挡猪只姿态估计

荣子阳; 贺子都; 祝涛; 周翔; 谭臣; 倪福川; 李国亮

doi:10.11975/j.issn.1002-6819.202503129

基于神经离散表示学习的遮挡猪只姿态估计

Pose estimation of occlusive pig based on neural discrete pepresentation learning

摘要

摘要: 猪只姿态估计在智能养殖和动物福利监测中具有重要应用价值，然而在养殖场复杂环境下，由于猪群密集、设施遮挡和视角受限，主流的基于热图（Heatmap-Based）方法难以准确捕捉姿态关键点间的依赖关系，在猪只被遮挡时预测效果明显下降，且对不可见关键点估计的研究不足。为了准确地预测被遮挡猪只的不可见关键点，本文提出了一种基于神经离散表示学习的猪只姿态估计模型DLV-Pose。模型采用两阶段训练：第一阶段利用离散码本向量学习姿态关键点子结构，捕捉姿态关键点间的依赖关系；第二阶段在主干网络与分类头之后引入对比学习模块，通过余弦退火动态调整对比学习损失权重，使模型在训练样本有限的情况下获得更具判别性的特征表示。基于自建的包含2 708张昼夜场景的遮挡猪只姿态数据集对DLV-Pose进行了性能验证，试验结果表明：在遮挡关键点占比达21.1%的测试集上，以Swin-Transformer-B为主干网络的DLV-Pose模型对猪只姿态关键点检测的平均准确率达到44.0%，均优于使用其它主干网络的基于热图检测方法。预测试验的可视化结果验证了DLV-Pose在不同场景和遮挡条件下的具有较强的泛化能力。本文所提出的模型为复杂的遮挡环境下的猪只不可见关键点的准确估计提供了新方法，对大规模养殖场提升动物福利和养殖数字化管理具有重要意义。

Abstract: Pig pose estimation is fundamental to advancing intelligent livestock management, welfare assessment, and precision agriculture, especially as the scale and automation of modern pig farms continue to grow. However, in real-world farm environments, frequent mutual occlusion between animals, self-occlusion due to non-frontal postures, and the lack of sufficient annotated data make accurate pose estimation a significant challenge. Traditional heatmap-based approaches, though effective under clear and unobstructed conditions, often fail to explicitly model the anatomical and structural dependencies among the key points, leading to substantial performance degradation when the body parts are obscured. Addressing these limitations is crucial for developing robust monitoring systems that can operate reliably in the complex and dynamic conditions of commercial pig production. This study present DLV-Pose model, a two-stage occlusion-robust pig pose estimation framework based on neural discrete representation learning. In the first stage, an encoder, codebook, and decoder are jointly optimized to learn a structured prior of pig poses by projecting each pose into multiple discrete latent vectors. These vectors effectively capture local structural patterns and the dependencies among both visible and occluded key points, enabling reliable reconstruction of complete poses even under severe occlusion. In the second stage, a classification head network is trained to map features extracted from a pretrained backbone network (such as Swin-Transformer-B) to codebook indices. To enhance the discriminative power of the feature representation, a contrastive learning module is introduced, which employs a cosine annealing strategy to dynamically adjust the contrastive loss weight. This strategy facilitates the formation of a well-structured feature space during early training and supports fine-grained classification as the model converges. For comprehensive evaluation, this paper constructed a dataset consisting of 2 708 pig images from both daytime and nighttime scenes in actual farm settings to validate the performance of DLV-Pose. To simulate real-world occlusion, synthetic stripe-shaped masks were superimposed on the validation and test sets, resulting in a key point occlusion ratio of 21.1%. Extensive experiments demonstrate that DLV-Pose consistently and significantly outperforms conventional heatmap-based methods. With Swin-Transformer-B as the backbone, DLV-Pose achieves an average precision (AP) of 44.0% on the occluded test set, representing a remarkable improvement of 13.1 percentage points over the baseline. The incorporation of the contrastive learning module further raises AP by 2.2 points. Ablation experiments confirm the effectiveness of discrete latent representation and contrastive learning, revealing that using 34 latent vectors and a codebook of 1 024 entries yields the optimal performance. Visualization results further demonstrate that DLV-Pose provides accurate, robust, and stable pose predictions across various occlusion types, target sizes, and farm scenarios. Although the addition of discrete latent vectors and the contrastive learning module increases the number of model parameters and reduces inference speed compared to baseline methods, DLV-Pose still maintains a practical processing rate suitable for real-time farm monitoring applications. This work presents a novel and scalable solution to the challenge of animal pose estimation under occlusion and data scarcity, offering strong potential for deployment in intelligent pig farming, health surveillance, and automated welfare assessment. Future work will focus on further optimizing model efficiency for edge devices and extending the approach to more diverse farm environments and animal species.

HTML全文

参考文献(33)

施引文献

资源附件(0)