Abstract:
Pig pose estimation is fundamental to advancing intelligent livestock management, welfare assessment, and precision agriculture, especially as the scale and automation of modern pig farms continue to grow. However, in real-world farm environments, frequent mutual occlusion between animals, self-occlusion due to non-frontal postures, and the lack of sufficient annotated data make accurate pose estimation a significant challenge. Traditional heatmap-based approaches, though effective under clear and unobstructed conditions, often fail to explicitly model the anatomical and structural dependencies among the key points, leading to substantial performance degradation when the body parts are obscured. Addressing these limitations is crucial for developing robust monitoring systems that can operate reliably in the complex and dynamic conditions of commercial pig production. This study present DLV-Pose model, a two-stage occlusion-robust pig pose estimation framework based on neural discrete representation learning. In the first stage, an encoder, codebook, and decoder are jointly optimized to learn a structured prior of pig poses by projecting each pose into multiple discrete latent vectors. These vectors effectively capture local structural patterns and the dependencies among both visible and occluded key points, enabling reliable reconstruction of complete poses even under severe occlusion. In the second stage, a classification head network is trained to map features extracted from a pretrained backbone network (such as Swin-Transformer-B) to codebook indices. To enhance the discriminative power of the feature representation, a contrastive learning module is introduced, which employs a cosine annealing strategy to dynamically adjust the contrastive loss weight. This strategy facilitates the formation of a well-structured feature space during early training and supports fine-grained classification as the model converges. For comprehensive evaluation, this paper constructed a dataset consisting of 2 708 pig images from both daytime and nighttime scenes in actual farm settings to validate the performance of DLV-Pose. To simulate real-world occlusion, synthetic stripe-shaped masks were superimposed on the validation and test sets, resulting in a key point occlusion ratio of 21.1%. Extensive experiments demonstrate that DLV-Pose consistently and significantly outperforms conventional heatmap-based methods. With Swin-Transformer-B as the backbone, DLV-Pose achieves an average precision (AP) of 44.0% on the occluded test set, representing a remarkable improvement of 13.1 percentage points over the baseline. The incorporation of the contrastive learning module further raises AP by 2.2 points. Ablation experiments confirm the effectiveness of discrete latent representation and contrastive learning, revealing that using 34 latent vectors and a codebook of 1 024 entries yields the optimal performance. Visualization results further demonstrate that DLV-Pose provides accurate, robust, and stable pose predictions across various occlusion types, target sizes, and farm scenarios. Although the addition of discrete latent vectors and the contrastive learning module increases the number of model parameters and reduces inference speed compared to baseline methods, DLV-Pose still maintains a practical processing rate suitable for real-time farm monitoring applications. This work presents a novel and scalable solution to the challenge of animal pose estimation under occlusion and data scarcity, offering strong potential for deployment in intelligent pig farming, health surveillance, and automated welfare assessment. Future work will focus on further optimizing model efficiency for edge devices and extending the approach to more diverse farm environments and animal species.