Abstract
Mounting posture is one of the most intuitive, reliable, and visual indicators of estrus in dairy cows. Its accurate estimation can be widely recognized to determine whether a cow is in heat. However, a challenging task remains to recognize the posture in the group-housed dairy environments, due to complex practical conditions, diverse individual appearances, and frequent occlusions among cows. The visual features of individual cows can often blend with complex backgrounds, such as bedding, fences, and other animals. Moreover, conventional vision algorithms cannot accurately differentiate individuals or posture transitions, due to the high similarity or ambiguities between coat patterns and body structures. In addition, substantial scale variations among key anatomical regions—such as the head, limbs, and torso—have made it difficult to accurately detect and locate the keypoints, especially with the varying camera distance and viewing angle. Thereby, conventional pose estimation cannot fully meet the actual requirement of robustness, accuracy, and generalization, due to the frequent misidentification or loss of critical keypoints. In this study, a lightweight framework, termed Frequency-Spatial and Multi-scale Self-calibrated Pose Estimation (FSMCPose), was proposed to estimate cow mounting posture. Specifically, the group-housed dairy scenarios were also designed for high detection accuracy and computational efficiency. A lightweight backbone network (CowMountNet) was incorporated to efficiently extract the visual features with low computational complexity and memory footprint. Spatial-Frequency Enhancement Block (SFEB) served as the first critical component in the overall architecture. Multi-scale decomposition of wavelet transforms was used to capture fine-grained features at different spatial frequencies. The smoothing properties of Gaussian distributions were combined to suppress the background noise for the feature continuity. Collaborative enhancement between spatial- and frequency-domain representations effectively improved background separation to preserve the lightweight nature of the model. Following SFEB, the Receptive-field Adaptive Block (RAB) was employed to extract the multi-scale contextual feature. Multiple parallel branches were introduced after pointwise convolutions, each of which was configured with a dilation rate to capture contextual information under varying receptive fields. Such a structure was used to strengthen the sensitivity to small-scale keypoints—such as subtle joints or limb regions—to maintain the integrity of large-scale anatomical features of the cow’s body. Finally, the Spatial-Channel Self-calibration Head (SCSCHead) was integrated with the spatial and channel attention mechanisms to enhance both spatial awareness and semantic discrimination of high-level features. In addition, a self-calibration branch was introduced to compensate for the potential structural deviations caused by occlusions, overlapping individuals, or motion blur, further improving the stability and precision of keypoint predictions. Experimental evaluations demonstrated that the FSMCPose was achieved in the improvements of 1.2, 3.0, 0.9, and 1.8 percentage points in the AP, AP75, AR, and AR75, respectively, which was improved by 89.0%, 92.5%, 89.9%, and 97.7%, respectively, compared with the baseline combination of MobileNet as the backbone and RTMPose as the detection head. Furthermore, the framework reduced the number of parameters by 80.01% to only 2.698 M, compared with the baseline mode, where the floating-point operations were limited to 0.354 G, indicating an excellent balance between performance and efficiency. The FSMCPose can provide the robust and accurate localization of keypoints under complex background interference, individual overlap, and scale variation, indicating an efficient upstream module for subsequent behavior recognition and estrus detection. Consequently, the finding can also provide a reliable reference for the cow monitoring in scalable precision dairy farming.