基于频域-空间特征融合与多尺度自校正的奶牛姿态估计算法

刘海洋; 丁新鑫; 王荣; 李方静; 李奇峰; 高荣华

doi:10.11975/j.issn.1002-6819.202508231

基于频域-空间特征融合与多尺度自校正的奶牛姿态估计算法

Cattle pose estimation algorithm based on frequency–spatial fusion and multi-scale self-calibration

摘要

摘要: 为解决群牛场景下奶牛体表特征与环境融合、关键区域尺度差异大及姿态结构易混淆等因素影响，造成现有模型在提取奶牛姿态时出现误检、漏检等问题，该研究提出一种基于频域-空间特征融合与多尺度自校正的奶牛姿态估计方法FSMCPose（frequency-spatial and multi-scale self-calibrated pose estimation）。首先，构建空频协同增强模块（spatial-frequency enhancement block, SFEB）融合频域与空间域的信息，有效缓解目标与背景特征融合问题；其次，采用感受野聚合增强模块（receptive aggregation block, RAB）提高模型的尺度适应性；最后，设计空间通道自校正检测头（spatial channel self-calibration module head，SCSCHead），引入空间与通道联合自校准机制提升模型在奶牛个体重叠场景下的关键点区分能力。结果表明，当目标关键点相似度阈值为0.75时FSMCPose的平均精度达到92.5%，较基线模型RTMPose提升1.9个百分点；计算量为0.354 G；参数量为2.698 M，较RTMPose减少80.01%。平均精度达到89.0%，与DEKR、CID、CoupledEmbedding、RTMO、SimCC、DWPose主流姿态估计模型相比分别提升1.8、1.0、16.8、1.2、1.6、0.7个百分点。可视化结果与热图分析显示，该模型在群牛重叠场景下能够更加稳定地定位关键点，对关键区域响应更加集中。该研究可为奶牛行为识别等上游任务提供一定的技术支撑。

Abstract: Mounting posture is one of the most intuitive, reliable, and visual indicators of estrus in dairy cows. Its accurate estimation can be widely recognized to determine whether a cow is in heat. However, a challenging task remains to recognize the posture in the group-housed dairy environments, due to complex practical conditions, diverse individual appearances, and frequent occlusions among cows. The visual features of individual cows can often blend with complex backgrounds, such as bedding, fences, and other animals. Moreover, conventional vision algorithms cannot accurately differentiate individuals or posture transitions, due to the high similarity or ambiguities between coat patterns and body structures. In addition, substantial scale variations among key anatomical regions—such as the head, limbs, and torso—have made it difficult to accurately detect and locate the keypoints, especially with the varying camera distance and viewing angle. Thereby, conventional pose estimation cannot fully meet the actual requirement of robustness, accuracy, and generalization, due to the frequent misidentification or loss of critical keypoints. In this study, a lightweight framework, termed Frequency-Spatial and Multi-scale Self-calibrated Pose Estimation (FSMCPose), was proposed to estimate cow mounting posture. Specifically, the group-housed dairy scenarios were also designed for high detection accuracy and computational efficiency. A lightweight backbone network (CowMountNet) was incorporated to efficiently extract the visual features with low computational complexity and memory footprint. Spatial-Frequency Enhancement Block (SFEB) served as the first critical component in the overall architecture. Multi-scale decomposition of wavelet transforms was used to capture fine-grained features at different spatial frequencies. The smoothing properties of Gaussian distributions were combined to suppress the background noise for the feature continuity. Collaborative enhancement between spatial- and frequency-domain representations effectively improved background separation to preserve the lightweight nature of the model. Following SFEB, the Receptive-field Adaptive Block (RAB) was employed to extract the multi-scale contextual feature. Multiple parallel branches were introduced after pointwise convolutions, each of which was configured with a dilation rate to capture contextual information under varying receptive fields. Such a structure was used to strengthen the sensitivity to small-scale keypoints—such as subtle joints or limb regions—to maintain the integrity of large-scale anatomical features of the cow’s body. Finally, the Spatial-Channel Self-calibration Head (SCSCHead) was integrated with the spatial and channel attention mechanisms to enhance both spatial awareness and semantic discrimination of high-level features. In addition, a self-calibration branch was introduced to compensate for the potential structural deviations caused by occlusions, overlapping individuals, or motion blur, further improving the stability and precision of keypoint predictions. Experimental evaluations demonstrated that the FSMCPose was achieved in the improvements of 1.2, 3.0, 0.9, and 1.8 percentage points in the AP, AP₇₅, AR, and AR₇₅, respectively, which was improved by 89.0%, 92.5%, 89.9%, and 97.7%, respectively, compared with the baseline combination of MobileNet as the backbone and RTMPose as the detection head. Furthermore, the framework reduced the number of parameters by 80.01% to only 2.698 M, compared with the baseline mode, where the floating-point operations were limited to 0.354 G, indicating an excellent balance between performance and efficiency. The FSMCPose can provide the robust and accurate localization of keypoints under complex background interference, individual overlap, and scale variation, indicating an efficient upstream module for subsequent behavior recognition and estrus detection. Consequently, the finding can also provide a reliable reference for the cow monitoring in scalable precision dairy farming.

HTML全文

参考文献(35)

施引文献

资源附件(0)