基于注意力机制和跨模态层级特征融合的群养肉牛个体质量估测

宋平; 杨颖; 刘刚; 姚冲; 李子若; 毛天赐

doi:10.11975/j.issn.1002-6819.202412117

基于注意力机制和跨模态层级特征融合的群养肉牛个体质量估测

Individual quality estimation of herd beef cattle based on attention mechanism and cross-modal hierarchical feature fusion

摘要

摘要: 为解决群养场景下肉牛个体质量称量复杂、精度低的问题，该研究提出了基于注意力机制的跨模态层级特征融合模型CMHFF-ResNet（cross-modal hierarchical feature fusion resnet）。首先，无接触式地采集俯视视角下日常活动的肉牛的 RGB（red-green-blue）图像与深度图像，使用引入定向边界框 OBB（oriented bounding box）的YOLOv8网络对肉牛进行旋转目标检测和识别，精准定位群养场景中的个体目标；其次，以 ResNet50 为骨干网络构建双流估重模型，分别提取 RGB 和深度模态特征，并引入 CBAM（convolutional block attention module）注意力机制以增强关键特征表达能力。设计跨模态的层级特征融合，有效结合 RGB 流和深度流的特征并充分利用浅层特征；第三，引入肉牛的身份信息便于网络学习肉牛身份与其体质量之间的对应关系，为优化模型效率，将全连接层替换为 KAN（kolmogorov-arnold networks），显著减少参数量；最后，将双流的输出结果融合，回归肉牛体质量值。在试验中，构建了包含 2 546 对 RGB-D 图像的数据集，包括 2 373 对训练数据和 173 对验证数据。结果表明，CMHFF-ResNet 在验证集上的平均绝对误差为 14.19 kg。与基于 RGB 和深度的单流模型相比，双流模型在平均绝对误差上分别降低 16.943%和 26.133%。同时，该方法优于其他现有肉牛体质量估测方法：多元线性回归、改进 MobileNetv2 模型、改进 DenseNet201 模型和改进跨模态特征融合模型 CFF-ResNet，在平均绝对误差上分别减少 57.233%、34.699%、24.761%和 20.991%，提升了群养环境下肉牛个体质量估测的精度与泛化性，能够有效地学习跨模态的层级特征表示。该研究为大规模群养环境中肉牛个体质量的高精度估测提供了参考。

Abstract: The individual weighing of beef cattle has been confined to the complex and low accuracy in the herd breeding scene. In this study, a cross-modal hierarchical feature fusion model (CMHFF-ResNet) was proposed using an attention mechanism. Firstly, the camera was installed directly above the drinking water area of the cattle shed. The RGB (red-green-blue) and depth images of beef cattle were then collected in the daily activities from the top-down perspective. The Yolov8 network with the oriented bounding box (OBB) was used to detect and then recognize the rotating targets of the beef cattle. The individual targets were accurately located in the herd breeding scene. The depth images were denoised under the open and closed morphological operations. The high resolution of the depth image was obtained after optimization; Secondly, a dual flow model of the weight estimation was established to extract the RGB and depth modal features using the ResNet 50 backbone network. The CBAM (convolutional block attention module) was introduced to enhance the expression of the key features. A cross-modal hierarchical feature fusion was designed to effectively combine the RGB and depth flow for the full use of shallow features; Thirdly, the identity information of the beef cattle was introduced into the network to learn the relationship between beef cattle identity and body mass. The full connection layer was replaced by Kan (Kolmogorov-Arnold networks), in order to optimize the efficiency of the model. The number of parameters was reduced significantly; Finally, the output of the two streams was fused to regress the body mass of beef cattle. In the experiment, the dataset was constructed to contain 2 546 pairs of RGB-D images, including 2 373 pairs of training data and 173 pairs of validation data. The average absolute error of CMHFF RESNET in the validation set was 14.19 kg. Compared with the single stream model with the RGB and depth, the two stream models reduced the average absolute error by 16.943% and 26.133%, respectively. Compared with the original two-stream benchmark model, the average absolute errors of the CBAM, CMHFF module, the addition of the identity information, and the full connection layer with the Kan network were reduced by 1.19, 3.23, 1.14, and 0.14 kg, respectively. At the same time, the performance was also superior to the existing weight estimation of beef bodies. Compared with the multiple linear regression, ridge regression, quadratic polynomial regression, improved efficientnetv2 model, improved mobilenetv2 model, improved densenet201 model, improved resnext101 model and improved cross-modal feature fusion model CFF RESNET, the average absolute errors were reduced by 57.233%, 57.169%, 47.984%, 31.250%, 34.699%, 24.761%, 23.751% and 20.991%, respectively. In addition, the average absolute error of CMHFF RESNET on the cross-breed beef dataset was 13.25 kg. The accuracy and generalization of the estimation were improved on the individual beef quality in the herd environment. The improved model can effectively learn the cross-modal hierarchical feature representation. The finding can also provide a strong reference for the high-precision estimation of the individual beef quality in the large-scale herd environment.

HTML全文

参考文献(44)

施引文献

资源附件(0)