Abstract:
The individual weighing of beef cattle has been confined to the complex and low accuracy in the herd breeding scene. In this study, a cross-modal hierarchical feature fusion model (CMHFF-ResNet) was proposed using an attention mechanism. Firstly, the camera was installed directly above the drinking water area of the cattle shed. The RGB (red-green-blue) and depth images of beef cattle were then collected in the daily activities from the top-down perspective. The Yolov8 network with the oriented bounding box (OBB) was used to detect and then recognize the rotating targets of the beef cattle. The individual targets were accurately located in the herd breeding scene. The depth images were denoised under the open and closed morphological operations. The high resolution of the depth image was obtained after optimization; Secondly, a dual flow model of the weight estimation was established to extract the RGB and depth modal features using the ResNet 50 backbone network. The CBAM (convolutional block attention module) was introduced to enhance the expression of the key features. A cross-modal hierarchical feature fusion was designed to effectively combine the RGB and depth flow for the full use of shallow features; Thirdly, the identity information of the beef cattle was introduced into the network to learn the relationship between beef cattle identity and body mass. The full connection layer was replaced by Kan (Kolmogorov-Arnold networks), in order to optimize the efficiency of the model. The number of parameters was reduced significantly; Finally, the output of the two streams was fused to regress the body mass of beef cattle. In the experiment, the dataset was constructed to contain 2 546 pairs of RGB-D images, including 2 373 pairs of training data and 173 pairs of validation data. The average absolute error of CMHFF RESNET in the validation set was 14.19 kg. Compared with the single stream model with the RGB and depth, the two stream models reduced the average absolute error by 16.943% and 26.133%, respectively. Compared with the original two-stream benchmark model, the average absolute errors of the CBAM, CMHFF module, the addition of the identity information, and the full connection layer with the Kan network were reduced by 1.19, 3.23, 1.14, and 0.14 kg, respectively. At the same time, the performance was also superior to the existing weight estimation of beef bodies. Compared with the multiple linear regression, ridge regression, quadratic polynomial regression, improved efficientnetv2 model, improved mobilenetv2 model, improved densenet201 model, improved resnext101 model and improved cross-modal feature fusion model CFF RESNET, the average absolute errors were reduced by 57.233%, 57.169%, 47.984%, 31.250%, 34.699%, 24.761%, 23.751% and 20.991%, respectively. In addition, the average absolute error of CMHFF RESNET on the cross-breed beef dataset was 13.25 kg. The accuracy and generalization of the estimation were improved on the individual beef quality in the herd environment. The improved model can effectively learn the cross-modal hierarchical feature representation. The finding can also provide a strong reference for the high-precision estimation of the individual beef quality in the large-scale herd environment.