基于CNN与Transformer混合模型的自然场景奶牛身份重识别

秦立峰; 周馨怡; 高延年; 王磊; 赵继政

doi:10.11975/j.issn.1002-6819.202508021

基于CNN与Transformer混合模型的自然场景奶牛身份重识别

Cow re-identification in natural scenes using a CNN-transformer hybrid model

摘要

摘要: 在规模化奶牛养殖场中，身份识别是行为监测和精细化管理的前提。基于计算机视觉的奶牛身份识别，是当前智慧畜牧领域的研究热点。为了提升自然场景下跨摄像头奶牛身份识别的准确性，该研究提出基于卷积神经网络（convolutional neural network，CNN）与Transformer混合模型（CNN-Transformer）的奶牛身份重识别算法。模型的CNN分支提取奶牛局部纹理细节等特征，Transformer分支利用全局自注意力机制捕获奶牛整体特征，实现全局依赖关系的建模。模型构建跨纬度多尺度特征融合模块，分别在Transformer分支和CNN分支对应的3个语义层级进行特征融合，以兼顾空间结构与语义表达，实现全局和局部特征的动态交互。此外，在Transformer分支的第8个Layer（语义层）构建Token-SE注意力模块，增强模型通道选择性，提升模型对关键语义特征的聚焦能力。使用11个摄像头拍摄了21头待产区奶牛图像构建数据集，共包含不同视角、不同姿态、存在遮挡的奶牛图像7371张。身份重识别试验结果显示，该文提出的模型在Rank-1、Rank-5、Rank-10、平均精度均值（mAP）上达到86.2%、93.1%、95.7%、45.1%，相较基准Transformer模型分别提升了8.6、6.0、1.7、5.5个百分点。同时，注意力热图、t-SNE特征嵌入可视化、Top-10检索结果可视化、特征距离热图可视化进一步验证了所提模型在复杂环境跨摄像头场景的身份重识别任务中具有较强的特征学习能力。该文所提模型可为复杂自然场景下的奶牛身份重识别提供技术参考。

Abstract: Individual cow identification can greatly contribute to the behavior monitoring, precision feeding, and fine-grained health management in the large-scale dairy farms. Cow reidentification aims to recognize the same individual under different cameras and time periods using machine vision. However, the cross-camera cow reidentification can remain a great challenge in natural barn environments, due to the high inter-individual similarity, large intra-individual variations in posture and viewpoint, frequent occlusions, and complex illumination. In this study, a cow reidentification algorithm was proposed using a hybrid Convolutional Neural Network and Transformer model (CNN–Transformer). The dual-branch backbone was adopted to simulate the long-range dependencies over the entire cow body. Among them, the CNN branch was used to extract the local texture details, such as the hair, spots, and body edges, while the Transformer branch employed the global self-attention mechanisms to capture the holistic body shape and spot distribution patterns. Both branches were trained in a unified reidentification framework. A combination of cross-entropy and triplet loss was used for the compact intra-class clustering and large inter-class separation in the embedding space. A cross-dimensional multi-scale feature fusion module was inserted at three semantic levels of the backbone to enhance the complementarity between global and local representations. Feature maps from the Transformer and CNN branches were first rescaled to a consistent spatial resolution and then aligned along the channel dimension at shallow and intermediate stages. The fusion module then performed the multi-scale pooling and cross-channel rearrangement. Global semantic cues were obtained to guide the selection of the informative local textures. While the noisy or redundant local patterns were suppressed under the cluttered backgrounds or partial occlusions. The terminal output of both branches was fused at the final stage. The Transformer branch aggregated the global semantic information, and the CNN branch concentrates on rich local spatial details. A unified feature map was produced to jointly encode the overall body structure, spot patterns, and multi-scale contextual cues. The feature map after fusion was subsequently fed into a global average pooling and normalization pipeline. A discriminative identity descriptor was obtained for the cosine-similarity Query–Gallery retrieval in the reidentification. In addition, a Token-SE attention module was introduced at the eighth semantic layer of the Transformer branch for channel-wise selectivity. Experimental ablation shows that the Token-SE module effectively strengthened to focus synergistically on the semantic information and tasks with the CNN branch and the cross-dimension multi-scale fusion module. The experimental dataset was collected in a real calving area of the dairy farm using 11 fixed surveillance cameras continuously monitoring 21 cows under natural conditions. The 7 371 annotated cow images contained diverse viewpoints, postures, and occlusion patterns. Among them, the images from 10 cows were used for training, and the rest from the remaining 11 cows were used for testing. The test data was organized during evaluation, according to the standard reidentification protocol, into a Query and a Gallery set. The performance of the CNN–Transformer hybrid model on the dataset was achieved in the Rank-1, Rank-5, Rank-10 accuracy, and Average Precision (mAP) of 86.2%, 93.1%, 95.7%, and 45.1%, respectively, thus outperforming the baseline Transformer model by 8.6, 6.0, 1.7, and 5.5 percentage points. The cross-dimension multi-scale feature fusion and the Token-SE attention module with the CNN branch significantly improved the joint modeling of the global and local features, as well as the robustness to cross-view and cross-camera variations. Furthermore, qualitative analysis was also conducted to validate the effectiveness of the model. Attention heatmaps were improved to increasingly focus on the key regions, such as the head, back, and characteristic spot areas. The t-SNE visualization of the feature embeddings shared better inter-class separability and intra-class compactness, compared with the baseline. The individual target was correctly retrieved using Top-10 retrieval examples and pairwise distance heatmaps under different illumination, occlusion, and appearance-similarity challenging scenarios. The finding can provide technical reference for the cow reidentification under complex natural scenes in smart livestock farming.

HTML全文

参考文献(35)

施引文献

资源附件(0)