QIN Lifeng, ZHOU Xinyi, GAO Yannian, et al. Cow re-identification in natural scenes using a CNN-transformer hybrid modelJ. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2026, 42(5): 299-311. DOI: 10.11975/j.issn.1002-6819.202508021
Citation: QIN Lifeng, ZHOU Xinyi, GAO Yannian, et al. Cow re-identification in natural scenes using a CNN-transformer hybrid modelJ. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2026, 42(5): 299-311. DOI: 10.11975/j.issn.1002-6819.202508021

Cow re-identification in natural scenes using a CNN-transformer hybrid model

  • Individual cow identification can greatly contribute to the behavior monitoring, precision feeding, and fine-grained health management in the large-scale dairy farms. Cow reidentification aims to recognize the same individual under different cameras and time periods using machine vision. However, the cross-camera cow reidentification can remain a great challenge in natural barn environments, due to the high inter-individual similarity, large intra-individual variations in posture and viewpoint, frequent occlusions, and complex illumination. In this study, a cow reidentification algorithm was proposed using a hybrid Convolutional Neural Network and Transformer model (CNN–Transformer). The dual-branch backbone was adopted to simulate the long-range dependencies over the entire cow body. Among them, the CNN branch was used to extract the local texture details, such as the hair, spots, and body edges, while the Transformer branch employed the global self-attention mechanisms to capture the holistic body shape and spot distribution patterns. Both branches were trained in a unified reidentification framework. A combination of cross-entropy and triplet loss was used for the compact intra-class clustering and large inter-class separation in the embedding space. A cross-dimensional multi-scale feature fusion module was inserted at three semantic levels of the backbone to enhance the complementarity between global and local representations. Feature maps from the Transformer and CNN branches were first rescaled to a consistent spatial resolution and then aligned along the channel dimension at shallow and intermediate stages. The fusion module then performed the multi-scale pooling and cross-channel rearrangement. Global semantic cues were obtained to guide the selection of the informative local textures. While the noisy or redundant local patterns were suppressed under the cluttered backgrounds or partial occlusions. The terminal output of both branches was fused at the final stage. The Transformer branch aggregated the global semantic information, and the CNN branch concentrates on rich local spatial details. A unified feature map was produced to jointly encode the overall body structure, spot patterns, and multi-scale contextual cues.The feature map after fusion was subsequently fed into a global average pooling and normalization pipeline. A discriminative identity descriptor was obtained for the cosine-similarity Query–Gallery retrieval in the reidentification. In addition, a Token-SE attention module was introduced at the eighth semantic layer of the Transformer branch for channel-wise selectivity. Experimental ablation shows that the Token-SE module effectively strengthened to focus synergistically on the semantic information and tasks with the CNN branch and the cross-dimension multi-scale fusion module. The experimental dataset was collected in a real calving area of the dairy farm using 11 fixed surveillance cameras continuously monitoring 21 cows under natural conditions. The 7,371 annotated cow images contained diverse viewpoints, postures, and occlusion patterns. Among them, the images from 10 cows were used for training, and the rest from the remaining 11 cows were used for testing. The test data was organized during evaluation, according to the standard reidentification protocol, into a Query and a Gallery set. The performance of the CNN–Transformer hybrid model on the dataset was achieved in the Rank-1, Rank-5, Rank-10 accuracy, and Average Precision (mAP) of 86.2%, 93.1%, 95.7%, and 45.1%, respectively, thus outperforming the baseline Transformer model by 8.6%, 6.0%, 1.7%, and 5.5%. The cross-dimension multi-scale feature fusion and the Token-SE attention module with the CNN branch significantly improved the joint modeling of the global and local features, as well as the robustness to cross-view and cross-camera variations. Furthermore, qualitative analysis was also conducted to validate the effectiveness of the model. Attention heatmaps were improved to increasingly focus on the key regions, such as the head, back, and characteristic spot areas. The t-SNE visualization of the feature embeddings shared better inter-class separability and intra-class compactness, compared with the baseline. The individual target was correctly retrieved using Top-10 retrieval examples and pairwise distance heatmaps under different illumination, occlusion, and appearance-similarity challenging scenarios. The finding can provide technical reference for the cow reidentification under complex natural scenes in smart livestock farming.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return