Abstract:
In aquaculture, the real-time and accurate identification of farmers' identities can not only improve production management efficiency and reduce operating costs, but also provide the traceability basis when serious aquatic product safety incidents occur. Previous identification methods, such as manual recording and facial recognition, usually required user cooperation, high costs, low efficiency, and unreliable input, which were not suitable for long-distance identification and could not meet the actual needs of modern intelligent farming scenarios. Therefore, this paper proposes a multimodal gait recognition model, P3D-ECAFormer (Pseudo-3D ResNet with Efficient Channel Attention for Swin Transformer), which achieves the identification of long-distance farming workers in uncontrolled aquaculture scenarios. The implementation is as follows: Firstly, the video sequences are preprocessed to obtain two modalities: silhouette and human parsing. Data quality enhancement such as connected silhouette extraction, mask-enhanced parsing and dynamic template filtering are introduced to improve the data quality. Among them, connected silhouette extraction is used to reduce the interference of complex backgrounds and local noise, thereby obtaining a cleaner silhouette. Mask-enhanced parsing further improves the incomplete problem of human analysis structure, enhancing the integrity of the human structure. Dynamic template filtering eliminates low-quality and abnormal frame data, thereby enhancing the temporal consistency and validity of the input sequence. This thereby provides a more stable data foundation for subsequent feature extraction and identity recognition. Then, during the feature modeling stage, based on the GaitBase model, a channel attention enhanced P3D-ECANet network is proposed to extract short-term filtering for dynamic response, enabling the model to more sensitively capture the local fine-grained action differences related to identity. Combined with the self-attention enhanced 3D Swin Transformer network, noise interference is reduced and position biases are adaptively adjusted, enhancing the model's ability to globally model the gait cycle under non-standard gait and scene perturbation conditions. Finally, under the synergistic effect of local and global spatial-temporal features, a highly discriminative gait feature representation is generated, triplet loss and cross-entropy loss are combined to learn the gait vector for identity recognition. The results show that the proposed model in this paper achieves a rank-1 identification rate (Rank-1) of 96.53%, mean average precision (mAP) of 94.57%, and mean inverse negative penalty (mINP) of 91.16% on the Gaitfisher dataset of aquatic farming workers. Compared with the baseline model (GaitBase), these metrics have increased by 3.22, 7.57, and 17.1 percentage points respectively. Compared with the suboptimal model (DeepGaitV2), they have increased by 1.98, 2.13, and 2.87 percentage points respectively. The inference time accounted for only 1.8% of the 3 s sequence clip duration, demonstrating that the model can meet the efficiency requirements of online monitoring and traceability recording in practical aquaculture scenarios. For the silhouette and human parsing modalities, the dataset quality enhancement strategy improved mAP by 1.45 and 4.73 percentage points, respectively. After introducing the P3D-ECANet, mINP increased by 11.70 and 1.65 percentage points, respectively. In addition, the improved 3D Swin Transformer increased Rank-1 by 0.99 and 3.22 percentage points, respectively. This verifies the effectiveness of the proposed multimodal feature fusion and attention enhancement mechanism in improving the accuracy and robustness of gait recognition. It can achieve high-precision and non-contact identity recognition of farmers in actual aquaculture scenarios, providing a reliable technical support for aquatic product quality and safety traceability, and facilitating the construction of a smart fishery management system.