Abstract:
Accurate and real-time identification of farmers' identities can be expected to improve the efficiency with cost savings in aquaculture. Traceability can also provide concurrently, when serious aquatic product safety incidents occur. Previous manual recording and facial recognition cannot fully meet the actual needs of modern intelligent farming, due to user cooperation, high costs, low efficiency, and unreliable input. It is often required for long-distance identification. In this study, a multimodal gait recognition model, P3D-ECAFormer (Pseudo-3D ResNet with Efficient Channel Attention for Swin Transformer) was proposed for the long-distance farming workers in uncontrolled aquaculture scenarios. The implementation was as follows: 1) The video sequences were preprocessed to obtain two modalities: silhouette and human parsing. Data enhancement, such as connected silhouette extraction, mask-enhanced parsing, and dynamic template filtering, was introduced to improve the data quality. Among them, connected silhouette extraction was used to reduce the interference of complex backgrounds and local noise, thereby obtaining a cleaner silhouette. Mask-enhanced parsing further improved the incomplete nature of the human analysis structure, thus enhancing the integrity of the human structure. Dynamic template filtering was used to eliminate the low-quality and abnormal frame data, thereby enhancing the temporal consistency and validity of the input sequence. Thereby, more stable data was provided for subsequent feature extraction and identity recognition. 2) A P3D-ECANet network was enhanced by channel attention during feature modeling. Short-term filtering was extracted for dynamic response using the GaitBase model. Local fine-grained action differences were more sensitively captured for the identity. A self-attention enhanced 3D Swin Transformer network was combined to globally model the gait cycle under non-standard gait and scene perturbation. Noise interference was then reduced to adaptively adjust the position biases. 3) A highly discriminative representation of gait feature was generated under the synergistic interaction between local and global spatial-temporal features. Triplet loss and cross-entropy loss were combined to learn the gait vector for identity recognition. The results show that the improved model achieved a rank-1 identification rate (Rank-1) of 96.53%, mean average precision (mAP) of 94.57%, and mean inverse negative penalty (mINP) of 91.16% on the Gaitfisher dataset of aquatic farming workers. Compared with the baseline model (GaitBase), these metrics increased by 3.22, 7.57, and 17.1 percentage points, respectively. Compared with the suboptimal model (DeepGaitV2), there was an increase of 1.98, 2.13, and 2.87 percentage points, respectively. The inference time accounted for only 1.8% of the 3 s sequence clip duration, fully meeting the efficiency requirements of online monitoring and traceability recording in practical aquaculture. Furthermore, the dataset enhancement improved mAP by 1.45 and 4.73 percentage points, respectively, in the silhouette and human parsing modalities. The mINP increased by 11.70 and 1.65 percentage points, respectively, after the P3D-ECANet was introduced. In addition, the improved 3D Swin Transformer increased Rank-1 by 0.99 and 3.22 percentage points, respectively. The multimodal feature fusion and attention enhancement mechanism were verified to improve the accuracy and robustness of gait recognition. The high-precision and non-contact identity recognition of farmers was realized under actual aquaculture scenarios. The findings can also provide reliable technical support for aquatic product quality and safety traceability in smart fishery.