融合剪影与人体解析的水产养殖作业人员多模态步态识别方法

叶玉明; 杨信廷; 李善谭; 孙成林; 孙传恒; 周超

doi:10.11975/j.issn.1002-6819.202510235

融合剪影与人体解析的水产养殖作业人员多模态步态识别方法

Uncontrolled identification of farmers' identities in fish farm using a multimodal gait recognition network

摘要

摘要: 在水产养殖中，实时识别农事作业人员的身份信息不仅可以提高生产效率，还可以在发生严重的水产品质量安全事件时提供可追溯的依据。以往人工记录、人脸识别等方法，通常需要用户配合，效率低，且输入不可靠，并不适用于远距离识别。为此，该研究提出了一种多模态步态识别模型P3D-ECAFormer（Pseudo-3D ResNet with efficient channel attention for Swin Transformer），用于实现远距离、非受控的农事作业人员身份识别。首先，对视频序列进行预处理，得到人体剪影和解析两种输入模态，并引入连通域轮廓提取、掩码增强解析和动态模板过滤等策略提升数据质量。其次，以GaitBase为基准模型，提出通道注意力增强的P3D-ECANet网络，提取短时滤波后进行动态响应，使模型能够更敏感地捕获与身份相关的局部细粒度动作差异。结合自注意力增强的3D Swin Transformer网络，减少噪声干扰且自适应调整位置偏置，提升模型在非标准步态及场景扰动条件下对步态周期的全局建模能力。最后，在局部和全局时空建模的协同作用下，输出具有判别性的综合特征向量，使用三元组损失和交叉熵损失联合学习步态身份映射，实现身份识别。结果表明，本文所提出的模型在水产养殖农事作业人员步态数据集Gaitfisher上的首位识别命中率、平均精度均值和平均负逆惩罚分别为96.53%、94.57%和91.16%，较基线模型（GaitBase）分别提高3.22、7.57和17.1个百分点，较次优模型（DeepGaitV2）分别提高1.98、2.13和2.87个百分点，推理速度仅占用3 s序列片段时长的1.8%，能够满足实际水产养殖场景的在线监控与溯源记录效率需求。在剪影和人体解析模态中，数据集质量提升策略对平均精度均值的增益分别为1.45和4.73个百分点；引入P3D-ECANe网络后在平均负逆惩罚上分别提升11.70和1.65个百分点；改进的3D Swin Transformer网络在首位识别命中率上分别提高0.99和3.22个百分点；验证了该方法可准确识别作业人员身份信息，为水产品质量安全溯源提供技术支持。

Abstract: In aquaculture, the real-time and accurate identification of farmers' identities can not only improve production management efficiency and reduce operating costs, but also provide the traceability basis when serious aquatic product safety incidents occur. Previous identification methods, such as manual recording and facial recognition, usually required user cooperation, high costs, low efficiency, and unreliable input, which were not suitable for long-distance identification and could not meet the actual needs of modern intelligent farming scenarios. Therefore, this paper proposes a multimodal gait recognition model, P3D-ECAFormer (Pseudo-3D ResNet with Efficient Channel Attention for Swin Transformer), which achieves the identification of long-distance farming workers in uncontrolled aquaculture scenarios. The implementation is as follows: Firstly, the video sequences are preprocessed to obtain two modalities: silhouette and human parsing. Data quality enhancement such as connected silhouette extraction, mask-enhanced parsing and dynamic template filtering are introduced to improve the data quality. Among them, connected silhouette extraction is used to reduce the interference of complex backgrounds and local noise, thereby obtaining a cleaner silhouette. Mask-enhanced parsing further improves the incomplete problem of human analysis structure, enhancing the integrity of the human structure. Dynamic template filtering eliminates low-quality and abnormal frame data, thereby enhancing the temporal consistency and validity of the input sequence. This thereby provides a more stable data foundation for subsequent feature extraction and identity recognition. Then, during the feature modeling stage, based on the GaitBase model, a channel attention enhanced P3D-ECANet network is proposed to extract short-term filtering for dynamic response, enabling the model to more sensitively capture the local fine-grained action differences related to identity. Combined with the self-attention enhanced 3D Swin Transformer network, noise interference is reduced and position biases are adaptively adjusted, enhancing the model's ability to globally model the gait cycle under non-standard gait and scene perturbation conditions. Finally, under the synergistic effect of local and global spatial-temporal features, a highly discriminative gait feature representation is generated, triplet loss and cross-entropy loss are combined to learn the gait vector for identity recognition. The results show that the proposed model in this paper achieves a rank-1 identification rate (Rank-1) of 96.53%, mean average precision (mAP) of 94.57%, and mean inverse negative penalty (mINP) of 91.16% on the Gaitfisher dataset of aquatic farming workers. Compared with the baseline model (GaitBase), these metrics have increased by 3.22, 7.57, and 17.1 percentage points respectively. Compared with the suboptimal model (DeepGaitV2), they have increased by 1.98, 2.13, and 2.87 percentage points respectively. The inference time accounted for only 1.8% of the 3 s sequence clip duration, demonstrating that the model can meet the efficiency requirements of online monitoring and traceability recording in practical aquaculture scenarios. For the silhouette and human parsing modalities, the dataset quality enhancement strategy improved mAP by 1.45 and 4.73 percentage points, respectively. After introducing the P3D-ECANet, mINP increased by 11.70 and 1.65 percentage points, respectively. In addition, the improved 3D Swin Transformer increased Rank-1 by 0.99 and 3.22 percentage points, respectively. This verifies the effectiveness of the proposed multimodal feature fusion and attention enhancement mechanism in improving the accuracy and robustness of gait recognition. It can achieve high-precision and non-contact identity recognition of farmers in actual aquaculture scenarios, providing a reliable technical support for aquatic product quality and safety traceability, and facilitating the construction of a smart fishery management system.

HTML全文

参考文献(44)

施引文献

资源附件(0)