高级检索+

融合剪影与人体解析的水产养殖作业人员多模态步态识别方法

Uncontrolled identification of farmers' identities in fish farm using a multimodal gait recognition network

  • 摘要: 在水产养殖中,实时识别农事作业人员的身份信息不仅可以提高生产效率,还可以在发生严重的水产品质量安全事件时提供可追溯的依据。以往人工记录、人脸识别等方法,通常需要用户配合,效率低,且输入不可靠,并不适用于远距离识别。为此,该研究提出了一种多模态步态识别模型P3D-ECAFormer(pseudo-3D ResNet with efficient channel attention for Swin Transformer),用于实现远距离、非受控的农事作业人员身份识别。首先,对视频序列进行预处理,得到人体剪影和解析两种输入模态,并引入连通域轮廓提取、掩码增强解析和动态模板过滤等策略提升数据质量。其次,以GaitBase为基准模型,提出通道注意力增强的P3D-ECANet网络,提取短时滤波后进行动态响应,使模型能够更敏感地捕获与身份相关的局部细粒度动作差异。结合自注意力增强的3D Swin Transformer网络,减少噪声干扰且自适应调整位置偏置,提升模型在非标准步态及场景扰动条件下对步态周期的全局建模能力。最后,在局部和全局时空建模的协同作用下,输出具有判别性的综合特征向量,使用三元组损失和交叉熵损失联合学习步态身份映射,实现身份识别。结果表明,本文所提出的模型在水产养殖农事作业人员步态数据集Gaitfisher上的首位识别命中率、平均精度均值和平均负逆惩罚分别为96.53%、94.57%和91.16%,较基线模型(GaitBase)分别提高3.22、7.57和17.1个百分点,推理速度仅占用3 s序列片段时长的1.8%,满足实际水产养殖场景的在线监控需求。数据集质量提升策略对人体解析的平均精度均值增益为4.73个百分点;引入P3D-ECANet使剪影的平均负逆惩罚提升11.70个百分点;改进的3D Swin Transformer在人体解析的首位识别命中率提高3.22个百分点;验证了该方法可有效识别水产养殖作业人员身份,为水产品质量安全溯源提供技术支持。

     

    Abstract: Accurate and real-time identification of farmers' identities can be expected to improve the efficiency with cost savings in aquaculture. Traceability can also provide concurrently, when serious aquatic product safety incidents occur. Previous manual recording and facial recognition cannot fully meet the actual needs of modern intelligent farming, due to user cooperation, high costs, low efficiency, and unreliable input. It is often required for long-distance identification. In this study, a multimodal gait recognition model, P3D-ECAFormer (Pseudo-3D ResNet with Efficient Channel Attention for Swin Transformer) was proposed for the long-distance farming workers in uncontrolled aquaculture scenarios. The implementation was as follows: 1) The video sequences were preprocessed to obtain two modalities: silhouette and human parsing. Data enhancement, such as connected silhouette extraction, mask-enhanced parsing, and dynamic template filtering, was introduced to improve the data quality. Among them, connected silhouette extraction was used to reduce the interference of complex backgrounds and local noise, thereby obtaining a cleaner silhouette. Mask-enhanced parsing further improved the incomplete nature of the human analysis structure, thus enhancing the integrity of the human structure. Dynamic template filtering was used to eliminate the low-quality and abnormal frame data, thereby enhancing the temporal consistency and validity of the input sequence. Thereby, more stable data was provided for subsequent feature extraction and identity recognition. 2) A P3D-ECANet network was enhanced by channel attention during feature modeling. Short-term filtering was extracted for dynamic response using the GaitBase model. Local fine-grained action differences were more sensitively captured for the identity. A self-attention enhanced 3D Swin Transformer network was combined to globally model the gait cycle under non-standard gait and scene perturbation. Noise interference was then reduced to adaptively adjust the position biases. 3) A highly discriminative representation of gait feature was generated under the synergistic interaction between local and global spatial-temporal features. Triplet loss and cross-entropy loss were combined to learn the gait vector for identity recognition. The results show that the improved model achieved a rank-1 identification rate (Rank-1) of 96.53%, mean average precision (mAP) of 94.57%, and mean inverse negative penalty (mINP) of 91.16% on the Gaitfisher dataset of aquatic farming workers. Compared with the baseline model (GaitBase), these metrics increased by 3.22, 7.57, and 17.1 percentage points, respectively. Compared with the suboptimal model (DeepGaitV2), there was an increase of 1.98, 2.13, and 2.87 percentage points, respectively. The inference time accounted for only 1.8% of the 3 s sequence clip duration, fully meeting the efficiency requirements of online monitoring and traceability recording in practical aquaculture. Furthermore, the dataset enhancement improved mAP by 1.45 and 4.73 percentage points, respectively, in the silhouette and human parsing modalities. The mINP increased by 11.70 and 1.65 percentage points, respectively, after the P3D-ECANet was introduced. In addition, the improved 3D Swin Transformer increased Rank-1 by 0.99 and 3.22 percentage points, respectively. The multimodal feature fusion and attention enhancement mechanism were verified to improve the accuracy and robustness of gait recognition. The high-precision and non-contact identity recognition of farmers was realized under actual aquaculture scenarios. The findings can also provide reliable technical support for aquatic product quality and safety traceability in smart fishery.

     

/

返回文章
返回