轻量化多特征融合的大口黑鲈摄食声学识别

王志俊; 赵霞

doi:10.11975/j.issn.1002-6819.202601260

轻量化多特征融合的大口黑鲈摄食声学识别

王志俊,
赵霞

Lightweight multi-feature fusion-based acoustic recognition of feeding behavior in Micropterus salmoides

摘要

摘要: 针对大口黑鲈养殖环境背景噪声复杂、单一特征表征不足、高算力模型难以在嵌入式设备部署等难题，该研究提出一种基于被动声学信号与轻量化边缘部署自适应多特征融合网络（lightweight edge-deployable adaptive multi-feature fusion network, LE-AMFNet）的摄食行为识别方法。首先，构建包含“无、弱、中、强”分级的大口黑鲈摄食声音数据集，采用谱减法抑制背景噪声。其次，设计了多分支特征提取架构，并行处理经主成分分析降维的声学统计特征、梅尔频率倒谱系数特征图与时频图；并引入自适应注意力机制，实现多源异构特征的动态加权融合。可视化分析表明，该模型能够根据信号强度自适应聚焦于摄食信号区域，具备良好的可解释性。试验结果显示，实验室离线测试环境下，LE-AMFNet参数量仅为0.25 M，识别准确率达到93.11%。将模型部署至搭载RK3588芯片的边缘计算设备开展现场实测，转换后的模型文件大小仅0.89 MB，边缘端单次推理耗时仅为8.5 ms，现场识别准确率达到90.8%，实现了高精度与低计算成本的平衡。研究成果为大口黑鲈摄食行为的实时监测提供了可行的解决方案。

Abstract: Accurate and real-time recognition of fish feeding behavior is often required for precision feeding and intelligent aquaculture in recirculating systems. For instance, the feeding activity of largemouth bass (Micropterus salmoides) is commonly accompanied by acoustic emissions during swallowing and chewing. However, practical acoustic monitoring is still challenging due to complex environmental background noise and the limited representation of single acoustic features. Computationally intensive deep learning models are difficult to deploy on low-power edge devices. In this study, an acoustic recognition framework was proposed using passive hydroacoustic sensing and an adaptive multi-feature fusion network, termed lightweight LE-AMFNet. A largemouth bass feeding sound dataset was constructed in a recirculating aquaculture tank at the Fishery Machinery and Instrument Research Institute, Chinese Academy of Fishery Sciences, China. Among them, approximately 200 fish with an average body length of 25 cm and body weight of 315 g were cultured in a 1 m deep tank. Acoustic signals were collected using a digital hydrophone (OceanSonics SC2-ETH) with a sampling rate of 512 kS/s and 24-bit resolution. Fish were fasted for 24 h prior to experiments, particularly for sufficient feeding motivation. Feeding trials were conducted every hour, with 50 g of pellet feed supplied each time. Acoustic signals were recorded for 5 min before and after feeding in each feeding event, thus resulting in a total of 120 min of audio data. Continuous recordings were segmented using a sliding window of 5 s with a 0.25 s overlap, thereby yielding 1 524 samples. Each segment was manually annotated into four feeding intensity levels: none, weak, medium, and strong. A spectral subtraction was employed to consider the frequency-domain features of feeding sounds and background noise (concentrated below 1 kHz). Noise suppression was used to estimate the stationary background noise from non-feeding segments and subtract it from the noisy spectrum. Feeding acoustic components were effectively enhanced to maintain the moderate computational complexity. A hybrid acoustic feature set was constructed to characterize feeding sounds. Global statistical features, such as the time, frequency, and cepstral domains, were first extracted and reduced to 11 dimensions using principal component analysis (PCA), thus serving as a stable baseline representation. Mel-frequency cepstral coefficient (MFCC) maps and short-time Fourier transform (STFT) time–frequency spectrograms were generated and then resized into 128 × 128 pixels to preserve perceptual and physical acoustic information. Heterogeneous features were fed into LE-AMFNet, which consisted of three parallel branches: a fully connected PCA and two lightweight convolutional neural network branches for MFCC and spectrogram images. An adaptive attention fusion module dynamically weighted multi-branch features to suppress redundant information for the discriminative cues. Experimental results demonstrated that the laboratory validation accuracy of 93.11% was achieved to maintain a total parameter size of 0.25 M. Ablation experiments showed that the PCA branch provided a robust baseline, whereas MFCC and spectrogram branches contributed additional accuracy gains of 5.21 and 7.85 percentage points, respectively. The performance was further improved by 2.86 percentage points using the attention module. Compared with raw audio input and conventional band-pass filtering, the spectral subtraction improved recognition accuracy by 8.61 and 3.93 percentage points, respectively. Class activation mapping analysis revealed that the practical applicability and interpretability were validated to consistently focus on physically meaningful frequency regions with feeding activity. LE-AMFNet was deployed on an edge device with the domestic RK3588 chip. The model file required only 0.89 MB of storage after conversion. A single inference on the edge device took just 8.5 ms. Real-world field tests also verified the recognition accuracy of 90.8%, successfully realizing a balance between high accuracy and low computational cost. Overall, this finding can present an efficient and accurate acoustic recognition of the feeding behavior in largemouth bass aquaculture. The framework can also provide a feasible technical pathway to monitor real-time and intelligent feeding in precision aquaculture.

HTML全文

参考文献(34)

施引文献

资源附件(0)