WANG Zhijun, ZHAO Xia. Lightweight multi-feature fusion-based acoustic recognition of feeding behavior in Micropterus salmoidesJ. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2026, 42(10): 193-203. DOI: 10.11975/j.issn.1002-6819.202601260
Citation: WANG Zhijun, ZHAO Xia. Lightweight multi-feature fusion-based acoustic recognition of feeding behavior in Micropterus salmoidesJ. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2026, 42(10): 193-203. DOI: 10.11975/j.issn.1002-6819.202601260

Lightweight multi-feature fusion-based acoustic recognition of feeding behavior in Micropterus salmoides

  • Accurate and real-time recognition of fish feeding behavior is often required for precision feeding and intelligent aquaculture in recirculating systems. For instance, the feeding activity of largemouth bass (Micropterus salmoides) is commonly accompanied by acoustic emissions during swallowing and chewing. However, practical acoustic monitoring is still challenging due to complex environmental background noise and the limited representation of single acoustic features. Computationally intensive deep learning models are difficult to deploy on low-power edge devices. In this study, an acoustic recognition framework was proposed using passive hydroacoustic sensing and an adaptive multi-feature fusion network, termed lightweight LE-AMFNet. A largemouth bass feeding sound dataset was constructed in a recirculating aquaculture tank at the Fishery Machinery and Instrument Research Institute, Chinese Academy of Fishery Sciences, China. Among them, approximately 200 fish with an average body length of 25 cm and body weight of 315 g were cultured in a 1 m deep tank. Acoustic signals were collected using a digital hydrophone (OceanSonics SC2-ETH) with a sampling rate of 512 kS/s and 24-bit resolution. Fish were fasted for 24 h prior to experiments, particularly for sufficient feeding motivation. Feeding trials were conducted every hour, with 50 g of pellet feed supplied each time. Acoustic signals were recorded for 5 min before and after feeding in each feeding event, thus resulting in a total of 120 min of audio data. Continuous recordings were segmented using a sliding window of 5 s with a 0.25 s overlap, thereby yielding 1 524 samples. Each segment was manually annotated into four feeding intensity levels: none, weak, medium, and strong. A spectral subtraction was employed to consider the frequency-domain features of feeding sounds and background noise (concentrated below 1 kHz). Noise suppression was used to estimate the stationary background noise from non-feeding segments and subtract it from the noisy spectrum. Feeding acoustic components were effectively enhanced to maintain the moderate computational complexity. A hybrid acoustic feature set was constructed to characterize feeding sounds. Global statistical features, such as the time, frequency, and cepstral domains, were first extracted and reduced to 11 dimensions using principal component analysis (PCA), thus serving as a stable baseline representation. Mel-frequency cepstral coefficient (MFCC) maps and short-time Fourier transform (STFT) time–frequency spectrograms were generated and then resized into 128 × 128 pixels to preserve perceptual and physical acoustic information. Heterogeneous features were fed into LE-AMFNet, which consisted of three parallel branches: a fully connected PCA and two lightweight convolutional neural network branches for MFCC and spectrogram images. An adaptive attention fusion module dynamically weighted multi-branch features to suppress redundant information for the discriminative cues. Experimental results demonstrated that the laboratory validation accuracy of 93.11% was achieved to maintain a total parameter size of 0.25 M. Ablation experiments showed that the PCA branch provided a robust baseline, whereas MFCC and spectrogram branches contributed additional accuracy gains of 5.21 and 7.85 percentage points, respectively. The performance was further improved by 2.86 percentage points using the attention module. Compared with raw audio input and conventional band-pass filtering, the spectral subtraction improved recognition accuracy by 8.61 and 3.93 percentage points, respectively. Class activation mapping analysis revealed that the practical applicability and interpretability were validated to consistently focus on physically meaningful frequency regions with feeding activity. LE-AMFNet was deployed on an edge device with the domestic RK3588 chip. The model file required only 0.89 MB of storage after conversion. A single inference on the edge device took just 8.5 ms. Real-world field tests also verified the recognition accuracy of 90.8%, successfully realizing a balance between high accuracy and low computational cost. Overall, this finding can present an efficient and accurate acoustic recognition of the feeding behavior in largemouth bass aquaculture. The framework can also provide a feasible technical pathway to monitor real-time and intelligent feeding in precision aquaculture.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return