基于视锥的相机-雷达融合农业场景三维目标检测

杨丽丽; 郭潇; 李子昂; 吴才聪

doi:10.11975/j.issn.1002-6819.202508160

摘要: 针对复杂农业场景中农机自动驾驶的3D目标检测问题，该研究提出一种集成毫米波雷达和单目相机的低成本感知系统。首先，构建包括激光雷达、组合导航、相机、毫米波雷达数据的农业场景多模态感知数据集；然后，构建基于中间融合策略的神经网络模型CFPNet，采用改进的中心点检测网络对目标进行初步检测，将检测结果与雷达探测点相匹配，使用特征提取模块对视锥感兴趣区域内的雷达点云特征进行提取，生成用于补充图像特征的雷达信息；最后，融合图像初步检测信息和雷达信息进行二次目标检测，同时回归3D目标的深度、方向、速度等属性。结果表明，该网络在自建农业场景数据集上的目标检测精度约为86.5%，在农业场景部署测试中的检测速度达到约7.4帧/s。CFPNet可以在农业场景保证检测速度的同时应对农业环境中的视觉障碍和雷达回波的稀疏问题，并且可以直接返回目标的速度信息，可为农业场景农机自动驾驶提供技术支撑。

Abstract: A perception system is one of the most important components for the autonomous driving of agricultural machinery. However, there are only a few perception datasets specifically designed for agricultural scenarios, due to their difference from the typical urban scenarios in previous studies. In contrast to the urban examples, the agricultural applications can suffer from harsh working circumstances. It is often required for the perception sensors and algorithms. In this study, a low-cost perception system was proposed for the two-stage detection using millimeter-wave radar and a monocular camera. 3D object detection was then performed on the autonomous driving of the agricultural machinery under agricultural scenarios. Firstly, a multimodal perception dataset of the agricultural scenes was constructed to incorporate the LiDAR (light detection and ranging), INS (inertial navigation system), camera, and millimeter-wave radar data with a hardware-level data synchronization and target-level data annotation. Then the middle fusion strategy was used to build a neural network model, known as CFPNet. Preliminary detection of the target was also implemented with the improved network of the center point detection. Furthermore, the radar point cloud features were extracted from the frustum region of interest to supplement the image features. Finally, the preliminary detection information and radar feature were combined to perform a secondary detection. The 3D object attributes (depth, direction, and velocity) were regressed concurrently. The results show that the mAP (mean average precision) of the CFPNet on the self-built multimodal dataset of the agricultural perception was 86.5%, which was 5.5 percentage points higher than the baseline, and the mATE (mean average translation error) was 0.197 m lower than the baseline. An additional experiment on small object detection was conducted to verify the effectiveness of the CFPNet. The better performance was achieved in a recall rate of 1 for the selected small objects, which was 0.3 higher than before the improvement, indicating the better performance of the detection. Deployment experiments were conducted to test the applicability of the CFPNet. A frame rate of 7.4 frames per second was 211% of the baseline in the low-computing agricultural scenarios. Experiments on the public datasets were conducted to test the CFPNet in the rest scenarios. The favorable performance was achieved on the NuScenes public dataset, with the mATE, mASE, and mAVE of 0.792 m, 0.236, and 0.52 m/s, respectively. Since the CFPNet was specifically designed for monocular cameras, its mAP lagged behind. Furthermore, the CFPNet can directly provide the speed information of the target without the preceding and following frames. This finding can provide a feasible solution and technical support for the 3D object detection in agricultural scenarios, especially with low computing power.

基于视锥的相机-雷达融合农业场景三维目标检测

Frustum-based camera-radar fusion for 3D object detection in agricultural scenes