Abstract:
Conventional estimation of banana hands and fingers has often been confined to the complex, interlocked structures and frequent occlusions impeding. This research aims to develop a non-destructive, accurate, and rapid estimation of the mass of the entire banana hands and individual fingers. Imaging and computational techniques were also selected as an alternative to the manual or inadequate automated approaches. The more efficient post-harvest processing and grading were facilitated in the banana industry. Firstly, the morphology and geometry were acquired to predict the mass. The registered color and depth (RGB-D) images of the banana hands were captured from the convex and concave viewpoints using a controlled laboratory setup. Secondly, the individual fingers were then precisely segmented using the Segment Anything Model (SAM), a zero-shot instance segmentation. The occlusions were also managed without requiring the task-specific training datasets. The depth information was acquired to convert into the three-dimensional (3D) point clouds. The computational efficiency was then optimized using voxel grid down-sampling. The statistical outlier was also removed for the data integrity. Thirdly, the comprehensive features were extracted from the color images, including the two-dimensional (2D) morphology and size descriptors (such as pixel area, contour perimeter, and aspect ratio). The 3D geometric properties (including principal dimensions, surface area, and convex hull volume) were obtained from the processed point clouds. These multi-modal features were generated for the whole banana hands and each segmented finger. Finally, the mass prediction models were developed with a multiple linear regression (MLR) model as a baseline. Five non-linear machine learning algorithms were also used, including support vector regression (SVR), k-nearest neighbors (KNN), gradient boosting (GB), random forest (RF), and backpropagation neural network (BPNN). Model performance was assessed using
R², RMSE (root mean squared error), and MAPE (mean absolute percentage error). Furthermore, the importance analysis was carried out to eliminate the recursive features, thus focusing primarily on the RF model. The influential predictors were then identified to construct and optimize the models. The comparative analysis showed that the RF model outperformed the rest. The non-linear approaches were then required to capture the relationships between extracted features and banana mass. In the mass estimation of the whole banana hand, the concave view was achieved in better performances
(R² = 0.984, RMSE = 77.78 g, and MAPE=5.37%) with the RF model after optimization. The 3D features, particularly surface area and convex hull volume, were the most critical for the accurate prediction of the hand mass. In individual finger mass, the RF model was more accurate for the exposed outer fingers (viewed convexly; optimized RF:
R² = 0.794, RMSE = 13.14 g, MAPE = 6.12%) than that for the occluded inner fingers (viewed concavely; optimized RF:
R² = 0.668, RMSE = 17.47 g, MAPE = 9.07%). Interestingly, the 2D features (like pixel area and contour perimeter) dominated the mass prediction on the outer fingers. There was the differential feature importance of the finger position and visibility. Two strategies of mass estimation were also evaluated: An average finger mass was calculated (from the predicted total hand mass divided by actual finger count), and then directly predicted the individual finger mass. Both derived average finger mass (using the best hand model) and directly predicted outer finger mass was achieved with high accuracy (relative error <10% for ~80% of samples). The average mass method was better suited to assess the overall quality. While the direct prediction was offered the detailed data for accessible outer fingers. The computational efficiency showed that the direct estimation of the finger mass was faster (~7.7 s per hand), including the SAM segmentation step (~1 s), compared with the average mass estimation on the complex point cloud of the entire hand (~76.6 s per hand). The 3D features were calculated from the multiple simple point clouds which the less computationally demanding than those from single, large, and intricate ones. There was the "divide and conquer" benefit from the SAM-based segmentation. The RGB-D imaging and machine learning were integrated to validate the accurate, non-destructive mass determination of the banana hands and fingers. The utility of the SAM was obtained for the complex fruit segmentation in the agricultural contexts. The RF was the better modeling choice than the rest. Differential contributions of the 2D and 3D features were gained to quantify the varying viewpoint impacts. The direct estimation of the individual finger can offer detailed information and higher computational efficiency than before. The banana grading can also be developed to enhance post-harvest operations. While the 3D feature computation from the complex point clouds can be identified as the primary bottleneck in real-time industrial applications.