Abstract:
Genotype Imputation(GI) is a technique that uses existing genotype information to infer unobserved or incomplete genotypes. This study aims to explore efficient imputation methods for handling incomplete data in soybean genomic sequencing, with the goal of improving data processing speed and efficiency. Real soybean reference panel genotype data was used in the study, and the data was subjected to complete random missingness at rates of 2%, 5%, 10%, and 25%. A GPU-accelerated random forest machine learning algorithm was employed to construct imputation models and fill in the missing data at different missingness rates. Additionally, the accuracy and performance of different processors were compared and analyzed. The research results demonstrate that the GPU-accelerated random forest algorithm achieves excellent imputation accuracy in the soybean genome. Compared to mainstream genotype imputation software, this method provides at least a fourfold computational time advantage. Therefore, the GPU-accelerated genotype imputation strategy can be applied to large-scale genotype data processing, improving the speed and efficiency of soybean genotype data processing while reducing computational time and resource consumption.