基于GPU加速随机森林算法的大豆基因型填充研究

李明亮; 李卓; 黄斌; 于军; 辛鹏; 张继成; 唐友

基于GPU加速随机森林算法的大豆基因型填充研究

Research on Soybean Genotype Imputation Based on GPU-Accelerated Random Forest Algorithm

摘要

摘要: 基因型填充(Genotype Imputation, GI)是一种利用现有的基因型信息来推断未测定或不完整基因型的技术。为了探索在大豆基因组测序中处理不完整数据的高效填充方法，以提高数据处理速度和效率，本研究采用真实的大豆参考面板基因型数据，通过对数据进行2%、5%、10%和25%的完全随机缺失处理，利用GPU加速的随机森林机器学习算法构建填充模型，并对不同缺失比例的数据进行填充。同时，对比分析了不同处理器的准确性和性能。结果显示：基于GPU加速的随机森林算法在大豆基因组中实现了优秀的填充精度。与主流基因填充软件相比，该方法至少提供了4倍以上的运算时间优势。因此，GPU加速的基因型填充策略可应用于大规模基因型数据处理中，提高大豆基因型数据处理的速度和效率，同时减少计算时间和资源消耗。

Abstract: Genotype Imputation（GI） is a technique that uses existing genotype information to infer unobserved or incomplete genotypes. This study aims to explore efficient imputation methods for handling incomplete data in soybean genomic sequencing, with the goal of improving data processing speed and efficiency. Real soybean reference panel genotype data was used in the study, and the data was subjected to complete random missingness at rates of 2%, 5%, 10%, and 25%. A GPU-accelerated random forest machine learning algorithm was employed to construct imputation models and fill in the missing data at different missingness rates. Additionally, the accuracy and performance of different processors were compared and analyzed. The research results demonstrate that the GPU-accelerated random forest algorithm achieves excellent imputation accuracy in the soybean genome. Compared to mainstream genotype imputation software, this method provides at least a fourfold computational time advantage. Therefore, the GPU-accelerated genotype imputation strategy can be applied to large-scale genotype data processing, improving the speed and efficiency of soybean genotype data processing while reducing computational time and resource consumption.

HTML全文

参考文献(20)

施引文献

资源附件(0)