多模态组学数据整合方法的性能评测

杨品; 任振华; 袁增强

doi:10.13417/j.gab.043.001196

摘要: 多模态组学数据的联合运用在揭示细胞异质性和解析细胞命运调控机制方面具有重要意义，目前已有多种方法被开发，用于处理不同组学模态数据的整合。本研究通过对应用于不同整合任务的多种数据整合方法进行性能评测，为相关领域的研究提供有益参考。使用6个联合测序数据集对16种单细胞多模态配对数据整合方法在2类整合任务上的性能进行测试，再通过4个模拟数据集和1个真实数据集对6种空间转录组反卷积方法的性能进行评估。在RNA和ATAC配对数据整合任务中，MOFA+、 SCOIT、 Cobolt分别在PBMC、 BMMC、 SNARE数据集上取得最优表现，SCOIT在3个数据集的汇总得分中均排名前3, MMDVAE、 DAE在基于AE的融合算法中表现突出。在RNA和蛋白质配对数据整合任务中，Cobolt、 MOFA+、 Seurat分别在P5＿CITE、 BM＿CITE、 COVID中取得最优表现，totalVI在3个数据集的汇总得分中排名靠前，基于AE的融合算法中以efMMDVAE、 lfMMDVAE的表现最好。在空间转录组反卷积方法评测中，Cell2location和SPACEL在模拟数据和真实数据中的性能表现均优于其他方法的，其中Cell2location在真实数据集中表现最佳，正确地推断了两类心肌细胞在心室的比例。此外，本研究发现在配对数据整合任务中，不同方法对数据的适应性不同。SCOIT和totalVI分别是RNA与ATAC、 RNA与蛋白质数据整合中表现稳定优异的方法。Seurat、 MOFA+易受数据影响。

Abstract: The joint application of multimodal omics data plays a significant role in revealing cellular heterogeneity and elucidating the mechanisms regulating cell fate. At present, a variety of methods have been developed for the integration of multi-omics modalities. This study conducted performance evaluations on several data integration methods applied to different integration tasks, providing a useful reference for research in related fields. Initially, the performance of 16 single-cell multi-modal paired data integration methods was tested on 6 joint sequencing datasets for 2 integration tasks. Subsequently, the performance of 6 spatial transcriptomic deconvolution methods was assessed using four simulated datasets and one real dataset. For RNA and ATAC paired integration task, MOFA+, SCOIT, and Cobolt each achieved optimal performance on PBMC, BMMC, and SNARE datasets respectively, with SCOIT ranking in the top three in the aggregate scores across all three datasets. MMDVAE and DAE are prominent among the AE-based fusion algorithms. In RNA and protein paired integration task, Cobolt, MOFA+, and Seurat respectively attained optimal performance on P5＿CITE, BM＿CITE, and COVID datasets, with totalVI ranking prominently in aggregate scores for all three datasets. Among the fusion algorithms based on AE, efMMDVAE and lfMMDVAE perform best. During the evaluation of spatial transcriptomic deconvolution methods, Cell2location and SPACEL outperformed other methods in both simulated and real datasets, with Cell2location demonstrating the best performance in the real dataset by accurately inferring the proportions of two types of cardiomyocytes in the ventricles. Moreover, different methods exhibit varying adaptabilities to data in paired data integration tasks. SCOIT and totalVI respectively emerged as stable and excellent performers in RNA with ATAC and RNA with protein data integrations. Seurat and MOFA+ are sensitive to the influence of data.

多模态组学数据整合方法的性能评测

Performance Evaluation of Methods for Integrating Multi-modal Omics Data