单细胞转录组测序数据的细胞类型识别方法比较

朱晓姝; 滕飞; 廖燕莹; 谢妙; 杨朝义

doi:10.13417/j.gab.043.000195

单细胞转录组测序数据的细胞类型识别方法比较

Comparison of Cell Type Identification Methods for Single-cell RNA-sequencing Data

摘要

摘要: 单细胞转录组测序技术提供单个细胞分辨率的基因表达谱，有助于更准确地揭示细胞异质性。聚类是识别生物组织中细胞类型的主要方法，选择合适的聚类算法可以提升单细胞转录组测序数据分析的性能。本文阐述了k-means、层次聚类(hierarchical clustering, HC)、 Leiden、 SC3、 SCENA、 LAK、 SIMLR和dropClust等8种典型的单细胞聚类算法，在12个带有真实标签的单细胞转录组测序数据集上进行聚类比较分析。采用轮廓系数、 Calinski-Harabasz指数、调整兰德指数、调整互信息、 FMI指数、 V-measure、 Jaccard系数和变异系数等8个评价指标，对8种聚类算法的性能进行分析评价。根据实验结果，发现HC、 SC3、k-means、 SCENA的聚类泛用性与鲁棒性最佳，在大规模数据集上SIMLR算法表现最好；在小规模数据集上Leiden算法表现最好，但是存在依赖邻居节点参数和稳定性低的问题；dropClust算法在泛用性和鲁棒性上最差。此外，8种聚类方法的性能都与数据质量有关，当数据的变异系数较低时，聚类算法的评分指标普遍增高，反之亦然。

Abstract: Single-cell RNA-sequencing technology provides gene expression profiles with single cell resolution, which helps to reveal cellular heterogeneity more accurately. Clustering is the main method to identify cell types in biological tissues. Selecting a suitable clustering algorithm can improve the performance of single-cell transcriptome sequencing data analysis. In this paper, eight typical single-cell clustering methods are elaborated, including k-means, hierarchical clustering（HC）, Leiden, SC3, SCENA, LAK, SIMLR, and dropClust, and compared on 12 single-cell transcriptome sequencing datasets with real labels. Eight evaluation indexes including contour coefficient, Calinski-Harabasz index, adjusted Rand index, adjusted mutual information, FMI index, V-measure, Jaccard coefficient and coefficient of variation are used to analyze and evaluate the performance of eight clustering algorithms. According to the experimental results, it is found that HC, SC3, k-means and SCENA have the best generalization and robustness of clustering perfor-mance, and SIMLR has the best clustering performance on large-scale data sets. Leiden algorithm has the best performance on small data sets, but it has the problem of dependence on neighbor node parameters and low stability. dropClust algorithm is the worst in terms of generalization and robustness. In addition, the performance of the eight clustering methods is related to the quality of the data. When the coefficient of variation of the data is low, the score of the clustering algorithm generally increases, and vice versa.

HTML全文

参考文献(44)

施引文献

资源附件(0)