library('devtools')
install_github("CBIIT-CGBB/scCorr")
One of the challenges in single cell RNA-sequence analysis is abundance of zero values that results in biased estimation of gene-gene correlations for downstream analyses. Here, we present a novel graph-based k-partitioning method by merging “homology” cells to reduce the zero values. The method is robust and reliable for the detection of correlated gene pairs that is fundamental for network construction, gene-gene interaction, and cellular -omic analyses. The associated publication was "A novel graph-based k-partitioning approach improves the detection of gene-gene correlations by single-cell RNA sequencing" on BMC Genomics 2022.
The example R codes are: tsne, k-partitioning, merging clusters, cluster ID renaming, correlation analysis. More examples are at the end of this page and named as R codes (please download data for them) in each section.
A total of 21,430 genes have zero values in at least one cell (A) and more than 95% of 15,973 cells show zero values in at least one cell (B).Among a set of 347 genes from KEGG, all genes have zero value in at least one cell (C) and 95% of 15,973 cells contains zero value in at least one gene (D).
E-G shows reductions of zero values in merged cells. The percentage of zero values of 21,430 genes is remarkably reduced in the merged cells. The reduction of zero value is approximately 50% among 50 merged cells (E). Similarly, zero values of 347 genes selected from KEGG are reduced in merged cells (F). The reduction of zero values in merged cells is consistently observed in 6 different number of cell sets (G).
H-L present the workflow and features of scCorr method. First, data dimensional reduction and cell classification by tSNE and cell type identification using marker gene approach (H). Secondly, cell partitioning based on tSNE plot by using scCorr with different number of clusters (I: k=100; J: k=1,000). Average number of cells per cluster is shown (K).
R codes and one full example (from clustering to tree plotting R codes)
ScCorr enables to trace evolutional process of each petitioned cluster (L).
Correlated genes are shown in –log10 p values (A) and r value (B). Gene-gene correlations between two methods are in the same direction in some cases (C) while gene-gene correlations are in opposite direction between two methods in other cases (D).
E and F show top 10 correlated genes in different number of clusters partitioned by scCorr among CD4 T cells evaluated by –log 10 p value (E) and r value (F). Performances of scCorr for cell type identification of CD4 T cells are shown in G (k=117) and H (k=10). Area Under Curve (AUC) was greater by using scCorr (AUC: 0.97 and 0.96) than using unflustered single cell (AUC=0.55).
Distributions of zero value expressions in four sets of simulated datasets (A) and in the scRNA-seq dataset with 21,430 genes , 15973 cells (B).
t-SNE plot-based k-partitioning cluster. All cells are clustered as 50, 100, and 1,000 groups (A). The same clusters are shown in dot-plot views (B), where each dot represents a cluster and the size is proportional to the cluster size.
Tree-based visualization of cell clusters by k-partitioning algorithm (A: Ladder clusters N=20-40; B: Circle clusters N=20-40; C: circle clusters, N=100-1,000). The size of each dot size represents a proportional of the cell number in one cluster. A line connects two closest clusters.
Correlation of two co-expression gene pairs: MAPK1 pair and DUSP2 pairs by non-clustered Correlation method (A) and by scCorr clustered method (B).
Correlation of top 10 co-expressed gene pairs from cluster 40 in different number of partitioned clusters: evaluated by p values and correlation coefficient values. In the title, n#, the numbers are the thresholds for the cluster merging. If one cluster cell number less than the threshold, the cluster will be merged into the adjacent cluster.
The xy.coordinate is the regions for scaling. For example, xy.coordinate is 50, the scaling region will be from -50 to 50, and so on. We suggest that the xy.coordinates are 300 or 400 for about single cells from 5,000 to 15,000. The xy.coordinates could be the increased if you have more single cells.
c_list : A graph-based k-partitioning method with scaling
d_list : Merging homological single cells by one coordinate with density method
GCluster : Graphical based clustering
get_value : Converting single cell based matrix to cluster based matrix
m_list : Merging homological single cells by one coordinate by window sizes
merge_list : Merging cluster into adjacent cluster if the merged cluster single cell number less than one cutoff
mgGCLuster : Merging clusters given the merged cluster IDs
scale_v : Scaling function
tj_list : Merging homological single cells by trajectory analysis
tjGCluster : Trajectory analysis function for tj_list
tjGCluster2 : Trajectory analysis function II for tj_list
r_c : Rotating coordinate