Clustering

cellula offers a wrapper around clustering functions. For now, only SNN-based Louvain[11] and Leiden[12] clustering are implemented. The makeGraphsAndClusters() function allows users to do parameter sweeps along either the number of neighbors or the resolution of the clustering.

In this example we sweep along the value of the resolution parameter for a Louvain clustering. For the SNN graph constructions, edges are weighted according to the jaccard index of their shared neighbors, mimicking the Seurat graph construction and clustering procedure:

sce <- makeGraphsAndClusters(sce, k = c(0.1, 0.25, 0.5, 0.75, 1),
                             dr = "PCA_Harmony", ndims = 20,
                             sweep_on = "clustering", method = "louvain", 
                             weighting_scheme = "jaccard", prefix = "SNN_",
                             verbose = TRUE)

If another integration method has been run on the same object (e.g. Seurat integration), then the clustering can be performed on that integrated space by specifying the dr argument (in this case, "PCA_Seurat"):

sce <- makeGraphsAndClusters(sce, k = c(0.1, 0.25, 0.5, 0.75, 1), 
                             dr = "PCA_Seurat", ndims = 20,
                             sweep_on = "clustering", method = "louvain", 
                             weighting_scheme = "jaccard", prefix = "Seurat_SNN_",
                             verbose = TRUE)

The default value for space is NULL and will use the "PCA" slot from the reducedDim() accessor.

Plotting clustering results

You can visualize clustering results on the UMAP using the plot_UMAP() function, adding labels if desired:

plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "SNN_0.5", label_by = "SNN_0.5")

If using the clustree package [13], the clustering tree can be visualized by using the same prefix defined in makeGraphsAndClusters():

library(clustree)
clustree(sce, prefix = "SNN_")

The default arguments to the clustering wrapper include the generation of modularity and approximate silhouette scores for every clustering round. These will be stored in the metadata of the SCE, named according to the prefix, the resolution, and the "modularity_" and "silhouette_" prefixes. Silhouette and modularity can be visualized by using the dedicated functions:

plotSilhouette(sce, "SNN_0.5")

plotModularity(sce, "SNN_0.5")

As seen in the Plotting section, you can plot cluster markers using the presto package and the plot_dots() function:

# Install {presto}
remotes::install_github("immunogenomics/presto")

# Quick and dirty marker calculation
markers = presto::wilcoxauc(sce, group_by = "SNN_0.5")
markerlist = split(markers, markers$group)

for(i in seq_len(length(markerlist))) {
  markerlist[[i]]$deltapct = markerlist[[i]]$pct_in - markerlist[[i]]$pct_out
  markerlist[[i]] = markerlist[[i]][order(markerlist[[i]]$deltapct, decreasing = TRUE),]
}

top5 = lapply(markerlist, function(x) x$feature[1:5])
markergenes =  Reduce(union, top5)

plot_dots(sce, genes = top5, group_by = "SNN_0.5")

Metaclusters

Aditionally, metaclusters can also be identified. A metacluster is a cluster of clusters obtained by different clustering methods. Clusters across methods are linked acording to how many cells they share, and these links become edges of a graph. Then, Louvain clustering is run on the graph and the communities that are identified are metaclusters. These metaclusters show the relationship between clustering methods. Moreover, they can be used to understand cluster stability along different parameters and/or integration methods. A cell can belong to different clusters according to the clustering method (i.e. to the resolution or to the space that was used). If a cell belongs to clusters that are consistently included in a metacluster, then that cell belongs to a “stable” cluster. If instead the cell belongs to clusters that have different metacluster assignments, then it’s in an “unstable” position, meaning it may be clustered differently according to integration methods and/or resolutions.

The metaClusters() function takes a clusters argument, which is a vector of column names from the colData of the SCE where clustering results are stored. In this example it is easy to isolate by using grep() and searching for the prefix “SNN_”.

clusterlabels <- colnames(colData(sce))[grep("SNN_", colnames(colData(sce)))]
sce <- metaCluster(sce, clusters = clusterlabels)

Every cell will belong to a series of clusters, which in turn belong to a metacluster. For every cell, we count how many times they are assigned to a particular metacluster, and the maximum metacluster is assigned, together with a “metacluster score” (i.e. the frequency of assignment to the maximum metacluster) and whether this score is above or below a certain threshold (0.5 by default). These columns are saved in the colData slot of the SCE.

plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "metacluster_score", label_by = "SNN_0.5")