Trajectory inference

At the time of writing cellula only implements a wrapper around the slingshot method[20] and the monocle3 method [21] for pseudotemporal trajectory inference, and the testPseudotime() method from TSCAN[22] for differential expression along a trajectory. More trajectory inference and differential expression methods will be implemented in the future.

The monocle3 method has been originally developed to work on 2D embeddings such as UMAP, but given the distortion introduced by these embeddings this function allows users to input any embedding (such as PCA). Moreover, the monocle3 method originally allows partitions to be specified. In the current implementation partitions have been disabled.

The findTrajectories() function takes a SingleCellExperiment object as input, and requires the user to specify the method (one of "slingshot" or "monocle"), the cluster label which will be used as an input to the MST creation in slingshot, or to identify the starting node in monocle3.

Other important parameters are:

The output is the same object used in the input, with some additional fields.

In this example we download a dataset that captures early haematopoietic differentiation from Nestorowa et al. [23] and we do a quick processing and clustering.

sce2 = scRNAseq::NestorowaHSCData()

sce2 = cellula(sce2, name = "nestorowa", do_qc = FALSE)

sce2 = makeGraphsAndClusters(sce2, space = reducedDim(sce2, "PCA")[,1:20])

plot_UMAP(sce2, umap_slot = "UMAP", color_by = "SNN_0.64", label_by = "SNN_0.64")

Then we take an arbitrary starting point (cluster 7) and calculate trajectories with both slingshot and monocle using the same space (PCA, first 20 components).

sce2 = findTrajectories(sce2, dr = "PCA", 
                        method = "monocle",
                        ndims = 20, clusters = "SNN_0.64", 
                        dr_embed = "UMAP", start = 7)
                        
sce2 = findTrajectories(sce2, dr = "PCA", 
                        method = "slingshot",
                        ndims = 20, clusters = "SNN_0.64", 
                        dr_embed = "UMAP", start = 7)           

Now we can visualize results on the UMAP since we specified a dr_embed parameter. Notice how slingshot will identify different pseudotimes (one per lineage) while monocle3 will identify a single pseudotime. Moreover, the plot_UMAP() function can include a trajectory argument which points to the name of the element in the metadata slot containing segments to draw trajectories.

 plot_UMAP(sce2, umap_slot = "UMAP", color_by = "slingPseudotime_1", 
           label_by = "SNN_0.64", trajectory = "Slingshot_embedded_curves")
 plot_UMAP(sce2, umap_slot = "UMAP", color_by = "monoclePseudotime", 
           label_by = "SNN_0.64", trajectory = "Monocle_embedded_curves")

Since the 2D embedding of monocle3 PCA-derived trajectories may be hard to understand, given the distortions introduced by UMAP, cellula includes an additional 2D embedding method, dr_embed = "FR", inspired båy the PAGA embedding initialization technique [24].

Briefly, once the principal graph has been calculated, it is laid out in 2D using the Fruchterman-Reingold algorithm. Then each cell is randomly plotted around their closest vertex in the graph, and reordered according to pseudotime value. This semi-random layout is used as an initialization for UMAP, which will optimize the point positions. The resulting layout is more visually pleasing and reflects more accurately the positions of cells with respect to the trajectory (although not necessarily to each other).

This FR-initialized UMAP is stored in a reducedDim slot named UMAP_FR.

sce2 = findTrajectories(sce2, clusters = "SNN_0.64", method = "monocle",
        dr = "PCA", ndims = 20, start = "7", dr_embed = "FR")
        
plot_UMAP(sce2, umap_slot = "UMAP_FR", color_by = "monoclePseudotime", 
           label_by = "SNN_0.64", trajectories = "Monocle_embedded_curves")

It should be noted that this layout is, in a way, optimized for trajectories rather than global cell-cell similarity and as always should only be treated as a visualization tool.

Metacells

In order to speed up calculations and overcome sparsity, cells can be aggregated into metacells using k-means clustering with a high k.

Clustering is carried out in the reduced dimensional space of choice selected through the space argument, and it is carried out in each level of group separately.

Rather than selecting the number of clusters k, the function takes an average number of cells per cluster w which is used to determine k (default is w = 10 cells per cluster). Read counts are aggregated by gene across all cells within each cluster, resulting in metacells. These can be used as input to findTrajectories() or other operations such as clustering, downsampling, signature scoring, etc.

In this example we create metacells aggregating (on average) 10 cells, within each cluster from the “SNN_0.5” clustering results:

sce_meta <- makeMetacells(sce, group = "SNN_0.5", space = "PCA_Harmony", w = 10)