Trajectory inference

At the time of writing cellula only implements a wrapper around the slingshot method[20] and the monocle3 method [21] for pseudotemporal trajectory inference, and the testPseudotime() method from TSCAN[22] for differential expression along a trajectory. More trajectory inference and differential expression methods will be implemented in the future.

The monocle3 method has been originally developed to work on 2D embeddings such as UMAP, but given the distortion introduced by these embeddings this function allows users to input any embedding (such as PCA). Moreover, the monocle3 method originally allows partitions to be specified. In the current implementation partitions have been disabled.

The findTrajectories() function takes a SingleCellExperiment object as input, and requires the user to specify the method (one of "slingshot" or "monocle"), the cluster label which will be used as an input to the MST creation in slingshot, or to identify the starting node in monocle3.

Other important parameters are:

dr - the reduced dimensional embedding in which trajectories will be estimated (default is “PCA”)
start - the starting cluster for trajectory estimation, defaulting to “auto” for an entropy-based method
omega - slingshot only. Whether or not to use a synthetic cluster to estimate disjointed trajectories, as detailed in ?slingshot::getLineages
Monocle_lg_control - monocle only. A list of control parameters for the learn_graph() function from monocle3.
invert - monocle only. Whether or not to invert the direction of the pseudotime vector.
dr_embed - an additional 2D dimensional reduction slot in which to embed the trajectories (slingshot curves or monocle principal graph). Yields slightly different results according to the method.
doDE - whether or not to perform differential expression on every trajectory.

The output is the same object used in the input, with some additional fields.

the colData slot will contain a column containing the slingPseudotime_N pseudotime values for each lineage, where N is the number of the lineage (slingshot) or just a column named monoclePseudotime with pseudotime values for each cell (monocle).
the metadata slot will several new elements depending on the method.

For slingshot:
- Slingshot_lineages: the output of slingLineages(), i.e. the result of the MST construction
- Slingshot_embedded_curves: the output of embedCurves(), i.e. the projection of low-dimensional principal curves onto a 2D embedding such as UMAP. Optional and only created if the parameter dr_embed is an available dimensionality reduction slot in the object.
- Slingshot_MST: the MST used to determine lineages by slingshot. Optional and only added if the parameter add_metadata is set to TRUE.
- Slingshot_curves: the principal curves constructed by slingshot. Optional and only added if the parameter add_metadata is set to TRUE.
- Slingshot_weights: the weights assigned per cell within each lineage by slingshot. Optional and only added if the parameter add_metadata is set to TRUE.
- Slingshot_params: the params used in the slingshot call. Optional and only added if the parameter add_metadata is set to TRUE.
- pseudotime_DE: the results of the differential expression test. This is a list where each lineage is tested separately. Optional and only created if doDE is set to TRUE.
For monocle:
- Monocle_embedded_curves: the projection of the nodes of the principal graph to their nearest neighbor in 2D. Optional and only created if the parameter dr_embed is an available dimensionality reduction slot in the object.
- Monocle_principal_graph: the igraph object with the principal graph.
- Monocle_principal_graph_aux: the rest of the CellDataSet object with additional information on the principal graph.

In this example we download a dataset that captures early haematopoietic differentiation from Nestorowa et al. [23] and we do a quick processing and clustering.

sce2 = scRNAseq::NestorowaHSCData()

sce2 = cellula(sce2, name = "nestorowa", do_qc = FALSE)

sce2 = makeGraphsAndClusters(sce2, space = reducedDim(sce2, "PCA")[,1:20])

plot_UMAP(sce2, umap_slot = "UMAP", color_by = "SNN_0.64", label_by = "SNN_0.64")

Then we take an arbitrary starting point (cluster 7) and calculate trajectories with both slingshot and monocle using the same space (PCA, first 20 components).

sce2 = findTrajectories(sce2, dr = "PCA", 
                        method = "monocle",
                        ndims = 20, clusters = "SNN_0.64", 
                        dr_embed = "UMAP", start = 7)
                        
sce2 = findTrajectories(sce2, dr = "PCA", 
                        method = "slingshot",
                        ndims = 20, clusters = "SNN_0.64", 
                        dr_embed = "UMAP", start = 7)

Now we can visualize results on the UMAP since we specified a dr_embed parameter. Notice how slingshot will identify different pseudotimes (one per lineage) while monocle3 will identify a single pseudotime. Moreover, the plot_UMAP() function can include a trajectory argument which points to the name of the element in the metadata slot containing segments to draw trajectories.

 plot_UMAP(sce2, umap_slot = "UMAP", color_by = "slingPseudotime_1", 
           label_by = "SNN_0.64", trajectory = "Slingshot_embedded_curves")
 plot_UMAP(sce2, umap_slot = "UMAP", color_by = "monoclePseudotime", 
           label_by = "SNN_0.64", trajectory = "Monocle_embedded_curves")

Since the 2D embedding of monocle3 PCA-derived trajectories may be hard to understand, given the distortions introduced by UMAP, cellula includes an additional 2D embedding method, dr_embed = "FR", inspired båy the PAGA embedding initialization technique [24].

Briefly, once the principal graph has been calculated, it is laid out in 2D using the Fruchterman-Reingold algorithm. Then each cell is randomly plotted around their closest vertex in the graph, and reordered according to pseudotime value. This semi-random layout is used as an initialization for UMAP, which will optimize the point positions. The resulting layout is more visually pleasing and reflects more accurately the positions of cells with respect to the trajectory (although not necessarily to each other).

This FR-initialized UMAP is stored in a reducedDim slot named UMAP_FR.

sce2 = findTrajectories(sce2, clusters = "SNN_0.64", method = "monocle",
        dr = "PCA", ndims = 20, start = "7", dr_embed = "FR")
        
plot_UMAP(sce2, umap_slot = "UMAP_FR", color_by = "monoclePseudotime", 
           label_by = "SNN_0.64", trajectories = "Monocle_embedded_curves")

It should be noted that this layout is, in a way, optimized for trajectories rather than global cell-cell similarity and as always should only be treated as a visualization tool.

Metacells

In order to speed up calculations and overcome sparsity, cells can be aggregated into metacells using k-means clustering with a high k.

Clustering is carried out in the reduced dimensional space of choice selected through the space argument, and it is carried out in each level of group separately.

Rather than selecting the number of clusters k, the function takes an average number of cells per cluster w which is used to determine k (default is w = 10 cells per cluster). Read counts are aggregated by gene across all cells within each cluster, resulting in metacells. These can be used as input to findTrajectories() or other operations such as clustering, downsampling, signature scoring, etc.

In this example we create metacells aggregating (on average) 10 cells, within each cluster from the “SNN_0.5” clustering results:

sce_meta <- makeMetacells(sce, group = "SNN_0.5", space = "PCA_Harmony", w = 10)