Usage

The pipelines in cellula assume that you are working with the output of CellRanger (or something similar) and you imported it into a SingleCellExperiment object (hereafter SCE) using the TENxIO package. This is relevant for gene identifiers, since the rowData slot of the SCE will have a “Symbol” and a “ID” column.

For demo purposes we can use a publicly available dataset, Segerstolpe et al. 2016 [2], which we retrieve using the scRNAseq package:

# BiocManager::install("scRNAseq") 
sce <- scRNAseq::SegerstolpePancreasData()
colnames(rowData(sce)) = c("Symbol", "ID")

Assuming that you have formed a SCE object containing the “individual” column that identifies different batches, you can run an integration pipeline as follows:

sce <- cellula(sce, name = "myproject", batch = "individual", 
                integration_method = "Harmony",
                verbose = TRUE, save_plots = TRUE)

The name argument defines the name of the folder that will be created to store files and plots. We set verbose = TRUE to print the progress of the pipeline. Setting save_plots = TRUE will create a few QC plots in the name/plots folder: total UMI, total genes detected, UMI x genes; optionally % MT, % Ribo and %MALAT1, total UMI x doublet class. Plots are separated according to whether the cells were discarded or not in the filtering step.

The cellula() function is a wrapper around a few modules or sub-pipelines that have different degrees of customization.

There are other independent functions that are not run through cellula() as they need some user input, e.g. findTrajectories() requires the user to specify the starting cluster (through makeGraphsAndClusters()), or the cluster labels to use.

The scheme is:

cellula()
    ├── Quality Control [QC]
    |    ├── run emptyDrops (optional) [QC/EMPTY]
    |    ├── score mito/ribo/malat1 subsets (optional)
    |    ├── filter out (optional)
    |    └── doublet finding (optional) [QC/DBL]
    ├── Normalization and dimensionality reduction [NOR]
    |    ├── pre-clustering
    |    ├── computing pooled factors
    |    ├── log-normalization (simple or multi-batch)
    |    ├── HVG finding (simple or multi-batch)
    |    ├── PCA
    |    └── UMAP
    ├── Integration [INT] (optional) - choose one method
    |    ├── fastMNN [INT/MNN]
    |    |    ├── integration
    |    |    └── UMAP
    |    ├── Seurat [INT/SEURAT]
    |    |    ├── conversion to Seurat
    |    |    ├── normalization and HVG finding
    |    |    ├── find integration anchors
    |    |    ├── integrate data
    |    |    ├── scale data
    |    |    ├── PCA
    |    |    └── UMAP
    |    ├── LIGER [INT/LIGER]
    |    |    ├── conversion to LIGER
    |    |    ├── normalization
    |    |    ├── HVG finding
    |    |    ├── scale data
    |    |    ├── NMF
    |    |    ├── quantile normalization
    |    |    └── UMAP
    |    ├── Harmony [INT/HARMONY]
    |    |    ├── Harmony matrix (on PCA)
    |    |    └── UMAP
    |    └── Regression [INT/regression]
    |    |    ├── regression on logcounts
    |    |    ├── PCA
    |    |    └── UMAP
    |    ├── scMerge2 [INT/scMerge2]
    |    |    ├── Pseudobulk and RUV
    |    |    └── UMAP
    |    └── STACAS [INT/STACAS]
    |         ├── conversion to Seurat
    |         ├── normalization and HVG finding
    |         ├── STACAS integration
    |         ├── PCA
    |         └── UMAP    
    └── Cell type annotation [ANNO] (optional) - choose one method
              ├── Seurat AddModuleScore
              ├── ssGSEA
              ├── UCell
              ├── AUCell
              └── Jaitin
              
  makeGraphsAndClusters()
    └── Multi-resolution clustering [CLU]
          ├── sweep on Louvain/Leiden resolution or SNN neighbor numbers
          ├── calculate modularity (optional)
          └── calculate silhouette (optional)
              
  findTrajectories()            
    └── Trajectory estimation [TRAJ]
          ├── slingshot [TRAJ/slingshot]
          |   ├── get lineages
          |   ├── calculate principal curves
          |   ├── embed in 2D (optional)
          |   └── calculate per-lineage DE (optional)
          └── monocle3 [TRAJ/monocle]
              ├── convert to CellDataSet
              ├── learn graph
              └── embed in 2D (optional)
                   ├── populate FR layout (if dr_embed = "FR")
                   └── UMAP on FR layout (if dr_embed = "FR")

Most of the choices can be made around the integration method. cellula has implemented 5 methods: fastMNN[3],[4] and regressBatches from the batchelor package, Harmony[5], the CCA-based Seurat[6] method, non-negative matrix factorization (NMF) from LIGER[7] through the rliger and RccPlancs packages, a pseudobulk and RUV-based method, scMerge2 from the scMerge package[8], and the Seurat-based STACAS integration method from the eponymous package[9].

LIGER and Seurat integration methods require an intermediate step where package-specific objects are created and some pre-processing steps are repeated again according to the best practices published by the authors of those packages.

Each step of the pipeline can be called independently on the object:

sce <- doQC(sce, name = "segerstolpe", batch = "individual", save_plots = TRUE)
sce <- doNormAndReduce(sce, name = "segerstolpe", batch = "individual")
sce <- integrateSCE(sce, batch = "individual", method = "Seurat")

Doublet identification is carried out through the scDblFinder package[10] using standard defaults.