Assigning identities

cellula implements 5 methods for automated cell identity assignment, 4 of which are signature-based, and one reference-based. The methods are based on the Bioconductor AUCell package [14], the GSVA ssGSEA implementation[15], the Seurat AddModuleScore() function, the UCell method[16] and the Jaitin method[17].

For the first 4 methods ("AUCell", "ssGSEA", "AddModuleScore" and "UCell") the function requires user-defined genesets, i.e. a named list containing genes to be used for scoring every single cell. These can be obtained through other packages, e.g. msigdbr. For instance, if we wanted to take all the Muraro et al.[18]. signature genes, present in the C8 collection, we would do:

library(msigdbr)

type_genes = msigdbr("Homo sapiens", category = "C8")

genesets = lapply(split(type_genes, type_genes$gs_name), function(x) x$gene_symbol)

muraro_genes = genesets[grep("MURARO", names(genesets))]

Then, we would use the assignIdentities() function from cellula to calculate signature scores:

sce = assignIdentities(sce, 
                       genesets = muraro_genes, 
                       method = "AUC")

Other signature-based methods are "Seurat", "UCell", and "ssGSEA".

A reference-based method, "Jaitin", is available. This method uses a matrix of gene expression as a reference, and then calculates the posterior probability for each cell in the sce object that its transcriptome matches any of the reference ones. The reference with the highest probability is selected as the best label. Importantly, for the “Jaitin” method it is possible to choose the assay to be used. If the user supplies a matrix of log-normalized counts as a reference, the assay argument should point to a similarly normalized data, e.g. "logcounts".

assignIdentities() will create a column named “labels_AUC” (or anything else the user determines using the name argument) in the colData(sce). Assignments can be plotted:

plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "labels_AUC")

You can now see why it can be useful to plot a confusion matrix as a heatmap:

plot_Coldata(sce, x = "SNN_0.5", y = "labels_AUC")

You can also use single signatures as an input, which will result in adding the score to the colData slot of the SCE directly, rather than an assignment:

sce <- assignIdentities(sce, 
                        genesets = muraro_genes$BETA_CELL, 
                        method = "AUC", 
                        name = "Beta_Cell_signature")

plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "Beta_Cell_signature")

The "UCell" method works well when you have small signatures (e.g. even 2/3 genes). It allows you to specify positive and negative labels, which is useful when you are sure the identity of a cell types depends on the lack of expression of certain markers (see hematopoietic lineages). To do so, you can add “+” or “-” to each gene.