library(msigdbr)
= msigdbr("Homo sapiens", category = "C8")
type_genes
= lapply(split(type_genes, type_genes$gs_name), function(x) x$gene_symbol)
genesets
= genesets[grep("MURARO", names(genesets))] muraro_genes
Assigning identities
cellula
implements 5 methods for automated cell identity assignment, 4 of which are signature-based, and one reference-based. The methods are based on the Bioconductor AUCell
package[14], the GSVA
ssGSEA
implementation[15], the Seurat
AddModuleScore()
function, the UCell
method[16] and the Jaitin
method[17].
For the first 4 methods ("AUCell"
, "ssGSEA"
, "AddModuleScore"
and "UCell"
) the function requires user-defined genesets
, i.e. a named list containing genes to be used for scoring every single cell. These can be obtained through other packages, e.g. msigdbr
. For instance, if we wanted to take all the Muraro et al.[18]. signature genes, present in the C8 collection, we would do:
Then, we would use the assignIdentities()
function from cellula
to calculate signature scores:
= assignIdentities(sce,
sce genesets = muraro_genes,
method = "AUC")
Other signature-based methods are "Seurat"
, "UCell"
, and "ssGSEA"
.
A reference-based method, "Jaitin"
, is available. This method uses a matrix of gene expression as a reference, and then calculates the posterior probability for each cell in the sce
object that its transcriptome matches any of the reference ones. The reference with the highest probability is selected as the best label. Importantly, for the “Jaitin
” method it is possible to choose the assay
to be used. If the user supplies a matrix of log-normalized counts as a reference, the assay
argument should point to a similarly normalized data, e.g. "logcounts"
.
assignIdentities()
will create a column named “labels_AUC” (or anything else the user determines using the name
argument) in the colData(sce)
. Assignments can be plotted:
plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "labels_AUC")
You can now see why it can be useful to plot a confusion matrix as a heatmap:
plot_Coldata(sce, x = "SNN_0.5", y = "labels_AUC")
You can also use single signatures as an input, which will result in adding the score to the colData
slot of the SCE directly, rather than an assignment:
<- assignIdentities(sce,
sce genesets = muraro_genes$BETA_CELL,
method = "AUC",
name = "Beta_Cell_signature")
plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "Beta_Cell_signature")
The "UCell"
method works well when you have small signatures (e.g. even 2/3 genes). It allows you to specify positive and negative labels, which is useful when you are sure the identity of a cell types depends on the lack of expression of certain markers (see hematopoietic lineages). To do so, you can add “+” or “-” to each gene.