Downsampling
There are two ways to downsample data in cellula
: downsampling reads and downsampling cells.
The first approach simulates reads randomly sampling counts from a distribution with a fixed total number, using a vector of probabilities equivalent to the per-gene proportion of reads within each cell.
Briefly, let’s consider a cell C in which genes a, b, and c have been quantified with 50, 30, 20 counts each (totaling to 100 counts). This is equivalent to a bag of marbles in which the probability of randomly picking an a marble is 50/100 = 0.5, b marble is 0.3, and c marble is 0.2.
If we want to downsample C to a total of 40 counts (yielding the downsampled C’), we randomly pick 40 counts from a (0.5, 0.3, 0.2) vector of probabilities.
This is a sort of downsampling by simulation and is described in Scott Tyler’s work[19], reimplemented in cellula
with a slightly faster optimization.
The downsampleCounts()
uses a minimum count number that is user-defined (or the minimum total count number in the dataset as a default) and returns a SingleCellExperiment
object with the same number of cells as the input, and a down-sampled count matrix where each cell has the same total number of counts.
The second approach randomly select cells from within groups such as clusters, batches, or a combination of the two.
Cells are randomly selected so that they represent a user-defined fraction of the within-group total, with some lower bound to ensure that small groups are represented: if a rare cluster label only contains 9 cells and we want to downsample a dataset to 10%, we can cap the minimum to 5 cells so that we ensure the rare label is still adequately represented.
The downsampleCells()
function returns a SingleCellExperiment
object with fewer cells than the input, as defined by the proportion
and min
parameters.