Downsampling

There are two ways to downsample data in cellula: downsampling reads and downsampling cells.

The first approach simulates reads randomly sampling counts from a distribution with a fixed total number, using a vector of probabilities equivalent to the per-gene proportion of reads within each cell.

Briefly, let’s consider a cell C in which genes a, b, and c have been quantified with 50, 30, 20 counts each (totaling to 100 counts). This is equivalent to a bag of marbles in which the probability of randomly picking an a marble is 50/100 = 0.5, b marble is 0.3, and c marble is 0.2.

If we want to downsample C to a total of 40 counts (yielding the downsampled C’), we randomly pick 40 counts from a (0.5, 0.3, 0.2) vector of probabilities.

This is a sort of downsampling by simulation and is described in Scott Tyler’s work[19], reimplemented in cellula with a slightly faster optimization.

The downsampleCounts() uses a minimum count number that is user-defined (or the minimum total count number in the dataset as a default) and returns a SingleCellExperiment object with the same number of cells as the input, and a down-sampled count matrix where each cell has the same total number of counts.

The second approach randomly select cells from within groups such as clusters, batches, or a combination of the two.

Cells are randomly selected so that they represent a user-defined fraction of the within-group total, with some lower bound to ensure that small groups are represented: if a rare cluster label only contains 9 cells and we want to downsample a dataset to 10%, we can cap the minimum to 5 cells so that we ensure the rare label is still adequately represented.

The downsampleCells() function returns a SingleCellExperiment object with fewer cells than the input, as defined by the proportion and min parameters.