Looking at batch effect through scRNA-seq data

What Is Batch Effect?

In biological research, batch effect occurs when factors unrelated to the biology of the study lead to changes in the experimental data.

Common cases can be found in studies using gene expressions. When characterizing a disease, researchers look for the differences in the mRNA level between the sample of patients and healthy people. However, such differences can be obscured by technical factors, such as laboratory techniques, measurement devices, or variations in the materials used. Because of that, the gene expression of biologically equivalent samples can significantly vary across experiments and even within experiments. This phenomenon is called batch effect, and it makes things more difficult to identify the actual differences.

In this blog, we discuss batch effects in single-cell RNA sequencing (scRNA-seq). Why?

Single-cell RNA sequencing and batch effect

scRNA-seq is a sequencing technique that measures the gene expression of individual biological cells. By allowing the characterization at cellular level, it has been widely adopted by academic institutions, biotechs, and pharmaceutical companies.

While a mammalian cell might have over 20,000 genes that need to be measured, its measure-able material, the mRNA, is tiny. Therefore, scRNA-seq technology relies on the amplification procedure to enhance the sensitivity. This enhancement, however, is a double-edged sword. Not only gene expression signals are amplified, technical factors are also accumulated along the process. As a result, scRNA-seq data is highly sensitive to batch effects.

In the next sections, we discuss how batch effects in scRNA-seq can hinder discoveries.

Introducing Batch Effect To A Truth Set Of scRNA-seq Data

To exemplify the batch effect in scRNA-seq, we simulated a count matrix of 2300 cells and 5000 genes using splatter [1], a framework to imitate the batch effect in scRNA-seq. The simulated data mimics a common experimental design, where cells are harvested from different patients (3 patients in this example) and sequenced independently. For each patient, two types of tissue are collected: tumor and adjacent non-tumor. The research question is: what genes are biologically different between tumor and adjacent non-tumor cells?

The simulation process starts from a gamma distribution of expression level for each of the 5000 genes. Then, we added biological differences to 117 genes so that they are either up or down regulated when comparing 2 tissue types. These genes are the answer to the research question above.

To introduce the batch effect to the data, we added two major types of technical variation. Firstly, a distortion was added to the distribution of all 5000 genes (including the 117 genes) to mimic the variations between sequencing batches. In this case, the sequencing batches are the 3 patients. This addition of batch effect introduces a chance that genes are expressed differently between patients, regardless of the tissue type. After that, from the distorted distributions, expression values are generated for each cell of a tissue type and patient, creating 2300 cells in total. The expression values of these individual cells are then added to another batch effect that imitates the miscaptured mRNA/cDNA during the amplification process (e.g. dropouts). This batch effect results in a new expression value that is called “observed expression”, the expression that the end users (e.g. researchers) used for their analyses.

To illustrate how batch effects can impact the data interpretation, we are going to apply the most common approaches on this “observed expression” to re-discover the 117 genes.

Batch Effect In Data Visualization

Before jumping into the gene discovery part, let's check how the batch effect looks like when the data is visualized.

For data visualization, we use t-SNE, a common technique to reduce the number of dimensions so that each cell can be visualized as a data point on a 2 dimensional space. By compressing 5000 genes into x and y axis, cells that are next to each other in the 5000 dimensional space (i.e. have similar expression profiles) will stay close to each other in the compressed space.

On your left (if you are browsing on your PC), you can see the t-SNE plot of the data with no batch effect, colored in 2 different ways: tissue type and patient. When there is no batch effect, the difference between cells is the biological difference between tissue types. This reflects in the 2 clusters of cells for tumor and non-tumor tissues.

When the batch effect is added (click on Add batch effect), cells are no longer clustered by just the tissue types. By coloring with the patient labels, we can see each patient is a different cluster of two types of tissue. This t-SNE plot was constructed using the “observed expression” that we created in the previous section.

In the next section, we are going to perform the differential expression analysis on the “observed expression” and examine how many of the 117 genes can be found.

The Batch Effect In Differential Expression Analysis

Among the 117 genes that we are looking for, 54 and 63 of them are up and down-regulated in adjacent normal tissue, respectively. Those are called the truth in this section.

To find all significantly different genes between the tissue types, we applied the default differential expression procedure in Seurat [2] . The result gives 49 up-regulated and 63 down-regulated genes. To see how these newly found genes overlap with the 117 genes above, we visualize the result with a Venn diagram.

The table shows the up and down regulated genes from the differential expression analysis

Type	Observed expression	The truth	False Negative	False Positive
Up-regulated genes	49 genes	54 genes	5 genes	0 gene
Down-regulated genes	63 genes	63 genes	4 genes	4 genes
TOTAL	112 genes	117 genes	9 genes	4 genes

In the upregulated genes, out of 54 genes of the truth, 49 were correctly identified as DEGs, with no misidentifications. In the downregulated genes, out of 63 genes of the truth, only 59 genes were correctly identified. An addition of 4 genes are falsely identified as down-regulated.

From the differential expression analysis, 9 genes (5 up and 4 down) among 117 genes (7.7%) of the truth are missing (false negative). And among the 112 genes that we discovered, 4 of them (3.6%) are not the truth (false positive).

In this simple example, we were trying to find the biological difference of the data in the presence of batch effect. Not only we missed 7.7% of the genes indicating the actual difference, 3.6% of what we found is incorrect. Are these fractions concerning to you?

Conclusion

For the sake of simplicity, our simulated data does not take into account the complexity of having multiple cell types in a tissue or the stochasticity of gene expression and cell states. The reality can be much more complicated.

One good news is that this is not a new topic, especially in scRNA-seq. Researchers have been continuously inventing new strategies to mitigate its distorsion on the data, from experiment design, sequencing strategies, to mathematical models. However, batch effects exist in many different forms. Its effect grows even bigger when you combine datasets from multiple studies, such as in databases or atlases. This is the reason why we should be skeptical with our findings when it is based on a composite knowledge.

At DataXight, we help our clients build disease-specific knowledge base from clinical and omics data to accelerate drug development.
Contact us for more information.

Share this blog on:

Additional posts

<span style="white-space: pre-wrap;">Processing 500,000 whole genome sequences to identify rare genetic disease variants</span>

Insight