The Virtual Cell is a concept at the intersection of computational biology and systems science. At its core, it aims to represent a predictive model of how a living cell responds to internal and external cues. In its ideal form, a Virtual Cell captures every molecular detail - dynamic proteins, metabolic fluxes, physical interactions, and more. Building such a fully mechanistic model remains beyond current computational capabilities.

A more tractable approximation has emerged over the past decade: the single-cell transcriptome. By treating a cell’s gene expression profile as a compact proxy for its global state, scRNA-seq offers a practical foundation for building a Virtual Cell. This raises an essential question for us at DataXight: What information do we really need from scRNA-seq data to construct a Virtual Cell?

To explore this, we formulated a minimal theoretical model of gene expression under perturbation:

Gene-wise mean expression constitutes the principal component of the transcriptional response to a perturbation and is therefore sufficient to approximate the perturbed cell state.

We performed the following experiment to test this hypothesis:

  • For each perturbation (e.g. drug treatment), we applied the simplest statistical assumption: we computed the average raw UMI count (Xg) of every gene g across all  perturbed cells.
  • Using these mean values, we generated simulated expression profiles by sampling from a Poisson distribution (Poisson(lambda = Xg)). This step models the natural and stochasticity of transcription.
  • We then compared these simulated profiles directly to the ground-truth expression data for the corresponding perturbation.
Workflow for testing the hypothesis of Poisson distribution of gene expression. Source code: poisson_validation.ipynb

About the data, we used the training dataset and three evaluation metrics introduced by the Virtual Cell Challenge in 2025:

  • Perturbation Discrimination Score (PDS): This metric measures a model's ability to distinguish between different perturbations. It functions by ranking the predicted perturbation profiles according to their similarity to the true perturbational effect, irrespective of the magnitude of the effect size.
  • Mean Absolute Error (MAE): This metric quantifies the average magnitude of the error between the predicted and true values. It is calculated as the mean of the absolute differences between the two pseudo-bulk profiles.
  • Differential Expression Score (DES): This score evaluates how accurately a model predicts differential gene expression when compared to the control group.

Below is the score of the simulated data, generated with cell-eval v0.6.5.

statistic

DES

MAE

PDS

count

150

150

150

null_count

0

0

0

mean

0.3939272016

0.04163619777

0.9996444444

std

0.20447462

0.001779971019

0.002012143101

min

0.03722084367

0.03977333009

0.98

25%

0.2364217252

0.04082875699

1

50%

0.3322203673

0.04122887179

1

75%

0.544565459

0.04181614518

1

max

0.8721075372

0.05821436644

1

To test whether the raw expression data sufficiently convey useful results, the simulation followed a deliberately minimal approach: it estimated each gene’s expected expression from the perturbation-level averages and introduced variability using a simple Poisson model. Everything operated directly on raw UMI counts, with no normalization, no batch correction, and no attempt to model dispersion or higher-order structure. Despite this pared-down setup, the method achieved near-perfect PDS across all 150 perturbations. Put differently, just the mean expression profile contains enough signal to tell perturbations apart.

Needless to say, there have been many strategies that try to conceptualize a Virtual Cell, ranging from models that capture gene-gene interactions and dynamic state transitions to those that represent higher-order variability through generative models. Our experiment does not challenge this complexity. Instead, it asks a modest question of how much predictive signal is already contained in the transcriptional mean. This result indicates that the mean might be a useful baseline for the Virtual Cell problem. For an interesting parallel perspective, I recommend Kris Szalay’s recent blog post, in which his team at Turbine independently explores how simple yet powerful a mean predictor can be.

It is also important to note that our experiment was performed using the VCC training dataset, which benefits from exceptionally high sequencing depth and overall data quality. These qualities may amplify the predictive value of first-order statistics. Whether this hypothesis holds across noisier or more heterogeneous datasets remains an open question - one we plan to investigate and share updates on in future blog posts.