Summary

Our benchmarking reveals a surprising truth: in the race to translate massive perturbation datasets into discovery, the most effective mathematical "lens" isn't the most complex one. While sophisticated metrics like Wasserstein or Mean Pairwise are often favored due to their mathematical impressiveness, we found that E-distance and Euclidean distance provide the superior balance of speed and signal resolution for high-throughput pipelines. By delivering sharper biological contrast at a fraction of the computational cost, these pragmatic engines allow us to uncover critical signatures that more intensive methods often bury.

Context: The test case of Saquinavir 

We revisit the case of Saquinavir—a protease inhibitor used in HIV treatment. Its unexpected functional insights mimicking  Adrenoceptor Agonist identified and discussed by the Tahoe blog provide a rigorous benchmark for our current evaluation: a metric’s true value lies in its ability to resolve the subtle mechanistic nuances.

We used Pertpy to benchmark five distinct mathematical distance metrics. Our objective is to determine which of them can effectively extract signals from high-dimensional transcriptomic data and reproduce the hidden adrenergic signal. 

We reused Tahoe Plate 6—subsampling 3,000 cells per drug across 62,710 genes. We categorized nine treatments into three groups:

  • The Targets: Saquinavir and Terfenadine (the "hidden" cardiovascular side effects).
  • The Positive Controls: Vilanterol and Norepinephrine (the ground-truth adrenergic signature).
  • The Reference Background: A diverse mechanistic set including Furosemide (diuretic), Clonidine (alpha-2 adrenergic), Verapamil (calcium channel blocker), Lidocaine (sodium channel blocker), and DMSO (vehicle)

For the distance metrics, we used the following:

  1. Euclidean Distance: Essentially, a straight line connects the mean expressions of two populations.
  2. E-distance (Energy Distance): A statistical measure that compares full distributions to capture both shifts in mean and population shape.
  3. Mean Pairwise Distance: Cell-to-cell averaging to capture global population shifts.
  4. Wasserstein Distance: An optimal transport model measuring the cell-state transitions.
  5. Maximum Mean Discrepancy (MMD): A kernel-based approach designed to detect shifts in specific sub-populations.

Results

1. Biological Resolution: Separating Signal from Background

All five metrics consistently mirror the previous discovery of the Tahoe team. As shown in Figure 1, the metrics successfully "de-noised" the transcriptomic data to reveal three distinct relationship tiers:

  • The Adrenergic Core: Beta-agonists, such as Vilanterol and Norepinephrine, showed high similarity to Saquinavir, confirming the hidden beta-adrenergic signature.
  • The Expected Outliers: Furosemide (a diuretic) was correctly pushed to the periphery by every metric, showing the greatest distance from Saquinavir.
  • The Mechanistic Nuance: Clonidine (an alpha-2 agonist) maintained a moderate distance. This is a critical distinction whether the metrics were sensitive enough to differentiate between a general adrenergic response and the specific beta-signature we were tracking.
  • The Lidocaine Surprise: Lidocaine showed an unexpectedly high similarity to Saquinavir across every approach, ranking nearly as close as the beta-agonists. This shared transcriptomic footprint suggests a potential off-target effect that warrants further investigation.

Figure 1. The heatmap shows the distances between drug pairs, measured using 5 distance types in the Pertpy library. The numbers in the cells are the calculated distances from the corresponding metrics.

2. Ranking Consistency: Where the First Cracks Appear

While most metrics agree on the "big picture", the real test is whether that lens stays in focus when the signals get subtle.

When we ranked the top five drug pairs by similarity, Euclidean, E-distance, and MMD were perfectly aligned, producing an identical hierarchy. This consistency gives us confidence that the biological signal is dominant and reproducible.

However, as we move toward the more computationally expensive metrics, that stability begins to fracture:

Top 5 shortest distances

Euclidean, E-distance, and MMD

Wasserstein

Mean Pairwise

1

Saquinavir-Lidocaine

Vilanterol-Norepinephrine

Saquinavir-Lidocaine

2

Saquinavir-Norepinephrine

Saquinavir-Norepinephrine

Vilanterol-Lidocaine

3

Vilanterol-Norepinephrine

Saquinavir-Lidocaine

Saquinavir-Vilanterol

4

Norepinephrine-Lidocaine

Saquinavir-Vilanterol

Norepinephrine-Lidocaine

5

Saquinavir-Vilanterol

Norepinephrine-Lidocaine

Saquinavir-Norepinephrine

Table 1. The top 5 shortest distances measured by each metric.

The divergence in data resolution reveals the trap in modern analysis: higher cost does not always mean better data. While MMD and Wasserstein often demand significant computing power, the results demonstrate that their complexity does not yield superior biological resolution in this context. In fact, they introduce a layer of statistical noise, blurring the clear, meaningful insights that simpler engines like Euclidean and E-distance surface with ease. 

 2. The Mathematics of Signal Strength: Why Metrics Fail or Thrive

Why do some metrics deliver a clear discovery while others struggle with noise? The answer often lies in how each formula handles within-group variance in single-cell data.

To benchmark the effectiveness of each metric, we calculated a Signal Strength Ratio:

This ratio measures the distance of our target drug (Saquinavir) from the inert solvent (DMSO) relative to its distance from known controls (Adrenergic agonists). As shown in Figure 2, the results reveal a clear hierarchy in detection power across the metrics: 

  • The Noise Trap: Mean Pairwise and Wasserstein (1.04x – 1.10x). These metrics barely hover above the baseline. While biologically appealing in theory, in practice, it’s a technical noise trap, making these two methods produce relatively low signals.
  • The Baseline: Euclidean Distance (2.93x). Euclidean distance provides a visible separation by collapsing cell populations into a single mean vector. While this "shortcut" ignores the nuance of cell-to-cell variance, it captures the dominant transcriptomic shift and outperforms the more complex optimal transport models.
  • The High-Contrast Engines: MMD and E-distance (8.61x – 11.19x). These metrics deliver a massive surge in contrast. By explicitly accounting for—and subtracting—internal spread (within-group variance), E-distance and MMD reward "focused" drug responses while penalizing scattered, random stress responses. Therefore, they isolate the true transcriptomic signature, eventually resulting in a signal nearly 10x sharper than the baseline. 
Figure 2. Relative signal strength produced across five distance metrics. The bar plot displays the ratio of the calculated distance from the target drugs (Saquinavir) to the vehicle (DMSO) against the distance to known adrenergic agonists. A ratio exceeding the red dashed line (1.0) suggests the target's transcriptomic profile aligns more closely with the positive controls than with the inert solvent.

We further examine the discrimination power between these distance metrics by evaluating the resolution gap using the Coefficient of Variation (CV%). We calculated the CV%, which measures the relative dispersion among the top five drug pairs for each metric, using the following formula:

While Euclidean and E-distance, with CV% values ranging from 12.7% to 33.5%, provide sufficient range to clearly distinguish between these top candidates, Wasserstein and Mean Pairwise markedly condense these top-ranking results with a CV% near zero, compressing the top drug pairs into a single, indistinguishable cluster. This minimal variability and lack of separation substantially limited their utility for prioritizing biologically meaningful leads.

Figure 3: Metric Discrimination Power (CV%). High variance in Euclidean and E-distance (12.7–33.5%) provides the contrast needed to differentiate the top 5 shortest drug pairs.

3. The Scalability Wall: Where Mathematical Elegance Meets Computational Reality

In the high-stakes world of drug discovery, a metric’s "biological resolution" is only half the story. For bioinformatics scientists building production-grade pipelines, biological resolution must be balanced with computational scalability.

To test the limits of our distance metric candidates, we benchmarked them on a dedicated server equipped with 96 CPU cores and 125GB of RAM.

Using the dataset of 27,000 cells (3,000 per drug), the result for each metric revealed a profound "Scalability Wall" (Figure 4). Euclidean finishes in a staggering 0.11 seconds. E-distance (1.7s) and MMD (3.6s) followed closely, performing efficiently that they are essentially "computationally free" even in high-throughput contexts. However, on the identical input, the "sophisticated" metrics began to struggle. Mean Pairwise required 174.7s, while Wasserstein demanded a grueling 557.9s—roughly 1,500× and 5,000× slower than Euclidean, respectively turning a few seconds of processing into nearly ten minutes of wait time for a single experiment.

This performance gap is transformative: Mean Pairwise and Wasserstein are significantly slower than Euclidean, E-distance, or MMD, making these "sophisticated" metrics hit a scalability wall when scaling to libraries of thousands of perturbations and millions of cells for real-world drug discovery.

Figure 4. Barplot displays the running time corresponding to each type of metric and the corresponding number of cells.

Conclusion: From Mathematical Choice to Biological Discovery

The choice of a distance metric is far from a mere technicality; it is the fundamental "lens" through which we interpret complex high-dimensional data. As our analysis in the case of Saquinavir demonstrates, high computational cost does not guarantee signal clarity. In fact, em"sophisticated" metrics like Wasserstein and Mean Pairwise can often bury critical biological signatures under background noise in high-dimensional data.

By contrast, E-distance and Euclidean distance emerge as the most pragmatic "engines" for high-throughput single-cell perturbation data analysis due to their capabilities to detect high-resolution biological signals at a low computational cost. By choosing the right metrics, we can reveal similarities between transcriptomic profiles that might otherwise be missed, supporting critical applications:

  • Drug Repurposing at Scale: Scalability is required to scan through the large volumes of data needed to identify similar molecular signatures and uncover hidden pharmaceutical functions.
  • High-Resolution Toxicity Detection: Strong sensitivity to biological signals is necessary for early detection of potential side effects of drug candidates before they reach clinical trials.
  • Mechanistic Interpretation: Computational efficiency allows for the mapping of complex biological pathways to understand drug action, facilitating faster mechanistic interpretation and hypothesis generation.

Building a system that handles data at scale isn't about finding the most complex math; it’s about choosing the most effective tool for the job. By prioritizing scientific truth, speed, stability, and signal resolution, we can bridge the gap between massive raw datasets and definitive pharmacological insights. In the race to turn perturbations into discoveries, the most powerful tool is the one that provides the highest resolution with the lowest overhead—ensuring the journey from data to insight is both scientifically rigorous and computationally seamless.