Tahoe-100M in Practice: Workflows, Pitfalls, and Pathways to Scalable scRNA Analysis

Single-cell transcriptomics (scRNA) studies now profile millions of cells, revealing identity, state, and tissue heterogeneity, and create unprecedented opportunities to extract biological insights that would be invisible in smaller studies. Tahoe-100M, a groundbreaking resource hosted by Arc Institute, contains 100 million cells covering 379 distinct drugs and 50 cancer cell lines, is one such study.

On the other hand, at Tahoe-100M scale, even routine queries pose significant computational challenges for researchers. For example, filtering for drug specific treatment (e.g. Dabrafenib-treated cells), must pull from 3 of the files of the Tahoe-100M datasets (totaling ~57 GB). Loading them into memory would exceed the memory available on most individual computers. Using servers or cloud instances with more RAM can ease this constraint, but researchers still need optimized data loading tools, the right analysis frameworks, and reproducible environments. We were curious about the status quo of tools for researchers to leverage this dataset. In this blog post, we share our experience in using publicly available notebooks built on the Tahoe-100M dataset:

Out-of-core PCA illustrates scaling beyond memory limits, on both CPUs and GPUs, on performing PCA and UMAP dimensionality reduction.
scDataset - demonstrates data loading efficiency for model training in PyTorch.
STATE trains a modern virtual cell model and generates an inference file for submission to the Virtual Cell Challenge (VCC).

TLDR (Key Takeaways)

All three benchmark notebooks ran successfully, albeit with non-trivial challenges that included ensuring sufficient disk space and memory, installing necessary packages, updating function calls due to API changes, and parameter tuning (such as the number of workers that are parallel processes for data loading, preprocessing and other tasks). These hurdles highlight the practical barriers that scientists face in working with Tahoe-100M and show why improvements in tooling are urgently needed. The three notebooks we ran serve as benchmarks for Tahoe-100M, and the issues we encountered there reflect the same challenges scientists would face when analyzing large datasets like Tahoe-100M.

At DataXight, our goal is to reduce the friction in different stages of the data lifecycle in order to streamline workflows and accelerate impactful insight discoveries. As a result, we have partnered with DNAnexus to provide an analysis environment that addresses the infrastructure challenges to analyze billion cell scRNA datasets at scale. If you would like access to a push-button environment that includes

Ready to use data, in this case the Tahoe-100M dataset
Optimized data loaders to handle memory constraints of large datasets,
Preinstalled tools, including scRNA and AI/ML tools with pinned versions and all required dependencies,
Hardware aware preflight checks that size RAM, disk, and shared memory and suggest safe defaults,
Clear guides with quick checklists, parameter cheat sheets, troubleshooting notes, and prompt support.

Technical Notes

This section outlines our experience running the notebooks, highlights the main pitfalls we encountered, and discusses where tooling can be refined. These notes are highly technical in nature and intended primarily for individuals with a high degree of technical knowledge in areas such as data engineering, machine learning, or systems infrastructure. As such, they prioritize depth and specificity over introductory explanations, and may not be easily accessible to those without prior expertise.

Out of core PCA and UMAP

This notebook, developed by the Helmholtz Munich—Fabian Theis lab, is available in their vevo_Tahoe_100m_analysis GitHub repository. It demonstrates out-of-core PCA on Tahoe-100M by using AnnData with Dask and RAPIDS to perform scalable, memory-efficient dimensionality reduction.

Using anndata.experimental.read_elem_as_dask, large arrays (e.g., X and obsm["X_pca"]) are loaded as Dask arrays for lazy access. The workflow computes PCA across ~100M cells from different cancer cell lines and drug conditions, then visualizes a subset of the embedding. The same pattern could also be applied to downstream tasks from deep learning to linear models without loading the full dataset. Additionally, the user has the option to compute using CPUs or GPUs.

Pitfalls to watch for

Dependencies by trial and error. The notebook imports many packages but provides no install requirements, so the practical path was to run the cell and pip install after each ModuleNotFoundError. It shows a package table, but it does not separate directly from transitive dependencies, so users may end up guessing dependency chains and manually pinning pip versions, which is slow and error prone. In practice, the CPU path only needs dask, scanpy, and dask[distributed].
Choosing SPARSE_CHUNK_SIZE and cluster size. The default (SPARSE_CHUNK_SIZE = 100,000; LocalCluster(n_workers=16)) comes with limited documentation on selecting the optimal chunk size. Starting with 100,000 as the initial setting, we continued to lower the setting until the KilledWorker errors were avoided. Since then, we have created a guide for mapping RAM per worker to recommend SPARSE_CHUNK_SIZE and n_workers.
API changes. The notebook calls anndata.experimental.read_elem_as_dask, which recent AnnData versions removed, causing AttributeError. It exists only in older builds (e.g., 0.11.0rc3). Newer releases use anndata.experimental.read_elem_lazy or read_lazy, and the referenced tutorial is no longer available. New users must choose between pinning AnnData or migrating to the new lazy APIs.
Beginner misunderstanding of GPU setup. The notebook reads as if setting use_gpu=True is enough, but a full RAPIDS stack is required (cuDF, CuPy, Dask CUDA, RMM) along with rapids singlecell. Missing packages only surface when GPU is enabled, causing a second round of installs. One fix involved ensuring proper imports: for example, replacing the deprecated cupyx.scipy.spx with cupyx.scipy.sparse. Additionally, installing RAPIDS via pip frequently often leads to CUDA mismatches such as “Failed to dlopen libcudart.so.12”. The reliable path is to set up a fresh conda environment from RAPIDS channels with rapids singlecell, cudf, dask cuda, cupy, rmm, plus anndata, scanpy, and dask, then run the notebook with that kernel.
Other GPU pitfalls
- sc.pp.normalize_total(adata) can fail with TypeError: spmatrix.sum() got an unexpected keyword argument 'keepdims' when library versions do not match (CuPy, CUDA, Scanpy, Dask). This issue can be fixed by aligning versions or converting the matrix to a supported sparse type before normalization.
- Moving X_pca to GPU with map_blocks can raise ValueError: Unsupported initializer format if an empty CSR is created instead of passing real data. This can be addressed by following the Dask map_blocks function and meta pattern so the block carries actual values.
- The notebook assumes X_umap is on GPU, but it's actually on CPU. Users may not realize where the data resides, and the code doesn't handle this correctly, leading to errors when calling .get() on a NumPy array.
- rapids_singlecell.tl.umap can produce a few extreme outliers that do not appear with scanpy.tl.umap on CPU.

scDataset: Scalable scRNA data loading on PyTorch Lightning

High level view of scDataset (credits to scDataset Github repo)

In “scDataset: Scalable Data Loading for Deep Learning on Large‑Scale Single‑Cell Omics”, authors Davide D’Ascenzo and Sebastiano Cultrera di Montesano introduce scDataset package to speed up the loading of scRNA data on PyTorch Lightning for model training.

To optimize model generalization, a best practice is to ensure batch diversity during training. This is accomplished by splitting and shuffling the dataset with respect to the order of the data provided during the training process. For a large dataset of sparse matrices, such as Tahoe 100M, loading the entire dataset into memory just to form diverse, shuffled batches is both compute and time intensive.

scDataset, available in their GitHub repository, implements a PyTorch IterableDataset that operates directly on AnnData files, enabling high-throughput training workflows without full in-memory loading or format conversion. Their notebook demonstrates how to read directly from AnnData files without converting formats, and replaces slow random reads with block-wise reads and in-memory mixing that keep batches diverse while using the disk efficiently.

According to the scDataset pre-print, this new method significantly outperformed existing data-loading tools. Particularly, its results show up to about 48× higher single-core throughput than the anndata.experimental.AnnLoader, which is a common data loader for single-cell RNA data; about 27× time faster than HuggingFace Datasets, which is a popular library for managing and loading datasets in machine learning; about 18× time faster than BioNeMo. Furthermore, when multiprocessing was enabled, the total time to complete one full epoch training fell from >58 days to <11 hours. This highlights a massive improvement in data throughput and training efficiency. In our own testing on a mem2_ssd1_gpu_x16 instance (16 vCPU, 64.5 GB RAM, 210 GB disk) with two plates (3 and 9), one epoch of training and evaluation took up to 5 hours.

Pitfalls to watch for

Shared memory limits: On a mem2_ssd1_gpu_x16 instance (16 vCPU, 64.5 GB RAM, 210 GB disk), loading the file Plate 3 (12.27 GB) of Tahoe100M failed with the default /dev/shm (64 MB). Increasing /dev/shm to 4 GB resolved DataLoader crashes due to temp-file and bus errors. In principle, memory needs can be estimated from row size, batch size, block size, and number of workers (e.g., ~127 MB per worker at batch_size=64 and block_size=8). Clear guidance would avoid the brute force trial-and-error of setting /dev/shm to 1GB, 2GB, 3GB, etc. A concise sizing guide that maps CPU, RAM, disk, and shared memory to batch_size, block_size, fetch_factor, and num_workers, with minimums for both one file and the full set of 14 files in Tahoe100M- would be invaluable.
Unclear DataLoader messages for a new user: Errors like AsssertionError: can only test a child process are often benign multiprocessing noise, but the connection to shared memory or worker settings is not obvious. A short guidance on when to ignore versus when to adjust /dev/shm, num_workers, or fetch_factor would reduce friction.

Adapting STATE for Virtual Cell Challenge

STATE transition model (credits to ARC Institute's blog post on Hugging Face)

STATE is the Arc Institute’s first virtual cell model (press release), “designed to predict how various stem cells, cancer cells, and immune cells respond to drugs, cytokines, or genetic perturbations[, and] trained on observational data from nearly 170 million cells and perturbational data from over 100 million cells across 70 cell lines, including data from the Arc Virtual Cell Atlas”.

Arc Institute developed a notebook that adapts STATE for context generalization in the Arc Virtual Cell Challenge (VCC), and released it on their Github repository. They also provide a Colab walkthrough training and inference in the VCC. STATE was not directly built for transcriptome-wide effect prediction or for forecasting responses to unseen perturbations; however, the notebook modifies its architecture to fit with the VCC’s purpose, by simplifying the basal encoder to a linear layer and moving the residual connection to the final expression-prediction space. Beside the training and validation profiles from H1 cells given by the challenge, this notebook co-trains on the Replogle genome wide and essential CRISPR screens, restricting the auxiliary set to the 200 perturbations that also appear in the challenge’s train and validation split.

Pitfalls to watch for

Memory and storage considerations. Running the STATE notebook demands significant disk space. The Virtual Cell Challenge environment pulls PyTorch, CUDA, RAPIDS, and other GPU libraries and typically occupies 10 to 12 GB. Six AnnData train/validation files add ~29 GB. With default (max_steps 40000, ckpt_every_n_steps 2000), training writes 20 checkpoints at 1.7 GB each. Total disk use can exceed 60 GB, so users should ensure at least 100 GB of free space before running the notebook. While this is within the capacity of most modern systems, it serves as an important reminder to plan storage in advance.
Evaluation for VCC submission. To evaluate the model performance, the inference portion of the notebook writes a .vcc file of ~3.8 GB. This file then needs to be uploaded to the VVC to be evaluated by the model performance evaluation. The results will be posted to the Virtual Cell Challenge leaderboard.
Tuning the model with Tahoe 100M. When adapting the workflow to Tahoe-100M, several additional challenges arise. While the VCC expects predictions on 18,080 genes, the overlap that Tahoe-100m has is only 17,741 genes, leaving 339 missing and causing input shape errors when combining datasets. Additionally, the STATE notebook was built around the VCC dataset using gene perturbations, whereas Tahoe-100M contains only drug perturbations. Gene perturbations directly target a specific gene, whereas drug perturbations reflect dose, time, and off-target effects, resulting in different biological contexts and metadata profiles. When matching between target genes and drugs in Tahoe 100-M, some drugs have either no listed targets or multiple targets. Also, metadata issues arise such as an extra space in drug name, e.g.Erdafitinib, break alignment. These gaps require careful mapping, cleaning, and harmonization before training with the competition data.

Analyzing large scale scRNA datasets?

Tahoe-100M highlights both the opportunities and the practical barriers of working at single-cell scale. As we refine tools and workflows for scRNA analysis, input from the community is critical.

If you are interested in leveraging large scRNA datasets such as Tahoe100m in your work, we invite you to join us in making large-scale single-cell analysis more accessible. Email us at solutions@dataxight.com

Tahoe-100M in Practice: Workflows, Pitfalls, and Pathways to Scalable scRNA Analysis

TLDR (Key Takeaways)

Technical Notes

Out of core PCA and UMAP

Pitfalls to watch for

scDataset: Scalable scRNA data loading on PyTorch Lightning

Pitfalls to watch for

Adapting STATE for Virtual Cell Challenge

Pitfalls to watch for

Analyzing large scale scRNA datasets?

Find out what’s happening

Accelerating Training for Large-Scale Single-Cell Data at the NVIDIA Accelerate Omics Hackathon

Introducing PROTOplast: Scalable Machine Learning for Molecular Data Analysis

A Note on Parquet-based scRNA ML Pipelines