The single-cell RNA sequencing (scRNA-seq) field is experiencing unprecedented growth in dataset sizes, driven by ambitious molecular perturbation studies that promise to revolutionize our understanding of cellular responses to interventions. While projects like CZI’s CellxGene, Arc Institute and Tahoe Bio’s Tahoe-100M and Xaira Therapeutic's X-Atlas/Orion represent a new frontier in scale, they also pose significant computational challenges that traditional tools weren't designed to handle.

Both Tahoe-100M and X-Atlas/Orion datasets are organized and stored as H5AD files. These files are also known as AnnData (or annotated data object), a widely used data structure for handling single-cell RNA sequencing data. Analysis of these datasets using existing tools require the scientist to have massive computational resources available to them. Having spent considerable time wrestling with these datasets, we want to share our experience exploring an alternative approach to the standard workflow— specifically, in converting to a DataFrame format and leveraging modern distributed computing tools like Daft and Ray. This endorses democratizing large-scale scRNA analysis for those with less power computers.

The Scale Challenge: When AnnData Hits Limits

The excitement around molecular perturbation studies is well-founded and remains one of core questions in biology. Perturbation combined with scRNA-seq technologies allows us to understand how individual cells respond upon chemical or genetic interventions. To date, industrialized platforms from Tahoe Bio and Xaira have accelerated the growth of single cell perturbation experiments and provided gigantic datasets. The ability to analyze intervention effects at this unprecedented scale opens doors to discoveries that were simply impossible with smaller datasets. However, this scale comes with a price: computational complexity that pushes traditional tools to their breaking point.

Consider the Tahoe 100M dataset: 300GB of data stored across 14 AnnData sparse matrices in H5AD file format ranging from 12GB to 35GB. A sparse matrix is a matrix in which most of the elements are zero and are commonly used in scRNA-seq data. While sparse matrices are optimized for managing disk space (these matrices are typically >95% zeros), the computational reality is more complex. When you need to perform analyses, that sparse data often needs to be densified, and suddenly your 300GB dataset could easily require upwards of 1TB of memory when fully loaded.

This is where existing AnnData tools struggle. The existing ecosystem of tools was not designed with this scale in mind, and several fundamental limitations become apparent:

  • Memory constraints: Support for out-of-core processing capabilities are limited. Out-of-core or external memory processing is designed to handle data that are too large to fit into a computer’s internal memory at once.
  • Single-machine assumptions: Most tools are implemented to execute on a single computer, while large-scale data usually requires distributed processing across many machines to achieve the necessary computational power and storage capacity. 
  • File-backed limitations: File-backed access is a way of storing data such that data is kept on disk rather than entirely loaded into RAM during analysis. While file-backed mode is somewhat helpful for those with limited computational resources, it's still constrained for operations that require densification.

There have been growing efforts to address these issues using Dask and Rapids-SingleCell.  Beyond those approaches, we wanted to explore an approach that leverages the mature, battle-tested tools from the broader data science and machine learning ecosystem.

Overcoming the Scale Challenge: The Parquet Paradigm Shift

Our approach centers on a simple but powerful idea: convert AnnData's internal objects to a DataFrame format, and store the data in Parquet file format. This idea is not particularly radical. An AnnData object includes an expression matrix X and metadata related to different aspects of the central expression matrix X, and since AnnData already stores most of its metadata in DataFrames, we're essentially extending this pattern to the expression data itself while harmonizing the metadata together. Furthermore, as there is a vibrant Parquet community, this approach would enable each researcher to choose their favorite Parquet library for their analysis.

The Conversion Process: Lessons Learned

Our initial attempt used the pqdata library by Danila Bredikhin (currently a postdoc at Stanford), which converts AnnData into a directory structure of Parquet files that simulates the internal AnnData structure. However, we immediately encountered a compatibility issue: the library assumes sparse matrices are SciPy sparse matrices, a set of specialized data structures provided by the Python SciPy library for efficiently storing and manipulating sparse matrices. However, the Python anndata library’s API has since evolved, and the data is now instantiated as anndata.abc.CSRDataset objects which causes pqdata to error out. After implementing the fix, we tried again to convert the files. This time we encountered the out-of-memory issue, and the code would crash on even the smallest of the Tahoe-100m plate files (plate 3 with 12 GB data).

We pivoted to a batch processing approach, reading data in batches of 25,000 rows and writing individual Parquet files for each batch. This strategy has a crucial advantage: it makes the conversion process accessible even on modest hardware. On a 5-year-old MacBook M1 with just 8GB of RAM, we could convert each plate of the Tahoe 100M data in 2.5 to 7.5 hours. Specialized hardware not required.

Daft + Ray: The Scalability Sweet Spot

Once we have converted the data into Parquet files, we leverage the Python Daft and Ray libraries for analysis. Our experience suggests that this combination offers a couple of advantages over traditional Pandas-based workflows:

Lazy Evaluation: Daft's lazy evaluation model means operations are optimized and executed only when results are needed, dramatically reducing memory overhead.

Seamless Cloud Scaling: Ray is an open-source framework designed for scaling Python applications—from single machines to large clusters—without requiring major code changes. The tight integration with Ray is perhaps the most compelling feature. You can develop and test your analysis pipeline locally on a subset of data, then seamlessly scale to a Ray cluster in the cloud when you're ready to process the full dataset.

Performance Results: Where the Rubber Meets the Road

To validate this approach, we ran simple aggregation operations comparing an implementation using the anndata library (in file-backed mode) against our Daft-Parquet implementation, querying against a single plate (plate 7 - 15.4 GB data). This code is minimalist, taking only 25 lines, including timing and logging of the results and timing information.

We performed three test aggregations to extract the count rows, max value, and mean value for gene C1orf112 in the dataset on a MacBook M1. The results speak for themselves:

AnnData Performance: This implementation crashes on all three test aggregations (count, max, mean).

Daft Performance (no Ray cluster):

Operation

Time (seconds)

Result 

Count rows

11.03

5,692,117

Max value (Gene expression score)

11.14

29.0

Mean value (Gene expression score)

10.93

0.042

All three simple operations completed in approximately 11 seconds each on hardware where attempting the same computations using the Python anndata library (with minimal densification) crashes.

Having shown feasibility of this approach on a single plate, we re-attempt this on 2 plates (adding plate 3, for a total of 28 GB of data). Here are the results:

Operation

Time (seconds)

Result 

Count rows

20.9

10,397,519

Max value (Gene expression score)

20.8

29.0

Mean value (Gene expression score)

20.3

0.039494517874889196

Looking Forward: Implications and Next Steps

This exploration represents more than just a technical workaround—it's a paradigm shift toward treating scRNA data as part of the broader big data ecosystem and making the data more accessible to a broader group of scientists.  By converting to standard formats like Parquet and leveraging mature distributed computing frameworks, we gain access to:

  • Proven scalability patterns from other domains
  • Cloud-native architectures that can handle petabyte-scale datasets
  • Ecosystem compatibility with modern Machine Learning Operations (MLOps) and data engineering tools
  • Cost efficiency through better resource utilization

Large-scale perturbation studies generating these massive single-cell datasets represent the future of understanding functional genomics and context-specific mechanisms. To fully realize and utilize their potential, we need computational approaches that can scale with the ambition of science. Our experience suggests that looking beyond the traditional bioinformatics toolkit—toward the tools and practices of modern data science—may be essential for unlocking the insights hidden in these unprecedented datasets.

The field is still in the early stages of grappling with this scale challenge, and there's much more work to be done. But the initial results are promising: with the right tools and approaches, even modest hardware can tackle analyses that would have been impossible just a few years ago.

Open Source Commitment

We are strong believers in open science and the power of community-driven development. The code and conversion scripts we've developed for this Parquet-based scRNA analysis approach will be made available as open source in the coming weeks. We're currently finalizing documentation and preparing comprehensive examples to ensure the tools are accessible to the broader research community.

If you are working with large-scale scRNA datasets and would like an advanced look at our implementation, or if you'd like to discuss potential collaborations, we encourage you to reach out.  We are always glad to connect with fellow researchers tackling similar computational challenges and would welcome the opportunity to share our work before the official release.

Let us know at solutions@dataxight.com