A Note on Parquet-based scRNA ML Pipelines

Single-cell RNA sequencing (scRNA-seq) is revolutionizing our understanding of cellular biology, but the computational challenges of processing these massive datasets continue to evolve. As datasets grow from thousands to millions of cells, the choice of data format and processing pipeline becomes critical. Parquet files, with their columnar storage and excellent compression ratios, seem like a natural fit for intermediate data storage in machine learning workflows. In a previous blog post, we shared some initial promising results. As we explored more, we encountered a memory leak bug in the Apache Arrow library that has made Parquet files untenable for large-scale scRNA-seq data processing, at least for now. This post details our investigation in tracking down the issue.

Exploring Parquet for scRNA-seq data

Parquet files offer several compelling advantages for genomics data:

Columnar storage: Perfect for gene expression matrices where operations often involve entire genes (columns) across cells
Excellent compression and storage: Critical for sparse scRNA-seq data with many zero values
Cross-language compatibility: Works seamlessly across Python, R, and other analysis environments
Accelerated Query Performance: Predicate pushdown enables efficient filtering without loading entire datasets

For scRNA-seq datasets, these features could translate to faster data loading, reduced storage costs, and more efficient memory usage during analysis. The format seems particularly promising for preprocessing pipelines that convert raw count matrices into PyTorch tensors for deep learning applications. Our initial investigation, which we shared in the blog post “Democratizing Single-Cell RNA Analyses” (link), also found that using Parquet files to be promising.

Encountering a memory leak bug

The Tahoe100m dataset, with its 100+ million cells and comprehensive perturbation profiles, represents the cutting edge of precision medicine datasets. Each cell contains expression profiles for 62,710 genes, creating a massive sparse matrix ideal for Parquet's columnar compression. Following our previous blog post, we attempted to implement a PyTorch DataLoader that reads Parquet files in chunks and converts them to tensors for ML training.

In doing so, we encountered a memory leak bug in the Apache Arrow library that underlies most Parquet implementations in Python. What should have been a straightforward data loading operation becomes a deep dive in memory profiling while trying numerous alternatives in converting a Parquet file to a PyTorch Tensor object.

Regardless of the approach, we observed that 1GB of Parquet files quickly grew 10x to 10GB of memory usage. For Tahoe100M, which is 300GB, memory usage quickly overwhelms the system. The bug seems to have been identified, and documented (currently an open issue): https://github.com/apache/arrow/issues/47266

The culprit seems to be traced to the use of mimalloc in the latest Arrow version 21.0, as one user reports that building pyarrow with ARROW_MIMALLOC=OFF fixes the issue. Another user reports that setting the ARROW_DEFAULT_MEMORY_POOL environment variable to be “system” also addresses the issue.

Once we traced the issue to Apache Arrow’s memory leak, we realized that our options were to wait for a bug fix in a future release, or to find an alternative approach. We decided that we would take the AND approach (vs the EITHER/OR approach) and do both. As we wait for the bug fix in a future release, we started working on an alternative approach by operating directly on AnnData files to achieve the desired improvements in efficiency and performance. Stay tuned for some exciting news that we have on this front!

Have you encountered similar memory management issues in your scRNA-seq processing pipelines? What strategies have you found effective for handling large-scale densification? Email us at solutions@dataxight.com - we’d love to hear from you!

A Note on Parquet-based scRNA ML Pipelines

Exploring Parquet for scRNA-seq data

Encountering a memory leak bug

Find out what’s happening

Accelerating Training for Large-Scale Single-Cell Data at the NVIDIA Accelerate Omics Hackathon

Introducing PROTOplast: Scalable Machine Learning for Molecular Data Analysis

Tahoe-100M in Practice: Workflows, Pitfalls, and Pathways to Scalable scRNA Analysis