Accelerating Training for Large-Scale Single-Cell Data at the NVIDIA Accelerate Omics Hackathon

Single-cell RNA sequencing is fueling a rapid expansion of large-scale omics data. Projects such as Tahoe 100M, which profiled 100 million cells across 50 cancer cell lines, are already opening new avenues for discovery, including predicting cellular responses to drug treatments. Not long after, the Billion Cells Project has partnered with more than 10 leading laboratories and institutions across the United States, with a goal of generating nearly 500 million single-cell profiles in its first year and ultimately building a resource of one billion profiles to advance virtual cell models and accelerate the use of AI in biology.

These datasets offer tremendous opportunities, but they also pose major computational challenges. Traditional workflows cannot keep pace with the scale, turning data storage, processing, and analysis into bottlenecks. Addressing these challenges requires new approaches, and GPU acceleration is emerging as a transformative solution.

In September 2025, the DataXight team joined the Accelerate Omics Hackathon, a three-week virtual event organized by NVIDIA and scverse. The hackathon brought together 11 teams of researchers, developers, and open-source contributors from around the world to explore how GPUs can advance the performance and scalability of omics workflows.

The DataXight multidisciplinary team brought together expertise in biology, bioinformatics, data science, and software engineering. This positioned us well to take on the challenge with both scientific insight and technical depth. We came into the hackathon with two clear goals:

Contribute to the open-source tools that the scientific community depends on, in line with our mission to support open science; and
Deepen our understanding of how GPU technologies can help overcome computational bottlenecks in large-scale biological data, particularly single-cell RNA sequencing.

The Problem

Our team primarily focused on the Tahoe 100M dataset, a landmark single-cell resource designed for studying drug perturbations. The dataset spans 316 GB across 14 separate AnnData files, which immediately creates hurdles for machine learning workflows. To obtain meaningful results, researchers must test multiple models and tune parameters. However, even a single training epoch on this dataset can take many hours, and completing a full set of experiments may require days or even weeks. These long training times slow the pace of discovery, making it difficult to iterate quickly and explore new ideas. As data volumes continue to grow, especially with new large scale resources for perturbation studies, training time will remain a critical barrier that must be addressed.

PROTOplast in Action: Enhancing Performance During the Hackathon

Through experimentation, we decomposed the data loading bottleneck into 4 different subproblems

Loading from disk is slow
Transfer from system memory to GPU memory is slow
Batch sizes must balance between limited RAM and VRAM
Distributing data evenly across workers to maximize parallelization

Making loading from disk faster

To expedite data loading, we utilize multiple workers that sequentially fetch data in controlled batches. The optimal batch size is determined by the system's available memory. This approach significantly improves loading speed as sequential disk reads are much faster than random reads.

Parallel processing, where multiple workers read data chunks concurrently, effectively mitigates I/O latency. This is particularly beneficial with SSDs or distributed storage systems like S3. While distributed storage systems experience higher latency compared to local storage, they greatly benefit from a larger number of workers, allowing for unlimited scalability and handling numerous concurrent reads.

Reducing data transfer between system memory and GPU

For models requiring a densified matrix, our research indicates that it's more efficient to transfer the matrix in a sparse format and then densify it on the GPU. This approach is preferred over densifying on the CPU and then transferring the denser matrix to the GPU, primarily because sparse matrices are significantly smaller.

Adjustable system memory and GPU memory usage

The batch size dictates the amount of data transferred from disk to system memory. Users with memory constraints can reduce this value. The mini-batch parameter governs the data sent to the GPU, and its optimal size depends on the model and available GPU memory. Crucially, avoid processing one sample at a time, as GPUs excel at large matrix multiplications, performing much slower with numerous small ones.

Distributing the data prior to training initiation

Before training, each dataset file is divided into fixed-size batches of observations, with adjustments made to handle cases where batch counts do not evenly match the number of workers (via padding, duplication, or dropping, subject to user-specified thresholds). The total batches across files are then proportionally partitioned into training, validation, and test sets according to user-defined ratios. Finally, the splits are balanced so that each worker receives an equal share of data, ensuring efficient and consistent distributed training across multiple nodes and files.

Performance Benchmarking with the Tahoe 100M Dataset

During the hackathon, our team achieved a key milestone: a seven-fold improvement in data loading performance, demonstrated through benchmarking against widely used pipelines such as scDataset and AnnLoader. To validate this advance, we designed a benchmarking setup that reflected the demands of large-scale training.

The experiments were conducted on an NVIDIA A100-SXM4 GPU with 80 GB of memory, using the Tahoe 100M dataset (14 AnnData files totaling 316 GB). The task involved data loading and one training epoch with a simple two-layer MLP classifier. We configured the system with a batch size of 1024, 12 workers, and a fetch factor of 16, and evaluated performance based on runtime, peak RAM usage, and peak GPU memory consumption. PROTOplast was then compared against established baselines including AnnData, AnnLoader, scVI, and scDataset.

Table 1: Benchmarking End-to-End Workflows: Data Loading and One-Epoch MLP Training on Tahoe 100M (100M Cells)

Pipeline	Description	Time (s)	Peak RAM (MB)	Peak GPU (MB)
anndata	single file only1	-	-	-
annloader	multi-file	running2
scvi	single file only3	-	-	-
scDataset	multi-file	10,735	56,323	1,600
PROTOplast	multi-file	1,404	68,228	2,774

¹anndata implementation is single file only, while ³scvi’s anncollection implementation crashes on full Tahoe100m

²annloader reaches 10% after 24 hours

Because AnnData and scVI only support single-file input, they were not included in the full 14-plate benchmark. To ensure a fair comparison, we also evaluated all pipelines on Plate 3 of Tahoe 100M (4.7 million cells). This setup made it possible to directly compare PROTOplast with both single-file and multi-file workflows under the same conditions.

Table 2: End-to-End Benchmark on Tahoe 100M Plate 3 (4.7M Cells)

Pipeline	Time (s)	Peak RAM (MB)	Peak GPU (MB)
anndata	-1	68,6801	7641
annloader	47,846	11,319	1,711
scvi	277	19,661	1,711
scDataset	368	29,414	1,749
PROTOplast	79	27,910	2,557

¹AnnData crashed due to out-of-memory during benchmarking on Plate 3. The Peak RAM and Peak GPU was recorded before crashing

The results show that while pipelines such as AnnData and scVI are restricted to single-file datasets, PROTOplast delivers substantially faster runtimes with efficient memory usage, even at the single-plate scale of 4.7 million cells.

During the three-week hackathon, we also performed preliminary multi-GPU benchmarks on the Tahoe 100M dataset, which showed clear scaling benefits. Running on NVIDIA L40S instances, training time for Plate 3 dropped from 82 seconds on a single GPU to 58 seconds on four GPUs, while the full dataset runtime decreased from 1,434 to 870 seconds. By adjusting worker allocation per GPU, we maintained efficient throughput, highlighting PROTOplast’s ability to leverage parallelization and accelerate large-scale single-cell analysis. After the hackathon, we are planning to investigate multi-GPU scaling more deeply to further optimize performance.

Table 3: Preliminary benchmark results on Tahoe 100M dataset using PROTOplast on multi GPUs, unoptimized

Plate 3 Time (s)	Entire dataset time (s)	Number of GPUs	Workers/ GPU
82	1,434	1	36
70	1,141	2	18
63	970	3	12
58	870	4	9

Key Insights from Our Hackathon Experience

Participating in the Accelerate Omics Hackathon reinforced valuable insights for technical development and scientific collaboration.

On the technical aspects of the machine learning workflow for large scale scRNA datasets, our takeaways are:

Data loading remains the primary bottleneck. Even with powerful GPUs, the speed of moving data from disk to memory dictates overall performance. Optimizing this step has the greatest impact on training efficiency.
GPU-based operations outperform CPU-based approaches. Performing sparse-to-dense conversion directly on the GPU proved far more efficient than CPU operations, especially at large scale.
Batch size matters. Increasing batch sizes, up to a reasonable threshold, significantly accelerated data loading and improved throughput without overwhelming system memory.
Scalability requires thoughtful design. Efficient handling of multi-file AnnData and distributed workers is essential to ensure workflows can keep pace with the rapid growth of omics datasets.

Beyond the technical takeaways, the hackathon provided valuable opportunities for learning and exchange. Access to office hours with NVIDIA and scverse teams and sharing sessions with other teams deepened our understanding of workflow pipelines for large-scale datasets. In addition, having access to Brev instances allowed us to test our pipelines across different environments, which gave us practical insights into scalability and performance. These experiences enriched the hackathon and reinforced the importance of community and shared infrastructure in advancing open-source solutions.

Many thanks to NVIDIA and the scverse community for organizing and supporting the Accelerate Omics Hackathon. The opportunity to learn, collaborate, and contribute to open science was invaluable, and we look forward to building on these insights in future work.

Next Steps for PROTOplast

Building on the initial success of improving training times for large datasets such as Tahoe 100M, we released an early developer preview of PROTOplast to support participants in the Virtual Cell Challenge (VCC). The library incorporates the solutions developed during the hackathon and is designed to accelerate ML training on large single-cell RNA sequencing datasets. PROTOplast supports multi-file AnnData and enables faster data loading, efficient tensor conversion, and memory-aware batch management. With the final VCC test set scheduled for release in late October and only a short time before submissions close, reducing training time becomes especially critical. This creates an opportunity for the community to experiment with PROTOplast in a competitive setting and help uncover additional performance bottlenecks.

Looking ahead, our roadmap for acceleration includes investigation into:

Direct data loading: enabling data to move directly from disk to GPU, reducing latency and bypassing CPU bottlenecks.
Scalable training: extending support to multi-GPU and multi-node environments to accommodate larger datasets and more complex models.
Accessible integration: providing a streamlined API that integrates smoothly with PyTorch Lightning, lowering barriers for adoption and community use.

Through these steps, our goal is to ensure that PROTOplast is flexible, robust, and aligned with the evolving needs of the single-cell research community

Call to action

PROTOplast is designed to accelerate large-scale single-cell analysis and streamline workflows for researchers. This first version is an early step, and we invite researchers and developers to try it out, explore its capabilities, and share feedback to help us refine and improve it. Together, we can make PROTOplast a stronger, more accessible tool that benefits the entire community and drives faster progress in single-cell research.

Accelerating Training for Large-Scale Single-Cell Data at the NVIDIA Accelerate Omics Hackathon

The Problem

PROTOplast in Action: Enhancing Performance During the Hackathon

Performance Benchmarking with the Tahoe 100M Dataset

Key Insights from Our Hackathon Experience

Next Steps for PROTOplast

Call to action

Find out what’s happening

Introducing PROTOplast: Scalable Machine Learning for Molecular Data Analysis

A Note on Parquet-based scRNA ML Pipelines

Tahoe-100M in Practice: Workflows, Pitfalls, and Pathways to Scalable scRNA Analysis