Introducing PROTOplast: Scalable Machine Learning for Molecular Data Analysis

We're excited to announce the early developer preview of PROTOplast, our new Python library designed for fast scalable analysis of molecular data. PROTOplast addresses the unique challenges of working with large-scale molecular datasets while maintaining the flexibility needed for cutting-edge research.

What is PROTOplast?

PROTOplast is an open-source Python library, released under the Apache License 2.0, that bridges the gap between molecular data analysis and modern machine learning infrastructure. Its first major use case focuses on ML training for single-cell RNA (scRNA) sequencing data, with particular emphasis on perturbation prediction models.

We've accelerated the timeline for this developer preview to support contestants in the Arc Institute's Virtual Cell Challenge (VCC), recognizing the immediate need for robust tools in this rapidly evolving field. If you are participating in the VCC, try out PROTOplast and share with us your experience and suggestions for improvements to support your ML training process!

The Problems We're Solving

Working with molecular data at scale presents unique challenges that traditional ML pipelines weren't designed to handle:

Data Management Challenges

Staging data adds overhead: The anndata library reads AnnData files from local disk only, requiring data to be copied to the compute instance prior to analysis.
Loading data is time consuming: Large molecular datasets can take hours or days to load locally.
Densification is computationally heavy: Sparse matrices, which optimize the amount of storage for scRNA datasets, typically require densification for analyses.

Scalability Challenges

Constrained memory: Bottlenecks occur when the size of the data greatly exceeds the amount of physical memory available on a machine.
Multi-worker coordination & cluster management: Managing distributed workloads across multiple workers requires specialized expertise.
Multiple code environments: Rewriting entire analysis pipelines is often necessary when scaling to cluster environments.

Key Features

PROTOplast addresses these challenges with a comprehensive set of features designed for molecular data workflows:

🚀 Fast Data Reading

PROTOplast includes a high-performance dataloader specifically optimized for reading directly from AnnData files (no conversion needed!), whether stored on local or cloud storage. Benchmarks show that PROTOplast achieves up to 7× faster in model training than scDataset and over 1300x faster than the standard anndata library implementation (more details below), our implementation dramatically reduces I/O bottlenecks that typically slow down molecular data analysis.

🧩 Flexible Model Training

PROTOplast’s modular design enables flexible customization for model training. Users can easily drop in preferred components that conform to standardized APIs, such as:

Model training plan: Encapsulate one's "secret sauce" by subclassing PyTorch Lightning’s LightningModule.
Data loader: Integrate custom data loading strategies that leverage PROTOplast's accelerated AnnData reading.
Compute resources: Specify the desired compute environment (e.g., CPUs vs. GPUs, number of workers, instance types).

📈 Effortless Scalability

From a laptop to a multi-node cluster, PROTOplast leverages Ray to scale with a single, simple syntax. No need to rewrite training loops or data pipelines—the same code runs consistently, while the framework handles orchestration, parallelism, and reproducibility.

Preliminary Benchmarks

Here is an example of PROTOplast’s performance, from the benchmarking which we conducted as part of our participation in NVIDIA’s Accelerate Omics hackathon.

Our benchmarking setup:

Compute environment: Brev - Massed Compute
Instance type: 1GPU NVIDIA A100-SXM4-80GB,
- RAM: 98.3 GB
- Disk speed: 3.3 GB/s
Data: Tahoe 100M (316GB)
Task: data loading and 1 training epoch
Simple classifier (2 layer MLP)
- Batch size = 1024
- Fetch factor = 16
- Num workers = 12
The competition: anndata, annloader, scvi, scDataset

Benchmarking results on full Tahoe-100M dataset.
The anndata implementation is single file only, while scvi’s anncollection implementation crashes.
annloader completed only 10% after 24 hours. PyTorch estimated it would finish in 540.5 hours.

Furthermore, preliminary benchmarking results for multi-GPU training show that PROTOplast can train 1 epoch on the entire Tahoe-100M dataset in 14.5 minutes, using a 4 GPU nVIDIA A100. We expect even higher speedups with additional optimizations.

Our intent is to conduct a full suite of benchmarks that compares additional approaches. Stay tuned for more details in the near future!

Getting Started

PROTOplast is now available through the following channels channels:

PyPI: pip install protoplast
GitHub: https://github.com/dataxight/protoplast (source code, examples, and documentation)

We have also created notebooks to serve as usage tutorials. Access them here: https://protoplast.dataxight.com/tutorials/

We Want Your Feedback

This is just the beginning of our vision for PROTOplast. As an early preview release, PROTOplast will benefit tremendously from community input. Whether you're working on perturbation prediction, cell type classification, or other molecular ML applications, we want to hear about your experience:

How well does PROTOplast improve the ML training process for you?
What bottlenecks do you face in your everyday work that could be accelerated?
What features would make your workflow more efficient?
What additional data formats or model architectures should we prioritize?
How can we make the library even easier to use for researchers new to distributed computing?

For technical documentation, tutorials, and examples, visit our documentation site. Connect with the community on our GitHub page.