PROTOplastPROTOplast
Get StartedDownload Fact Sheet
PROTOplast
Accelerate your scRNA ML training.

A lightweight, open-source Python library for fast data loading, cloud-native workflows, and scalable ML training, providing the ability to harness massive-scale datasets like Tahoe100M and X-Atlas/Orion.

Challenges

The Problems We're Solving.

Working with molecular data at scale presents unique challenges that traditional ML pipelines weren't designed to handle:

Data Management

Staging data adds overhead
Anndata reads from local file paths, requiring data to be copied to the compute instance prior to analysis
Loading data is time consuming
Large scRNA datasets remain slow to load—often hours to days—even on cloud or HPC systems.
Densification is costly
Sparse matrices, which optimize the amount of storage for scRNA datasets, require densification

Scalability

Memory is constrained
Bottlenecks occur when the size of the data exceeds the amount of physical memory available on a machine
Cluster management is complex
Managing distributed workloads across multiple workers requires specialized expertise
Code environments are fragmented
Rewriting entire analysis pipelines is often necessary when scaling to cluster environments.

How PROTOplast helps

PROTOplast was built to remove these bottlenecks

1300X faster I/O than standard AnnData: What once required 22.5 days, now takes only 14.5 minutes to train 1 epoch on the entire Tahoe-100M dataset using a 4-L40S instance, transforming large-scale ML training from a bottleneck into a routine step.
WorkflowElapsed# of workers
AnnLoader (AnnData)22.5 days12
PROTOplast14.5 minutes12

* The benchmark was timed on 1 epoch, 2 MLP classifier, 4 NVIDIA L40S GPUs (See benchmarking scripts )

Seamless integration:Easily plug in your secret sauce by subclassing PyTorch Lightning's LightningModule. This keeps full compatibility with the PyTorch ecosystem while giving you the flexibility to build specialized models for your molecular and single-cell data.
from state.tx.models.embed_sum import EmbedSumPerturbationModel
from protoplast import RayTrainRunner
trainer = RayTrainRunner(
   EmbedSumPerturbationModel,
   ...
)
Read the tutorial
Native cloud integration: Allowing you to stream data directly from remote storage(S3, GCS, Azure) without intermediate downloads.
trainer.train([
   "s3://collaborator-1/cohort_1.h5ad",
   "gcs://collaborator-2/cohort_2.h5ad",
   "adl://collaborator-3/cohort_3.h5ad",
   "dnanexus://project-xxx:/cohort_4.h5ad",
], ...)
Your browser does not support the video tag.
With PROTOplast, you can use 4 NVIDIA L40S GPUs to train 1 epoch on the entire Tahoe 100M dataset in 14.5 minutes — something that was previously unfeasible.

Quick Start

(It’s simple)

That’s it — no extra code, no tuning. PROTOplast automatically scales across GPUs, nodes, or clusters.

Installation guide:

pip install protoplast

A minimal code example showcasing end-to-end:

from protoplast import RayTrainRunner, DistributedCellLineAnnDataset, LinearClassifier
import glob

trainer = RayTrainRunner(
   LinearClassifier,  # replace with your own model
   DistributedCellLineAnnDataset,  # replace with your own Dataset
   ["num_genes", "num_classes"],  # change according to what you need for your model
)

file_paths = glob.glob("/data/tahoe100/*.h5ad")
trainer.train(file_paths)

That’s it — no extra code, no tuning. PROTOplast automatically scales across GPUs, nodes, or clusters.


Resources

{ 1 }Examples
Training perturbation prediction models on scRNA-seq data.

Advancing precision in drug and gene response modeling

Use with any classification models

Seamless integration with external and custom models

Create a submission to the Virtual Cell Challenge

Step-by-step guide to packaging and submitting your model for evaluation

{ 2 }Get started
Documentation More Tutorials & Examples
Join our community: Github

Related Blog Posts

More articles
Comparing Perturbations: E-distance and Euclidean distance are Your Best Allies
{Sci-tech}
7 mins read

Comparing Perturbations: E-distance and Euclidean distance are Your Best Allies


Summary Our benchmarking reveals a surprising truth: in the race to translate massive perturbation datasets into discovery, the most effective mathematical "lens" isn't the most complex one. While sophisticated metrics like Wasserstein or Mean Pairwise are often favored due to their mathematical impressiveness, we found that E-distance and Euclidean distance provide the superior balance of speed and signal resolution for high-throughput pipelines. By delivering sharper biological contrast at a

Learn more
Perturbation effect is not an on-off switch
{Sci-tech}
5 mins read

Perturbation effect is not an on-off switch


In this blog, we examine how the “perturbation effect” can vary depending on the metrics used to define it, and why these differences matter. While these metrics may appear interchangeable, they often capture fundamentally different aspects of the underlying biology. As Perturb-seq datasets continue to grow exponentially, understanding how perturbation effects are measured becomes critical for reliable downstream analysis. When suppression is not an on-off switch In 2025, Nadig and colleagues

Learn more
Virtual Cell: It Might Start From The Mean
{Insights}{Sci-tech}
3 mins read

Virtual Cell: It Might Start From The Mean


The Virtual Cell is a concept at the intersection of computational biology and systems science. At its core, it aims to represent a predictive model of how a living cell responds to internal and external cues. In its ideal form, a Virtual Cell captures every molecular detail - dynamic proteins, metabolic fluxes, physical interactions, and more. Building such a fully mechanistic model remains beyond current computational capabilities. A more tractable approximation has emerged over the past deca

Learn more
More articles

Swipe to Explore

Have an idea?
Drop us a line
Contact Now
DataXight

[Community}

Stay up to date on Dataxight News,
projects, and more

{Follow us}

{Spotlight}

  • Champions Oncology
  • Crown Bioscience
  • Danaher
  • DNAnexus
  • Form Bio
  • UPenn

{Solutions}

  • Data Services & Solutions
  • Software & AI/ML
  • Science Services
  • AI Quickstart

{Products}

  • protoXell
  • protoPlast

{Company Info}

  • About Us
  • Blog
  • Join Us

{Follow us}


© 2026 DataXight
Privacy|Terms of Use
Privacy|Terms of Use
Mountain ViewPragueSai GonBangkok