A Modern Application Stack for Big Data that we use for Proteomics R&D

Proteomics, the large-scale study of proteins, plays a critical role in early-stage research and development (R&D) by providing valuable insights into the molecular mechanisms driving biological processes. Proteins are the primary functional molecules in cells, and understanding their expression, modifications, and interactions is key to deciphering disease pathways, identifying biomarkers, and discovering potential drug targets. In the early phases of R&D, proteomics enables researchers to identify and quantify proteins across different conditions, offering a deeper understanding of cellular functions and responses. This knowledge is essential for developing new therapies, enhancing drug efficacy, and predicting patient outcomes. Moreover, proteomics complements genomic data by translating genetic information into functional biology, making it indispensable for advancing innovative treatments and precision medicine.

Target audience: Our target audience, for this blog post, are data teams, scientists, and team leaders interested in developing self-service, no-code exploration dashboards on large datasets for their end users.

User challenges in proteomics

In the rapidly evolving landscape of scientific research, data has become both our most valuable resource and our greatest challenge. As datasets become larger and more complex, scientists increasingly struggle to analyze them to extract meaningful and actionable insights. The challenges they face include data heterogeneity and compatibility, scalability, reproducibility, tools availability, and usability. As such, proteomics data analysis is a prime candidate to benefit from modern computational approaches that can handle extraordinary data complexity.

Overcoming these challenges typically require expert software engineering skills and processes, which are often unavailable to scientists. To bridge this gap, we need a simple, approachable solution that enables scientists to do the following:

Process enormous volumes of information quickly
Handle complex, multi-structured data
Create compelling, interactive visualizations
Provide intuitive interfaces for data exploration

Our approach

Our technology stack offers an integrated and approachable solution for addressing these challenges. The stack comprises

Python as the programming language
Ray for write-once run everywhere scalable distributed computing
Daft for scalable flexible data processing
Taipy for easy-to-create scalable user interfaces
Plotly for scalable, interactive data visualization

Python as the programming language

Python is highly suitable for biomedical data analysis due to its versatility, rich ecosystem of libraries, and ease of use. Key libraries like NumPy offer powerful data manipulation and analysis capabilities, and SciPy and Statsmodels offer robust statistical tools for hypothesis testing and data normalization. For specialized proteomics workflows, libraries such pyteomics and tools from the Mann Lab (e.g. alphapept, alphatims, alphamap, alpharaw) are available for efficient accession and visualization of mass spectrometry raw data as well as quantified peptide and proteins, Additionally, Python's integration with machine learning libraries like Scikit-learn and deep learning frameworks such as TensorFlow and PyTorch makes it ideal for predictive modeling, biomarker discovery, and complex data visualization. With a broad range of resources and active community support, Python is a powerful, accessible language for proteomics researchers.

Ray as the distributed computing framework

Ray is a distributed computing framework designed to scale effortlessly across multiple processors or nodes, making it ideal for parallelizing proteomics workflows that require significant computational resources, such as protein identification, quantification, and data normalization. By leveraging Ray, researchers can efficiently process and analyze massive proteomics datasets, speeding up tasks like statistical analysis, machine learning, and data integration. Whether running on a single machine or across a cloud cluster, Ray provides an excellent framework for speeding up proteomics data analysis while maintaining scalability and flexibility. Running complex protein interaction analyses that would traditionally take weeks, could be completed in hours or even minutes, by leveraging Ray.

Daft as the high-performance query engine

We selected Daft as our data processing engine. From its website:

Daft is a unified data engine for data engineering, analytics, and ML/AI. It exposes both SQL and Python DataFrame interfaces as first-class citizens and is written in Rust. Daft provides a snappy and delightful local interactive experience, but also seamlessly scales to petabyte-scale distributed workloads.

In proteomics, datasets can include millions of protein interactions, post-translational modifications, and multi-dimensional experimental results. Daft provides the computational muscle to handle these challenges, as its capabilities include

Scalable processing of massive datasets
Efficient handling of complex, nested data structures
Lazy query evaluation that optimizes computational resources
Parallel computing capabilities for rapid analysis, and
Seamless integration with existing Python scientific libraries

Imagine processing protein interaction networks from thousands of experimental samples—Daft makes this feasible with unprecedented efficiency.

Taipy as the framework for GUI data application

Taipy is a versatile framework designed for building data-driven applications. Its ability to integrate seamlessly with Python’s data analysis ecosystem allows developers to easily develop web based GUI applications for visualizing proteomics data, and for researchers to easily process and analyze proteomics data from the web browser. Additionally, Taipy has a rich set of features for data pipeline orchestration for optimized integration between the data backend and the dashboard frontend.

Plotly as the visualization framework

Plotly is an excellent tool for proteomics data analysis, offering powerful, interactive visualizations that are ideal for exploring complex biological datasets. In proteomics, where researchers often deal with high-dimensional data such as protein abundances, peptide intensities, or post-translational modifications, Plotly enables dynamic visual exploration through heatmaps, scatter plots, volcano plots, and interactive 3D surface plots. These visualizations help identify patterns, clusters, and outliers in large-scale proteomics datasets, which are crucial for tasks like biomarker discovery and pathway analysis.

As recently as December 2024, Plotly, via Narwhals, now integrates natively with a variety of dataframe libraries (vs Pandas only), and also benefiting our use of Daft.

Poetry for software dependency management

Finally, we want to acknowledge our use of Poetry. From its website:

Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and it will manage (install/update) them for you. Poetry offers a lockfile to ensure repeatable installs, and can build your project for distribution.

While Poetry is technically part of our development process (and not the application stack per se), we feel the need to mention it because Poetry is the tool we use to ensure and automate the installation of the aforementioned libraries to work well together. This includes ensuring that as we upgrade libraries, requisite upgrades of its dependencies are transitively upgraded. For proteomics researchers, using Poetry ensures that complex analysis pipelines can be easily shared, reproduced, and scaled across different computational environments.

Putting it all together

Let's explore a concrete scenario of how this technology stack can advance proteomics research. We have created a toy example that uses the aforementioned libraries to build a data application for biomarker investigation. This data contains the staining profiles for proteins in 20 human tumor tissues based on immunohistochemistry using tissue micro arrays. The data was downloaded from the Human Protein Atlas version 24.0 and Ensembl version 109 as a TSV file. To navigate the dashboard, the user selects a gene to see the count of patients annotated for different protein expression levels (“High”, “Medium”, “Low” & “Not detected”) across different cancer types.

The code itself is very succinct– in total, it is ~100 lines (including whitespaces). This simplicity promotes rapid development, reducing the time and effort required to build functional software. By streamlining complex processes and providing pre-built components, it enhances productivity and minimizes the need for repetitive coding. This framework also lowers the barrier to entry for developers, making it easier for beginners to create applications without extensive knowledge of programming intricacies.

💡

We have made the code available in Github at https://github.com/dataxight/toy-data-web-app.

Try it out and let us know what you think, whether it's via GitHub or emailing us directly at solutions@dataxight.com

Also share it with your network!

Summary

The synergistic combination of Python, Ray, Daft, Taipy, and Plotly represents more than a technology stack— it's a paradigm shift in how we manage biological complexity with simple and approachable in-silico tools. This integrated approach offers some key advantages, including

Computational scalability for massive datasets
Reproducible research environments
Interactive, intuitive data exploration
Rapid iteration of scientific hypotheses
Enhanced collaboration through shareable applications

By providing researchers with powerful, flexible tools, we're accelerating the pace of scientific discovery, turning massive, opaque datasets into clear, actionable insights.

💡

Do you need a scalable application or tool to analyze your data? Are you looking to modernize an existing analysis tool? Let’s talk! Send an email to our Solutions Team at solutions@dataxight.com