Reproducible Proteomics Pipelines Using Galaxy

The Clinical Data Analysis Pipelines (CDAP), originally developed by the NIH Office of Cancer Clinical Proteomics Research (OCCPR), formerly Clinical Proteomic Tumor Analysis Consortium (CPTAC), and now hosted by the NIH Proteomic Data Commons (PDC) standardize proteomics data processing to reduce variability and enable cross-dataset comparisons. Public dissemination of these Galaxy workflows on GitHub is part of the NIH's support of FAIR data principles. While these pipelines represent a promising beginning, scientists still face obstacles when attempting to implement these workflows in their own settings. This post shares our experience in reproducing the CDAP pipelines to reanalyze public datasets including the challenges we encountered and the solutions we implemented to build a reproducible, validated pipeline for conducting proteomics studies, particularly when augmenting one’s own data with public data from PDC.

Understanding PDC CDAP: Structure and Goals

The CDAP workflow is structured to execute in two stages, for both label-free and TMT labeled datasets:

The analysis workflow handles the fundamental data processing tasks. Raw mass spectrometry files are first converted into open file formats such as MGF or mzML. Then, peptide identification is performed using a search engine with curated protein databases, and quantification integrates spectral and identification data to produce interpretable results.
The second workflow processes and annotates peptide-spectrum matches, performs gene and protein inference and estimates the false discovery rate to ensure confident protein identification. It also produces a summary and quantification including spectral count and precursor area reports, which can be used for downstream analysis.

The description of each workflow was explained in the publication: "A Description of the Clinical Proteomic Tumor Analysis Consortium (CPTAC) Common Data Analysis Pipeline" (Rudnick et al., J. Proteome Res., 2016) and the tools and workflows were designed to run in the Galaxy platform and are publicly available on Github at https://github.com/cptac3-cdap

The Challenges and Our Approaches

While the CDAP workflows are well-designed, running them outside the PDC environment reveals several technical challenges that make the analysis hard to reproduce and apply to new data.

Challenge 1: Legacy tool availability

ReAdw4Mascot and ProMS are no longer maintained or available in the Galaxy Tool Shed. This absence means that one cannot simply install these tools through Galaxy's standard mechanisms.

While ReAdw4Mascot is still available for download from the NIST software portal, it is a single, older version last updated in 2013-06-04, and the Galaxy workflow references multiple versions.
Similarly, ProMS is a tool developed by NIST that has been discontinued, and the specific version of ProMS referenced in the CDAP PDC pipeline is similarly not readily available.

Our approach: Our first priority was addressing the tool availability problem. We manually retrieved the required version by cloning the CDAP PDC tool repository. We eventually discovered and retrieved from Edwards Lab (Georgetown University Medical Center) the multiple versions of the tools. Having located the executables, we containerized them for integration into the Galaxy Docker environment. This approach ensured compatibility with CDAP workflows while maintaining stable execution despite the tools being Windows-only and unsupported.

Challenge 2: Disparate execution environments

Several tools essential to CDAP workflows are Windows-only applications (e.g., msconvert, ReAdw4Mascot and ProMS). While users can run these tools locally on Windows machines, processing large datasets often requires deployment in the cloud, where Linux is the preferred platform for automation and scalability,creating challenges for integrating Windows-only tools into a cloud based workflow. Additionally, the tools in the workflows require a mixture of software versions, most notably with some components requiring Python 2 and others needing Python 3.

Our approach: We successfully installed and configured all required tools for both label-free and TMT datasets within a local Galaxy instance running on Ubuntu 24.04. Python 2 scripts were modernized for Python 3 using 2to3 for syntax updates and black for code formatting.To enable a reproducible deployment environment, we built a Docker image with all CDAP related dependencies, combining R packages, Python environments, and Windows-only tools. Tools like msconvert which require Windows, were executed using the default CDAP setup via a Wine-based Docker image. These changes allowed us to run the complete workflow reproducibly on any system in an automated manner.

Challenge 3: Workflow design limitations

*The image above shows the complete CDAP pipeline as a single Galaxy workflow, following the connection of its two constituent parts.*

The first step of CDAP PDC pipeline is “Download CPTAC data files” which was designed to download raw files from CPTAC repository and download one file at a time. For researchers using their own datasets such as uploading multiple local RAW files, this design presents a major limitation. The workflow provides no built-in support for batch processing.

Another notable design limitation lies at the conclusion of the initial workflow: it produces several output files but lacks the automatic aggregation of these into a single input file for the summary report workflow. This design necessitates users to manually combine and re-upload the data to Galaxy for the summary report workflow. This manual intervention is prone to errors, time-consuming, and disrupts automation, particularly when dealing with a large volume of samples.

Our approach: To make the workflow scalable for large studies and compatible with user’s own data, we prepared a more flexible input step that allows researchers to combine all data into one collection and can be processed in parallel within a single run, eliminating the need for per file execution. We also modified the final step of the first workflow to automatically aggregate all individual output files into a single, structured input which can pass directly to the second workflow, removing the need for manual curation and preparing the input file. Additionally, we linked two previously separate workflows into one unified pipeline, allowing users to execute the complete process from raw data to summary report in a single run.

To ensure the pipeline generates valid results after our modifications, we tested the complete CDAP PDC workflows using representative datasets from both label-free and TMT experiments with published results available on the PDC, allowing us to compare our outputs against established benchmarks. This not only confirmed that our environment and tool configurations were functioning correctly but also helped us understand the expected behavior and output characteristics of each workflow and type of dataset.

Challenge 4: Limited documentation

While the CDAP PDC pipeline is supported by publications and workflow overviews (2014, 2016, 2018) , some tool-specific parameter setups are missing. The 2014 summary report offers the most insight, yet not linked in the GitHub repository, requiring users to search externally. Without a clear README or user guide outlining the workflow structure, tool dependencies, or file organization, new users struggle to navigate or adapt the pipeline.

A key issue is the absence of official documentation for ProMS, a critical tool. Users must guess how to configure the tool correctly as the only reference ParametersForProMS.txt contains vague and/or unexplained entries. Additionally, ProMS is limited to input from ReAdw4Mascot (MGF and mzXML), not other tools such as msconvert or ThermoRawFileParser (a wrapper for Thermo Fisher’s RawFileReader released in 2021).

Reference database documentation is also limited. The CDAP paper describes a single composite FASTA file (RefSeq H. sapiens, S. scrofa trypsinogen, M. musculus), with M. musculus sequences excluded for TCGA human samples. However, additional FASTA files are referenced with minimal explanation or documentation regarding their purpose.

The summary report workflow, much like the main workflow, lacks clear instructions regarding input formats, parameters, and expected outputs, which complicates interpretation and validation.

Our approach: To address the documentation limitation, we developed a comprehensive documentation through testing, manual inspection and literature review. This includes:

Detailed tool documentation covering dependencies, input/output formats, and parameters.
Clarified undocumented tool dependencies (e.g., ReAdw4Mascot's output as being the required input for ProMS).
Comprehensive workflow-level guides for both label-free and TMT datasets, including analysis and reporting steps.
Adopted a wiki to centralize knowledge sharing. Similar to Wikipedia, it facilitates knowledge sharing and preservation within our team.

Suggestions and tips for reproducing CDAP PDC workflows

Reproducing the CDAP PDC pipelines outside the original environment can be rewarding but requires thoughtful planning and careful execution. Based on our experience, here are some practical suggestions for teams looking to reproduce these workflows:

Start with a clean Galaxy environment: Avoid configuration conflicts by using a fresh Galaxy instance, ensuring a consistent baseline and simplifies dependency management
Test each component independently: Before running full pipelines, test each tool individually with representative input. Validate that intermediate outputs match expected formats and values to identify problems early
Document every configuration and change: Maintain a detailed changelog of tool versions, installation steps, environment modifications, and parameter values. This documentation is essential for debugging, team collaboration, and future reproducibility.
Build a comprehensive test suite early in the process: Use a small subset of public CPTAC datasets or small test cases to verify your workflow implementation. Step-by-step comparison with published results can help identify silent failures or incorrect assumptions.

Alternatively, reach out to us! Our team here at DataXight would welcome the opportunity to engage with, and share our learnings with the scientific community.

Summary

Fulfilling the promise of FAIR science hinges on our capacity for consistent and reproducible data analysis. The CDAP PDC workflows represent an important step toward standardized, reproducible proteomics analysis. Our experience demonstrates that achieving true reproducibility requires also addressing fundamental challenges in software maintenance, documentation, and platform portability. By sharing our lessons learned, we hope to contribute to ongoing community efforts to improve computational reproducibility in proteomics.

Ready to tackle your own mass spec proteomics challenges? DataXight's team of computational biologists and distributed systems experts can help you build optimized pipelines that scale with your data. Contact us at solutions@dataxight.com to discuss your next project

Acknowledgment: This work was made possible through the collaborative efforts of DataXight’s ProtoXight team. Their contributions were critical in configuring the Galaxy environment and ensuring a reproducible deployment of the CDAP workflows. We sincerely thank the team for their technical expertise and dedication throughout this project.