Submitted:
29 January 2026
Posted:
30 January 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Multiple providers. Equivalent processing stages may be implemented by different providers with distinct interfaces and default behaviors, making the notion of “the pipeline” ambiguous unless provider choice is explicitly recorded.
- Multiple inputs of the same type. Real analyses fre- quently involve many recordings, where individual steps may produce zero, one, or multiple outputs per input, requiring an explicit mechanism for maintain- ing these associations.
- Multiple data types. Pipelines operate on heteroge- neous data types (e.g., videos, cell sets, summary tables), each with different semantics. Steps must therefore know both the content and the type of each input to behave correctly.
- Multiple formats per data type. The same conceptual data type may be stored in different file formats de- pending on the algorithm or provider, requiring these formats to be distinguishable and explicitly tracked.
- High disk usage. Large imaging datasets and their in- termediate products consume substantial disk space. Since many steps generate outputs comparable in size to their inputs, accumulated intermediates can quickly exceed available storage.
- Input selection. When multiple outputs of the same type are available, selection rules must be explicit to prevent accidental cherry-picking or inconsistent reuse.
- Algorithm parameterization. Reproducibility depends on recording parameter values. Default parameters must be easy to define, and the mapping between steps and parameters must remain unambiguous.
- Experiment reproducibility. Retaining only final out- puts is insufficient without a systematic mechanism that records which algorithms ran, with which para- meters, on which inputs.
- Testability. From a developer standpoint, the system must remain easy to modify without breaking existing behavior. From a development perspective, the system must remain easy to modify without breaking existing behavior. Rapid and automated testing is essential to support safe, iterative development.
- Repeated analyses across subjects. Users often need to run identical analyses on data from different animals or sessions. Simple pipeline abstractions frequently assume a single subject, requiring more flexible exe- cution models.
2. Methods
2.1. Traces
- • Step number and branch
- • Algorithm or provider implementation executed
- • Parameter values used
- • Input and output file paths
- • Identifiers linking outputs back to original inputs
2.2. Algorithm Libraries and Providers
2.3. Wrappers Around Provider Algorithms
2.4. Parameter Handling
- Parameters passed directly to a pipeline step
- Pipeline-level default parameters
- Provider-level defaults
2.5. Keys and Identifiers
2.6. Storage Efficiency, Cleanup and Recovery
2.7. Branching and Batch Execution
2.8. Processing Tools and Interoperability
2.9. Quality Control
2.10. Tests Driven Development
3. Results
3.1. Source Repository
3.2. Documentation
3.3. Package Distribution
4. Discussion
References
- Stamatakis, A. M.; et al. Miniature microscopes for manipulating and recording in vivo brain activity. Microscopy 2021, 70, 399–414. [Google Scholar] [CrossRef] [PubMed]
- Ziv, Y.; et al. Long-term dynamics of CA1 hippocampal place codes. Nature Neuroscience 2013, 16, 264–266. [Google Scholar] [CrossRef] [PubMed]
- Grüning, B.; et al. Practical Computational Reproducibility in the Life Sciences. Cell Systems 2018, 6, 631–635. [Google Scholar] [CrossRef] [PubMed]
- Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 2016. [Google Scholar] [CrossRef] [PubMed]
- Alam, K.; Roy, B. Challenges of Provenance in Scientific Workflow Management Systems. 2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS), 2022; pp. 10–18. [Google Scholar] [CrossRef]
- Ince, D. C.; Hatton, L.; Graham-Cumming, J. The case for open computer programs. Nature 2012, 482, 485–488. [Google Scholar] [CrossRef] [PubMed]
- Bayarri, G.; Andrio, P.; Gelpí, J. L.; Hospital, A.; Orozco, M. Using in- teractive Jupyter Notebooks and BioConda for FAIR and reproducible biomolecular simulation workflows. PLOS Computational Biology 2024, 20, e1012173. [Google Scholar] [CrossRef] [PubMed]
- Jurica, P.; van Leeuwen, C. OMPC: an Open-Source MATLAB-to- Python Compiler. Frontiers in Neuroinformatics 2009, 3, 5. [Google Scholar] [CrossRef] [PubMed]
- Viejo, G.; et al. Pynapple, a toolbox for data analysis in neuroscience. eLife 2023, 12, RP85786. [Google Scholar] [CrossRef] [PubMed]
- Dong, Z.; et al. Minian, an open-source miniscope analysis pipeline. eLife 2022, 11, e70661. [Google Scholar] [CrossRef] [PubMed]
- Wilkinson, M. D.; et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 2016, 3, 160018. [Google Scholar] [CrossRef] [PubMed]
- Giovannucci, A.; Friedrich, J.; Gunn, P.; Kalfon, J.; Et., A. CaImAn an open source tool for scalable calcium imaging data analysis. eLife 2019. [Google Scholar] [CrossRef] [PubMed]
- Beck, K. Test-Driven Development: By Example; Addison-Wesley Pro- fessional, 2003. [Google Scholar]


Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).