Submitted:
31 March 2025
Posted:
02 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Framework Architecture and Workflow
2.1. Configuration-Driven Control
- Project Setup: Metadata, root directory paths.
- Data Sources: Paths and metadata for climate model outputs (GCM/RCM), soil profile databases, GIS shapefiles (e.g., administrative boundaries, soil maps), and location coordinate files.
- Computational Settings: Logging levels, number of parallel workers for local execution, flags and environment variable names for HPC job array integration.
- Climate Parameters: Active climate data sources, specific GCMs/RCMs and emission scenarios (e.g., SSPs) to include, target variables, reference and future time periods, spatial domain for subsetting, unit conversion specifications.
- Crop Model Selection: List of PSMs to run in the ensemble, paths to their executables, required model-specific inputs (e.g., module names, default cultivar IDs), and specification of expected output files for parsing.
- Simulation Design: Target simulation locations, baseline management practices (default sowing rules, fertilization, irrigation regimes).
- Adaptation Scenarios: Definitions of alternative management strategies or genetic traits, specifying parameter overrides relative to the baseline (e.g., earlier sowing dates, drought-tolerant cultivar ID, modified irrigation thresholds).
- Analysis Specification: List of standard output variables for analysis, the mapping rules from model-specific outputs to these standard names, definition of the baseline period for comparison.
- Optional Modules: Flags to enable/disable and configure parameters for advanced simulation options (biotic stress, soil carbon, HYDRUS) or the machine learning surrogate component.
2.2. Modular Pipeline Stages
- 00_setup_environment.py: Initializes the run by validating the config.yaml file, resolving all configured paths (relative to a defined base_dir or current working directory), checking for the existence of essential input files and specified crop model executables, and creating the required directory structure for outputs, logs, and intermediate simulation files.
- 01_prepare_climate_data.py: Handles the ingestion and preparation of climate data. It identifies relevant climate files based on the configuration (source, model, scenario, period), loads them (typically NetCDF) using xarray [24] (optionally leveraging dask [25] for parallel I/O and processing), performs spatial and temporal subsetting, selects required variables, executes necessary unit conversions (e.g., K to °C, flux rates to daily amounts), potentially renames variables to internal standards, and extracts point-based time series for each simulation location defined in the locations file. These extracted time series are cached (e.g., as Parquet files) to avoid redundant processing in subsequent steps. Optionally, it can compute derived climate indices (e.g., heat stress days, drought metrics) if the calculation logic is implemented within src/climate_processing.py.
- 02_setup_simulations.py: Orchestrates the generation of all necessary input files for each individual crop model simulation run. It iterates through the complete factorial combination of configured factors (location, climate projection, crop model, adaptation strategy, sowing date, etc.). For each unique combination (sim_id), it retrieves the corresponding cached climate data, fetches the relevant soil profile information (potentially performing a GIS lookup using geopandas [28] if needed via src/soil_processing.py), collates the specific management parameters by merging baseline settings with adaptation overrides, and then invokes the appropriate functions (generate_weather, generate_soil, generate_experiment) within the designated crop model interface module (src/crop_model_interface/) to create all files required by that model in its specific format within a dedicated simulation directory (simulations/setup/.../sim_id/). This stage also initializes or updates a central simulation status tracking file (e.g., simulations/simulation_status.csv or an SQLite database) marking newly prepared simulations as ready for execution (e.g., READY_TO_RUN).
- 03_run_simulations_parallel.py: Executes the crop model simulations identified as READY_TO_RUN in the status tracker. It manages parallel execution using Python's multiprocessing module for local runs (respecting num_workers in config) or, if configured for HPC use, distributes the simulation tasks across nodes within a job array by interpreting environment variables set by the scheduler (e.g., SLURM_ARRAY_TASK_ID, SLURM_ARRAY_TASK_COUNT). For each simulation task, it navigates to the appropriate working directory and invokes the run_model function from the corresponding crop model interface, passing the experiment filename and executable path. Upon model completion (or failure), it captures the exit status and any relevant messages returned by the interface and updates the central status tracker (e.g., to SUCCESS or RUN_ERROR), using file locking or atomic operations to ensure safe concurrent updates.
- 04_process_outputs.py: Handles the post-simulation data extraction and standardization. It identifies simulations marked as SUCCESS in the status tracker. For each, it calls the parse_output function of the relevant model interface, providing the simulation's working directory and the configuration specifying which output files to read. The interface returns a dictionary of parsed results. This script then applies the variable mapping rules (defined in config.yaml) via src/analysis.py to translate model-native output variable names (e.g., DSSAT's GWAD) into predefined standard names (e.g., Yield_kg_ha). If optional modules like HYDRUS or biotic stress simulation were enabled and implemented, their outputs are integrated or post-processing effects applied at this stage. Finally, it aggregates the standardized results and associated run metadata (location, climate scenario, adaptation, etc.) from all successfully processed simulations into a single consolidated dataset, typically saved in Parquet format for efficient downstream analysis (analysis_outputs/combined_results_std_vars.parquet). The status tracker is updated accordingly (e.g., to OUTPUT_PARSED or OUTPUT_ERROR).
- 05_analyze_impacts.py: Performs the core climate change impact analysis using the consolidated, standardized results from Step 04. It focuses on the subset of simulations run using the 'baseline' adaptation strategy. Key operations, implemented in src/analysis.py, include: calculating baseline statistics (e.g., mean yield, phenological dates) averaged over the historical reference period for each relevant grouping (location, sowing date, crop model); calculating future changes (both absolute differences and percentage changes) for various output variables relative to these calculated baselines for different future periods and scenarios; and computing ensemble statistics (e.g., mean, median, standard deviation, percentiles like p10 and p90) across the dimensions defined as ensemble members (typically GCM/RCM and/or PSM). Results are saved as CSV or other tabular formats in the analysis_outputs/ directory.
- 06_analyze_adaptations.py: Evaluates the effectiveness of the configured adaptation strategies. It compares the projected climate change impacts (i.e., the changes calculated in Step 05 or recalculated here across all adaptations) under each adaptation scenario against the impacts projected for the baseline adaptation scenario within the same future climate context (location, climate projection, period, etc.). The 'effectiveness' is typically calculated as the difference in the projected change (e.g., Effectiveness = Change_with_Adaptation - Change_with_Baseline). Functions in src/analysis.py perform these comparisons and can calculate ensemble statistics across adaptation effectiveness metrics. Results are saved to analysis_outputs/.
- 07_generate_visualizations.py: Creates graphical outputs based on the results generated in Steps 05 and 06. Using functions defined in src/visualization.py, which leverage libraries like matplotlib [26] and seaborn [27], it can produce various plot types: time series showing ensemble mean impacts and uncertainty ranges over future periods; box plots illustrating the distribution of impacts across ensemble members; bar charts comparing the effectiveness of different adaptation strategies; and spatial maps (using geopandas [28]) visualizing the geographic distribution of impacts or adaptation benefits, potentially aggregated to administrative boundaries. Figures are saved to the figures/ directory.
- 08_train_surrogate.py, 09_predict_surrogate.py (Optional): These scripts manage the machine learning surrogate modeling component. 08_train_surrogate.py uses the consolidated simulation database (combined_results_std_vars.parquet) as training data. It applies feature engineering (e.g., creating climate indices from daily data, encoding categorical inputs like GCM name or soil type), trains a specified ML model (e.g., Random Forest, configured via config.yaml) using scikit-learn [19] pipelines that include preprocessing steps (like scaling), evaluates the model's performance on a held-out test set, and saves the fitted pipeline object using joblib. 09_predict_surrogate.py loads a previously trained pipeline and uses it to rapidly generate predictions for target variables based on new input feature combinations provided in an external file, facilitating exploration beyond the explicitly simulated scenarios. The implementation resides in src/surrogate_model.py.
2.3. Crop Model Interfaces (src/crop_model_interface/)
- generate_weather(climate_df, site_info, output_path): Translates standardized climate pandas.DataFrame into the model's specific weather file format.
- generate_soil(soil_profile, output_path): Creates the model's soil file from a standardized dictionary representation of the soil profile (optional if soil data is embedded elsewhere).
- generate_experiment(exp_details, output_path, **kwargs): Constructs the main experiment or simulation control file, incorporating management details, cultivar parameters, simulation options, and links to other input files. May utilize template files (e.g., .apsimx).
- run_model(exp_file_name, executable_path, working_dir): Executes the model binary or script, monitors its execution, and returns a Status enum member (src/crop_model_interface/status_codes.py) indicating success or failure type, along with a descriptive message.
- parse_output(output_dir, output_files_config): Reads specified raw output files produced by the model and extracts relevant variables into a Python dictionary.
2.4. HPC Integration and Parallelism
- Simulation Parallelism: The 03_run_simulations_parallel.py script leverages Python's multiprocessing for efficient use of multi-core local machines. For cluster environments, it can automatically distribute simulation tasks across the nodes assigned to an HPC job array by parsing standard environment variables (e.g., SLURM_ARRAY_TASK_ID, SLURM_ARRAY_TASK_COUNT, configurable in config.yaml), allowing potentially thousands of simulations to execute concurrently.
- I/O and Memory Management: The optional use of dask [25] within xarray [24] enables lazy evaluation and parallel, chunked processing of large climate datasets, mitigating memory bottlenecks during climate data preparation (Step 01). Caching the extracted point climate data prevents costly re-reading and reprocessing for each simulation setup. The use of the efficient Parquet format for the consolidated results database (Step 04) aids downstream analysis performance.
- Modular Parallelism Potential: The framework's structure inherently allows for the future introduction of parallelism within other stages if they prove time-consuming for specific applications. For instance, climate point extraction (01), output parsing (04), or ML surrogate training (08) could be parallelized using libraries like joblib or further multiprocessing.
3. Discussion
3.1. Advantages and Novelty
- Configuration-Centric Workflow: Centralizing the entire experimental design—from input data selection and model choices to management scenarios, adaptation options, and analysis parameters—within a single, version-controllable YAML file provides a transparent and highly reproducible alternative to dispersed script parameters or complex GUI setups.
- Native Multi-Model Ensemble Support: The standardized interface system is fundamental, providing a clear pathway to incorporate diverse PSMs within a unified workflow. This facilitates direct comparison and the generation of MME results, crucial for representing simulation uncertainty. Standardized variable mapping ensures analytical consistency across models.
- Streamlined HPC Deployment: By directly interpreting HPC job array environment variables, PyCIAT simplifies the execution of large-scale simulation ensembles on clusters, reducing the need for users to write complex submission script logic for task distribution.
- Modularity and Extensibility: The division into discrete Python scripts and the abstract interface design enable researchers to readily substitute alternative methods (e.g., different climate downscaling techniques, new analysis metrics), extend functionality (e.g., incorporate economic models), or add support for additional crop models without requiring modification of the core framework logic.
- Leveraging the Python Ecosystem: Built entirely in Python, PyCIAT integrates seamlessly with the extensive scientific Python stack (pandas, xarray, geopandas, scikit-learn, matplotlib, seaborn, etc.), allowing users to leverage familiar tools for custom analysis, visualization, or further integration. The open-source nature encourages community development and adaptation.
3.2. Limitations and Future Directions
- Developing, testing, and distributing fully implemented interfaces for widely used versions of major PSMs (e.g., DSSAT, APSIM, STICS, potentially others like WOFOST or JULES-Crop).
- Implementing robust versions of the placeholder advanced simulation modules (e.g., generic pest/disease impact functions, interfaces to common soil carbon models, validated HYDRUS coupling routines).
- Integrating more sophisticated uncertainty quantification techniques beyond basic ensemble statistics (e.g., variance decomposition, parameter sensitivity analysis frameworks).
- Strengthening the concurrent status file update mechanism, potentially defaulting to an SQLite database backend or mandating the use of robust file-locking libraries.
- Optionally refactoring the script-based pipeline orchestration using a formal workflow manager (Snakemake/Nextflow) to gain enhanced features like intricate dependency tracking, caching of intermediate results, and detailed execution provenance.
- Establishing standardized mechanisms within the configuration to manage or link to crop model parameter files, particularly cultivar coefficients derived from calibration exercises.
- Advancing the ML surrogate component with features like automated hyperparameter tuning (e.g., using Optuna or scikit-learn's tools) and exploring techniques beyond standard regression (e.g., emulation of time-series outputs).
3.3. Applicability
- Impacts of various climate change scenarios (different GCMs/RCMs, SSPs/RCPs) and time horizons.
- Sensitivity to specific climate variables or extreme events.
- Uncertainty stemming from different climate models and crop simulation models through MME analysis.
4. Conclusion
Data Availability Statement
References
- Kaushik, P. (2025). Artificial Intelligence in Agriculture: A Review of Transformative Applications and Future Directions. Preprints. [DOI/URL Placeholder].
- Malhi, G. S., Kaur, M., & Kaushik, P. (2021). Impact of climate change on agriculture and its mitigation strategies: A review. Sustainability, 13(3), 1318. [CrossRef]
- IPCC. (2022). Climate Change 2022: Impacts, Adaptation and Vulnerability. Contribution of Working Group II to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press.
- FAO, IFAD, UNICEF, WFP & WHO. (2021). The State of Food Security and Nutrition in the World 2021. FAO. [CrossRef]
- Dave, K., Kumar, A., Dave, N., Jain, M., Dhanda, P. S., Yadav, A., & Kaushik, P. (2024). Climate change impacts on legume physiology and ecosystem dynamics: a multifaceted perspective. Sustainability, 16(14), 6026. [CrossRef]
- Jones, J. W., Hoogenboom, G., Porter, C. H., Boote, K. J., Batchelor, W. D., Hunt, L. A., Wilkens, P. W., Singh, U., Gijsman, A. J., & Ritchie, J. T. (2003). The DSSAT cropping system model. European Journal of Agronomy, 18(3-4), 235-265. [CrossRef]
- Holzworth, D. P., Huth, N. I., deVoil, P. G., Zurcher, E. J., Herrmann, N. I., McLean, G., Chenu, K., Van Oosterom, E. J., Snow, V., Murphy, C., Moore, A. D., Brown, H., Harsdorf, J., & Keating, B. A. (2014). APSIM–Evolution towards a new generation of agricultural systems simulation. Environmental Modelling & Software, 62, 327-350. [CrossRef]
- Brisson, N., Gary, C., Justes, E., Roche, R., Mary, B., Ripoche, D., Zimmer, D., Sierra, J., Bertuzzi, P., Burger, P., Bussière, F., Cabidoche, Y. M., Cellier, P., Debaeke, P., Gaudillère, J. P., Hénault, C., Maraux, F., Seguin, B., & Sinoquet, H. (2003). An overview of the STICS crop model. European journal of agronomy, 18(3-4), 309-332. [CrossRef]
- Eyring, V., Bony, S., Meehl, G. A., Senior, C. A., Stevens, B., Stouffer, R. J., & Taylor, K. E. (2016). Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geoscientific Model Development, 9(5), 1937-1958. [CrossRef]
- Giorgi, F., Jones, C., & Asrar, G. R. (2009). Addressing climate information needs at the regional level: the CORDEX framework. World Meteorological Organization (WMO) Bulletin, 58(3), 175-183.
- Brar, N. S., Bijalwan, P., Kumar, T., & Kaushik, P. (2024). Harnessing integrated technologies for profitable vegetable cultivation under changing climate conditions: An outlook. In Living with Climate Change (pp. 127-148).
- Brar, N. S., Kumar, T., & Kaushik, P. (2020). Integration of technologies under climate change for profitability in vegetable cultivation: an outlook. Preprints.
- Rosenzweig, C., Jones, J. W., Hatfield, J. L., Ruane, A. C., Boote, K. J., Thorburn, P., Antle, J. M., Nelson, G. C., Porter, C., Hillel, D., ... & Mutter, C. Z. (2014). Assessing agricultural risks of climate change in the 21st century in a global gridded crop model intercomparison. Proceedings of the National Academy of Sciences, 111(9), 3268-3273. [CrossRef]
- Müller, C., Elliott, J., Kelly, D., Arneth, A., Balkovič, J., Ciais, P., Deryng, D., Folberth, C., Hoek, S., Izaurralde, R. C., ... & Wang, X. (2017). Global gridded crop model evaluation: benchmarking, skills, deficiencies and implications. Geoscientific Model Development, 10(4), 1403-1422. [CrossRef]
- Asseng, S., Ewert, F., Rosenzweig, C., Jones, J. W., Hatfield, J. L., Ruane, A. C., Boote, K. J., Thorburn, P. J., Rötter, R. P., Cammarano, D., ... & Wolf, J. (2013). Uncertainty in simulating wheat yields under climate change. Nature Climate Change, 3(9), 827-832. [CrossRef]
- Martre, P., Wallach, D., Asseng, S., Ewert, F., Jones, J. W., Rötter, R. P., Boote, K. J., Ruane, A. C., Thorburn, P. J., Cammarano, D., ... & Zhu, Y. (2015). Multimodel ensembles of wheat growth: many models are better than one. Global change biology, 21(2), 911-925. [CrossRef]
- Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226-1227. [CrossRef]
- Gil, Y., Deelman, E., Ellisman, M., Fahringer, T., Fox, G., Gannon, D., Goble, C., Livny, M., Moreau, L., & Schopf, J. (2016, November). Examining the challenges of scientific workflows. Computer, 49(11), 26-34. [CrossRef]
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
- Rosenzweig, C., Jones, J. W., Hatfield, J. L., Antle, J. M., Ruane, A. C., Boote, K. J., Thorburn, P. J., Valdivia, R. O., Porter, C. H., Janssen, S., & Winter, J. M. (2013). The Agricultural Model Intercomparison and Improvement Project (AgMIP): Protocols and pilot studies. Agricultural and Forest Meteorology, 170, 166-182. [CrossRef]
- Müller, C., Franke, J., Jägermeyr, J., Ruane, A. C., Elliott, J., Folberth, C., François, L., Hank, T., Izaurralde, R. C., Jones, C. D., ... & Wang, X. (2019). The global gridded crop model intercomparison phase 1 simulation dataset. Scientific data, 6(1), 50. [CrossRef]
- Köster, J., & Rahmann, S. (2012). Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520-2522. [CrossRef]
- Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature biotechnology, 35(4), 316-319. [CrossRef]
- Hoyer, S., & Hamman, J. J. (2017). xarray: ND labeled arrays and datasets in Python. Journal of Open Research Software, 5(1), 10. [CrossRef]
- Rocklin, M. (2015). Dask: Parallel computation with blocked algorithms and task scheduling. Proceedings of the 14th Python in Science Conference, 130-136. [CrossRef]
- Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(03), 90-95. [CrossRef]
- Waskom, M. L. (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021. [CrossRef]
- Jordahl, K., Van Den Bossche, J., Fleischmann, M., McBride, J., Wasserman, J., Richards, M., Badar, M., Gerard, J., Snow, A. D., Paget, J., ... & geopandas contributors. (2014). GeoPandas: Python tools for geographic data [Software]. https://github.com/geopandas/geopandas.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
