1. Introduction
The frequency and severity of extreme weather events driven by climate change is escalating and are placing unprecedented stress on critical infrastructure systems, including transportation networks, port facilities, energy grids, and hydraulic structures. The IPCC’s Sixth Assessment Report (AR6) [
1] envisions risk through three interacting drivers, namely, hazard, exposure, and vulnerability. This framework establishes a robust scientific foundation for conducting infrastructure risk assessments. Despite this clear framework, creating tools that practitioners can easily use is a major challenge as many existing climate risk decision support tools (DSTs) demand advanced technical skills. Several recent studies have proposed frameworks for climate adaptation and resilience assessment of infrastructure systems [
2,
3,
4]. Nevertheless, non-experts often struggle to understand and apply the results. This was confirmed by Šedov’a et al. [
5] in their recent 2024 review, where they identified communication of uncertainty as one of the main gaps in existing climate risk assessment tools. This is not a technical challenge as the data layers required for an open-access alternative are nowadays available. OpenStreetMap (OSM) offers flexible, on-demand access to infrastructure data while being license-free, also European climate projection model suites such as EURO-CORDEX provide spatially consistent, RCP-aligned hazard information across the continent [
6]. By merging these datasets, it is possible to build intuitive interfaces for climate risk screening. Yet, the field lacks a unified, reproducible platform that non-specialists can operate fluidly without sacrificing processing speed.
In this context particularly nature-based solutions (NbS) have gained popularity as a mitigation strategy for climate-related risks that are affecting infrastructure. Institutions like IUCN [
7], the World Bank [
8], and the European Commission [
9], have highlighted the role of NbS as cost-effective reinforcements, and in some cases alternatives, to conventional grey infrastructure, especially for hazards such as flooding, erosion, and landslides. Embedding NbS options within the risk assessment step itself, rather than scheduling them as a follow-on activity, could shorten the distance between identifying a problem and implementing a response. Despite their potential for disaster risk reduction, quantitative risk models have largely not been extended to incorporate NbS, and published work on spatial tools or software systems for deploying them at scale remains limited [
10,
11]. Researchers working on climate services have framed a related difficulty: even technically robust tools tend to fall short of actual use when their outputs are expressed in terms that practitioners cannot readily trust or act on [
12,
13]. Bridging that gap requires not just a strong computational backend, but an interpretation layer that turns index scores into contextually grounded narrative; historically, that interpretation has only been delivered through manual expert reporting, at high cost and low scalability.
Recent advances in large language models (LLMs) open a new route to that interpretation layer: automated generation of natural-language reports from structured quantitative data. LLMs are being applied across different fields of science to generate context-aware summaries, answer domain-specific questions, and automate structured research narratives [
14]. Yet, trusting LLMs with high-stakes infrastructure decisions brings a major complication [
15]. These models might "hallucinate" and generate content that are plausible but factually unsupported [
16,
17]. Since this specific workflow relies on strict numerical indices to justify physical engineering interventions, any deviation from the raw data fatally compromises the tool. To mitigate this problem, several solutions have been proposed, including retrieval-augmented generation [
18], chain-of-thought prompting [
19], and in-context learning with worked exemplars [
14,
20].
Only a handful of recent projects have attempted to ground LLMs specifically in climate contexts, and their objectives differ sharply from the approach proposed here. ChatClimate [
21], for example, anchors a chat interface in the IPCC AR6 corpus. It functions as an exploratory dialogue tool for climate science, which is functionally distinct from the goal of translating structured risk metrics into standardized practitioner reports. Another effort, CHATREPORT [
22], targets corporate environmental, social, and governance (ESG) reporting. It extracts answers from sustainability documents based on templates from the Task Force on Climate-related Financial Disclosures (TCFD), focusing strictly on document-level analysis rather than data-to-text generation. The present work is also explicitly distinguished from ClimateBert [
23]. Because ClimateBert operates as a domain-specific encoder, researchers use it primarily to classify claims or analyze sentiment, rather than to generate cohesive, readable narratives from scratch. The present work occupies a different point in this design space: it does not retrieve from a document corpus, and it does not classify text. Instead, it translates a small, highly structured input table of quantitative risk indices into a stakeholder-facing narrative through one-shot in-context learning, with the input data itself acting as the grounding source. To the authors’ knowledge, the systematic integration of LLM-generated narrative into a chained, index-driven climate risk assessment, together with a controlled ablation of the contribution of system instructions versus embedded exemplars, has not previously been reported in this domain.
This paper addresses the interpretation and communication gap within the context of an end-to-end decision support platform developed by the author. The platform, briefly described in
Section 2.1, integrates polygon-based site characterization (OSM extraction plus Köppen-Geiger sampling), a qualitative perceived-risk workflow, and a quantitative IPCC AR6-aligned risk-assessment chain whose outputs feed a downstream nature-based solution recommendation engine. A Grounded Reporting Framework (GRF) is applied across each workflow’s analytical outputs, using one-shot in-context learning to translate quantitative results into stakeholder-facing narratives. The platform is introduced here solely as the deployment context for the methodological contribution; only the aspects directly relevant to the GRF are described, and the platform’s full capabilities are not evaluated in this paper.
The evaluation is organized around two research questions, examined in a single representative dam-infrastructure case as a first step toward broader infrastructure coverage:
What does the embedded reference exemplar contribute to the faithfulness of generated reports, over and above the strict-protocol system instruction alone? This question is addressed under three complementary faithfulness metrics: a deterministic numeric token detector, sentence-level natural language inference against the input table, and an LLM-as-judge claim decomposition.
Does the prompt pattern behave consistently across model families, so that the framework remains portable as the underlying LLM is updated or replaced? This question is addressed by replicating the experiment across three model families (Gemini 2.5 Flash Lite, Llama 3.1 8B Instruct, and GPT-5.4 mini) and comparing the direction of effects.
The empirical results provide preliminary evidence of the approach’s technical reliability under these two questions. The contributions of this paper are:
A Grounded Reporting Framework that transforms structured climate-risk indicators into stakeholder-facing narratives through a one-shot in-context learning pattern. The pattern combines strict-protocol system instructions with embedded reference exemplars, that is, an example input table paired with an expert-written reference report which is included in every prompt to demonstrate the desired output structure to the language model.
A multi-metric faithfulness evaluation methodology for assessing the reliability of LLM-generated climate-risk reports, combining a deterministic numeric token detector with sentence-level NLI and an LLM-as-judge claim decomposition.
A cross-model evaluation showing the directional consistency of the reporting framework across three LLM families, with an accompanying observation that judge-based support rates are themselves sensitive to the choice of judging model.
The ablation results are framed throughout as exploratory empirical evidence rather than as established findings, and the limitations of the present evaluation, including the single-case design and the absence of expert-rated human evaluation, are discussed in
Section 4.3.
The paper is structured as follows:
Section 2 describes the workflow components and the evaluation methodology.
Section 3 depicts the functional demonstration on three European case-study locations (Rotterdam Maasvlakte, Athens, Innsbruck-Brenner), while focusing on performance latency, the ablation study under three faithfulness metrics, and the cross-model replication on Llama 3.1 8B.
Section 4 discusses methodological implications, practical relevance, and limitations;
Section 5, in the end, presents the conclusions.