Constrained LLM Reporting for Geospatial Climate Risk: A One-Shot In-Context Framework for Critical Infrastructure

Farid Arabameri; Jörn Plönnigs; Maryam Imani; Panagiotis Spyridis

doi:10.20944/preprints202606.0566.v1

Submitted:

05 June 2026

Posted:

08 June 2026

You are already at the latest version

Abstract

Climate risk assessments for critical infrastructure are central to identifying and predicting vulnerabilities early in the asset life cycle, enabling proactive mitigation before impacts occur through the implementation of appropriate technical and nature-based solutions. However, such assessments often rely on quantitative indices that can be difficult for non-technical stakeholders to interpret. To address this challenge, this paper presents an open-source decision support platform that combines OpenStreetMap site characterization, qualitative pre-screening, a quantitative IPCC AR6-aligned risk chain, and a downstream nature-based solution (NbS) recommendation layer. The approach uses Large Language Models (LLM) to translate analytical outputs into accessible narrative explanations. End-to-end site-characterization processing across three tested European demonstration sites took between 29 and 70 seconds. An exploratory ablation study investigated the faithfulness of AI-generated explanations using three complementary metrics. It showed that the generated hazard assessments remained factually grounded and free from fabricated numerical values. Including example reports in the prompt further improved the reliability of explanations for more complex risk indicators. While the results demonstrate the potential of AI-assisted climate risk communication, expert evaluation of stakeholder utility is identified as the most important next step.

Keywords:

climate risk

;

critical infrastructure

;

nature-based solutions

;

large language models

;

in-context learning

;

faithfulness evaluation

;

OpenStreetMap

;

geospatial decision support

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The frequency and severity of extreme weather events driven by climate change is escalating and are placing unprecedented stress on critical infrastructure systems, including transportation networks, port facilities, energy grids, and hydraulic structures. The IPCC’s Sixth Assessment Report (AR6) [1] envisions risk through three interacting drivers, namely, hazard, exposure, and vulnerability. This framework establishes a robust scientific foundation for conducting infrastructure risk assessments. Despite this clear framework, creating tools that practitioners can easily use is a major challenge as many existing climate risk decision support tools (DSTs) demand advanced technical skills. Several recent studies have proposed frameworks for climate adaptation and resilience assessment of infrastructure systems [2,3,4]. Nevertheless, non-experts often struggle to understand and apply the results. This was confirmed by Šedov’a et al. [5] in their recent 2024 review, where they identified communication of uncertainty as one of the main gaps in existing climate risk assessment tools. This is not a technical challenge as the data layers required for an open-access alternative are nowadays available. OpenStreetMap (OSM) offers flexible, on-demand access to infrastructure data while being license-free, also European climate projection model suites such as EURO-CORDEX provide spatially consistent, RCP-aligned hazard information across the continent [6]. By merging these datasets, it is possible to build intuitive interfaces for climate risk screening. Yet, the field lacks a unified, reproducible platform that non-specialists can operate fluidly without sacrificing processing speed.

In this context particularly nature-based solutions (NbS) have gained popularity as a mitigation strategy for climate-related risks that are affecting infrastructure. Institutions like IUCN [7], the World Bank [8], and the European Commission [9], have highlighted the role of NbS as cost-effective reinforcements, and in some cases alternatives, to conventional grey infrastructure, especially for hazards such as flooding, erosion, and landslides. Embedding NbS options within the risk assessment step itself, rather than scheduling them as a follow-on activity, could shorten the distance between identifying a problem and implementing a response. Despite their potential for disaster risk reduction, quantitative risk models have largely not been extended to incorporate NbS, and published work on spatial tools or software systems for deploying them at scale remains limited [10,11]. Researchers working on climate services have framed a related difficulty: even technically robust tools tend to fall short of actual use when their outputs are expressed in terms that practitioners cannot readily trust or act on [12,13]. Bridging that gap requires not just a strong computational backend, but an interpretation layer that turns index scores into contextually grounded narrative; historically, that interpretation has only been delivered through manual expert reporting, at high cost and low scalability.

Recent advances in large language models (LLMs) open a new route to that interpretation layer: automated generation of natural-language reports from structured quantitative data. LLMs are being applied across different fields of science to generate context-aware summaries, answer domain-specific questions, and automate structured research narratives [14]. Yet, trusting LLMs with high-stakes infrastructure decisions brings a major complication [15]. These models might "hallucinate" and generate content that are plausible but factually unsupported [16,17]. Since this specific workflow relies on strict numerical indices to justify physical engineering interventions, any deviation from the raw data fatally compromises the tool. To mitigate this problem, several solutions have been proposed, including retrieval-augmented generation [18], chain-of-thought prompting [19], and in-context learning with worked exemplars [14,20].

Only a handful of recent projects have attempted to ground LLMs specifically in climate contexts, and their objectives differ sharply from the approach proposed here. ChatClimate [21], for example, anchors a chat interface in the IPCC AR6 corpus. It functions as an exploratory dialogue tool for climate science, which is functionally distinct from the goal of translating structured risk metrics into standardized practitioner reports. Another effort, CHATREPORT [22], targets corporate environmental, social, and governance (ESG) reporting. It extracts answers from sustainability documents based on templates from the Task Force on Climate-related Financial Disclosures (TCFD), focusing strictly on document-level analysis rather than data-to-text generation. The present work is also explicitly distinguished from ClimateBert [23]. Because ClimateBert operates as a domain-specific encoder, researchers use it primarily to classify claims or analyze sentiment, rather than to generate cohesive, readable narratives from scratch. The present work occupies a different point in this design space: it does not retrieve from a document corpus, and it does not classify text. Instead, it translates a small, highly structured input table of quantitative risk indices into a stakeholder-facing narrative through one-shot in-context learning, with the input data itself acting as the grounding source. To the authors’ knowledge, the systematic integration of LLM-generated narrative into a chained, index-driven climate risk assessment, together with a controlled ablation of the contribution of system instructions versus embedded exemplars, has not previously been reported in this domain.

This paper addresses the interpretation and communication gap within the context of an end-to-end decision support platform developed by the author. The platform, briefly described in Section 2.1, integrates polygon-based site characterization (OSM extraction plus Köppen-Geiger sampling), a qualitative perceived-risk workflow, and a quantitative IPCC AR6-aligned risk-assessment chain whose outputs feed a downstream nature-based solution recommendation engine. A Grounded Reporting Framework (GRF) is applied across each workflow’s analytical outputs, using one-shot in-context learning to translate quantitative results into stakeholder-facing narratives. The platform is introduced here solely as the deployment context for the methodological contribution; only the aspects directly relevant to the GRF are described, and the platform’s full capabilities are not evaluated in this paper.

The evaluation is organized around two research questions, examined in a single representative dam-infrastructure case as a first step toward broader infrastructure coverage:

What does the embedded reference exemplar contribute to the faithfulness of generated reports, over and above the strict-protocol system instruction alone? This question is addressed under three complementary faithfulness metrics: a deterministic numeric token detector, sentence-level natural language inference against the input table, and an LLM-as-judge claim decomposition.
Does the prompt pattern behave consistently across model families, so that the framework remains portable as the underlying LLM is updated or replaced? This question is addressed by replicating the experiment across three model families (Gemini 2.5 Flash Lite, Llama 3.1 8B Instruct, and GPT-5.4 mini) and comparing the direction of effects.

The empirical results provide preliminary evidence of the approach’s technical reliability under these two questions. The contributions of this paper are:

A Grounded Reporting Framework that transforms structured climate-risk indicators into stakeholder-facing narratives through a one-shot in-context learning pattern. The pattern combines strict-protocol system instructions with embedded reference exemplars, that is, an example input table paired with an expert-written reference report which is included in every prompt to demonstrate the desired output structure to the language model.
A multi-metric faithfulness evaluation methodology for assessing the reliability of LLM-generated climate-risk reports, combining a deterministic numeric token detector with sentence-level NLI and an LLM-as-judge claim decomposition.
A cross-model evaluation showing the directional consistency of the reporting framework across three LLM families, with an accompanying observation that judge-based support rates are themselves sensitive to the choice of judging model.

The ablation results are framed throughout as exploratory empirical evidence rather than as established findings, and the limitations of the present evaluation, including the single-case design and the absence of expert-rated human evaluation, are discussed in Section 4.3.

The paper is structured as follows: Section 2 describes the workflow components and the evaluation methodology. Section 3 depicts the functional demonstration on three European case-study locations (Rotterdam Maasvlakte, Athens, Innsbruck-Brenner), while focusing on performance latency, the ablation study under three faithfulness metrics, and the cross-model replication on Llama 3.1 8B. Section 4 discusses methodological implications, practical relevance, and limitations; Section 5, in the end, presents the conclusions.

2. Methodology

This work is motivated by the need to effectively assess climate risks at critical infrastructure sites. An operator approaching such an assessment does not face a single question, but rather a sequence of dependent analytical stages. The process begins by delineating the specific site and cataloging the infrastructure it contains. Following this, a preparatory screening captures the qualitative risk profile through expert judgment, while a separate quantitative breakdown is available independently based on the analyst’s needs and available data. Where a deeper investigation is necessary, analysts calculate individual hazard, exposure, and vulnerability indices to form a composite potential risk score. To close the loop, these quantified risks are mapped against available nature-based and technical solutions to find the most effective mitigation strategies.

Each step in this chain depends on a different kind of input. Site selection requires the ability to define an area of interest, since infrastructure portfolios do not always align with administrative boundaries. Infrastructure identification requires an open, queryable source of feature data with consistent European coverage. Climate framing requires a global classification reference so that the user can understand the climate regime of the chosen site prior to analyzing specific climate projections. Preparatory screening requires a structured way to capture expert opinion in a reproducible form. Quantitative evaluation requires access to projected climate indicators under standardized scenarios and time horizons, together with a transparent index-construction methodology. Adaptation choice requires a curated catalog of nature-based solutions linked to the hazards each is designed to mitigate. Across all of these steps, communicating the resulting analysis to non-technical stakeholders requires a layer that translates numerical indices and categorical codes into narrative form, since the audience for an infrastructure adaptation decision typically extends beyond the analyst who produced the indices.

These needs determine the design choices of the platform described below. The assessment boundary is established by manually delineating a custom polygon directly onto the platform’s mapping interface, which makes the boundary an input. Infrastructure features are retrieved from OpenStreetMap through the Overpass API at query time, which provides license-clean, on-demand coverage of the European study area without the maintenance burden of bundled datasets. Climate framing is provided by sampling the Köppen-Geiger 1991–2020 baseline raster [24] at the polygon centroid, which yields a compact climate-zone code (for example Cfb, Csa, or Dfb) suitable for inclusion in downstream narrative reports. Structured expert judgment is captured by a qualitative Level 1 perceived-risk module that records hazard scoring through guided indicators. Quantitative evaluation is performed by a Level 2 risk-assessment chain that consumes projected climate indicators retrieved at runtime from the project’s Climate-Indices-Visualization API1 which is based on the NASA-IBM Climate model [25]. A nature-based solution recommendation engine consumes the Level 1 and Level 2 outputs to rank candidate solutions by their projected risk-reduction potential. The interpretation layer that connects all of these analytical outputs to a non-technical audience is the Grounded Reporting Framework (GRF), introduced in Section 2.4, which is the object of the methodological evaluation reported later in this paper. The remainder of this section is organized in the order in which an operator encounters the workflow. Section 2.1 introduces the three analytical tools that implement the decision chain. Section 2.2 details the polygon-based site-characterization pipeline. Section 2.3 describes the Level 2 quantitative risk-assessment chain. Section 2.4 introduces the Grounded Reporting Framework that produces the narrative reports. Section 2.5 sets out the evaluation methodology, including the way the experiments were actually carried out against the deployed system.

2.1. Decision Support Tool: Workflow Overview

The Decision Support Tool (DST) developed for the Nature-Demo project is an open-source web platform that exposes the decision chain described above through a single point-and-click interface, deployed and publicly accessible across the EURO-CORDEX domain [6]. Figure 1 presents a simplified conceptual overview of the Custom Site Analysis environment, which integrates three distinct analytical modules under a unified interface and a shared reporting architecture.

The first tool, Extraction · Mapping & Data (top tier of Figure 1), implements the site-selection and infrastructure-identification steps of the decision chain described in Section 2. It supports the user with the geospatial data retrieval process of the selected site. The user draws an arbitrary polygon on an interactive map, selects the infrastructure categories of interest, and receives a tabulated inventory of OpenStreetMap features within the polygon together with the Köppen-Geiger climate zone sampled at the polygon centroid. The Gemini AI subsequently synthesizes this structured data into a cohesive narrative report, transforming raw geospatial extractions into a clear situational picture of the environment.

The second tool, Level 1 · Perceived Risks [26] (middle tier of Figure 1), implements the preparatory-screening step of the decision chain. It captures expert opinion on infrastructure risk through structured indicators and produces a narrative perceived-risk report. The qualitative screening logic of this tool is mentioned here for completeness; it is not evaluated further in this paper.

The third tool, Level 2 · Technical Analysis [27] (bottom tier of Figure 1), implements the quantitative-evaluation step of the decision chain. The user enters infrastructure-hazard pairings together with the climate-projection settings (RCP scenario and time horizon), and the tool retrieves the corresponding climate indicator from the Climate-Indices-Visualization API, computes the Hazard, Exposure, Vulnerability, and Potential Risk Indices, and produces two narrative reports: a Hazard Report that interprets the hazard table, and a PRI Assessment Report that interprets the integrated risk table. The output of this tool feeds the downstream nature-based solution recommendation engine.

The three tools are architecturally independent. The only piece of data that crosses tool boundaries is the polygon centroid produced by the first tool, which the user can optionally reuse as the location input for the climate-indicator retrieval performed by the third tool. This independence is a deliberate design choice: it allows an analyst to enter the workflow at any level depending on what is already known about the site, rather than forcing a fixed top-down sequence.

2.2. Polygon-Based Site Characterization

Site selection and infrastructure identification are managed via a polygon-based extraction pipeline. The assessed region is defined by manually tracing a custom shape on an interactive map, a method that avoids reliance on predefined administrative borders. The use of a hand-drawn polygon, rather than a pre-defined boundary, allows the assessed area to follow the actual extent of an infrastructure asset, such as a port complex, a railway corridor, or a dam catchment, etc., instead of an unrelated municipal or regional outline.

The infrastructure of interest, such as Roads, Railways, Bridges, or Tunnels, are designated at query time utilizing a chip-based interface toolbar. Each category label is mapped to one or more underlying OpenStreetMap tag values (for example, the Roads category resolves to a small set of highway=* values such as motorway, primary, and secondary), and these mappings function as internal constraints within the Overpass query itself. retrieval is limited strictly to relevant elements, rendering post-query classification unnecessary. The closed spatial geometry is then dispatched to the Overpass API as a solitary request, capturing only the categorized elements situated inside the drawn footprint.

The polygon centroid is computed on a WGS84 sphere and is sampled against the 0.1°-resolution Köppen-Geiger 1991–2020 baseline raster [24] to obtain the climate-zone code at the site. Polygon area is also computed by spherical excess and is reported alongside the OpenStreetMap element counts. The extracted element counts, the polygon area, and the climate-zone code together constitute the structured input that the reporting layer translates into a narrative Geographical Context report and a climate interpretation report.

Figure 2. Polygon-based extraction in detail (Innsbruck case study, INN, Section 3.1). (a) The user draws a polygon over the area of interest after selecting infrastructure categories from the chip toolbar. (b) The polygon vertices are translated into an Overpass QL filter; an example query for the four selected categories is shown after the first occurrence to preserve readability. (c) The returned elements are tabulated and aggregated by category (the Bridges section is omitted for compactness; the same tabular format applies). (d) The polygon centroid is sampled against the Köppen-Geiger raster to obtain the climate class, here Dfb (cold without dry season, warm summer).

2.3. Quantitative Risk-Assessment Chain

Quantitative risk evaluation is the role of the Level 2 module, which follows the multi-level framework of the parent project [27]. The user begins by selecting one or more impact models from a curated catalog. Each impact model describes a specific combination of infrastructure asset, climate hazard, and resulting consequence (for example, “track buckling due to extreme heat” on a railway, or “embankment instability due to heavy precipitation” on a road) and carries with it a pre-attached EURO-CORDEX climate indicator that quantifies the relevant climate driver. The user therefore composes the assessment by choosing what to study, not by choosing how to measure it; the indicator-to-impact mapping is fixed in the catalog so that the same hazard is always characterized by the same climate variable across assessments.

Once the scope is composed, four indices are computed in sequence for each selected impact model. The Hazard Index (

H I

) is a 1–5 ordinal value derived from the climate indicator attached to that impact model. For each row of the user’s scope table, the platform issues a request to the Climate-Indices-Visualization API that includes the row’s indicator, the analysis location, the user-selected RCP scenario (RCP4.5 or RCP8.5), and the user-selected time horizon (short, medium, or long term). The API returns the indicator’s historical baseline value, its projected value under the requested scenario and horizon, and the relative variation. The platform then converts the returned indicator value into the 1–5 hazard index through an internally defined mapping function specified in [27]. The Exposure Index (

E I

) is computed from two user-supplied values, annual revenue exposure and capacity exposure, mapped to an integer 1–5 scale through a piecewise lookup. The Vulnerability Index (

V I

) is computed as

S e n s i t i v i t y \times (1 - A d a p t i v e_C a p a c i t y)

, where sensitivity is rated 1–5 by the user and adaptive capacity is the sum of a base capacity and three normalized user inputs (asset lifetime, maintenance regime, and design topology), with adaptive capacity capped at

0.4

. The Potential Risk Index (

P R I

) is the composite score

H I \times E I \times V I

, mapped to an ordinal six-class scale (No Risk, Very Low, Low, Medium, High, Extreme) by threshold lookup.

A key operational feature of this module is its dynamic execution. The indices are computed on demand from user inputs and live climate-indicator retrievals, superseding the use of the tables. The polygon centroid from the site-characterization pipeline is optionally reused as the location input for the climate-indicator fetch, and the chain is integrated with the reporting framework described next.

2.4. Grounded Reporting Framework

While quantitative outputs are essential for technical analysis, they are often insufficient for communicating risk to non-technical stakeholders. Disconnected metrics, such as a discrete Hazard Index value, a categorical Potential Risk Index label, or raw geospatial element counts, do not intrinsically provide the descriptive context needed to communicate a site’s actual risk profile. The Grounded Reporting Framework (GRF) is the layer of the workflow that turns these structured outputs into stakeholder-facing narrative reports. It is instantiated independently in the site-characterization tool (where it produces a Geographical Context report and a Köppen Interpretation), in the Level 1 tool (where it produces a narrative perceived-risk report), and in the Level 2 tool (where it produces a Hazard Report and a PRI Assessment Report). Each instance makes its own call to the underlying LLM with its own system prompt, exemplar, and structured input.

The GRF was designed against four constraints derived from the parent project’s data-management plan. Reports must be reproducible in tone and structure across runs and across analyzed sites, so that they can be compared across stakeholders and over time. Reports must contain no factual claims that are not present in, or directly derivable from, the data passed in the prompt. The same software pattern must extend to all report types produced by the workflow. The framework must remain model-agnostic, so that the underlying LLM can be replaced as new models become available without rewriting the application logic.

Each report prompt is built from three composable blocks, illustrated in the right panel of Figure 1:

System prompt: fixes the analyst role and enumerates a small set of strict protocols (zero hallucination, narrative Markdown output, scope restriction to the input table, and terminology constraints specific to the report type).
Reference exemplar: an embedded input/output pair consisting of a representative input table paired with an expert-written reference report; it fixes the desired structure, level of detail, terminology, and discussion logic of the output.
Structured input data: the actual analytical output on which the report is to be produced, formatted as Markdown so that it matches the model’s training distribution.

This pattern is functionally equivalent to retrieval-augmented generation [28] with a single retrieved exemplar and is often called one-shot in-context learning [14]; the latter terminology is used in the remainder of this paper.

In the deployed configuration of the platform, the PRI Assessment Report uses a low sampling temperature of

0.3

in the LLM. The sampling temperature controls the degree of variation in the generated text: a low temperature favors high-probability words and phrases and yields more consistent outputs across runs. The PRI Assessment Report inherits this low setting because it interprets the most consequential output of the risk chain and can influence the decision of stakeholders regarding the implementation of the nature-based solutions. The other GRF reports (Geographical Context, Köppen Interpretation, Level 1 narrative, and Hazard Report) use a temperature of

0.7

, which allows minor stylistic variation that improves readability. The ablation experiment in Section 3.3 mirrors these deployed temperatures so that its observations apply directly to the deployed configuration.

Table 1. Configuration of the four report types in the Grounded Reporting Framework. “Exemplar” indicates whether an input/output reference pair is embedded in the prompt. T is the sampling temperature.

Report type	Input	Exemplar	T	Output structure
Geographical Context	OSM element counts, raw tags, polygon centroid, area	No	$0.7$	Four fixed sections (geography, infrastructure, detail, limitations)
Köppen Interpretation	Köppen-Geiger code (e.g., Cfb)	No	$0.7$	Three fixed sections (name, characteristics, ecology)
Hazard Report (L2)	$H I$ table for one impact-model row set	Yes	$0.7$	Hazards grouped by severity and chronicity; exact indices preserved
PRI Assessment (L2)	$H I$ / $E I$ / $V I$ / $P R I$ table across all impact models	Yes	$0.3$	Risks ranked highest first; drivers attributed to each index

The reference exemplars used for the Hazard and PRI report types were authored by domain experts within the project consortium and reviewed for accuracy, terminology, and tone before being embedded in the prompt. The exemplars are deliberately specific: each describes a single representative case (a railway asset in mountainous terrain) at a fixed length and depth, rather than offering a generic template. The intent is to anchor the model’s output style without constraining the content scope of the actual analysis the user requests. As an illustration, the system instruction used for the PRI Assessment Report is reproduced verbatim below:

You are a senior infrastructure risk analyst. Write a formal PRI Assessment Report. STRICT PROTOCOLS: 1. Zero Hallucination, base analysis EXCLUSIVELY on provided table. 2. Narrative Markdown output (do not reconstruct the table). 3. Use abbreviations HI, EI, VI, PRI.

Every report rendered in the user interface is presented inside a uniformly styled “AI-Generated Content” panel that displays the model name and version, a permanent disclaimer about the non-deterministic nature of the output, and an “AI Limitations & Responsible Use” expander, in accordance with European Union AI Act transparency requirements [29] and with the parent project’s data-management plan.

2.5. Evaluation Methodology

The Custom Site Analysis workflow was assessed through four complementary evaluations, each targeting a distinct aspect of the platform’s behavior: The first is a functional demonstration that confirms the workflow runs end-to-end and produces coherent outputs at three diverse European sites. The second is a performance characterization that quantifies processing latency and identifies the dominant contributors. The third is an ablation study on the Grounded Reporting Framework, which examines how individual prompt components affect the faithfulness of the generated reports relative to their structured inputs. The fourth is a cross-model replication that tests whether the ablation findings hold when the underlying LLM is replaced. The ablation and cross-model stages are exploratory in scale and are intended to provide directional evidence rather than statistically significant findings.

2.5.1. Experimental Execution and Reproducibility

The ablation and cross-model studies were carried out through standalone Python scripts that reproduce the prompt construction of the deployed DST verbatim. The system instructions, exemplar pairs, sampling temperatures, and Markdown input serializations used in the live platform are imported by the scripts as immutable string constants, so that the evaluated configuration matches the deployed configuration exactly. Report generation and faithfulness scoring are separated into two independent stages: a generation script writes one Markdown file per generated report by calling the relevant model API (Gemini 2.5 Flash Lite), as well as two other models used exclusively for this study: GPT-5.4 mini (via hosted endpoints) and Llama 3.1 8B Instruct (via a local Ollama runtime2 in 4-bit quantization). A separate scoring script then computes the three faithfulness metrics across all generated reports. The deterministic numeric token detector is applied locally; the sentence-level natural language inference analysis is performed with a RoBERTa-large model fine-tuned on MNLI running on local hardware; the LLM-as-judge analysis uses two judges (Llama 3.1 8B Instruct and GPT-5.4 mini, both at sampling temperature 0) so that the sensitivity of the support-rate metric to the choice of judge can be examined directly. All generation and scoring scripts, the generated Markdown reports, and the per-report metric CSV files are released as open source alongside the platform.

2.5.2. Functional Demonstration on Three Self-Selected Locations

To illustrate the workflow’s applicability beyond any preconfigured site, three European polygons were defined to span distinct climate zones, infrastructure typologies, and dominant hazard families. The first (ROT) covers the Rotterdam Maasvlakte port complex (Köppen class Cfb), dominated by port and intermodal-transport infrastructure. The second (ATH) covers the central urban core of Athens (Köppen class Csa), dominated by road and building stock. The third (INN) covers a section of Innsbruck’s Brenner highway and railway corridor (Köppen class Dfb), dominated by alpine highway and rail infrastructure with cross-cutting bridge, tunnel, and slope-stabilization features. The site-characterization workflow was run end-to-end at each location, from polygon completion to fully rendered reports, and per-component latencies (Overpass extraction, Köppen sampling, LLM generation of the Geographical Context report and the Köppen Interpretation) were measured over three repetitions per site. The Level 2 risk chain was not included in this latency measurement because its latency profile depends primarily on user-supplied inputs rather than on the polygon-extraction pipeline.

2.5.3. Performance Characterization

End-to-end latency was further characterized by drawing 20 polygons over the centers of major European urban regions (Berlin, Paris, London, Madrid, Rome, Amsterdam, Brussels, Budapest, Prague, Lisbon, Dublin, Oslo, Zagreb, Bucharest, Athens, Innsbruck, Milan, Luxembourg, Maasvlakte, and Marseille). Polygon areas spanned approximately

0.5

–100 km², and returned element counts spanned almost three orders of magnitude. Each polygon was queried three times against the public Overpass endpoint, which allowed within-polygon variance driven by transient load on the public service to be separated from across-polygon variance driven by polygon size and complexity. Köppen-Geiger sampling latency was recorded alongside each Overpass call.

2.5.4. Ablation Study on Embedded Exemplars

The contribution of the embedded reference exemplar to the GRF is examined by generating the Hazard Report and PRI Assessment Report under two prompt conditions. The first is the deployed full-prompt configuration, in which the exemplar pair is included; this is referred to in the result tables as the with-exemplar condition (WE). The second is an ablated no-exemplar configuration (NE), in which the reference exemplar is removed while the strict-protocol system instruction and the input data block are retained verbatim. The input table consists of analyst-validated impact-model rows and

H I / E I / V I / P R I

scores for a representative dam infrastructure site within one of the project demonstration regions, comprising five infrastructure assets (intake tunnel, spillway chute, service road, service building, and dam) and ten impact-model rows. Each (type × condition) combination is sampled three times under independent calls to the deployed model, producing 12 generated reports.

The generated reports are scored under three complementary faithfulness metrics, each targeting a different category of failure that automatic similarity scores cannot directly detect. Each metric is described below first in terms of the failure mode it is intended to surface, then in terms of how that failure mode is operationalized.

Deterministic numeric token detector

The motivation is straightforward: in a report whose role is to interpret a structured numerical input, every numerical value that appears in the prose should either be present in the input table or be directly derivable from it. Any numerical token in the output that is neither in the input nor derivable from it is, by construction, a value the model has invented, since the model has no other grounded source for the number. Counting such tokens therefore provides a conservative, non-semantic lower bound on numerical hallucination. The detector operationalizes this idea by tokenizing the generated report with a regular expression that captures all numeric tokens, including single digits, and flagging every token that does not appear in the input table. The legitimate ordinal index scale 0–5 used by HI, EI, VI, and PRI is whitelisted, since these single digits would otherwise be flagged trivially. Each flagged token is recorded together with ±40 characters of surrounding context so that a domain reviewer can rapidly distinguish a genuine fabrication from an arithmetic intermediate (for example, the product

H I \times E I \times V I

shown before normalization) or a category-range label such as “Moderate-High PRI (2.0–2.9)”.

Sentence-level natural language inference (NLI)

The numeric detector is blind to non-numeric fabrications. A sentence such as “embankment failure is driven by thermal shock to rebar” contains no flagged numbers but is, in the input considered here, an unsupported causal claim. To probe such cases, every sentence of the generated report is treated as a hypothesis and tested for entailment against a serialized version of the input table, used as the premise. The classifier is a RoBERTa-large model [30] fine-tuned on the MultiNLI corpus [31], which labels each (premise, hypothesis) pair as entailed, neutral, or contradicted (a brief description of MNLI is given in Section 2.5.5). The serialization of the input table is constructed row by row, with each header field rendered as “header is value.” so that the premise is natural-language prose rather than markdown. The headline NLI metric is the entailment rate

r_{ent} = n_{ent} / (n_{ent} + n_{con})

, where

n_{ent}

is the number of sentences labelled entailed and

n_{con}

is the number labelled contradicted. Neutral sentences are excluded from the denominator because narrative reports legitimately include framing prose that the input table neither confirms nor contradicts (for example, generic statements about cascading risks or stakeholder context).

LLM-as-judge claim decomposition

NLI evaluates one sentence at a time. A sentence may, however, contain several factual assertions, only some of which are grounded in the input table; sentence-level scoring therefore conflates a partially supported sentence with a fully supported one. To obtain a more fine-grained signal, each generated report is decomposed by a judging LLM into a set of atomic factual claims, following the FActScore decomposition approach [32]. Each claim is then adjudicated against the input table by a separate call to the same judging model and assigned one of four verdicts: supported (the claim is directly stated or trivially derivable from the table), partial (part of the claim is supported but part adds detail not in the table), contradicted (the claim asserts something the table refutes), or unverifiable (the claim addresses interpretive content the table is not expected to confirm, for example an adaptation recommendation). To avoid the bias of a model judging its own outputs, the primary judging model is Llama 3.1 8B Instruct, run at sampling temperature

T = 0

; a second judge, GPT-5.4 mini, is used as a robustness check (Section 3.4). The headline judge metric is the verifiable support rate

ρ_{\sup} = \frac{n_{\sup}}{n_{\sup} + n_{par} + n_{con}},

where

n_{\sup}

,

n_{par}

,

n_{con}

, and

n_{unv}

are the counts of supported, partial, contradicted, and unverifiable claims returned by the judge for the report under consideration. Unverifiable claims are excluded from the denominator because the metric is intended to characterize the verifiable share of the claim space; the raw support rate

n_{\sup} / n_{tot}

, where

n_{tot} = n_{\sup} + n_{par} + n_{con} + n_{unv}

, is also reported in the result tables for completeness.

In addition to the three faithfulness metrics above, ROUGE-L F-measure [33] and BERTScore F1 [34] are computed against a human-authored expert reference for the same dam-infrastructure site; the corresponding numbers are reported in Appendix A. These two scores measure stylistic and semantic alignment with the reference rather than faithfulness to the input table, and are therefore retained as supplementary indicators rather than headline metrics. Their roles are described briefly in Section 2.5.5.

2.5.5. Brief Definitions of Referenced Evaluation Metrics

The metric pipeline above borrows three components from the natural-language-processing literature. Since readers of this journal may not be familiar with them, short descriptions are given here.

ROUGE-L [33] is one of the oldest automatic metrics for evaluating generated text. The variant used here scores a generated report against a reference text by finding the longest sequence of words that appears in both (in order, though not necessarily contiguously), and reporting the result as an F-measure. Two generated reports that paraphrase the reference with different vocabulary will receive low ROUGE-L scores even if both are correct, which is why ROUGE-L is treated here as a stylistic-alignment indicator rather than a faithfulness metric.

BERTScore [34] was introduced to address the lexical-overlap limitation of ROUGE-style metrics. Rather than counting matching words, it represents each token in the candidate and the reference as a contextual embedding (RoBERTa-large, in this paper’s configuration), then matches every candidate token to its most-similar reference token via cosine similarity. The resulting precision, recall, and F1 scores reward semantic agreement even when wording differs. BERTScore is therefore a stricter alignment test than ROUGE-L, but it shares the same fundamental property: it compares against a reference report, not against the input data table, and so is not by itself a measure of factual grounding.

MNLI [31] is a corpus, not a metric. It contains sentence pairs labeled entailment, contradiction, or neutral, where the label describes how the second sentence (the hypothesis) relates to the first (the premise). Language models fine-tuned on MNLI learn to predict these three labels for any new pair, and the resulting classifier is what is used in the NLI step described earlier in this section. In the configuration here, the premise is a sentence-by-sentence rendering of the input table and the hypothesis is one sentence of the generated report, so the entailment classifier acts as a grounding test against the data rather than against a reference.

2.5.6. Cross-Model Replication

To examine whether the GRF prompt pattern is portable beyond the deployed proprietary model, the same four report

t y p e \times c o n d i t i o n

configurations were replicated under two further models in addition to the deployed Gemini 2.5 Flash Lite. Llama 3.1 8B Instruct [35] was served locally through the Ollama runtime with default 4-bit quantization. GPT-5.4 mini was accessed through its hosted OpenAI API. The same input table, system instructions, sampling temperatures, exemplar configurations, and three-repetition design were used in every arm, producing 36 generations across the three model families. The cross-model replication is not designed to establish quantitative equivalence between models, which is not its purpose, but to test whether the principal ablation effects observed under the deployed model reproduce in direction under different model families.

3. Results

3.1. Functional Demonstration Across the Three Case-Study Locations

To illustrate the workflow’s general applicability beyond any preconfigured site, three European polygons were defined to span distinct climate zones, infrastructure typologies, and dominant hazard families. The first polygon (ROT) covers the Rotterdam Maasvlakte port complex (centroid 51.97°N, 4.09°E; area 100.5 km²); the second (ATH) covers the central core of Athens (centroid 37.98°N, 23.74°E; area 19.5 km²); the third (INN) covers a section of Innsbruck’s Brenner corridor (centroid 47.24°N, 11.44°E; area 70.5 km²). The Köppen-Geiger sampling at the centroid of each polygon returned, respectively, Cfb (temperate oceanic), Csa (hot-summer Mediterranean), and Dfb (warm-summer humid continental), consistent with the intended climate-zone diversity.

The on-demand Overpass extraction returned 34,416 elements for ROT, 60,282 for ATH, and 18,872 for INN. The contrast between the ATH and INN counts is informative: ATH yields the highest element density of the three (approximately 3,100 elements/km²), reflecting the dense building stock of central Athens (45,811 building features alone), whereas INN, despite covering 3.6 times the area of ATH, returns the smallest element count because the polygon traverses alpine terrain in which infrastructure is concentrated along a narrow valley corridor. ROT lies between the two and is dominated by buildings (26,066) and roads (6,887). Per-category extraction counts are reported in Table 2 and indicate that the polygon-based extraction captures the infrastructure profile of each location: railway and slope-stabilization features dominate the INN profile, urban-building density dominates ATH, and a mixed port/transport/building profile is returned for ROT, including 245 water-body features that reflect the Maasvlakte’s port and waterway character.

End-to-end execution of the workflow produced the polygon-based extraction, the Köppen sample, the Geographical Context report, and the Köppen Interpretation report for each location. An excerpt from the generated reports for Rotterdam is illustrated in Figure 3. Single-pixel Köppen sampling completed in well under 0.6 s in all runs, as expected for a raster lookup. Overpass extraction times averaged 14.5 s across the three sites (range 9.7–19.2 s, with substantial within-site variance reflecting the public-endpoint variability characterized in Section 3.2). LLM-generation times were dominated by the Geographical Context report, particularly for the data-rich ROT prompt (34.8 s), reflecting the larger number of detailed OSM tag dictionaries serialized into that prompt, and ranged from 8.4 s to 20.0 s for the Köppen Interpretation report. Total end-to-end time from polygon completion to fully displayed reports was 69.9 s for ROT, 42.4 s for ATH, and 29.2 s for INN.

Taken together, the three case studies establish that the workflow operates end-to-end across structurally and climatically dissimilar sites without bespoke configuration: a coastal industrial port complex in a temperate-oceanic climate, a dense urban core in a Mediterranean climate, and a mountainous transport corridor in a cold continental climate were all processed by the same pipeline with no site-specific tuning, parameter changes, or manual data preparation. The contrast in returned element counts and dominant feature types (Table 2) shows that the polygon-based extraction adapts to the actual infrastructure profile of each site rather than relying on a fixed schema, and the Köppen sample correctly recovered the expected climate class at each centroid (Cfb, Csa, Dfb). Figure 3 provides a single fully rendered example for ROT, since the visual layout of the four panels is the same at all three sites; the corresponding outputs for ATH and INN are reproduced in the released data alongside the per-site latency measurements.

The functional demonstration verifies that the workflow runs end to end and produces coherent outputs of the expected structure at each site. It does not verify the factual correctness of the generated reports against ground truth; that question is addressed for the Hazard and PRI reports specifically by the ablation study in Section 3.3, and is identified as a target for expert review in Section 4.3.

3.2. Performance Characterization

End-to-end latency was characterized to confirm that the workflow remains in the interactive-use regime and to identify the components that dominate response time. Total time from polygon completion to fully displayed reports was 29.2 s, 42.4 s, and 69.9 s for the three case-study locations (INN, ATH, ROT). Component-level measurements identified Overpass extraction as both the dominant and the most variable contributor (mean 14.5 s across the three case-study sites, range 9.7–19.2 s for repeated queries against the same INN polygon), reflecting transient load on the public Overpass endpoint rather than any property of the query itself. Köppen-Geiger sampling contributed negligibly (per-call latency 0.45 ± 0.10 s). LLM generation under hosted Gemini was approximately 3 to 8 times faster than the equivalent local Llama 3.1 8B Instruct (4-bit quantized) generation; the deployed system uses hosted Gemini, with the Llama configuration retained only for the cross-model replication described in Section 3.4. These observations motivate caching Overpass responses in operational deployments and clarify the cost of model portability. A broader 20-polygon characterization, including the size–latency scaling fit and per-polygon distributions, is available in the open data repository.

3.3. Ablation Study on Embedded Exemplars

The ablation study compares the Hazard Report and the PRI Assessment Report under the two prompt conditions defined in Section 2.5, namely the deployed full-prompt configuration in which the reference exemplar pair is included (WE) and the ablated configuration in which the exemplar pair is removed (NE). Each (type × condition) combination was sampled three times under independent calls to the deployed Gemini 2.5 Flash Lite model, producing 12 generated reports in total. Reports were scored under the three faithfulness metrics described in Section 2.5.4.

3.3.1. Ablation Results

Per-condition aggregates for the Hazard and PRI Assessment Reports under both conditions are reported in Table 3. Throughout the rest of this paper, the two ablation conditions are referenced in tables using two short labels: WE (with exemplar) refers to the deployed full-prompt configuration of the GRF, in which the reference exemplar pair is embedded in the prompt as described in Section 2.4; NE (no exemplar) refers to the ablated configuration, in which the exemplar pair has been removed while every other element of the prompt, including the strict-protocol system instruction and the structured input data block, is kept identical. The column

Δ

reports the per-row difference

NE - WE

, so that a positive

Δ

on any metric indicates a larger value under the no-exemplar condition. Four observations emerge from Table 3.

First, the exemplar’s effect on output length differs in sign between the two report types. Removing the exemplar shortened the Hazard Report (from 401 to 311 words) and lengthened the PRI Assessment Report (from 595 to 1109 words). Inspection of the no-exemplar PRI outputs showed that the additional length was produced by detailed per-risk breakdowns and explicit exposition of the

H I \times E I \times V I

arithmetic before normalization–content the model omitted when an exemplar with the concise expert format was available. The exemplar therefore acts as a stylistic and structural anchor in this single case, with the apparent magnitude of anchoring depending on the structural strictness of the target report.

Second, the numeric token detector reveals a structured grounding pattern. Hazard reports produced no flagged tokens under either condition (zero across all six runs). PRI reports under the deployed full-prompt configuration produced 2.0 flagged tokens per report on average (6 across three runs); manual inspection of these tokens is reported below. PRI reports under exemplar removal produced 7.0 flagged tokens per report on average (21 across three runs), a 3.5-fold increase. The exemplar therefore appears to play a scope-limiting role on numeric content for the more interpretive of the two report types, in addition to its structural anchoring role. Manual inspection of the 6 tokens flagged under the deployed PRI configuration found these to consist of arithmetic intermediates from the

H I \times E I \times V I

product chain (e. g., 40, 32, 24, 12, 8) and one category-range label (“Moderate-High PRI (2.0–2.9)”); no token corresponded to a fabricated value with no derivable interpretation from the input table.

Third, NLI and LLM-judge metrics confirm a degradation in PRI grounding under exemplar removal that the length-and-overlap metrics could not see. For PRI, NLI sentences classified as contradicted rose from 1.0 to 3.0 under exemplar removal, and the deterministic numeric token detector results rose as described above. For Hazard, NLI contradictions rose more modestly (from 4.7 to 5.3) and the entailment rate fell slightly (from 0.56 to 0.46), but the absolute level of contradictions in the Hazard condition is partly attributable to NLI labelling of legitimate framing prose (statements about expected cascading effects, regional context) that the input table does not address. The deterministic-detector numeric-hallucination count remained zero under both Hazard conditions.

Fourth, the judge’s verifiable support rate $ρ_{\sup}$ rises under exemplar removal for the PRI report (from 0.53 to 0.68 under the Llama judge), a result that requires explanation. The mechanism is visible in the unverifiable column: the deployed full-prompt PRI reports include substantial interpretive narrative (mean 11.3 unverifiable claims per report) that the judge cannot ground in the input table, while the no-exemplar PRI reports lean more heavily on direct restatement of table values (mean 6.0 unverifiable claims). The full-prompt reports therefore generate more material that is neither supportable nor contradictable by the table; the no-exemplar reports score more highly on

ρ_{\sup}

partly because they make fewer interpretive claims, not because their interpretive claims are better grounded. Read alongside the deterministic-detector numeric-hallucination result (+5.0 per report), the judge result therefore reflects a trade-off introduced by exemplar removal: shallower interpretation but more numeric speculation. As Section 3.4 shows, this support-rate swing is itself judge-dependent, and is absent under the GPT-5.4 mini judge. The deployed configuration retains the exemplar.

Figure 4. Ablation effect on the three faithfulness metrics for the Hazard Report (top row) and PRI Assessment Report (bottom row) under Gemini 2.5 Flash Lite. (a, d) Numeric hallucinations under the deterministic numeric token detector. (b, e) NLI entailment rate against the input table (RoBERTa-large MNLI). (c, f) Judge verifiable support rate

ρ_{\sup}

. Cross-model replication results (Section 3.4) follow the same patterns; see Table 4. Boxes show the distribution of values across

n = 3

generations per condition.

Figure 4. Ablation effect on the three faithfulness metrics for the Hazard Report (top row) and PRI Assessment Report (bottom row) under Gemini 2.5 Flash Lite. (a, d) Numeric hallucinations under the deterministic numeric token detector. (b, e) NLI entailment rate against the input table (RoBERTa-large MNLI). (c, f) Judge verifiable support rate

ρ_{\sup}

. Cross-model replication results (Section 3.4) follow the same patterns; see Table 4. Boxes show the distribution of values across

n = 3

generations per condition.

Figure 5. Judge verdict breakdown for the PRI Assessment Report across all six (

m o d e l \times c o n d i t i o n

) configurations, with the Gemini and Llama generations judged by Llama 3.1 8B Instruct (left of the separator) and the GPT-5.4 mini generations judged by GPT-5.4 mini (right). Bars show the proportion of decomposed claims labelled supported, partial, contradicted, and unverifiable. Under the Llama judge the supported share swings markedly with condition; under the GPT-5.4 mini judge it stays high and nearly flat. The contrast illustrates that the judge-based support rate is judge-dependent and must be read alongside the judge-independent metrics.

Figure 5. Judge verdict breakdown for the PRI Assessment Report across all six (

m o d e l \times c o n d i t i o n

) configurations, with the Gemini and Llama generations judged by Llama 3.1 8B Instruct (left of the separator) and the GPT-5.4 mini generations judged by GPT-5.4 mini (right). Bars show the proportion of decomposed claims labelled supported, partial, contradicted, and unverifiable. Under the Llama judge the supported share swings markedly with condition; under the GPT-5.4 mini judge it stays high and nearly flat. The contrast illustrates that the judge-based support rate is judge-dependent and must be read alongside the judge-independent metrics.

The ablation is intentionally small (

n = 3

per condition, 12 generations total) and was designed to provide controlled directional evidence rather than to establish statistical significance. With this sample size, the paired Wilcoxon signed-rank test [36] cannot reach a p-value below 0.25 even for the largest observed effect, so effect sizes and the directional consistency of those effects across the two model families (Section 3.4) are reported instead of inference statistics. Generalization to a larger and more diverse set of input cases, and a complementary expert-rated human evaluation, are identified as the most important next steps.

3.4. Cross-Model Replication

To examine whether the GRF prompt pattern is portable beyond the deployed proprietary model, the Hazard and PRI report generations were replicated under all four

t y p e \times c o n d i t i o n

configurations using two further models: the open-weight Llama 3.1 8B Instruct, served locally through the Ollama runtime with default 4-bit quantization; and the proprietary GPT-5.4 mini, accessed through its hosted API. The same input table, system instructions, sampling temperatures, exemplar configurations, and three-repetition design were used in every arm. The Llama generations are reported in Table 4 alongside the Gemini results from Section 3.3; the GPT-5.4 mini generations are reported separately in Table 5. The cross-model replication was not designed to establish quantitative equivalence between models, which is not its purpose, but to test whether the principal ablation effects observed under Gemini reproduce in direction under different model families. All 36 generations completed successfully. Four observations stand out:

First, the prompt pattern produced reports of the intended structure, which are narratives divided into sections according to hazard level or PRI score with

H I / E I / V I / P R I

abbreviations preserved verbatim, for all 36 generations across the three model families, including every no-exemplar run. This is evidence that the framework’s design (strict-protocol system instruction, structured input table, exemplar block) is portable across model families rather than being a property of any single model.

Second, the central numeric-hallucination asymmetry reproduces under all three models: hazard reports stay near zero under both conditions, while PRI reports remain clean under the full prompt and acquire fabricated numeric tokens once the exemplar is removed (Gemini, 2.0 to 7.0; Llama, 0.0 to 2.3; GPT-5.4 mini, 0.0 to 3.0). The length asymmetry likewise reproduces, most strongly on PRI, where exemplar removal lengthens the report in every arm (Gemini +515 words, Llama +183, GPT-5.4 mini +545). Three independent models, two proprietary and one open-weight, therefore agree on the direction of the two effects on which the paper’s main claim rests (Figure 6).

Third, the exemplar-removal effect on the PRI report is directionally consistent across model families on five of seven metrics under the Llama judge: output length (Gemini +515 / Llama +183), numeric hallucinations (Gemini +5.0 / Llama +2.3), NLI entailment rate (Gemini +0.01 / Llama +0.25), verifiable support rate (Gemini +0.16 / Llama +0.10), and unverifiable claim count (Gemini −5.3 / Llama −7.0). The hazard report shows weaker cross-model consistency precisely because its exemplar effect is small in absolute terms in both models; when the underlying effect is small, the sign of the residual is dominated by per-realization noise. This reinforces the per-model finding that the exemplar’s role is most consequential for the PRI report.

Fourth, the LLM-judge support rate proved sensitive to the choice of judging model, which is itself an informative result. Under the Llama judge, the verifiable support rate for the PRI report rose noticeably on exemplar removal (0.53 to 0.68), an effect attributed to the parallel fall in unverifiable claims. Under the GPT-5.4 mini judge, the same support rate barely moved (0.94 to 0.98) while the unverifiable count still rose (8.3 to 14.0), because GPT-5.4 mini labelled a larger share of claims as supported regardless of condition. GPT-5.4 mini also decomposed reports more granularly and more consistently than Llama (35 to 111 claims per report, with none of the extreme outliers seen under Llama). The two judges agree on the direction of the core asymmetry but disagree on the magnitude of the support-rate swing. The support-rate metric should therefore be read as judge-dependent and triangulated against the deterministic numeric token detector and the NLI metric, both of which are judge-independent and both of which reproduce the asymmetry. This divergence supplies direct evidence for the single-judge-bias concern that would otherwise have to be stated only as a caveat.

Latency. Llama generations ran approximately 3 to 8 times slower than the equivalent Gemini hosted inference, consistent with the cost of running an 8B-parameter model on consumer-grade local hardware in 4-bit quantization. This penalty is acceptable for a replication study but would be a deployment concern; the deployed system continues to use the hosted Gemini service, with the Llama and GPT-5.4 mini configurations retained solely to examine framework portability.

The cross-model replication thus provides preliminary evidence for the central architectural claims of the GRF: the prompt pattern produced the intended structured output across three model families; the exemplar’s role as a scope-limiting anchor on numeric content for the PRI report was preserved in direction across all three; and the asymmetry between hazard and PRI report types (hazard robust, PRI sensitive) held throughout. The replication does not establish quantitative equivalence between models, which is not its purpose, and absolute report quality scales with model capability as expected.

4. Discussion

4.1. Applied Contribution

The principal contribution of this work is the Grounded Reporting Framework introduced in Section 2.4: a structured one-shot in-context learning pattern that translates a chained, IPCC AR6-aligned set of quantitative climate-risk indices into stakeholder-facing narrative reports under explicit grounding constraints. The framework is deployed and evaluated inside the open-source DST described in Section 2.1, Section 2.2 and Section 2.3, which provides its application context: a polygon-based OSM extraction pipeline, a Köppen-Geiger classification step, and the Level 2 risk chain produce the structured inputs that the reporting layer interprets. The DST itself is a useful artifact of the present work, but it is not the object of evaluation in this paper. To the authors’ knowledge, no prior open decision support tool for the climate risk assessment of critical infrastructure has incorporated a structured, one-shot in-context learning reporting layer of this kind. The contrast with adjacent LLM-in-climate efforts is informative. ChatClimate [21] retrieves from the AR6 corpus to support open-ended dialogue; CHATREPORT [22] mines ESG disclosures using TCFD-aligned question templates; ClimateBert [23] provides a domain-adapted encoder for downstream NLP tasks. None of these address the specific design problem solved here: translating a structured table of quantitative risk indices, generated by a chained climate-risk workflow, into a stakeholder-facing narrative under explicit grounding constraints. The present work demonstrates that, in this restricted setting, a single embedded exemplar paired with strict-protocol system instructions can produce structured output suitable for a transparent decision-support panel.

The ablation results in Section 3.3 provide exploratory empirical evidence that refines, rather than confirms, the standard intuition about in-context learning [14,20]. Under three complementary faithfulness metrics (deterministic numeric token detector, NLI against the input table, LLM-as-judge claim decomposition), the deployed full-prompt configuration produced no fabricated numeric tokens on hazard reports and an average of two flagged tokens per PRI report, all of which on manual inspection were arithmetic intermediates or category-range labels rather than fabricated values. Exemplar removal had an asymmetric effect: hazard reports were robust to it, while PRI reports lengthened sharply and acquired fabricated numeric tokens (a mean of seven per report under Gemini, 2.3 under Llama, and 3.0 under GPT-5.4 mini). In this prompt design, the exemplar therefore appears to play not only a stylistic anchoring role but a scope-limiting role on numeric speculation for the more interpretive report type. Non-numeric and causal hallucinations are not directly addressed by these metrics and remain a target for expert review in future work. The asymmetric pattern was directionally consistent across all three model families, providing preliminary evidence that the prompt pattern of the GRF is model-portable. The LLM-judge support rate, by contrast, diverged between the Llama and GPT-5.4 mini judges, which is taken here as direct evidence that judge-based support rates must be triangulated against the judge-independent metrics rather than read in isolation. This interpretation is framed as exploratory rather than methodologically established, given the small sample, single-case design, and the absence of expert-rated human evaluation.

4.2. Practical Implications for Climate-Adaptation Decision Support

The polygon-based extraction pipeline of Section 2.2, the Extraction · Mapping & Data tool, lowers the barrier to understanding the geospatial and climatic context of a candidate infrastructure site by automating two previously manual steps: obtaining and pre-processing OpenStreetMap data for an arbitrary polygon, and locating the polygon within the global Köppen-Geiger climate classification. The Geographical Context and Köppen Interpretation reports produced from these inputs translate raw OSM tag dictionaries and a four-letter climate code into professionally styled narrative summaries that an infrastructure manager can read without GIS or climatology expertise. The Extraction tool is a context-gathering workflow rather than a risk-assessment workflow: it does not produce hazard, exposure, or vulnerability indices, and it is not connected by an internal data pipeline to the two risk-assessment tools (Level 1 and Level 2). Its output is intended to inform a user’s subsequent independent judgment or to serve as documentation alongside a separately conducted risk assessment. The performance characterization of Section 3.2, supplemented by a broader 20-polygon extraction characterization in the open data repository, quantifies the latency profile of this workflow: end-to-end (extraction + climate sample + LLM reports) times of 29–70 s on the three case-study locations, with Overpass extraction identified as the dominant and most variable contributor due to public-endpoint load variability. These figures place the workflow firmly in the regime of interactive use, consistent with recent calls in the climate-services literature for tools that bridge the “useful, used, usable” gap [13].

The narrative reports themselves serve a complementary function. Quantitative risk indices are powerful for technical analysts but are generally opaque to non-technical stakeholders, who in many critical-infrastructure operators include senior management, municipal authorities, and regulators. The combination of strict system-instruction protocols, embedded exemplars, persistent disclaimers, and a low-temperature configuration for the highest-stakes report (PRI Assessment, T = 0.3) appears to be a defensible compromise between the communication benefits of generative AI and the trustworthiness requirements of decision-support contexts. The ablation results offer preliminary support for this design: hazard reports were robust to exemplar removal across all three faithfulness metrics, while PRI reports under the deployed configuration produced an average of two flagged numeric tokens per report (all benign on manual inspection) and verifiable support rates on the LLM-judge metric. The exemplar’s apparent contribution to limiting numeric speculation under the more interpretive PRI configuration argues for retaining exemplar-anchored prompting in production deployments. This trustworthy narrative layer is what allows the platform’s downstream nature-based solution recommendations to be communicated in context: an adaptation measure such as slope stabilization or drainage reinforcement is far more likely to be acted upon when it is presented alongside a grounded, stakeholder-readable explanation of the hazard, exposure, and vulnerability drivers that motivate it, rather than as an isolated index score. In this sense the reporting framework is not an adjunct to the risk-and-NbS workflow but the component that makes its outputs usable by the non-specialist audiences who commission and approve infrastructure adaptation. Whether the resulting reports actually improve stakeholder comprehension and trust is a question that automatic metrics cannot answer; it is identified in Section 4.3 as the most important next step.

4.3. Limitations

Several limitations of the present work warrant discussion. First, the OSM extraction inherits all known limitations of OpenStreetMap as a data source, including spatially heterogeneous completeness and tag-quality variation by region [37]. The workflow partially mitigates this by displaying raw element counts alongside the AI-generated narrative, allowing users to detect cases where extraction is incomplete, but it does not actively assess OSM data quality. The contrast in returned-element counts between the dense urban polygon (ATH) and the alpine-corridor polygon (INN) in Section 3.1 illustrates this dependence, even though the contrast is also a genuine geographical signal.

Second, the ablation study (Section 3.3) is intentionally small (

n = 3

per condition, 12 generations per model family) and was conducted on a single dam-infrastructure case. The cross-model replication (Section 3.4) extends this to 36 generations across three model families, but a more diverse ablation across multiple infrastructure types and report inputs is left for future work. The present manuscript does not include an expert-rated human evaluation of LLM outputs, which is the natural complement to the automatic-metric layer reported here. Such an evaluation would assess subjective qualities of the reports (usefulness for stakeholder communication, perceived accuracy, completeness as judged by domain experts) that automatic metrics cannot directly capture, and would also probe non-numeric and causal hallucinations that the deterministic detector used here cannot identify. This is identified as the highest-priority extension of the present work.

Third, the three faithfulness metrics used in Section 3.3 each have well-known limitations and are reported together precisely because they are complementary rather than substitutable. The deterministic numeric detector is blind to non-numeric and causal hallucinations; the NLI classifier produces “neutral” verdicts on legitimate framing prose that the input table does not address; the LLM-as-judge analysis depends on the judging model’s own behavior, and its sensitivity to that choice was demonstrated directly by the divergence between the Llama and GPT-5.4 mini judges in Section 3.4. The ROUGE-L and BERTScore values against an expert reference, also computed for each report, are reported as supplementary stylistic-alignment indicators in Appendix A; they measure surface and semantic similarity to a single expert-authored interpretation rather than faithfulness to the input table. All metrics are reported with their respective caveats, and the convergent direction of the judge-independent metrics, rather than any one metric in isolation, is what supports the qualitative findings.

Fourth, although the prompt pattern appears portable across LLMs in this small replication, the absolute report quality reported in Section 3.3 was obtained primarily with a single proprietary model (Gemini 2.5 Flash Lite) whose behavior may change without notice as the provider updates the model. The framework’s design (exemplars as immutable code constants, model-agnostic prompt structure) is intended to make such drift detectable and recoverable, but it does not eliminate it.

Fifth, the case-study locations were chosen by the authors specifically to span distinct climate zones and infrastructure types; they are demonstrative rather than representative of any pre-specified European population. Generalization of the latency and quality estimates beyond European temperate, Mediterranean, and alpine settings should be verified before relying on the workflow in markedly different contexts.

5. Conclusions

This paper has presented an exploratory evaluation of a Grounded Reporting Framework (GRF) for translating chained quantitative climate-risk indices into stakeholder-facing narrative reports. The framework combines domain-specific strict-protocol system instructions, an embedded input/output reference exemplar pair, and a low-temperature sampling configuration for the highest-stakes report type, applied through a one-shot in-context learning pattern. The framework is deployed inside an open-source, web-based decision support tool (DST) for climate risk assessment and nature-based adaptation planning for critical infrastructure in Europe, organized into three independent analytical modules: a geospatial extraction and context-gathering workflow (Extraction · Mapping & Data), a perceived-risk screening tool (Level 1), and a quantitative hazard–exposure–vulnerability indexing tool (Level 2) whose outputs feed a nature-based solution recommendation engine. The DST is the deployment context that motivates and grounds the framework but is not itself the object of evaluation in this paper.

Demonstrations on three European locations spanning temperate-oceanic, Mediterranean, and alpine settings showed end-to-end latency of 29.2–69.9 s from polygon completion to fully displayed reports, well within the regime of interactive use, with Overpass extraction identified as both the dominant and the most variable contributor. An exploratory ablation study on real expert-validated dam-site input data, applied to the Level 2 Hazard Report and PRI Assessment Report and evaluated with three complementary faithfulness metrics (deterministic numeric token detector, sentence-level NLI against the input table, LLM-as-judge claim decomposition), revealed an asymmetric pattern: hazard reports were robust to exemplar removal, while PRI reports lengthened substantially and showed a clear rise in flagged numeric tokens once the exemplar was removed. Under the deployed full-prompt configuration, hazard reports produced no flagged numeric tokens; the two flagged PRI tokens per report on average were, on manual inspection, arithmetic intermediates or category-range labels rather than fabricated values. The exemplar’s apparent contribution therefore extends beyond stylistic anchoring to include scope-limiting of speculative numeric content, particularly for the more interpretive of the two report types. Replication across three model families (Gemini 2.5 Flash Lite, Llama 3.1 8B Instruct, and GPT-5.4 mini) reproduced the direction of this asymmetry, while a divergence between two LLM judges established that judge-based support rates must be triangulated against judge-independent metrics.

Future extensions of this work include expert-rated human evaluation of stakeholder utility (identified as the highest-priority next step), characterizing the framework on additional asset classes beyond transport and dam infrastructure, evaluating self-hosted Overpass deployments that would reduce latency variance, and extending the ablation to additional open-weight model scales.

Together, these results provide support for the idea that carefully constrained generative AI can play a useful role as a stakeholder-communication layer atop quantitative climate risk assessment and nature-based-solution recommendation, provided that the framework around it enforces grounding, transparency, and reproducibility. For infrastructure operators weighing nature-based adaptation measures, the value of such a platform lies in lowering the expertise barrier between a quantitative risk assessment and an actionable, well-communicated adaptation decision. The framework presented here uses a functional decomposition between system instruction (primary numerical grounding) and embedded exemplar (style anchoring plus scope-limiting of numeric speculation on interpretive report types), and this pattern reproduced across three model families. The evaluation provides evidence that the reporting layer is both reliable and portable as underlying models evolve, with broader validation left to future work. The tool is publicly available3 and the source code, deployment configuration, and all measurement data are released open-source on github4.

Author Contributions

Conceptualization, F.A., J.P.; methodology, F.A., J.P.; software, F.A.; validation, F.A.; formal analysis, F.A.; investigation, F.A.; data curation, F.A.; writing—original draft preparation, F.A., J.P.; writing—review and editing, J.P., P.S., and M.I.; visualization, F.A.; supervision, J.P., P.S., and M.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the European Union under the Horizon Europe research and innovation program, project NATURE-DEMO (Demonstrating Nature-Based Solutions for the Resilience of Critical Infrastructures), Grant Agreement No. 101157448. The views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the granting authority. Neither the European Union nor the granting authority can be held responsible for them.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code of the Decision Support Tool, the deployment configuration, the raw measurement data and aggregate metrics underlying the results in Section 3.1, Section 3.2, Section 3.3, and Section 3.4, and the test scripts used for the latency characterization and the ablation study, are openly released under an open-source license at https://github.com/saturngreen67/MDPI1. The deployed application is publicly accessible at https://nature-demo-dst.dic-cloudmate.eu.

Acknowledgments

The authors thank the University of Cantabria team for the multi-level risk assessment framework that underpins the Level 2 quantitative chain used in this work; the BOKU team for the Level 1 perceived-risk methodology and for the Level 1 NbS ranking method, both communicated to the implementation team in consortium working meetings; the ANRI team, and Lauren Machí Castañer in particular, for the Level 2 NbS scoring methodology (SSF, SEI, and HIA factors), also communicated in working meetings; the IBM Research team for their work on their Climate Indices Visualization and the underlying climate foundation models and all NATURE-DEMO consortium partners for their feedback during the design, testing, and deployment of the platform. During the preparation of this manuscript, the authors used Claude and Google Gemini for the purposes of proofreading and refining the writing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
API	Application Programming Interface
AR6	IPCC Sixth Assessment Report
DST	Decision Support Tool
EI	Exposure Index
GIS	Geographic Information System
GRF	Grounded Reporting Framework
HI	Hazard Index
IPCC	Intergovernmental Panel on Climate Change
LLM	Large Language Model
NbS	Nature-Based Solutions
OSM	OpenStreetMap
PRI	Potential Risk Index
RCP	Representative Concentration Pathway
VI	Vulnerability Index

Appendix A. Supplementary Stylistic-Alignment Metrics

Table A1. Stylistic alignment with the human-authored expert reference, measured by ROUGE-L F-measure (surface n-gram overlap) and BERTScore

F 1

(semantic similarity, RoBERTa-large), under both ablation conditions and both model families. These metrics measure alignment between the generated report and a single expert reference, not faithfulness to the input table; they are retained for transparency but are not used as faithfulness indicators in the main text (Section 3.3 and Section 3.4). Cells report mean ± SD over

n = 3

generations.

Table A1. Stylistic alignment with the human-authored expert reference, measured by ROUGE-L F-measure (surface n-gram overlap) and BERTScore

F 1

(semantic similarity, RoBERTa-large), under both ablation conditions and both model families. These metrics measure alignment between the generated report and a single expert reference, not faithfulness to the input table; they are retained for transparency but are not used as faithfulness indicators in the main text (Section 3.3 and Section 3.4). Cells report mean ± SD over

n = 3

generations.

Metric	Gemini WE	Gemini NE	Llama WE	Llama NE
Hazard Report
ROUGE-L F	0.216 ± 0.020	0.170 ± 0.020	0.201 ± 0.024	0.167 ± 0.021
BERTScore $F 1$	0.849 ± 0.006	0.832 ± 0.005	0.850 ± 0.001	0.842 ± 0.002
PRI Assessment Report
ROUGE-L F	0.207 ± 0.016	0.151 ± 0.005	0.255 ± 0.003	0.176 ± 0.002
BERTScore $F 1$	0.834 ± 0.002	0.817 ± 0.004	0.847 ± 0.008	0.814 ± 0.007

The four columns show consistent patterns: both ROUGE-L and BERTScore decrease on exemplar removal in every cell of the 2 × 2 × 2 (type × condition × model) design. These declines indicate reduced stylistic and lexical alignment with the expert reference under exemplar removal. They do not, however, measure whether the generated content is faithful to the input table; that role is filled by the three metrics in Section 3.3 and Section 3.4. BERTScore values near 0.83–0.85 are characteristic of coherent English text on related topics, and the small absolute deltas (

Δ

BERTScore ≈ 0.01–0.03) sit within the range where surface lexical overlap on shared domain vocabulary (HI, EI, VI, PRI, asset names) can dominate the metric.

Notes

1	https://github.com/NATURE-DEMO/clima-ind-viz
2	http://ollama.com
3	https://nature-demo-dst.dic-cloudmate.eu
4	https://github.com/NATURE-DEMO/Decision_Support_Tool

References

Pörtner, H.O.; Roberts, D.C.; Tignor, M.; Poloczanska, E.S.; Mintenbeck, K.; Alegría, A.; Craig, M.; Langsdorf, S.; Löschke, S.; Möller, V.; et al. (Eds.) Climate Change 2022: Impacts, Adaptation and Vulnerability. Contribution of Working Group II to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change; Cambridge University Press: Cambridge, UK and New York, NY, USA, 2022. [Google Scholar]
Hajializadeh, D.; Imani, M. RV-DSS: Towards a resilience and vulnerability-informed decision support system framework for interdependent infrastructure systems. Comput. Ind. Eng. 2021, 156, 107276. [Google Scholar] [CrossRef]
Chang, C.M.; Hossain, A. A climate adaptation asset risk management approach for resilient roadway infrastructure. Infrastructures 2024, 9, 226. [Google Scholar] [CrossRef]
Vlachogiannis, D.; Zarikos, I.; Sfetsos, A.; Rimlinger, J.; Jaumouillé, A.; Freissinet, C.; Santala, V.; Tzempelikos, D.; Dubovik, M. A Uniform Framework for Climate Change Adaptation of Critical Infrastructure Using Nature-Based Solutions. Infrastructures 2026, 11, 65. [Google Scholar] [CrossRef]
Šedová, B.; Binder, L.; Michelini, S.; Schellens, M.; Rüttinger, L. A Review of Climate Security Risk Assessment Tools. Environ. Secur. 2024. [Google Scholar] [CrossRef]
Jacob, D.; Petersen, J.; Eggert, B.; Alias, A.; Christensen, O.B.; Bouwer, L.M.; Braun, A.; Colette, A.; Déqué, M.; Georgievski, G.; et al. EURO-CORDEX: New High-Resolution Climate Change Projections for European Impact Research. Reg. Environ. Change 2014, 14, 563–578. [Google Scholar] [CrossRef]
IUCN. IUCN Global Standard for Nature-based Solutions: A User-Friendly Framework for the Verification, Design and Scaling Up of NbS, 1 ed.; International Union for Conservation of Nature: Gland, Switzerland, 2020. [Google Scholar]
World Bank. Implementing Nature-Based Flood Protection: Principles and Implementation Guidance; World Bank Group: Washington, DC, USA, 2017. [Google Scholar]
European Commission. Evaluating the Impact of Nature-Based Solutions: A Handbook for Practitioners; Publications Office of the European Union: Luxembourg, 2021. [Google Scholar]
Kabisch, N.; Frantzeskaki, N.; Pauleit, S.; Naumann, S.; Davis, M.; Artmann, M.; Haase, D.; Knapp, S.; Korn, H.; Stadler, J.; et al. Nature-Based Solutions to Climate Change Mitigation and Adaptation in Urban Areas: Perspectives on Indicators, Knowledge Gaps, Barriers, and Opportunities for Action. Ecol. Soc. 2016, 21, 39. [Google Scholar] [CrossRef]
Sowińska-Świerkosz, B.; García, J. What are Nature-based Solutions (NBS)? Setting Core Ideas for Concept Clarification. Nat.-Based Solut. 2022, 2, 100009. [Google Scholar] [CrossRef]
Vincent, K.; Daly, M.; Scannell, C.; Leathes, B. What Can Climate Services Learn from Theory and Practice of Co-Production? Clim. Serv. 2018, 12, 48–58. [Google Scholar] [CrossRef]
Lemos, M.C.; Kirchhoff, C.J.; Ramprasad, V. Narrowing the Climate Information Usability Gap. Nat. Clim. Change 2012, 2, 789–794. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the NeurIPS, 2020; Vol. 33. [Google Scholar]
Ploennigs, J.; Berger, M.; Wortmann, T.; Kirchner, J.; Beetz, J.; Roitberg, A.; Menzel, K.; Ommer, B. Building foundation models-potentials, challenges and research directions for using LLM and LVM in AEC. In Proceedings of the EC3, 2025; Vol. 6. [Google Scholar]
Maynez, J.; Narayan, S.; Bohnet, B.; McDonald, R. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the ACL, 2020; pp. 1906–1919. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the NeurIPS; 2020; Vol. 33. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the NeurIPS, 2022; Vol. 35. [Google Scholar]
Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Hajishirzi, H.; Zettlemoyer, L. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the EMNLP, 2022; pp. 11048–11064. [Google Scholar]
Vaghefi, S.A.; Stammbach, D.; Muccione, V.; Bingler, J.; Ni, J.; Kraus, M.; Allen, S.; Colesanti-Senni, C.; Wekhof, T.; Schimanski, T.; et al. ChatClimate: Grounding Conversational AI in Climate Science. Commun. Earth Environ. 2023, 4, 480. [Google Scholar] [CrossRef]
Ni, J.; Bingler, J.; Colesanti-Senni, C.; Kraus, M.; Gostlow, G.; Schimanski, T.; Stammbach, D.; Vaghefi, S.A.; Wang, Q.; Webersinke, N.; et al. CHATREPORT: Democratizing Sustainability Disclosure Analysis through LLM-based Tools. In Proceedings of the EMNLP; 2023; pp. 21–51. [Google Scholar] [CrossRef]
Webersinke, N.; Kraus, M.; Bingler, J.A.; Leippold, M. ClimateBert: A Pretrained Language Model for Climate-Related Text. arXiv 2021, arXiv:2110.12010. [Google Scholar] [CrossRef]
Beck, H.E.; Zimmermann, N.E.; McVicar, T.R.; Vergopolan, N.; Berg, A.; Wood, E.F. Present and Future Köppen-Geiger Climate Classification Maps at 1-km Resolution. Sci. Data 2018, 5, 180214. [Google Scholar] [CrossRef] [PubMed]
Jakubik, J.; Roy, S.; Phillips, C.; Fraccaro, P.; Godwin, D.; Zadrozny, B.; Szwarcman, D.; Gomes, C.; Nyirjesy, G.; Edwards, B.; et al. Foundation models for generalist geospatial artificial intelligence. arXiv 2023, arXiv:2310.18660. [Google Scholar] [CrossRef]
Strauss, A.; Fernandes, S.; Ionescu, F.D.; Hübl, J.; Stangl, R.; Kuschel, E.; Obriejetan, M.; Machí Castañer, L.; Wirth, M.; Canga, E.; et al. Methodology for NbS Integration into Digital Tools. Technical Report Deliverable D1.2, NATURE-DEMO: Nature-Based Solutions for Demonstrating Climate-Resilient Critical Infrastructure, 2025. Horizon Europe Grant Agreement No. 101157448.
Barrios Crespo, E.; Fuentes Álvarez de Eulate, M.; López Lara, J. Methodological Framework for Risk Reduction Analysis of Infrastructures Based on Nature-Based Solutions. Deliverable d2.1, NATURE-DEMO Consortium, 2025.
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the NeurIPS; 2020; Vol. 33. [Google Scholar]
European Parliament and Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act), 2024.
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Williams, A.; Nangia, N.; Bowman, S. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the NAACL-HLT, 2018; pp. 1112–1122. [Google Scholar]
Min, S.; Krishna, K.; Lyu, X.; Lewis, M.; Yih, W.t.; Koh, P.; Iyyer, M.; Zettlemoyer, L.; Hajishirzi, H. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the Conf. on Empirical Methods in Natural Language Processing, 2023; pp. 12076–12100. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2004; pp. 74–81. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the ICLR, 2020. [Google Scholar]
Grattafiori, A.; Dubey, A.; Jauhri, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
Barrington-Leigh, C.; Millard-Ball, A. The World’s User-Generated Road Map Is More Than 80% Complete. PLoS ONE 2017, 12, e0180698. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Simplified architecture of the Custom Site Analysis interface, showing the three independent analytical tools and the shared reporting layer. The top tier illustrates the polygon-based site-characterization workflow that supports site selection and infrastructure identification. The middle tier represents the qualitative Level 1 preparatory screening. The bottom tier outlines the quantitative Level 2 risk-assessment chain. The Grounded Reporting Framework (right panel) translates the analytical outputs of each tier into stakeholder-facing narrative reports through a shared one-shot prompting pattern.

Figure 3. End-to-end Custom Site Analysis on the Rotterdam Maasvlakte polygon (ROT, Cfb, 100.5 km², 34,416 elements). (a) Polygon outline on satellite imagery, centroid marked. (b) Köppen-Geiger classification map at the polygon centroid, returning class Cfb. (c) Excerpt from the Geographical Context report generated by Gemini 2.5 Flash Lite. (d) Excerpt from the Köppen-Geiger Interpretation report generated by the same model. Reports are truncated for display.

Figure 6. Reproduction of the two principal exemplar-removal effects on the PRI Assessment Report across three model families. (a) Flagged numeric tokens per report under the deterministic numeric token detector. (b) Output length in words. In both panels the no-exemplar bar exceeds the full-prompt bar for every model, confirming that exemplar removal raises numeric speculation and report length regardless of the underlying model. Bars are means over

n = 3

generations per condition.

Figure 6. Reproduction of the two principal exemplar-removal effects on the PRI Assessment Report across three model families. (a) Flagged numeric tokens per report under the deterministic numeric token detector. (b) Output length in words. In both panels the no-exemplar bar exceeds the full-prompt bar for every model, confirming that exemplar removal raises numeric speculation and report length regardless of the underlying model. Bars are means over

n = 3

generations per condition.

Table 2. Per-component latencies and OSM extraction profile for the three case-study locations. Extraction and Köppen-sampling latencies are mean ± standard deviation (SD) over three repetitions; LLM-generation latencies are single-run measurements. “Total end-to-end” is the sum of all components. Dashes indicate categories with zero or negligible elements at that site.

	ROT (Rotterdam)	ATH (Athens)	INN (Innsbruck)
Polygon centroid	51.970, 4.090	37.980, 23.735	47.235, 11.440
Polygon area (km²)	100.5	19.5	70.5
Köppen-Geiger code	Cfb (temperate oceanic)	Csa (hot-summer Mediterr.)	Dfb (warm-summer cont.)
OSM elements (total)	34,416	60,282	18,872
Roads & highways	6,887	13,877	15,620
Railways	1,193	—	3,067
Bridges	204	—	301
Tunnels	—	—	258
Slope stabilization	—	—	113
Embankments & levees	—	—	24
Buildings	26,066	45,811	—
Urban green spaces	—	571	—
Power & utilities	—	25	—
Water bodies & rivers	245	—	—
Overpass extraction (s)	14.6 ± 6.9	19.2 ± 9.9	9.7 ± 8.7
Köppen sampling (s)	0.53 ± 0.17	0.40 ± 0.02	0.40 ± 0.02
Geographical Context report (s)	34.8	12.8	10.7
Köppen Interpretation report (s)	20.0	10.0	8.4
Total end-to-end (s)	69.9	42.4	29.2

Table 3. Ablation results for the Hazard and PRI Assessment Reports under Gemini 2.5 Flash Lite, evaluated with the three faithfulness metrics described in Section 2.5.4. Column WE (with exemplar) reports the deployed full-prompt configuration in which the reference exemplar pair is embedded in the GRF prompt. Column NE (no exemplar) reports the ablated configuration in which the exemplar pair is removed and only the strict-protocol system instruction and the input data block are retained. Column

Δ

reports the per-row difference

NE - WE

; a positive

Δ

therefore indicates a larger value under the no-exemplar condition. Each cell reports the mean over

n = 3

independent generations. Hallucinations are counted per report under the deterministic numeric token detector. NLI ent. rate is computed over entailed + contradicted sentences. Judge

ρ_{\sup}

is the verifiable support rate.

Table 3. Ablation results for the Hazard and PRI Assessment Reports under Gemini 2.5 Flash Lite, evaluated with the three faithfulness metrics described in Section 2.5.4. Column WE (with exemplar) reports the deployed full-prompt configuration in which the reference exemplar pair is embedded in the GRF prompt. Column NE (no exemplar) reports the ablated configuration in which the exemplar pair is removed and only the strict-protocol system instruction and the input data block are retained. Column

Δ

reports the per-row difference

NE - WE

; a positive

Δ

therefore indicates a larger value under the no-exemplar condition. Each cell reports the mean over

n = 3

independent generations. Hallucinations are counted per report under the deterministic numeric token detector. NLI ent. rate is computed over entailed + contradicted sentences. Judge

ρ_{\sup}

is the verifiable support rate.

Metric	WE	NE	$Δ$	Effect
Hazard Report
Output length (words)	401	311	−90 (−22%)	Length decreases
Numeric hallucinations (flagged)	0.0	0.0	0.0	None in either condition
NLI entailment rate	0.56	0.46	−0.11	Grounding decreases
NLI sentences contradicted (count)	4.7	5.3	+0.7	Slight rise
Judge $ρ_{\sup}$ (verifiable)	0.81	0.79	−0.02	Approximately stable
PRI Assessment Report
Output length (words)	595	1109	+515 (+87%)	Length increases sharply
Numeric hallucinations (flagged)	2.0	7.0	+5.0	Hallucinations rise on removal
NLI entailment rate	0.67	0.68	+0.01	Stable
NLI sentences contradicted (count)	1.0	3.0	+2.0	Contradictions rise
Judge $ρ_{\sup}$ (verifiable)	0.53	0.68	+0.16	Rises (see text)
Judge unverifiable claims (count)	11.3	6.0	−5.3	Falls

Table 4. Cross-model comparison: ablation results under Gemini 2.5 Flash Lite and Llama 3.1 8B Instruct, evaluated with the three faithfulness metrics described in Section 2.5.4. Columns labelled WE report the with-exemplar (deployed) configuration; columns labelled NE report the no-exemplar (ablated) configuration. The same WE / NE convention introduced in Table 3 is used throughout. Cells report the mean over

n = 3

generations. The rightmost column reports whether the within-model

Δ = NE - WE

has the same sign in both models. “≈0” indicates that both deltas are below 0.01 in absolute value. The Hazard report’s exemplar effect is below the noise floor of these metrics in both models; the PRI report shows directional consistency on five of seven metrics.

Table 4. Cross-model comparison: ablation results under Gemini 2.5 Flash Lite and Llama 3.1 8B Instruct, evaluated with the three faithfulness metrics described in Section 2.5.4. Columns labelled WE report the with-exemplar (deployed) configuration; columns labelled NE report the no-exemplar (ablated) configuration. The same WE / NE convention introduced in Table 3 is used throughout. Cells report the mean over

n = 3

generations. The rightmost column reports whether the within-model

Δ = NE - WE

has the same sign in both models. “≈0” indicates that both deltas are below 0.01 in absolute value. The Hazard report’s exemplar effect is below the noise floor of these metrics in both models; the PRI report shows directional consistency on five of seven metrics.

Metric	Gemini WE	Gemini NE	Llama WE	Llama NE	Direction match
Hazard Report
Output length (words)	401	311	408	301	Yes
Numeric halluc. (flagged)	0.0	0.0	0.0	0.0	≈0
NLI entailment rate	0.56	0.46	0.40	0.51	No
NLI contradicted (count)	4.7	5.3	5.3	4.3	No
Judge $ρ_{\sup}$ (verifiable)	0.81	0.79	0.57	0.75	No
PRI Assessment Report
Output length (words)	595	1109	350	533	Yes
Numeric halluc. (flagged)	2.0	7.0	0.0	2.3	Yes
NLI entailment rate	0.67	0.68	0.75	1.00	Yes
NLI contradicted (count)	1.0	3.0	1.0	0.0	No
Judge $ρ_{\sup}$ (verifiable)	0.53	0.68	0.57	0.67	Yes
Judge unverifiable (count)	11.3	6.0	7.3	0.3	Yes

Table 5. Third generation arm: ablation results under GPT-5.4 mini, with GPT-5.4 mini also acting as the judge. Cells report the mean over

n = 3

generations. Columns WE and NE use the same with-exemplar / no-exemplar convention introduced in Table 3;

Δ = NE - WE

. Judge

ρ_{\sup}

is the verifiable support rate. The numeric-hallucination and length asymmetries reproduce the pattern seen under Gemini and Llama; the judge support rate is discussed in the text alongside the Llama judge.

Table 5. Third generation arm: ablation results under GPT-5.4 mini, with GPT-5.4 mini also acting as the judge. Cells report the mean over

n = 3

generations. Columns WE and NE use the same with-exemplar / no-exemplar convention introduced in Table 3;

Δ = NE - WE

. Judge

ρ_{\sup}

is the verifiable support rate. The numeric-hallucination and length asymmetries reproduce the pattern seen under Gemini and Llama; the judge support rate is discussed in the text alongside the Llama judge.

Metric	WE	NE	$Δ$	Effect
Hazard Report
Output length (words)	536	571	+35	Approximately stable
Numeric hallucinations (flagged)	0.0	0.3	+0.3	Near zero in both
NLI entailment rate	0.50	0.35	−0.15	Grounding decreases
Judge $ρ_{\sup}$ (verifiable)	0.97	0.97	0.00	Stable and high
PRI Assessment Report
Output length (words)	713	1259	+545 (+76%)	Length increases sharply
Numeric hallucinations (flagged)	0.0	3.0	+3.0	Hallucinations rise on removal
NLI entailment rate	0.58	0.44	−0.14	Grounding decreases
Judge $ρ_{\sup}$ (verifiable)	0.94	0.98	+0.04	Approximately stable
Judge unverifiable claims (count)	8.3	14.0	+5.7	Rises

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Constrained LLM Reporting for Geospatial Climate Risk: A One-Shot In-Context Framework for Critical Infrastructure

Abstract

Keywords:

Subject:

1. Introduction

2. Methodology

2.1. Decision Support Tool: Workflow Overview

2.2. Polygon-Based Site Characterization

2.3. Quantitative Risk-Assessment Chain

2.4. Grounded Reporting Framework

2.5. Evaluation Methodology

2.5.1. Experimental Execution and Reproducibility

2.5.2. Functional Demonstration on Three Self-Selected Locations

2.5.3. Performance Characterization

2.5.4. Ablation Study on Embedded Exemplars

Deterministic numeric token detector

Sentence-level natural language inference (NLI)

LLM-as-judge claim decomposition

2.5.5. Brief Definitions of Referenced Evaluation Metrics

2.5.6. Cross-Model Replication

3. Results

3.1. Functional Demonstration Across the Three Case-Study Locations

3.2. Performance Characterization

3.3. Ablation Study on Embedded Exemplars

3.3.1. Ablation Results

3.4. Cross-Model Replication

4. Discussion

4.1. Applied Contribution

4.2. Practical Implications for Climate-Adaptation Decision Support

4.3. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Supplementary Stylistic-Alignment Metrics

Notes

References

MDPI Initiatives

Important Links

Subscribe