BMRS: Bongard–Maximov Problems for Remote Sensing

Nikita Firsov; Olga Terekhova; Nikita Odinets; Alexey Fedotov; Artem Muzyka; Anna Ukhanaeva; Anastasia Sarycheva; Sergei Gladilin; Dmitry Sidorchuk

doi:10.20944/preprints202606.1484.v1

Submitted:

18 June 2026

Posted:

22 June 2026

You are already at the latest version

Abstract

Vision-language models (VLMs) are increasingly used in remote sensing (RS), yet their ability to perform abstract visual reasoning remains poorly understood. Existing RS benchmarks primarily evaluate perception-oriented capabilities, such as scene classification, object detection, image captioning, and visual question answering, providing limited insight into higher-level reasoning. To address this gap, we introduce BMRS, the first remote-sensing benchmark based on the Bongard problem paradigm. The benchmark comprises 122 problems constructed from satellite and aerial imagery and organized into seven categories: Shape, Semantic, Presence, Spatial, Size, Number, and Same. Unlike conventional perception benchmarks, BMRS requires models to infer abstract concepts from multiple images, thereby evaluating few-shot learning, concept induction, analogy making, and relational reasoning. To establish a human baseline, we conducted a study involving 113 participants, who achieved an average accuracy of 74.5%. We then evaluated three groups of models: proprietary large-scale VLMs, open-source general-purpose VLMs, and remote sensing vision-language models (RSVLMs). The strongest proprietary models, ChatGPT and Gemini, achieved accuracies of 89.3% and 86.9%, respectively, surpassing human performance. In contrast, the best open-source general-purpose model achieved 42.6% accuracy, while the strongest RSVLM reached only 20.5%. Analysis across problem categories revealed that Semantic problems were the easiest for both humans and models, indicating that object-level semantic understanding transfers effectively to the remote-sensing domain: several RSVLMs achieved performance comparable to substantially larger general-purpose models. Conversely, Spatial problems proved the most challenging, highlighting spatial-relational reasoning as a key limitation of current RSVLMs. BMRS provides a challenging benchmark for measuring progress toward reasoning-capable RSVLMs and offers new insights into the strengths and limitations of current model adaptation strategies.

Keywords:

Bongard problems

;

remote sensing

;

vision–language models

;

benchmark

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Timely and reliable information from remote sensing (RS) is essential for addressing environmental challenges and supporting evidence-based natural resource management. Remote sensing data are widely used in environmental monitoring [1], agriculture [2], land-cover mapping [3], urban analysis [4], disaster assessment [5], and natural resource inventory [6]. The volume and diversity of available remote sensing imagery (RSI) continue to increase due to advances in imaging technologies and the growing availability of compact satellite platforms [7,8]. At the same time, improvements in image-processing methods further enhance the quality and usability of RSI [9,10,11]. As a result, manual interpretation becomes increasingly difficult to scale, creating a growing demand for automated approaches to high-level semantic analysis of RS data. Consequently, vision-language models (VLMs) are routinely applied to RSI. By aligning visual and textual modalities within a shared representation space, VLMs can interpret image content and express their understanding in natural language, enabling tasks that require both visual understanding and language-based reasoning. However, general-purpose VLMs are predominantly trained on human-perspective (ground-level) photographs and often perform poorly on RS-specific applications [12]. This domain gap is largely explained by the distinctive properties of RSI, including large spatial extents, pronounced scale variation, and sparse or subtle visual cues [13]. Consequently, current research focuses on remote sensing vision-language models (RSVLMs), i.e., VLMs adapted specifically to RS data.

RSVLMs can be applied to established remote sensing tasks, such as scene classification, object detection, semantic segmentation, and change detection. Beyond these traditional computer vision tasks, modern conversational VLMs support a broad range of vision-language applications, including image captioning, cross-modal retrieval, and visual question answering (VQA), i.e., answering free-form natural-language questions about an input image. Among these tasks, VQA is particularly challenging because success depends not only on recognizing individual objects but also on understanding their relationships and broader scene context [12]. Although several remote sensing VQA benchmarks have been proposed [14,15], recent surveys emphasize that the range of available benchmarks for evaluating RSVLM reasoning capabilities remains limited [16].

Benchmarking is more developed in general-purpose VLMs than in RSVLMs. Existing benchmarks assess sensitivity to visual illusions [17], logical learning [18], mathematical reasoning [19], and analogy-based visual reasoning [20]. Among the most challenging evaluation paradigms are Bongard problems (BPs) [21]. A BP presents a Bongard matrix containing two sets of images, arranged on the left and right sides according to an underlying rule or concept that must be inferred and expressed in natural language (Figure 1). BPs are typically designed so that human observers can solve them within a few minutes. BPs are particularly appealing because they capture several fundamental aspects of human cognition [22,23]. First, they require few-shot concept learning, where a visual concept must be inferred from only a small number of right and left images. Second, they involve context-dependent reasoning, since the interpretation of an image depends on the other images present in the Bongard matrix. Third, they require analogy-making perception, in which structural relationships rather than individual visual features must be recognized and generalized. Originally proposed by Mikhail Bongard and colleagues in 1967 [24] and later popularized through the work of Hofstadter [25], BPs remain a concise, interpretable, and challenging benchmark for abstract visual reasoning.

To the best of our knowledge, no Bongard-problem benchmark has yet been developed for remote sensing imagery. To address this gap, we introduce the first remote sensing benchmark based on the Bongard problem paradigm. To acknowledge the contributions of Vadim Maximov to the development of the original Bongard problems, we name our benchmark BMRS: Bongard–Maximov Problems for Remote Sensing (see Appendix C).

The proposed benchmark enables systematic evaluation of both general-purpose VLMs and RSVLMs on reasoning capabilities that extend beyond domain-specific visual perception. The main contributions of this work are threefold. First, we introduce BMRS, the first remote-sensing benchmark based on the Bongard problem paradigm. The benchmark comprises 122 problems constructed from aerial and satellite imagery and is designed to evaluate few-shot concept learning, context-dependent reasoning, analogy making, and relational reasoning. Second, we establish a human baseline through a study involving 113 participants and evaluate a diverse set of proprietary general-purpose VLMs, open-source general-purpose VLMs, and specialized remote-sensing VLMs. Third, we provide a detailed analysis of model performance across multiple categories of Bongard problems, identifying strengths and limitations of current remote-sensing model adaptation strategies and highlighting key directions for future research.

2. Related Work

2.1. Visual Language Models

Vision-language models can be broadly categorized into contrastive and conversational models [12]. Contrastive VLMs, such as CLIP [26], learn a shared embedding space for images and text through cross-modal alignment. While effective for representation learning and retrieval, they cannot directly generate natural-language responses and are not designed for instruction-following or complex reasoning tasks [16].

Conversational VLMs, such as LLaVA [27], typically consist of three components: a visual encoder that extracts semantic information from images, a connector module that maps visual features into the language-model embedding space, and a large language model (LLM) that processes multimodal inputs and generates natural-language responses. Through multimodal alignment and instruction tuning, these models learn to connect visual content with language and follow user instructions. Unlike contrastive VLMs, conversational VLMs can explain their predictions and perform reasoning in natural language, making them suitable for tasks that require abstract visual inference, such as Bongard problems.

Despite the rapid progress of general-purpose VLMs, their performance on remote sensing tasks is often limited by the substantial domain gap between natural imagery and remote sensing imagery (RSI) [16]. Differences in viewpoint, spatial scale, scene composition, and semantic structure make RSI fundamentally different from the data on which most generic VLMs are trained. To address this challenge, recent research has focused on remote sensing vision-language models (RSVLMs), which are adapted to remote sensing data through domain-specific training.

However, the relatively limited size and diversity of remote sensing image-text datasets make training large multimodal models from scratch impractical. Consequently, most RSVLMs inherit architectures and pre-trained weights from successful general-purpose VLMs and are subsequently adapted using remote sensing datasets [16]. Many open-source RSVLMs are based on LLaVA-1.5, including VHM [28], GeoChat [29], RS-LLaVA [30], and SkySenseGPT [31]. LHRS-bot [32] shares only the visual encoder of LLaVA-1.5, while RSGPT [33] shares only its LLM component. More recent models explore alternative architectures, although relatively few remain open source, such as TeoChat [34] and EarthDial [35].

Benchmark results indicate that domain adaptation can substantially improve performance on remote sensing tasks. On GeoBench, RSVLMs based on LLaVA-1.5 often outperform their generic counterpart and, in some cases, surpass more advanced general-purpose VLMs. Nevertheless, RSVLMs continue to lag behind generic VLMs on tasks requiring complex reasoning, as evidenced by results on the CHOICE benchmark [36]. This suggests that current adaptation strategies effectively improve domain-specific perception but do not fully close the reasoning gap.

Recent work has therefore explored reinforcement learning (RL) as a means of enhancing reasoning capabilities in RSVLMs [37]. Examples include GeoVLM-R1 [37], which demonstrates substantial gains in visual reasoning, ViLaCD-R1 [38] for semantic change detection, and the RS-EoT framework [13], which integrates iterative reasoning with visual inspection. As RL-based training continues to advance, benchmark datasets must evolve accordingly to provide challenging and reliable evaluations of RSVLM capabilities.

2.2. RSVLM Benchmarks

The dominant evaluation paradigm for RSVLMs is Remote Sensing Visual Question Answering (RSVQA), in which a model is provided with a remote sensing image and a natural-language question and is expected to produce a short textual answer. Early RSVQA benchmarks were fragmented and limited in scope: most datasets focused on isolated capabilities, lacked support for multi-temporal or context-aware reasoning, and relied on rigid exact-match evaluation protocols [36]. To address these limitations, the CHOICE benchmark was introduced as a structured and hierarchical evaluation framework for RSVLMs. It organizes model capabilities into two primary dimensions – perception and reasoning – which are further decomposed into six sub-dimensions and 23 sub-tasks, enabling a more comprehensive assessment of model performance.

As RSVLM research has progressed, benchmarks have become increasingly challenging, often by adapting evaluation settings from general vision-language benchmarks to remote sensing. These include, for example, mathematical reasoning and multi-image understanding benchmarks. GeoMath [39] evaluates mathematical reasoning across geometry, logic, statistics, arithmetic, counting, and algebra. Multi-image benchmarks are designed to assess the ability to model relationships, analogies, and fine-grained visual differences that cannot be inferred from a single image. In remote sensing, representative examples include GeoBench [40] and VLRS-Bench [41], which use multiple images to evaluate tasks such as change detection.

Bongard problems also belong to this category of multi-image benchmarks. They are specifically designed to evaluate abstract visual reasoning and pattern induction from sets, requiring generalization across multiple images. However, to the best of our knowledge, Bongard-style benchmarks have not yet been adapted to the remote sensing domain.

2.3. Bongard Benchmarks

Several Bongard-based benchmarks have been introduced over time, including some that predate the emergence of VLMs. One of the earliest is Bongard-LOGO [22] benchmark, introduced in 2020. The authors proposed an automatic BP generator, thereby greatly expanding the original set of BPs. Bongard-HOI [23] later transferred the paradigm to real-world images replacing, however, the original geometric concepts with human-object interactions. The more recent Bongard-OpenWorld benchmark [42] extended this paradigm with diverse real world concepts. Developed in the era of VLMs and LLMs, it has showed that large commercial neural networks can perform slightly better than humans, although most modern neural models still perform substantially below average human levels.

Subsequent work [43] hypothesized that VLMs poor performance on the original BPs is caused by a domain gap, because models were trained on photographs, but evaluated on abstract drawings. To test this hypothesis, the authors introduced Bongard-RWR, in which the concepts of the original Bongard problems were recreated using photographs; this translation was achieved for 60 of the 100 original BPs. In a follow-up study, the authors proposed an AI-based augmentation method and applied it to Bongard-RWR, resulting in the substantially larger Bongard-RWR+ benchmark [44]. This augmentation included generation of image descriptions based on the original Bongard-RWR dataset, text-to-text paraphrasing, text-to-image rendering, manual quality filtering, and diversity-aware subset composition to create new problem instances.

Nevertheless, verifying open-form answers becomes difficult on large dataset, and the tasks remain challenging for models. Therefore, several simpler task formats have been proposed, typically involving selection of the correct answer from a set of options. Although models achieve higher results on domain-adopted Bongard tasks than on the original Bongard problems, these analogues of the original problems still prove difficult for current models across all task formats.

3. Materials and Methods

3.1. Remote Sensing Bongard Benchmark

Various Bongard-related task formats exist in the literature, whereas Bongard himself originally proposed two formats: free-form concept formulation and image-to-side classification [24]. We use the free-form concept formulation — identifying the rule that distinguishes the two sides — as the most challenging format, since it prevents cues and guessing.

The original set of problems included only geometric problems, and Bongard required them to satisfy the following conditions [24]:

1.: the problems include black-and-white images without halftones (i.e., line drawings);
2.: the information contained in the images themselves is sufficient to solve the problem (analytic reasoning);
3.: the problems are solvable by a human observer.

Relaxing the first requirement is the central idea of the present work: instead of line drawings, we consider real photographs. Similar extensions have been explored in [42,43], but, to the best of our knowledge, this is the first Bongard benchmark based on RSI. The second requirement has also been relaxed in recent work [23,42]. This change is motivated by the emergence of VLMs, which are trained on large-scale datasets and are expected to combine visual perception with the common sense and preconceived notions of a human observer. Accordingly, in addition to problems that require purely analytic reasoning, our benchmark also includes problems that require broader semantic or knowledge-based (synthetic) reasoning.

We consider the third requirement essential and therefore preserve it. In his book, Bongard gave an example of a formally valid but excessively difficult problem in which the difference between the total number of angles and the total number of shapes on each side serves as an organizing rule. To verify that our problems remain solvable for humans, we conducted human studies (Section 3.4).

The remote sensing images were selected to cover a wide range of acquisition heights and spatial resolutions, including imagery captured from satellites, aircraft, and unmanned aerial vehicles. Most images were acquired in nadir view, whereas a smaller subset was captured at an oblique angle. All images are RGB. The data sources include Google Earth and the VisDrone [45], AID [46], DOTA [47], MLRSNet [48], RRealHyperPDID [11], and WAID [49] datasets. The resulting benchmark contains 122 Bongard problems. Table 1 compares BMRS with previously published Bongard benchmarks.

3.1.1. Collection Methods

We employed two complementary collection strategies. First, most problems were constructed manually. For each problem, the underlying concept was defined by the authors, and then suitable images were selected through visual inspection. Some of these problems were designed as RS analogues of original Bongard tasks (Figure 2).

Second, several categories of problems were assembled semi-automatically using object annotations. This approach was used for presence/absence tasks, class-contrast tasks (e.g., vehicles versus boats), counting tasks, relative-position tasks within an image crop (e.g., upper versus lower image regions), and size-based tasks (e.g., large versus small objects). Examples of semi-automatically assembled problems are shown in Figure 3.

3.1.2. Classification of Problems

To categorize BMRS problems, we adopted the taxonomy of original Bongard problems proposed by [21]. In this taxonomy, each problem is assigned to one of five mutually exclusive rule categories:

1.: Size: the discriminative rule is based on object size;
2.: Number: the discriminative rule is based on object count;
3.: Spatial: the discriminative rule is based on the spatial arrangement of objects;
4.: Same: objects within each image are either identical or non-identical with respect to a shared property;
5.: Concept: all remaining cases.

We found this taxonomy largely applicable to RS, with one modification. Because the Concept category encompasses a broad range of rules, we further subdivided it into Shape, Semantic, and Presence subclasses:

Shape: the discriminative rule depends on shape-related properties irrespective of object semantics. For example, the BMRS analogue of original problem 97 [24], p.246 distinguishes rectangular from circular shapes (Figure 2).
Semantic: the discriminative rule is defined by object identity or category rather than by visual properties alone. For example, the left problem in Figure 3 contrasts watercraft and aircraft.
Presence: the discriminative rule is based on the presence or absence of a particular object or object category. For example, the right task in Figure 3 contrasts images containing houses with images containing trees.

The class distributions of BMRS and the original Bongard dataset are compared in Table 2.

3.2. Vision–Language Models

The evaluated models were selected to cover three complementary categories: open-source general-purpose VLMs, open-source RSVLMs, and state-of-the-art proprietary VLMs. Model selection was guided by architectural diversity, availability of model weights and/or source code, feasibility of local deployment, prevalence in recent literature, and relevance to complex visual reasoning tasks.

Among general-purpose VLMs, we included LLaVA-1.5-7B [27] as a representative baseline because many RSVLMs are derived from its architecture and pre-trained weights. We additionally evaluated LLaVA-1.5-13B to assess the effect of parameter scaling within the same model family. To further analyze the impact of architectural improvements while keeping the model scale comparable, we included LLaVA-1.6-7B as an updated version of the LLaVA family. Finally, we evaluated LLaVA-1.6-34B to jointly assess the effect of architectural changes and substantially increased model scale. We further included the medium-scale Qwen-3-VL-32B-Instruct and the larger InternVL-3.5-38B, two recent open-source VLMs with strong multimodal capabilities and competitive performance across a wide range of general vision-language tasks, since models of these families are frequently used in other Bongard benchmarks [21,43,44].

The RSVLM group consisted of VHM [28], RS-LLaVA [30], RS-EoT [13], GeoChat [29], SkySenseGPT [31], EarthDial [35], LHRS-Bot [32] and TeoChat [34]. These models were specifically developed for remote sensing imagery and collectively represent several generations of RSVLM development, ranging from domain-adapted LLaVA variants to more recent architectures incorporating explicit reasoning mechanisms.

To provide a reference based on current proprietary systems, we additionally evaluated Gemini 3.1 Pro Preview and ChatGPT 5.5 Pro. Although their architectures and parameter counts are not publicly disclosed, these models represent the current state of the art in proprietary multimodal AI systems and substantially exceed the scale of most open-source models considered in this study.

All models were evaluated on BMRS using a unified experimental protocol. Open-source general-purpose VLMs were used in their original form, enabling an assessment of their out-of-the-box performance on remote sensing tasks. RSVLMs were evaluated using their publicly released checkpoints without fine-tuning. Proprietary models were accessed exclusively through their respective APIs and were evaluated in inference-only mode. Table 3 summarizes the evaluated models.

Inference of RSVLMs was performed locally. VHM, RS-LLaVA, and RS-EoT were evaluated on a system equipped with an Intel Xeon W-2133 CPU, 128 GB of RAM, and an NVIDIA Quadro GV100 GPU (32 GB VRAM). GeoChat, SkySenseGPT, EarthDial, LHRS-Bot and TeoChat were evaluated on a second system equipped with an Intel Xeon Platinum 8468, NVIDIA H100 GPU (80 GB VRAM) and 2 TB of RAM.

Open-source general-purpose models were evaluated on a computing cluster managed via the Slurm workload manager. All experiments were executed on a single compute node comprising 96 logical CPU cores Intel Xeon Platinum 8462Y and six NVIDIA A100 SXM4 GPUs (40 GB VRAM each), with a total system memory of approximately 1008 GB.

3.3. Prompting Strategies

Prompting strategy has a substantial impact on the performance of large language models across modalities [50]. Moreover, explicit decomposition of a task into subproblems is often critical for strong performance [51,52]. Similar findings have been reported for VLMs, where multi-stage prompting improves multi-step and spatial reasoning compared with single-query approaches and simple chains of thought formats [53].

To evaluate both end-to-end and decomposition-based reasoning in Bongard problems, we consider the prompting strategies of [43]: Direct, Descriptive, Descriptive-direct, Descriptive-iterative, Contrastive, Contrastive-direct, Contrastive-iterative. The Direct strategy presents the full Bongard matrix to directly predict the underlying rule.

The Descriptive family first generates image-level descriptions to infer the rule from these representations. Variants differ in how descriptions are produced and aggregated. Descriptive processes images independently. Descriptive-direct provides the full matrix during final inference. Descriptive-iterative incrementally refines a shared side description. Descriptive family encourages reasoning over textual abstractions rather than raw visual input.

The Contrastive family instead operates on pairwise image comparisons between left and right sides to derive a general rule. Contrastive analyzes each pair independently with final aggregation. Contrastive-direct provides the full matrix during final reasoning. Contrastive-iterative considers pairs sequentially while accumulating context. Contrastive family explicitly encourages relational reasoning between sides.

We implemented five of the strategies described above: Direct, Descriptive-iterative, Descriptive-direct, Contrastive-iterative, Contrastive-direct. The original Descriptive and Contrastive strategies were omitted because most RSVLMs do not support text-only interaction without image inputs. All of the implemented strategies were used with prompts provided in Bongard-RWR [43] with little changes. Modified prompts are listed in the Appendix A.

Besides implemented strategies we introduced the following Shuffle modification. Because iterative prompting strategies may be sensitive to the order in which images are presented, we evaluated this effect using TeoChat by running ten trials with randomly shuffled image orders. The influence of model stochasticity was excluded in this experiment, as TeoChat produced character-identical outputs when the image order was fixed. However, some evaluated models exhibited stochastic behavior; therefore, multiple runs were conducted for highly variable models.

3.4. Human Study Design

To assess the solvability of the developed problems, we evaluated human performance on the resulting dataset. The study involved 113 participants recruited from the academic community. All participants were native Russian speakers, and the study was conducted in Russian.

In a single session, each participant was asked to solve up to 20 problems randomly sampled from the dataset and ordered by difficulty. Problem difficulty was estimated during dataset development using a two-stage procedure. First, all problems were independently cross-reviewed by the authors and assigned to coarse difficulty categories based on the perceived complexity of the underlying concept and the expected effort required to identify it. Second, a pilot study involving laboratory staff who were not involved in the problem development was conducted to assess problem clarity and verify the proposed difficulty ordering. Based on participant feedback and observed solution patterns, ambiguous problems were revised or removed, and the final difficulty ranking used in the human study was established. Problem sets were constructed to preserve the class distribution of the full dataset (Table 2) while maintaining a comparable average difficulty across participants. At least 14 participants evaluated each problem.

We adopted the classical open-ended Bongard problem format: participants were presented with six left and six right images arranged as a matrix, together with a text field for entering their answer. After one minute, participants were given the option to select “I don’t know”. Neither the total session duration nor the time allotted to individual problems was restricted.

Before the study began, participants received written instructions and a worked example based on the original Bongard problem, including several acceptable answer formulations. For each problem, participants were asked to provide a detailed explanation of the distinguishing concept, describing both the left and right sides.

3.5. Answers Evaluation

To evaluate the corpus of more than 10 000 human and model-generated responses, we adopted the LLM-as-a-judge framework [21,54], using DeepSeek-v4-flash in “Think Max” mode [55] as the automated evaluator. This model was selected for three reasons: DeepSeek-family models were previously used as judges in Bongard-problem research [43]; DeepSeek-v4-flash represents one of the strongest currently available lightweight reasoning models; and no DeepSeek models were included among the evaluated BMRS solvers, eliminating potential bias arising from evaluating a model family with itself.

To establish a reliable ground truth, three experts manually graded all human responses and a representative subset of model outputs. The evaluation protocol consisted of both general assessment criteria and class-specific guidelines covering all seven BMRS categories (Appendix B). The resulting expert-annotated corpus was divided into three subsets: few-shot examples for prompt construction, a validation set, and a test set. To prevent data leakage, the validation and test sets contained disjoint subsets of Bongard problems.

The validation set was used to iteratively refine both the system prompt and class-specific prompts (Appendix B) until the judge achieved 100% agreement with expert annotations. The final evaluation on the held-out test set yielded an accuracy of 93%, computed as

A c c_{j} = \frac{T_{P} + T_{N}}{T_{A}}

, where

T_{P}

and

T_{N}

denote the numbers of true-positive and true-negative evaluations, respectively, and

T_{A}

is the total number of evaluated responses.

4. Results

Performance was measured using accuracy, defined as the proportion of correct answers. Overall accuracy was computed as

A c c_{a} = \frac{C_{a}}{T_{a}}

, where

C_{a}

is the number of correct answers and

T_{a}

is the total number of answers. For individual problem categories, we report class-specific accuracy,

A c c_{t} = \frac{C_{t}}{T_{t}}

, where

C_{t}

is the number of correct answers within a given category and

T_{t}

is the total number of answers for that category.

4.1. Human Study Results

A total of 113 participants completed all assigned BMRS problems: to reduce fatigue, each participant was presented with a subset comprising up to 20 problems rather than the full benchmark. The mean participant age was 25.1 years, and the average completion time was 29.6 minutes.

Across 2 255 collected solutions, the overall human accuracy was 74.5%. Performance by problem category is summarized in Table 4. Participants achieved the highest accuracy on Semantic problems (85.4%), followed by Number (81.8%), Shape (79.7%), and Size (78.9%) problems. The lowest accuracy was observed for Spatial problems (53.3%), with Same problems also proving relatively challenging (57.7%). The most difficult individual BMRS problems were predominantly in the Spatial category. In particular, problems where a discriminative rule depended on object location within the image (e.g., objects appearing in the upper versus lower part of the scene) exhibited the lowest solution rates. Overall, the results indicate that humans reliably solve object- and concept-based BMRS problems, whereas spatial-relational reasoning is substantially more challenging. Note that we consider the benchmark to satisfy the solvability criterion, as every problem was solved correctly by at least one participant. Detailed results, including examples of the most challenging problems, are provided in Appendix D.

4.2. Prompting Strategies Evaluation

The performance of different prompting strategies is summarized in Table 5. Proprietary General-Large models were excluded from this analysis because preliminary experiments showed that explicit decomposition of Bongard problems provided little benefit compared with direct prompting.

For open-source general-purpose VLMs, Descriptive strategies consistently outperformed Contrastive and Direct prompting. In particular, Descriptive-direct achieved the highest accuracy for all evaluated models except LLaVA-1.5-13B, for which Descriptive-iterative performed slightly better. These results suggest that generating intermediate image descriptions before concept induction is beneficial for Bongard problem solving.

In contrast, no single prompting strategy emerged as universally optimal for RSVLMs. LHRS-Bot and VHM achieved their best performance with Descriptive-direct, whereas RS-EoT, EarthDial, and SkySenseGPT performed best with Contrastive-iterative. GeoChat achieved its highest accuracy with Contrastive-direct, while TeoChat performed best with Direct prompting. This variability indicates that the effectiveness of problem decomposition depends strongly on the underlying model architecture and training procedure.

Overall, Descriptive prompting appears most effective for general-purpose VLMs, whereas RSVLMs exhibit less consistent behavior and derive varying benefits from decomposition-based strategies.

4.2.1. Sensitivity to Image Ordering (Shuffle) in Iterative Prompting

To evaluate the robustness of iterative prompting strategies, we investigated their sensitivity to the presentation order of the images within the Bongard matrix for each BMRS problem. Because iterative strategies incrementally build or refine rules based on sequential images, the specific order within the Bongard matrix can significantly influence the final inference. We evaluated this effect using the TeoChat model by performing ten independent trials per problem, randomly shuffling the image order within the left and right sides of the Bongard matrix for each trial. Since TeoChat produces character-identical outputs under fixed inputs, any variance in performance across these runs is entirely attributable to the permutation of the images rather than model stochasticity.

The results, visualized across all 10 shuffles for both descriptive-iterative and contrastive-iterative (Figure 4) strategies, reveal a stark disparity in performance across the BMRS taxonomy. Both strategies show the highest success rates within the Semantic category, indicating that the considered model is primarily capable of distinguishing rules based on object identity and categorical differences. Moderate success is observed in the Presence and Shape categories. Conversely, problems requiring abstract relational reasoning – specifically those categorized under Spatial, Number, and Same – remain almost entirely unsolved across all permutations, highlighting a significant limitation of the tested model.

A key finding from the shuffle experiments is the extreme fragility of the model’s reasoning paths. As shown in the frequency plots (see bottom panels in Figure 4), very few problems are consistently solved across all 10 shuffles. The vast majority of solved tasks are only answered correctly in a fraction of the permutations (often fewer than half). This indicates that the model does not demonstrate robust reasoning in neither Descriptive-iterative nor Contrastive-iterative case; rather, the model relies on lucky sequences of images that happen to align with its inductive biases during the iterative refinement process.

Interestingly, a comparative analysis of the two prompting strategies reveals that the specific problems solved by the Descriptive-iterative strategy and the Contrastive-iterative strategy frequently do not overlap. While both struggle with the same overarching categories (e.g., Spatial, Number), the individual problems they manage to solve within the Semantic or Presence categories differ. This suggests that describing images independently before inference (Descriptive) versus iteratively comparing them (Contrastive) triggers distinct reasoning pathways within the model. The lack of overlap implies that these prompting strategies are complementary, and a hybrid approach could potentially yield a higher overall solve rate on the BMRS benchmark.

4.3. Comparison of Solution Results Across Problem Classes

Table 6 and Figure 5 summarize the performance of humans and VLMs on BMRS. Detailed results are provided in Appendix F. For RS and General-Mid models, the best-performing prompting strategy was used.

Among all evaluated systems, proprietary large-scale VLMs (General-Large) achieved the highest performance. ChatGPT achieved the best overall accuracy (89.3%), closely followed by Gemini (86.9%). Both models exceeded human performance (74.5%), demonstrating higher accuracy across all problem categories. Spatial reasoning remains challenging for General-Large VLMs, although their performance remained considerably above the human baseline.

In contrast, open-source general-purpose VLMs (General-Mid) exhibited substantially lower performance. Qwen-3-VL achieved the strongest result within this group (42.6%), followed by InternVL-3.5 (40.2%). The LLaVA family performed considerably worse, demonstrating accuracy ranging from 13.1% to 20.5%. Despite their lower overall performance, Qwen-3-VL and InternVL-3.5 consistently outperformed all evaluated RSVLMs.

The performance of RSVLMs varied considerably across models. LHRS-Bot achieved the highest overall accuracy among remote-sensing-specific models (20.5%), exceeding General-Mid LLaVA-v1.5-7B, LLaVA-1.5-13B, LLaVA-1.6-7B and matching LLaVA-1.6-34B despite having substantially fewer parameters. RS-EoT obtained the second-best RSVLM result (17.2%), demonstrating competitive performance on Semantic, Shape, Presence, and Spatial tasks, but performing poorly on Number, Same, and Size problems.

Among RSVLMs derived from the LLaVA-1.5 architecture, VHM achieved performance comparable to the corresponding general-purpose LLaVA baseline, whereas GeoChat, RS-LLaVA, and SkySenseGPT performed substantially worse. This observation suggests that domain adaptation alone does not guarantee improved abstract reasoning performance.

Across both human and model evaluations, Semantic problems were consistently the easiest category. Several RSVLMs, including VHM, EarthDial, and RS-EoT, achieved their strongest results on Semantic tasks, while LHRS-Bot surpassed all open-source general-purpose models except Qwen-3-VL and InternVL-3.5 in this category. In contrast, Spatial problems remained challenging for both humans and models, highlighting spatial-relational reasoning as a key area for improvement.

5. Discussion

We introduced BMRS, the first remote-sensing benchmark based on the Bongard-problem paradigm. Human participants achieved an average accuracy of 74.5% solving BMRS, which is comparable to the performance reported in previous studies on Bongard-style reasoning benchmarks – 65% [43] and 86% [21]. This suggests that the overall difficulty of BMRS is broadly consistent with that of existing Bongard benchmarks while extending the paradigm to remote sensing imagery.

Direct comparison with prior work is challenging because model architectures and capabilities evolve rapidly. Nevertheless, one useful reference point is LLaVA-v1.6 7B, which was evaluated both in our study and in [43]. Using the Direct prompting strategy, LLaVA-v1.6 7B achieved 13% accuracy on BMRS, compared with reported accuracies of 0% on the original Bongard problems, 5% on Bongard-HOI, 12% on Bongard-OpenWorld, and 0% on Bongard-RWR. This suggests that the transition from line drawings to human-perspective photographs [43], and to remote sensing imagery is not the primary obstacle for current open-source VLMs. Instead, their main limitation appears to be the underlying reasoning task itself.

This interpretation is further supported by the category-level results. Both general-purpose VLMs and RSVLMs achieved their strongest performance on Semantic problems, which primarily require object recognition and category-level knowledge. Notably, the best RSVLMs approached or exceeded the performance of substantially larger general-purpose models on Semantic problems. In addition, the General-Mid Qwen-3-VL model achieved 84.6% accuracy on Presence problems, indicating that reliable recognition of remote-sensing objects is already possible even without remote-sensing-specific adaptation. Taken together, these results suggest that the domain gap between natural and remote-sensing imagery is not the primary limitation for current VLMs. In contrast, performance dropped markedly on Spatial problems, indicating that spatial-relational reasoning remains a major challenge even when object recognition is reliable. These findings suggest that current RSVLM adaptation strategies are more successful at transferring domain-specific perceptual knowledge than at developing abstract reasoning capabilities, such as concept induction, analogy making, and relational reasoning.

The performance of ChatGPT and Gemini demonstrates that Bongard-style reasoning is achievable at, and even beyond, human-level accuracy in modern frontier VLMs. It is worth noting that, judging by the responses, at least ChatGPT appears to be familiar with Bongard benchmarks: “It looks like there might be an interesting challenge here, possibly related to the "Bongard-HOI" task, involving satellite images like those from BigEarth”. This may indicate that Bongard benchmarks were used during training. Nevertheless, the high performance of these models is attributable to their reliance on computational resources that are far beyond those available for most practical remote-sensing applications. Consequently, an important research direction is the development of efficient domain-specialized models that retain strong reasoning abilities while remaining computationally tractable.

Future work will focus on expanding BMRS in both scale and diversity. First, additional remote-sensing analogues of original Bongard problems will be developed, potentially with the assistance of generative AI to overcome the difficulty of locating suitable images in real RS imagery. Second, we plan to investigate AI-based augmentation strategies similar to those proposed in [44]. Finally, we intend to study the effect of input representation, comparing traditional Bongard matrices with multi-image interfaces that allow models to process individual images separately. Such experiments may clarify whether limitations arise primarily from reasoning itself or from difficulties in parsing the visual structure of the Bongard matrix.

6. Conclusions

We introduced BMRS, the first remote sensing benchmark designed around the Bongard problem paradigm for evaluating vision-language models. Unlike conventional remote sensing benchmarks that primarily assess perception, BMRS evaluates the ability to perform multi-image, context-dependent reasoning, concept induction, and analogy-making. The benchmark contains 122 problems spanning seven categories: Shape, Semantic, Presence, Spatial, Size, Number, and Same. In contrast to classical Bongard problem collections, BMRS introduces a Semantic category that requires recognizing and reasoning about object identity and category information in remote sensing imagery.

A human study involving 113 participants established a reference accuracy of 74.5%. We evaluated three groups of models: proprietary large-scale general-purpose VLMs, open-source general-purpose VLMs, and specialized remote sensing VLMs (RSVLMs). The strongest models, ChatGPT and Gemini, achieved 89.3% and 86.9%, respectively, exceeding average human performance. In contrast, the best open-source general-purpose model, Qwen-3-VL, achieved 42.6%, while the strongest RSVLM, LHRS-Bot, achieved 20.5%.

Analysis across problem categories revealed a consistent pattern. RSVLMs performed relatively well on Semantic tasks involving object recognition, but remained substantially weaker on tasks requiring relational and abstract reasoning. Additional prompting and shuffle experiments further demonstrated that, although current RSVLMs can acquire domain-specific perceptual knowledge, their reasoning capabilities remain fragile.

These findings suggest that adapting visual perception to remote sensing is considerably easier than transferring general reasoning abilities. Consequently, future RSVLM research should focus not only on improving domain-specific understanding but also on developing methods that preserve or enhance high-level reasoning capabilities. We hope that BMRS will provide a challenging benchmark for measuring progress toward this goal.

Author Contributions

Conceptualization, D. S.; methodology, D. S. and O. T.; software, N. F., O. T., N. O., A. F. and A. M.; validation, N. F., A. S., S. G. and D. S.; formal analysis, N. F., O. T. and A. F.; investigation, N. F., O. T., N. O., A. F. and A. M.; resources, N. F., O. T., A. U., A. M. and S. G.; data curation, O. T., A. F., A. U. and D. S.; writing—original draft preparation, O. T., N. O., A. U. and D. S.; writing—review and editing, O. T., A. S. and D. S.; visualization, O. T. and A. F.; supervision, S. G.; project administration, A. U. and D. S.; funding acquisition, N. F. and S. G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Economic Development of the Russian Federation (agreement identifier 000000C313925P3U0002, grant No

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Ethics Board of the Institute for Information Transmission Problems of the Russian Academy of Sciences (protocol of the meeting of the Ethical Committee of IITP RAS No. EC-2026/3 of 27 March 2026).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The proposed dataset BMRS is publicly available at

Acknowledgments

We thank Maria Koval for her valuable assistance in the development of the LLM-as-a-judge approach. The research was carried out using the infrastructure of the Shared Research Facilities "High Performance Computing and Big Data" (CKP "Informatics") of FRC CSC RAS(Moscow). This research was inspired by the work of Mikhail Bongard and the research group that investigated this field while working at Institute for Information Transmission Problems, with which the authors of the present article are affiliated. Special thanks are extended to Elena Maximova and Pavel Maximov for the reminiscences provided in Appendix C.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
BP	Bongard problems
LLM	Large language lodels
RL	Reinforcement learning
RS	Remote sensing
RSI	Remote sensing imagery
RSVLMs	Remote sensing vision-language models
RSVQA	Remote Sensing Visual Question Answering
VLM	Vision-language model
VQA	Visual question answering

Appendix A. Vision–Language Model Inference Prompts

This appendix reports the prompts used for vision–language model inference on the BMRS dataset. The evaluated models were used in an image-text-to-text setting, where each input consisted of a visual component, either an individual image, an image pair, or a full Bongard collage, and a textual instruction.

For each prompting strategy, the model input contained a common system prompt followed by a strategy-specific prompt. The common system prompt specified the general Bongard problem-solving objective and the required answer format, while the strategy-specific prompt defined the particular inference procedure. The resulting VLM answers were evaluated separately using the LLM-as-a-judge protocol described in Appendix B.

Appendix A.1. Common System Prompt

Table A1. Common system prompt used for all evaluated strategies.

Prompt A1: Common system prompt.
`You are a vision understanding module designed to provide short, clear, and accurate answers. Your goal is to solve a Bongard problem consisting of a collage with six images on the left side and six images on the right side.`
`All left images share a common concept that none of the right images have, and all right images share a different common concept that none of the left images have. Your task is to identify both concepts.`
`The answer must consist of exactly two plain sentences: the first sentence describes the concept of the left side, and the second sentence describes the concept of the right side. Do not use markdown, bullet points, or any formatting. Keep each sentence short and clear.`

Appendix A.2. Direct Strategy

Table A2. Prompt used for the direct strategy.

Prompt A2: Direct strategy.
`Here is the Bongard problem collage. Provide your answer as two plain sentences: first the concept for all left images, then the concept for all right images.`

Appendix A.3. Descriptive-Direct Strategy

Table A3. Prompts used for the descriptive-direct strategy.

Prompt A3: Descriptive-direct strategy.
`This is a single image from either the left or right side of a Bongard problem. Do not solve the problem yet. Describe this image in detail.`
`Focus on all visual features: objects, shapes, colors, spatial arrangements, textures, and any other distinctive properties. Be thorough, as these descriptions will later be used to identify a common concept.`
`Here is the Bongard problem collage. Below are the detailed descriptions of each left image and each right image. Based on this information, provide your answer.`
`Left class image descriptions: {}. Right class image descriptions: {}.`

Appendix A.4. Contrastive-Direct Strategy

Table A4. Prompts used for the contrastive-direct strategy.

Prompt A4: Contrastive-direct strategy.
`Here is a pair: one image from the left side of a Bongard problem and one from the right side. Do not solve the full problem yet. Instead, carefully examine both images and list all differences between them.`
`Focus on contrasting visual features: objects, shapes, colors, positions, textures, counts, orientations, or any other properties that distinguish the left image from the right image. Output your answer as a simple list of differences. Use plain text, no markdown.`
`Here is the full Bongard problem collage. Below are all the difference lists obtained from comparing left-right image pairs.`
`Based on this information, determine the common concept that unites all left images but no right images, and the common concept that unites all right images but no left images. Provide your answer.`

Appendix A.5. Contrastive-Iterative Strategy

Table A5. Prompts used for the contrastive-iterative strategy.

Prompt A5: Contrastive-iterative strategy.
`Here is a pair: one image from the left side of a Bongard problem and one from the right side. Carefully examine both images and list all differences between them.`
`Focus on contrasting visual features: objects, shapes, colors, positions, textures, counts, orientations, or any other properties that distinguish the left image from the right image. Propose a candidate concept that distinguishes the left image from the right image.`
`Here is the next left-right pair. Your goal is to generalize the concept to fit all of the pairs you have seen. Carefully examine the new pair and find all differences between the images.`
`Focus on contrasting visual features: objects, shapes, colors, positions, textures, counts, orientations, or any other properties that distinguish the left image from the right image. Your previous candidate concept is: {}. Check whether the previous concept applies to the new pair.`
`If it is fully correct, output the same concept unchanged. If it is partially correct, refine the concept by removing or adjusting the failing parts, keeping only the aspects valid for all left images seen so far and false for all right images. If it is completely wrong, discard it and formulate a completely new concept based on all pairs seen now, including this one.`
`Output the updated concept as one detailed sentence covering all distinguishing features common to all processed pairs.`
`This is the last left-right pair. Based on all six pairs and your iterative refinement, provide your final answer.`
`Output exactly two plain sentences: the first describes the concept that holds for all left images but no right images, and the second describes the concept that holds for all right images but no left images. Your iterative concept is: {}.`

Appendix A.6. Descriptive-Iterative Strategy

Table A6. Prompts used for the descriptive-iterative strategy.

Prompt A6: Descriptive-iterative strategy.
`This is the first image from one side of a Bongard problem. Later you will see more images from this same side. Your goal is to formulate a general description of all images on this side for solving the problem later.`
`Do not solve the problem yet. Describe this image in detail. Focus on all visual features: objects, shapes, colors, spatial arrangements, textures, counts, orientations, and any other distinctive properties. Be thorough, as these descriptions will later be used to identify a common concept.`
`Here is the next image from the same side. Your goal is to generalize the class description to fit all images from this side that you have seen so far.`
`Carefully examine this new image and list all its relevant visual features: objects, shapes, colors, spatial arrangements, textures, counts, orientations, or any other distinctive properties. Your previous candidate description for this side is: {}. Check whether it applies to the new image.`
`If it is fully correct, output the same description unchanged. If it is partially correct, refine the description by removing or adjusting the failing parts, keeping only what is true for all images seen so far. If it is completely wrong, discard it and formulate a completely new description based on all images seen now, including this one.`
`This is the last image from this side. Based on all six images and your iterative refinement, provide the final description that unites all images on this side. Your iterative description is: [previous concept].`
`Here is the Bongard problem collage. Below are generalized descriptions of the left and right sides of images. Based on this information, solve the Bongard problem.`
`Left class image descriptions: {}. Right class image descriptions: {}.`

Appendix B. LLM Judge Prompting Protocol

This appendix describes the prompting protocol used for automatic evaluation of VLM answers by an LLM judge. While Appendix A reports the prompts used to generate model predictions, this appendix reports the prompts used to verify their correctness.

For each task, the judge received a common system prompt and a task-specific user prompt. The system prompt defined the general evaluation objective and restricted the output to a binary verdict. The user prompt contained class-specific evaluation instructions, reference answers, examples of correct and incorrect answers, and the answer produced by the evaluated model.

Class-specific instructions were added to clarify the type of target property to be evaluated, including presence, number, size, shape, spatial arrangement, semantics, or intra-class consistency. These instructions were used to guide the judge in handling paraphrases, semantic generalizations, and common incorrect formulations.

Appendix B.1. System Prompt for the LLM Judge

Table A7. System prompt used for the LLM judge.

Prompt B1: LLM judge system prompt.
`You evaluate the user’s answer by comparing it with reference answers for a given task.`
`You see: reference answers, examples of correct and incorrect answers, and the user answer.`
`Each answer contains features of the right and left class. Identify the target property that distinguishes the right class from the left. This may describe semantics, form, presence, quantity, or size of objects.`
`The user answer may contain features of both right and left classes, or an explicit target property separating these classes. If two features are indicated, formulate the target property based on them.`
`The answer is correct if the target property matches the reference or correct examples, accounting for generalization, paraphrasing, synonyms, word order changes, or simplification. The answer is incorrect if the target property is missing or wrong.`
`Do not accept answers similar to or matching incorrect examples. Focus on meaning, not exact wording. Ignore minor differences in style, grammar, or phrasing.`
`Respond with only one word: correct or incorrect.`

Appendix B.2. User Prompt Template

Table A8. User prompt template used for the LLM judge.

Prompt B2: LLM judge user prompt template.
`Task-specific evaluation instruction:`
`{class_specific_prompt}`
`Reference answers:`
`Left: {reference_left}`
`Right: {reference_right}`
`Examples of correct answers:`
`Example 1:`
`Left: {correct_example_left}`
`Right: {correct_example_right}`
`Examples of incorrect answers:`
`Example 1:`
`Left: {incorrect_example_left}`
`Right: {incorrect_example_right}`
`Model answer:`
`{model_answer}`

Appendix B.3. Class-Specific Evaluation Instructions

Table A9. Class-specific evaluation instruction for size-based tasks.

Prompt B3: Size.
You are given answers to a problem of the “size” type. Your task is to check whether each answer is correct. A correct answer must explicitly mention one of the following differences between the objects in the left and right sets of images: the objects in one set are closer and the objects in the other set are farther; the objects in one set are bigger and the objects in the other set are smaller; or each set has a name, such as “barns” on the left and “estates” on the right, and these names reflect a real-world size difference.
`For example, barns are typically smaller than estates, so stating that difference would be correct logic.`
`If an answer contains the phrase “I do not know” or any similar expression of uncertainty, it is incorrect.`
`Output your judgment as correct or incorrect for each answer, with a brief justification.`

Table A10. Class-specific evaluation instruction for presence-based tasks.

Prompt B4: Presence.
`In this task, the reference target property is based on the presence of an object of a certain class in one set of images and the absence of that same object in the other set. The property is not based on the shape of objects, their color, size, or spatial arrangement.`
`The target property can be rephrased in semantic terms while preserving meaning. For example, the property “images in the left class contain clouds, images in the right class do not contain clouds” can be rephrased as “the left class depicts cloudy weather, and the right class depicts clear weather,” because clear weather means the absence of clouds.`
`Analyze the correct answers to understand which semantic rephrasings have already been accepted as valid. Analyze the incorrect answers to understand which semantic formulations lead to errors.`
If the answer to be checked is formulated in terms of presence or absence of an object, evaluate it according to the standard rules. If the answer is formulated in semantic terms, first check whether such semantics has appeared in examples of correct or incorrect answers. If it has, follow those examples. If the semantics is new and has not appeared before, analyze whether the named properties always imply the presence or absence of the target object. If they do, mark the answer as correct; otherwise, mark it as incorrect.
`If the answer refers to “image” in the singular rather than “images” as a set, accept it as valid as long as the distinction between the two sets is preserved in meaning. The answer is also valid if it includes additional information such as a cause-and-effect relationship, but it is incorrect if it includes additional distinguishing features.`
`If the answer describes a characteristic of an object being “more” or “less” of something rather than simply the presence or absence of the target object, accept it only if no additional distinguishing features are provided beyond that quantitative difference. If the answer includes other differences beyond the more-or-less comparison, reject it as incorrect.`

Table A11. Class-specific evaluation instruction for number-based tasks.

Prompt B5: Number.
`You are given answers to a problem of the “number” type. Your task is to check whether each answer is correct.`
`A correct answer must explicitly mention one of the following differences between the objects in the left and right sets of images: every image in one set contains more or fewer specific objects than every image in the other set; or every image in one set contains a specific fixed number of objects, and every image in the other set contains a different specific fixed number of objects.`
`For example, an answer is correct if it states that left-set images each have more apples and right-set images each have fewer apples, or that the left set always has three cars and the right set always has seven cars.`
`If an answer contains “I do not know” or any similar expression of uncertainty, it is incorrect.`

Table A12. Class-specific evaluation instruction for shape-based tasks.

Prompt B6: Shape.
`In this task, the reference target property is based on the shape of objects in the image, regardless of their semantics.`
`The target property can be rephrased in terms of object semantics while preserving meaning. For example, the property “the boundary between land and water is closed or not closed” can be rephrased semantically as “island or peninsula.”`
`Analyze the correct answers to understand which semantic rephrasings have already been accepted as valid. Also analyze the incorrect answers to understand which semantic formulations are mistaken.`
`If the answer to be checked is formulated in terms of shape, evaluate it according to the standard rules. If the answer is formulated in terms of semantics, first check whether such semantics has appeared in examples of correct or incorrect answers. If it has, follow those examples. If the semantics is new and has not appeared before, analyze whether objects with such semantics always have the required shape difference. If they do, the answer is correct; otherwise, it is incorrect.`

Table A13. Class-specific evaluation instruction for spatial-relation tasks.

Prompt B7: Spatial.
`In this task, the reference target property is based exclusively on the position of objects in the image, the orientation of objects relative to the image frame, or the mutual arrangement of objects relative to each other. The property is not based on shape, color, size, quantity, or semantics.`
`Semantic rephrasings are allowed as long as they preserve the meaning of the spatial arrangement. For example, “left-hand traffic” means that cars move on the left side of the road, and “right-hand traffic” means that cars move on the right side.`
`As a general rule, if an answer does not refer to the location or orientation of objects at all, it is incorrect. If the answer refers to a different type of spatial relation, it is also incorrect. The answer is still valid if it includes additional information such as a cause-and-effect relationship of this spatial relation, but it is incorrect if it includes additional objects as a distinguishing feature.`
For tasks about absolute location, the answer must clearly indicate that the difference between the classes lies in the location. It is not necessary to specify exact parts of the image for each class, but the answer must state what the difference is. For example, “on different sides of the picture” is acceptable, whereas “in different places” is not, because it does not specify the difference. The object can be referred to generically as an “object.” If the object is named specifically, it must be named correctly. Rephrasings are allowed, such as “forest” instead of “trees.”
For tasks about orientation, the answer must clearly indicate that the difference between the classes lies in orientation. It is not necessary to specify the exact orientation for each class, but the answer must state what the difference is. For example, “oriented horizontally and vertically” is acceptable, whereas “oriented differently” is not, because it does not specify the difference. The object can be referred to generically as an “object.” If the object is named specifically, it must be named correctly. Rephrasings are allowed, such as “forest” instead of “trees.”
For tasks about mutual arrangement of objects, the answer must clearly indicate that the difference lies in the relative arrangement, not in absolute positions. If both objects or object classes are correctly identified, it is sufficient to state that there is a difference in their mutual arrangement. If specific mutual arrangements are given, they must be correct. It is acceptable to name only one object, but then the answer must specifically state that the difference is in its position relative to the other object. If no object is named, the answer is not counted as correct.
`When checking, first determine the type of the reference answer: absolute location, orientation, or mutual arrangement. Then evaluate the user’s answer according to these rules. Analyze the correct answers to understand which formulations have already been accepted as valid. Analyze the incorrect answers to understand common mistakes.`

Table A14. Class-specific evaluation instruction for intra-class consistency tasks.

Prompt B8: Same.
`For this task, the target property should be a shared visual characteristic that unifies objects in one class and sets them apart from the other class.`
`A correct answer must describe a visual attribute that is uniform across the images in one set and non-uniform in the other collage.`
`The key distinction should rely on intra-set consistency, for example all objects facing the same direction or all objects having the same color or shape, versus variability within the opposing set.`

Table A15. Class-specific evaluation instruction for semantic tasks.

Prompt B9: Semantic.
`A correct answer identifies a difference in object type, purpose, class, activity, or real-world meaning between the two sets of images.`
`The target property must be the object type itself, such as rivers versus roads, or a direct equivalent, such as the presence of water versus the absence of water.`
`An answer is incorrect if it replaces the object types with a broader category, such as “natural landscapes versus human-made infrastructure,” “rural versus urban,” or “ecologically rich versus degraded.”`
`An answer is also incorrect if it adds an unnecessary distinguishing attribute that is not guaranteed by the reference answer. For example, “winding rivers” is incorrect if winding is typical but not required, and “urban roads” is incorrect because roads can be rural.`

Appendix C. Historical Background

The foundational work on algorithmic geometric problem solving, illustrating the original set of problems, was published in 1967 by Mikhail Bongard [56]. In this book, Bongard acknowledges collaborative contributions from Modest Vaintsvaig, Vadim Maximov, and Mikhail Smirnov, all of whom were affiliated with the Institute for Information Transmission Problems (IITP) at the time and working in what is now referred to as artificial intelligence. Although the precise division of contributions is not fully documented, Maximov’s role in the development of the original problem set appears to have been substantial.

Elena Maximova, biologist, currently a senior researcher at IITP, who was affiliated with the laboratory at the time and was married to Vadim Maximov, recalls:

“ The ideas for the pictures were conceived and discussed together. Then Vadim would draw them. He was fond of drawing. I remember those sheets of Whatman paper, the ink, and the pens. Was it during the time when M. M. Bongard was preparing "Problema uznavaniya" [56], or several years later, in 1975, for his own candidate dissertation, which was a continuation of "Geometry"? I am not sure – perhaps both. As a playful dedication, M. M. Bongard wrote on the copy gifted to Vadim:

‘To dear Vadim, my partner in all this mischief. Mika, 26 August 1967’ ”

Moving from recollections to documented facts, Vadim Maximov was a postgraduate student under Mikhail Bongard, working on geometric problem solving. In 1968 and 1971, Bongard and Maximov presented their findings at international symposia [57,58]. Maximov further developed this line of research in a series of independent publications [59,60,61] and continued it after Bongard tragically passed away in 1971 [62,63]. Bongard and Maximov remained the sole authors of these works, both during Bongard’s lifetime and posthumously. In light of this historical context, we name our benchmark Bongard–Maximov Problems for Remote Sensing to reflect the contributions of both researchers to the original formulation of the problem.

Appendix D. Extended Human Study Results

Figure A1 presents the distribution of individual participant accuracies across problem categories. Accuracy variability was particularly large for the Number, Presence, Same, and Size categories, where participant performance ranged from 0% to 100%. Despite this variability, more than half of the participants solved all problems correctly in each of these categories. In contrast, Spatial problems exhibited consistently lower performance, and were the only category for which the interquartile range did not include perfect accuracy. This observation further supports the conclusion that spatial-relational reasoning is the most challenging aspect of BMRS for human participants.

Figure A1. Distribution of human accuracies across problem categories. Each point corresponds to one participant.

Each problem was solved by at least 14 participants, allowing us to estimate problem difficulty as the proportion of participants who answered correctly. Figure A2 shows the distribution of problem difficulties, while Figure A3 provides a problem-level heatmap grouped by problem category. Importantly, every problem was solved correctly by at least one participant, indicating that all BMRS problems are human-solvable.

Figure A2. Distribution of BMRS problems by human-estimated difficulty.

Because Spatial problems were the most challenging, we conducted a more detailed analysis of this category. Spatial problems were divided into three subclasses according to the underlying discriminative rule:

Absolute position: object location relative to the image frame;
Orientation: object orientation relative to the image frame;
Relative position: spatial relationships between multiple objects.

Examples of these subclasses are provided in SubAppendix D.1.

Figure A4 compares human performance across the three subclasses. Problems based on absolute position were the most difficult, whereas orientation-based problems were solved most reliably.

Figure A3. Human solution rates for individual problems grouped by problem category.

Figure A4. Human performance across Spatial subclasses.

Appendix D.1. Most Challenging Problems for Humans

The nine most difficult BMRS problems all belong to the Spatial category: bb_m_97, bb_s_08, bb_s_14, bb_s_20, bb_s_15, bb_s_57, bb_m_98, bb_m_95, and bb_s_30, listed from most to least difficult.

Six of these problems (bb_m_97, bb_s_08, bb_s_14, bb_s_20, bb_s_15, and bb_s_57) belong to the Absolute Position subclass and require distinguishing object locations within the image (left versus right, or top versus bottom). Problems bb_m_98 and bb_s_30 belong to the Orientation subclass, while bb_m_95 belongs to the Relative Position subclass.

Representative examples of most difficult problems from Spatial subclasses are shown in Figure A5.

Figure A5. Representative examples of the three Spatial subclasses among the most challenging BMRS problems for human participants.

There were also several difficult problems in the Shape, Same, Presence, and Semantic categories for which fewer than 50% of participants provided correct solutions. Figure A6 presents the most difficult problem from each of these categories.

Notably, all problems in the Number and Size categories were solved correctly by more than half of the participants, indicating that these categories were comparatively easy for humans.

Figure A6. The most challenging non-Spatial BMRS problems for human participants, shown for each problem category.

Appendix E. Most Challenging Problems for Models

Similarly to the human analysis, model difficulty was estimated as the proportion of evaluated models that solved a problem correctly. Figure A7 presents a problem-level heatmap of model performance.

Two problems were not solved by any evaluated models despite being solved correctly by more than 80% of human participants. These problems are shown in Figure A8, highlighting instances where human reasoning substantially outperformed current VLMs.

Three problems were solved by General-Mid VLMs but not by General-Large models. These examples are shown in Figure A9 and illustrate that superior overall performance does not necessarily imply dominance on every individual problem.

Figure A7. VLMs’ solution rates for individual BMRS problems.

Figure A8. Problems solved by most human participants but by none of the evaluated VLMs.

Figure A9. Problems unsolved by frontier proprietary VLMs but solved by medium-sized general-purpose models.

Appendix E.1. Examples of Incorrect Answers

Since understanding not only which problems models struggle with but also how they fail is valuable for further model improvement, we provide examples of incorrect responses.

Some responses did not constitute valid answers to the Bongard problems in terms of their wording. For instance, we observed repetitions of the prompt, descriptions of the images, isolated symbols, as well as incoherent phrases and sentence fragments. This issue was particularly frequent for EarthDial, which tended to produce descriptions of individual images rather than the underlying concept. Examples of such responses are shown in Table A16.

Among the responses that were valid answers to the Bongard problems but still incorrect, the most frequent errors involved either ignoring or hallucinating objects. The majority of such responses were formulated around semantics, presence, and visual attributes, with very few addressing object positions or counts. Examples of these errors are given in Table A17.

It may also be of interest to see what ChatGPT and Gemini answered on the problems bb_m_08, bb_m_36 and bb_m_113, showed in Figure A9. These answers are provided in Table A18 ans Table A19.

Table A16. Examples of models answers that are not valid answers to the Bongard problem.

EarthDial on bb_m_01 using Direct strategy.
`port, harbor, water, lake`

TeoChat on bb_m_28 using Contrastive-iterative strategy.

None of the left images have the concept None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right

Table A17. Examples of models answers that are valid answers to the Bongard problems but still incorrect. All the responses provided correspond to the problem bb_m_29, shown in Figure A8.

EarthDial on bb_m_29 using Contrastive-iterative strategy.
`The left image is a roundabout, while the right image is a building.`

LHRS-Bot on bb_m_29 using Contrastive-direct strategy.
`The difference between the two sides of the problem is the arrangement of the buildings, roads, and trees.`

VHM on bb_m_29 using Descriptive-direct strategy.
`The left side images share the concept of roundabouts, while the right side images share the concept of residential areas.`

SkySenseGPT on bb_m_29 using Descriptive-direct strategy.
`The common concept that unites all left images but no right images is "1 intersection". The common concept that unites all right images but no left images is "1 roundabout".`

Table A18. ChatGPT’s responses to the problems bb_m_08, bb_m_36, and bb_m_113 (shown in Figure A9).

ChatGPT on bb_m_08.
`The left images focus on cars on paved lots or roads. The right images show vehicle depots or parking areas mixed with buildings, buses, or trailers.`

ChatGPT on bb_m_36.
`The left images show fields with clear crop rows or furrows. The right images show fields without clear crop rows or furrows.`

ChatGPT on bb_m_113.
`The left images show square or rectangular roof forms. The right images show circular or radial roof forms.`

Table A19. Gemini’s responses to the problems bb_m_08, bb_m_36, and bb_m_113 (shown in Figure A9).

Gemini on bb_m_08.
`The left images feature groups of multi-colored vehicles. The right images feature fleets of uniformly white vehicles`

Gemini on bb_m_36.
`The left images show agricultural fields with distinct parallel rows or furrows. The right images show agricultural fields with uniform textures lacking visible rows.`

Gemini on bb_m_113.
`The left images show buildings with square or rectangular roofs. The right images show buildings with circular roofs.`

Appendix F. Model Accuracy Across Problem Types

Detailed results for human observers and models, explicitly categorized into three tiers: General-Large VLM, General-Mid VLM, and RSVLM.

References

Li, J.; Pei, Y.; Zhao, S.; Xiao, R.; Sang, X.; Zhang, C. A Review of Remote Sensing for Environmental Monitoring in China. Remote Sens. 2020, 12. [Google Scholar] [CrossRef]
Pavlova, M.; Sidorchuk, D.; Bocharov, D.; Sarycheva, A. Crop Classification Using Reduced-Dimensionality NDVI Time Series. In Proceedings of the ECMS 2023, European Council for Modelling and Simulation, 2023; Vol. 37, pp. 306–312. [Google Scholar] [CrossRef]
Pavlova, M.A.; Timofeev, V.A.; Bocharov, D.A.; Sidorchuk, D.S.; Nurmukhametov, A.L.; Nikonorov, A.V.; Yarykina, M.S.; Kunina, I.A.; Smagina, A.A.; Zagarev, M.A. Low-parameter method for delineation of agricultural fields in satellite images based on multi-temporal MSAVI2 data. Comput. Opt. 2023, 47, 451–463. [Google Scholar]
Yu, D.; Fang, C. Urban Remote Sensing with Spatial Big Data: A Review and Renewed Perspective of Urban Studies in Recent Decades. Remote Sens. 2023, 15. [Google Scholar] [CrossRef]
Im, J.; Park, H.; Takeuchi, W. Advances in Remote Sensing-Based Disaster Monitoring and Assessment. Remote Sens. 2019, 11. [Google Scholar] [CrossRef]
Omoniyi, T.O.; Sims, A. Enhancing the Precision of Forest Growing Stock Volume in the Estonian National Forest Inventory with Different Predictive Techniques and Remote Sensing Data. Remote Sens. 2024, 16. [Google Scholar] [CrossRef]
Ivliev, N.A.; Podlipnov, V.V.; Ivanushkin, M.A.; Skidanov, R.V.; Fedorov, V.V.; Kazanskiy, N.L.; Soifer, V.A. Imaging of the Earth’s surface with an ultra-compact camera with a hybrid lens mounted on the CubeSat 3U platform. Comput. Opt. 2026, 50, 1742. [Google Scholar] [CrossRef]
Pellegrino, A.; Pancalli, M.G.; Gianfermo, A.; Marzioli, P.; Curianò, F.; Angeletti, F.; Piergentili, F.; Santoni, F. HORUS: Multispectral and Multiangle CubeSat Mission Targeting Sub-Kilometer Remote Sensing Applications. Remote Sens. 2021, 13. [Google Scholar] [CrossRef]
Borisov, A.N.; Myasnikov, V.V.; Sergeev, V.V. Method of automatic coregistration of digital remote sensing images from different sources. Comput. Opt. 2024, 48, 932–943. [Google Scholar] [CrossRef]
Konovalov, V.F.; Myasnikov, V.V.; Sergeev, V.V. A unified neural network-based single super-resolution method for heterogeneous digital earth remote sensing images. Comput. Opt. 2024, 48, 944–955. [Google Scholar]
Nikonorov, A.; Sidorchuk, D.; Odinets, N.; Volkov, V.; Sarycheva, A.; Dudenko, E.; Zhidkov, M.; Nikolaev, D. HyperHazeOff: Hyperspectral Remote Sensing Image Dehazing Benchmark. J. Imaging 2025, 11, 422. [Google Scholar] [PubMed]
Tao, L.; Zhang, H.; Jing, H.; Liu, Y.; Yan, D.; Wei, G.; Xue, X. Advancements in vision–language models for remote sensing: Datasets, capabilities, and enhancement techniques. Remote Sens. 2025, 17, 162. [Google Scholar] [CrossRef]
Shao, R.; Li, Z.; Zhang, Z.; Xu, L.; He, X.; Yuan, H.; He, B.; Dai, Y.; Yan, Y.; Chen, Y.; et al. Asking like Socrates: Socrates helps VLMs understand remote sensing images. arXiv 2025, arXiv:2511.22396. [Google Scholar]
Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Chen, Z.; Ma, A.; Zhong, Y. Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering. Proc. Proc. AAAI Conf. Artif. Intell. 2024, Vol. 38, 5481–5489. [Google Scholar]
Weng, X.; Pang, C.; Xia, G.S. Vision-language modeling meets remote sensing: Models, datasets, and perspectives. IEEE Geoscience and Remote Sensing Magazine, 2025. [Google Scholar]
Liu, F.; Guan, T.; Li, Z.; Chen, L.; Yacoob, Y.; Manocha, D.; Zhou, T. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv 2023, arXiv:2310.145662, 9. [Google Scholar]
Helff, L.; Stammer, W.; Shindo, H.; Dhami, D.S.; Kersting, K. V-lol: A diagnostic dataset for visual logical learning. arXiv 2023, arXiv:2306.07743. [Google Scholar]
Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.W.; Galley, M.; Gao, J. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv 2023, arXiv:2310.02255. [Google Scholar]
Moskvichev, A.; Odouard, V.V.; Mitchell, M. The conceptarc benchmark: Evaluating understanding and generalization in the arc domain. arXiv 2023, arXiv:2305.07141. [Google Scholar]
Wüst, A.; Woydt, T.; Helff, L.; Ibs, I.; Stammer, W.; Dhami, D.S.; Rothkopf, C.A.; Kersting, K. Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad? arXiv 2024, arXiv:2410.19546. [Google Scholar]
Nie, W.; Yu, Z.; Mao, L.; Patel, A.B.; Zhu, Y.; Anandkumar, A. Bongard-logo: A new benchmark for human-level concept learning and reasoning. Adv. Neural Inf. Process. Syst. 2020, 33, 16468–16480. [Google Scholar]
Jiang, H.; Ma, X.; Nie, W.; Yu, Z.; Zhu, Y.; Anandkumar, A. Bongard-hoi: Benchmarking few-shot visual reasoning for human-object interactions. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022; pp. 19056–19065. [Google Scholar]
Bongard, M. Pattern Recognition; Spartan Books: New York, 1970. [Google Scholar]
Hofstadter, D.R. Gödel, Escher, Bach: an eternal golden braid; Basic books, 1999. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PmLR, 2021; pp. 8748–8763. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar] [CrossRef]
Pang, C.; Weng, X.; Wu, J.; Li, J.; Liu, Y.; Sun, J.; Li, W.; Wang, S.; Feng, L.; Xia, G.S.; et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. Proc. Proc. AAAI Conf. Artif. Intell. 2025, Vol. 39, 6381–6388. [Google Scholar] [CrossRef]
Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024; pp. 27831–27840. [Google Scholar]
Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Ricci, R.; Melgani, F. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery. Remote Sens. 2024, 16, 1477. [Google Scholar]
Luo, J.; Pang, Z.; Zhang, Y.; Wang, T.; Wang, L.; Dang, B.; Lao, J.; Wang, J.; Chen, J.; Tan, Y.; et al. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding. arXiv;arXiv 2024, arXiv:2406.10100. [Google Scholar]
Muhtar, D.; Li, Z.; Gu, F.; Zhang, X.; Xiao, P. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In Proceedings of the European Conference on Computer Vision, 2024; Springer; pp. 440–457. [Google Scholar]
Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Liu, Y.; Li, X. Rsgpt: A remote sensing vision language model and benchmark. ISPRS J. Photogramm. Remote Sens. 2025, 224, 272–286. [Google Scholar] [CrossRef]
Irvin, J.; Liu, E.; Chen, J.; Dormoy, I.; Kim, J.; Khanna, S.; Zheng, Z.; Ermon, S. Teochat: A large vision-language assistant for temporal earth observation data. Proc. Int. Conf. Learn. Represent. 2025, Vol. 2025, 68883–68911. [Google Scholar]
Soni, S.; Dudhane, A.; Debary, H.; Fiaz, M.; Munir, M.A.; Danish, M.S.; Fraccaro, P.; Watson, C.D.; Klein, L.J.; Khan, F.S.; et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 14303–14313. [Google Scholar]
An, X.; Sun, J.; Gui, Z.; He, W. Choice: benchmarking the remote sensing capabilities of large vision-language models. arXiv 2024, arXiv:2411.18145. [Google Scholar]
Fiaz, M.; Debary, H.; Fraccaro, P.; Paudel, D.; Van Gool, L.; Khan, F.; Khan, S. Geovlm-r1: Reinforcement fine-tuning for improved remote sensing reasoning. arXiv 2025, arXiv:2509.25026. [Google Scholar]
Ma, X.; Feng, S.; Zhang, B.; Wang, B. ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing. arXiv 2025, arXiv:2512.23244. [Google Scholar]
Zhou, Y.; Feng, L.; Lan, M.; Ke, Y.; Jiang, X.; Zhang, W. GeoMath: A benchmark for multimodal mathematical reasoning in remote sensing. 2025. [Google Scholar] [PubMed]
Danish, M.; Munir, M.A.; Shah, S.R.A.; Kuckreja, K.; Khan, F.S.; Fraccaro, P.; Lacoste, A.; Khan, S. Geobench-vlm: Benchmarking vision-language models for geospatial tasks. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 7132–7142. [Google Scholar]
Luo, Z.; Wang, D.; Guo, H.; Zhang, J.; Du, B. VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing. arXiv 2026, arXiv:2602.07045. [Google Scholar]
Wu, R.; Ma, X.; Zhang, Z.; Wang, W.; Li, Q.; Zhu, S.C.; Wang, Y. Bongard-openworld: Few-shot reasoning for free-form visual concepts in the real world. arXiv 2023, arXiv:2310.10207. [Google Scholar]
Małkiński, M.; Pawlonka, S.; Mańdziuk, J. Reasoning limitations of multimodal large language models. a case study of bongard problems. arXiv 2024, arXiv:2411.01173. [Google Scholar]
Pawlonka, S.; Małkiński, M.; Mańdziuk, J. Bongard-rwr+: Real-world representations of fine-grained concepts in bongard problems. arXiv 2025, arXiv:2508.12026. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef]
Qi, X.; Zhu, P.; Wang, Y.; Zhang, L.; Peng, J.; Wu, M.; Chen, J.; Zhao, X.; Zang, N.; Mathiopoulos, P.T. MLRSNet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding. ISPRS J. Photogramm. Remote Sens. 2020, 169, 337–350. [Google Scholar] [CrossRef]
Mou, C.; Liu, T.; Zhu, C.; Cui, X. WAID: A Large-Scale Dataset for Wildlife Detection with Drones. Appl. Sci. 2023, 13. [Google Scholar] [CrossRef]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.079271. [Google Scholar]
Khot, T.; Trivedi, H.; Finlayson, M.; Fu, Y.; Richardson, K.; Clark, P.; Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. arXiv 2022, arXiv:2210.02406. [Google Scholar]
Zhang, Y.; Du, L.; Cao, D.; Fu, Q.; Liu, Y. An examination on the effectiveness of divide-and-conquer prompting in large language models. arXiv 2024, arXiv:2402.05359. [Google Scholar]
Ji, B.; Agrawal, S.; Tang, Q.; Wu, Y. Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning. arXiv 2025, arXiv:2507.13362. [Google Scholar]
Li, D.; Jiang, B.; Huang, L.; Beigi, A.; Zhao, C.; Tan, Z.; Bhattacharjee, A.; Jiang, Y.; Chen, C.; Wu, T.; et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing 2025, 2757–2791. [Google Scholar] [CrossRef]
DeepSeek-AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. 2026. [Google Scholar]
Bongard, M. Problema Uznavaniya [Pattern Recognition]; Nauka: Moscow, 1967; p. 3. [Google Scholar]
Maximov, V.; Bongard, M. Programma, obuchayushchayasya klassifikatsii geometricheskikh izobrazheniy [Program Learning to Classify Geometric Images]. In Proceedings of the Mezhdunarodnyy simpozium IFAC po tekhnicheskim i biologicheskim problemam upravleniya, Tezisy dokladov [IFAC International Symposium on Technical and Biological Problems of Control. Abstracts of Papers], Yerevan, 1968; pp. 86–87. [Google Scholar]
Maximov, V.; Bongard, M. Programma, obuchayushchayasya klassifikatsii geometricheskikh izobrazheniy [Program Learning to Classify Geometric Images]. In Proceedings of the Trudy Mezhdunarodnogo simpoziyma po tekhnicheskim i biologicheskim problemam upravleniya; Moscow, Tsypkin, Ya.Z., Ed.; 1971; Vol. 1971, pp. 128–133. [Google Scholar]
Maximov, V. Programma, obuchayushchayasya klassifikatsii geometricheskikh izobrazheniy. Yazyk i eksperimenty [Program Learning to Classify Geometric Images. Language and Experiments]. Strukturnyye metody opoznaniya i avtomaticheskoye chteniye. In Structural Methods of Recognition and Automatic Reading / Edited by A.I. Mikhailov.; Mikhaylov, A.I., Ed.; 1970; pp. 106–126. [Google Scholar]
Maximov, V. Programma, obuchayushchayasya klassifikatsii geometricheskikh figur [Program Learning to Classify Geometric Images]. In Proceedings of the Abstracts of the 4th Colloquium on Microwave Communication, Budapest, 1970; p. 42. [Google Scholar]
Maximov, V. Modelirovaniye protsessa uznavaniya geometricheskikh izobrazheniy [Modeling the process of recognition of geometric images]. In Proceedings of the Pererabotka zritel’noy informatsii i regulyatsiya dvigatel’noy deyatel’nosti – Trudy Mezhdunarodnogo simpoziyma; Gidikov, A., Ed.; Processing of Processing of visual information and regulation of motor activity: Sofia, 1971; pp. 217–226. [Google Scholar]
Maximov, V. Sistema, obuchayushchayasya klassifikatsii geometricheskikh izobrazheniy [Modeling the Geometric Images Recognition]. In Proceedings of the In Modelirovaniye obucheniya i povedeniya Processing of Visual Information and Regulation of Motor Activity – Proceedings of the International Symposium / Edited by A. Gidikov.; Moscow, Smirnov, M.S., Ed.; 1975; pp. 29–120. [Google Scholar]
Maximov, V. Sistema, obuchayushchayasya klassifikatsii geometricheskikh izobrazheniy. Candidate of Technical Sciences Dissertation [A System Capable of Learning to Classify Geometric Images . Thesis Submitted for the Degree of Candidate of Technical Sciences (Specialty Code 05.13.01)., Moscow, Akademiya nauk SSSR, 1975. [Google Scholar]

Figure 1. Example of Bongard Problem #16. All images depict non-self-intersecting spiraling curves that may be smooth, wiggly, or angular. The left side contains clockwise spirals, whereas the right side contains counterclockwise spirals. Adapted from [24], p.219.

Figure 2. Examples of BMRS problems constructed by analogy with original Bongard problems. In Bongard Problem #16 (adopted from [24], p.219) and its BMRS analogue, the left and right sides are distinguished by the winding direction of spirals (clockwise versus counterclockwise). In Bongard Problem #97 (adopted from [24], p.246) and its BMRS analogue, the discriminative rule is object shape: the original problem contrasts triangular (left side) and circular (right side) objects, whereas the BMRS analogue contrasts square (left side) and circular (right side) buildings.

Figure 3. Examples of BMRS problems assembled semi-automatically from annotated imagery. In the left problem, the discriminative rule distinguishes watercraft (left side) from aircraft (right side). In the right problem, the discriminative rule distinguishes images containing houses (left side) from images containing trees (right side).

Figure 4. Performance of the Descriptive-iterative prompting strategy across 10 random image shuffles, grouped by BMRS taxonomy categories. Top: heatmap detailing the success (dark green) or failure (light green) of the model for each specific shuffle ID across the problem set; green indicates problems that were solved at least in one permutation. Bottom: a bar chart showing the total number of correct solutions (out of a maximum of 10) for each problem. The results demonstrate a high sensitivity to image ordering and a strong bias toward solving Semantic problems, while Spatial, Number, and Same problems remain unsolved.

Figure 5. Performance of models in compare with performance of individual human subjects. Each blue point corresponds to the mean accuracy of human subjects measured on 20 problems data sample. Green, yellow and red points corresponds to the mean accuracy of VLM measured on full BMRS dataset (122 problems).

Table 1. Comparison of the Bongard benchmarks.

Dataset	Images	Reasoning	Size
Original BPs [24]	Line drawing	Analytic	100
Bongard LOGO [22]	Line drawing	Analytic	12 000
Bongard HOI [23]	Real ground-level	Synthetic	53 000
Bongard OpenWorld [42]	Real ground-level	Synthetic	1 010
Bongard RWR [43]	Real ground-level	Analytic	60
Bongard RWR+ [44]	AI-generated ground-level	Analytic	5400
BMRS	Real Remote Sensing	Synthetic + Analytic	122

Table 2. Comparison of task distributions by class. The number of problems is specified in parentheses.

Dataset	Concept			Spatial	Size	Number	Same	Total
	Shape	Semantic	Presence
Original Bongard problem [21]	31% (31)			41% (41)	6% (6)	15% (15)	7% (7)	100
BMRS	20% (25)	30% (37)	11% (13)	21% (26)	4% (5)	10% (12)	3% (4)	122

Table 3. Summary of general-purpose visual-language models used in the experiments.

Model	Checkpoint / API	Params	Type
LLaVA-v1.5 7B	`llava-hf/llava-1.5-7b-hf`	7B	Open
LLaVA-v1.5 13B	`llava-hf/llava-1.5-13b-hf`	13B	Open
LLaVA-v1.6 7B	`llava-hf/llava-v1.6-vicuna-7b-hf`	7B	Open
LLaVA-v1.6 34B	`llava-hf/llava-v1.6-34b-hf`	34B	Open
InternVL-3.5	`OpenGVLab/InternVL3_5-38B-HF`	38B	Open
Qwen-3-VL	`Qwen/Qwen3-VL-32B-Instruct`	32B	Open
VHM	`FitzPC/vhm_7B`	7B	Open, RS
RS-LLaVA	`BigData-KSU/RS-llava-v1.5-7b-LoRA`	7B	Open, RS
RS-EoT	`ShaoRun/RS-EoT-7B`	7B	Open, RS
GeoChat	`MBZUAI/geochat-7B`	7B	Open, RS
SkySenseGPT	`ll-13/SkySenseGPT-7B-CLIP-ViT`	7B	Open, RS
EarthDial	`akshaydudhane/EarthDial_4B_RGB`	4B	Open, RS
LHRS-Bot	`LHRS/LHRS-Bot-Nova`	7B	Open, RS
TeoChat	`jirvin16/TEOChat`	7B	Open, RS
Gemini-3.1-Pro	API	>1T	Proprietary
ChatGPT-5.5-Pro	API	>1T	Proprietary

RS denotes models specifically designed or adapted for remote sensing tasks.

Table 4. Performance per type for humans.

	All	Number	Presence	Same	Semantic	Shape	Size	Spatial
Accuracy	0.745	0.823	0.734	0.577	0.853	0.801	0.778	0.518

Table 5. Models performance per strategy.

Model	Contrastive-direct	Contrastive-iterative	Descriptive-direct	Descriptive-iterative	Direct
LLaVA-v1.5 7B	0.11	0.14	$0.15$	0.12	0.13
LLaVA-v1.5 13B	0.09	0.10	0.16	$0.18$	0.13
LLaVA-v1.6 7B	0.09	0.07	$0.13$	0.07	$0.13$
LLaVA-v1.6 34B	0.15	0.13	$0.20$	0.08	0.17
InternVL-3.5	0.35	0.26	$0.40$	0.20	0.36
Qwen-3-VL	0.39	0.32	$0.43$	0.22	0.39
VHM	0.07	0.14	$0.15$	0.14	0.08
RS-LLaVA	0.06	0.02	0.03	0.03	$0.07$
RS-EoT	0.08	$0.17$	0.03	0.07	0.14
GeoChat	$0.09$	0.07	0.03	0.07	0.07
SkySenseGPT	0.07	$0.08$	0.01	0.06	0.05
EarthDial	0.07	$0.13$	0.05	0.07	0.07
LHRS-Bot	0.05	0.07	$0.20$	0.16	0.11
TeoChat	0.08	0.07	0.06	0.07	$0.10$

Table 6. Performance per type using best strategy per model.

Model	all	number	presence	same	semantic	shape	size	spatial
	Human baseline
Humans	0.745	0.823	0.734	0.577	0.853	0.801	0.778	0.518
	General-Large VLMs
Gemini-3.1-Pro	0.869	0.833	0.917	1.000	0.946	0.875	0.800	0.750
ChatGPT-5.5-Pro	0.893	0.833	0.923	1.000	0.973	0.840	0.800	0.846
	General-Mid VLMs
LLaVA-v1.5 7B Descriptive-direct	0.148	0.000	0.154	0.000	0.324	0.120	0.200	0.000
LLaVA-v1.5 13B Descriptive-iterative	0.180	0.250	0.231	0.000	0.378	0.040	0.000	0.038
LLaVA-v1.6 7B Descriptive-direct	0.131	0.000	0.077	0.000	0.270	0.120	0.200	0.038
LLaVA-v1.6 34B Descriptive-direct	0.205	0.167	0.385	0.000	0.378	0.160	0.000	0.000
InternVL-3.5 Descriptive-direct	0.402	0.500	0.462	0.250	0.595	0.440	0.200	0.077
Qwen-3-VL Descriptive-direct	0.426	0.417	0.846	0.250	0.514	0.480	0.400	0.077
	Remote Sensing VLMs
VHM Descriptive-direct	0.148	0.083	0.231	0.000	0.351	0.040	0.000	0.000
RS-LLaVA Direct	0.066	0.000	0.077	0.000	0.189	0.000	0.000	0.000
RS-EoT Contrastive-iterative	0.172	0.083	0.231	0.000	0.351	0.120	0.000	0.038
GeoChat Contrastive-direct	0.090	0.000	0.000	0.000	0.243	0.040	0.200	0.000
SkySenseGPT Contrastive-iterative	0.082	0.000	0.154	0.000	0.189	0.040	0.000	0.000
EarthDial Contrastive-iterative	0.131	0.083	0.154	0.000	0.324	0.040	0.000	0.000
LHRS-Bot Descriptive-direct	0.205	0.250	0.231	0.250	0.459	0.040	0.000	0.000
TeoChat Direct	0.098	0.000	0.077	0.000	0.270	0.000	0.200	0.000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

BMRS: Bongard–Maximov Problems for Remote Sensing

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

2.1. Visual Language Models

2.2. RSVLM Benchmarks

2.3. Bongard Benchmarks

3. Materials and Methods

3.1. Remote Sensing Bongard Benchmark

3.1.1. Collection Methods

3.1.2. Classification of Problems

3.2. Vision–Language Models

3.3. Prompting Strategies

3.4. Human Study Design

3.5. Answers Evaluation

4. Results

4.1. Human Study Results

4.2. Prompting Strategies Evaluation

4.2.1. Sensitivity to Image Ordering (Shuffle) in Iterative Prompting

4.3. Comparison of Solution Results Across Problem Classes

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Vision–Language Model Inference Prompts

Appendix A.1. Common System Prompt

Appendix A.2. Direct Strategy

Appendix A.3. Descriptive-Direct Strategy

Appendix A.4. Contrastive-Direct Strategy

Appendix A.5. Contrastive-Iterative Strategy

Appendix A.6. Descriptive-Iterative Strategy

Appendix B. LLM Judge Prompting Protocol

Appendix B.1. System Prompt for the LLM Judge

Appendix B.2. User Prompt Template

Appendix B.3. Class-Specific Evaluation Instructions

Appendix C. Historical Background

Appendix D. Extended Human Study Results

Appendix D.1. Most Challenging Problems for Humans

Appendix E. Most Challenging Problems for Models

Appendix E.1. Examples of Incorrect Answers

Appendix F. Model Accuracy Across Problem Types

References

MDPI Initiatives

Important Links

Subscribe