Submitted:
18 June 2026
Posted:
22 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Visual Language Models
2.2. RSVLM Benchmarks
2.3. Bongard Benchmarks
3. Materials and Methods
3.1. Remote Sensing Bongard Benchmark
- 1.
- the problems include black-and-white images without halftones (i.e., line drawings);
- 2.
- the information contained in the images themselves is sufficient to solve the problem (analytic reasoning);
- 3.
- the problems are solvable by a human observer.
3.1.1. Collection Methods
3.1.2. Classification of Problems
- 1.
- Size: the discriminative rule is based on object size;
- 2.
- Number: the discriminative rule is based on object count;
- 3.
- Spatial: the discriminative rule is based on the spatial arrangement of objects;
- 4.
- Same: objects within each image are either identical or non-identical with respect to a shared property;
- 5.
- Concept: all remaining cases.
- Semantic: the discriminative rule is defined by object identity or category rather than by visual properties alone. For example, the left problem in Figure 3 contrasts watercraft and aircraft.
- Presence: the discriminative rule is based on the presence or absence of a particular object or object category. For example, the right task in Figure 3 contrasts images containing houses with images containing trees.
3.2. Vision–Language Models
3.3. Prompting Strategies
3.4. Human Study Design
3.5. Answers Evaluation
4. Results
4.1. Human Study Results
4.2. Prompting Strategies Evaluation
4.2.1. Sensitivity to Image Ordering (Shuffle) in Iterative Prompting
4.3. Comparison of Solution Results Across Problem Classes
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial intelligence |
| BP | Bongard problems |
| LLM | Large language lodels |
| RL | Reinforcement learning |
| RS | Remote sensing |
| RSI | Remote sensing imagery |
| RSVLMs | Remote sensing vision-language models |
| RSVQA | Remote Sensing Visual Question Answering |
| VLM | Vision-language model |
| VQA | Visual question answering |
Appendix A. Vision–Language Model Inference Prompts
Appendix A.1. Common System Prompt
| Prompt A1: Common system prompt. |
|---|
| You are a vision understanding module designed to provide short, clear, and accurate answers. Your goal is to solve a Bongard problem consisting of a collage with six images on the left side and six images on the right side. |
| All left images share a common concept that none of the right images have, and all right images share a different common concept that none of the left images have. Your task is to identify both concepts. |
| The answer must consist of exactly two plain sentences: the first sentence describes the concept of the left side, and the second sentence describes the concept of the right side. Do not use markdown, bullet points, or any formatting. Keep each sentence short and clear. |
Appendix A.2. Direct Strategy
| Prompt A2: Direct strategy. |
|---|
| Here is the Bongard problem collage. Provide your answer as two plain sentences: first the concept for all left images, then the concept for all right images. |
Appendix A.3. Descriptive-Direct Strategy
| Prompt A3: Descriptive-direct strategy. |
|---|
| This is a single image from either the left or right side of a Bongard problem. Do not solve the problem yet. Describe this image in detail. |
| Focus on all visual features: objects, shapes, colors, spatial arrangements, textures, and any other distinctive properties. Be thorough, as these descriptions will later be used to identify a common concept. |
| Here is the Bongard problem collage. Below are the detailed descriptions of each left image and each right image. Based on this information, provide your answer. |
| Left class image descriptions: {}. Right class image descriptions: {}. |
Appendix A.4. Contrastive-Direct Strategy
| Prompt A4: Contrastive-direct strategy. |
|---|
| Here is a pair: one image from the left side of a Bongard problem and one from the right side. Do not solve the full problem yet. Instead, carefully examine both images and list all differences between them. |
| Focus on contrasting visual features: objects, shapes, colors, positions, textures, counts, orientations, or any other properties that distinguish the left image from the right image. Output your answer as a simple list of differences. Use plain text, no markdown. |
| Here is the full Bongard problem collage. Below are all the difference lists obtained from comparing left-right image pairs. |
| Based on this information, determine the common concept that unites all left images but no right images, and the common concept that unites all right images but no left images. Provide your answer. |
Appendix A.5. Contrastive-Iterative Strategy
| Prompt A5: Contrastive-iterative strategy. |
|---|
| Here is a pair: one image from the left side of a Bongard problem and one from the right side. Carefully examine both images and list all differences between them. |
| Focus on contrasting visual features: objects, shapes, colors, positions, textures, counts, orientations, or any other properties that distinguish the left image from the right image. Propose a candidate concept that distinguishes the left image from the right image. |
| Here is the next left-right pair. Your goal is to generalize the concept to fit all of the pairs you have seen. Carefully examine the new pair and find all differences between the images. |
| Focus on contrasting visual features: objects, shapes, colors, positions, textures, counts, orientations, or any other properties that distinguish the left image from the right image. Your previous candidate concept is: {}. Check whether the previous concept applies to the new pair. |
| If it is fully correct, output the same concept unchanged. If it is partially correct, refine the concept by removing or adjusting the failing parts, keeping only the aspects valid for all left images seen so far and false for all right images. If it is completely wrong, discard it and formulate a completely new concept based on all pairs seen now, including this one. |
| Output the updated concept as one detailed sentence covering all distinguishing features common to all processed pairs. |
| This is the last left-right pair. Based on all six pairs and your iterative refinement, provide your final answer. |
| Output exactly two plain sentences: the first describes the concept that holds for all left images but no right images, and the second describes the concept that holds for all right images but no left images. Your iterative concept is: {}. |
Appendix A.6. Descriptive-Iterative Strategy
| Prompt A6: Descriptive-iterative strategy. |
|---|
| This is the first image from one side of a Bongard problem. Later you will see more images from this same side. Your goal is to formulate a general description of all images on this side for solving the problem later. |
| Do not solve the problem yet. Describe this image in detail. Focus on all visual features: objects, shapes, colors, spatial arrangements, textures, counts, orientations, and any other distinctive properties. Be thorough, as these descriptions will later be used to identify a common concept. |
| Here is the next image from the same side. Your goal is to generalize the class description to fit all images from this side that you have seen so far. |
| Carefully examine this new image and list all its relevant visual features: objects, shapes, colors, spatial arrangements, textures, counts, orientations, or any other distinctive properties. Your previous candidate description for this side is: {}. Check whether it applies to the new image. |
| If it is fully correct, output the same description unchanged. If it is partially correct, refine the description by removing or adjusting the failing parts, keeping only what is true for all images seen so far. If it is completely wrong, discard it and formulate a completely new description based on all images seen now, including this one. |
| This is the last image from this side. Based on all six images and your iterative refinement, provide the final description that unites all images on this side. Your iterative description is: [previous concept]. |
| Here is the Bongard problem collage. Below are generalized descriptions of the left and right sides of images. Based on this information, solve the Bongard problem. |
| Left class image descriptions: {}. Right class image descriptions: {}. |
Appendix B. LLM Judge Prompting Protocol
Appendix B.1. System Prompt for the LLM Judge
| Prompt B1: LLM judge system prompt. |
|---|
| You evaluate the user’s answer by comparing it with reference answers for a given task. |
| You see: reference answers, examples of correct and incorrect answers, and the user answer. |
| Each answer contains features of the right and left class. Identify the target property that distinguishes the right class from the left. This may describe semantics, form, presence, quantity, or size of objects. |
| The user answer may contain features of both right and left classes, or an explicit target property separating these classes. If two features are indicated, formulate the target property based on them. |
| The answer is correct if the target property matches the reference or correct examples, accounting for generalization, paraphrasing, synonyms, word order changes, or simplification. The answer is incorrect if the target property is missing or wrong. |
| Do not accept answers similar to or matching incorrect examples. Focus on meaning, not exact wording. Ignore minor differences in style, grammar, or phrasing. |
| Respond with only one word: correct or incorrect. |
Appendix B.2. User Prompt Template
| Prompt B2: LLM judge user prompt template. |
|---|
| Task-specific evaluation instruction: |
| {class_specific_prompt} |
| Reference answers: |
| Left: {reference_left} |
| Right: {reference_right} |
| Examples of correct answers: |
| Example 1: |
| Left: {correct_example_left} |
| Right: {correct_example_right} |
| Examples of incorrect answers: |
| Example 1: |
| Left: {incorrect_example_left} |
| Right: {incorrect_example_right} |
| Model answer: |
| {model_answer} |
Appendix B.3. Class-Specific Evaluation Instructions
| Prompt B3: Size. |
|---|
| You are given answers to a problem of the “size” type. Your task is to check whether each answer is correct. A correct answer must explicitly mention one of the following differences between the objects in the left and right sets of images: the objects in one set are closer and the objects in the other set are farther; the objects in one set are bigger and the objects in the other set are smaller; or each set has a name, such as “barns” on the left and “estates” on the right, and these names reflect a real-world size difference. |
| For example, barns are typically smaller than estates, so stating that difference would be correct logic. |
| If an answer contains the phrase “I do not know” or any similar expression of uncertainty, it is incorrect. |
| Output your judgment as correct or incorrect for each answer, with a brief justification. |
| Prompt B4: Presence. |
|---|
| In this task, the reference target property is based on the presence of an object of a certain class in one set of images and the absence of that same object in the other set. The property is not based on the shape of objects, their color, size, or spatial arrangement. |
| The target property can be rephrased in semantic terms while preserving meaning. For example, the property “images in the left class contain clouds, images in the right class do not contain clouds” can be rephrased as “the left class depicts cloudy weather, and the right class depicts clear weather,” because clear weather means the absence of clouds. |
| Analyze the correct answers to understand which semantic rephrasings have already been accepted as valid. Analyze the incorrect answers to understand which semantic formulations lead to errors. |
| If the answer to be checked is formulated in terms of presence or absence of an object, evaluate it according to the standard rules. If the answer is formulated in semantic terms, first check whether such semantics has appeared in examples of correct or incorrect answers. If it has, follow those examples. If the semantics is new and has not appeared before, analyze whether the named properties always imply the presence or absence of the target object. If they do, mark the answer as correct; otherwise, mark it as incorrect. |
| If the answer refers to “image” in the singular rather than “images” as a set, accept it as valid as long as the distinction between the two sets is preserved in meaning. The answer is also valid if it includes additional information such as a cause-and-effect relationship, but it is incorrect if it includes additional distinguishing features. |
| If the answer describes a characteristic of an object being “more” or “less” of something rather than simply the presence or absence of the target object, accept it only if no additional distinguishing features are provided beyond that quantitative difference. If the answer includes other differences beyond the more-or-less comparison, reject it as incorrect. |
| Prompt B5: Number. |
|---|
| You are given answers to a problem of the “number” type. Your task is to check whether each answer is correct. |
| A correct answer must explicitly mention one of the following differences between the objects in the left and right sets of images: every image in one set contains more or fewer specific objects than every image in the other set; or every image in one set contains a specific fixed number of objects, and every image in the other set contains a different specific fixed number of objects. |
| For example, an answer is correct if it states that left-set images each have more apples and right-set images each have fewer apples, or that the left set always has three cars and the right set always has seven cars. |
| If an answer contains “I do not know” or any similar expression of uncertainty, it is incorrect. |
| Prompt B6: Shape. |
|---|
| In this task, the reference target property is based on the shape of objects in the image, regardless of their semantics. |
| The target property can be rephrased in terms of object semantics while preserving meaning. For example, the property “the boundary between land and water is closed or not closed” can be rephrased semantically as “island or peninsula.” |
| Analyze the correct answers to understand which semantic rephrasings have already been accepted as valid. Also analyze the incorrect answers to understand which semantic formulations are mistaken. |
| If the answer to be checked is formulated in terms of shape, evaluate it according to the standard rules. If the answer is formulated in terms of semantics, first check whether such semantics has appeared in examples of correct or incorrect answers. If it has, follow those examples. If the semantics is new and has not appeared before, analyze whether objects with such semantics always have the required shape difference. If they do, the answer is correct; otherwise, it is incorrect. |
| Prompt B7: Spatial. |
|---|
| In this task, the reference target property is based exclusively on the position of objects in the image, the orientation of objects relative to the image frame, or the mutual arrangement of objects relative to each other. The property is not based on shape, color, size, quantity, or semantics. |
| Semantic rephrasings are allowed as long as they preserve the meaning of the spatial arrangement. For example, “left-hand traffic” means that cars move on the left side of the road, and “right-hand traffic” means that cars move on the right side. |
| As a general rule, if an answer does not refer to the location or orientation of objects at all, it is incorrect. If the answer refers to a different type of spatial relation, it is also incorrect. The answer is still valid if it includes additional information such as a cause-and-effect relationship of this spatial relation, but it is incorrect if it includes additional objects as a distinguishing feature. |
| For tasks about absolute location, the answer must clearly indicate that the difference between the classes lies in the location. It is not necessary to specify exact parts of the image for each class, but the answer must state what the difference is. For example, “on different sides of the picture” is acceptable, whereas “in different places” is not, because it does not specify the difference. The object can be referred to generically as an “object.” If the object is named specifically, it must be named correctly. Rephrasings are allowed, such as “forest” instead of “trees.” |
| For tasks about orientation, the answer must clearly indicate that the difference between the classes lies in orientation. It is not necessary to specify the exact orientation for each class, but the answer must state what the difference is. For example, “oriented horizontally and vertically” is acceptable, whereas “oriented differently” is not, because it does not specify the difference. The object can be referred to generically as an “object.” If the object is named specifically, it must be named correctly. Rephrasings are allowed, such as “forest” instead of “trees.” |
| For tasks about mutual arrangement of objects, the answer must clearly indicate that the difference lies in the relative arrangement, not in absolute positions. If both objects or object classes are correctly identified, it is sufficient to state that there is a difference in their mutual arrangement. If specific mutual arrangements are given, they must be correct. It is acceptable to name only one object, but then the answer must specifically state that the difference is in its position relative to the other object. If no object is named, the answer is not counted as correct. |
| When checking, first determine the type of the reference answer: absolute location, orientation, or mutual arrangement. Then evaluate the user’s answer according to these rules. Analyze the correct answers to understand which formulations have already been accepted as valid. Analyze the incorrect answers to understand common mistakes. |
| Prompt B8: Same. |
|---|
| For this task, the target property should be a shared visual characteristic that unifies objects in one class and sets them apart from the other class. |
| A correct answer must describe a visual attribute that is uniform across the images in one set and non-uniform in the other collage. |
| The key distinction should rely on intra-set consistency, for example all objects facing the same direction or all objects having the same color or shape, versus variability within the opposing set. |
| Prompt B9: Semantic. |
|---|
| A correct answer identifies a difference in object type, purpose, class, activity, or real-world meaning between the two sets of images. |
| The target property must be the object type itself, such as rivers versus roads, or a direct equivalent, such as the presence of water versus the absence of water. |
| An answer is incorrect if it replaces the object types with a broader category, such as “natural landscapes versus human-made infrastructure,” “rural versus urban,” or “ecologically rich versus degraded.” |
| An answer is also incorrect if it adds an unnecessary distinguishing attribute that is not guaranteed by the reference answer. For example, “winding rivers” is incorrect if winding is typical but not required, and “urban roads” is incorrect because roads can be rural. |
Appendix C. Historical Background
“ The ideas for the pictures were conceived and discussed together. Then Vadim would draw them. He was fond of drawing. I remember those sheets of Whatman paper, the ink, and the pens. Was it during the time when M. M. Bongard was preparing "Problema uznavaniya" [56], or several years later, in 1975, for his own candidate dissertation, which was a continuation of "Geometry"? I am not sure – perhaps both. As a playful dedication, M. M. Bongard wrote on the copy gifted to Vadim:
‘To dear Vadim, my partner in all this mischief. Mika, 26 August 1967’ ”
Appendix D. Extended Human Study Results


- Absolute position: object location relative to the image frame;
- Orientation: object orientation relative to the image frame;
- Relative position: spatial relationships between multiple objects.


Appendix D.1. Most Challenging Problems for Humans


Appendix E. Most Challenging Problems for Models



Appendix E.1. Examples of Incorrect Answers
| EarthDial on bb_m_01 using Direct strategy. |
|---|
| port, harbor, water, lake |
| TeoChat on bb_m_28 using Contrastive-iterative strategy. |
|---|
| None of the left images have the concept None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right images have None of the left images have None of the right |
| EarthDial on bb_m_29 using Contrastive-iterative strategy. |
|---|
| The left image is a roundabout, while the right image is a building. |
| LHRS-Bot on bb_m_29 using Contrastive-direct strategy. |
|---|
| The difference between the two sides of the problem is the arrangement of the buildings, roads, and trees. |
| VHM on bb_m_29 using Descriptive-direct strategy. |
|---|
| The left side images share the concept of roundabouts, while the right side images share the concept of residential areas. |
| SkySenseGPT on bb_m_29 using Descriptive-direct strategy. |
|---|
| The common concept that unites all left images but no right images is "1 intersection". The common concept that unites all right images but no left images is "1 roundabout". |
| ChatGPT on bb_m_08. |
|---|
| The left images focus on cars on paved lots or roads. The right images show vehicle depots or parking areas mixed with buildings, buses, or trailers. |
| ChatGPT on bb_m_36. |
|---|
| The left images show fields with clear crop rows or furrows. The right images show fields without clear crop rows or furrows. |
| ChatGPT on bb_m_113. |
|---|
| The left images show square or rectangular roof forms. The right images show circular or radial roof forms. |
| Gemini on bb_m_08. |
|---|
| The left images feature groups of multi-colored vehicles. The right images feature fleets of uniformly white vehicles |
| Gemini on bb_m_36. |
|---|
| The left images show agricultural fields with distinct parallel rows or furrows. The right images show agricultural fields with uniform textures lacking visible rows. |
| Gemini on bb_m_113. |
|---|
| The left images show buildings with square or rectangular roofs. The right images show buildings with circular roofs. |
Appendix F. Model Accuracy Across Problem Types

References
- Li, J.; Pei, Y.; Zhao, S.; Xiao, R.; Sang, X.; Zhang, C. A Review of Remote Sensing for Environmental Monitoring in China. Remote Sens. 2020, 12. [Google Scholar] [CrossRef]
- Pavlova, M.; Sidorchuk, D.; Bocharov, D.; Sarycheva, A. Crop Classification Using Reduced-Dimensionality NDVI Time Series. In Proceedings of the ECMS 2023, European Council for Modelling and Simulation, 2023; Vol. 37, pp. 306–312. [Google Scholar] [CrossRef]
- Pavlova, M.A.; Timofeev, V.A.; Bocharov, D.A.; Sidorchuk, D.S.; Nurmukhametov, A.L.; Nikonorov, A.V.; Yarykina, M.S.; Kunina, I.A.; Smagina, A.A.; Zagarev, M.A. Low-parameter method for delineation of agricultural fields in satellite images based on multi-temporal MSAVI2 data. Comput. Opt. 2023, 47, 451–463. [Google Scholar]
- Yu, D.; Fang, C. Urban Remote Sensing with Spatial Big Data: A Review and Renewed Perspective of Urban Studies in Recent Decades. Remote Sens. 2023, 15. [Google Scholar] [CrossRef]
- Im, J.; Park, H.; Takeuchi, W. Advances in Remote Sensing-Based Disaster Monitoring and Assessment. Remote Sens. 2019, 11. [Google Scholar] [CrossRef]
- Omoniyi, T.O.; Sims, A. Enhancing the Precision of Forest Growing Stock Volume in the Estonian National Forest Inventory with Different Predictive Techniques and Remote Sensing Data. Remote Sens. 2024, 16. [Google Scholar] [CrossRef]
- Ivliev, N.A.; Podlipnov, V.V.; Ivanushkin, M.A.; Skidanov, R.V.; Fedorov, V.V.; Kazanskiy, N.L.; Soifer, V.A. Imaging of the Earth’s surface with an ultra-compact camera with a hybrid lens mounted on the CubeSat 3U platform. Comput. Opt. 2026, 50, 1742. [Google Scholar] [CrossRef]
- Pellegrino, A.; Pancalli, M.G.; Gianfermo, A.; Marzioli, P.; Curianò, F.; Angeletti, F.; Piergentili, F.; Santoni, F. HORUS: Multispectral and Multiangle CubeSat Mission Targeting Sub-Kilometer Remote Sensing Applications. Remote Sens. 2021, 13. [Google Scholar] [CrossRef]
- Borisov, A.N.; Myasnikov, V.V.; Sergeev, V.V. Method of automatic coregistration of digital remote sensing images from different sources. Comput. Opt. 2024, 48, 932–943. [Google Scholar] [CrossRef]
- Konovalov, V.F.; Myasnikov, V.V.; Sergeev, V.V. A unified neural network-based single super-resolution method for heterogeneous digital earth remote sensing images. Comput. Opt. 2024, 48, 944–955. [Google Scholar]
- Nikonorov, A.; Sidorchuk, D.; Odinets, N.; Volkov, V.; Sarycheva, A.; Dudenko, E.; Zhidkov, M.; Nikolaev, D. HyperHazeOff: Hyperspectral Remote Sensing Image Dehazing Benchmark. J. Imaging 2025, 11, 422. [Google Scholar] [PubMed]
- Tao, L.; Zhang, H.; Jing, H.; Liu, Y.; Yan, D.; Wei, G.; Xue, X. Advancements in vision–language models for remote sensing: Datasets, capabilities, and enhancement techniques. Remote Sens. 2025, 17, 162. [Google Scholar] [CrossRef]
- Shao, R.; Li, Z.; Zhang, Z.; Xu, L.; He, X.; Yuan, H.; He, B.; Dai, Y.; Yan, Y.; Chen, Y.; et al. Asking like Socrates: Socrates helps VLMs understand remote sensing images. arXiv 2025, arXiv:2511.22396. [Google Scholar]
- Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
- Wang, J.; Zheng, Z.; Chen, Z.; Ma, A.; Zhong, Y. Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering. Proc. Proc. AAAI Conf. Artif. Intell. 2024, Vol. 38, 5481–5489. [Google Scholar]
- Weng, X.; Pang, C.; Xia, G.S. Vision-language modeling meets remote sensing: Models, datasets, and perspectives. IEEE Geoscience and Remote Sensing Magazine, 2025. [Google Scholar]
- Liu, F.; Guan, T.; Li, Z.; Chen, L.; Yacoob, Y.; Manocha, D.; Zhou, T. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv 2023, arXiv:2310.145662, 9. [Google Scholar]
- Helff, L.; Stammer, W.; Shindo, H.; Dhami, D.S.; Kersting, K. V-lol: A diagnostic dataset for visual logical learning. arXiv 2023, arXiv:2306.07743. [Google Scholar]
- Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.W.; Galley, M.; Gao, J. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv 2023, arXiv:2310.02255. [Google Scholar]
- Moskvichev, A.; Odouard, V.V.; Mitchell, M. The conceptarc benchmark: Evaluating understanding and generalization in the arc domain. arXiv 2023, arXiv:2305.07141. [Google Scholar]
- Wüst, A.; Woydt, T.; Helff, L.; Ibs, I.; Stammer, W.; Dhami, D.S.; Rothkopf, C.A.; Kersting, K. Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad? arXiv 2024, arXiv:2410.19546. [Google Scholar]
- Nie, W.; Yu, Z.; Mao, L.; Patel, A.B.; Zhu, Y.; Anandkumar, A. Bongard-logo: A new benchmark for human-level concept learning and reasoning. Adv. Neural Inf. Process. Syst. 2020, 33, 16468–16480. [Google Scholar]
- Jiang, H.; Ma, X.; Nie, W.; Yu, Z.; Zhu, Y.; Anandkumar, A. Bongard-hoi: Benchmarking few-shot visual reasoning for human-object interactions. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022; pp. 19056–19065. [Google Scholar]
- Bongard, M. Pattern Recognition; Spartan Books: New York, 1970. [Google Scholar]
- Hofstadter, D.R. Gödel, Escher, Bach: an eternal golden braid; Basic books, 1999. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PmLR, 2021; pp. 8748–8763. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar] [CrossRef]
- Pang, C.; Weng, X.; Wu, J.; Li, J.; Liu, Y.; Sun, J.; Li, W.; Wang, S.; Feng, L.; Xia, G.S.; et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. Proc. Proc. AAAI Conf. Artif. Intell. 2025, Vol. 39, 6381–6388. [Google Scholar] [CrossRef]
- Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024; pp. 27831–27840. [Google Scholar]
- Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Ricci, R.; Melgani, F. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery. Remote Sens. 2024, 16, 1477. [Google Scholar]
- Luo, J.; Pang, Z.; Zhang, Y.; Wang, T.; Wang, L.; Dang, B.; Lao, J.; Wang, J.; Chen, J.; Tan, Y.; et al. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding. arXiv;arXiv 2024, arXiv:2406.10100. [Google Scholar]
- Muhtar, D.; Li, Z.; Gu, F.; Zhang, X.; Xiao, P. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In Proceedings of the European Conference on Computer Vision, 2024; Springer; pp. 440–457. [Google Scholar]
- Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Liu, Y.; Li, X. Rsgpt: A remote sensing vision language model and benchmark. ISPRS J. Photogramm. Remote Sens. 2025, 224, 272–286. [Google Scholar] [CrossRef]
- Irvin, J.; Liu, E.; Chen, J.; Dormoy, I.; Kim, J.; Khanna, S.; Zheng, Z.; Ermon, S. Teochat: A large vision-language assistant for temporal earth observation data. Proc. Int. Conf. Learn. Represent. 2025, Vol. 2025, 68883–68911. [Google Scholar]
- Soni, S.; Dudhane, A.; Debary, H.; Fiaz, M.; Munir, M.A.; Danish, M.S.; Fraccaro, P.; Watson, C.D.; Klein, L.J.; Khan, F.S.; et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 14303–14313. [Google Scholar]
- An, X.; Sun, J.; Gui, Z.; He, W. Choice: benchmarking the remote sensing capabilities of large vision-language models. arXiv 2024, arXiv:2411.18145. [Google Scholar]
- Fiaz, M.; Debary, H.; Fraccaro, P.; Paudel, D.; Van Gool, L.; Khan, F.; Khan, S. Geovlm-r1: Reinforcement fine-tuning for improved remote sensing reasoning. arXiv 2025, arXiv:2509.25026. [Google Scholar]
- Ma, X.; Feng, S.; Zhang, B.; Wang, B. ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing. arXiv 2025, arXiv:2512.23244. [Google Scholar]
- Zhou, Y.; Feng, L.; Lan, M.; Ke, Y.; Jiang, X.; Zhang, W. GeoMath: A benchmark for multimodal mathematical reasoning in remote sensing. 2025. [Google Scholar] [PubMed]
- Danish, M.; Munir, M.A.; Shah, S.R.A.; Kuckreja, K.; Khan, F.S.; Fraccaro, P.; Lacoste, A.; Khan, S. Geobench-vlm: Benchmarking vision-language models for geospatial tasks. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 7132–7142. [Google Scholar]
- Luo, Z.; Wang, D.; Guo, H.; Zhang, J.; Du, B. VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing. arXiv 2026, arXiv:2602.07045. [Google Scholar]
- Wu, R.; Ma, X.; Zhang, Z.; Wang, W.; Li, Q.; Zhu, S.C.; Wang, Y. Bongard-openworld: Few-shot reasoning for free-form visual concepts in the real world. arXiv 2023, arXiv:2310.10207. [Google Scholar]
- Małkiński, M.; Pawlonka, S.; Mańdziuk, J. Reasoning limitations of multimodal large language models. a case study of bongard problems. arXiv 2024, arXiv:2411.01173. [Google Scholar]
- Pawlonka, S.; Małkiński, M.; Mańdziuk, J. Bongard-rwr+: Real-world representations of fine-grained concepts in bongard problems. arXiv 2025, arXiv:2508.12026. [Google Scholar]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
- Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
- Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef]
- Qi, X.; Zhu, P.; Wang, Y.; Zhang, L.; Peng, J.; Wu, M.; Chen, J.; Zhao, X.; Zang, N.; Mathiopoulos, P.T. MLRSNet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding. ISPRS J. Photogramm. Remote Sens. 2020, 169, 337–350. [Google Scholar] [CrossRef]
- Mou, C.; Liu, T.; Zhu, C.; Cui, X. WAID: A Large-Scale Dataset for Wildlife Detection with Drones. Appl. Sci. 2023, 13. [Google Scholar] [CrossRef]
- Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.079271. [Google Scholar]
- Khot, T.; Trivedi, H.; Finlayson, M.; Fu, Y.; Richardson, K.; Clark, P.; Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. arXiv 2022, arXiv:2210.02406. [Google Scholar]
- Zhang, Y.; Du, L.; Cao, D.; Fu, Q.; Liu, Y. An examination on the effectiveness of divide-and-conquer prompting in large language models. arXiv 2024, arXiv:2402.05359. [Google Scholar]
- Ji, B.; Agrawal, S.; Tang, Q.; Wu, Y. Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning. arXiv 2025, arXiv:2507.13362. [Google Scholar]
- Li, D.; Jiang, B.; Huang, L.; Beigi, A.; Zhao, C.; Tan, Z.; Bhattacharjee, A.; Jiang, Y.; Chen, C.; Wu, T.; et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing 2025, 2757–2791. [Google Scholar] [CrossRef]
- DeepSeek-AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. 2026. [Google Scholar]
- Bongard, M. Problema Uznavaniya [Pattern Recognition]; Nauka: Moscow, 1967; p. 3. [Google Scholar]
- Maximov, V.; Bongard, M. Programma, obuchayushchayasya klassifikatsii geometricheskikh izobrazheniy [Program Learning to Classify Geometric Images]. In Proceedings of the Mezhdunarodnyy simpozium IFAC po tekhnicheskim i biologicheskim problemam upravleniya, Tezisy dokladov [IFAC International Symposium on Technical and Biological Problems of Control. Abstracts of Papers], Yerevan, 1968; pp. 86–87. [Google Scholar]
- Maximov, V.; Bongard, M. Programma, obuchayushchayasya klassifikatsii geometricheskikh izobrazheniy [Program Learning to Classify Geometric Images]. In Proceedings of the Trudy Mezhdunarodnogo simpoziyma po tekhnicheskim i biologicheskim problemam upravleniya; Moscow, Tsypkin, Ya.Z., Ed.; 1971; Vol. 1971, pp. 128–133. [Google Scholar]
- Maximov, V. Programma, obuchayushchayasya klassifikatsii geometricheskikh izobrazheniy. Yazyk i eksperimenty [Program Learning to Classify Geometric Images. Language and Experiments]. Strukturnyye metody opoznaniya i avtomaticheskoye chteniye. In Structural Methods of Recognition and Automatic Reading / Edited by A.I. Mikhailov.; Mikhaylov, A.I., Ed.; 1970; pp. 106–126. [Google Scholar]
- Maximov, V. Programma, obuchayushchayasya klassifikatsii geometricheskikh figur [Program Learning to Classify Geometric Images]. In Proceedings of the Abstracts of the 4th Colloquium on Microwave Communication, Budapest, 1970; p. 42. [Google Scholar]
- Maximov, V. Modelirovaniye protsessa uznavaniya geometricheskikh izobrazheniy [Modeling the process of recognition of geometric images]. In Proceedings of the Pererabotka zritel’noy informatsii i regulyatsiya dvigatel’noy deyatel’nosti – Trudy Mezhdunarodnogo simpoziyma; Gidikov, A., Ed.; Processing of Processing of visual information and regulation of motor activity: Sofia, 1971; pp. 217–226. [Google Scholar]
- Maximov, V. Sistema, obuchayushchayasya klassifikatsii geometricheskikh izobrazheniy [Modeling the Geometric Images Recognition]. In Proceedings of the In Modelirovaniye obucheniya i povedeniya Processing of Visual Information and Regulation of Motor Activity – Proceedings of the International Symposium / Edited by A. Gidikov.; Moscow, Smirnov, M.S., Ed.; 1975; pp. 29–120. [Google Scholar]
- Maximov, V. Sistema, obuchayushchayasya klassifikatsii geometricheskikh izobrazheniy. Candidate of Technical Sciences Dissertation [A System Capable of Learning to Classify Geometric Images . Thesis Submitted for the Degree of Candidate of Technical Sciences (Specialty Code 05.13.01)., Moscow, Akademiya nauk SSSR, 1975. [Google Scholar]





| Dataset | Images | Reasoning | Size |
|---|---|---|---|
| Original BPs [24] | Line drawing | Analytic | 100 |
| Bongard LOGO [22] | Line drawing | Analytic | 12 000 |
| Bongard HOI [23] | Real ground-level | Synthetic | 53 000 |
| Bongard OpenWorld [42] | Real ground-level | Synthetic | 1 010 |
| Bongard RWR [43] | Real ground-level | Analytic | 60 |
| Bongard RWR+ [44] | AI-generated ground-level | Analytic | 5400 |
| BMRS | Real Remote Sensing | Synthetic + Analytic | 122 |
| Dataset | Concept | Spatial | Size | Number | Same | Total | ||
|---|---|---|---|---|---|---|---|---|
| Shape | Semantic | Presence | ||||||
| Original Bongard problem [21] | 31% (31) | 41% (41) | 6% (6) | 15% (15) | 7% (7) | 100 | ||
| BMRS | 20% (25) | 30% (37) | 11% (13) | 21% (26) | 4% (5) | 10% (12) | 3% (4) | 122 |
| Model | Checkpoint / API | Params | Type |
|---|---|---|---|
| LLaVA-v1.5 7B | llava-hf/llava-1.5-7b-hf | 7B | Open |
| LLaVA-v1.5 13B | llava-hf/llava-1.5-13b-hf | 13B | Open |
| LLaVA-v1.6 7B | llava-hf/llava-v1.6-vicuna-7b-hf | 7B | Open |
| LLaVA-v1.6 34B | llava-hf/llava-v1.6-34b-hf | 34B | Open |
| InternVL-3.5 | OpenGVLab/InternVL3_5-38B-HF | 38B | Open |
| Qwen-3-VL | Qwen/Qwen3-VL-32B-Instruct | 32B | Open |
| VHM | FitzPC/vhm_7B | 7B | Open, RS |
| RS-LLaVA | BigData-KSU/RS-llava-v1.5-7b-LoRA | 7B | Open, RS |
| RS-EoT | ShaoRun/RS-EoT-7B | 7B | Open, RS |
| GeoChat | MBZUAI/geochat-7B | 7B | Open, RS |
| SkySenseGPT | ll-13/SkySenseGPT-7B-CLIP-ViT | 7B | Open, RS |
| EarthDial | akshaydudhane/EarthDial_4B_RGB | 4B | Open, RS |
| LHRS-Bot | LHRS/LHRS-Bot-Nova | 7B | Open, RS |
| TeoChat | jirvin16/TEOChat | 7B | Open, RS |
| Gemini-3.1-Pro | API | >1T | Proprietary |
| ChatGPT-5.5-Pro | API | >1T | Proprietary |
| All | Number | Presence | Same | Semantic | Shape | Size | Spatial | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.745 | 0.823 | 0.734 | 0.577 | 0.853 | 0.801 | 0.778 | 0.518 |
| Model | Contrastive-direct | Contrastive-iterative | Descriptive-direct | Descriptive-iterative | Direct |
|---|---|---|---|---|---|
| LLaVA-v1.5 7B | 0.11 | 0.14 | 0.12 | 0.13 | |
| LLaVA-v1.5 13B | 0.09 | 0.10 | 0.16 | 0.13 | |
| LLaVA-v1.6 7B | 0.09 | 0.07 | 0.07 | ||
| LLaVA-v1.6 34B | 0.15 | 0.13 | 0.08 | 0.17 | |
| InternVL-3.5 | 0.35 | 0.26 | 0.20 | 0.36 | |
| Qwen-3-VL | 0.39 | 0.32 | 0.22 | 0.39 | |
| VHM | 0.07 | 0.14 | 0.14 | 0.08 | |
| RS-LLaVA | 0.06 | 0.02 | 0.03 | 0.03 | |
| RS-EoT | 0.08 | 0.03 | 0.07 | 0.14 | |
| GeoChat | 0.07 | 0.03 | 0.07 | 0.07 | |
| SkySenseGPT | 0.07 | 0.01 | 0.06 | 0.05 | |
| EarthDial | 0.07 | 0.05 | 0.07 | 0.07 | |
| LHRS-Bot | 0.05 | 0.07 | 0.16 | 0.11 | |
| TeoChat | 0.08 | 0.07 | 0.06 | 0.07 |
| Model | all | number | presence | same | semantic | shape | size | spatial |
|---|---|---|---|---|---|---|---|---|
| Human baseline | ||||||||
| Humans | 0.745 | 0.823 | 0.734 | 0.577 | 0.853 | 0.801 | 0.778 | 0.518 |
| General-Large VLMs | ||||||||
| Gemini-3.1-Pro | 0.869 | 0.833 | 0.917 | 1.000 | 0.946 | 0.875 | 0.800 | 0.750 |
| ChatGPT-5.5-Pro | 0.893 | 0.833 | 0.923 | 1.000 | 0.973 | 0.840 | 0.800 | 0.846 |
| General-Mid VLMs | ||||||||
| LLaVA-v1.5 7B Descriptive-direct | 0.148 | 0.000 | 0.154 | 0.000 | 0.324 | 0.120 | 0.200 | 0.000 |
| LLaVA-v1.5 13B Descriptive-iterative | 0.180 | 0.250 | 0.231 | 0.000 | 0.378 | 0.040 | 0.000 | 0.038 |
| LLaVA-v1.6 7B Descriptive-direct | 0.131 | 0.000 | 0.077 | 0.000 | 0.270 | 0.120 | 0.200 | 0.038 |
| LLaVA-v1.6 34B Descriptive-direct | 0.205 | 0.167 | 0.385 | 0.000 | 0.378 | 0.160 | 0.000 | 0.000 |
| InternVL-3.5 Descriptive-direct | 0.402 | 0.500 | 0.462 | 0.250 | 0.595 | 0.440 | 0.200 | 0.077 |
| Qwen-3-VL Descriptive-direct | 0.426 | 0.417 | 0.846 | 0.250 | 0.514 | 0.480 | 0.400 | 0.077 |
| Remote Sensing VLMs | ||||||||
| VHM Descriptive-direct | 0.148 | 0.083 | 0.231 | 0.000 | 0.351 | 0.040 | 0.000 | 0.000 |
| RS-LLaVA Direct | 0.066 | 0.000 | 0.077 | 0.000 | 0.189 | 0.000 | 0.000 | 0.000 |
| RS-EoT Contrastive-iterative | 0.172 | 0.083 | 0.231 | 0.000 | 0.351 | 0.120 | 0.000 | 0.038 |
| GeoChat Contrastive-direct | 0.090 | 0.000 | 0.000 | 0.000 | 0.243 | 0.040 | 0.200 | 0.000 |
| SkySenseGPT Contrastive-iterative | 0.082 | 0.000 | 0.154 | 0.000 | 0.189 | 0.040 | 0.000 | 0.000 |
| EarthDial Contrastive-iterative | 0.131 | 0.083 | 0.154 | 0.000 | 0.324 | 0.040 | 0.000 | 0.000 |
| LHRS-Bot Descriptive-direct | 0.205 | 0.250 | 0.231 | 0.250 | 0.459 | 0.040 | 0.000 | 0.000 |
| TeoChat Direct | 0.098 | 0.000 | 0.077 | 0.000 | 0.270 | 0.000 | 0.200 | 0.000 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).