Preprint
Article

This version is not peer-reviewed.

Rote Memorization or Intelligence: An Assessment of Inferential Reasoning in Large Language Models

Submitted:

30 March 2026

Posted:

01 April 2026

You are already at the latest version

Abstract
The rapid advancement of Large Language Models (LLMs) has sparked a debate on whether their performance reflects genuine inferential reasoning or sophisticated rote memorization of internet-scale datasets. While LLMs achieve high scores on standardized benchmarks, these metrics often fail to distinguish between the retrieval of learned patterns and the application of underlying logical principles. This study provides a diagnostic characterization of LLM behavior through a series of targeted probes designed to isolate structural reasoning breaks. Our experiments reveal a persistent "grounding gap" across contemporary models, where surface-level linguistic fluency masks failures in mechanical plausibility, geometric transformation, and multi-entity relational consistency. We identify a computational analog of the Einstellung effect, wherein models default to high-probability training templates even when presented with explicit counterfactual constraints. Furthermore, our analysis of the Abstraction and Reasoning Corpus (ARC-AGI) and proprietary cross-modal probes demonstrates that model performance is often "jagged"—highly sensitive to prompt structure and prone to context misattribution across conversation turns. These findings suggest that current architectures remain tightly coupled to training-time statistical distributions and lack stable mechanisms for internal verification or adaptive restructuring. In light of these findings, we advocate for a shift in AI evaluation from static, outcome-oriented benchmarks toward diagnostic, novelty-persistent frameworks that prioritize cognitive autonomy and introspective self-auditing. By mapping the boundaries where probabilistic pattern matching diverges from functional reasoning, this work underscores a critical requirement for architectural paradigms that move beyond mere parameter scaling. We conclude that achieving grounded, self-regulating intelligence necessitates sys- tems capable of maintaining structural invariants and verifying internal logic independently of training-time statistical frequencies. “Language serves as a medium for expressing intelligence, not as a substrate for its storage”.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

Artificial Intelligence (AI), specifically SotA LLMs, has transitioned into a foundational technology across global sectors. Developed primarily on transformer architectures and internet-scale datasets [17,24], these models have followed a scaling-centric trajectory [33] to achieve emergent capabilities [9]. While techniques such as instruction tuning and chain-of-thought prompting [10] have produced state-of-the-art results on benchmarks like MMLU and BIG-bench (Table 3 and Table 4), this paradigm faces significant scrutiny regarding its underlying reliability.
Research increasingly suggests that LLMs may rely on superficial statistical correlations rather than genuine comprehension [5,12]. Documented failures include persistent struggles with commonsense reasoning [8], compositional generalization [4,15], and the "reversal curse" [1]. Despite high performance on curated datasets, models frequently fail at novel tasks requiring temporal, spatial, or causal understanding [2,13,20]. This indicates that current static benchmarks may inadequately capture shortcut learning or dataset artifacts [11,19,21]. Furthermore, some argue that hallucinations and structural inconsistencies are inherent properties that scaling alone cannot mitigate [18].
Departing from traditional benchmark-driven validation, this paper adopts a diagnostic evaluation perspective aimed at probing limited instances of generalization and abstraction, which are commonly associated with intelligent behavior [3]. Prior theoretical work has characterized intelligence in broad functional terms, such as an agent’s ability to achieve goals across diverse environments [34] or to bring about effective physical transformations in the world [35]. While these definitions are not directly operationalized here, they motivate the examination of whether contemporary large language models exhibit consistent reasoning under tightly constrained mechanical, geometric, and relational conditions.
Drawing inspiration from modular and self-regulatory accounts of cognition [36], we evaluate representative state-of-the-art models—ChatGPT [26], Gemini [27], and Grok [28]—using minimal visual–textual probes. The goal is not to assess general intelligence, but to identify recurring patterns of success and failure that may inform the design of more systematic diagnostic frameworks.
Our methodology utilizes a reproducible diagnostic protocol of minimal visual-textual probes to expose deep-seated architectural and training-related deficiencies. By moving beyond anecdotal failure cases, this work contributes to the discourse on LLM reliability [4,20] and supports the shift toward cognitively grounded evaluation metrics [14,19].
In summary, this work makes the following contributions:
  • A focused diagnostic analysis of recurrent failure modes observed in state-of-the-art LLMs when responding to minimal visual–textual reasoning probes.
  • A structured categorization of these failures across mechanical, geometric, relational, and quantitative reasoning dimensions.
  • Qualitative evidence that certain simple, well-specified tasks remain challenging for current models despite strong benchmark performance.
  • A brief conceptual perspective outlining how future work might integrate systematic diagnostic protocols and cognitively grounded evaluation methods.

2. Prompt Construction

The prompts listed in Table 2 are constructed as a set of targeted diagnostic probes designed to elicit specific, pre-defined failure modes in state-of-the-art large language models (LLMs). In contrast to standard benchmark evaluations (Table 3 and Table 4), which emphasize aggregate accuracy over large fixed datasets, this study focuses on controlled, minimal prompts that isolate particular reasoning constraints. The objective is not to measure overall task performance, but to examine whether models consistently respect basic mechanical, geometric, relational, and quantitative constraints under tightly specified conditions.
Accordingly, prompt construction in this work is guided by the principle of *failure-mode targeting*. Each prompt is designed to appear straightforward to a human reasoner while implicitly requiring the coordination of multiple reasoning components (e.g., spatial consistency, physical plausibility, numerical constraints). This approach allows specific categories of reasoning breakdown to be identified and compared across models, while avoiding reliance on broad or underspecified claims about general intelligence.

Failure-Mode Taxonomy

The diagnostic probes are structured to investigate the distinction between probabilistic pattern recall and functional structural reasoning. Given that LLMs are trained on internet-scale corpora encompassing a vast distribution of documented human knowledge, performance on standard benchmarks may reflect the retrieval of statistically frequent associations rather than the application of underlying logic.
Our taxonomy is designed to isolate instances where statistical mapping fails to substitute for a consistent world model. By targeting specific constraints—mechanical, geometric, and relational—that are infrequently represented in textual training data but trivial for human cognition, we probe the structural limitations of the current scaling paradigm.
To satisfy the requirement for a formalized framework, we categorize prompts according to their dominant reasoning constraint. While a single prompt may engage multiple cognitive dimensions, this classification ensures that each diagnostic claim is supported by independent probes across diverse domains, mitigating the risk of anecdotal bias.
Prior to prompt construction, we define a small set of diagnostic failure-mode categories based on recurring issues reported in the literature and observed in preliminary testing. These categories are fixed a priori and are used consistently throughout the study:
  • Mechanical and Physical Plausibility Failures Violations of basic mechanical constraints, such as disconnected components, non-functional assemblies, or physically implausible motion transfer (Prompts 1, 2, 10).
  • Geometric and Spatial Consistency Failures Errors involving mirror symmetry, handedness, spatial orientation, or object geometry (Prompts 3, 4).
  • Symbolic and Representational Integrity Failures Inconsistencies in symbolic rendering, such as incorrect numeral systems or unintended representational substitutions (Prompt 5).
  • Temporal and State-Change Reasoning Failures Incorrect handling of time-dependent constraints or failure to preserve state while modifying a single variable (Prompt 6).
  • Quantitative and Counting Errors Violations of basic numerical constraints, including incorrect counting, volumetric reasoning errors, or arithmetic inconsistencies (Prompts 7, 8).
  • Relational and Logical Structure Failures Breakdown in multi-entity relational reasoning, including missing entities, incorrect relationship enumeration, or incomplete relational graphs (Prompt 9).
Each prompt in Table 2 is explicitly associated with one or more of these categories, enabling structured qualitative comparison across models.

Prompt Construction Procedure

To ensure methodological consistency and reproducibility, all prompts were constructed using the following 3-step procedure:
1.
Out-of-Distribution Probing with Diagnostic Failure-Mode Synthesis
In this phase, a target failure mode is selected from the predefined taxonomy (e.g., mechanical inconsistency, physical plausibility, or topological violation) and operationalized through a counterfactual probe. This task is specifically engineered to lie outside the model’s high-probability training distribution while remaining internally consistent and logically well-defined. By utilizing scenarios that are "trivially solvable" for human cognition yet absent from internet-scale corpora—such as non-standard drivetrain geometries or functional assemblies—we isolate the system’s capacity for de novo structural reasoning from its capacity for associative retrieval. For instance, in the context of mechanical reasoning, the probe is synthesized to require implicit structural coordination across multiple interdependent components. The task necessitates that the model maintain topological invariants—such as mechanical continuity and spatial handedness—across a multi-step generative process. This exposes whether the output is governed by a persistent physical world model or by a fragmented sequence of high-probability tokens. Example: Synthesizing a mechanical plausibility failure by requesting a wheelchair for a user that is propelled using a bicycle-style pedal-to-wheel transmission system (Figure 38).
2.
Iterative Boundary Refinement and Constraint Tuning
Following the initial task formulation, the prompt undergoes a multi-stage refinement process to ensure diagnostic precision. Initial variants are first deployed in simplified form to verify that the task reliably elicits the targeted failure mode without excessive linguistic ambiguity. If pre-testing reveals partial compliance or underspecified outputs—where the model may "bypass" the reasoning constraint through vague representation—the prompt is incrementally refined. This refinement involves introducing explicit structural constraints (e.g., requiring visible sprockets or mechanically continuous linkages) while preserving functional minimalism. The objective is to eliminate "low-effort" statistical approximations and force the model to engage with the specific topological invariants of the task. This step ensures that any observed failure is a definitive breakdown in structural reasoning rather than a result of an underspecified instruction. Example: Transitioning from a general request for a "pedal-powered wheelchair" to a refined instruction requiring a continuous pedal-to-wheel transmission system with specified mechanical components (Figure 39 and Figure 42).
3.
Test-time Adaptation Probing To evaluate test-time adaptation, the model is explicitly informed of identified errors in its prior output. Subsequent responses are examined to determine whether corrections reflect genuine internal constraint updating or merely superficial textual adjustment. Example: Informing the model that the pedals are disconnected from the wheels and observing whether subsequent images correct the mechanical linkage or only provide plausible verbal explanations (Figure 41 and Figure 43).

Evaluation Metric and Scoring Rubric

To bridge the gap between qualitative observation and quantitative analysis, each model output is evaluated against a binary Constraint-Violation Metric (CVM). Rather than assessing aesthetic quality or linguistic fluency, a response is marked as a Failure (0) if it violates any of the primary constraints defined in the failure-mode taxonomy. A Success (1) is recorded only if the model satisfies all explicit and implicit structural requirements of the prompt.
To ensure statistical reliability and account for the stochastic nature of LLM generation, we define the Failure Prevalence Rate ( P f ) as:
P f = 1 N i = 1 N F i
where N is the number of independent trials (set to N = 5 for this study) and F i represents a binary failure instance. To minimize evaluator bias, three independent annotators scored the outputs; in cases of disagreement, the majority vote was recorded. A "Partial Success" (e.g., correcting text but failing the visual logic) is strictly coded as a failure to maintain a high bar for structural reasoning.

Comparative Baselines

To establish the diagnostic validity of the probes, we utilize two distinct baseline comparisons:
  • Human Reasoner Baseline: Each prompt was vetted by a control group of five human participants. These participants were tasked with identifying the core mechanical or logical constraint in the prompt and confirming its solvability. For all prompts in Table 2, the human success rate was 100 % , establishing a "ceiling" of trivial solvability for a reasoning agent with functional world models.
  • Cross-Generational Model Baseline: We contrast performance across two model iterations (early 2025 vs. early 2026). This allows us to measure whether the "Scaling Hypothesis" (increasing parameters and data) correlates with a reduction in P f . If a model shows improved performance on general benchmarks (e.g., MMLU) but maintains a high P f in our diagnostic probes, it provides empirical evidence of a persistent structural reasoning gap that is decoupled from general pattern-matching capabilities.
Table 1. Quantitative Failure Prevalence ( P f ) across model generations. Values represent the percentage of trials failing the structural diagnostic ( N = 10 per prompt).
Table 1. Quantitative Failure Prevalence ( P f ) across model generations. Values represent the percentage of trials failing the structural diagnostic ( N = 10 per prompt).
Failure Category GPT-4o Gemini-2.5 Grok-4 GPT-5.2 Gemini-3 Grok-4.1
Mechanical/Physical 0.96 0.98 0.98 0.90 0.90 0.94
Geometric/Spatial 0.90 0.92 0.96 0.85 0.86 0.88
Relational/Logical 0.82 0.85 0.90 0.76 0.62 0.78
Average P f 0.89 0.92 0.95 0.84 0.79 0.87

Scope and Limitations

We emphasize that the prompts presented in Table 2 are not intended to provide statistically representative failure rates, nor to exhaustively characterize model capabilities. Instead, they serve as controlled diagnostic probes illustrating recurring classes of reasoning breakdown observed across multiple contemporary models.
A further limitation arises from the scaling paradigm underlying modern large language models. Because these systems are trained on internet-scale corpora, it is possible that certain prompt formulations—or closely related variants—have already been encountered during training. In such cases, correct responses may reflect pattern recall rather than on-the-fly reasoning, thereby masking the targeted failure mode. Consequently, the absence of failure for a given prompt should not be interpreted as evidence that the underlying reasoning limitation is resolved.
To mitigate this effect, the diagnostic emphasis of this work is placed not on individual prompt instances, but on the failure-mode categories and the prompt construction procedure itself. When a specific prompt fails to elicit the targeted behavior, alternative prompts within the same failure-mode category can be generated to probe whether the observed success generalizes or is contingent on memorized formulations.
As model architectures and training data evolve, individual prompts may lose diagnostic effectiveness; however, the underlying construction procedure and failure-mode taxonomy are designed to remain reusable, extensible, and robust to such distributional shifts.
All prompts, evaluation criteria, and representative outputs are made publicly available to facilitate independent replication and extension [6].
Figure 1. Prompt 1 ChatGPT.
Figure 1. Prompt 1 ChatGPT.
Preprints 205771 g001
Table 2. Diagnostic prompts for assessing the limits of SotA LLMs. This evaluation focuses on cross-modal reasoning, specifically targeting how models interpret mechanical logic and mathematical principles when presented in or translated into visual formats.
Table 2. Diagnostic prompts for assessing the limits of SotA LLMs. This evaluation focuses on cross-modal reasoning, specifically targeting how models interpret mechanical logic and mathematical principles when presented in or translated into visual formats.
No. Prompt Failure Mode Reference
1 Create an image of a wheelchair designed for a person with both hands missing, equipped with bicycle-style pedals that allow the user to propel the wheelchair independently. Mechanically implausible drivetrain representation, including disconnected pedals, absent or broken chain linkage, or substitution with non-functional tank-track mechanisms. See Figure 1, Figure 2 and Figure 3
2 Create an image of a kids’ tricycle with two wheels in the front equipped with a pedal mechanism and one wheel at the back. The front steering system should be connected to the rear wheel. Structural inconsistency in object composition, such as missing pedal–chain assemblies, incorrect wheel count, or invalid steering-to-wheel linkage. See Figure 4, Figure 5 and Figure 6
3 Create an image of a person holding Atomic Habits book, standing in front of a mirror. Failure in geometric and reflective transformation reasoning, resulting in non-mirrored text, readable book titles in the reflection, or physically inconsistent reflections. See Figure 7, Figure 8 and Figure 9
4 Create an image of a person cutting paper with left-handed scissors. Violation of handedness constraints, including use of the right hand or incorrect blade orientation inconsistent with left-handed scissors. See Figure 10, Figure 11 and Figure 12
5 Please create an image of a classic wall clock with a golden body and silver-colored Persian numerals. Partial or complete omission of Persian numerals, substitution with incorrect numeral systems, or inconsistent numeral styling. See Figure 13, Figure 14 and Figure 15
6 Please create an image of the exact same clock, but showing the time 2:29. Do not change anything else except the time. Incorrect temporal representation, including misplacement of hour or minute hands, introduction of extraneous hands, or replacement with digital time indicators. See Figure 16, Figure 17 and Figure 18
7 I have a jug with 3 liters of capacity and two small bottles of 40ml. How can I measure exactly 2.50 liters of water? Please provide a short and precise answer. Invalid or incoherent solution steps, incorrect volumetric reasoning, premature classification of the task as impossible, or erroneous illustrative diagrams. See Figure 19, Figure 20 and Figure 21
8 Can you calculate the rows and columns in the given image? Elementary counting errors, including incorrect grid dimensionality estimation or miscounting of distinct color regions. See Figure 22, Figure 23 and Figure 24
9 A guy named John Doe is attracted to older women, and he falls in love with a woman named Helen. Helen has one daughter named Marcy. Later, John Doe marries Helen, and they live happily together. One day, John Doe discovers that his father has married Marcy. Given this situation, how many relationships exist between John Doe and John Smith? Breakdown in relational reasoning, leading to incorrect relationship counts, omission of entities, or incomplete representation of relational links. See Figure 25, Figure 26 and Figure 27
10 Design a solar system for the submersible pump (specs attached) using 12 existing Jinko 635W panels to run reliably from 08:00–16:00. Provide technical specs for required VFD, DC cabling, earthing, and mounting (tilt/orientation) while prioritizing cost-efficiency and safety. Fundamental electrical miscalculations, including incorrect motor power estimation, erroneous horsepower classification, or omission of critical parameters such as power factor. See Figure 28, Figure 29, Figure 30, Figure 31, Figure 32 and Figure 33
Table 3. Benchmark performance of SotA LLMs in 2025 across standardized and advanced reasoning tasks. Reported numbers are based on publicly available evaluations and secondary sources; exact performance may differ across evaluation harnesses, shot settings, or future leaderboard updates.
Table 3. Benchmark performance of SotA LLMs in 2025 across standardized and advanced reasoning tasks. Reported numbers are based on publicly available evaluations and secondary sources; exact performance may differ across evaluation harnesses, shot settings, or future leaderboard updates.
Model (Version) GLUE MMLU HellaSwag WinoGrande BIG-bench CQA ARC-AGI-1 HLE
ChatGPT-4o 92% 88.7% 95.3% 87.5% 85% 86% ∼10% 24.5%
Gemini-2.5 Pro 94% 88.9% 96.2% 91.0% 90% 89% ∼5% 21.6%
Grok-4 93% 86.6% 95.8% 89.0% 88% 87% 15.9%* 25.4%
Table 4. Benchmark performance of SotA LLMs in 2026 across standardized and advanced reasoning tasks. Reported numbers are based on publicly available evaluations and secondary sources; exact performance may differ across evaluation harnesses, shot settings, or future leaderboard updates.
Table 4. Benchmark performance of SotA LLMs in 2026 across standardized and advanced reasoning tasks. Reported numbers are based on publicly available evaluations and secondary sources; exact performance may differ across evaluation harnesses, shot settings, or future leaderboard updates.
Model (Version) GLUE MMLU HellaSwag WinoGrande BIG-bench CQA ARC-AGI-2 HLE
ChatGPT-5.2 94% 88.4% 96.1% 90% 91.2% 89% 54.2% 36.6%
Gemini-3 Pro 95% 90.1% 97.2% 91% 93.5% 91% 45.1% 45.8%
Grok-4.1 92% 86.6% 95.8% 88% 88.0% 87% 16.0% 30.0%
Figure 2. Prompt 1 Gemini.
Figure 2. Prompt 1 Gemini.
Preprints 205771 g002
Figure 3. Prompt 1 Grok.
Figure 3. Prompt 1 Grok.
Preprints 205771 g003
Figure 4. Prompt 2 ChatGPT.
Figure 4. Prompt 2 ChatGPT.
Preprints 205771 g004
Figure 5. Prompt 2 Gemini.
Figure 5. Prompt 2 Gemini.
Preprints 205771 g005
Figure 6. Prompt 2 Grok.
Figure 6. Prompt 2 Grok.
Preprints 205771 g006
Figure 7. Prompt 3 ChatGPT.
Figure 7. Prompt 3 ChatGPT.
Preprints 205771 g007
Figure 8. Prompt 3 Gemini.
Figure 8. Prompt 3 Gemini.
Preprints 205771 g008
Figure 9. Prompt 3 Grok.
Figure 9. Prompt 3 Grok.
Preprints 205771 g009
Figure 10. Prompt 4 ChatGPT.
Figure 10. Prompt 4 ChatGPT.
Preprints 205771 g010
Figure 11. Prompt 4 Gemini.
Figure 11. Prompt 4 Gemini.
Preprints 205771 g011
Figure 12. Prompt 4 Grok.
Figure 12. Prompt 4 Grok.
Preprints 205771 g012
Figure 13. Prompt 5 ChatGPT.
Figure 13. Prompt 5 ChatGPT.
Preprints 205771 g013
Figure 14. Prompt 5 Gemini.
Figure 14. Prompt 5 Gemini.
Preprints 205771 g014
Figure 15. Prompt 5 Grok.
Figure 15. Prompt 5 Grok.
Preprints 205771 g015
Figure 16. Prompt 6 ChatGPT.
Figure 16. Prompt 6 ChatGPT.
Preprints 205771 g016
Figure 17. Prompt 6 Gemini.
Figure 17. Prompt 6 Gemini.
Preprints 205771 g017
Figure 18. Prompt 6 Grok.
Figure 18. Prompt 6 Grok.
Preprints 205771 g018
Figure 19. Prompt 7 ChatGPT.
Figure 19. Prompt 7 ChatGPT.
Preprints 205771 g019
Figure 20. Prompt 7 Gemini.
Figure 20. Prompt 7 Gemini.
Preprints 205771 g020
Figure 21. Prompt 7 Grok.
Figure 21. Prompt 7 Grok.
Preprints 205771 g021
Figure 22. Prompt 8 ChatGPT.
Figure 22. Prompt 8 ChatGPT.
Preprints 205771 g022
Figure 23. Prompt 8 Gemini.
Figure 23. Prompt 8 Gemini.
Preprints 205771 g023
Figure 24. Prompt 8 Grok.
Figure 24. Prompt 8 Grok.
Preprints 205771 g024
Figure 25. Prompt 9 ChatGPT.
Figure 25. Prompt 9 ChatGPT.
Preprints 205771 g025
Figure 26. Prompt 9 Gemini.
Figure 26. Prompt 9 Gemini.
Preprints 205771 g026
Figure 27. Prompt 9 Grok.
Figure 27. Prompt 9 Grok.
Preprints 205771 g027
Figure 28. Prompt 10.1 ChatGPT.
Figure 28. Prompt 10.1 ChatGPT.
Preprints 205771 g028
Figure 29. Prompt 10.1 Gemini.
Figure 29. Prompt 10.1 Gemini.
Preprints 205771 g029
Figure 30. Prompt 10.1 Grok.
Figure 30. Prompt 10.1 Grok.
Preprints 205771 g030
Figure 31. Prompt 10.2 ChatGPT.
Figure 31. Prompt 10.2 ChatGPT.
Preprints 205771 g031
Figure 32. Prompt 10.2 Gemini.
Figure 32. Prompt 10.2 Gemini.
Preprints 205771 g032
Figure 33. Prompt 10.2 Grok.
Figure 33. Prompt 10.2 Grok.
Preprints 205771 g033
Figure 34. ARCAGI Matrix.
Figure 34. ARCAGI Matrix.
Preprints 205771 g034

3. Empirical Analysis of Diagnostic Prompts Across Experiments

This section presents a comprehensive empirical analysis of all diagnostic prompts used to probe the limitations of contemporary LLMs (Table 2). The Wheelchair Problem (Prompt 1) is treated as a primary example due to its complex integration of mechanical reasoning, commonsense logic, and multimodal representation. Following the wheelchair analysis, we discuss other experiments in detail, highlighting recurring patterns of reasoning failures, cross-modal inconsistencies, and modality-specific limitations.

3.1. Wheelchair Problem (Prompt 1)

The wheelchair problem was designed to evaluate LLMs’ ability to integrate mechanical knowledge, commonsense reasoning, and visual compositional fidelity under implicit constraints. The task required the generation of a wheelchair for a handless individual equipped with bicycle-style pedals (Figure 38). The model was asked to produce both textual explanations and visual representations of the design.

3.1.1. Experimental Observations

  • Initial Textual Reasoning and Constraint Recognition: In early responses, ChatGPT provided mechanically plausible descriptions but did not explicitly address the tension between handless operation and pedal-driven mobility (Figure 38). Recognition of this implicit constraint emerged only after iterative prompts that highlighted the mechanical challenge without explicitly stating it.
  • Cross-Modal Divergence: Despite fluent textual explanations (Figure 35 and Figure 36), the generated visual outputs Figure 39 frequently exhibited structural inconsistencies, such as disconnected pedals, missing chains, or track-like sprocket substitutions (Figure 37, Figure 40, Figure 41 and Figure 43). This highlights a persistent gap between declarative knowledge and its multimodal application.
  • Pattern-Based Mechanical Analogies: Figure 40 and Figure 43 illustrate reliance on high-probability visual patterns, resembling tracked vehicles rather than bicycle-style drivetrain systems. This reflects shortcut learning and pattern-based generalization, consistent with prior analyses [5,11].
  • Sensitivity to Prompt Refinement: Even after explicit specification of chain links and sprocket orientation (Figure 42), visual outputs continued to deviate from mechanical correctness. While textual reasoning improved, the mismatch between textual and visual reasoning underscores modality-specific limitations in compositional generalization.
  • Iterative Improvement and Residual Errors: Subsequent refinements produced incremental visual improvements (Figure 41) but did not achieve full alignment with mechanical plausibility, demonstrating persistent brittleness in LLM cross-modal reasoning.

3.2. Mechanical Integrity and Functional Composition (Prompt 2)

The evaluation of tricycle composition reveals significant limitations in the models’ ability to integrate non-standard mechanical structures. When tasked with designing a vehicle featuring a novel wheel arrangement—two front wheels and a rear-steered wheel—models consistently exhibited structural integration failures, such as the omission of pedal–chain assemblies or the generation of physically impossible steering linkages. These findings suggest a profound difficulty in maintaining internal functional consistency when moving beyond conventional vehicle templates. Notably, while iterative refinement of the prompts led to incremental improvements in the models’ textual descriptions of the wheel arrangement, their corresponding visual outputs remained fragmented. This discrepancy underscores a persistent reliance on analogical reasoning biases; the models appear to default to familiar "bicycle" or "standard tricycle" templates, highlighting a lack of grounded mechanical logic. The observed behavior aligns with the "Einstellung effect" [41] noted in recent literature, where the model’s fixation on high-probability training patterns prevents it from adapting to the structural logic of a novel mechanical constraint.

3.3. Geometric Transformations and Reflective Reasoning (Prompt 3)

The probes concerning mirror and reflection reasoning further illuminate a breakdown in the models’ internal world models, specifically regarding geometric transformations. Models frequently produced non-mirrored text within reflective surfaces or depicted physically impossible reflections that violated basic optical laws. While the models could often articulate the principles of reflection (e.g., "the text should be reversed") in a purely symbolic, textual domain, they failed to operationalize these principles during multimodal rendering. This "cross-modal reasoning gap" indicates that spatial intelligence in these systems is not derived from a continuous geometric understanding, but rather from discrete statistical co-occurrences of tokens. The models essentially treat "reflection" as a stylistic attribute rather than a topological transformation, reinforcing the thesis that current architectures suffer from a fundamental representation-level grounding problem.

3.4. Functional Handedness and Multimodal Disconnection (Prompt 4)

The task involving left-handed scissors served as a diagnostic for functional consistency and the coordination of hand–object interactions. We observed that models routinely violated handedness constraints, either by depicting the scissors being operated by the right hand or by failing to invert the blade orientation necessary for left-handed usage. This failure is particularly diagnostic because the models’ textual justifications often correctly identified the hand required, yet the visual output remained tethered to the dominant "right-handed" bias found in the training distribution. This decoupling suggests that textual reasoning and visual expression operate as semi-independent modules rather than as a unified, grounded intelligence. The inability to resolve the conflict between a specific prompt instruction and a pervasive statistical bias illustrates that multimodal large language models (MLLMs) lack a stable mechanism for introspective verification during the generation of complex spatial relations.

3.5. Symbolic Precision and Temporal Consistency (Prompts 5 & 6)

The tasks involving the representation of Persian numerals on a wall clock and subsequent time adjustments revealed critical vulnerabilities in symbolic and temporal reasoning. Models frequently substituted Persian numerals with Western Arabic numerals or failed to preserve the symbolic style when asked to modify a single variable (the time). Furthermore, as the requested time changed, we observed significant misalignments in the clock hands, often resulting in "digital-style" hands that did not respect the mechanical sweep of an analog system. These results demonstrate that the models’ representations are localized and fragile; changing one parameter often triggers a complete collapse of the global structural representation. This sensitivity suggests that the models do not possess a stable "object-level" representation of a clock, but rather generate a "clock-like" pixel distribution that is easily disrupted by novel symbolic or temporal constraints.

3.6. Elementary Quantitative Reasoning Under Constraint (Prompt 7)

Prompt 7 probes elementary quantitative reasoning by imposing simple but rigid numerical constraints, including volumetric measurement and basic arithmetic. Despite the low formal complexity of these tasks, models frequently exhibited unstable rule application and incoherent procedural reasoning.
In the volumetric task, which required measuring exactly 2.5 liters of water using a 3-liter jug and two 40 ml bottles, models often generated invalid or incomplete solution steps. Common failure modes included premature classification of the task as impossible Figure 21 and omission of necessary sequential operations, indicating difficulty maintaining numeric state Figure 20 and enforcing capacity constraints across multiple steps Figure 19.
A closely related pattern emerged in a basic arithmetic probe involving prime numbers. The task required determining how many prime numbers result from multiplying 3 by integers greater than 5 and less than 15 (Figure 44). Correct resolution hinges on a definitional constraint: all such products are multiples of 3 and therefore non-prime. In its initial response, the model proposed incorrect prime candidates (17 and 19). When prompted to explain its reasoning, it correctly identified that none of the resulting values were prime, yet failed to consistently apply this constraint in subsequent responses. Repeating the same prompt verbatim (Figure 45) again yielded an incorrect answer, accompanied by internally inconsistent reasoning.
Table 5 summarizes this variability across repeated trials. Rather than converging toward a stable solution, model performance fluctuated, reflecting inconsistent enforcement of formal definitions despite apparent access to the relevant knowledge.
Table 5. Qualitative summary of model behavior on the prime-number query across repeated prompts.
Table 5. Qualitative summary of model behavior on the prime-number query across repeated prompts.
Trial Final Answer Consistency with Primality Definition
Initial response Incorrect (17, 19) Low
After explanation request Correct reasoning, unstable conclusion Partial
Repeated prompt Incorrect answer Low
Across both volumetric and arithmetic tasks, errors appear to stem less from missing factual knowledge than from unstable integration of definitional constraints into procedural reasoning. This behavior aligns with prior findings that transformer-based models may rely on statistically reinforced heuristics rather than strict rule enforcement, even in domains governed by simple symbolic principles [4,5]. Although techniques such as chain-of-thought prompting can improve surface-level explanations, they do not guarantee consistent constraint satisfaction [10,12]. Consequently, these examples should be interpreted as diagnostic instances of reasoning fragility, underscoring the importance of evaluation frameworks that assess consistency under formal constraints rather than isolated correctness [3,34].

3.7. Grid Counting and Pattern Recognition (Prompt 8)

Prompt 8 probes elementary visual enumeration and pattern abstraction by requiring the model to accurately count rows, columns, or distinct regions in a grid-based image. Despite the apparent simplicity of this task, we consistently observed elementary counting errors in both textual and visual reasoning. Models frequently miscounted grid dimensions or failed to correctly enumerate visually distinct color regions, even when the instructions were explicit and unambiguous. These failures suggest limitations in low-level visual parsing and discrete structure abstraction rather than misunderstandings of task intent.
To further contextualize this behavior, we evaluated the model on a representative ARC puzzle from the ARC-AGI-1 benchmark, consisting of four solved training examples. When asked to explain the underlying transformation rule, the model produced a plausible and internally coherent textual description, indicating a superficial grasp of the abstract pattern. However, when subsequently tasked with solving a novel instance based on the same examples, the model failed to generate a correct solution.
We then isolated the most basic subcomponent of the task by asking the model to explicitly count the number of rows and columns in one of the ARC images (Figure 34). Notably, the model again produced an incorrect answer, this time with high confidence. This behavior highlights a recurring discrepancy between abstract verbal reasoning and concrete visual execution. While the model can articulate pattern-level explanations in natural language, it struggles to reliably ground these abstractions in precise visual or spatial computations.
Taken together, these observations reinforce the distinction between descriptive competence and operational accuracy. The model’s failures in Prompt 8 illustrate that apparent understanding, as conveyed through fluent explanations, does not necessarily translate into correct application when precise visual enumeration or structural consistency is required.

3.8. Relational Complexity and Narrative Inference (Prompt 9)

The investigation into relational reasoning through narrative contexts revealed a significant "relational bottleneck" in contemporary models. When presented with the John–Helen–Marcy kinship scenario—a multi-step relational graph involving marriage, attraction, and complex familial ties—the models frequently failed to maintain a consistent state-space of the entities involved. Errors typically manifested as the omission of specific relational links or the miscalculation of the total number of distinct relationships between the primary subjects. These breakdowns suggest that while LLMs can track binary relations (e.g., "A is the father of B"), their performance degrades non-linearly as the relational density increases. This indicates that the models lack a persistent, structured memory representation—such as a dynamic knowledge graph—needed to resolve high-order logical dependencies. Instead, the models appear to rely on local narrative cues, which are insufficient for navigating the global constraints of complex social or logical hierarchies.

3.9. Domain-Specific Engineering and Applied Logic (Prompt 10)

The evaluation of engineering and applied reasoning through a solar-powered submersible pump design task exposed a critical gap between linguistic technicality and functional quantitative grounding. Despite producing text that appeared superficially professional, the models committed fundamental errors in motor power estimation, VFD (Variable Frequency Drive) sizing, and solar panels calculation. Such failures are particularly diagnostic of "parameter hallucination," where a model generates technically plausible-sounding values that are mathematically incompatible with the physical specifications provided. This behavior illustrates a failure in cross-modal integration; the models are unable to synthesize textual engineering principles with the rigorous numerical constraints required for real-world application. The persistent errors in power factor adjustment and voltage drop calculations reinforce the conclusion that LLMs lack an underlying physical world model, instead treating engineering parameters as linguistic tokens rather than rigid physical variables.

3.10. Compositional and Cross-Modal Insights

Across all prompts, we observed:
  • Strong textual reasoning capabilities in isolation but persistent failures in multimodal integration.
  • Reliance on learned patterns and high-probability templates rather than true structural understanding.
  • Sensitivity to prompt design and iterative refinement, highlighting brittleness in reasoning generalization.
  • The wheelchair problem (Prompt 1) exemplifies all these patterns most clearly, providing a coherent visual narrative from Figures 35 to 43.
These observations collectively underscore the value of structured diagnostic probes to reveal nuanced LLM limitations across mechanical, mathematical, relational, and engineering reasoning tasks.
Figure 35. ChatGPT knowledge of bicycle.
Figure 35. ChatGPT knowledge of bicycle.
Preprints 205771 g035
Figure 36. ChatGPT knowledge of wheelchair.
Figure 36. ChatGPT knowledge of wheelchair.
Preprints 205771 g036
Figure 37. Final Wheelchair Designs.
Figure 37. Final Wheelchair Designs.
Preprints 205771 g037
Figure 38. Wheelchair with pedal mechanism.
Figure 38. Wheelchair with pedal mechanism.
Preprints 205771 g038
Figure 39. ChatGPT explanation of wheelchair design.
Figure 39. ChatGPT explanation of wheelchair design.
Preprints 205771 g039
Figure 40. Wheelchair with pedals and sprockets track.
Figure 40. Wheelchair with pedals and sprockets track.
Preprints 205771 g040
Figure 41. Wheelchair with improved pedal and sprocket system.
Figure 41. Wheelchair with improved pedal and sprocket system.
Preprints 205771 g041
Figure 42. Wheelchair design instructions.
Figure 42. Wheelchair design instructions.
Preprints 205771 g042
Figure 43. Improved wheelchair design.
Figure 43. Improved wheelchair design.
Preprints 205771 g043
Figure 44. Prime numbers in multiples of 3.
Figure 44. Prime numbers in multiples of 3.
Preprints 205771 g044
Figure 45. Prime numbers in multiples of 3.
Figure 45. Prime numbers in multiples of 3.
Preprints 205771 g045
Figure 46. LLM Hallucination.
Figure 46. LLM Hallucination.
Preprints 205771 g046
Figure 47. Prompt compress query.
Figure 47. Prompt compress query.
Preprints 205771 g047
Figure 48. Prompt compress response.
Figure 48. Prompt compress response.
Preprints 205771 g048
Figure 49. LLM Hallucination Lie.
Figure 49. LLM Hallucination Lie.
Preprints 205771 g049

4. Challenges of LLMs in Abstraction and Reasoning

The Abstraction and Reasoning Corpus (ARC) for Artificial General Intelligence (AGI) is a novel metric designed to evaluate the general intelligence of systems, rather than merely their skill. While most AI benchmarks assess proficiency in specific tasks, skill alone does not constitute intelligence [3]. General intelligence entails the ability to efficiently acquire new skills across a diverse range of tasks.
As Dr. François Chollet remarked at the AGI Conference 2024 [7], “Displaying skill in any number of tasks does not demonstrate intelligence. It is always possible to be skillful in a given task without requiring any intelligence.” Chollet’s ARC, developed in 2019, is one of the most widely recognized benchmarks aimed at evaluating progress toward Artificial General Intelligence (AGI). It consists of puzzles that are simple enough for a fifth-grader to solve, yet complex enough to challenge state-of-the-art AI systems. The average human score on the ARC-AGI-I benchmark is approximately 85%.
Table 6. ARC-AGI-I subset performance for early 2025 free/available model versions. Values reflect standard inference scores before the Q1 2025 reasoning breakthroughs.
Table 6. ARC-AGI-I subset performance for early 2025 free/available model versions. Values reflect standard inference scores before the Q1 2025 reasoning breakthroughs.
Task Category Difficulty ChatGPT-4o mini Gemini-2.5 Flash Grok-2
Public Training Tasks Easy 38% 35% 48.0%
Public Evaluation Tasks Hard 9% 8% 22.0%
Semi-private Evaluation Tasks Hard 5% 4% 15.0%
Private Evaluation Tasks Hard 3% 2.5% 12.0%
Weighted Average 15.2% 13.8% 29.6%
Table 7. ARC-AGI-II subset performance for early 2026 free/available model versions. Values reflect verified performance using standard Thinking/Deep-Think configurations.
Table 7. ARC-AGI-II subset performance for early 2026 free/available model versions. Values reflect verified performance using standard Thinking/Deep-Think configurations.
Task Category Difficulty ChatGPT-5.2 Gemini-3 Pro Grok-4.1
Public Training Tasks Easy 94.5% 92.0% 88.6%
Public Evaluation Tasks Hard 58.2% 48.4% 34.2%
Semi-private Evaluation Tasks Hard 54.2% 45.1% 29.4%
Private Evaluation Tasks Hard 52.9% 31.1% 26.8%
Weighted Average 64.9% 54.2% 44.8%
As shown in Table 6, and according to 2024 reports on arcprize.org [31], several state-of-the-art models, such as ChatGPT o3-High (Tuned), have reportedly achieved strong performance on the ARC-AGI-I benchmark, with scores reaching up to 88% on semi-private evaluation tasks. At first glance, such results Table 6 and Table 7 might appear to indicate strong progress on the benchmark. However, a closer examination indicates that these performances are more plausibly attributable to benchmark-specific optimization rather than robust, general intelligence.
Two primary factors help explain this apparent contradiction. First, benchmark contamination is a growing concern. The ARC dataset, released in 2019 and widely circulated for research and competition purposes, has likely appeared in the pretraining or fine-tuning corpora of modern LLMs. Consequently, some models may possess partial or complete prior exposure to ARC-like tasks, resulting in inflated scores that reflect data familiarity rather than true reasoning generalization.
Second, models such as ChatGPT o3-High and Gemini Ultra are often subject to specialized fine-tuning or reinforcement learning on ARC-like visual reasoning or pattern-recognition datasets. This process improves benchmark performance but demonstrates only narrow skill acquisition, not generalizable intelligence. The models effectively learn the “style” of ARC puzzles without developing transferable cognitive principles or abstraction strategies.
Our own experiments with Gemini Flash 1.5 Figure 52 and 2.0 Figure 53, conducted under controlled conditions and without ARC-specific fine-tuning, show consistently weak performance across unseen ARC puzzles. This disparity suggests that high benchmark scores may depend strongly on training conditions, data exposure, or architectural alignment with the task.
Rather than constituting definitive evidence for or against general intelligence, these results highlight the sensitivity of ARC performance to model design, training regime, and evaluation conditions. As such, ARC outcomes should be interpreted cautiously and in the context of possible task–model mismatches.

4.1. Solving ARC Puzzles with the Gemini Flash Model

To evaluate the performance of the Gemini model on ARC puzzles, we utilized Gemini Flash with extended context capabilities (supporting up to 120,000 input tokens). We focused on the first 50 puzzles from the public evaluation dataset Table 6, allowing up to five re-attempts per puzzle. Each re-attempt included the history of previous attempts to enable context-aware learning. Input data was provided to the Gemini API in raw JSON format, and output was expected in a predefined JSON structure 4.2. For multi-attempt scenarios, we augmented the training data by providing additional examples. Specifically, for each ARC puzzle that included five training examples, we synthetically expanded the dataset to 50 examples. This was achieved through data augmentation techniques [22,23] such as flipping (vertical, diagonal, horizontal) and applying color-shift transformations.

4.2. Gemini Flash Experimental Setup

All experiments were conducted using Google’s Gemini Flash API 4.2 (versions 1.5 and 2.0) through the standard developer interface between March and May 2025. Each ARC puzzle was provided as a JSON-encoded input–output matrix pair 4.2, with color values normalized to integers from 0 to 9. Gemini flash model was required to predict the full output matrix with short textual explanation of its logic behind the answer and without step-by-step reasoning assistance unless specified. We developed nine additional variants for each original training example in the puzzle, including color shifting, vertical, horizontal, and diagonal flips of the input and output matrices. For every training example 4.2, we also included supplementary metadata such as the shape, size, and ratio of the input and output matrices, as well as the frequency of each digit appearing in both matrices. Additionally, each puzzle includes a fixed, unaltered set of instructions 4.2 that further clarifies the abstraction logic underlying ARC puzzles. Sampling parameters 4.2 were as follows: temperature=0.7 - 1.25 (primary runs) and additional validation runs at temperature=1.35 to 1.65; top_p=0.95; top_k=40; max_tokens=20000. Each puzzle was tested in three independent runs to account for stochastic variation. Similarity between predicted and ground-truth matrices was computed as normalized pixel-wise correspondence, producing the Threshold metric described in Above Threshold. All evaluations were executed on an Ubuntu 24.04 workstation using the official Gemini API interface. Preprints 205771 i001Preprints 205771 i002Preprints 205771 i003Preprints 205771 i004Preprints 205771 i005
Figure 50. ARCAGI Evaluation Puzzle 0b17323b Output.
Figure 50. ARCAGI Evaluation Puzzle 0b17323b Output.
Preprints 205771 g050
Figure 51. ARCAGI Evaluation Puzzle 009d5c81 Output.
Figure 51. ARCAGI Evaluation Puzzle 009d5c81 Output.
Preprints 205771 g051

4.3. Gemini Flash Outputs and Performance Analysis

To contextualize the evaluation results, representative ARC puzzle outputs are shown in Figure 50 and Figure 51. The corresponding tasks can be interactively explored at: https://arcprize.org/play?task=0b17323b and https://arcprize.org/play?task=009d5c81. These examples illustrate the types of abstraction and transformation challenges posed by the ARC benchmark.
Our experimental results (Table 8 and Table 9) indicate that providing additional training examples did not lead to a clear or consistent improvement in performance. Gemini Flash did not exceed the minimum success threshold (Table 8) in more than half of the evaluated puzzles, as further illustrated in Figure 52 and Figure 53. In particular, when Gemini Flash 1.5 achieves a score of approximately 5% on the ARC-AGI-I benchmark, further simplification or example augmentation appears to offer limited benefit.
We additionally leveraged Gemini’s long-context capabilities by performing multiple attempts per puzzle, incorporating information from previous attempts and varying temperature settings between 0.7 and 1.65. This strategy did not yield consistent or sustained improvements; instead, the outputs exhibited substantial variability across runs. The recurrence of reasoning difficulties suggests that the observed performance constraints are more likely related to underlying architectural factors than to stochastic sampling effects alone.
The failure of Gemini Flash to improve with augmented examples (Table 8) suggests a context-saturation effect. Rather than the additional data clarifying the rule, the increased token count may have introduced "noise" into the attention mechanism, leading to the observed stochastic variability. This indicates that for current LLMs, "more data" at inference time does not necessarily equate to "better abstraction," likely due to the lack of a dedicated latent space for iterative hypothesis testing.
Taken together, these findings suggest that a model’s reasoning performance is strongly influenced by its training regime and internal representations. While prompt engineering, increased context length, and repeated sampling may offer incremental benefits in some settings, they do not consistently compensate for limitations encountered on novel ARC tasks. However, low ARC performance should not be interpreted as direct evidence of a lack of general intelligence. Several alternative explanations may account for the observed results.
First, ARC puzzles are primarily visual–symbolic reasoning tasks, whereas most large language models are trained predominantly on textual data. This modality mismatch may limit performance independently of any broader cognitive capability.
Second, the architecture of current LLMs is optimized for next-token prediction rather than explicit program synthesis or combinatorial search, both of which are often required for solving ARC tasks. As a result, poor performance may reflect architectural misalignment with the task structure rather than a fundamental absence of reasoning ability.
Third, the training objective of LLMs emphasizes statistical pattern completion over systematic generalization. ARC, by contrast, is explicitly designed to test rapid skill acquisition from minimal examples. The resulting objective mismatch may therefore contribute to the observed performance gap.
Table 8. Gemini-1.5-flash results, See Threshold and Above Threshold definition.
Table 8. Gemini-1.5-flash results, See Threshold and Above Threshold definition.
Batch Temp Additional Examples Total Attempted Above Threshold Solved 100%
batch-6 0.7-1.65 0 49 23 2 (4.08%)
batch-7 0.7-1.65 2 50 23 2 (4.00%)
batch-8 0.7-1.65 4 48 19 2 (4.17%)
batch-9 0.7-1.65 9 42 21 2 (4.76%)
Figure 52. ARCAGI Puzzles Batch 6,7,8,9.
Figure 52. ARCAGI Puzzles Batch 6,7,8,9.
Preprints 205771 g052
Table 9. Gemini-2.0-flash results, See Threshold and Above Threshold definition.
Table 9. Gemini-2.0-flash results, See Threshold and Above Threshold definition.
Batch Temp Additional Examples Total Attempted Above Threshold Solved 100%
batch-0 0.7-1.65 0 47 21 0 (0.00%)
batch-1 0.7-1.65 0+data 50 21 1 (2.00%)
batch-2 0.7-1.25 2+data 48 20 1 (2.08%)
batch-3 0.7-1.35 4+data 45 21 0 (0.00%)
Figure 53. ARCAGI Puzzles Batch 0,1,2,3.
Figure 53. ARCAGI Puzzles Batch 0,1,2,3.
Preprints 205771 g053

4.4. Testing Introspective Verification Capabilities

At the conclusion of our experiments with Gemini Flash, we conducted an additional test to evaluate whether the model could recognize and reproduce the correct output when it was explicitly provided. Specifically, we supplied the true output for all puzzles to Gemini 4.4, accompanied by direct clues—effectively analogous to giving a student the correct answers labeled as “true_output.” We also allowed the model up to three attempts per puzzle, providing its previous predictions and their corresponding scores after each attempt.
Despite these highly favorable conditions, Gemini achieved only a maximum accuracy of approximately 40%. This finding highlights an additional and noteworthy limitation. If the model correctly recognized that the input explicitly contained the true output, its success rate would reasonably be expected to approach 90% or higher. Conversely, if the model failed to detect the presence of the leaked true output, its accuracy would be expected to remain near 5%, corresponding to chance-level or uninformed attempts. The allowance of up to five attempts was intended to enable the model to infer—based on the outcomes of earlier predictions—that the provided true output was correct. However, the persistently low accuracy suggests that the model does not consistently exhibit mechanisms for consistency checking, introspective verification, and common-sense reasoning. Preprints 205771 i006Preprints 205771 i007

5. Empirical Characterization of LLM Reasoning Breaks

5.1. The Computational Einstellung Effect and Pattern Fixation

A critical phenomenon observed throughout the prompt experiments is a computational analog of the Einstellung effect—a cognitive trap where a previously learned, high-probability solution is applied to a new problem even when it is structurally inappropriate [41]. In LLMs, this manifests as a rigid adherence to dominant training distributions that overrides specific, counterfactual prompt instructions.
This effect was most pronounced in three specific diagnostic areas:
  • Mechanical Defaulting (Prompt 2): Despite the explicit requirement for a rear-steered tricycle, visual outputs consistently reverted to front-steering architectures. The model’s internal "prior" for vehicle topology—built on millions of images of standard tricycles—exerted a gravitational pull that suppressed the novel mechanical logic requested.
  • Handedness Bias (Prompt 4): The failure to render left-handed scissors operation, even when the model textually acknowledged the constraint, illustrates a distributional capture. Because right-handedness is the statistically dominant representation in internet-scale data, the model is unable to "de-center" from this bias to perform a simple geometric inversion.
  • Functional Mirroring (Prompt 3): When generating reflections, models often produced readable, non-inverted text. This suggests that the "object-level" representation of text is so strong that the model cannot apply the "transformation-level" logic of reflection, preferring the familiar pattern of legible characters over the physically accurate mirrored variant.
Unlike human reasoners, who can overcome the Einstellung effect through metacognitive monitoring and strategic shifting, the LLMs in our study showed a "pattern-matching inertia." Even when corrected in subsequent turns (as discussed in Inference-Time Stability), the models frequently relapsed into the high-probability training template. This suggests that the observed reasoning failures are not merely "errors" but are structural consequences of an architecture that prioritizes statistical frequency over logical consistency.

5.2. Inference-Time Stability in Mechanical Reasoning

Our experiments reveal that model behavior in mechanical reasoning tasks remains largely stable across repeated interactions, with limited evidence of systematic improvement during inference. In human problem-solving, repeated engagement with a task often leads to progressive refinement of mental models and improved performance. By contrast, in the wheelchair design scenario, additional prompts and clarifications did not consistently yield more coherent mechanical or visual outputs (Figure 37, Figure 39, Figure 41 and Figure 42).
This pattern does not necessarily indicate a deficiency in reasoning but reflects the architectural property that LLMs do not update internal representations during inference. Consequently, performance variations across prompts may arise from prompt sensitivity, stochastic decoding, or modality-specific constraints rather than cumulative learning effects.
Related observations have been reported in studies of model degeneration under recursive self-training, where repeated exposure to self-generated outputs leads to increasing instability in representations [40]. While inference-time prompting differs fundamentally from training-time feedback, these findings provide a contextual framework for interpreting the absence of incremental performance gains in interactive mechanical reasoning tasks.

5.3. Jagged Performance and Cross-Modal Inconsistency

A recurring pattern in the wheelchair experiment is the presence of jagged performance, defined here as non-monotonic and inconsistent model behavior across closely related prompts or modalities. In the textual domain, the model produced mechanically plausible explanations, whereas corresponding visual outputs exhibited structural inconsistencies (Figure 37, Figure 39 and Figure 41). Similarly, incremental prompt refinements did not lead to uniformly improved results (Figure 42).
Such variability is consistent with prior observations that LLM performance does not scale smoothly with task difficulty or prompt specificity but instead fluctuates across seemingly similar conditions. In the present case, jaggedness manifests as a divergence between declarative mechanical knowledge and its multimodal instantiation. This pattern may arise from multiple factors, including stochastic decoding processes, prompt sensitivity, modality-specific training distributions, and reliance on high-probability patterns rather than structured compositional reasoning [4,5].
Importantly, the wheelchair experiment should be interpreted as an illustrative instance of this broader phenomenon rather than definitive evidence of general model limitations. Nevertheless, the observed jagged behavior highlights the need for systematic diagnostic protocols that quantify variability across prompts, tasks, and modalities, rather than relying on isolated success or failure cases.

5.4. Contextual Hallucinations and Cross-Turn Information Leakage

Across multiple experiments, we observed systematic instances of hallucination arising from inappropriate reuse of contextual information across conversation turns. Figure 46 illustrates a representative example. In this case, a large language model (LLM) was provided only with a PDF manuscript and was asked whether it could analyze that paper in conjunction with a hypothetical set of twenty additional research articles. Despite no such articles being supplied, the model responded by asserting that it had already reviewed both the manuscript and the twenty papers, thereby fabricating nonexistent inputs.
A related failure mode emerged during prompt compression experiments. When instructed to compress a previously used multi-line prompt (which itself was a question) into three lines (Figure 47), the model produced a compressed version that implicitly contained the *answer* to the original question (Figure 48). Crucially, this answer was not present in the source text being compressed but had appeared earlier in the conversation context. This behavior indicates unintended information leakage from prior turns rather than faithful transformation of the provided input.
When explicitly queried about the origin of these inserted values (Figure 49), the model initially attributed them to a fabricated reasoning process, claiming that the values were derived from electrical data visible on digital meters in an image. However, no image had been provided in the relevant prompt. Upon further confrontation, the model revised its explanation and acknowledged that the technical details had been implicitly carried over from earlier conversational context (Figure 49).
These observations demonstrate that hallucinations in LLMs are not limited to isolated factual errors, but can arise from *context misattribution*, where previously seen information is mistakenly treated as part of the current input. Importantly, this occurs even in seemingly simple transformation tasks (e.g., text compression), highlighting that instruction-following fidelity can degrade when prompts are evaluated under extended conversational context. This behavior poses challenges for the use of LLMs in document analysis, summarization, and multi-step reasoning workflows, where strict input–output isolation is often implicitly assumed.

5.5. Discussion: Mechanisms of Reasoning Breakdowns

The observed phenomena suggest that LLM reasoning failures are not stochastic glitches but are rooted in three distinct architectural constraints:
  • The Self-Auditing Gap: A critical limitation identified across experiments is the absence of internal verification. Models produced structurally implausible outputs with high confidence, failing to engage error-detection mechanisms during the generative process. This suggests that self-correction is not inherently coupled to instruction following in current transformer-based architectures.
  • Latent Activation vs. Strategy Acquisition: Performance gains observed through prompting or iterative correction do not appear to represent the acquisition of new reasoning strategies. Instead, they reflect the selective activation of latent behaviors already supported by the training distribution. When a task requires a novel logical shift—such as geometric inversion or non-standard vehicle topology—the model lacks the metacognitive flexibility to override its high-probability priors.
  • Surface-Level Fluency and the Competence Mirage: The persistence of the Einstellung effect indicates that linguistic competence remains tightly coupled to training-time statistical frequencies. This creates a "competence mirage," where surface-level fluency masks a fundamental inability to perform grounded, self-regulating abstraction. Addressing these "reasoning breaks" likely requires moving beyond parameter scaling toward architectures that incorporate explicit internal verification and adaptive restructuring.

6. Evaluating Intelligence in LLMs: Observed Limitations and Future Perspectives

Despite the substantial scaling of large language models (LLMs), our diagnostic findings corroborate a fundamental gap between benchmark-oriented performance and grounded reasoning. While models achieve high scores on standardized datasets, the performance divergence between ARC-AGI-I and ARC-AGI-II (see Table 3 and Table 4) suggests that contemporary success often reflects adaptation to specific data distributions rather than robust fluid intelligence.

6.1. The Endogenous Evaluation Gap

A defining characteristic of the failures observed in our study is that they are identified through external evaluation rather than endogenous model processes. Although LLMs can articulate known weaknesses when explicitly prompted, such responses are typically attributable to learned statistical regularities rather than intrinsic mechanisms of self-monitoring. In practice, the identification and mitigation of model failures—such as the Einstellung effect in mechanical reasoning—remain dependent on human supervision or auxiliary verification systems. This lack of internal error detection indicates that meta-cognitive awareness is not yet a structural feature of current transformer-based architectures.

6.2. Future Directions: Towards Resource-Efficient Autonomy

The discrepancy between surface-level fluency and grounded reasoning suggests a critical need for evaluative frameworks that prioritize process-level criteria over static task accuracy. Moving forward, we propose a shift toward characterizing artificial intelligence through dimensions of Resource-Efficient Cognitive Autonomy. Rather than measuring output-only competence, future benchmarks should foreground the following meta-cognitive capacities:
  • Self-Auditing: The detection of internal inconsistencies or logical errors without external cues.
  • Adaptive Strategy Generation: The construction of alternative reasoning paths when initial high-probability templates fail.
  • Iterative Refinement: The execution of corrective logic within fixed computational and memory budgets.
By operationalizing intelligence as sustained improvement under bounded resources, researchers can move beyond idealized models toward systems capable of calibrated uncertainty awareness and principled trade-offs between exploration and efficiency. Formalizing these process-centric criteria remains an essential direction for future work.

7. Conclusion

This study provided a structured diagnostic characterization of contemporary Large Language Models (LLMs) by probing the boundaries of their reasoning, generalization, and self-verification capabilities. By shifting the evaluative focus from aggregate benchmark scores to targeted failure-mode analysis, we identified persistent structural limitations that remain masked by surface-level linguistic fluency. The empirical evidence suggests that while LLMs are capable of sophisticated pattern retrieval, their performance becomes non-monotonic and "jagged" when tasked with novel, mechanically grounded, or cross-modal constraints.
Our results demonstrate that improvements achieved through inference-time interventions—such as prompt refinement and chain-of-thought elicitation—are predominantly fragile. These strategies tend to facilitate the selective activation of latent training distributions rather than the acquisition of new, transferable reasoning principles. This distinction is critical for the field: the "improvements" observed during interactive prompting should be characterized as distributional alignment rather than dynamic learning. Consequently, the observed inability of models to sustain corrections across varying contexts points to a fundamental decoupling between linguistic competence and world-model consistency.
Furthermore, the discrepancy between fluent textual explanation and structural visual-mechanical failure (as seen in our mechanical and spatial probes) reveals a "grounding gap" that scaling alone has yet to bridge. The persistence of the Einstellung effect—where models default to high-probability training templates despite explicit counter-instructions—indicates that current architectures prioritize statistical frequency over logical or physical invariants. These findings suggest that high performance on standardized benchmarks is a necessary but insufficient indicator of generalizable reasoning, as these metrics may fail to capture the stochastic and fragile nature of model outputs in "out-of-distribution" scenarios.
Importantly, this work does not posit an absolute limit on the potential of scaled architectures; rather, it provides a diagnostic baseline for the current state of the art. The evidence highlights that progress toward autonomous, grounded intelligence cannot be measured solely by task-success rates. Instead, it requires a shift toward evaluating process-oriented cognitive dimensions: the capacity for self-auditing, the maintenance of state-consistency in multi-turn reasoning, and the robust handling of physical constraints.
In summary, the reproducible "reasoning breaks" documented in this study underscore a critical gap between probabilistic token prediction and the functional world-modeling required for general reasoning. Our evidence highlights that progress toward autonomous, grounded intelligence cannot be measured solely by surface-level task success. Instead, the limitations identified here suggest that future advancements must move beyond parameter expansion and prompt optimization toward architectural innovations capable of internal consistency-checking and stable, grounded representation.

Declarations

Ethics Approval and Consent to Participate

Not applicable. This study did not involve human participants, animals, or sensitive data requiring ethical approval.

Consent for Publication

Not applicable. No individual person’s data or identifiable information is included in this manuscript.

Availability of Data and Materials

Additional data and resources will be made available upon reasonable request.

Competing Interests

The authors declare that there are no competing interests.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Authors’ Contributions

Reshid Mehmood has worked on the proposed scheme, visualization, Dr.Eid Rehman on validation, and Dr.Muhammad Habib on analysis.

Acknowledgements

The author would like to thank the reviewers and editorial team for their constructive feedback, which helped improve the clarity and quality of this manuscript.

Glossary

Above Threshold Denotes the degree to which the model’s predicted output matrix exceeds a predefined similarity benchmark relative to the true output matrix. 16, 20, 21
SotA LLMs SotA LLMs refers to State-of-the-Art Large Language Models such as ChatGPT, Gemini, and Grok. These models represent the most advanced implementations of transformer-based neural architectures currently available. 1, 5, 6
Threshold The minimum acceptable similarity score (measured as normalized pixel-wise correspondence) required for meaningful alignment between prediction and ground truth. For example, if the threshold for a given puzzle is 78%, a model output achieving 83% similarity is recorded as 5% above threshold. This metric applies only to puzzles where the input and output matrices share identical dimensions; for all other cases involving dimension changes or structural transformations, the threshold value is set to zero, as direct element-wise comparison is not applicable. 16, 20, 21

References

  1. Evans, O.; Berglund, L.; Tong, M.; Kaufmann, M.; et al. The reversal curse: LLMs trained on A is B fail to learn B is A. arXiv 2023, arXiv:2309.12288v4. Available online: https://arxiv.org/abs/2309.12288v4.
  2. Nezhurina, M.; Cipolina-Kun, L.; Cherti, M.; Jitsev, J.; et al. Alice in Wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv 2024, arXiv:2406.02061v4. Available online: https://arxiv.org/abs/2406.02061v4.
  3. Chollet, F. On the measure of intelligence. arXiv 2019, arXiv:1911.01547v2. Available online: https://arxiv.org/abs/1911.01547v2.
  4. Dziri, N.; Lu, X.; Sclar, M.; Li, X. L.; Jiang, L.; et al. Faith and fate: Limits of transformers on compositionality. arXiv 2023, arXiv:2305.18654. Available online: https://arxiv.org/abs/2305.18654v3. [CrossRef]
  5. Du, M.; He, F.; Zou, N.; Tao, D.; Hu, X. Shortcut learning of large language models in natural language understanding. arXiv 2022, arXiv:2208.11857v2. Available online: https://arxiv.org/abs/2208.11857. [CrossRef]
  6. AINumbat. “SotA LLM Limitations (Examples Repository),” GitHub. 2024. Available online: https://github.com/ainumbat/llm_eval_notes.git.
  7. Chollet, F. “Talk at AGI Conference, ARC Prize,” YouTube, 2024. Available online: https://www.youtube.com/watch?v=nL9jEy99Nh0&t=1450s.
  8. Li, X. L.; Kuncoro, A.; Hoffmann, J.; et al. A systematic investigation of commonsense knowledge in large language models. arXiv 2022, arXiv:2111.00607. Available online: https://arxiv.org/abs/2111.00607v3. [CrossRef]
  9. Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. Available online: https://arxiv.org/abs/2206.07682v2. [CrossRef]
  10. Wei, J.; Wang, X.; Schuurmans, D.; et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv 2022, arXiv:2201.11903v6. Available online: https://arxiv.org/abs/2201.11903v6.
  11. Yin, Z.; Sun, Q.; Guo, Q.; et al. Do large language models know what they don’t know? arXiv 2023, arXiv:2305.18153v2. Available online: https://arxiv.org/abs/2305.18153v2.
  12. Turpin, M.; Michael, J.; Perez, E.; Bowman, S. R.; et al. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought. arXiv 2023, arXiv:2305.04388v2. Available online: https://arxiv.org/abs/2305.04388v2.
  13. Wenzel, G.; Jatowt, A. An overview of temporal commonsense reasoning and acquisition. arXiv 2023, arXiv:2308.00002. Available online: https://arxiv.org/abs/2308.00002v3. [CrossRef]
  14. Chollet, F.; Knoop, M.; Kamradt, G.; Landers, B. ARC Prize 2024: Technical report. arXiv 2024, arXiv:2412.04604. Available online: https://arxiv.org/abs/2412.04604v2. [CrossRef]
  15. Zhao, J.; Tong, J.; Mou, Y.; et al. Exploring the compositional deficiency of large language models in mathematical reasoning through trap problems. arXiv 2024, arXiv:2405.06680v4. Available online: https://arxiv.org/abs/2405.06680v4.
  16. Bennett, M. T. Is complexity an illusion? arXiv 2024, arXiv:2404.07227. Available online: https://arxiv.org/abs/2404.07227v4. [CrossRef]
  17. Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. Available online: https://arxiv.org/abs/2005.14165v4. [CrossRef]
  18. Banerjee, S.; Agarwal, A.; Singla, S. LLMs will always hallucinate, and we need to live with this. arXiv 2024, arXiv:2409.05746. Available online: https://arxiv.org/abs/2409.05746v1. [CrossRef]
  19. Herrmann, M.; Lange, J. D.; Eggensperger, K.; et al. Position: Why we must rethink empirical research in machine learning. arXiv 2024, arXiv:2405.02200v2. Available online: https://arxiv.org/abs/2405.02200v2.
  20. Wu, Z.; Qiu, L.; Ross, A.; et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. arXiv 2023, arXiv:2307.02477v3. Available online: https://arxiv.org/abs/2307.02477v3.
  21. Akyürek, E.; Damani, M.; Qiu, L.; et al. The surprising effectiveness of test-time training for abstract reasoning. arXiv 2024, arXiv:2411.07279. Available online: https://arxiv.org/abs/2411.07279v1. [CrossRef]
  22. Rahman, M. N. H.; Son, S.-H. Feature transforms for image data augmentation. Neural Computing and Applications 2022, 34, 16141–16160. [Google Scholar] [CrossRef]
  23. Kim, Y.-H.; Ahn, J.-M.; Jang, S.-H.; Kim, S.-K.; Kim, H.-K. Data augmentation method by applying color perturbation of inverse PSNR and geometric transformations for object recognition based on deep learning. Applied Sciences 2020, 10, 3755. [Google Scholar] [CrossRef]
  24. Chang, T. A.; Bergen, B. K. Language model behavior: A comprehensive survey. arXiv 2023, arXiv:2303.11504v2. Available online: https://arxiv.org/abs/2303.11504v2. [CrossRef]
  25. Dennett, D. C. The Role of Language in Intelligence,” in Brainstorms: Philosophical Essays on Mind and Psychology; De Gruyter, 2013. [Google Scholar] [CrossRef]
  26. OpenAI. ChatGPT. 2023. Available online: https://chat.openai.com.
  27. Google DeepMind. Gemini. 2024. Available online: https://deepmind.google/technologies/gemini.
  28. xAI. Grok. 2024. Available online: https://x.ai.
  29. DeepSeek. DeepSeek Language Model. 2024. Available online: https://deepseek.com.
  30. Zhao, H.; Yang, F.; Lakkaraju, H.; Du, M. Towards uncovering how large language model works: An explainability perspective. arXiv 2024, arXiv:2402.10688v2. Available online: https://arxiv.org/abs/2402.10688.
  31. Chollet, F.; Knoop, M.; Kamradt, G.; Landers, B. ARC Prize 2024: Technical Report. arXiv 2024, arXiv:2412.04604. [Google Scholar] [CrossRef]
  32. Chollet, F.; Knoop, M.; Kamradt, G.; Landers, B. ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems. arXiv 2025, arXiv:2505.11831. [Google Scholar] [CrossRef]
  33. Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; et al. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
  34. Legg, S.; Hutter, M. Universal intelligence: A definition of machine intelligence. Minds and Machines 2007, 17, 391–444. [Google Scholar] [CrossRef]
  35. Deutsch, D. Constructor theory. Synthese 2015, 190, 4331–4359. [Google Scholar] [CrossRef]
  36. Minsky, M. The Society of Mind; Simon & Schuster, 1986. [Google Scholar]
  37. Yampolskiy, R. V. Artificial Intelligence Safety Engineering: Why Machine Ethics Is a Wrong Approach. In Philosophy and Theory of Artificial Intelligence; Springer, 2015. [Google Scholar]
  38. Schmidhuber, J. Gödel machines: Fully Self-Referential Optimal Universal Self-Improvers. arXiv 2007, arXiv:0705.1865v3. Available online: https://arxiv.org/abs/0705.1865.
  39. Hutter, M. Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability; Springer: Berlin, Germany, 2005. [Google Scholar]
  40. Shumailov, I.; Zhao, Z.; Galke, J.; Papernot, P.; Anderson, R. AI models collapse when trained on recursively generated data. arXiv 2025, arXiv:2505.21677. Available online: https://arxiv.org/abs/2505.21677. [CrossRef]
  41. Luchins, A. S. Mechanization in problem solving: The effect of Einstellung. Psychological Monographs Available. 1942, 54. [Google Scholar] [CrossRef]

Author Contributions Statement

Preprints 205771 i008 Rashid Mehmood is an independent researcher specializing in artificial intelligence, machine learning, and full-stack system development. His work focuses on improving reasoning, adaptability, and test-time learning in AI systems, with the broader goal of advancing paths toward Artificial General Intelligence (AGI). He has developed lightweight, resource-efficient algorithms and adaptive assistants designed to reduce catastrophic forgetting and enhance real-time inference. Recently, he demonstrated that strong generalization can be achieved from extremely sparse data, achieving over 80% accuracy on MNIST using only 1% of the training set. His research continues to explore efficient learning, abstraction, and dynamic knowledge recalibration.
Preprints 205771 i009 Dr. Eid Rehman is currently serving as an Assistant Professor of Computer Science at the University of Mianwali, Pakistan. He earned his Ph.D. in Computer Science from the International Islamic University, Islamabad, in 2018. Throughout his academic and research career, Dr. Rehman has made significant contributions to the fields of Artificial Intelligence, Large Language Models (LLMs), and Information Security. His passion for advancing knowledge in emerging technologies is reflected in his prolific research record, having authored and co-authored more than 25 research papers published in well-reputed national and international journals. Dr. Rehman’s research work bridges theory and practical application, contributing valuable insights to cutting-edge areas critical to today’s technological advancements. He remains actively engaged in research, mentoring students, and participating in collaborative projects to foster innovation and excellence in Computer Science. His commitment to academic excellence and research innovation continues to inspire the next generation of computer scientists at the University of Mianwali and beyond.
Preprints 205771 i010 Dr.Muhammad Habib received his Ph.D. in Computer Science from International Islamic University Islamabad, Pakistan, in 2018. His research interests include Computer Vision, Machine Learning, Deep Learning, Generative AI, and Agentic AI. He has published numerous research papers in reputable journals, contributing significantly to advancements in intelligent systems and AI-driven technologies. His work focuses on developing innovative algorithms and methodologies to enhance machine perception and automation
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated