Submitted:
30 March 2026
Posted:
01 April 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- A focused diagnostic analysis of recurrent failure modes observed in state-of-the-art LLMs when responding to minimal visual–textual reasoning probes.
- A structured categorization of these failures across mechanical, geometric, relational, and quantitative reasoning dimensions.
- Qualitative evidence that certain simple, well-specified tasks remain challenging for current models despite strong benchmark performance.
- A brief conceptual perspective outlining how future work might integrate systematic diagnostic protocols and cognitively grounded evaluation methods.
2. Prompt Construction
Failure-Mode Taxonomy
- Mechanical and Physical Plausibility Failures Violations of basic mechanical constraints, such as disconnected components, non-functional assemblies, or physically implausible motion transfer (Prompts 1, 2, 10).
- Geometric and Spatial Consistency Failures Errors involving mirror symmetry, handedness, spatial orientation, or object geometry (Prompts 3, 4).
- Symbolic and Representational Integrity Failures Inconsistencies in symbolic rendering, such as incorrect numeral systems or unintended representational substitutions (Prompt 5).
- Temporal and State-Change Reasoning Failures Incorrect handling of time-dependent constraints or failure to preserve state while modifying a single variable (Prompt 6).
- Quantitative and Counting Errors Violations of basic numerical constraints, including incorrect counting, volumetric reasoning errors, or arithmetic inconsistencies (Prompts 7, 8).
- Relational and Logical Structure Failures Breakdown in multi-entity relational reasoning, including missing entities, incorrect relationship enumeration, or incomplete relational graphs (Prompt 9).
Prompt Construction Procedure
- 1.
-
Out-of-Distribution Probing with Diagnostic Failure-Mode SynthesisIn this phase, a target failure mode is selected from the predefined taxonomy (e.g., mechanical inconsistency, physical plausibility, or topological violation) and operationalized through a counterfactual probe. This task is specifically engineered to lie outside the model’s high-probability training distribution while remaining internally consistent and logically well-defined. By utilizing scenarios that are "trivially solvable" for human cognition yet absent from internet-scale corpora—such as non-standard drivetrain geometries or functional assemblies—we isolate the system’s capacity for de novo structural reasoning from its capacity for associative retrieval. For instance, in the context of mechanical reasoning, the probe is synthesized to require implicit structural coordination across multiple interdependent components. The task necessitates that the model maintain topological invariants—such as mechanical continuity and spatial handedness—across a multi-step generative process. This exposes whether the output is governed by a persistent physical world model or by a fragmented sequence of high-probability tokens. Example: Synthesizing a mechanical plausibility failure by requesting a wheelchair for a user that is propelled using a bicycle-style pedal-to-wheel transmission system (Figure 38).
- 2.
-
Iterative Boundary Refinement and Constraint TuningFollowing the initial task formulation, the prompt undergoes a multi-stage refinement process to ensure diagnostic precision. Initial variants are first deployed in simplified form to verify that the task reliably elicits the targeted failure mode without excessive linguistic ambiguity. If pre-testing reveals partial compliance or underspecified outputs—where the model may "bypass" the reasoning constraint through vague representation—the prompt is incrementally refined. This refinement involves introducing explicit structural constraints (e.g., requiring visible sprockets or mechanically continuous linkages) while preserving functional minimalism. The objective is to eliminate "low-effort" statistical approximations and force the model to engage with the specific topological invariants of the task. This step ensures that any observed failure is a definitive breakdown in structural reasoning rather than a result of an underspecified instruction. Example: Transitioning from a general request for a "pedal-powered wheelchair" to a refined instruction requiring a continuous pedal-to-wheel transmission system with specified mechanical components (Figure 39 and Figure 42).
- 3.
- Test-time Adaptation Probing To evaluate test-time adaptation, the model is explicitly informed of identified errors in its prior output. Subsequent responses are examined to determine whether corrections reflect genuine internal constraint updating or merely superficial textual adjustment. Example: Informing the model that the pedals are disconnected from the wheels and observing whether subsequent images correct the mechanical linkage or only provide plausible verbal explanations (Figure 41 and Figure 43).
Evaluation Metric and Scoring Rubric
Comparative Baselines
- Human Reasoner Baseline: Each prompt was vetted by a control group of five human participants. These participants were tasked with identifying the core mechanical or logical constraint in the prompt and confirming its solvability. For all prompts in Table 2, the human success rate was , establishing a "ceiling" of trivial solvability for a reasoning agent with functional world models.
- Cross-Generational Model Baseline: We contrast performance across two model iterations (early 2025 vs. early 2026). This allows us to measure whether the "Scaling Hypothesis" (increasing parameters and data) correlates with a reduction in . If a model shows improved performance on general benchmarks (e.g., MMLU) but maintains a high in our diagnostic probes, it provides empirical evidence of a persistent structural reasoning gap that is decoupled from general pattern-matching capabilities.
| Failure Category | GPT-4o | Gemini-2.5 | Grok-4 | GPT-5.2 | Gemini-3 | Grok-4.1 |
|---|---|---|---|---|---|---|
| Mechanical/Physical | 0.96 | 0.98 | 0.98 | 0.90 | 0.90 | 0.94 |
| Geometric/Spatial | 0.90 | 0.92 | 0.96 | 0.85 | 0.86 | 0.88 |
| Relational/Logical | 0.82 | 0.85 | 0.90 | 0.76 | 0.62 | 0.78 |
| Average | 0.89 | 0.92 | 0.95 | 0.84 | 0.79 | 0.87 |
Scope and Limitations

| No. | Prompt | Failure Mode | Reference |
|---|---|---|---|
| 1 | Create an image of a wheelchair designed for a person with both hands missing, equipped with bicycle-style pedals that allow the user to propel the wheelchair independently. | Mechanically implausible drivetrain representation, including disconnected pedals, absent or broken chain linkage, or substitution with non-functional tank-track mechanisms. | See Figure 1, Figure 2 and Figure 3 |
| 2 | Create an image of a kids’ tricycle with two wheels in the front equipped with a pedal mechanism and one wheel at the back. The front steering system should be connected to the rear wheel. | Structural inconsistency in object composition, such as missing pedal–chain assemblies, incorrect wheel count, or invalid steering-to-wheel linkage. | See Figure 4, Figure 5 and Figure 6 |
| 3 | Create an image of a person holding Atomic Habits book, standing in front of a mirror. | Failure in geometric and reflective transformation reasoning, resulting in non-mirrored text, readable book titles in the reflection, or physically inconsistent reflections. | See Figure 7, Figure 8 and Figure 9 |
| 4 | Create an image of a person cutting paper with left-handed scissors. | Violation of handedness constraints, including use of the right hand or incorrect blade orientation inconsistent with left-handed scissors. | See Figure 10, Figure 11 and Figure 12 |
| 5 | Please create an image of a classic wall clock with a golden body and silver-colored Persian numerals. | Partial or complete omission of Persian numerals, substitution with incorrect numeral systems, or inconsistent numeral styling. | See Figure 13, Figure 14 and Figure 15 |
| 6 | Please create an image of the exact same clock, but showing the time 2:29. Do not change anything else except the time. | Incorrect temporal representation, including misplacement of hour or minute hands, introduction of extraneous hands, or replacement with digital time indicators. | See Figure 16, Figure 17 and Figure 18 |
| 7 | I have a jug with 3 liters of capacity and two small bottles of 40ml. How can I measure exactly 2.50 liters of water? Please provide a short and precise answer. | Invalid or incoherent solution steps, incorrect volumetric reasoning, premature classification of the task as impossible, or erroneous illustrative diagrams. | See Figure 19, Figure 20 and Figure 21 |
| 8 | Can you calculate the rows and columns in the given image? | Elementary counting errors, including incorrect grid dimensionality estimation or miscounting of distinct color regions. | See Figure 22, Figure 23 and Figure 24 |
| 9 | A guy named John Doe is attracted to older women, and he falls in love with a woman named Helen. Helen has one daughter named Marcy. Later, John Doe marries Helen, and they live happily together. One day, John Doe discovers that his father has married Marcy. Given this situation, how many relationships exist between John Doe and John Smith? | Breakdown in relational reasoning, leading to incorrect relationship counts, omission of entities, or incomplete representation of relational links. | See Figure 25, Figure 26 and Figure 27 |
| 10 | Design a solar system for the submersible pump (specs attached) using 12 existing Jinko 635W panels to run reliably from 08:00–16:00. Provide technical specs for required VFD, DC cabling, earthing, and mounting (tilt/orientation) while prioritizing cost-efficiency and safety. | Fundamental electrical miscalculations, including incorrect motor power estimation, erroneous horsepower classification, or omission of critical parameters such as power factor. | See Figure 28, Figure 29, Figure 30, Figure 31, Figure 32 and Figure 33 |
| Model (Version) | GLUE | MMLU | HellaSwag | WinoGrande | BIG-bench | CQA | ARC-AGI-1 | HLE |
|---|---|---|---|---|---|---|---|---|
| ChatGPT-4o | 92% | 88.7% | 95.3% | 87.5% | 85% | 86% | ∼10% | 24.5% |
| Gemini-2.5 Pro | 94% | 88.9% | 96.2% | 91.0% | 90% | 89% | ∼5% | 21.6% |
| Grok-4 | 93% | 86.6% | 95.8% | 89.0% | 88% | 87% | 15.9%* | 25.4% |
| Model (Version) | GLUE | MMLU | HellaSwag | WinoGrande | BIG-bench | CQA | ARC-AGI-2 | HLE |
|---|---|---|---|---|---|---|---|---|
| ChatGPT-5.2 | 94% | 88.4% | 96.1% | 90% | 91.2% | 89% | 54.2% | 36.6% |
| Gemini-3 Pro | 95% | 90.1% | 97.2% | 91% | 93.5% | 91% | 45.1% | 45.8% |
| Grok-4.1 | 92% | 86.6% | 95.8% | 88% | 88.0% | 87% | 16.0% | 30.0% |

































3. Empirical Analysis of Diagnostic Prompts Across Experiments
3.1. Wheelchair Problem (Prompt 1)
3.1.1. Experimental Observations
- Initial Textual Reasoning and Constraint Recognition: In early responses, ChatGPT provided mechanically plausible descriptions but did not explicitly address the tension between handless operation and pedal-driven mobility (Figure 38). Recognition of this implicit constraint emerged only after iterative prompts that highlighted the mechanical challenge without explicitly stating it.
- Cross-Modal Divergence: Despite fluent textual explanations (Figure 35 and Figure 36), the generated visual outputs Figure 39 frequently exhibited structural inconsistencies, such as disconnected pedals, missing chains, or track-like sprocket substitutions (Figure 37, Figure 40, Figure 41 and Figure 43). This highlights a persistent gap between declarative knowledge and its multimodal application.
- Sensitivity to Prompt Refinement: Even after explicit specification of chain links and sprocket orientation (Figure 42), visual outputs continued to deviate from mechanical correctness. While textual reasoning improved, the mismatch between textual and visual reasoning underscores modality-specific limitations in compositional generalization.
- Iterative Improvement and Residual Errors: Subsequent refinements produced incremental visual improvements (Figure 41) but did not achieve full alignment with mechanical plausibility, demonstrating persistent brittleness in LLM cross-modal reasoning.
3.2. Mechanical Integrity and Functional Composition (Prompt 2)
3.3. Geometric Transformations and Reflective Reasoning (Prompt 3)
3.4. Functional Handedness and Multimodal Disconnection (Prompt 4)
3.5. Symbolic Precision and Temporal Consistency (Prompts 5 & 6)
3.6. Elementary Quantitative Reasoning Under Constraint (Prompt 7)
| Trial | Final Answer | Consistency with Primality Definition |
|---|---|---|
| Initial response | Incorrect (17, 19) | Low |
| After explanation request | Correct reasoning, unstable conclusion | Partial |
| Repeated prompt | Incorrect answer | Low |
3.7. Grid Counting and Pattern Recognition (Prompt 8)
3.8. Relational Complexity and Narrative Inference (Prompt 9)
3.9. Domain-Specific Engineering and Applied Logic (Prompt 10)
3.10. Compositional and Cross-Modal Insights
- Strong textual reasoning capabilities in isolation but persistent failures in multimodal integration.
- Reliance on learned patterns and high-probability templates rather than true structural understanding.
- Sensitivity to prompt design and iterative refinement, highlighting brittleness in reasoning generalization.
- The wheelchair problem (Prompt 1) exemplifies all these patterns most clearly, providing a coherent visual narrative from Figures 35 to 43.















4. Challenges of LLMs in Abstraction and Reasoning
| Task Category | Difficulty | ChatGPT-4o mini | Gemini-2.5 Flash | Grok-2 |
|---|---|---|---|---|
| Public Training Tasks | Easy | 38% | 35% | 48.0% |
| Public Evaluation Tasks | Hard | 9% | 8% | 22.0% |
| Semi-private Evaluation Tasks | Hard | 5% | 4% | 15.0% |
| Private Evaluation Tasks | Hard | 3% | 2.5% | 12.0% |
| Weighted Average | — | 15.2% | 13.8% | 29.6% |
| Task Category | Difficulty | ChatGPT-5.2 | Gemini-3 Pro | Grok-4.1 |
|---|---|---|---|---|
| Public Training Tasks | Easy | 94.5% | 92.0% | 88.6% |
| Public Evaluation Tasks | Hard | 58.2% | 48.4% | 34.2% |
| Semi-private Evaluation Tasks | Hard | 54.2% | 45.1% | 29.4% |
| Private Evaluation Tasks | Hard | 52.9% | 31.1% | 26.8% |
| Weighted Average | — | 64.9% | 54.2% | 44.8% |
4.1. Solving ARC Puzzles with the Gemini Flash Model
4.2. Gemini Flash Experimental Setup






4.3. Gemini Flash Outputs and Performance Analysis
| Batch | Temp | Additional Examples | Total Attempted | Above Threshold | Solved 100% |
|---|---|---|---|---|---|
| batch-6 | 0.7-1.65 | 0 | 49 | 23 | 2 (4.08%) |
| batch-7 | 0.7-1.65 | 2 | 50 | 23 | 2 (4.00%) |
| batch-8 | 0.7-1.65 | 4 | 48 | 19 | 2 (4.17%) |
| batch-9 | 0.7-1.65 | 9 | 42 | 21 | 2 (4.76%) |

| Batch | Temp | Additional Examples | Total Attempted | Above Threshold | Solved 100% |
|---|---|---|---|---|---|
| batch-0 | 0.7-1.65 | 0 | 47 | 21 | 0 (0.00%) |
| batch-1 | 0.7-1.65 | 0+data | 50 | 21 | 1 (2.00%) |
| batch-2 | 0.7-1.25 | 2+data | 48 | 20 | 1 (2.08%) |
| batch-3 | 0.7-1.35 | 4+data | 45 | 21 | 0 (0.00%) |

4.4. Testing Introspective Verification Capabilities

5. Empirical Characterization of LLM Reasoning Breaks
5.1. The Computational Einstellung Effect and Pattern Fixation
- Mechanical Defaulting (Prompt 2): Despite the explicit requirement for a rear-steered tricycle, visual outputs consistently reverted to front-steering architectures. The model’s internal "prior" for vehicle topology—built on millions of images of standard tricycles—exerted a gravitational pull that suppressed the novel mechanical logic requested.
- Handedness Bias (Prompt 4): The failure to render left-handed scissors operation, even when the model textually acknowledged the constraint, illustrates a distributional capture. Because right-handedness is the statistically dominant representation in internet-scale data, the model is unable to "de-center" from this bias to perform a simple geometric inversion.
- Functional Mirroring (Prompt 3): When generating reflections, models often produced readable, non-inverted text. This suggests that the "object-level" representation of text is so strong that the model cannot apply the "transformation-level" logic of reflection, preferring the familiar pattern of legible characters over the physically accurate mirrored variant.
5.2. Inference-Time Stability in Mechanical Reasoning
5.3. Jagged Performance and Cross-Modal Inconsistency
5.4. Contextual Hallucinations and Cross-Turn Information Leakage
5.5. Discussion: Mechanisms of Reasoning Breakdowns
- The Self-Auditing Gap: A critical limitation identified across experiments is the absence of internal verification. Models produced structurally implausible outputs with high confidence, failing to engage error-detection mechanisms during the generative process. This suggests that self-correction is not inherently coupled to instruction following in current transformer-based architectures.
- Latent Activation vs. Strategy Acquisition: Performance gains observed through prompting or iterative correction do not appear to represent the acquisition of new reasoning strategies. Instead, they reflect the selective activation of latent behaviors already supported by the training distribution. When a task requires a novel logical shift—such as geometric inversion or non-standard vehicle topology—the model lacks the metacognitive flexibility to override its high-probability priors.
- Surface-Level Fluency and the Competence Mirage: The persistence of the Einstellung effect indicates that linguistic competence remains tightly coupled to training-time statistical frequencies. This creates a "competence mirage," where surface-level fluency masks a fundamental inability to perform grounded, self-regulating abstraction. Addressing these "reasoning breaks" likely requires moving beyond parameter scaling toward architectures that incorporate explicit internal verification and adaptive restructuring.
6. Evaluating Intelligence in LLMs: Observed Limitations and Future Perspectives
6.1. The Endogenous Evaluation Gap
6.2. Future Directions: Towards Resource-Efficient Autonomy
- Self-Auditing: The detection of internal inconsistencies or logical errors without external cues.
- Adaptive Strategy Generation: The construction of alternative reasoning paths when initial high-probability templates fail.
- Iterative Refinement: The execution of corrective logic within fixed computational and memory budgets.
7. Conclusion
Declarations
Ethics Approval and Consent to Participate
Consent for Publication
Availability of Data and Materials
Competing Interests
Funding
Authors’ Contributions
Acknowledgements
Glossary
| Above Threshold | Denotes the degree to which the model’s predicted output matrix exceeds a predefined similarity benchmark relative to the true output matrix. 16, 20, 21 |
| SotA LLMs | SotA LLMs refers to State-of-the-Art Large Language Models such as ChatGPT, Gemini, and Grok. These models represent the most advanced implementations of transformer-based neural architectures currently available. 1, 5, 6 |
| Threshold | The minimum acceptable similarity score (measured as normalized pixel-wise correspondence) required for meaningful alignment between prediction and ground truth. For example, if the threshold for a given puzzle is 78%, a model output achieving 83% similarity is recorded as 5% above threshold. This metric applies only to puzzles where the input and output matrices share identical dimensions; for all other cases involving dimension changes or structural transformations, the threshold value is set to zero, as direct element-wise comparison is not applicable. 16, 20, 21 |
References
- Evans, O.; Berglund, L.; Tong, M.; Kaufmann, M.; et al. The reversal curse: LLMs trained on A is B fail to learn B is A. arXiv 2023, arXiv:2309.12288v4. Available online: https://arxiv.org/abs/2309.12288v4.
- Nezhurina, M.; Cipolina-Kun, L.; Cherti, M.; Jitsev, J.; et al. Alice in Wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv 2024, arXiv:2406.02061v4. Available online: https://arxiv.org/abs/2406.02061v4.
- Chollet, F. On the measure of intelligence. arXiv 2019, arXiv:1911.01547v2. Available online: https://arxiv.org/abs/1911.01547v2.
- Dziri, N.; Lu, X.; Sclar, M.; Li, X. L.; Jiang, L.; et al. Faith and fate: Limits of transformers on compositionality. arXiv 2023, arXiv:2305.18654. Available online: https://arxiv.org/abs/2305.18654v3. [CrossRef]
- Du, M.; He, F.; Zou, N.; Tao, D.; Hu, X. Shortcut learning of large language models in natural language understanding. arXiv 2022, arXiv:2208.11857v2. Available online: https://arxiv.org/abs/2208.11857. [CrossRef]
- AINumbat. “SotA LLM Limitations (Examples Repository),” GitHub. 2024. Available online: https://github.com/ainumbat/llm_eval_notes.git.
- Chollet, F. “Talk at AGI Conference, ARC Prize,” YouTube, 2024. Available online: https://www.youtube.com/watch?v=nL9jEy99Nh0&t=1450s.
- Li, X. L.; Kuncoro, A.; Hoffmann, J.; et al. A systematic investigation of commonsense knowledge in large language models. arXiv 2022, arXiv:2111.00607. Available online: https://arxiv.org/abs/2111.00607v3. [CrossRef]
- Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. Available online: https://arxiv.org/abs/2206.07682v2. [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv 2022, arXiv:2201.11903v6. Available online: https://arxiv.org/abs/2201.11903v6.
- Yin, Z.; Sun, Q.; Guo, Q.; et al. Do large language models know what they don’t know? arXiv 2023, arXiv:2305.18153v2. Available online: https://arxiv.org/abs/2305.18153v2.
- Turpin, M.; Michael, J.; Perez, E.; Bowman, S. R.; et al. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought. arXiv 2023, arXiv:2305.04388v2. Available online: https://arxiv.org/abs/2305.04388v2.
- Wenzel, G.; Jatowt, A. An overview of temporal commonsense reasoning and acquisition. arXiv 2023, arXiv:2308.00002. Available online: https://arxiv.org/abs/2308.00002v3. [CrossRef]
- Chollet, F.; Knoop, M.; Kamradt, G.; Landers, B. ARC Prize 2024: Technical report. arXiv 2024, arXiv:2412.04604. Available online: https://arxiv.org/abs/2412.04604v2. [CrossRef]
- Zhao, J.; Tong, J.; Mou, Y.; et al. Exploring the compositional deficiency of large language models in mathematical reasoning through trap problems. arXiv 2024, arXiv:2405.06680v4. Available online: https://arxiv.org/abs/2405.06680v4.
- Bennett, M. T. Is complexity an illusion? arXiv 2024, arXiv:2404.07227. Available online: https://arxiv.org/abs/2404.07227v4. [CrossRef]
- Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. Available online: https://arxiv.org/abs/2005.14165v4. [CrossRef]
- Banerjee, S.; Agarwal, A.; Singla, S. LLMs will always hallucinate, and we need to live with this. arXiv 2024, arXiv:2409.05746. Available online: https://arxiv.org/abs/2409.05746v1. [CrossRef]
- Herrmann, M.; Lange, J. D.; Eggensperger, K.; et al. Position: Why we must rethink empirical research in machine learning. arXiv 2024, arXiv:2405.02200v2. Available online: https://arxiv.org/abs/2405.02200v2.
- Wu, Z.; Qiu, L.; Ross, A.; et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. arXiv 2023, arXiv:2307.02477v3. Available online: https://arxiv.org/abs/2307.02477v3.
- Akyürek, E.; Damani, M.; Qiu, L.; et al. The surprising effectiveness of test-time training for abstract reasoning. arXiv 2024, arXiv:2411.07279. Available online: https://arxiv.org/abs/2411.07279v1. [CrossRef]
- Rahman, M. N. H.; Son, S.-H. Feature transforms for image data augmentation. Neural Computing and Applications 2022, 34, 16141–16160. [Google Scholar] [CrossRef]
- Kim, Y.-H.; Ahn, J.-M.; Jang, S.-H.; Kim, S.-K.; Kim, H.-K. Data augmentation method by applying color perturbation of inverse PSNR and geometric transformations for object recognition based on deep learning. Applied Sciences 2020, 10, 3755. [Google Scholar] [CrossRef]
- Chang, T. A.; Bergen, B. K. Language model behavior: A comprehensive survey. arXiv 2023, arXiv:2303.11504v2. Available online: https://arxiv.org/abs/2303.11504v2. [CrossRef]
- Dennett, D. C. The Role of Language in Intelligence,” in Brainstorms: Philosophical Essays on Mind and Psychology; De Gruyter, 2013. [Google Scholar] [CrossRef]
- OpenAI. ChatGPT. 2023. Available online: https://chat.openai.com.
- Google DeepMind. Gemini. 2024. Available online: https://deepmind.google/technologies/gemini.
- xAI. Grok. 2024. Available online: https://x.ai.
- DeepSeek. DeepSeek Language Model. 2024. Available online: https://deepseek.com.
- Zhao, H.; Yang, F.; Lakkaraju, H.; Du, M. Towards uncovering how large language model works: An explainability perspective. arXiv 2024, arXiv:2402.10688v2. Available online: https://arxiv.org/abs/2402.10688.
- Chollet, F.; Knoop, M.; Kamradt, G.; Landers, B. ARC Prize 2024: Technical Report. arXiv 2024, arXiv:2412.04604. [Google Scholar] [CrossRef]
- Chollet, F.; Knoop, M.; Kamradt, G.; Landers, B. ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems. arXiv 2025, arXiv:2505.11831. [Google Scholar] [CrossRef]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; et al. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
- Legg, S.; Hutter, M. Universal intelligence: A definition of machine intelligence. Minds and Machines 2007, 17, 391–444. [Google Scholar] [CrossRef]
- Deutsch, D. Constructor theory. Synthese 2015, 190, 4331–4359. [Google Scholar] [CrossRef]
- Minsky, M. The Society of Mind; Simon & Schuster, 1986. [Google Scholar]
- Yampolskiy, R. V. Artificial Intelligence Safety Engineering: Why Machine Ethics Is a Wrong Approach. In Philosophy and Theory of Artificial Intelligence; Springer, 2015. [Google Scholar]
- Schmidhuber, J. Gödel machines: Fully Self-Referential Optimal Universal Self-Improvers. arXiv 2007, arXiv:0705.1865v3. Available online: https://arxiv.org/abs/0705.1865.
- Hutter, M. Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability; Springer: Berlin, Germany, 2005. [Google Scholar]
- Shumailov, I.; Zhao, Z.; Galke, J.; Papernot, P.; Anderson, R. AI models collapse when trained on recursively generated data. arXiv 2025, arXiv:2505.21677. Available online: https://arxiv.org/abs/2505.21677. [CrossRef]
- Luchins, A. S. Mechanization in problem solving: The effect of Einstellung. Psychological Monographs Available. 1942, 54. [Google Scholar] [CrossRef]
Author Contributions Statement
![]() |
Rashid Mehmood is an independent researcher specializing in artificial intelligence, machine learning, and full-stack system development. His work focuses on improving reasoning, adaptability, and test-time learning in AI systems, with the broader goal of advancing paths toward Artificial General Intelligence (AGI). He has developed lightweight, resource-efficient algorithms and adaptive assistants designed to reduce catastrophic forgetting and enhance real-time inference. Recently, he demonstrated that strong generalization can be achieved from extremely sparse data, achieving over 80% accuracy on MNIST using only 1% of the training set. His research continues to explore efficient learning, abstraction, and dynamic knowledge recalibration. |
![]() |
Dr. Eid Rehman is currently serving as an Assistant Professor of Computer Science at the University of Mianwali, Pakistan. He earned his Ph.D. in Computer Science from the International Islamic University, Islamabad, in 2018. Throughout his academic and research career, Dr. Rehman has made significant contributions to the fields of Artificial Intelligence, Large Language Models (LLMs), and Information Security. His passion for advancing knowledge in emerging technologies is reflected in his prolific research record, having authored and co-authored more than 25 research papers published in well-reputed national and international journals. Dr. Rehman’s research work bridges theory and practical application, contributing valuable insights to cutting-edge areas critical to today’s technological advancements. He remains actively engaged in research, mentoring students, and participating in collaborative projects to foster innovation and excellence in Computer Science. His commitment to academic excellence and research innovation continues to inspire the next generation of computer scientists at the University of Mianwali and beyond. |
![]() |
Dr.Muhammad Habib received his Ph.D. in Computer Science from International Islamic University Islamabad, Pakistan, in 2018. His research interests include Computer Vision, Machine Learning, Deep Learning, Generative AI, and Agentic AI. He has published numerous research papers in reputable journals, contributing significantly to advancements in intelligent systems and AI-driven technologies. His work focuses on developing innovative algorithms and methodologies to enhance machine perception and automation |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).


