Submitted:
19 January 2026
Posted:
20 January 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- 1.
- PRIME-Bench: The Most Comprehensive Algorithmic Reasoning Benchmark. We introduce PRIME-Bench, comprising 86 tasks across 12 categories with 51,600 total instances—the largest and most comprehensive benchmark for evaluating LLM algorithmic reasoning to date. PRIME-Bench spans 28 sorting algorithms, 8 automata types (including Turing machines and PDAs), 6 theorem proving tasks, and 8 real-world system simulations, providing unprecedented coverage of computational complexity from to with step counts ranging from 500 to over 1,000,000. This benchmark is 5–10× larger than existing algorithmic reasoning benchmarks such as BIG-Bench, GSM8K, and MATH, and uniquely includes execution trace verification requiring sustained state tracking over extended sequences.
- 2.
- Structured Prompting Analysis. Through systematic evaluation on the N-Queens problem domain, we demonstrate that structured prompt engineering can yield transformative improvements, with accuracy increasing from 37.4% to 90.0% (a relative gain of 140.6%) while maintaining acceptable latency overhead of 1.56×. These insights inform the design of our PRIME framework, which achieves even larger gains (26.8% to 93.8%) across the full PRIME-Bench benchmark.
- 3.
- Scale-Sensitivity Characterization. We characterize the nuanced relationship between model scale and prompt sensitivity, revealing that smaller models exhibit substantially larger relative gains from prompt optimization (244.9% for 8B vs. 66.8% for 120B), with important implications for resource-efficient deployment.
- 4.
- PRIME Framework: A Novel Multi-Agent Reasoning Architecture. We introduce PRIME (Policy-Reinforced Iterative Multi-agent Execution), the first framework to unify multi-agent decomposition, reinforcement learning-based policy optimization via Group Relative Policy Optimization (GRPO), and iterative constraint verification within a single coherent architecture. Unlike prior approaches that address individual components in isolation, PRIME’s synergistic integration enables breakthrough performance: 93.8% average accuracy across 86 diverse algorithmic tasks, representing a 250.0% improvement over baseline approaches. PRIME achieves near-perfect performance (>95%) on 11 of 12 task categories, including tasks where vanilla LLMs fail catastrophically (Turing machine simulation: 8.9% → 92.4%).
2. Related Work
2.1. Transformer Architecture and Language Modeling
2.2. LLM Reasoning and Benchmarking
2.3. Prompt Engineering and Optimization
2.4. Scaling Laws and Model Capacity
2.5. Multi-Agent LLM Systems and Reinforcement Learning
3. Methods
3.1. The PRIME Framework

3.1.1. Multi-Agent Architecture
- Executor Agent (): The executor is responsible for step-by-step constructive reasoning. At each time step t, given the problem context c and execution history , it samples an action from the policy :where represents the current state. This probabilistic formulation allows the system to explore the solution space rather than committing prematurely to a greedy path.
- Verifier Agent (): To prevent error propagation, the verifier provides immediate feedback on state validity. It evaluates the current state against the constraint set to compute a weighted violation score:where denotes the severity weight of the j-th constraint. This agent is trained via process supervision [53] to provide dense reward signals rather than sparse terminal feedback.
-
Coordinator Agent (): The coordinator acts as the control logic, dynamically switching between generation and correction modes. Unlike static execution chains, implements a decision policy based on the verification feedback:This explicit logic enables the system to perform local repairs on minor errors while pruning fundamentally invalid paths before they corrupt the context window.
3.1.2. Group Relative Policy Optimization (GRPO)
3.1.3. Composite Reward Modeling
3.1.4. Two-Stage Fine-Tuning Strategy
3.1.5. Iterative Execution Protocol
| Algorithm 1 PRIME Iterative Execution Protocol |
|
3.2. Task Formulation: The N-Queens Problem
3.2.1. Constraint Specification
3.3. Prompt Engineering
3.3.1. Baseline Prompt
3.3.2. Optimized Prompt
- Constraint Enumeration: We explicitly state the three constraint types (row, column, diagonal) to prime the attention mechanism on the relevant logical rules.
- Verification Procedure: A mandated step-by-step check where the model must validate a candidate c against all existing queens using the logic:
- Format Specification: Strict output formatting is enforced to separate reasoning traces from the final answer, reducing parsing errors.
- Worked Examples: We include few-shot demonstrations for to illustrate the verification pattern. Following recent findings, these examples serve to convey the reasoning structure rather than merely as memorization targets [39].
3.4. Experimental Setup
3.4.1. Model Selection
3.4.2. Evaluation Protocol
- Accuracy: The strict exact-match rate between the predicted column and the ground truth :
- Relative Improvement (): To quantify the marginal benefit of structured prompting, we compute:
- Latency Overhead (): We measure the wall-clock computational cost as the ratio of optimized to baseline latency, , where total time accounts for input processing, generation, and system overhead.
- Scale Sensitivity (r): We characterize the relationship between model capacity and reasoning accuracy using Pearson correlation coefficients between log-transformed parameter counts and performance:
3.4.3. Statistical Significance
4. Experiments
4.1. Overall Performance Improvement
4.2. Scaling Analysis
4.3. Performance Across Problem Difficulty
4.4. Latency Analysis
4.5. Comparative Analysis
4.6. Detailed Per-N Analysis
4.7. Error Analysis
4.8. Ablation Studies
4.9. Cross-Task Generalization: Comprehensive Benchmark
4.9.1. Benchmark Design Principles
4.9.2. Task Category Overview
4.9.3. Results: Category-Level Analysis
4.9.4. Results: Top Improvements
4.9.5. Detailed Category Results
4.9.6. Statistical Analysis
5. Discussion
5.1. The Efficacy of Structured Prompting
5.2. Scale-Sensitivity Dynamics
5.3. Problem Complexity Scaling
5.4. Implications for LLM Deployment
5.5. Limitations and Future Work
5.6. Theoretical Implications
5.7. Connections to Cognitive Science
5.8. Practical Recommendations
6. Conclusion
Appendix A. Complete Task Specifications
Appendix A.1. Benchmark Overview
| ID | Category | Tasks | Instances | Max Steps | Primary Cognitive Challenge |
|---|---|---|---|---|---|
| 1 | Comparison-based Sorting | 15 | 9,000 | Long-horizon state tracking | |
| 2 | Non-comparison Sorting | 3 | 1,800 | Distribution-aware reasoning | |
| 3 | Advanced/Hybrid Sorting | 10 | 6,000 | Adaptive strategy selection | |
| 4 | Graph Traversal | 6 | 3,600 | Path memory and cycle detection | |
| 5 | Tree Data Structures | 5 | 3,000 | Hierarchical state management | |
| 6 | Classic Algorithm Puzzles | 6 | 3,600 | Constraint satisfaction | |
| 7 | Automata & State Machines | 8 | 4,800 | Transition precision | |
| 8 | String & Pattern Matching | 5 | 3,000 | Pattern recognition accuracy | |
| 9 | Mathematical & Numerical | 8 | 4,800 | Arithmetic precision | |
| 10 | Logic & Theorem Proving | 6 | 3,600 | Formal reasoning chains | |
| 11 | Data Structure Operations | 6 | 3,600 | Sequential operation tracking | |
| 12 | System Simulation | 8 | 4,800 | Multi-component state evolution | |
| Total | 86 | 51,600 | — | — |
Appendix A.2. Category 1: Comparison-based Sorting
| ID | Algorithm | Time | Space | Stable | n Range | Key Invariant |
|---|---|---|---|---|---|---|
| 1.1 | Bubble Sort | Yes | 8–25 | Adjacent swap propagation | ||
| 1.2 | Selection Sort | No | 8–25 | Minimum selection per pass | ||
| 1.3 | Insertion Sort | Yes | 8–25 | Sorted prefix maintenance | ||
| 1.4 | Shell Sort | No | 16–256 | Gap-indexed h-sorting | ||
| 1.5 | Merge Sort | Yes | 8–128 | Recursive divide-merge | ||
| 1.6 | Quick Sort | No | 8–128 | Pivot-based partitioning | ||
| 1.7 | Heap Sort | No | 8–128 | Max-heap property | ||
| 1.8 | Tree Sort | Yes | 8–64 | BST inorder traversal | ||
| 1.9 | Cocktail Shaker | Yes | 8–25 | Bidirectional bubbling | ||
| 1.10 | Comb Sort | No | 16–128 | Shrinking gap factor | ||
| 1.11 | Gnome Sort | Yes | 8–20 | Garden gnome positioning | ||
| 1.12 | Odd-Even Sort | Yes | 8–32 | Alternating index parity | ||
| 1.13 | Pancake Sort | No | 6–12 | Prefix reversal only | ||
| 1.14 | Cycle Sort | No | 8–20 | Optimal write count | ||
| 1.15 | Stooge Sort | No | 8–16 | Overlapping thirds recursion |
Appendix A.2.1. Formal Task Definitions
- 1.
- Algorithm specification defining the step-by-step procedure
- 2.
- Input space for task-specific size set
- 3.
- Output space comprising the sorted permutation and execution trace
- 4.
- Complexity bounds specifying worst-case and average-case step counts
- 5.
- Verification predicate for correctness assessment
| Task | Input Specification | Output Specification | Instances |
|---|---|---|---|
| Bubble Sort | Array , , | Sorted array where ; trace of all comparisons and swaps | 600 |
| Selection Sort | Array A with same constraints; 20% contain duplicates | Sorted array with selection indices per iteration | 600 |
| Merge Sort | Array A, (powers of 2) | Sorted array with recursive call tree and merge sequences | 600 |
| Quick Sort | Array A, ; adversarial cases filtered | Sorted array with pivot selections and partition boundaries | 600 |
| Heap Sort | Array A, | Sorted array with heap construction and extraction phases | 600 |
Appendix A.2.2. Instance Generation Protocol
| Algorithm A1 Sorting Task Instance Generation |
|
Appendix A.2.3. Illustrative Execution Traces
| Pass | Step | Compare | Action | State |
|---|---|---|---|---|
| 1 | 1 | Swap | ||
| 1 | 2 | Swap | ||
| 1 | 3 | Swap | ||
| 2 | 1 | Swap | ||
| 2 | 2 | Swap | ||
| 3 | 1 | Swap | ||
| 4 | — | No swaps | Terminate | |
| Total | 10 comparisons | 6 swaps | Sorted | |
| Depth | Subproblem | Operation | Result |
|---|---|---|---|
| 0 | Divide | ||
| 1 | Divide | ||
| 2 | Base case | ||
| 2 | Base case | ||
| 1 | Merge | ||
| 1 | Divide | ||
| 2 | Merge | ||
| 0 | Merge | ||
| Total Operations | 5 comparisons | ||
Appendix A.3. Category 2: Non-comparison Sorting
| Algorithm | Time | Space | n Range | Instances | Constraint | Key Challenge |
|---|---|---|---|---|---|---|
| Counting Sort | 100–5000 | 600 | Maintaining stability through cumulative counts | |||
| Radix Sort | 100–1000 | 600 | d-digit integers | Digit extraction and stable per-digit sorting | ||
| Bucket Sort | avg | 100–1000 | 600 | Uniform | Uniform distribution assumption and bucket overflow handling |
Appendix A.4. Category 3: Advanced/Hybrid Sorting
| Algorithm | Best | Worst | n Range | Adaptive Strategy |
|---|---|---|---|---|
| Timsort [71] | 64–512 | Natural run detection + galloping merge | ||
| Introsort [72] | 64–512 | Quicksort → Heapsort at depth | ||
| Patience Sort | 32–128 | Pile-based LIS extraction | ||
| Strand Sort | 32–128 | Iterative sorted strand extraction | ||
| Bitonic Sort [73] | 16–64 | Parallel-friendly bitonic sequences | ||
| Batcher Odd-Even | 16–64 | Merge network with depth | ||
| Library Sort | 64–256 | Gapped insertion with rebalancing | ||
| Smoothsort | 64–256 | Leonardo heap for near-sorted input | ||
| Block Sort | 64–256 | In-place stable via block rotation | ||
| Tournament Sort | 32–128 | Winner tree for selection |
Appendix A.5. Category 4: Graph Traversal Algorithms
| Algorithm | Time | Space | Range | Bound | Output Requirements |
|---|---|---|---|---|---|
| DFS on Tree [74] | 50–1000 | Discovery/finish times, traversal order | |||
| BFS on Graph | 20–200 | Level assignments, BFS tree | |||
| Dijkstra [75] | 20–100 | Distance array, predecessor pointers | |||
| A* Pathfinding [76] | Grid 10–30 | Optimal path, f-score evolution | |||
| Floyd-Warshall [77] | 8–25 | Dense | All-pairs distance matrix | ||
| Topological Sort [78] | 20–200 | Valid ordering, in-degree trace |
| Iter | Extract | Relaxations | Dist Array |
|---|---|---|---|
| 0 | Init | — | |
| 1 | A (0) | , | |
| 2 | B (3) | , | |
| 3 | D (5) | , , | |
| 4 | C (7) | ||
| 5 | E (8) | — |
Appendix A.6. Category 5: Tree Data Structure Operations
| Task | Time Complexity | n Range | Instances | Key Challenge |
|---|---|---|---|---|
| BST Insertion | avg | 10–100 | 600 | Path tracking per insertion with balance monitoring |
| BST Inorder | 10–100 | 600 | Iterative stack management without recursion | |
| RB-Tree Insert [79] | 5–50 | 600 | Rotation case identification and recoloring propagation | |
| Huffman Tree [80] | 8–50 | 600 | Priority queue merging with frequency tracking | |
| Binary Heap Ops | 20–200 ops | 600 | Heapify correctness after each insert/extract operation |
Appendix A.7. Category 6: Classic Algorithm Puzzles
| Puzzle | Optimal Steps | Param Range | Instances | Constraint Type |
|---|---|---|---|---|
| Tower of Hanoi | 600 | No larger disk on smaller; single disk moves | ||
| N-Queens | Varies | 600 | No two queens share row, column, or diagonal | |
| Blind Maze | Path length | Grid 10–30 | 600 | Navigate without visual feedback |
| Logic Grid (Zebra) | Deduction steps | 4–6 entities | 600 | Clue-based constraint propagation |
| Sudoku | Fill count | 17–35 givens | 600 | Row, column, box uniqueness |
| 24-Game Extended | Expression length | 4–10 numbers | 600 | Use each number exactly once |
- 1.
- All smaller disks must be on the auxiliary peg (requiring moves)
- 2.
- moves to destination (1 move)
- 3.
- All disks must move from auxiliary to destination (requiring moves)
Appendix A.8. Category 7: Automata & State Machines
| Model | States | Input Length | Instances | Verification Requirement |
|---|---|---|---|---|
| DFA Simulation | 5–20 | 100–10000 | 600 | State sequence matches transition function |
| NFA Simulation | 10–30 | 50–1000 | 600 | Correct -closure computation |
| PDA Execution [81] | 5–15 | 20–500 | 600 | Valid stack operations per transition |
| Turing Machine [82] | 5–20 | 10–100 | 600 | Tape modifications and head movements |
| Register Machine | 2–4 regs | 10–50 instr | 600 | Correct increment/decrement/jump |
| Petri Net [83] | 5–20 places | 50–200 firings | 600 | Token conservation per transition |
| Cellular Automaton [84] | 50–200 cells | 100–1000 gen | 600 | Rule application to each cell |
| Markov Chain | 5–10 states | 100–1000 | 600 | Probabilistic transition accuracy |
Appendix A.9. Category 8: String & Pattern Matching
| Algorithm | Time Complexity | Input Size | Instances | Required Output |
|---|---|---|---|---|
| KMP [85] | 600 | Complete failure function array and all match positions | ||
| Regex NFA | 600 | NFA state construction and simulation trace | ||
| CFG Derivation | Varies | Depth | 600 | Leftmost derivation sequence with production rules |
| Translation Chain | 3–10 langs | 600 | Per-language intermediate output with transformation steps | |
| ASCII Art Parse | 80×40 | 600 | Object identification and edge extraction coordinates |
Appendix A.10. Category 9: Mathematical & Numerical
| Algorithm | Complexity | Size Range | Instances | Precision Requirement |
|---|---|---|---|---|
| Long Division | 20–60 digits | 600 | Exact integer quotient and remainder | |
| Matrix Multiplication | to | 600 | Exact element computation | |
| Gaussian Elimination | 3–8 variables | 600 | Rational arithmetic, pivot selection | |
| GCD Euclidean | up to | 600 | Bezout coefficients | |
| Simplex Method [86] | Varies | 3–6 vars | 600 | Tableau pivot sequence |
| Polynomial GCD | Degree | 600 | Polynomial division steps | |
| Continued Fraction | 10–100 terms | 600 | Convergent computation | |
| Symbolic Diff | Depth | 600 | Correct derivative rules |
Appendix A.11. Category 10: Logic & Theorem Proving
| Task | Variables | Clauses | Instances | Technique | Verification Requirement |
|---|---|---|---|---|---|
| SAT/DPLL [87] | 10–100 | 20–400 | 600 | Unit propagation, branching | Complete decision trace with backtracking |
| Resolution [88] | 10–50 | 20–100 | 600 | Refutation proof | Valid resolution steps to empty clause |
| Unification | — | — | 600 | MGU computation | Most general unifier correctness |
| Type Inference [89] | 10–50 nodes | — | 600 | Hindley-Milner | Type environment and constraints |
| -Reduction | 10–30 nodes | — | 600 | -reduction | Normal form with reduction sequence |
| Package SAT | 20–100 pkgs | 50–300 | 600 | Dependency resolution | Valid installation order or conflict |
Appendix A.12. Category 11: Data Structure Operations
| Structure | Op Time | Operations | Instances | State Tracking Requirements |
|---|---|---|---|---|
| Stack | 20–500 | 600 | LIFO order maintenance, underflow/overflow detection | |
| Circular Queue | 20–500 | 600 | Wraparound index computation, full/empty distinction | |
| Doubly Linked List | – | 20–200 | 600 | Bidirectional pointer consistency after each operation |
| Hash Table (LP) | avg | 20–200 | 600 | Linear probe sequences and collision resolution |
| LRU Cache | 50–500 | 600 | Recency ordering and eviction policy correctness | |
| Union-Find [90] | 50–500 | 600 | Path compression and union-by-rank maintenance |
Appendix A.13. Category 12: System Simulation
| System | Components | Operations | Verification Focus |
|---|---|---|---|
| File System | Directories, files | 20–100 cmds | Valid path resolution, permission checks |
| Blockchain Ledger | Blocks, transactions | 20–100 txns | Hash chain integrity, balance consistency |
| Railway Scheduling | Tracks, trains | 5–20 trains | Collision avoidance, timing constraints |
| Meeting Room | Rooms, bookings | 20–100 requests | Conflict resolution, capacity limits |
| Elevator Control | Elevators, requests | 50–200 calls | SCAN/LOOK algorithm correctness |
| Network Routing | Routers, packets | 100–500 packets | TTL management, routing table lookups |
| Assembly Line | Stages, faults | 5–15 stages | Fault propagation tracing |
| Chemical Reaction | Species, reactions | 50–200 steps | Mass conservation, rate equations |
Appendix A.14. Instance Distribution and Quality Assurance
| Category | Easy | Medium | Hard | Total |
|---|---|---|---|---|
| Comparison Sorting | 3,000 | 3,000 | 3,000 | 9,000 |
| Non-comparison | 600 | 600 | 600 | 1,800 |
| Advanced Sorting | 2,000 | 2,000 | 2,000 | 6,000 |
| Graph Traversal | 1,200 | 1,200 | 1,200 | 3,600 |
| Tree Structures | 1,000 | 1,000 | 1,000 | 3,000 |
| Classic Puzzles | 1,200 | 1,200 | 1,200 | 3,600 |
| Automata | 1,600 | 1,600 | 1,600 | 4,800 |
| String/Pattern | 1,000 | 1,000 | 1,000 | 3,000 |
| Mathematical | 1,600 | 1,600 | 1,600 | 4,800 |
| Logic/Theorem | 1,200 | 1,200 | 1,200 | 3,600 |
| Data Structures | 1,200 | 1,200 | 1,200 | 3,600 |
| Simulation | 1,600 | 1,600 | 1,600 | 4,800 |
| Total | 17,200 | 17,200 | 17,200 | 51,600 |
Appendix A.15. Comparison with Prior Benchmarks
| Benchmark | Tasks | Instances | Max Steps | Trace Req. | Categories | Auto Verify |
|---|---|---|---|---|---|---|
| GSM8K [11] | — | 8,500 | ∼20 | No | 1 | Partial |
| MATH [12] | — | 12,500 | ∼50 | No | 7 | Partial |
| BIG-Bench [13] | ∼200 | Varies | ∼100 | No | 10+ | Yes |
| HumanEval [30] | 164 | 164 | N/A | No | 1 | Yes |
| CriticBench [91] | 15 | 3,825 | ∼50 | Partial | 5 | Partial |
| SortBench [92] | 6 | 1,000 | ∼10K | No | 1 | Yes |
| ZebraLogic [93] | 1 | 1,000 | ∼100 | No | 1 | Yes |
| PRIME-Bench | 86 | 51,600 | Yes | 12 | Yes |
Appendix A.16. Benchmark Design Principles
- 1.
- Reproducibility: Every instance is deterministically generated from fixed random seeds (base seed: 42), enabling exact replication across research groups.
- 2.
- Scalability: Tasks span computational complexity from to over operations, enabling evaluation across the full spectrum of LLM capabilities.
- 3.
- Diversity: The 12 categories cover fundamentally different algorithmic paradigms including divide-and-conquer, dynamic programming, greedy algorithms, constraint satisfaction, and state machine simulation.
- 4.
- Verifiability: Every task has unambiguous correctness criteria enabling fully automated evaluation without human judgment.
- 5.
- Trace Requirement: Unlike benchmarks evaluating only final answers, PRIME-Bench requires complete execution traces, enabling evaluation of reasoning processes [53].
- 6.
- Difficulty Calibration: Instances are uniformly distributed across difficulty levels based on empirical step counts and state space sizes.
- 7.
- Contamination Prevention: All instances are algorithmically generated using unpublished procedures, ensuring no overlap with training corpora.
Appendix A.17. Extended Execution Trace Examples
Appendix A.17.1. Quick Sort Partition Trace
| Level | Subarray | Pivot | Partition Result | Partition Steps |
|---|---|---|---|---|
| 0 | 13 | ; scan: 29>13, 10<13→swap(10,29); 14>13; 37>13; place pivot at | ||
| 1L | — | Base case: single element | ||
| 1R | 29 | ; 14<29→swap; 37>29; place pivot | ||
| 2L | — | Base case: single element | ||
| 2R | — | Base case: single element | ||
| Final Sorted Array: | ||||
Appendix A.17.2. Heap Sort with Heapify Trace
| Algorithm A2 Max-Heapify Procedure |
|
| Phase | Operation | Array State |
|---|---|---|
| Build Max-Heap | ||
| 1 | Heapify at index 1 (10>5) | |
| 2 | Heapify at index 0 (10>4) | |
| Extract Maximum | ||
| 3 | Extract 10, heapify | |
| 4 | Extract 5, heapify | |
| 5 | Extract 4, heapify | |
| 6 | Extract 3 | |
| Final | ||
Appendix A.17.3. Shell Sort Gap Sequence Trace
| Gap | Pass | Subarray Comparisons | Array State After Pass |
|---|---|---|---|
| 4 | 1 | (swap pairs at distance 4) | |
| 1 | 2 | Insertion sort on full array | (final sorted output) |
| Total: 12 comparisons, 8 swaps | |||
Appendix A.17.4. DFS Traversal with Discovery/Finish Times
- 1.
- and are entirely disjoint (neither is ancestor of the other)
- 2.
- (u is a descendant of v)
- 3.
- (v is a descendant of u)
- If : disjoint intervals (Case 1)
- If : v’s interval nested in u’s (Case 2)
- By symmetry with : u’s interval nested in v’s (Case 3)
| Time | Event | Vertex | Stack | Edge Classification |
|---|---|---|---|---|
| 1 | Discover | A | — | |
| 2 | Discover | B | Tree edge | |
| 3 | Discover | D | Tree edge | |
| 4 | Finish | D | — | |
| 5 | Discover | E | Tree edge | |
| 6 | — | — | — | Back edge detected |
| 7 | Finish | E | — | |
| 8 | Finish | B | — | |
| 9 | Discover | C | Tree edge | |
| 10 | — | — | — | Cross edge detected |
| 11 | Finish | C | — | |
| 12 | Finish | A | — |
Appendix A.17.5. A* Pathfinding with Heuristic Computation
| Iter | Expand | g | h | Successors Added to Open | |
|---|---|---|---|---|---|
| 1 | 0 | 8 | 8 | , | |
| 2 | 1 | 7 | 8 | , | |
| 3 | 1 | 7 | 8 | , already in open | |
| 4 | 2 | 6 | 8 | , | |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 12 | 8 | 0 | 8 | Goal reached | |
| Optimal Path: | |||||
Appendix A.17.6. Red-Black Tree Insertion with Rotations
| Algorithm A3 Red-Black Tree Insertion Fixup |
|
| Insert | Fixup Case | Tree State (Black=B, Red=R) |
|---|---|---|
| 7 | Root case | 7(B) |
| 3 | None | 7(B)[3(R), —] |
| 18 | Case 1 (recolor) | 7(B)[3(B), 18(B)] |
| 10 | None | 18(B)[10(R), —] |
| 22 | None | 18(B)[—, 22(R)] |
| 8 | Case 3 (rotate) | 10(B)[8(R), 18(R)[—, 22(R)]] under 7(B)[3(B), ...] |
| 11 | Case 2→3 | Restructure with rotations |
| 26 | Case 1 (recolor) | Final balanced tree |
Appendix A.17.7. Turing Machine Execution Trace
| Step | State | Tape | Action |
|---|---|---|---|
| 0 | Start | ||
| 1 | Write X, R | ||
| 2 | R | ||
| 3 | Write Y, L | ||
| 4 | L | ||
| 5 | L | ||
| 6 | R | ||
| 7 | R (skip Y) | ||
| 8 | R | ||
| 9 | Write Y, L | ||
| ⋮ | ⋮ | ⋮ | ⋮ |
| 15 | Accept |
Appendix A.17.8. DPLL SAT Solver Trace
| Step | Operation | Assignment | Clause Status |
|---|---|---|---|
| 1 | Choose | satisfied, active | |
| 2 | Unit propagate: | satisfied, active | |
| 3 | Unit propagate: | All satisfied | |
| SAT: | |||
Appendix A.17.9. Gaussian Elimination Trace
| Step | Operation | Augmented Matrix |
|---|---|---|
| 0 | Initial | |
| 1 | ||
| 2 | ||
| 3 | ||
| Back Substitution: , , | ||
Appendix A.18. Task Category Deep Dive: Sorting Algorithms
Appendix A.18.1. Comparison-Based Sorting: Formal Properties
- Each internal node represents a comparison
- Left subtree corresponds to “yes” (), right subtree to “no” ()
- Each leaf represents a permutation that produces the sorted output
| Algorithm | Loop Invariant | Termination Proof |
|---|---|---|
| Bubble Sort | After i passes, the largest i elements are in their final sorted positions at the end of the array | Each pass places at least one element; at most passes required |
| Selection Sort | After i iterations, contains the i smallest elements in sorted order | Each iteration places one element; exactly iterations |
| Insertion Sort | After processing element i, is sorted | Each element processed once; n iterations total |
| Merge Sort | Each recursive call correctly sorts its subarray; merge combines two sorted arrays | Recursion depth ; each level processes n elements |
| Quick Sort | All elements left of pivot < pivot; all elements right of pivot ≥ pivot | Each partition reduces problem size; expected depth |
| Heap Sort | After extraction i, the largest i elements are sorted at positions | Each extraction is ; exactly n extractions |
Appendix A.18.2. Expected Step Count Analysis
| Algorithm | |||||
|---|---|---|---|---|---|
| Bubble Sort | 90 | 600 | 2,450 | 9,900 | 65,280 |
| Selection Sort | 45 | 300 | 1,225 | 4,950 | 32,640 |
| Insertion Sort (avg) | 25 | 156 | 625 | 2,500 | 16,384 |
| Shell Sort (Knuth) | 35 | 150 | 450 | 1,200 | 4,500 |
| Merge Sort | 34 | 117 | 282 | 664 | 2,048 |
| Quick Sort (avg) | 30 | 100 | 250 | 580 | 1,800 |
| Heap Sort | 50 | 180 | 450 | 1,100 | 3,500 |
Appendix A.19. Task Category Deep Dive: Graph Algorithms
Appendix A.19.1. Graph Representation Formats
- 1.
- Adjacency List:
- 2.
- Edge List: with optional weights
- 3.
- Adjacency Matrix: where or ∞
Appendix A.19.2. Shortest Path Algorithm Variants
| Algorithm | Negative Weights | All-Pairs | Complexity | Data Structure Requirements |
|---|---|---|---|---|
| BFS | No (unweighted) | No | Queue for frontier | |
| Dijkstra | No | No | Min-heap priority queue | |
| Bellman-Ford | Yes (no neg cycles) | No | Array for distances | |
| Floyd-Warshall | Yes (detect neg cycles) | Yes | distance matrix | |
| A* | No | No | to | Priority queue with f-scores |
Appendix A.19.3. Topological Sort Algorithms
| Algorithm A4 Kahn’s Topological Sort |
|
Appendix A.20. Task Category Deep Dive: Automata Theory
Appendix A.20.1. Formal Language Hierarchy
| Type | Grammar | Automaton | Example Language |
|---|---|---|---|
| Type-3 | Regular | DFA/NFA | |
| Type-2 | Context-Free | PDA | |
| Type-1 | Context-Sensitive | LBA | |
| Type-0 | Unrestricted | Turing Machine | Halting problem |
Appendix A.20.2. NFA to DFA Conversion (Subset Construction)
| Algorithm A5 Subset Construction (NFA to DFA) |
|
Appendix A.20.3. Pushdown Automaton Configurations
| Step | State | Stack | Remaining Input |
|---|---|---|---|
| 0 | |||
| 1 | |||
| 2 | |||
| 3 | (-transition to guess middle) | ||
| 4 | a (pop b, match) | ||
| 5 | (pop a, match) | ||
| 6 | (accept by empty stack/final state) |
Appendix A.21. Evaluation Metrics and Scoring
Appendix A.21.1. Primary Metrics
Appendix A.21.2. Error Taxonomy
| Error Type | Description |
|---|---|
| State Tracking | |
| Carryover Error | Failure to propagate state correctly across steps |
| Reset Error | Incorrectly resetting accumulated state |
| Index Error | Off-by-one or incorrect array indexing |
| Algorithmic | |
| Wrong Operation | Applying incorrect operation for algorithm |
| Ordering Error | Executing steps in wrong sequence |
| Termination Error | Stopping too early or continuing past termination |
| Constraint | |
| Boundary Violation | Exceeding defined constraints |
| Invariant Violation | Breaking algorithmic invariant |
| Format Error | Output not matching required format |
Appendix B. Complete Experimental Results
Appendix B.1. Overall Performance Summary
| Metric | Value |
|---|---|
| Total Tasks Evaluated | 86 |
| Total Task Categories | 12 |
| Total Evaluation Samples | 51,600 |
| Average Baseline Accuracy | 26.8% |
| Average PRIME Accuracy | 93.8% |
| Relative Improvement | +250.0% |
| Absolute Improvement | +67.0 pp |
| Median Baseline Accuracy | 26.7% |
| Median PRIME Accuracy | 93.8% |
Appendix B.2. Category-Level Results

| Category | Tasks | Baseline | Std | PRIME | Std | Improvement |
|---|---|---|---|---|---|---|
| Comparison-based Sorting | 15 | 25.4% | 4.7% | 94.1% | 2.4% | +270.5% |
| Non-comparison Sorting | 3 | 33.4% | 3.8% | 96.9% | 1.5% | +190.1% |
| Advanced/Hybrid Sorting | 10 | 24.8% | 5.1% | 92.9% | 2.8% | +274.6% |
| Graph Traversal | 6 | 29.4% | 4.9% | 93.9% | 2.7% | +219.4% |
| Tree Data Structures | 5 | 27.8% | 5.0% | 93.5% | 2.8% | +236.3% |
| Classic Puzzles | 6 | 27.3% | 4.5% | 94.4% | 2.4% | +245.8% |
| Automata/State Machines | 8 | 24.2% | 5.3% | 93.4% | 2.9% | +286.0% |
| String/Pattern Matching | 5 | 33.6% | 4.5% | 92.9% | 2.9% | +176.5% |
| Mathematical/Numerical | 8 | 22.4% | 5.8% | 93.5% | 2.8% | +317.4% |
| Logic/Theorem Proving | 6 | 19.5% | 6.1% | 90.6% | 3.8% | +364.6% |
| Data Structure Operations | 6 | 32.6% | 4.4% | 95.6% | 2.2% | +193.3% |
| System Simulation | 8 | 26.9% | 5.1% | 93.7% | 2.8% | +248.3% |
| Overall | 86 | 26.8% | 5.0% | 93.8% | 2.7% | +250.0% |
Appendix B.3. Radar Analysis

Appendix B.4. Top Improvements Analysis

Appendix B.5. Detailed Results by Category
Appendix B.5.1. Sorting Algorithms

| Algorithm | Base | PRIME | |
|---|---|---|---|
| Comparison-based | |||
| Bubble Sort | 28.4% | 96.7% | +240.5% |
| Selection Sort | 29.8% | 97.1% | +225.8% |
| Insertion Sort | 31.2% | 95.8% | +207.1% |
| Shell Sort | 25.6% | 94.2% | +268.0% |
| Merge Sort | 26.7% | 94.3% | +253.2% |
| Quick Sort | 24.5% | 93.6% | +282.0% |
| Heap Sort | 19.8% | 91.2% | +360.6% |
| Tree Sort | 22.3% | 92.4% | +314.3% |
| Cocktail Shaker Sort | 27.6% | 96.1% | +248.2% |
| Comb Sort | 26.1% | 94.8% | +263.2% |
| Gnome Sort | 28.9% | 95.9% | +231.8% |
| Odd-Even Sort | 27.1% | 95.3% | +251.7% |
| Pancake Sort | 23.4% | 93.1% | +297.9% |
| Cycle Sort | 21.2% | 91.8% | +333.0% |
| Stooge Sort | 17.8% | 89.7% | +404.0% |
| Non-comparison | |||
| Counting Sort | 35.6% | 97.8% | +174.7% |
| Radix Sort | 31.2% | 96.2% | +208.3% |
| Bucket Sort | 33.4% | 96.8% | +189.8% |
| Advanced/Hybrid | |||
| Timsort | 28.9% | 95.1% | +229.1% |
| Introsort | 27.8% | 94.6% | +240.3% |
| Patience Sort | 26.7% | 93.8% | +251.3% |
| Strand Sort | 24.5% | 92.6% | +278.0% |
| Bitonic Sort | 22.3% | 91.8% | +311.7% |
| Batcher Merge | 23.4% | 92.4% | +295.0% |
| Library Sort | 25.6% | 93.2% | +264.1% |
| Smoothsort | 21.2% | 90.8% | +328.3% |
| Block Sort | 23.4% | 92.1% | +293.6% |
| Tournament Sort | 24.5% | 92.8% | +278.8% |
Appendix B.5.2. Graph, Tree, and Classic Puzzles

Appendix B.5.3. Automata, String, and Mathematical Tasks

Appendix B.5.4. Logic, Data Structures, and System Simulation

Appendix B.6. Statistical Distribution Analysis
Appendix B.6.1. Box Plot Analysis

Appendix B.6.2. Baseline vs. PRIME Correlation

Appendix B.6.3. Improvement Distribution

Appendix B.7. Per-Task Complete Results
| Task | Steps | Base | PRIME | |
|---|---|---|---|---|
| Bubble Sort | 1M | 28.4% | 96.7% | +68.3 |
| Selection Sort | 1M | 29.8% | 97.1% | +67.3 |
| Insertion Sort | 1M | 31.2% | 95.8% | +64.6 |
| Shell Sort | 500K | 25.6% | 94.2% | +68.6 |
| Merge Sort | 800K | 26.7% | 94.3% | +67.6 |
| Quick Sort | 800K | 24.5% | 93.6% | +69.1 |
| Heap Sort | 600K | 19.8% | 91.2% | +71.4 |
| Tree Sort | 600K | 22.3% | 92.4% | +70.1 |
| Cocktail Sort | 1M | 27.6% | 96.1% | +68.5 |
| Comb Sort | 800K | 26.1% | 94.8% | +68.7 |
| Gnome Sort | 1M | 28.9% | 95.9% | +67.0 |
| Odd-Even Sort | 1M | 27.1% | 95.3% | +68.2 |
| Pancake Sort | 500K | 23.4% | 93.1% | +69.7 |
| Cycle Sort | 500K | 21.2% | 91.8% | +70.6 |
| Stooge Sort | 300K | 17.8% | 89.7% | +71.9 |
| Counting Sort | 200K | 35.6% | 97.8% | +62.2 |
| Radix Sort | 300K | 31.2% | 96.2% | +65.0 |
| Bucket Sort | 250K | 33.4% | 96.8% | +63.4 |
| Timsort | 600K | 28.9% | 95.1% | +66.2 |
| Introsort | 600K | 27.8% | 94.6% | +66.8 |
| Patience Sort | 500K | 26.7% | 93.8% | +67.1 |
| Strand Sort | 400K | 24.5% | 92.6% | +68.1 |
| Task | Steps | Base | PRIME | |
|---|---|---|---|---|
| Bitonic Sort | 500K | 22.3% | 91.8% | +69.5 |
| Batcher Merge | 500K | 23.4% | 92.4% | +69.0 |
| Library Sort | 400K | 25.6% | 93.2% | +67.6 |
| Smoothsort | 500K | 21.2% | 90.8% | +69.6 |
| Block Sort | 500K | 23.4% | 92.1% | +68.7 |
| Tournament Sort | 500K | 24.5% | 92.8% | +68.3 |
| DFS on Tree | 100K | 35.6% | 96.2% | +60.6 |
| BFS on Graph | 100K | 34.2% | 95.8% | +61.6 |
| Dijkstra | 50K | 27.8% | 93.1% | +65.3 |
| A* Pathfinding | 80K | 25.6% | 92.4% | +66.8 |
| Floyd-Warshall | 125K | 19.8% | 90.8% | +71.0 |
| Topological Sort | 50K | 33.4% | 95.1% | +61.7 |
| BST Insertion | 100K | 31.2% | 94.8% | +63.6 |
| BST Inorder | 80K | 37.8% | 96.8% | +59.0 |
| Red-Black Insert | 50K | 18.9% | 89.7% | +70.8 |
| Huffman Tree | 30K | 26.7% | 93.4% | +66.7 |
| Heap Operations | 80K | 24.5% | 92.6% | +68.1 |
| Tower of Hanoi | 1M | 33.0% | 98.5% | +65.5 |
| N-Queens | 500K | 37.4% | 96.4% | +59.0 |
| Blind Maze | 50K | 19.8% | 95.8% | +76.0 |
| Logic Grid | 20K | 26.7% | 91.2% | +64.5 |
| Sudoku Solve | 30K | 28.9% | 94.1% | +65.2 |
| Task | Steps | Base | PRIME | |
|---|---|---|---|---|
| 24-Game Ext. | 10K | 17.8% | 90.3% | +72.5 |
| DFA Simulation | 100K | 31.2% | 95.6% | +64.4 |
| NFA Simulation | 80K | 26.7% | 93.4% | +66.7 |
| PDA Execution | 60K | 23.4% | 91.8% | +68.4 |
| Turing Machine | 200K | 8.9% | 92.4% | +83.5 |
| Register Machine | 150K | 17.8% | 90.8% | +73.0 |
| Petri Net | 80K | 25.6% | 93.4% | +67.8 |
| Cellular Automaton | 100K | 28.9% | 94.6% | +65.7 |
| Markov Chain | 50K | 31.2% | 95.2% | +64.0 |
| KMP Pattern | 100K | 33.4% | 95.8% | +62.4 |
| Regex NFA | 80K | 26.7% | 93.2% | +66.5 |
| CFG Derivation | 50K | 22.3% | 91.4% | +69.1 |
| Translation Chain | 10K | 41.2% | 89.1% | +47.9 |
| ASCII Art Parse | 5K | 44.5% | 95.2% | +50.7 |
| Long Division | 1K | 15.6% | 94.3% | +78.7 |
| Matrix Multiply | 8K | 18.9% | 92.7% | +73.8 |
| Gaussian Elim. | 5K | 16.7% | 91.5% | +74.8 |
| GCD Euclidean | 2K | 37.8% | 97.8% | +60.0 |
| Simplex Method | 3K | 14.5% | 89.8% | +75.3 |
| Polynomial GCD | 2K | 17.8% | 91.2% | +73.4 |
| Continued Frac. | 1K | 26.7% | 94.8% | +68.1 |
| Symbolic Diff. | 500 | 31.2% | 95.6% | +64.4 |
| Task | Steps | Base | PRIME | |
|---|---|---|---|---|
| SAT DPLL | 50K | 17.8% | 91.2% | +73.4 |
| Resolution Proof | 30K | 19.8% | 91.8% | +72.0 |
| Unification | 20K | 21.2% | 92.4% | +71.2 |
| Type Inference | 15K | 18.9% | 90.8% | +71.9 |
| Lambda Reduction | 10K | 16.7% | 89.6% | +72.9 |
| Dependency SAT | 40K | 22.3% | 87.6% | +65.3 |
| Stack Simulator | 100K | 37.8% | 97.2% | +59.4 |
| Queue Simulator | 100K | 36.7% | 96.8% | +60.1 |
| Doubly Linked List | 80K | 28.9% | 94.6% | +65.7 |
| Hash Table | 50K | 31.2% | 95.2% | +64.0 |
| LRU Cache | 50K | 27.8% | 93.8% | +66.0 |
| Union-Find | 80K | 33.4% | 95.8% | +62.4 |
| File System Ops | 100K | 26.7% | 94.2% | +67.5 |
| Blockchain Ledger | 50K | 28.9% | 94.8% | +65.9 |
| Railway Scheduling | 30K | 25.6% | 93.2% | +67.6 |
| Meeting Scheduler | 20K | 31.2% | 95.6% | +64.4 |
| Elevator Sched. | 30K | 27.8% | 93.8% | +66.0 |
| Packet Routing | 50K | 24.5% | 92.6% | +68.1 |
| Assembly Line | 20K | 23.4% | 91.8% | +68.4 |
| Chemical Reaction | 30K | 26.7% | 93.4% | +66.7 |
Appendix B.8. Statistical Significance
| Category | Baseline CI | PRIME CI |
|---|---|---|
| Comparison Sorting | [23.1, 27.7]% | [92.8, 95.4]% |
| Non-comparison Sort | [30.2, 36.6]% | [95.6, 98.2]% |
| Advanced Sorting | [22.3, 27.3]% | [91.4, 94.4]% |
| Graph Traversal | [26.5, 32.3]% | [92.1, 95.7]% |
| Tree Operations | [24.8, 30.8]% | [91.8, 95.2]% |
| Classic Puzzles | [24.5, 30.1]% | [93.0, 95.8]% |
| Automata/State | [21.3, 27.1]% | [91.6, 95.2]% |
| String/Pattern | [30.5, 36.7]% | [91.1, 94.7]% |
| Mathematical | [19.1, 25.7]% | [91.8, 95.2]% |
| Logic/Theorem | [15.9, 23.1]% | [88.1, 93.1]% |
| Data Structures | [29.7, 35.5]% | [94.2, 97.0]% |
| System Simulation | [24.0, 29.8]% | [92.0, 95.4]% |
Appendix B.9. Model-Specific Performance Analysis
Appendix B.9.1. Performance by Model Size
| Model | Params | Baseline | Std | PRIME | Std | (pp) | Rel. Improv. |
|---|---|---|---|---|---|---|---|
| Qwen3-8B | 8B | 21.3% | 5.8% | 89.2% | 3.4% | +67.9 | +318.8% |
| Gemma3-12B | 12B | 24.1% | 5.2% | 91.8% | 3.1% | +67.7 | +280.9% |
| Qwen3-14B | 14B | 26.8% | 5.0% | 93.8% | 2.7% | +67.0 | +250.0% |
| GPT-OSS-20B | 20B | 28.4% | 4.8% | 94.6% | 2.5% | +66.2 | +233.1% |
| Gemma3-27B | 27B | 30.2% | 4.6% | 95.1% | 2.4% | +64.9 | +214.9% |
| Qwen3-Coder-30B | 30B | 32.5% | 4.4% | 95.8% | 2.2% | +63.3 | +194.8% |
| GPT-OSS-120B | 120B | 38.7% | 4.1% | 96.9% | 1.9% | +58.2 | +150.4% |
Appendix B.9.2. Scaling Law Analysis
| Metric | Coefficient | Value | 95% CI |
|---|---|---|---|
| Baseline | 4.73 | [4.12, 5.34] | |
| 0.21 | [0.18, 0.24] | ||
| 15.2 | [13.8, 16.6] | ||
| PRIME Gain | 12.8 | [11.2, 14.4] | |
| 0.08 | [0.06, 0.10] | ||
| 56.1 | [54.3, 57.9] |
Appendix B.9.3. Model Architecture Comparison
| Architecture | ∼Params | Baseline | PRIME | |
|---|---|---|---|---|
| ∼12-14B Parameter Models | ||||
| Gemma3-12B (Decoder) | 12B | 24.1% | 91.8% | +67.7 |
| Qwen3-14B (Decoder) | 14B | 26.8% | 93.8% | +67.0 |
| ∼27-30B Parameter Models | ||||
| Gemma3-27B (Decoder) | 27B | 30.2% | 95.1% | +64.9 |
| Qwen3-Coder-30B (Decoder) | 30B | 32.5% | 95.8% | +63.3 |
Appendix B.10. Error Analysis
Appendix B.10.1. Error Distribution by Category
| Category | State | Index | Operation | Ordering | Termination | Format | Other |
|---|---|---|---|---|---|---|---|
| Comparison Sorting | 28.4% | 22.1% | 18.3% | 12.5% | 8.9% | 6.2% | 3.6% |
| Non-comparison Sort | 18.2% | 31.4% | 22.6% | 8.4% | 10.2% | 5.8% | 3.4% |
| Advanced Sorting | 31.5% | 19.8% | 21.2% | 11.3% | 7.8% | 5.2% | 3.2% |
| Graph Traversal | 35.2% | 15.6% | 12.4% | 18.9% | 9.1% | 5.4% | 3.4% |
| Tree Operations | 29.8% | 24.3% | 16.5% | 14.2% | 6.8% | 5.1% | 3.3% |
| Classic Puzzles | 22.4% | 12.8% | 28.6% | 16.4% | 11.2% | 4.8% | 3.8% |
| Automata/State | 38.6% | 8.4% | 14.2% | 22.5% | 8.5% | 4.6% | 3.2% |
| String/Pattern | 25.3% | 28.6% | 18.4% | 10.2% | 9.8% | 4.5% | 3.2% |
| Mathematical | 42.1% | 18.5% | 15.2% | 6.8% | 7.4% | 6.8% | 3.2% |
| Logic/Theorem | 35.8% | 8.2% | 24.6% | 18.4% | 6.2% | 3.6% | 3.2% |
| Data Structures | 26.4% | 32.5% | 14.8% | 12.1% | 6.8% | 4.2% | 3.2% |
| System Simulation | 34.2% | 14.6% | 16.8% | 19.4% | 7.2% | 4.6% | 3.2% |
| Overall | 30.7% | 19.7% | 18.6% | 14.3% | 8.3% | 5.1% | 3.3% |
Appendix B.10.2. Error Severity Analysis
| Severity | Weight | Description |
|---|---|---|
| Critical | 1.0 | Completely incorrect result; algorithm fails |
| Major | 0.6 | Partial correctness; significant deviation |
| Minor | 0.2 | Correct result with suboptimal execution |
| Baseline | PRIME | |||||
|---|---|---|---|---|---|---|
| Category | Crit. | Maj. | Min. | Crit. | Maj. | Min. |
| Sorting | 68.2% | 24.3% | 7.5% | 3.8% | 1.4% | 0.7% |
| Graph | 65.4% | 26.8% | 7.8% | 4.2% | 1.2% | 0.7% |
| Tree | 67.8% | 24.6% | 7.6% | 4.6% | 1.3% | 0.6% |
| Puzzles | 66.1% | 25.4% | 8.5% | 3.4% | 1.5% | 0.7% |
| Automata | 71.2% | 22.4% | 6.4% | 4.8% | 1.2% | 0.6% |
| String | 62.8% | 28.2% | 9.0% | 5.2% | 1.4% | 0.5% |
| Math | 73.4% | 20.8% | 5.8% | 4.6% | 1.2% | 0.7% |
| Logic | 76.2% | 18.6% | 5.2% | 6.8% | 1.8% | 0.8% |
| Data Struct. | 61.8% | 29.4% | 8.8% | 2.8% | 1.0% | 0.6% |
| Simulation | 68.4% | 24.2% | 7.4% | 4.4% | 1.4% | 0.5% |
Appendix B.10.3. First Error Position Analysis
| Category | Baseline | Baseline | PRIME | PRIME |
|---|---|---|---|---|
| Sorting | 18.4% | 12.3% | 72.6% | 18.4% |
| Graph | 22.1% | 14.5% | 68.4% | 21.2% |
| Tree | 24.6% | 15.2% | 71.2% | 19.8% |
| Puzzles | 31.2% | 18.4% | 78.4% | 15.6% |
| Automata | 15.8% | 10.6% | 65.2% | 22.4% |
| Math | 12.4% | 8.2% | 62.8% | 24.6% |
Appendix B.11. Ablation Study Results
Appendix B.11.1. Component-wise Ablation
| Configuration | Acc. | vs Full | State Err | Constraint Err | Avg Steps | Retry Rate |
|---|---|---|---|---|---|---|
| Full PRIME | 93.8% | — | 2.1% | 1.4% | 1.28 | 12.4% |
| − GRPO (use PPO) | 89.2% | −4.6 pp | 3.8% | 2.4% | 1.52 | 18.6% |
| − Verifier Agent | 86.4% | −7.4 pp | 5.2% | 4.8% | 1.34 | 14.2% |
| − Iterative Exec. | 82.8% | −11.0 pp | 6.4% | 3.2% | 1.00 | 0.0% |
| − Self-Consistency | 88.6% | −5.2 pp | 4.2% | 2.1% | 1.28 | 12.4% |
| − Multi-Agent | 78.4% | −15.4 pp | 8.6% | 6.4% | 1.12 | 8.2% |
| Baseline Only | 26.8% | −67.0 pp | 42.4% | 28.6% | 1.00 | 0.0% |
Appendix B.11.2. Component Interaction Effects
| Component Pair | Independent Sum | Combined Effect |
|---|---|---|
| GRPO + Verifier | 12.0 pp | 14.8 pp |
| GRPO + Multi-Agent | 20.0 pp | 24.2 pp |
| Verifier + Iterative | 18.4 pp | 22.6 pp |
| Multi-Agent + Self-Cons. | 20.6 pp | 25.8 pp |
Appendix B.11.3. Hyperparameter Sensitivity
| Parameter | Low Value | Default | High Value | (Low-High) | Sensitivity Notes |
|---|---|---|---|---|---|
| Group size G | 4: 91.2% | 8: 93.8% | 16: 94.1% | 2.9 pp | Diminishing returns above |
| Iterations K | 2: 88.4% | 5: 93.8% | 10: 94.2% | 5.8 pp | Most sensitive; early stopping mitigates |
| Violation | 0.1: 92.4% | 0.3: 93.8% | 0.5: 91.8% | 1.4 pp | U-shaped; optimal at moderate threshold |
| Temperature | 0.5: 92.1% | 0.7: 93.8% | 0.9: 90.6% | 3.2 pp | Balances diversity vs. quality |
| Learning rate | 5e-6: 91.8% | 1e-5: 93.8% | 2e-5: 92.4% | 2.0 pp | Stable within one order of magnitude |
Appendix B.12. Difficulty-Stratified Analysis
Appendix B.12.1. Performance by Difficulty Level
| Easy | Medium | Hard | ||||
|---|---|---|---|---|---|---|
| Category | Base | PRIME | Base | PRIME | Base | PRIME |
| Comparison Sorting | 38.2% | 98.4% | 24.6% | 94.2% | 13.4% | 89.7% |
| Non-comparison Sort | 45.6% | 99.1% | 32.8% | 97.2% | 21.8% | 94.4% |
| Advanced Sorting | 36.4% | 97.6% | 24.2% | 93.1% | 13.8% | 88.0% |
| Graph Traversal | 42.1% | 98.2% | 28.6% | 94.1% | 17.5% | 89.4% |
| Tree Operations | 40.2% | 97.8% | 27.4% | 93.6% | 15.8% | 89.1% |
| Classic Puzzles | 41.8% | 98.6% | 26.4% | 94.8% | 13.7% | 89.8% |
| Automata/State | 38.4% | 98.1% | 23.6% | 93.8% | 10.6% | 88.3% |
| String/Pattern | 46.8% | 97.4% | 32.4% | 93.2% | 21.6% | 88.1% |
| Mathematical | 36.2% | 97.8% | 21.8% | 93.6% | 9.2% | 89.1% |
| Logic/Theorem | 32.4% | 96.2% | 18.6% | 91.2% | 7.5% | 84.4% |
| Data Structures | 46.4% | 98.8% | 31.8% | 96.2% | 19.6% | 91.8% |
| System Simulation | 40.6% | 98.4% | 26.2% | 94.1% | 13.9% | 88.6% |
| Overall | 40.4% | 98.0% | 26.5% | 94.1% | 14.9% | 89.2% |
Appendix B.12.2. Difficulty Degradation Analysis
Appendix B.13. Execution Efficiency Analysis
Appendix B.13.1. Step Count Distribution
| Category | Optimal | PRIME | Overhead |
|---|---|---|---|
| Comparison Sorting | 1.00× | 1.12× | +12% |
| Non-comparison Sort | 1.00× | 1.08× | +8% |
| Advanced Sorting | 1.00× | 1.18× | +18% |
| Graph Traversal | 1.00× | 1.14× | +14% |
| Tree Operations | 1.00× | 1.16× | +16% |
| Classic Puzzles | 1.00× | 1.06× | +6% |
| Automata/State | 1.00× | 1.04× | +4% |
| String/Pattern | 1.00× | 1.10× | +10% |
| Mathematical | 1.00× | 1.08× | +8% |
| Logic/Theorem | 1.00× | 1.22× | +22% |
| Data Structures | 1.00× | 1.06× | +6% |
| System Simulation | 1.00× | 1.12× | +12% |
| Average | 1.00× | 1.11× | +11% |
Appendix B.13.2. Retry and Backtrack Statistics
| Category | Retry Rate | Avg Retries | Backtrack Rate |
|---|---|---|---|
| Sorting | 10.2% | 1.4 | 8.6% |
| Graph | 14.8% | 1.6 | 12.4% |
| Tree | 12.6% | 1.5 | 10.2% |
| Puzzles | 8.4% | 1.3 | 6.8% |
| Automata | 15.2% | 1.7 | 14.6% |
| Math | 11.8% | 1.5 | 9.4% |
| Logic | 18.4% | 1.9 | 16.8% |
| Data Struct. | 9.6% | 1.3 | 7.2% |
| Simulation | 13.2% | 1.5 | 11.4% |
| Overall | 12.4% | 1.5 | 10.8% |
Appendix B.14. Cross-Task Generalization
Appendix B.14.1. Transfer Learning Performance
| Train ∖ Test | Sorting | Graph | Tree | Automata | Math | Logic |
|---|---|---|---|---|---|---|
| Sorting | 94.1 | 78.4 | 82.6 | 68.2 | 72.4 | 64.8 |
| Graph | 76.2 | 93.9 | 84.2 | 72.6 | 68.4 | 70.2 |
| Tree | 80.4 | 82.8 | 93.5 | 70.8 | 74.2 | 68.6 |
| Automata | 64.6 | 70.4 | 68.2 | 93.4 | 66.8 | 78.4 |
| Math | 70.2 | 66.8 | 72.4 | 64.2 | 93.5 | 72.6 |
| Logic | 62.4 | 68.6 | 66.8 | 76.2 | 70.4 | 90.6 |
| All (Full PRIME) | 94.1 | 93.9 | 93.5 | 93.4 | 93.5 | 90.6 |
Appendix B.14.2. Zero-Shot Category Performance
| Held-Out Category | Zero-Shot | Full Training |
|---|---|---|
| Comparison Sorting | 84.2% | 94.1% |
| Graph Traversal | 82.6% | 93.9% |
| Automata/State | 78.4% | 93.4% |
| Mathematical | 80.2% | 93.5% |
| Logic/Theorem | 76.8% | 90.6% |
| System Simulation | 81.4% | 93.7% |
Appendix B.15. Computational Overhead Analysis
Appendix B.15.1. Inference Time Breakdown
| Component | Baseline | PRIME | Overhead |
|---|---|---|---|
| Input Encoding | 12.4 | 18.6 | +50% |
| Policy Forward | 45.2 | 48.4 | +7% |
| Verifier Forward | — | 32.6 | — |
| Majority Voting | — | 8.4 | — |
| State Management | 2.1 | 12.8 | +510% |
| Total | 59.7 | 120.8 | +102% |
Appendix B.15.2. Memory Usage
| Model | Baseline | PRIME | Overhead |
|---|---|---|---|
| Qwen3-8B | 16.2 | 24.8 | +53% |
| Qwen3-14B | 28.4 | 42.6 | +50% |
| Gemma3-27B | 54.2 | 78.4 | +45% |
| GPT-OSS-120B | 240.8 | 312.4 | +30% |
Appendix C. PRIME Algorithm Specification
Appendix C.1. Core Algorithm
| Algorithm A6 PRIME Framework |
|
Appendix C.2. Reward Function
Appendix C.3. GRPO Objective
Appendix C.4. Majority Voting
Appendix C.5. Verifier Architecture
Appendix D. Prompt Templates
Appendix D.1. Baseline Prompt Template

Appendix D.2. PRIME Structured Prompt Template

Appendix D.3. Task-Specific Templates
Appendix D.3.1. Sorting Task Template

Appendix D.3.2. State Machine Template

Appendix D.3.3. Mathematical Computation Template

Appendix D.4. Verifier Prompt Template

Appendix E. Theoretical Analysis
Appendix E.1. Convergence Analysis
Appendix E.1.1. GRPO Convergence Theorem
- 1.
- The policy space is compact and the policy is Lipschitz continuous in θ
- 2.
- The reward function is bounded:
- 3.
- The learning rate schedule satisfies and
- 4.
- The group size
Appendix E.1.2. Sample Complexity Bound
Appendix E.2. Verification Agent Analysis
Appendix E.2.1. Constraint Satisfaction Guarantees
Appendix E.2.2. Multi-Agent Coordination
Appendix E.3. Computational Complexity
Appendix E.3.1. Time Complexity Analysis
- Line 6: Policy forward pass requires time
- Line 8: Verifier forward pass requires time
- Lines 9–10: Backtracking and state updates are operations
Appendix E.3.2. Space Complexity Analysis
Appendix E.4. Optimality Conditions
Appendix E.4.1. Policy Improvement Guarantee
Appendix E.4.2. Regret Bound
Appendix F. Algorithm Variants
Appendix F.1. Verifier Variants
Appendix F.1.1. Lightweight Verifier
| Algorithm A7 Lightweight Rule-Based Verifier |
|
Appendix F.1.2. Ensemble Verifier
| Algorithm A8 Ensemble Verifier |
|
Appendix F.2. Policy Optimization Variants
Appendix F.2.1. Standard PPO Baseline
Appendix F.2.2. Reinforce with Baseline
Appendix F.3. Execution Strategy Variants
Appendix F.3.1. Greedy Execution
| Algorithm A9 Greedy PRIME Execution |
|
Appendix F.3.2. Beam Search Execution
| Algorithm A10 Beam Search Execution |
|
Appendix F.4. Adaptive Configuration
Appendix F.4.1. Dynamic Group Size
Appendix F.4.2. Adaptive Iteration Count
| Algorithm A11 Adaptive Iteration Control |
|
Appendix G. Extended Mathematical Derivations
Appendix G.1. GRPO Gradient Derivation
Appendix G.2. Variance Reduction Analysis
Appendix G.3. KL Divergence Bound
Appendix H. Implementation Details
Appendix H.1. Hyperparameter Configuration
| Parameter | Value | Description |
|---|---|---|
| Policy Optimization | ||
| Learning rate | Policy update step size | |
| Group size G | 8 | Rollouts per update |
| Clip range | 0.2 | PPO clipping threshold |
| KL coefficient | 0.01 | Divergence penalty |
| Entropy coefficient | 0.01 | Exploration bonus |
| Execution Control | ||
| Max iterations K | 5 | Retry attempts |
| Violation threshold | 0.3 | Backtrack trigger |
| Temperature | 0.7 | Sampling temperature |
| Top-p | 0.95 | Nucleus sampling |
| Max tokens | 4096 | Output length limit |
| Reward Weights | ||
| Task reward | 10.0 | Completion weight |
| Verify reward | 1.0 | Verification weight |
| Efficiency reward | 0.5 | Step efficiency weight |
| Format reward | 0.1 | Format compliance weight |
Appendix H.2. Hardware Configuration
| Component | Specification |
|---|---|
| GPU | 8× NVIDIA H100 80GB SXM5 |
| GPU Bandwidth | 3.35 TB/s per GPU |
| CPU | Dual AMD EPYC 7773X 64-Core |
| CPU Clock | 2.2 GHz base, 3.5 GHz boost |
| Memory | 4TB DDR4-3200 ECC Registered |
| Storage (Hot) | 61.44TB Solidigm D5-P5336 NVMe |
| Storage (Cold) | 1PB HDD RAID 60 |
| Network | 400Gbps InfiniBand NDR |
Appendix H.3. Software Environment
| Software | Version |
|---|---|
| Operating System | Ubuntu 22.04 LTS |
| Python | 3.11.7 |
| PyTorch | 2.2.0 |
| CUDA | 12.3 |
| cuDNN | 8.9.7 |
| Transformers | 4.38.0 |
| vLLM | 0.3.2 |
| Flash Attention | 2.5.0 |
Appendix H.4. Training Configuration
| Configuration | Value |
|---|---|
| Training duration | 72 hours |
| Total training steps | 50,000 |
| Batch size (per GPU) | 4 |
| Gradient accumulation | 8 |
| Effective batch size | 256 |
| Warmup steps | 1,000 |
| Learning rate schedule | Cosine decay |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Mixed precision | BF16 |
| Optimizer | AdamW |
Appendix H.5. Evaluation Protocol
Appendix H.6. Model Configurations
| Model | Params | Context | Precision |
|---|---|---|---|
| Qwen3-8B | 8B | 32K | BF16 |
| Gemma3-12B | 12B | 8K | BF16 |
| Qwen3-14B | 14B | 32K | BF16 |
| GPT-OSS-20B | 20B | 16K | BF16 |
| Gemma3-27B | 27B | 8K | BF16 |
| Qwen3-Coder-30B | 30B | 32K | BF16 |
| GPT-OSS-120B | 120B | 16K | INT8 |
Appendix H.7. Ablation Study Configuration
| Ablation | Description |
|---|---|
| Full PRIME | Complete framework |
| − Multi-Agent | Single agent, no verifier |
| − GRPO | Replace with standard PPO |
| − Iterative Exec. | Single pass, no retry |
| − Self-Consistency | No majority voting |
| − Verifier Agent | No constraint checking |
| Baseline | No optimizations |
Appendix H.8. Error Analysis Protocol
Appendix H.9. Statistical Analysis
Appendix H.10. Reproducibility Statement
Appendix H.11. Computational Cost
| Component | GPU Hours |
|---|---|
| Policy training | 576 |
| Verifier training | 192 |
| Baseline evaluation | 48 |
| PRIME evaluation | 144 |
| Ablation studies | 288 |
| Total | 1,248 |
Appendix H.12. Limitations and Future Work
Appendix I. Extended Training Details
Appendix I.1. Training Curriculum
| Phase | Epochs | Difficulty | Tasks |
|---|---|---|---|
| Warm-up | 1–5 | Easy only | All 86 |
| Intermediate | 6–15 | Easy + Medium | All 86 |
| Full | 16–30 | All levels | All 86 |
Appendix I.2. Data Augmentation
| Strategy | Description | Prob. |
|---|---|---|
| Value Scaling | Scale numeric values by random factor | 0.3 |
| Index Permutation | Randomly permute array indices (sorting) | 0.2 |
| Graph Relabeling | Randomly relabel graph vertices | 0.2 |
| Constraint Reordering | Reorder constraint presentation | 0.4 |
| Format Variation | Vary output format requirements | 0.1 |
Appendix I.3. Training Stability Techniques
- 1.
- Gradient Clipping: Maximum gradient norm of 1.0
- 2.
- Learning Rate Warmup: Linear warmup over 1,000 steps
- 3.
- Entropy Regularization: Coefficient 0.01 to encourage exploration
- 4.
- Value Function Clipping: Clip value function updates to
- 5.
- Early Stopping: Stop if validation accuracy plateaus for 5 epochs
Appendix I.4. Loss Function Components
| Component | Coefficient | Purpose |
|---|---|---|
| 1.0 | Primary policy optimization | |
| Value function fitting | ||
| Exploration bonus | ||
| Auxiliary prediction tasks |
Appendix J. Detailed Hyperparameter Studies
Appendix J.1. Learning Rate Sensitivity
| Learning Rate | Train Loss | Val Acc | Stability |
|---|---|---|---|
| 0.142 | 91.8% | High | |
| 0.098 | 93.8% | High | |
| 0.087 | 92.4% | Medium | |
| 0.112 | 88.6% | Low | |
| 0.234 | 82.1% | Very Low |
Appendix J.2. Group Size Analysis
| G | Accuracy | Variance | Time | Memory |
|---|---|---|---|---|
| 2 | 88.4% | 12.3 | 1.0× | 1.0× |
| 4 | 91.2% | 6.8 | 1.8× | 1.8× |
| 8 | 93.8% | 3.2 | 3.4× | 3.4× |
| 16 | 94.1% | 1.6 | 6.6× | 6.8× |
| 32 | 94.2% | 0.9 | 13.0× | 13.6× |
Appendix J.3. Iteration Count Analysis
| K | Accuracy | Avg Iters Used | Time |
|---|---|---|---|
| 1 | 82.6% | 1.00 | 1.0× |
| 2 | 88.4% | 1.42 | 1.4× |
| 3 | 91.8% | 1.68 | 1.6× |
| 5 | 93.8% | 2.14 | 2.1× |
| 10 | 94.2% | 2.28 | 2.3× |
Appendix J.4. Temperature Sweep
| Temp | Accuracy | Diversity | Self-Consistency |
|---|---|---|---|
| 0.3 | 91.4% | 0.12 | 92.4% |
| 0.5 | 92.1% | 0.28 | 88.6% |
| 0.7 | 93.8% | 0.45 | 82.4% |
| 0.9 | 90.6% | 0.68 | 71.2% |
| 1.0 | 88.2% | 0.82 | 64.8% |
Appendix J.5. Clipping Parameter Analysis
| Accuracy | KL Div | Stability | |
|---|---|---|---|
| 0.1 | 92.4% | 0.008 | Very High |
| 0.2 | 93.8% | 0.024 | High |
| 0.3 | 93.2% | 0.048 | Medium |
| 0.4 | 91.8% | 0.086 | Low |
Appendix K. Infrastructure Details
Appendix K.1. Distributed Training Configuration
| Component | Configuration |
|---|---|
| Parallelism Strategy | Fully Sharded Data Parallel (FSDP) |
| Sharding Strategy | FULL_SHARD |
| CPU Offloading | Disabled |
| Activation Checkpointing | Enabled (every 2 layers) |
| Communication Backend | NCCL |
| Gradient Accumulation | 8 steps |
| Synchronization | AllReduce (gradient averaging) |
Appendix K.2. Inference Optimization
| Technique | Description | Speedup | Memory | Applicability |
|---|---|---|---|---|
| Flash Attention 2 | Memory-efficient attention computation with tiling | 2.1× | −40% | All models |
| KV Cache | Cached key-value pairs for autoregressive decoding | 1.8× | +15% | All models |
| Continuous Batching | Dynamic batch packing for variable-length inputs | 1.5× | Neutral | Multi-request scenarios |
| Speculative Decoding | Draft model acceleration with verification | 1.3× | +20% | Long generations |
| INT8 Quantization | Weight quantization for reduced memory footprint | 1.4× | −50% | 120B model only |
Appendix K.3. Memory Management
| Component | Memory (GB) | Percentage |
|---|---|---|
| Policy Model Weights | 28.0 | 56.0% |
| Verifier Model Weights | 12.0 | 24.0% |
| KV Cache (per batch) | 4.2 | 8.4% |
| Activations | 3.8 | 7.6% |
| State Buffers | 1.2 | 2.4% |
| CUDA Kernels | 0.8 | 1.6% |
| Total | 50.0 | 100% |
Appendix L. Evaluation Pipeline Details
Appendix L.1. Instance Generation
Appendix L.2. Verification Protocol
- 1.
- Format Parsing: Extract structured output from model response
- 2.
- Syntax Validation: Verify output conforms to expected format
- 3.
- Semantic Verification: Check intermediate states against algorithm specification
- 4.
- Result Comparison: Compare final answer with ground truth
| Stage | Baseline Pass | PRIME Pass |
|---|---|---|
| Format Parsing | 78.4% | 98.2% |
| Syntax Validation | 72.1% | 97.4% |
| Semantic Verification | 42.6% | 95.1% |
| Result Comparison | 26.8% | 93.8% |
Appendix L.3. Timeout and Resource Limits
| Resource | Limit |
|---|---|
| Maximum Generation Time | 120 seconds |
| Maximum Output Tokens | 4,096 |
| Maximum Retries (PRIME) | 5 |
| Maximum Rollouts (PRIME) | 8 |
| Memory per Instance | 2 GB |
Appendix M. Benchmark Instance Statistics
Appendix M.1. Instance Size Distribution
| Category | Min | Median | Max | Unit |
|---|---|---|---|---|
| Comparison Sorting | 8 | 32 | 256 | elements |
| Non-comparison Sort | 100 | 500 | 5,000 | elements |
| Advanced Sorting | 16 | 128 | 512 | elements |
| Graph Traversal | 20 | 80 | 200 | vertices |
| Tree Operations | 10 | 50 | 200 | nodes |
| Classic Puzzles | 4 | 8 | 20 | problem size |
| Automata/State | 50 | 500 | 10,000 | input chars |
| String/Pattern | 100 | 1,000 | 10,000 | chars |
| Mathematical | 10 | 30 | 60 | digits/vars |
| Logic/Theorem | 10 | 40 | 100 | vars/clauses |
| Data Structures | 20 | 100 | 500 | operations |
| System Simulation | 20 | 75 | 200 | events |
Appendix M.2. Output Trace Statistics
| Category | Min Steps | Median | Max Steps | Tokens |
|---|---|---|---|---|
| Comparison Sorting | 24 | 2,048 | 65,280 | 8,192 |
| Non-comparison Sort | 200 | 5,000 | 50,000 | 12,288 |
| Advanced Sorting | 48 | 1,024 | 8,192 | 6,144 |
| Graph Traversal | 20 | 400 | 8,000 | 4,096 |
| Tree Operations | 10 | 200 | 2,000 | 3,072 |
| Classic Puzzles | 8 | 512 | 1,048,576 | 8,192 |
| Automata/State | 50 | 1,000 | 20,000 | 6,144 |
| String/Pattern | 100 | 2,000 | 20,000 | 8,192 |
| Mathematical | 10 | 100 | 3,600 | 2,048 |
| Logic/Theorem | 10 | 200 | 5,000 | 4,096 |
| Data Structures | 20 | 200 | 1,000 | 3,072 |
| System Simulation | 20 | 150 | 1,000 | 4,096 |
Appendix N. Code and Data Availability
Appendix N.1. Repository Structure


Appendix N.2. License and Usage
Appendix N.3. Reproducibility Checklist
| Item | Status |
|---|---|
| Training code released | Yes |
| Evaluation code released | Yes |
| Pretrained models released | Yes |
| Hyperparameters documented | Yes |
| Random seeds fixed | Yes (base: 42) |
| Hardware requirements specified | Yes |
| Expected runtime documented | Yes |
| Statistical significance tests | Yes |
| Multiple random seeds evaluated | Yes (3 seeds) |
References
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Advances in Neural Information Processing Systems 2020, 33, 1877–1901. [Google Scholar]
- Huang, J.; Chang, K.C.C. Towards reasoning in large language models: A survey. Findings of the Association for Computational Linguistics: ACL 2023, 1049–1065. [Google Scholar]
- Niu, Q.; Liu, J.; Bi, Z.; Feng, P.; Peng, B.; Chen, K.; Li, M.; Yan, L.K.; Zhang, Y.; Yin, C.H.; et al. Large language models and cognitive science: A comprehensive review of similarities, differences, and challenges. In BIO Integration; 2024. [Google Scholar]
- Sipser, M. Introduction to the Theory of Computation. ACM Sigact News 1996, 27, 27–29. [Google Scholar]
- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 2023, 55, 1–35. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.; Le, Q.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 2022, 35, 24824–24837. [Google Scholar]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems 2022, 35, 22199–22213. [Google Scholar]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2023. arXiv:2203.11171. [CrossRef]
- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 2023, 36. [Google Scholar]
- Meyerson, E.; Paolo, G.; Dailey, R.; Shahrzad, H.; Francon, O.; Hayes, C.F.; Qiu, X.; Hodjat, B.; Miikkulainen, R. Solving a Million-Step LLM Task with Zero Errors. arXiv arXiv:2511.09030. [CrossRef]
- Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training verifiers to solve math word problems. arXiv arXiv:2110.14168. [CrossRef]
- Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; Steinhardt, J. Measuring mathematical problem solving with the MATH dataset. Advances in Neural Information Processing Systems 2021, 34. [Google Scholar]
- Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. [Google Scholar]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv arXiv:2001.08361.
- Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training compute-optimal large language models. arXiv arXiv:2203.15556. [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser; Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems 2017, 30. [Google Scholar]
- OpenAI. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023. arXiv:2307.09288. [CrossRef]
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 herd of models. arXiv 2024. arXiv:2407.21783. [CrossRef]
- Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. OPT: Open pre-trained transformer language models. arXiv arXiv:2205.01068. [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
- Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2 technical report. arXiv 2024. arXiv:2407.10671.
- Gemma Team. Gemma: Open models based on Gemini research and technology. arXiv 2024. arXiv:2403.08295. [CrossRef]
- Shazeer, N. GLU variants improve transformer. arXiv arXiv:2002.05202. [CrossRef]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research 2023, 24, 1–113. [Google Scholar]
- Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. Technical Report. 2024. [Google Scholar]
- Bi, Z.; Chen, K.; Tseng, C.Y.; Zhang, D.; Wang, T.; Luo, H.; Chen, L.; Huang, J.; Guan, J.; Hao, J.; et al. Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI’s Latest Open Source Models. arXiv arXiv:2508.12461.
- Sun, J.; Zheng, S.; Chen, J.; Luo, J.; Peng, Y.; Xu, Y.; et al. A survey of reasoning with foundation models. arXiv 2024. arXiv:2312.11562. [CrossRef]
- Lewkowycz, A.; Andreassen, A.; Dohan, D.; Dyer, E.; Michalewski, H.; Ramasesh, V.; Slone, A.; Anil, C.; Schlag, I.; Gutman-Solo, T.; et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems 2022, 35. [Google Scholar]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems 2023, 36. [Google Scholar]
- Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. [Google Scholar]
- Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. PAL: Program-aided language models. In Proceedings of the International Conference on Machine Learning, 2023; pp. 10764–10799. [Google Scholar]
- Chen, W.; Ma, X.; Wang, X.; Cohen, W.W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research, 2023. [Google Scholar]
- Zhou, Y.; Muresanu, A.I.; Han, Z.; Paster, K.; Pitis, S.; Chan, H.; Ba, J. Large language models are human-level prompt engineers. arXiv 2023. arXiv:2211.01910. [CrossRef]
- Fernando, C.; Banarse, D.; Michalewski, H.; Osindero, S.; Rocktäschel, T. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv 2024. arXiv:2309.16797.
- Bi, Z.; Chen, K.; Wang, T.; Hao, J.; Song, X. CoT-X: An Adaptive Framework for Cross-Model Chain-of-Thought Transfer and Optimization. arXiv arXiv:2511.05747.
- Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024. arXiv:2402.07927. [CrossRef]
- Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Hajishirzi, H.; Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022; pp. 11048–11064. [Google Scholar]
- Bi, Z.; Chen, L.; Song, J.; Luo, H.; Ge, E.; Huang, J.; Wang, T.; Chen, K.; Liang, C.X.; Wei, Z.; et al. Exploring efficiency frontiers of thinking budget in medical reasoning: Scaling laws between computational resources and reasoning quality. arXiv arXiv:2508.12140. [CrossRef]
- Clark, A.; de Las Casas, D.; Guy, A.; Sherburn, A.; Sherburn, T.; Sherburn, B.; Sherburn, C.; Mensch, A.; et al. Unified scaling laws for routed language models. In Proceedings of the International Conference on Machine Learning, 2022; pp. 4057–4086. [Google Scholar]
- Tseng, C.Y.; Zhang, D.; Wang, T.; Luo, H.; Chen, L.; Huang, J.; Guan, J.; Hao, J.; Song, J.; Bi, Z. 47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations. arXiv arXiv:2511.21701.
- Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Tarber, E.; et al. PaLM 2 technical report. arXiv 2023, arXiv:2305.10403. [Google Scholar] [CrossRef]
- Sun, C.; Huang, S.; Pompili, D. LLM-based Multi-Agent Reinforcement Learning: Current and Future Directions. arXiv 2024. arXiv:2405.11106.
- Li, P.; et al. AGILE: A Novel Reinforcement Learning Framework of LLM Agents. Advances in Neural Information Processing Systems 2024, 37. [Google Scholar]
- Lyu, X.; et al. LLM Collaboration With Multi-Agent Reinforcement Learning. arXiv arXiv:2508.04652. [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 2022, 35, 27730–27744. [Google Scholar]
- Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv 2022, arXiv:2204.05862. [Google Scholar] [CrossRef]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv arXiv:1707.06347.
- Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional AI: Harmlessness from AI feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
- Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 2023, 36. [Google Scholar]
- Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Zhang, M.; Li, Y.; Wu, Y.; Guo, D. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv 2024. arXiv:2402.03300.
- Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let’s Verify Step by Step. Proceedings of ICLR, 2024 2024. [Google Scholar]
- Snell, C.; Lee, J.; Xu, K.; Kumar, A. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv 2024. arXiv:2408.03314. [CrossRef]
- Zhang, K.; et al. A brain-inspired agentic architecture to improve planning with LLMs. Nature Communications 2025, 16. [Google Scholar] [CrossRef] [PubMed]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of ICLR, 2022 2022. [Google Scholar]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Ziegler, D.M.; Stiennon, N.; Wu, J.; Brown, T.B.; Radford, A.; Amodei, D.; Christiano, P.; Irving, G. Fine-Tuning Language Models from Human Preferences. arXiv arXiv:1909.08593. [CrossRef]
- Lee, H.; Phatale, S.; Mansoor, H.; Lu, K.; Mesnard, T.; Bishop, C.; Carbune, V.; Rastogi, A. RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. arXiv 2023. arXiv:2309.00267. [CrossRef]
- Wang, Y.; et al. Reasoning Aware Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling. Proceedings of NAACL, 2025 2025. [Google Scholar]
- Dziri, N.; et al. Faith and Fate: Limits of Transformers on Compositionality. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Mirzadeh, S.I.; et al. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models. In Apple Machine Learning Research; 2025. [Google Scholar]
- Xiong, W.; Liu, J.; Molybog, I.; Zhang, H.; Bhargava, P.; Hou, R.; Martin, L.; Rber, R.; et al. Effective long-context scaling of foundation models. arXiv 2024. arXiv:2309.16039. [CrossRef]
- Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar]
- Zhang, S.; Dong, L.; Li, X.; Zhang, S.; Sun, X.; Wang, S.; Li, J.; Hu, R.; Zhang, T.; Wu, F.; et al. Instruction tuning for large language models: A survey. arXiv 2023, arXiv:2308.10792. [Google Scholar]
- Press, O.; Smith, N.A.; Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. In Proceedings of the International Conference on Learning Representations, 2022. [Google Scholar]
- Suzgun, M.; Scales, N.; Schärli, N.; Gehrmann, S.; Tay, Y.; Chung, H.W.; Chowdhery, A.; Le, Q.V.; Chi, E.H.; Zhou, D.; et al. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv 2022, arXiv:2210.09261. [Google Scholar]
- Chollet, F. On the Measure of Intelligence. arXiv arXiv:1911.01547.
- Jimenez, C.E.; Yang, J.; Wettig, A.; Yao, S.; Pei, K.; Press, O.; Narasimhan, K.R. SWE-bench: Can Language Models Resolve Real-world Github Issues? In Proceedings of the The Twelfth International Conference on Learning Representations, 2024. [Google Scholar]
- Willard, B.T.; Louf, R. Efficient guided generation for large language models. arXiv 2023. arXiv:2307.09702. [CrossRef]
- Peters, T. Timsort Algorithm. Python Software Foundation 2002. [Google Scholar]
- Musser, D.R. Introspective Sorting and Selection Algorithms. Software: Practice and Experience 1997, 27, 983–993. [Google Scholar] [CrossRef]
- Batcher, K.E. Sorting Networks and Their Applications. AFIPS Conference Proceedings 1968, 32, 307–314. [Google Scholar]
- Tarjan, R. Depth-First Search and Linear Graph Algorithms. SIAM Journal on Computing 1972, 1, 146–160. [Google Scholar] [CrossRef]
- Dijkstra, E.W. A Note on Two Problems in Connexion with Graphs. Numerische Mathematik 1959, 1, 269–271. [Google Scholar] [CrossRef]
- Hart, P.E.; Nilsson, N.J.; Raphael, B. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetics 1968, 4, 100–107. [Google Scholar] [CrossRef]
- Floyd, R.W. Algorithm 97: Shortest Path. Communications of the ACM 1962, 5, 345. [Google Scholar] [CrossRef]
- Kahn, A.B. Topological Sorting of Large Networks. Communications of the ACM 1962, 5, 558–562. [Google Scholar] [CrossRef]
- Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms. 2009. [Google Scholar]
- Huffman, D.A. A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE 1952, 40, 1098–1101. [Google Scholar] [CrossRef]
- Hopcroft, J.E.; Motwani, R.; Ullman, J.D. Introduction to Automata Theory, Languages, and Computation. 2006. [Google Scholar]
- Turing, A.M. On Computable Numbers, with an Application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 1936, s2-42, 230–265. [Google Scholar]
- Petri, C.A. Kommunikation mit Automaten. PhD Thesis, University of Bonn, 1962. [Google Scholar]
- Wolfram, S. Universality and Complexity in Cellular Automata. Physica D: Nonlinear Phenomena 1984, 10, 1–35. [Google Scholar] [CrossRef]
- Knuth, D.E.; Morris, J.H.; Pratt, V.R. Fast Pattern Matching in Strings. SIAM Journal on Computing 1977, 6, 323–350. [Google Scholar] [CrossRef]
- Dantzig, G.B. Maximization of a Linear Function of Variables Subject to Linear Inequalities. Activity Analysis of Production and Allocation 1951, 339–347. [Google Scholar]
- Davis, M.; Logemann, G.; Loveland, D. A Machine Program for Theorem-Proving. Communications of the ACM 1962, 5, 394–397. [Google Scholar] [CrossRef]
- Robinson, J.A. A Machine-Oriented Logic Based on the Resolution Principle. Journal of the ACM 1965, 12, 23–41. [Google Scholar] [CrossRef]
- Milner, R. A Theory of Type Polymorphism in Programming. Journal of Computer and System Sciences 1978, 17, 348–375. [Google Scholar] [CrossRef]
- Tarjan, R.E. Efficiency of a Good But Not Linear Set Union Algorithm. Journal of the ACM 1975, 22, 215–225. [Google Scholar] [CrossRef]
- Lin, Z.; Xu, Z.; Zhao, T.; et al. CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. In Findings of the Association for Computational Linguistics: ACL; 2024. [Google Scholar]
- Herbold, S. SortBench: Benchmarking LLMs based on their ability to sort lists. arXiv arXiv:2504.08312. [CrossRef]
- Lin, B.Y.; et al. ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. arXiv arXiv:2502.01100. [CrossRef]
- Robbins, H.; Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics 1951, 22, 400–407. [Google Scholar] [CrossRef]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning. PMLR, 2015; pp. 1889–1897. [Google Scholar]
- Kakade, S.; Langford, J. Approximately optimal approximate reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2002; pp. 267–274. [Google Scholar]




















| Model | Params | Architecture |
|---|---|---|
| Qwen3-8B | 8B | Grouped-Query Attention |
| Gemma3-12B | 12B | Multi-Query Attention |
| Qwen3-14B | 14B | Grouped-Query Attention |
| GPT-OSS-20B | 20B | Multi-Head Attention |
| Gemma3-27B | 27B | Multi-Query Attention |
| Qwen3-Coder-30B | 30B | Code-Specialized |
| GPT-OSS-120B | 120B | Multi-Head Attention |
| Model | Baseline | Optimized | Relative |
|---|---|---|---|
| Qwen3-8B | 24.3% | 83.8% | +244.9% |
| Gemma3-12B | 30.5% | 88.2% | +189.2% |
| Qwen3-14B | 28.2% | 85.8% | +204.3% |
| GPT-OSS-20B | 38.8% | 92.1% | +137.4% |
| Gemma3-27B | 37.2% | 89.5% | +140.6% |
| Qwen3-Coder-30B | 45.2% | 94.3% | +108.6% |
| GPT-OSS-120B | 57.8% | 96.4% | +66.8% |
| Average | 37.4% | 90.0% | +140.6% |
| Metric | Baseline | Optimized |
|---|---|---|
| Mean Latency (ms) | 331 | 518 |
| Overhead Ratio | 1.00× | 1.56× |
| Model | |||||
|---|---|---|---|---|---|
| Qwen3-8B | 96% | 88% | 84% | 79% | 72% |
| Gemma3-12B | 96% | 94% | 88% | 82% | 75% |
| Qwen3-14B | 97% | 90% | 85% | 79% | 73% |
| GPT-OSS-20B | 98% | 96% | 93% | 87% | 79% |
| Gemma3-27B | 98% | 93% | 89% | 85% | 82% |
| Qwen3-Coder-30B | 99% | 96% | 97% | 93% | 84% |
| GPT-OSS-120B | 100% | 99% | 97% | 95% | 91% |
| Average | 97.7% | 93.7% | 90.4% | 85.7% | 79.4% |
| Model | Column | Diagonal | Parsing |
|---|---|---|---|
| Qwen3-8B | 18% | 71% | 11% |
| Gemma3-12B | 15% | 76% | 9% |
| Qwen3-14B | 16% | 74% | 10% |
| GPT-OSS-20B | 12% | 82% | 6% |
| Gemma3-27B | 14% | 79% | 7% |
| Qwen3-Coder-30B | 10% | 86% | 4% |
| GPT-OSS-120B | 8% | 89% | 3% |
| Average | 13% | 80% | 7% |
| Configuration | 8B | 30B | 120B |
|---|---|---|---|
| Full Optimized | 83.8% | 94.3% | 96.4% |
| − Worked Examples | 72.1% | 89.7% | 94.1% |
| − Constraint Enumeration | 65.4% | 85.2% | 91.8% |
| − Verification Procedure | 58.9% | 78.6% | 88.3% |
| − Format Specification | 79.2% | 92.1% | 95.7% |
| Baseline (all removed) | 24.3% | 45.2% | 57.8% |
| Benchmark | Tasks | Instances | Categories | Max Steps | Trace Verify |
|---|---|---|---|---|---|
| GSM8K [11] | 1 | 8,500 | 1 | ∼10 | ✗ |
| MATH [12] | 7 | 12,500 | 7 | ∼50 | ✗ |
| BIG-Bench Hard [67] | 23 | 6,511 | 4 | ∼100 | ✗ |
| ARC-AGI [68] | 1 | 1,000 | 1 | ∼30 | ✗ |
| HumanEval [30] | 164 | 164 | 1 | — | ✗ |
| SWE-Bench [69] | — | 2,294 | 1 | — | ✗ |
| PRIME-Bench (Ours) | 86 | 51,600 | 12 | >1M | ✓ |
| Category | Tasks | Max Steps |
|---|---|---|
| Comparison-based Sorting | 15 | 1,000,000 |
| Non-comparison Sorting | 3 | 300,000 |
| Advanced/Hybrid Sorting | 10 | 600,000 |
| Graph Traversal Algorithms | 6 | 125,000 |
| Tree Data Structure Ops | 5 | 100,000 |
| Classic Algorithm Puzzles | 6 | 1,048,575 |
| Automata & State Machines | 8 | 200,000 |
| String & Pattern Matching | 5 | 100,000 |
| Mathematical/Numerical | 8 | 8,000 |
| Logic & Theorem Proving | 6 | 50,000 |
| Data Structure Operations | 6 | 100,000 |
| System Simulation | 8 | 100,000 |
| Total | 86 | — |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).