Submitted:
16 June 2026
Posted:
22 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- A 7,440-item upstream petroleum and subsurface MCQ benchmark generated from 372 technical reference books, with exactly 20 mechanically validated accepted questions per source.
- A deterministic validation pipeline for schema, source coverage, answer balance, evidence references, standalone wording, domain labels, question types, rejected-item fingerprints, and manifest consistency.
- A locked nine-model locally hosted evaluation protocol with per-item response caching, strict answer extraction, macro-by-book metrics, latency reporting, and failure diagnostics.
- A transparent release posture for private or copyrighted sources: source chunks and source paths are not redistributed, while source ids, hashes, line references, question metadata, and aggregate results remain auditable by the author.
2. Related Work
3. Benchmark Design
3.1. Scope
3.2. Task Format
3.3. Schema
- id, version, question_format, and language
- domains, topics, difficulty, and question_type
- question, four choices, answer_index, and answer_key
- rationale, requires_calculation, trap_type, and contamination_risk
- source_book with source id, title, logical source URI, rights status, corpus category, and SHA-256
- evidence_refs with evidence id, line span, heading, page hint when available, and excerpt hash
4. Corpus and Construction Pipeline
4.1. Corpus
4.2. Generation
4.3. Validation
5. Evaluation Protocol
5.1. Model Allowlist
5.2. Prompting and Metrics
6. Results
| Rank | Model | Micro | Macro | Correct | Parse | Empty | Length | Errors | Mean (s) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | qwen-3-vl-235b-fp8 | 84.70% | 84.70% | 6,302/7,440 | 31 | 0 | 0 | 31 | 0.95 |
| 2 | gemma-4-31b | 84.13% | 84.13% | 6,259/7,440 | 26 | 0 | 0 | 26 | 6.74 |
| 3 | qwen-3-coder | 83.36% | 83.36% | 6,202/7,440 | 86 | 0 | 87 | 0 | 0.76 |
| 4 | qwen-3-vl-30b | 81.83% | 81.83% | 6,088/7,440 | 94 | 0 | 96 | 0 | 1.30 |
| 5 | gpt-oss-120b | 81.67% | 81.67% | 6,076/7,440 | 50 | 50 | 50 | 0 | 4.69 |
| 6 | qwen-3-vl-8b | 80.03% | 80.03% | 5,954/7,440 | 45 | 0 | 16 | 30 | 1.05 |
| 7 | qwen-3-vl-4b | 76.65% | 76.65% | 5,703/7,440 | 44 | 0 | 45 | 0 | 0.79 |
| 8 | minimax-m2.7 | 64.13% | 64.13% | 4,771/7,440 | 2,144 | 2,060 | 2,066 | 77 | 22.49 |
| 9 | qwen-3.6-35b | 25.63% | 25.63% | 1,907/7,440 | 5,322 | 429 | 446 | 4,876 | 45.76 |
| Model | Medium | Hard | Expert |
|---|---|---|---|
| qwen-3-vl-235b-fp8 | 86.71% | 84.19% | 85.14% |
| gemma-4-31b | 85.93% | 84.31% | 83.47% |
| qwen-3-coder | 83.43% | 83.12% | 83.73% |
| qwen-3-vl-30b | 80.92% | 81.63% | 82.32% |
| gpt-oss-120b | 82.85% | 81.40% | 81.87% |
| qwen-3-vl-8b | 80.54% | 79.31% | 81.10% |
| qwen-3-vl-4b | 76.11% | 76.56% | 76.91% |
| minimax-m2.7 | 74.37% | 64.46% | 61.56% |
| qwen-3.6-35b | 47.01% | 23.35% | 25.14% |
7. Audit and Reproducibility
- data/benchmark/upstreambenchv0.1.jsonl: frozen 7,440-row benchmark.
- data/benchmark/upstreambenchv0.1manifest.json: source coverage, duplicate groups, local model roster, and generation metadata.
- data/working/upstreambench/validationstats.json: strict validation output.
- data/working/upstreambench/modelrosterverification.json: sanitized local model-roster snapshot.
- eval/results/upstreambenchruns.json: canonical exact nine-model result summary rebuilt from validated private local cache records.
- eval/results/upstreambenchleaderboard.md: generated local evaluation table.
8. Limitations
9. Conclusion
Appendix A. Representative Items and Capability Splits
Appendix A.1. Complete Example Items
- Calculation: gas-processing heat duty. Question: A gas-processing heat duty is 2.5 MJ/s. Using 1 kW = 1.341 hp, what horsepower is equivalent to that duty? Choices: A. About 2,500 hp; B. About 1,860 hp; C. About 3,100 hp; D. About 3,350 hp. Ground truth: D. Rationale: 2.5 MJ/s is 2,500 kW, and 2,500 3,350 hp. Provenance: source id prefix handbook-of-natural-gas-...-afb3b220, evidence hash prefix c2bc3ed34326, lines 14972–14977.
- Assumption check: P&ID documentation. Question: Which item must be excluded from a P&ID per documentation criteria? Choices: A. Relief valve set pressure; B. Controller action, for example direct or reverse acting; C. Analyzer tubing size and specification; D. Local hand switch tags on control panels. Ground truth: B. Rationale: Controller actions, set points, and configuration details are excluded, while operational parameters such as set pressure and tubing specifications must be shown. Provenance: source id prefix pid-doc-...-aa1ccb02, evidence hash prefix b3f4ea868c4f, lines 874–879.
- Failure mode: sulfide stress cracking. Question: Which failure mode is characteristic of sulfide stress cracking in sour environments? Choices: A. Ductile deformation with necking and elongation before fracture; B. Intergranular cracking occurring only at elevated temperatures above ; C. Brittle fracture with minimal plastic deformation, high crack propagation velocity, and transcrystalline cracking; D. Fatigue failure initiated by cyclic pressure fluctuations in the tubing string. Ground truth: C. Rationale: Sulfide stress cracking manifests as brittle fracture with high crack velocity and little visible warning under tensile stress in S environments. Provenance: source id prefix advanced-well-completion-...-c4f8e8a1, evidence hash prefix 8847f8abf732, lines 14302–14313.
- Geophysics: thin-bed seismic resolution. Question: In seismic attribute analysis for channel detection, why might a 20-Hz time-frequency slice derived from CWT processing be insufficient to fully resolve the vertical extent and internal architecture of a thin-bedded channel system? Choices: A. Fixed-frequency slices cannot distinguish lateral facies changes from vertical stratigraphy; B. CWT suppresses high-frequency components needed for thin-bed tuning; C. Phase distortion obscures bed boundaries; D. 20 Hz is below the tuning frequency for typical thin beds, causing constructive interference that masks vertical resolution. Ground truth: D. Rationale: The dominant wavelength at 20 Hz can exceed thin-bed thickness, so reflections from top and base interfere and prevent vertical resolution of the channel architecture. Provenance: source id prefix seismic-attributes-...-dca26ca5, evidence hash prefix 8eee3fcfb4ec, line 2963.
- Reservoir engineering: mobility anisotropy. Question: During a formation test, a near probe records horizontal mobility of 490 md/cp and vertical mobility of 13 md/cp, with a skin factor of 3.0. What does this mobility contrast most strongly suggest? Choices: A. Severe near-wellbore damage affecting vertical flow more than horizontal; B. Conductive horizontal fractures enhancing lateral flow; C. Significant vertical compartmentalization with limited interlayer connectivity; D. Anisotropic permeability with , consistent with laminated sandstone. Ground truth: C. Rationale: The large horizontal-to-vertical mobility contrast indicates strong vertical flow resistance and limited vertical communication between layers despite lateral connectivity. Choice D cites the correct ratio but attributes it to grain-scale lamination, whereas the suppressed vertical mobility most directly indicates interlayer flow barriers, hence C. Provenance: source id prefix formation-testing-fluid-analysis-...-3c578615, evidence hash prefix f3cf68a68dce, line 477.
- Structural geology: fault-bounded trap spillpoint. Question: In structural interpretation across faults, what does the spillpoint of a fault-bounded trap represent? Choices: A. The shallowest depth at which hydrocarbons can migrate across the fault plane via juxtaposition of permeable layers; B. The deepest point of structural closure on the downthrown side; C. The point of maximum fault seal capacity; D. The elevation where the fault plane intersects the top reservoir horizon on the upthrown side. Ground truth: A. Rationale: The spillpoint is the shallowest level at which hydrocarbons can leak or migrate across the fault, controlled by reservoir juxtaposition across the fault plane. Provenance: source id prefix 3-d-seismic-interpretation-...-aedee953, evidence hash prefix 5dd9458c2bc3, line 1177.
Appendix A.2. Valid-Output Capability Splits
- Coiled tubing RIH precaution in high-pressure zones. Ground truth: D. VL-235B answered D correctly, while Coder answered C. The split distinguishes surge-pressure formation-fracture risk from swab-kick control.
- Layered-earth integral-equation geometry limit. Ground truth: B. VL-30B answered B correctly, while Coder answered A. The split distinguishes lateral homogeneity assumptions from the lack of Green’s functions for non-parallel interfaces.
References
- Ermilov, A. FormationEval, an open multiple-choice benchmark for petroleum geoscience. arXiv 2026, arXiv:2601.02158. [Google Scholar] [CrossRef]
- Wang, X.; Zhang, T.; Wang, S.; Wu, Y.; Meng, H.; Zhou, P.; Li, P. PetroBench: A Benchmark for Large Language Models in Petroleum Engineering. arXiv 2026, arXiv:2605.28032. [Google Scholar] [CrossRef]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. [Google Scholar] [CrossRef]
- Wang, Y.; Ma, X.; Zhang, G.; Ni, Y.; Chandra, A.; Guo, S.; Ren, W.; Arulraj, A.; He, X.; Jiang, Z.; et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024. [Google Scholar] [CrossRef]
- Rein, D.; Hou, B.L.; Stickland, A.C.; Petty, J.; Pang, R.Y.; Dirani, J.; Michael, J.; Bowman, S.R. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. In Proceedings of the Conference on Language Modeling (COLM), 2024. [Google Scholar] [CrossRef]
- Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; Tafjord, O. Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv 2018, arXiv:1803.05457. [Google Scholar] [CrossRef]
- Wang, X.; Hu, Z.; Lu, P.; Zhu, Y.; Zhang, J.; Subramaniam, S.; Loomba, A.R.; Zhang, S.; Sun, Y.; Wang, W. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. In Proceedings of the International Conference on Machine Learning (ICML), 2024. [Google Scholar] [CrossRef]
- Jin, D.; Pan, E.; Oufattole, N.; Weng, W.H.; Fang, H.; Szolovits, P. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci. 2021, 11, 6421. [Google Scholar] [CrossRef]
- Guha, N.; Nyarko, J.; Ho, D.E.; Ré, C.; Chilton, A.; et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. [Google Scholar] [CrossRef]
- Deng, C.; Zhang, T.; He, Z.; Xu, Y.; Chen, Q.; Shi, Y.; Fu, L.; Zhang, W.; Wang, X.; Zhou, C.; et al. K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. In Proceedings of the Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM), 2024; pp. 161–170. [Google Scholar] [CrossRef]
- Lin, Z.; Deng, C.; Zhou, L.; Zhang, T.; Xu, Y.; Xu, Y.; He, Z.; Shi, Y.; Dai, B.; Song, Y.; et al. GeoGalactica: A Scientific Large Language Model in Geoscience. arXiv 2024, arXiv:2401.00434. [Google Scholar] [CrossRef]
- Hadid, A.; Chakraborty, T.; Busby, D. When Geoscience Meets Generative AI and Large Language Models: Foundations, Trends, and Future Challenges. Expert Syst. 2024, 41, e13654. [Google Scholar] [CrossRef]
- Ch, D.R.; Saha, S.K. Automatic Multiple Choice Question Generation from Text: A Survey. IEEE Trans. Learn. Technol. 2020, 13, 14–25. [Google Scholar] [CrossRef]
- Alhazmi, E.; Sheng, Q.Z.; Zhang, W.E.; Zaib, M.; Alhazmi, A. Distractor Generation in Multiple-Choice Tasks: A Survey of Methods, Datasets, and Evaluation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024; Association for Computational Linguistics; pp. 14437–14458. [Google Scholar] [CrossRef]
- Yao, Z.; Parashar, A.; Zhou, H.; Jang, W.S.; Ouyang, F.; Yang, Z.; Yu, H. MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) Assoc. Comput. Linguist. 2025; pp. 10728–10777. [Google Scholar] [CrossRef]
- Cao, R.; et al. Qwen3-Coder-Next Technical Report. arXiv 2026, arXiv:2603.00729. [Google Scholar] [CrossRef]


| Metric | Value |
|---|---|
| Technical source books | 372 |
| Source corpus size | 389 MB |
| Accepted MCQ items | 7,440 |
| Questions per source file | 20 |
| Source files with accepted items | 372 |
| Answer-key count per letter | 1,860 |
| Exact duplicate-content groups | 4 |
| Difficulty | Count | Question type | Count | Domain | Count |
|---|---|---|---|---|---|
| Medium | 519 | Mechanism | 3,976 | Geophysics and seismology | 3,915 |
| Hard | 4,296 | Diagnosis | 1,371 | Reservoir engineering | 1,582 |
| Expert | 2,625 | Interpretation | 819 | Petrophysics | 1,158 |
| Workflow design | 321 | Geomechanics | 969 | ||
| Comparative reasoning | 259 | Drilling and completions | 839 | ||
| Calculation | 242 | Petroleum geology and stratigraphy | 834 | ||
| Assumption check | 202 | Geochemistry and mineralogy | 596 | ||
| Failure mode | 128 | Production and facilities | 530 | ||
| Edge case | 63 | Economics, management, and safety | 238 | ||
| Counterfactual | 59 | Near-surface, environmental, and marine | 202 | ||
| Machine learning and data methods | 153 |
| Model id | Hosted family |
|---|---|
| qwen-3-coder | Qwen/Qwen3-Coder-Next |
| qwen-3-vl-235b-fp8 | Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 |
| qwen-3-vl-30b | Qwen/Qwen3-VL-30B-A3B-Instruct |
| qwen-3-vl-8b | Qwen/Qwen3-VL-8B-Instruct |
| qwen-3-vl-4b | Qwen/Qwen3-VL-4B-Instruct |
| qwen-3.6-35b | Qwen/Qwen3.6-35B-A3B |
| gpt-oss-120b | openai/gpt-oss-120b |
| gemma-4-31b | google/gemma-4-31B-it |
| minimax-m2.7 | MiniMaxAI/MiniMax-M2.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).