Preprint
Article

This version is not peer-reviewed.

PetroAgents: A Multi-Agent, Multi-Modal Large-Language-Model Framework for Integrated Upstream Petroleum Asset Reasoning, Benchmarked on the Equinor Volve Open Dataset

Rong Lu  *

Submitted:

16 June 2026

Posted:

18 June 2026

You are already at the latest version

Abstract
We present PetroAgents, a multi-agent, multi-modal large-language-model framework for petroleum-engineering reasoning on the Equinor Volve open dataset. The target architecture mirrors an integrated asset team: discipline-specific evidence surfaces, cross-examination, a Council Synthesiser, distributional well-action proposals, and risk review. Read as an agent-design pattern map, the architecture is organised around the seven cognitive functions of a language agent, with the per-discipline evidence lock as its governance layer; four functions are implemented and evaluated in this submission and three are specified as design. The current quantitative evidence is narrower by design. We specify Volve Bench, a six-task Volve benchmark suite spanning DRILL-NPT root-cause attribution, formation top picking, six-month production forecasting, stuck-pipe early warning, multi-modal Discovery-Report QA, and per-wellbore lifecycle forensic analysis over 26 wellbores, 1,759 daily drilling reports, 56 million WITSML rows, 602 LAS files, 5.7 million horizon points, and a decade of production. This submission reports two landed evidence slices: a three-seed DRILL-NPT study on stratified samples drawn from a 1,750-example pool, and a 12-question DISCOVERY-QA smoke test on three rendered pages of the 194-page Hugin Discovery Report. Every reported LLM call goes through a local OpenAI-compatible gateway using locally-hosted open weights (GPT-OSS-120B, Qwen3.6-35B, Gemma-4-31B, MiniMax-M2.7, Qwen3-VL-235B-FP8); no paid frontier API is invoked. On DRILL-NPT, the four-family vote attains macro-F1 0.464 ± 0.012 on the broad all-wellbore sample and 0.442 ± 0.019 on the WITSML-applicable subset. The Drilling+HSE+Council path lifts WITSML-applicable macro-F1 by +0.048 over the single-LLM baseline, but the paired test is not significant at three seeds (p = 0.22), so we report it as directional evidence rather than a settled win. A same-subset evidence-redaction ablation shows that exposing state_detail and proprietary_code lifts B1 from 0.355 to 0.431 macro-F1, quantifying how much of DRILL-NPT is label-code leakage rather than prose reasoning. On DISCOVERY-QA, Qwen3-VL reading rendered page images reaches a 0.958 keyword-hit score versus 0.792 for GPT-OSS-120B reading pdftotext, a bounded +16.7 percentage-point lift concentrated on figure annotations and OCR-damaged numerics.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated