1. Introduction
A key challenge in petroleum engineering is monitoring reservoir and production parameters throughout the lifecycle of wells and reservoirs. Tracking pressure, temperature, and production rates over time is essential for detecting anomalies, identifying issues like scaling or reservoir damage, understanding reservoir configuration, forecasting performance, and optimizing recovery. Reliable, continuous data acquisition is crucial for reservoir engineering, well testing, flow assurance, and operations management.
In modern offshore production systems, these measurements are collected at various nodes of the production and flow system, as described below (and illustrated by Apio et al. [
1]):
Topside: Measurements of oil, water, and gas production rates (, , and ), along with pressures and temperatures upstream and downstream of the choke valve.
Wellhead: Measurements of pressure and temperature via temperature and pressure transducers (TPTs).
Bottom-hole: Measurements of pressure and temperature via permanent downhole gauges (PDGs).
Among these, PDGs are arguably the most important, as they are located near the reservoir and provide essential data on its behavior, notably the Bottom-Hole Pressure (BHP) [
1,
2]. However, they are also the most prone to failure [
3] due to harsh operating conditions (high pressures and high temperatures) [
4] and the long distances that signals must travel to reach the monitoring stations at the platforms. Moreover, PDGs are the most challenging to replace, as they are positioned at the bottom of wells, and that would require complex and costly workover operations, that are rarely economically viable [
5].
In oilfields with lower productivities and revenues, even the initial installation costs of PDGs in all wells may be prohibitive. In cases where PDG signal loss occurs or new installations are not feasible, the use of Soft Sensors becomes an essential alternative.
Soft sensors (or software-based sensors) refer to a suite of methods and tools designed to replace/backup physical sensors (the “hard” sensors) or enable the monitoring of variables that are difficult or impossible to measure directly, due to challenges as hostile and unreachable environments, disturbances to the process, measurement delays, or high costs [
6]. They can be used to estimate the values of real physical quantities, or other virtual variables, as quality metrics for industrial processes [
7], and can be used in real-time applications [
8] including digital twins.
The development of soft sensors can be based on methods as mathematical relationships, statistics, and data-driven machine learning (ML), among others, as well as their combination with analytical hardware data [
6,
7].
In the field of oil and gas exploration and production, the literature primarily features studies utilizing soft sensors as virtual meters for flow variables and downhole quantities. For the first group, a common problem addressed in the literature regarding petroleum production systems is the allocation of individual well production rates. This is relevant because the production from a group of wells, e.g., all those on a platform, is typically combined in a single separator vessel, and daily flow measurements for accounting and regulatory purposes are performed for the total output [
9]. Individual rates must then be calculated/estimated to support other activities—we will later discuss how this impacts our work. This can be done simply by distributing based on the most recent individual measurements, but soft sensors can improve these estimates by providing values per fluid and per well based on other continuously measured variables, such as sensor data.
The use of data-driven methods for this purpose is studied by Paulo et al., who used system identification techniques to obtain a black box model to predict liquid flow rates from available field measures as pressures and temperatures [
2]. Song et al. developed a virtual flow meter for an offshore oil platform, and compared the results of models based on Multi-Layer Perceptron (MLP) neural network, random forests, and Long Short-Term Memory (LSTM) networks. They evaluated the data volume required for reasonable results and the positive impact of using transfer learning to reduce that volume [
10].
The work of Góes et al. proposed mathematical models to estimate the rates using real-time data from the plant monitoring system and offline data as fluids properties [
11]. A combination of physics-based and data-driven methods is proposed by Ishak et al., using ensemble learning for a virtual multiphase flow meter [
12]. The recent work of Alves et al. uses Neural Networks as proxy models for a physics-based simulator to perform data reconciliation online in real time for monitoring purposes [
9], while Rabello et al. deploy parallel computing techniques to improve the performance of data reconciliation for the same problem [
13].
A different application in flow, the sequential transport of different fluids through a pipeline, is studied by Yuan et al., who use a knowledge-informed Bayesian-Gaussian mixture regression model to track the fluid interface along the way, serving as an example of a soft sensor for a virtual variable [
14].
A more comprehensive review of works on virtual flow meters is brought by Bikmukhametov and Jäschke [
15].
Regarding bottom-hole variables, Semwogerere et al. developed and deployed a soft sensor fusion model to monitor temperatures in annuli that cannot be measured by traditional sensors [
16]. However, the applications of soft sensors for downhole variables are usually focused on estimating pressures. One relevant application is monitoring BHP in drilling operations, as we can see in the works of Ashena and Moghadasi, Zhang et al., and Zhu et al., all using machine learning methods having as inputs some easily measurable drilling parameters [
17,
18,
19].
For the production phase, we have examples of BHP estimation for static, transient, or steady-state. Prediction of pressure evolution during extended shut-ins is studied by He et al., comparing the results of a machine learning model and a physics-based data-driven one [
3]. Apio et al. focus on BHP estimation under slugging conditions, comparing the results of a black box neural network (NN) model and a grey box one (Kalman Filter) [
1].
More aligned to our own work, other authors approach flowing BHP under steady-state using different methods. Zalavadia et al. developed a hybrid method, using physics-based and machine learning models, where for each sample the best correlation is selected and subsequently the estimation is improved using the ML; they use as inputs a mix of static (e.g. Pressure-Volume-Temperature properties) and dynamic data (as rates and sensors) [
20]. Aggrey and Davies use NN as a virtual PDG sensor, having as inputs pressure and temperature data from other sensors and the positions of the intelligent completion for an offshore well [
5].
Eltahan et al. proposes a ML approach to improve BHP calculations by deriving correction factors for empirical correlations. They train separate ensembles of linear regression, support vector regression (SVR), and random forest models, using the full training dataset for each method. Predictions from multiple models within each algorithm are averaged to produce final results. Testing on 11 multi-fractured horizontal wells validates the framework’s effectiveness in refining BHP estimation [
21].
Ignatov et al. employed tree-based ensemble methods–Random Forest and XGBoost–to estimate BHP using dynamic production parameters and basic well geometry features. The models were validated using a dataset generated from multiphase flow simulations in wellbores, demonstrating the validity of the framework for pressure prediction in complex flow regimes [
22].
Campos et al. estimate flowing BHP using bottom-hole temperature, wellhead pressure, flow rates, depth, and tubing internal diameter for radial-based functions (RBF) neural networks, with weights optimized via particle swarm optimization (PSO) [
23]. The work of Tariq et al. deploys a PSO-adjusted neural network for this estimate, highlighting the capability of use in real-time applications [
24]. Also using a similar set of input variables, Nwanwe and Duru gave preference to a white-box approach, using an adaptive neuro-fuzzy model, which performed significantly better than empirical correlations and mechanistic models [
25].
The work of Terminiello et al. focus on multi-fractured onshore wells, highlighting that in such plays is not usual to install PDGs in wells after the initial phase of development. Their method is mainly based on wellhead pressure, but adding information on well geometry, production, and fluid properties, feeding Extreme Gradient Boosting (XGBoost) and Categorical Boosting (CatBoost) estimators, whose outputs are combined (mean value) for the final results [
26].
Rathnayake et al. compare the results of the XGBoost and a mixed-effects linear regression for pressure estimation in coal seam gas (CSG) wells, using production rates, pressures from other sensors, and pump parameters [
27].
Zheng et al. approached the problem using knowledge-guided machine learning, by integrating physics-based loss functions to the models. That loss is calculated from two-phase flow pressure drop equations and its use improved the accuracy of both NN and XGBoost models [
4].
In their work on BHP estimation, Agwu et al. applied multivariate adaptive regression splines, demonstrating its effectiveness through a combination of high predictive accuracy and interpretability. They underscored the model’s suitability for real-time applications, particularly in operational settings where transparency and physical consistency are critical. Additionally, they contributed an extensive literature review that traces the chronology of data-driven BHP estimation techniques [
28].
What most of these studies have in common is the attempt to leverage machine learning and hybrid models to achieve more reliable results than those obtained through traditional methods, such as empirical correlations, while being more practical to handle compared to numerical simulations.
Our previous work [
29] addressed this problem with a narrower scope, limited to a single platform. Leveraging a deep learning technique, the LSTM networks, yielded slightly better results than those obtained using neural networks and linear regression with ridge regularization.
In this study, we take a step further by expanding the scope to a complete giant oilfield composed by two isolated reservoirs, utilizing data from nine platforms and 60 production wells. This includes a wide range of production conditions, covering significant variations in flow rates, water cuts, and gas-oil ratios, as well as the use of artificial lift via continuous gas lift in some wells. A proposed strategy to enhance estimation within this complex domain involves partitioning the data space using clustering techniques and training ensembles of predictors for the resulting subsets.
This approach has been explored in some industrial applications. Kim et al. developed a clustering-based hybrid soft sensor to improve melt index monitoring in polypropylene production, dividing the operational conditions into four clusters using a Critical to Quality-based method, and developing ML models for each of those. Applied in the industry as real-time monitoring, the soft sensor improved the process by reducing off-specification products [
30].
Yang et al., address the challenge of developing soft sensors for nonlinear and multimodal industrial processes, where global modeling struggles to represent complex and unbalanced data distributions. They propose a Quality-Relevant Feature Clustering, which through balanced grouping and the use of a regulation variable optimizes feature representation and improves the ML estimation of oxygen concentration in an ammonia synthesis plant [
31].
To predict carbon content and temperature in steelmaking processes, Gu et al. developed a method combining clustering and ensemble learning. They employ a graph convolutional network-based supervised clustering to group data into subsets, guided by process labels. Local models trained on these clusters are integrated through grey relational analysis, weighting predictions based on similarity to cluster patterns. This strategy showed effective handling volatile industrial data and improving endpoint control [
32].
The use of clustering strategies for domain partitioning, however, remains a novel contribution within the context of soft sensor applications for oil wells, and this work aims to address that gap.
1.1. Traditional Approaches to Flowing BHP Estimation
Estimating pressures along the path from the wellbore to the platform is a highly complex task due to the characteristics of multiphase flow. Along this path, there are some predominantly vertical sections, such as in the tubing and riser, some predominantly horizontal sections in the flowlines, and even downward-sloping sections in connections with lazy-wave configurations. Additionally, the properties of the transported fluids change along this path, as gas is released from the oil as the pressure decreases below the bubble point. The combination of these factors results in multiple possible flow regimes in the pipelines (e.g., annular, bubbly, slug), directly influencing the pressure gradients—gravitational, accelerational, and frictional. The main challenge, therefore, lies in understanding the phase distribution and interface geometries, a problem that often lacks an analytical solution and cannot always be solved, even using advanced simulations such as Computational Fluid Dynamics (CFD) [
33].
Traditionally, multiphase flow calculations are performed using empirical correlations, such as the classical models of Beggs and Brill [
34], Duns and Ros [
35], and Hagedorn and Brown [
36]. Even decades after their publication, these correlations remain widely used in the industry, serving as the foundation for several simulation software applications.
However, these correlations are typically defined for specific conditions and face limitations when applied across a broad range of production parameters [
20]. Changes in flow regimes and the concept drift associated with the evolution of well and field production lead to a loss of accuracy and the need for correlation adjustments. Some simplifications are also common in this approach, such as neglecting phase slippage or assuming a homogeneous fluid [
37].
Mechanistic approaches, while grounded in flow physics, are often even more restrictive in their applicability, as they generally assume a single flow regime and adopt simplifications of the governing phenomena [
20].
These application range limitations also extend to simulation software. While these tools are capable of providing more detailed pressure and temperature gradient estimations, conducting sensitivity analyses, accounting for gas lift operations, and calculating flow rates based on boundary pressure conditions, they still rely on models that require updated parameters reflecting the well conditions, such as Inflow Performance Relationships (IPR). This is particularly relevant for the field where we aim to apply our framework, where well models are typically updated after production tests, when a well stream is individually routed to a separator to accurately measure its production rates and pressures [
9,
13,
37].
Data-driven methods can fill some gaps among the previous methods, considering that they allow us to develop soft sensors even if there’s no explicit knowledge about the relationships among variables [
8], a characteristic that can be useful for obtaining new insights and exploring new applications. Our work does not aim to replace existing tools but rather to serve as an additional resource for simpler and more general use in situations where traditional tools have limitations.
1.2. Objectives and Motivation
This study aims to estimate the flowing bottom-hole pressure of oil wells in regular operation by employing a novel data-driven methodology based on ensemble machine learning techniques. Our primary goal is to create a soft sensor capable of providing reliable instantaneous BHP estimates, suitable for real-time reservoir and well monitoring. The framework is particularly valuable in situations where permanent downhole gauge data is unavailable. By using surface measurements and wellhead gauge data, this methodology eliminates the need for analytical models, empirical formulas, or detailed fluid property information.
An important requirement is obtaining a comprehensive method applicable to fields with varying production conditions. This includes newly drilled wells with high production rates, original gas-oil ratio (GOR), and near-zero water cut, as well as mature wells that have been producing for over a decade. These older wells may be located in more depleted areas of the field, resulting in lower flowing pressures and production rates, and may be influenced by injection processes, leading to significantly higher GOR and water cut. Such diverse conditions can lead to varying relationships between the input variables and the target variable in our estimation problem. Addressing this variability is crucial to ensuring the robustness and applicability of the proposed methodology across a wide range of field scenarios. As we previously mentioned, the alternative we propose involves “partitioning” the problem, making it more manageable for simple machine learning methods that will be trained on subsets and have their results aggregated a posteriori (according to the methodology detailed later in a dedicated section).
We compare multiple machine learning models for the task: Ridge Regression, Gradient Boosting (XGBoost and LightGBM), and Multi-Layer Perceptron Neural Networks. Those models were chosen to allow an evaluation of diverse characteristics and levels of complexity, enabling us to compare performance on the task and assess how they can benefit from the ensemble framework. We prioritized high adaptability to the problem and low computational cost, with interpretability considered an additional advantage. From our proof of concept [
29], we observed that the performance differences between simple and complex models were not particularly significant in the context of our problem and dataset. This insight motivated us to experiment with the ensemble approach using regularized linear regression and non-deep neural networks with the classic multilayer perceptron architecture. We also include methods that are intrinsically composed of ensembles, particularly tree-based. The characteristics of our problem, based on a comprehensive dataset with relatively low variability and no relevant tendency to overfit, led us to prioritize boosting strategies (over bagging and Random Forests strategies) [
38]. Consequently, we selected XGBoost and LightGBM because of their demonstrated efficiency across a wide range of problems in recent literature and data science competitions, balancing predictive accuracy with computational efficiency [
38].
To assess the performance of the proposed methodology, we applied it to datasets from a Brazilian Pre-Salt offshore oilfield, where all wells were originally equipped with Permanent Downhole Gauges (PDGs). Over time, however, a significant number of these gauges became inoperative. As a key motivation for this work, our statistical survey of sensor availability—conducted across nine platforms operating in the selected oilfield—revealed that downhole gauge failures occur up to three times more frequently than those at the wellhead. Moreover, the replacement of PDGs is considerably more expensive and logistically complex. Based on our survey findings, the proposed virtual sensing methodology is immediately applicable to at least 15 wells in the field that currently lack reliable PDG data. We emphasize the scalability of the approach, which can be extended to other platforms or oilfields where PDG data is unavailable or where the installation of such sensors is not economically or technically viable, provided that the necessary input variables are available and a representative dataset exists for model training
Key contributions of this work include:
Utilizing clustering techniques to improve training efficiency for varying production conditions;
Demonstrating adaptability across different reservoir and flow scenarios, validated for wells at various production stages;
Offering a practical monitoring solution for wells lacking PDG data.
We believe that these characteristics distinguish our work from existing studies in the literature, representing an advancement in this research domain.