Cyber-Physical LPG debutanizer distillation columns: machine learning-based soft sensors for product quality monitoring

Refineries execute a series of interlinked processes, where the product of one unit 6 serves as the input to another process. Potential failures within these processes affect the quality 7 of the end products, operational efficiency, and revenue of the entire refinery. In this context, 8 implementation of a real-time cognitive module, referring to predictive machine learning models, 9 enables to provide equipment state monitoring services and to generate decision-making for 10 equipment operations. In this paper, we propose two machine learning models: 1) to forecast the 11 amount of pentane (C5) content in the final product mixture; 2) to identify if C5 content exceeds 12 the specification thresholds for the final product quality. We validate our approach by using 13 a use case from a real-world refinery. In addition, we develop a visualization to assess which 14 features are considered most important during feature selection, and later by the machine learning 15 models. Finally, we provide insights on the sensor values in the dataset, which help to identify the 16 operational conditions for using such machine learning models. 17

complicated equations to achieve enough generalization to be applied across different 48 units. Data-driven models overcome limitations regarding equation solving complexity 49 by utilizing past data to learn and produce possible solutions. While the ability to reuse 50 them across units strongly depends on the model design, once trained, such models can 51 provide forecasts with almost no latency. If forecasts are good enough, the models can 52 get frequently insights regarding C5 content in the LPG, for providing ground for earlier 53 off-spec product identification and timely decision making. 54 Real-time prediction of C5 content during the debutanization processes provide 55 new insights that guide decision-makings for process monitoring and control. To create 56 machine learning models capable of such forecasts, we utilize historical sensor data 57 regarding operational temperature and pressure, as well as laboratory results obtained 58 from the samples analysis. Such data and analysis results enable to support machine 59 learning model training and evaluation by identifying correlations between sensed 60 conditions and measured outcomes for two purposes: (i) with real-time sensor data, 61 such models can provide real-time C5 content estimates; (ii) with new real-time sensor 62 data and lab analysis data update, the machine learning model performance is expected 63 to be promoted in time, if retrained with the new data available. 64 In this paper, we develop machine learning models for a real-world use case, based 65 on sensor data provided by a Tüpras 2 refinery. By examining the actual process in the 66 use case, we found that different debutanizer columns have different features because 67 of their different designs. Moreover, only a few sensors are located in the debutanizer 68 column. Most sensor data corresponded to the pipping system that connected the 69 debutanizer column with the condensation unit and the units that follow. We used 70 several debutanizer unit diagrams to understand where the sensors are located and 71 which sensors are close to the distillation column exit. Temperature and pressure 72 conditions are identified by the ones near the column exit, and hence the first ones placed 73 in the pipes close to the related exit but before the condensation unit. We assume such 74 data provides good insight on how operating conditions relate to extracted samples and 75 1 https://www.aspentech.com/en/products/engineering/aspen-hysys 2 https://www.tupras.com.tr measured composition. Furthermore, we observed that there are some cases where both, 76 temperature and pressure sensors, exist for any given point in the debutanizer column, 77 but at least one of them exists. Considering these limitations, machine learning models 78 are developed to predict C5 content based on the inputs of two sensors (one pressure 79 sensor and one temperature sensor). Finally, we develop two machine learning models 80 that provide predictions based on the data from these two sensors for independent 81 estimate: (i) one that predicts the expected amount of C5 in the LPG; and (ii) one that 82 forecasts whether C5 content is off-spec (higher than 2%). 83 The contribution of this paper is the utilization of operational temperature and 84 pressure sensor data to develop: 85 1. a machine learning model to predict C5 content in LPG stream; 86 2. a machine learning model to predict if C5 content exceeds specification levels 87 Machine learning models built utilizing data from a few sensors can be more easily 88 applied to a broad range of debutanizer columns since they impose fewer restrictions 89 on the number of input data sources required to provide forecasts. Thus, we consider 90 that a major strength of our approach is the fact that it relies only on data of two sensors, 91 one measuring pressure and the second one measuring temperature in the debutanizer 92 column -both placed at separate locations within the column. 93 Along with the development of the aforementioned models, we also provide a 94 prototype dashboard, which provides global explanations to understand which features 95 were considered most relevant during the feature selection, and which features were 96 considered relevant by the forecasting model. In addition, we provide insights on the 97 sensor values' distribution in the training set, to understand the models' operational 98 limitations.

99
To evaluate our models, we have utilized three metrics: two for measuring regres- Operating Characteristic Curve (AUC ROC [1]). AUC ROC is invariant to a priori class 107 probabilities, referring to a relevant property when measuring models' discrimination 108 power in an imbalanced dataset. After evaluating the models, results show that our 109 approach is applied to effectively provide real-time C5 content predictions in the LPG 110 debutanization process of our given use case.

111
The rest of this paper is structured as follows: Section 2 presents related work.

112
Section 3 describes a Tüpras refinery use-case,and Section 4 introduces the features 113 created for the C5 content forecasting model, as well as the way to develop and evaluate 114 these models. Section 5 presents the experiments we performed and the obtained results.

115
Finally, Section 6 offers our conclusions and provides an outline for future work. ies. Therefore, the objective of the online composition of debutanizer outlet streams is to 120 maximize the production of LPG while meeting the corresponding quality standards.

121
Currently, the quality of the debutanizer output is measured via laboratory analysis.

122
Hence, changes in the quality are identified only upon the analysis of the sample, which 123 may take several hours. Therefore, in order to maintain the quality of the product within 124 the predefined specifications, it is of imperative importance to predict the top and bottom 125 outputs of the debutanization process precisely [2].

126
To realize this objective, [3] identifies three major approaches to develop the required 127 models: (i) first-principle (a.k.a. fundamental) models, which consider mass, energy, and 128 momentum principles and equations to provide a forecast; (ii) machine learning models, 129 which are created by training an algorithm on input-output data of the process; and (iii) 130 hybrid models, which combine both the fundamental and the empirical models.

131
First-principle models involve sets of non-linear differential equations (usually in 132 the order of 10 2 or 10 3 non-linear differential equations) and a comparable number of 133 algebraic equations [4,5]. The equations usually take into account the global balance 134 of matter, partial balances of matter, pressure, temperature, flow, reflux policies, and 135 the relationship between component concentrations at different levels of the distillation 136 column [6,7]. While additional information regarding the structure of the distillation 137 column can further enhance such models (e.g., the number of trays in a column or the 138 column hydraulics [8]) with the increasing computational complexity of such models.

139
To alleviate the computational needs, simplified distillation column models have 140 been proposed [9,10], at the expense of an increased error whose applicability often 141 restricted to a single column [11]. These models are usually implemented in Advanced simultaneously. In addition to their computational complexity, the usefulness of such 147 models is constrained to the model assumptions, e.g., sensor colocation points [12].

148
Data-driven models provide an alternative modeling approach for developing 149 the forecast models [13]. In particular, machine learning models are developed based vectors. Through such model features, the machine learning models can accurately learn 156 non-linear features from the data, even when some noises exist in the data [14].

157
Hybrid models arise from the combination of the first-principle and data-driven 158 models [15]. Such models are used to retain the theoretical knowledge of the process, 159 which is mirrored in equations. In contrast, the data-driven models can augment such 160 knowledge using data, and can be used to model parts of the process that are hard to 161 formulate and would otherwise require overly complex first principle models [3,16]. 162 Hybrid models have been implemented widely in various chemical processes such 163 as batch distillation [17], reactive distillation [18], and polymerization process [19,20]. 164 However, only a handful of models have been implemented in continuous distillation 165 columns.

174
To evaluate C5 and C4 product concentrations in the debutanizer column, [3] 175 created a dynamic neural model that acts as a soft sensor based on the data provided. In 176 a similar manner, [35] developed an ANN model to predict LPG composition at the top 177 and bottom of a distillation column, comparing its performance to a partial least squares 178 model. A comparison between different models was also performed by [27], which 179 developed multiple linear regression, principal components regression, and neural 180 networks models for a debutanizer column. They concluded that the performance 181 of such models was superior to least square regression models and support vector 182 regression models reported in the literature. Finally, [36] aimed to identify the governing 183 equations regarding a distillation column using a white-box machine learning approach.

184
Cyber-physical systems describe systems that integrate the physical processes 185 into the digital world, where monitoring and analytics can be performed [37,38]. A 186 standard abstraction model considers three significant layers: physical, cybernetic, 187 and an interface between both [39]. The concept of cyber-physical systems has been 188 successfully implemented in petrochemical plants [40].  need to understand the logic behind such models, to comply with regulatory require-201 ments, and provide ground for responsible decision-making [41,42]. Insights on the 202 process followed by such models when applying operations on the input to provide a 203 forecast enable to decide whether such forecasts can be trusted or not [43,44]. provided, to enable the user focus on specific instances and conduct further research [48].

242
In this research, we complemented our model development with a dashboard, that 243 provides insights into the most informative features within the dataset, when considering 244 feature selection, while also informing their relevance from the models' point of view.

245
In addition, we inform the value ranges of each sensors' readings found in the dataset.

246
Such values must be taken into account, since the model is able to issue good predictions 247 within the observed ranges, and not outside them.

Debutanization process 269
The debutanization process is a fractional distillation process that aims to recover  In our use case, we have sensor data for the temperature and pressure measurement.

303
To formalize meaningful features enabling the models to predict C5 content, we have For the case study, we obtained data from sensors P1 and T2 of the debutanizer unit 316 (in the Fig. 1); while we missed sensor readings from T1 and P2.

317
Not having both temperature and pressure at a given point of the debutanizer 318 column prevents us from using the Ideal Gas Law equation to compute the gas molar 319 Figure 2. Schematic diagram of an LPG debutanizer column. In the diagram we reference two locations on which the sensors are placed. In this research, we developed models that take into account only sensors P1 and T2.
Equation 1: Raoult's law. P refers to pressure, x refers to mole fraction, and the n indicates different mixture components.
Equation 3: Combined Gas Law equation. P refers to pressure, V refers to volume, and T refers to temperature. k is a constant.
Equation 4: Clausius-Clapeyron relation. P refers to pressure, T refers to temperature, L is the specific latent heat of the substance, and R is the specific gas constant.
(C 5 H 12 )). When doing so, we considered the vaporization temperature and sulfur 343 content (sulfur is removed in later stages).   minutes slots for a time range of an hour and a half (see Fig. 4). Since operational   . Timestamp conciliation between sensor and laboratory sample timestamps, based on insights provided by experts. Since a time range is provided, we decided to sample sensor values in the given interval every fifteen minutes, adding a fifteen minutes tolerance at the interval edges. Times provided in this example do not correspond to real timestamps in data.
Equation 5: Ideal Gas Law. P stands for pressure, V stands for volume, n represents the amount of substance, R is the ideal gas constant, and T corresponds to the temperature. that exceeded the allowed out-of-specification threshold, reaching a total of fourteen 383 off-specification events. We provide the summarized statistics of the sensor readings 384 and target values in Table 1.
Equation 7: Estimated C5 content. We obtain P from sensor data, P i can be computed based on a given temperature, x B and x P can be approximated to LPG specification, or other useful values.
that no more than 2% of the LPG volume is compound by C5 hydrocarbons and that the 390 sum of C2+C5 hydrocarbons must not exceed 5% of the LPG volume, a wide range of 391 possible mixture proportions is observed in reality. In some scenarios, the C5 proportion 392 exceeds the specifications, which is detrimental to propane and butane content. The 393 same is observed for C2 content. In our model, we decided to consider five hypothetical 394 LPG compositions as described in Table 2. Our hypothesis is that such simplifications 395 could be useful towards understanding the real LPG composition given temperature and vided in debutanizer unit diagrams (see Fig. 2). By analyzing temperature and pressure 405 for three segments of measurements, we identified that high or low C5 content is likely 406 associated to certain pressure thresholds. We thus created dummy variables considering 407 those thresholds.

408
In Table 3 we describe some of the features we developed for our machine learning 409 models. We grouped then in Feature Groups, based on their common characteristics.  Table 3. Some of the features we created for the machine learning models. spr abbreviates saturation pressure ratio, while spt abbreviates saturation pressure total. the training subset, as suggested by [78]. Feature selection was performed by computing 427 their mutual information [79], and selecting the top K most informative ones. We describe 428 the correlation between the selected features and target C5 content values we aim to 429 forecast in Fig. 6, 7, and 8.   ranges. In our research, we developed and compared six models. These models include 437 two baseline models and four models that aim to provide enhanced forecasts, and which 438 we describe below: Group ID #1 at T2 sensors at fifteen minute intervals (see Fig. 1, and all features described in Table   446 3), for the time range as presented in Fig. 4; neurons, using a ReLU activation [82] and the Adam solver [83]. The learning rate was 465 set to a fixed constant (0,001), and we trained it for 300 iterations. 466 We designed Model 4 (VR) (see Fig 9) as a voting regressor (VR) [84] that takes the

481
It is important to highlight that though C5 content data is available from laboratory 482 analysis, we avoid using features based on past C5 measurements to ensure that the 483 final model can load the sensor data and provide real-time C5 content estimates. Group ID #1 at Table 3);

494
• Model 1 (LgR): logistic regression considering raw sensor measurements of P1 and 495 T2 sensors at fifteen minute intervals (see Fig. 1), and all features described in Table   496 3), for the time range as presented in Fig. 4;

518
When building the classification models, we avoided using features based on past 519 C5 measurements to ensure the models consume only data that can be provided in 520 real-time, and thus issue real-time forecasts.

522
To evaluate the models presented in Section 4.4, we ran a repeated ten-fold cross- test [94] and tested for significance at a 95% level.  4.4.1, we measured MAE and RMSE metrics. We present the results in Table 4 and   in Experiment 2, but their importance faded in the presence of the ones mentioned above. 571 Figure 10. We pose two experiments: Experiment 1 trains models only with data of the debutanizer unit we aim to predict for (Fig.  10A), while Experiment 2 enriches the training set with data from other debutanizer unit (Fig. 10B).   tions, we found a few authors taking into account the data or models' training process.

602
Furthermore, we found no authors combined insights regarding feature selection, and 603 how relevant the selected features are to the model across a repeated cross-validation. 604 We therefore propose a novel visualization that summarizes the aforementioned insights 605 (see Fig. 11). where the intensity of feature concepts is presented to the users. We consider feature 615 concepts as semantic abstractions that group certain features based on features' metadata.

616
In our particular case, such grouping was performed for features computed with the 617 same formula, but using sensor data at different points in time (see Fig. 4). The shades