Time series analysis of geotechnical data from a sensor net- work controlling the remoted pipeline

Extensive, but remote oil and gas fields of the United States, Canada, and Russia require the construction and operation of extremely long pipelines. Global warming and local heating effects lead to rising soil temperatures and thus a reduction in the sub-grade capacity of the soils; this causes changes in the spatial positions and forms of the pipelines, consequently increasing the number of accidents. Oil operators are compelled to monitor the soil temperature along the routes of the remoted pipelines to be able to perform remedial measures in time. They are therefore seeking methods for the analysis of volumetric diagnostic information. To forecast soil temperatures at the different depths we propose compiling a multidimensional dataset, defining descriptive statistics; selecting uncorrelated time series; generating synthetic features; robust scaling temperature series, tuning the additive regression model to forecast soil temperatures.


Introduction
Constructing and operating the extended pipelines, crossing remote permafrost areas, leads to thawing and, thereby, a reduction in the subgrade capacity of frozen soil. Minor soil temperature changes in the range of 2-5°C contribute to significant changes in spatial positioning and may cause damage to the objects. To prevent the development of hazardous processes in permafrost during the operation stage, the thermal regime of the soil is controlled by a geotechnical monitoring system. At the local level of this system, there are strings of sensors (thermistors) loaded at various depths in specially equipped thermowells for the simultaneous measurement of temperature at multiple points. Measurements, converted to digital format, are transmitted to the reading, storage, and display devices (controllers). Controllers periodically poll the sensors in the string and read the numbers for the connection lines to sort measurements by the depth and obtain them in local storage devices. Fiber or wireless networks are used at the regional level for accessing local archives [1]. The global level of a geotechnical monitoring system for an extended pipeline includes web servers for the synchronization, integration, processing, and maintenance of measures as well as the saving of prepared information in a specialized global data warehouse. A longer operating time increases the number of exogenous processes, makes it necessary to install additional thermowells, and leads to the accumulation of a large quantity of information that is difficult to analyze with existing engineering methods. Therefore, implementing data analysis methods in a geotechnical monitoring system, controlling a pipeline, increases its efficiency in permafrost areas.
According to the Dimensions platform (www.dimensions.ai) provides access to grants, publications, patents, and other sources, the number of research fields where remote control of geotechnical features is applicable is significant (Figure 1).

Figure 1. Fields of research where remote control of geotechnical features is applicable
Geotechnical monitoring systems tend to be comprised of groups of sensors in thermometric wells along the routes of the pipelines; the sensors are linked with wired or wireless technologies, and measurements are obtained using web-based software. Modern approaches to the design of distributed systems are listed in [2][3][4]. In [5] is noted that the quality of an open atmospheric optical channel is affected by the frequency of weather changes, fluctuations of supports, scintillation. The duration of the operation of a channel of an arbitrary length was estimated by the formula: where ᵞ -parameters of a hyper-exponential distribution of the duration of the model channel of a length 0 ; b -a regional coefficient; 0 , -the model and arbitrary channel length, km; i -the number of channels (basic and backup).
An analysis of the relationship between air temperature and geological processes is described in [6]. Evaluations of the thermal influence of pipelines on permafrost are covered in the articles: [7,8]. The influence of geological processes on pipeline parameters is reflected in the following articles [9,10]. Methods of designing geotechnical systems and related issues are discussed in the following articles and patents: [11,12].
As we know from [13][14][15], numerical algorithms for solving thermal conductivity problems are based primarily on different versions of the finite element method and finite difference method. The application of these techniques to a small-scale grid spacing for spatially extensive objects leads to high overhead costs and complex algorithms due to the parallelization of computations. Alternative methods of soil data analysis represent streams of temperature measurements as time series [16][17][18] or group them according to some set of features [19]. A growing number of articles and patents [20] on monitoring sites in permafrost areas over the last two decades has enabled a discussion regarding the topical significance of the issue. The objective of this study is to develop a method for simplifying the process of analyzing and forecasting soil temperatures along the route of the pipeline. To achieve this objective, we studied a wide variety of papers, examined the areas and features of the pipelines crossing permafrost areas, selecting uncorrelated features and defined additional features for the time series analysis, and developed regression models of the soil temperature dynamics over time.

Materials and Methods
To manipulate, visualize and learn the geotechnical data we have used 14 opensource Python libraries. Primary analysis is made with a library named Numpy. It facilitates advanced mathematical and other types of operations on large numbers of data and pandas-profiling which is a tool to preview, explore and summarize the dataset. Our Python packages for data manipulation and handling include SciPy, Sklearn, and Statsmodels. All data visualization was made with Matplotlib, Pandas.plotting, Prophet.plot, and Seaborn. We performed extended data analysis with libraries, named Associa-tion_metrics, which is a Python module for measure the degree of nonlinear association between features and Pandas that provides flexible data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data. Our feature engineering is based on the Calendar package. Facebook prophet and SciPy.Interpolate packages helped with the time series forecast. Code was created in the Google Research product, named Colaboratory, which allows developers to write and execute Python code through a browser.

Two-step approach of forecasting temperatures based upon a geotechnical dataset
A two-step approach exploits the idea that the machine learning model itself is a apmethod to creating a new feature, solving the task [21]. Features are defined as a mapping

Primary study of the geotechnical data getting from the network
Within the geotechnical monitoring system, there are triples of thermowells installed along the route of the pipeline at the step from 1 to 77 km mostly in the area where exogenous processes are developing. Each thermowell contains 8-12 thermistors installed at different depths of soil ( Figure 3).  Then we have three 3d arrays of temperature measurements collecting from thermowell triples. These arrays share distance along the route and time of measurements and differ by the number of sensors and temperature values. To present all measurements at 2d plot we need to transform data, unpivoting it from a 'wide format' into a 'long format', where selected columns are identifier variables, while all other columns, considered measured variables. Figure 4 compares the temperatures along the pipeline route measured with triples of thermowells. Temperature intervals are chosen based on the classification of permafrost (high-temperature (-2; -0.5] °С; high-temperature with the predominance of the interval (-0.5; 1.5] °С; the stable temperature interval (-3; -2] °С; low-temperature with the predominance of the interval (-5; -3] °С; low-temperature (-∞; -5] °С). The interval distribution of temperatures at different depths is obtained for the period from November of the previous year to October of the next year. In the further analysis based on these five categories of soil temperature, we will create new synthetic categorical features. Comparative analysis tells us that soil layers are in a high-temperature condition with the predominance of the intervals (-2; -0.5] °С and (-0.5; 1.5] °С. A basic statistic of data divided by two categories is given in Table 1. Analysis of the density of temperature measurements at different soil layers defines the transition from a two-top distribution (surface temperatures) to a single-top distribution (medium-depth temperature) with a shift in the mathematical expectation towards negative temperatures and a significant decrease in the standard deviation (Fig. 5). Further increase of depth leads to a multi-top distribution. The greatest kurtosis values are at a depth from 3 to 4 meters for medium thermowells (Fig. 5a) and from 2 to 3 meters for distant thermowells (Fig. 5b). At the same depths, the least symmetrical distributions are observed. At the same time, the mathematical expectation from layer to layer varies to a lesser extent and tends to zero. The analysis shows that temperature distributions at different depths do not obey the normal law. Most methods of incorporating the explaining features into a model are based on two opposing principles: their weak mutual correlation or a strong correlation of each feature with a dependent variable [22]. To assess the linear correlation, we calculated Pearson's correlation coefficients for temperature measurements at the different depths and distances of the route. Figure 6 shows the values of Pearson's linear correlation with a threshold of more than 0.7 per module. мо Figure 6. Correlation of temperature and distance (where for each hi.j the first digit i={1, 2, 3} is the number of a thermowell triple and the second digit j={1,.., 17} is the number of a sensor at a different depth)

Linear correlation
According to the correlation map above, we defined a strong correlation between the distance of the route and temperature measurements of: -13-18 thermistors of the first line of thermowells (closest to the pipeline), -12-13 thermistors of the second line of thermowells (at the medium distance to the pipeline), -10-13 thermistors of the third line of thermowells (distant from the pipeline). Because these thermistors are located at the depth [5; 13] m, they capture the manmade effect of the pipe. At the same time, we defined a significant number of highly correlated pairs of layered temperature measurements, such as h1.13-h1.17; h1.2-h1.6, and h2.1-h2.6. This may indicate the existence of a solid geological layer with similar characteristics. Thus, as a result of correlation analysis, the measurements on five thermistors of the first line thermowells were taken into account. They were added to noncorrelated measurements of thermistors h2.7 and h3.10 of the thermowells from the second and third lines {h1.1, h1.10, h1.14, h1.6, h1.8, h2.7, h3.10}.

Nonlinear correlation
To identify a non-linear correlation [23] Spearman-Kramer correlation coefficients are used. Spearman's rank correlation coefficient is a nonparametric measure of rank correlation where ( , ) is the covariance of the rank variables, and are the standard deviations of the rank variables. The Cramer's [24] correlation coefficient φC based on Pearson's χ2 test statistic, and is used for ordinal and binned interval variables: where N is the number of observations; r (k) is the number of rows (columns) in a contingency table.
To apply non-linear correlation at the first stage, we convert the existing time feature 'Time' of the geotechnical dataset to a set of categorical features that sharps the time difference: a period of a day, a month, a day of a week, a season of a year ( Table 2). And at the second stage, we convert the values of linearly uncorrelated features (section 3.1.2) into categorical values using the temperature intervals given in section 3.1. The resulting categorical data are assessed in terms of a non-linear relationship between temperature and measurement time ( fig.7). Colored correlation maps indicate a relationship between layered temperature measurements. The color map of Cramer's correlation coefficients ( fig.7a) shows that the thermistors h1.14 и h3.10 malfunctioned, and their measurements depended on the day of the week, month, and season. Therefore, these features, together with the feature, named Day_name are excluded from further analysis. Additionally, with Spearman's matrix of correlations ( fig. 5b) we established the non-linear inverse relation between measurements of thermistors h1.8 and h1.10, and correlation between months and temperature fluctuations at the ground surface (the thermistors h1.1). To compare temperature measurements at different depths, we applied the robust scaling method because it uses statistics that are robust to outliers: where y, y' are unscaled and scaled temperatures; ( ) is the 1 st ,2d, or 3d quartile of the temperatures. Fig. 8 reflects scaled temperatures at different depths.  Figure 8 shows that in the area of thermistors h2.7 at 2900 km and h1.8, h1.10 at 2908 km there are strong temperature changes.

Forecast time series
There are several approaches to forecast time series, such as smoothing, adaptive, autoregression models, neural networks [25]. One of the modern approaches is implemented in the library of 2017 named Facebook Prophet. It contains the additive regression model with customizable components [26]: where t is time, measured in days; y'(t) is the regress function; g(t) is the trend component, modeled with piecewise linear, piecewise logistic growth, or flat function; s(t) is the seasonal component responsible for modeling the periodic changes related to seasonality. Seasonalities are estimated using a partial Fourier sum; h(t) is the component responsible for the user's abnormal days; Ɛ -an error that contains information not considered by the model.

Machine learning
We split our geotechnical dataset into training and testing samples. The testing sample includes the last 90 days of the dataset. Then we apply the training sample for training our model and then treat the testing sample as a collection of data points that will help us evaluate whether the model can generalize well to unknown data.  Figure 6b underlines seasonality that impacts more at measurements of h1.6 (increasing temperatures) and measurements of h1.8 (decreasing temperatures).

Model accuracy
The forecasting accuracy is assessed using a metric named mean absolute percentage error (MAPE) for the temperature measurements of the last 90 days. For calculation MAPE real temperature measurements and their forecast are used. As a result, we received the quality of the forecast about 70%.

Discussion
The most effective way to improve the accuracy of machine learning models is to increase the training sample by creating synthetic features based on the available dataset [27].
We suppose to create the second group of synthetic features to get away from the time-bound, taking into account statistical characteristics for two dimensions of the dataset: minimum, maximum, medium, median, variance of every time series (after smoothing); a number of peaks and troughs; an area under the curve (before scaling); a number of squares of derivative; trend indicator; number of intersections in the 25 th , 50 th , and 75 th percentile, etc.
The third group consists of two interconnected synthetic features called subsurface and deep measurement temperature gradients. The temperature gradient of a subsurface (deep) measurements, S(H), is defined as the mean value of temperature measurements tS(H), taken with sensors, located at the depth of seasonal frost penetration (located below the depth of seasonal frost penetration) per unit of depth ∆hS(H): This group of two temperature gradients accounts for seasonal fluctuations in subsurface temperatures and facilitates the stripping of the seasonal component from measurements taken at permafrost depth.
The following can be assumed to be additional features: the climatic zones at the locations of the thermowells; the rate of soil erosion; the distance from rivers, roads, and settlements; a detailed characteristic of the terrain (flat with inclinations of up to 2°, hilly with inclinations of up to 4°, variable with inclinations of up to 6°, mountains and foothills with inclinations of more than 6°), a wind rose, ravine networks, etc.
When the oil pipeline is launched, additional features, such as the distance of the thermowells locations from the pipeline profile and the temperature curve along the route of the pipeline as determined by the locations of points for oil heating, could be used.

Conclusions
The geotechnical monitoring system collecting soil temperature measurements along an extended oil pipeline includes a significant number of thermowells, fiber or wireless networks for accessing local archives and web servers for processing the data.
The thermal influence of a pipeline on frozen soils and the influence of geological processes on a pipeline are increasing in time of the pipeline's operation.
The proposed steps for the analysis of multi-temporal measurements along the route of the crude oil pipeline help to forecast changes in the trends of temperature: -linear and nonlinear correlation analysis selects uncorrelated timeseries; -new synthetic features allow to increase the model's accuracy; -robust scaling helps to compare temperature series at different seasons and depths; -customizable components allow to tune the additive regression model and improve the accuracy of forecasting.
Funding: This research received no external funding.

Data Availability Statement:
The code and dataset supporting reported results can be found via the author's github: https://github.com/avladova/Time-series-analysis-of-geotechnical-data-

Conflicts of Interest:
The authors declare no conflict of interest.