Evolutionary Prediction Model for Fine-Grained Soils Compression Index Using Gene-Expression Programming

Appropriate estimation of soil settlement is of significant importance since it directly influences the performance of building and infrastructures that are built on soil. In particular, the settlement of fine-grained soils is critical because of low permeability and continuous settlement with time. Coefficient of consolidation (Cc) is a key parameter to estimate settlement of fine-grained soil layers. However, estimation of this parameter is time consuming, needs skilled technicians, and specific equipment. In this study, Cc was estimated using several soil parameters such as liquid limit (LL), plastic limit (PL), and initial void ratio (e0). Estimating such parameters in laboratory is straight forward and needs substantially less time and cost compared to conventional tests to estimate Cc such as oedometer test. This study presents a novel prediction model for Cc of finegrained soils using gene-expression programming (GEP). GEP is a biologically inspired technique capable of offering closed-form solution for the optimal solution. A database consisted of 108 different data points was used to develop the model. A closed-form equation solution was derived to estimate Cc based on LL, PL, and e0. The performance of developed GEP-based model was evaluated through coefficient of determination (R), root mean squared error (RMSE), and mean average error (MAE). High R and low error values indicated the descent performance of the model. Furthermore, the model was evaluated using the additional performance measures and met all the suggested criteria. Furthermore, the model had a better performance in terms of R, RMSE, and MAE compared to most of existing models. It is expected that the developed model will decrease the time and cost associate with determining Cc of fine-grained soils.


Introduction
Soil compressibility is considered as a volume reduction under load of pore water drainage.Precise estimation of this property is critical for calculating settlement of soil layers [1].This problem has become more critical for fine-grained soils due to their low permeability, resulting in compression index (Cc) to be the most accepted parameter to date to represent soil compressibility [2].This parameter is often utilized for measuring the individual soil layer settlement.Different empirical equations have been particularly developed to predict Cc [3][4][5][6][7][8][9].These equations were mainly developed based on traditional statistical analyses.Nevertheless, they include a number of drawbacks such as low correlation of input and output parameters [10].Thus, it is essential to develop a comprehensive model to analyze the complex behaviour of Cc.This model should significantly eliminate the shortcomings of the previous models like practicality and low correlation between input and output parameters.
Soft computing techniques such as artificial neural networks (ANN) are widely accepted and popular along the conventional statistical methods (e.g., regression) [11][12][13][14][15][16][17][18][19][20][21] .These techniques were successfully applied to different geotechnical problems such as Cc prediction [7,[22][23][24][25][26][27].However, a major limitation of common soft computing techniques is that no closed-form prediction equation is provided by them.With the introduction of artificial intelligence (AI) techniques and particularly genetic program (GP), researchers in the field of soft computing attempted to solve this issue (i.e., obtaining closed-form solution).AI includes various techniques of ANN, neuro-fuzzy and support vector machines (SVM) with a great record of successful applications in wide range of problems [28,29].Considering AI, a learning mechanism is often embedded the techniques to construct the intelligent structure of the estimation model (i.e., solution of the problem).In this between, ANN is a robust artificial tool which is widely used to predict Cc [7,[22][23][24][25][26].Though ANN, FIS, and other AI techniques have a good statistical performance in terms of correlation, these techniques are often known as black-box models in soft computing terms, mainly due to lack of the capability to offer close-form estimation formulas.This, however, is reported to be a drawback for AI techniques that limits their practicality [10,28].It worth mentioning that the runtime for most of soft computing techniques could be efficiently decreased by using parallel processing methods [30].
Genetic programming (GP) is based on individual computational programming and it is classified as a major family of soft computing techniques.GP can empower and enable the complex and highly nonlinear estimation modelling tasks [31].While classical GP nominates only a single program, the gene expression programming (GEP) includes several genes of programming for reaching the optimal solutions [32].Application of GEP is growing significantly compared to GP in the engineering domain mainly due to the accuracy of its predictions [28,29].The current study investigates the use of GEP to develop a prediction equation for Cc of fine-grained soils existing in northeast Iran.The objective of this study was developing a GEP-based prediction equation for Cc of finegrained soil with simple tests such as Atterberg LL and PL.Since conventional consolidation tests of fine-grained soil (e.g., oedometer test) are time consuming and costly, the application of such a prediction equation will lead to substantial savings for Cc estimation in terms of cost and time.

Gene Expression Programming (GEP)
There are several variants of GP based on optimization technique used by them.
Gene expression programming (GEP) is the latest variant of GP which is a powerful tool to approximate optimized solution of a problem in a closed-form format.The conventional GP generates computer models through mimicking the biological evolution of living organisms providing a tree-like form of solution, which leads to closed-form solution for the optimization problem of interest [28,29,[31][32][33].The main objective of GP is obtaining programs that connect inputs to output for each data point creating a population of programs.The population of programs like shape of branches of tree created by GP include functions and terminals which are randomly generated.The final solution of the problem is determined based on the tree-like programs.
Fundamentals of GEP was first developed by Ferreira in 2002 consisting a number of components i.e. terminal set, function set, control parameters, fitness function, and termination function [34].GEP employs a fixed length of character strings to model the problem, unlike the conventional GP.These characters will further turn into parse trees in various sizes and shapes known as expression trees (ETs).The benefit of GEP over conventional GP is that genetic diversity is represented as genetic operators of chromosome.GEP, in fact, evolves a number of genes (sub-programs) [34] which are individual tree-like programs [10,34].Furthermore, GEP has a flexible multi genetic nature suitable for the construction and evolution of complex networks of genes.In the GEP framework, the genes in a chromosome may consist of two types of information stored in either tail or head of genes i.e. information to generate the overall GEP model, and the information of terminals for producing subsequent GEP models.Specific details about GEP can be found elsewhere [10,31,32,34,35].with coefficients of c0, c1, and c2 while utilizing the nonlinear terms [31,32].For obtaining c0, c1, and c2, a simple least square was applied to the training data.A partial least squares method could also employ for this objective (18,22).The important GEP parameters that need to select carefully are the tree depth and the quantity of genes.
However minimizing the tree depth generally results in shorter closed-form equations with fewer number of terms [29,34].

Data Collection
A set of 108 individual consolidation test results obtained from laboratory tests were used to develop the GEP-based prediction equation.As mentioned earlier, the objective of this study was to predict Cc using conventional parameters of fine-grained soils, namely PL, LL, and e0.101 out of 108 data points were corresponding to test results conducted on soil samples collected from different locations in Mashhad, Iran.Soil samples were classified as silty-clayey sand (SC-SM), gravelly lean clay with sand (CL) and silty clay with sand (CL-ML) based on unified soil classification system.These samples were cored from a depth of 0.5 m to 1.0 m.LL, PL, and e0 were measured for these samples in laboratory based on ASTM D4318-17 and ASTM D854-14 [36,37].
Furthermore, Cc was measured using oedometer test based on ASTM D2435-11 [38].In addition, seven consolidation test results conducted by Malih was integrated into the laboratory database to make it more robust [39].The descriptive statistics of influential input parameters (i.e., LL, PL, and e0) and the output parameter (Cc) based on the database utilized for our study is presented in Table 1.Furthermore, Figs 2-5 illustrates the distribution of these parameters using histograms.

Model Structure and Performance
Prediction equations for Cc developed by previous studies clearly indicated that LL, PL, and e0 are three main parameters that influence Cc [3][4][5][6][7][8][9].Thus, these parameters were considered in the current study to develop a simplified prediction equation for Cc.
The main motivation of developing such equation was that determination of LL, PL and eo is straight forward compared to performing any consolidation tests that directly determines Cc.Therefore, the developed model is anticipated to result in considerable savings in terms of testing time, technician cost, and laboratory equipment.It should be noted that LL, PL, and e0 are influenced by natural water content of partially saturated soils, thus making the developed equation applicable to any saturated find-grained soils [28,40,41].Mathematically, the developed equation had the following structure.
( ) showing that Cc was considered to be a function of LL, PL, and e0.In order to develop the GEP-based prediction equation for Cc, a database containing 108 data points was developed.Each data point corresponded to LL, PL, and eo, as well as Cc for a particular fine-grained soil sample.The GeneXproTools 5.0 was used to develop the GEP-based prediction equation in MATLAB [42].The performance of developed GEP models was evaluated using coefficient of determination (R 2 ), root mean squared error (RMSE), and mean average error (MAE) (21)(22)(23), applying the following equations: In these equations, hi and ti are measured and predicted output (Cc) values, respectively, for the i-th data point.Furthermore, i h and i t are average of the measured and predicted values, respectively, and n is number of samples [28,29].

Model Development
The database was divided into two subsets in order to avoid the over fitting issue: training subset and validation subset.The GEP-based model was trained using the training subset while validation subset was used for validating purposes and avoiding over fitting [34].The final model (prediction equation) was selected based on model simplicity and performance of training and validation subsets.Performance criteria was based on highest R 2 and lowest RMSE and MAE, for training and validation subsets.
After training, the candidate models were applied on un-seen validation subset to ensure their good performance.The proportion of training and validation subset size with respect to the whole data is commonly selected as 60%-75% and 25%-40%, respectively.In the current study, 75% (81 data points) and 25% (27 data points) of total data points were assigned to training subset and validation subset, respectively.
GEP algorithm was executed several times with a variety combination of influential parameters in order to identify the best model.This process was based on values suggested by previous works [31,32,34].Table 2 includes the parameters of various runs.Reasonably large numbers were considered for size of population and generations to guarantee that optimal models are achieved.In the developed GEP-based model, individuals were identified and transferred into further generation based on the fitness evaluation carried out with roulette wheel sampling considering elitism.Such evaluation can guarantee successful cloning of best individual.Furthermore, the variation in the population was carried out through genetic operators on the chosen chromosomes including crossover, mutation, and rotation [10].
In every GEP-based model, values of setting parameters have significant impact on the model performance.These parameters include the quantity of genes and chromosomes, in addition to gene's head size, and rate of genetic operators.Since minor information was available about GEP parameters in the literature, appropriate settings were selected based on a trial and error scheme (see Table 2).for training subset, validation subset, and entire set, respectively.Furthermore, Table 3 summarizes GEP-based model performance in terms of R 2 , RMSE, and MAE for these sets.Smith states that for a coefficient of determination of |R|> 0.8, a strong correlation exist between measured and predicted values [43].Based on Table 3, the developed GEPbased model has a high R 2 for training subset, validation subset, and entire data set.In addition, the model exhibited a relatively low RMSE and MAE for all these sets.

Additional evaluation of model performance
In this section, performance of the developed GEP-based model is evaluated based on various statistical parameters found in the literature.These statistical parameters along with their acceptance criteria is presented in Table 4. Parameters used in this table are all as previously defined.Furthermore, the developed model was evaluated based on these statistical parameters and results are presented in this tale.As can be seen in Table 4, the developed model met all the criteria for additional statistical parameters revealing the descent performance of the developed model.Roy and Roy [45] Should be close to 1.0.1.000 Roy and Roy [45] Should be close to 1.0.0.998 Based on Table 5, the developed GEP-based model outperforms regression models since regression models considers only a small quantity of base functions.
Therefore, such models cannot be used to complex interactions of soil parameters (i.e., LL, PL, and e0) and Cc.However, the developed GEP-based model considers a variety of base functions and their combination in order to achieve a closed-form equation with high performance.The developed GEP-based model directly considers the experimental data with no prior assumptions.In other words, contrary to traditional regression models, GEP does not assume any pre-defined shape for the solution equation.High values of R 2 presented in Table 5 indicates that the developed GEP-based model was very successful in fitting the measured Cc to the input parameters of LL, PL, and e0.

Fig 1
Fig 1 presents a sample program illustration of evolving GEP.d1, d2, and d3 are

Fig 8 .
Fig 8. Predicted versus measured Cc for entire data set (training + validation).

Table 1 .
Descriptive statistics for input and output parameters used in the GEP-based developed model.

Table 2 .
Parameters used for implementation of GEP-based model.

Table 4 .
Evaluating the developed GEP-Based model using additional statistical parameters.

Table 5
presents the comparison of the developed GEP-based model with previous models found in the literature.The previous model consisted of either regression-based equations or robust AI methods such as multi-expression programming (MEP), artificial neural network (ANN), and multi-gene genetic programming (MGGP).It worth mentioning that these AI methods do not provide any closed-form solution.AI methods had relatively high R 2 mainly due to their black-box nature of connecting inputs and outputs.Nevertheless, the developed GEP-based model had a higher R 2 compared to existing AI methods.However, MEP, ANN, and MGGP had a lower error in terms of RMSE and MAE.

Table 5 .
Performance comparison of current developed GEP-based model with existing models.evaluatethe performance of developed GEP-based model.This evaluation revealed that the model has a descent performance based on additional performance measures.Contrary to the classical models for estimating Cc such as regression models, the developed GEP-based model reveals a highly nonlinear behavior and includes complex combination of influential input parameters (i.e., LL, PL, and e0,).In general, Cc was positively correlated with e0.Furthermore, LL and e0 had a higher influence on estimation of Cc compared to PL.Comparison of the developed model with previous models in the literature revealed its good performance, which guarantees the use of GEP-based model in practical applications.