Stacking ensemble of machine learning methods for landslide susceptibility mapping in Zhangjiajie City, Hunan Province, China

The current study aims to apply and compare the performance of six machine learning algorithms, including three basic classifiers: random forest (RF), gradient boosting decision tree (GBDT), and extreme gradient boosting (XGB), as well as their hybrid classifiers, using the logistic regression (LR) method (RF + LR, GBDT + LR, and XGB + LR), to map the landslide susceptibility of Zhangjiajie City, Hunan Province, China. First, a landslide inventory map was created with 206 historical landslide points and 412 non-landslide points, which was randomly divided into two datasets for model training (80%) and model testing (20%). Second, a landslide factor database was initially established by selecting 15 landslide conditioning factors from the topography, hydrology, climate, geology, and artificial activities. Thereafter, the multicollinearity test and information gain ratio (IGR) technique were applied to rank the importance of the factors. Subsequently, we used a series of metrics (e.g., accuracy, precision, recall, f-measure, area under the ROC (receiver operating characteristic) curve (AUC), kappa index, mean absolute error (MAE), and root mean square error (RMSE)) to evaluate the accuracy and performance of the six models. Based on the AUC values derived from the models, the GBDT + LR model with the highest AUC value (0.8168) was identified as the most efficient model for mapping landslide susceptibility, followed by the XGB + LR, XGB, RF + LR, GBDT, and RF models, which achieved AUC values of 0.8124, 0.8118, 0.8060, 0.7927, and 0.7883, respectively. The results from this study suggest that the stacking ensemble machine learning method is promising for use in landslide susceptibility mapping in the Zhangjiajie area and is capable of targeting the areas prone to landslides.


Introduction
Landslides describe various processes that lead to the downward and outward movement of slope-forming materials, including rock, soil, and artificial fill, or a combination of these (Cruden and Varnes 1996). Different natural phenomena and artificial disturbances cause landslides. Natural triggers include meteorological changes (e.g., heavy rainfall or snowmelt) and rapid tectonic forcings (e.g., earthquakes or volcanic eruptions). Artificial disturbances includes land-use change, deforestation, excavation, slope profile change, irrigation, etc. (Guzzetti et al. 2005). Landslides are the seventh most destructive natural disaster globally, destroying transportation, farmland, and villages and causing loss of life or property, and economic collapse (Petley 2012).
Landslide susceptibility refers to the probability of landslide occurring in a certain area based on the influence of artificial, terrain, and environmental conditions, that is, the 1 3 35 Page 2 of 18 probability of slope damage given certain artificial and natural conditions (Guzzetti et al. 2006). Landslide susceptibility mapping (LSM) in a geographic information system (GIS)integrated environment is the key to formulating disaster prevention measures and reducing future risks.
Previous comparative studies have found that machine learning models based on ensemble learning are superior to single machine learning models in accuracy and robustness, which can increase the availability of highresolution LSM (Merghadi et al. 2020;Sadighi et al. 2020;Pham et al. 2017). In particular, some ensemble learning algorithms based on decision tree (DT) have improved the accuracy and efficiency of landslide classification (Merghadi et al. 2020). DT is a basic classifier of various bagging (Breiman 1996) and boosting (Freund 1990) ensemble machine learning algorithms. Kadavi et al. (2019) applied a DT classification method to predict the areas with high potential for future landslide occurrence and reported that the prediction accuracy of the DT model is better than the LR model. Bagging, which stands for boostrap aggregation, is a technique for repeated sampling with put-back from data based on a uniform probability distribution (Breiman 1996). Random forest (RF) is a typical bagging ensemble machine learning model consisting of DTs that can be trained independently or in parallel (Breiman 2001). Thai Pham et al. (2018) applied different machine learning algorithms for the modeling of landslide susceptibility in the Luc Yen District, Northern Vietnam, and found that the RF method could be applied for better landslide susceptibility mapping and management. Lai and Tsai (2019) developed a systematic approach with the RF method to apply satellite remote sensing images, geographic information system (GIS) datasets, and spatial analysis for multi-temporal and event-based landslide susceptibility assessments on a regional scale.
Unlike bagging, boosting is a weight-based integration of multiple resampled weak classifiers to form a strong classifier (Freund 1990). Boosting ensemble machine learning models, such as AdaBoost (Pham et al. 2015), Multi-boostAB (Webb 2000), the gradient boosting decision tree (GBDT) (Jerome 2001), and extreme gradient boosting (XGB) (Chen and Guestrin 2016), are widely used in landslide susceptibility prediction. Chen et al. (2020a) reported that the GBDT method outperformed the other machine learning methods, and was able to provide strong technical support for producing landslide susceptibility maps in the Three Gorges Reservoir (TGR) area. Sahin (2020) produced a landslide susceptibility map using three featured regression DT-based ensemble methods, including GBDT, XGB, and RF, and the results showed that the XGB method achieved a lower prediction error and higher accuracy than the other ensemble methods.
The aforementioned bagging and boosting ensemble learning methods are homogeneous in that they combine several identical models into one prediction method. However, due to the heterogeneity of topographic, geological, hydrological, and artificial conditions in different landslide areas, homogeneous ensemble learning may not be suitable for all LSM scenarios. Stacking is a typical heterogeneous classifier composition framework proposed by Wolpert (1992). By generalizing the outputs of multiple classifiers, a stacking ensemble uses the outputs of the previous classifier as the learning input information of the next layer so that the previous learning can be fully used in the subsequent induction process to discover and correct classification deviations, to achieve higher classification accuracy than the single classifiers. Althuwaynee et al. (2014) proved the efficiency and reliability of ensemble DT and LR models in LSM. He et al. (2014) introduced a model that combines GBDT with LR, outperforming either of these methods on its own by over 3%. Ma et al. (2015) proposed a novel framework to identify gene variation using LR and RF. According to the above literature results, it is feasible to stack DT-based machine learning models with a logistic regression model to improve the prediction accuracy and performance of landslide susceptibility prediction models.
The current study aims to carry out LSM for Zhangjiajie City, Hunan Province, China, and to improve LSM effectiveness for decision-makers. To achieve this purpose, the RF, GBDT, and XGB methods were used for LSM and compared with the hybrid models RF + LR, GBDT + LR, and XGB + LR. Whereafter, we used a series of criteria to evaluate the performance of the six models, including accuracy, precision, recall, f-measure, area under the ROC (receiver operating characteristic) curve (AUC), kappa index, and root mean square error (RMSE). In addition, the frequency ratio method was used to conduct quantitative analysis on the prediction effect of each machine learning model. Finally, based on the above evaluations, suggestions were drawn for landslide disaster risk prevention and management in the study area.

Study area
There are a total of 18,567 potential geological disaster sites in Hunan Province, accounting for 6.5% of the total sites and ranking fourth in China, threatening a population of 712,600 people and property worth CNY 29.824 billion. The types of potential disaster sites in Hunan Province are mainly landslides, with a total of 11,405 sites. Therefore, Zhangjiajie City is one of the key objects of geological disaster prevention and control in Hunan Province.
Zhangjiajie City (located between 28° 52′ and 29° 48′ north latitude, 109° 40′ and 111° 20′ east longitude) is in the northwest part of Hunan Province with an area of 9516 km 2 (Fig. 1). The elevation ranges from 67 to 1840 m above sea level (asl). Notable alluvial terraces and karst landscapes have been formed in the Zhangjiajie area, located in the Wuling Range between the Yun-Gui Plateau to the northeast and the mountainous area of northwestern Hunan Province, with the mountainous area accounting for 76% of the total area (Yang et al. 2011). The lithostratigraphy within this area comprises Silurian, Devonian, Permian, Triassic, and Quaternary strata. The Silurian and Devonian strata account for most of the total area (Hunan Bureau of Geology and Mineral Resources 1988). The study area belongs to the mid-tropical monsoon climate, with abundant and concentrated precipitation, with an average annual precipitation of 1200-1500 mm. Rainfall is an important external condition for the occurrence and development of landslides in the study area, especially the shallow accumulation body. The rivers and streams in Zhangjiajie City are crisscrossed, and the water system is dominated by the Lishui River and the Loushui River. The largest land-use types in Zhangjiajie City are forest land and cropland. The permanent resident population of Zhangjiajie City has reached more than 1.51 million.

Dataset
In this study, a landslide inventory map containing 206 historical landslide sites in the Zhangjiajie City of Hunan Province, China, was obtained based on field survey and high-resolution remote sensing image interpretation carried out by the Hunan Institute of Geological Survey (oa.hnsddy. com, Changsha, China). Most of these historical landslides belong to small-and medium-sized scale, and the general movement type is clay/silt sliding with the characteristics of high frequency and wide distribution, which can be classified as shallow soil landslide according to the classification standard of landslides (Hungr et al. 2014). Landslide points and non-landslide points are extracted as positive and negative samples, respectively. To avoid the close proximity of the positive and negative samples, 412 non-landslide points were randomly extracted outside the 1000 m buffer zone of the landslide sites. Subsequently, 80% of the points were chosen for model training and 20% for testing.
Based on the existing literature and data availability, we chose 15 conditioning factors from multi-disciplines that have an effect on the occurrence of landslide, such as geomorphology, hydrology, climate, geology, and artificial activities. These LCFs include altitude, slope, aspect, plane curvature, profile curvature, relief, roughness, rainfall, topographic wetness index (TWI), normalized difference vegetative index (NDVI), distance to roads, distance to rivers, land use/land cover (LULC), soil texture, and lithology (Tables 1  and 2). All the LCFs were converted into a raster form with a spatial resolution of 90 m (Fig. 2).

Methods
The LSM scheme is presented in Fig. 3, and the main procedures are described as follows.
(1) Selecting and ranking LCFs: in total, 15 landslide conditioning factors were selected from topography, hydrology, climate, geology, and artificial activities, which included altitude, slope, aspect, plane curvature, profile curvature, relief, roughness, rainfall, TWI, NDVI, distance to roads, distance to rivers, LULC, soil texture, and lithology. We used multicollinearity analysis to detect factor multicollinearity and information gain ratio (IGR) technology to measure the importance of LCFs, so as to independently evaluate the correlation of factors and eliminate irrelevant or redundant factors.
(2) Modeling: we first used several machine learning methods as basic classifiers to model landslide susceptibility, including the RF model with the bagging method, and the GBDT and XGB models with the boosting method. Subsequently, these DT-based ensemble models were stacked with the logistic regression (LR) linear classifier model, then the RF + LR, GBDT + LR, and XGB + LR models were constructed for landslide susceptibility prediction.
(3) Model validation and comparison: we used a series of machine learning evaluation methods and error-based evaluation methods to measure and compare the performance of the six models, including accuracy, precision, recall, f-measure, area under the ROC (receiver operating characteristic) curve (AUC), kappa index, and root mean square error (RMSE).
(4) Visualization of landslide susceptibility map: we made landslide susceptibility prediction maps to qualitatively evaluate the prediction capabilities of each model, and used the frequency ratio method to quantitatively evaluate the prediction results and overall performance of all models in detail.

Importance analysis of LCFs
Choosing appropriate landslide conditioning factors (LCFs) as model input is of great significance for landslide susceptibility evaluation. Information gain ratio (IGR) (Nhu et al. 2020a) is the ratio of node information gain to node split information measure. It is a factor ranking technique that can independently evaluate different factors' correlations and detect the irrelevant or redundant factors.
We used the IGR method to rank and choose the factors that affect landslide susceptibility.
It is assumed that T is the total number of input tuples for each factor in the training dataset, T j is the total number of positive or negative tuples in the training dataset, v is the total number of classes in the dataset, and S is one of the factors influencing landslide disaster. The specific calculation of IGR(S) is as follows: where: where E(S) represents the entropy of the S factor in the training dataset, I(P, N) shows the information required to satisfy the given training dataset, P is the total number of positive tuples in the training dataset, N illustrates the total number of negative tuples in the training dataset, P i and N i mean the number of positive and negative tuples of the ith S factor, respectively, and k represents the number of values of the S factor. (1)

Multicollinearity analysis
Multicollinearity refers to the mutual dependence, lack of independence, and complete or nearly complete linear relationship between the variables in the regression equation (Farrar and Glauber 1967). By analyzing whether there is multicollinearity in the model, high multicollinearity factors should be removed to minimize the bias of the model and to optimize the prediction results. Multicollinearity is usually analyzed by variance inflation factor (VIF) and tolerance (TOL) (Green and Stephenson 1986). Their equations are as follows: where R j is the negative correlation coefficient between the jth independent variable x j and the other independent variables. The value of VIF increases with the increase of R 2 j . Generally, if the TOL value is < 0.10 or 0.20 and the VIF value is > 5 or 10, the results indicate a high degree of multicollinearity between variables (Band et al. 2020).

Basic classifiers
The RF classifier is an improved form of bagged DT-based classifier, which is used to solve complex problems of prediction and multi-classification. It combines the bagging integrated learning algorithm with the random subspace algorithm, making it a powerful classification tool capable of recognizing large-scale and multivariable data. RF consists of four parts, i.e., random selection of samples (put-back sampling), random selection of features, construction of decision trees, and random forest voting (averaging) (Breiman 2001).
Gradient boosting decision tree (GDBT) is a multiple decision tree ensemble machine learning method based on a boosting scheme. The GBDT algorithm builds the model by iteratively forming regression DT as a weak classifier. The weak classifier of each round is generated through multiple iterations, trained on the residuals of the previous round (Jerome 2001). The GBDT algorithm uses multiple classifiers to create hundreds of trees that can minimize the overfitting of the decision tree algorithm. Moreover, the design of each sub-classifier is simple, and the training progress can be accelerated accordingly.
Extreme gradient boosting (XGB) is a highly scalable decision tree integrated machine learning method based on the principle of gradient boosting machines (Chen and (6)  Guestrin 2016). The XGB algorithm improves the GBDT by optimizing the function space gradient descent algorithm from a first-order Taylor expansion to a second-order Taylor expansion, and by adding a regular term to avoid overfitting ). The LR method is a predictive analysis process that combines linear regression with a sigmoid function, using events as dependent variables while causal factors in categorical, continuous, or binary variables as independent variables (Hong et al. 2016). In the process of landslide sensitivity analysis, the independent variables are the LCFs, and the dependent variable is the variable of landslide occurrences. The output value of the model is between 0 and 1. Usually, 0 represents a non-landslide event (negative cell), and 1 represents a landslide event (positive cell). The closer the value is to 1, the higher is the probability of landslide occurring; on the contrary, the closer the value is to 0, the lower is the probability of landslide occurring.

Stacking ensemble
The stacking ensemble method is an integration model proposed by Wolpert (1992) for compiling several different algorithms on training processes, which works by estimating raw classifiers with poor performance relative to independent or bootstrapped training data ).
The stacking model fusion system is designed as a two-layer structure, as shown in Fig. 4. The first layer is integrated by N boosted tree models to form the basic classifier layer of the fusion system. The output of each tree is used as the classification input of the second-layer sparse linear classifier, thereby fusing the N basic models. Assuming that the input feature is x i , the hth basic classifier in the first layer is F h , and the meta-classifier in the second layer is F, then the output of the hth basic classifier of the first layer is F h (x i ), which is used as the input feature of the metaclassifier of the second layer, and the output y i is the final prediction result. As shown in Eq. (8): Essentially, the GBDT + LR model is a binary classifier model with a stacking design, so it can be used to solve binary classification problems. The RF + LR and XGB + LR models have therefore been proposed on this basis. The ensemble decision tree is used instead of the single decision tree to build the model, because the expression ability of a single tree is not sufficient to express multiple distinguishing feature combinations. The expression ability of multiple trees can better find effective features and feature combinations.
The detailed process of stacking of the three DT-based basic models with the LR model is shown in Fig. 4. First, after RF random sampling with put-back and GBDT and XGB random sampling without put-back, their respective original training datasets are obtained. Second, the basic  One-hot encoding is the process of representing categorical variables as binary vectors. When the basic model makes a prediction, the output is not the probability value, but the one-hot coding used to record the position of the leaf node where the predicted probability value of each tree in the model is binarized as 1 or 0. Since each weak classifier has only one leaf node to output the prediction results, while in a basic model, there are a weak classifiers and a total of b leaf nodes, each training data will be converted into a sparse vector of 1 * b dimension, in which a elements are 1 and the rest of b − a elements are 0. Thus, a new training dataset is constructed as an input of the LR model. Next, the new training dataset and labels derived from the original training dataset are inputted into the LR classifier for final training. Finally, the prediction results of RF + LR, GBDT + LR, and XGB + LR are obtained. After the basic model extracted the original dataset into a new dataset, not only did the input data of the LR model become sparse, but also, due to the influence of the numbers of weak classifiers and leaf nodes, the feature dimension of the new training data may be too large. Therefore, the regularization method needs to be used in the LR model to reduce the risk of overfitting.

Correlation among LCFs
In this study, we used variance inflation factor (VIF) and tolerances (TOL) to evaluate the multicollinearity of LCFs. Table 3 shows the VIF and TOL of the 15 LCFs. The results show that the VIF value of the "relief" factor is greater than 5.0 and has multicollinearity, which needs to be excluded from the LCFs. However, the remaining 14 LCFs have no obvious correlation and can be used as input variables to model landslide susceptibility.

Validation and evaluation of models
In this study, we used RF, GDBT, XGB, RF + LR, GBDT + LR, and XGB + LR models to execute the modeling process on the training set and obtained evaluation results. We applied the tenfold cross-validation method to prevent overfitting and to reduce model variability, then applied the grid search method to find the best hyper-parameters for each model (Table 5). Accuracy, precision, recall, f-measure, area under the ROC (receiver operating characteristic) curve (AUC), kappa index, mean absolute error (MAE), and root mean square error (RMSE) were used to evaluate the performance of the six models. Higher values of model accuracy, precision, recall, f-measure, AUC, and kappa, as well as lower values of RMSE and MAE, mean better performance of the model. Table 6 lists the machine learning-based assessment results of five indicators: accuracy, precision, recall, f-measure, and AUC. Both the accuracy and precision values of GBDT + LR are greater than 0.800, which is higher than the other five models. The XGB model yielded the highest recall and f-measure values. In addition, the performance of the RF + LR and GBDT + LR models is better than that of the RF and GBDT models in terms of all the statistical measures. As shown in Fig. 5, the AUC values of the GBDT + LR, XGB + LR, XGB, and RF + LR models are above 0.800, indicating that the above four models demonstrate very satisfactory prediction capability. Furthermore, the GBDT + LR model is better than the other models, due to it obtaining the highest AUC value.
In terms of error-based assessment results (Table 7), the highest kappa index value was obtained by the GBDT + LR model, followed by XGB, GBDT, RF + LR, RF, and XGB + LR models. The kappa index values show the compatibility and reliability of the LSM models. The MAE and RMSE metrics indicate that the GBDT + LR model has the smallest errors, followed by the XGB, RF + LR, GBDT, RF, and XGB + LR models.
The validation results show that the six models have good performance in landslide prediction. Therein, the GBDT + LR model has the best accuracy and predictive ability compared with the other models. The analysis of the model results also confirms that the stacking ensemble method is a useful tool for improving the accuracy of model prediction.

Landslide susceptibility mapping (LSM)
After training and validating all models, we ran the models and obtained the output weight (i.e., the landslide sensitivity index (LSI)). We mapped the landslide sensitivity of the study area by predicting an LSI value on each pixel in the study area. Ultimately, the natural break classification method was used to classify the landslide susceptibility of the study area into five categories (i.e., very low, low, moderate, high, or very high) (Fig. 6). LSM can only qualitatively assess landslide susceptibility, and further statistical methods can be used for a quantitative Fig. 6 Landslide susceptibility maps using: a RF, b GBDT, c XGB, d RF + LR, e GBDT + LR, and f XGB + LR models assessment (Zhou et al. 2021b). The FR method was used to investigate the separation between classes (i.e., to assess the classification accuracy of the models), which represents the ratio of the percentage of landslides to the percentage of the total area in each class (Arabameri et al. 2019). Generally, models with the characteristics of landslide points concentrated in high-prone areas are considered to have strong landslide prediction ability. As shown in Table 8 and Fig. 7, the low and very low susceptibility categories of the six landslide susceptibility maps account for the majority, exceeding 70%; however, there are fewer landslide points in these two areas. Conversely, the high and very high susceptibility categories occupy less than 18% of the study area, but all have more than 50% of landslides occurring there. The statistical results show that the FR value gradually increases as the landslide susceptibility level increases from very low to very high, which means the six landslide susceptibility maps are reasonable and reliable.
According to the FR analysis results, the GBDT + LR model has the highest reliability of landslide susceptibility maps compared with the other five models. In the GBDT + LR maps, the percentage of landslide occurrences and the frequency of high susceptibility and very high susceptibility classes are higher than those of other models.
We chose the GBDT + LR model with the best model performance and FR analysis result to verify its actual predictive ability. Several landslide instances, extracted from Google Earth (www. google. com, Mountain View, USA), coincide with areas predicted to have high and very high landslide susceptibility (Fig. 8), which shows that the results predicted by the GBDT + LR model are highly consistent with the actual situation.

Discussion
Landslides in Zhangjiajie City, Hunan Province, China, have received considerable attention. LSM is of great significance for visually analyzing landslide susceptibility. The main goal of this study was to stack several DT-based classifiers with a LR classifier to generate RF + LR, GBDT + LR, and XGB + LR models to obtain the optimal model for the study area, and to compare them with individual RF, GBDT, and XGB classifiers. Subsequently, they were applied in LSM in Zhangjiajie City, Hunan Province, China. In LSM, it is important to evaluate the predictive capability of all LCFs. Factor choosing and sorting methods mainly include filter, wrapper, and embedded methods, and the IGR technique is a typical filter. In this study, we used the IGR technique to identify the LCFs' predictive capacity. The weight of each LCF was calculated using the entropy index. The AM values of 15 LCFs were tested to be greater than 0, indicating varying impacts of landslides in the study area. We demonstrated that profile curvature, roughness, LULC, altitude, distance to roads, TWI, and lithology are more important LCFs for LSM. In contrast, NDVI, slope, distance to rivers, plane curvature, aspect, soil textures, and rainfall were found to be less important in the study area. A systematic review of the literature shows that the importance of LCFs is specific to a region and cannot be extrapolated to other regions. For instance, Nhu et al. (2020b) identified rainfall as most important gully erosion factor in the Salavat Abad saddle, Kurdistan Province, Iran. However, Pham et al. (2020) discovered that slope is the most influential factor in  Zhao et al. (2021) showed that the TWI is the most important LCF in Longnan City, Gansu Province, China.
After choosing suitable LCFs, we used six machine learning models to predict the landslide susceptibility maps. According to the evaluation results of the model performances, XGB was better than GBDT and RF, and the stacking ensemble learning techniques improved the goodness of fit and performances of these models. The model performance of GBDT + LR was better than RF + LR and XGB + LR. In particular, for RF and GBDT, the results after stacking fusion with LR were better than for RF and GBDT alone, respectively. However, XGB alone was better than XGB + LR. This is inconsistent with our expectations for two possible reasons: (1) a simple stacking ensemble process of XGB model will not necessarily improve its performance; and (2) when there are not enough training samples, the effect of XGB + LR tends to deteriorate. In addition, it is not always the case that the modeling performance of a fusion model is better than that of a single model. Zhou et al. (2021a) used XGB, RF, GBDT, LR, XGB + LR, RF + LR, and GBDT + LR models to study lymph node metastasis (LNM) in patients with poorly differentiated-type intramucosal gastric cancer, suggested that a single machine learning algorithm can predict LNM, and that a fusion algorithm cannot improve the performance of machine learning in predicting LNM.
Five landslide susceptibility levels were obtained in the ultimate phase by dividing the LSI (0-1) using the natural break method. The selection of partitioning methods determines the correctness of LSM. Natural breaks, standard deviations, equal intervals, and quantiles have been used in LSM. Among them, the natural break method is considered the most popular, and different classes can be generated based on the inherent characteristics of the dataset without any subjective consideration (Chen and Zhang 2021). As shown in Fig. 7c, among all LSMs obtained in this study, the largest FR value belongs to the very high susceptibility category, followed by the high, moderate, low and very low susceptibility categories, proving that the models have the ability to effectively discriminate landslide occurring areas with different levels of susceptibility. The results also show that the prediction abilities of the percentage and FR of landslide occurrences are more reasonable with the stacking improvement in LSM.
The GBDT + LR model is considered to be the most effective prediction model, and it classified approximately 7.93% and 7.69% of the land into high and very high susceptibility  (Fig. 7a). The landslide-prone areas of the Zhangjiajie City are distributed in the central and western mountainous areas, and the distribution pattern is consistent with the trend of mountains. The results suggest that at least 15.62% of the area requires early warningrelated preventive measures to enable local authorities to take timely and appropriate action to avoid or reduce the impact of landslides.