Preprint
Article

This version is not peer-reviewed.

High-Resolution Soil Organic Carbon Mapping with Interpretability and Uncertainty Quantification in Hungarian Croplands

Submitted:

22 June 2026

Posted:

22 June 2026

You are already at the latest version

Abstract
Accurate prediction of soil organic carbon (SOC) content at fine spatial resolution is crucial for supporting precision soil management and ensuring food security in complex agricultural landscapes. Focusing on croplands in Hungary, this study developed a region‑specific framework that produces a 30 m resolution SOC content map from multi-temporal bare-soil composite images and DEM derivatives, with SHAP‑based interpretability and pixel‑wise uncertainty quantification. The results showed that the Gradient Boosted Decision Trees (GBDT) model achieved the best generalization performance on the test set (R² = 0.518, RMSE = 4.498 g·kg⁻¹, MAE = 3.499 g·kg⁻¹, RPIQ = 2.229, LCCC = 0.621). SHAP analysis revealed that both spectral and topographic variables exhibited markedly nonlinear effects, and that spectral variables contributed more to SOC prediction than topographic variables. In addition, prediction credibility varied considerably across the study area, with low-confidence zones concentrated mainly in transitional hilly and low‑mountain areas characterized by complex terrain. This study produced a 30 m resolution SOC content prediction map for Hungary cropland, offering a transferable methodology for accurate SOC mapping and uncertainty assessment in complex agricultural landscapes, and providing new insights for agricultural management decisions in Hungary and across Europe.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Soil organic carbon (SOC) is a key indicator of soil fertility and health in croplands, directly influencing nutrient retention, soil structure, water-holding capacity, and ultimately crop productivity [1], so spatially explicit SOC maps are required for monitoring regional soil quality and safeguarding farmland productivity [2]. High-resolution spatial prediction and mapping of soil organic carbon content can accurately reveal the distribution patterns and spatial variability of regional soil organic carbon. This information provides a decision-making basis for implementing precision fertilization and delineating conservation tillage zones. It also plays a critical supporting role in maintaining ecosystem stability, enhancing crop yield and cropland use efficiency, and formulating sustainable agricultural development strategies.
Although traditional soil sampling and laboratory analysis can provide accurate measurements at discrete points, this time-consuming and labor-intensive manner restricts to capture the spatial-continuous SOC content across large areas. Driven by these constraints and the reliance of sustainable agriculture targets on accurate SOC data, a growing demand for high-resolution SOC spatial information has emerged—one that necessitates accurate, reliable, and cost-effective prediction and mapping methods. In recent years, Remote sensing has become an effective alternative to traditional field surveys for soil property predicting, exploiting soil–spectral relationships to predict properties at unvisited locations. The effectiveness of remote sensing in predicting and mapping cropland soil organic carbon has been widely demonstrated across major agricultural regions, including China [3], the United States [4], and Germany [5], supporting its critical role in SOC studies. However, accurate, reliable, high-resolution SOC maps are still largely unavailable in Hungary. At present, the national SOC content map of Hungary is available at a resolution of 100 m , a spatial detail that remains insufficient for farm-scale precision management. Gridded SOC predictions derived from LUCAS observations, available through the European Soil Data Centre, range from 500 m to 1 km in spatial resolution [6]. The global SoilGrids 250 m product is similarly too coarse to guide precision agricultural management [7]. Meanwhile, soil types and terrain conditions vary widely across Hungarian croplands—low-elevation plains are dominated by Chernozems, whereas Luvisols and Cambisols prevail in the hilly areas—producing pronounced spatial heterogeneity in SOC. Medium- to large-sized fields are widespread across Hungarian croplands, and the SOC variation within them often requires finer spatial detail than coarse-resolution products can provide. In this study, a 30 m resolution SOC map of Hungarian cropland, capturing this fine-scale variability and providing reliable spatial information for precision agricultural management.
SOC prediction models quantify the relationships between SOC and predictor variables, and are used to generate spatially explicit estimates. Recently, machine learning methods such as Random Forest (RF), Support Vector Machines (SVM), and Extreme Gradient Boosting (XGBoost) have been increasingly adopted in SOC modeling because they combine strong noise tolerance with the flexibility to fit complex data structures, capture nonlinear interactions, and deliver high predictive accuracy [8]. However, the performance of these methods is strongly tied to regional context, and a model that excels in one setting may underperform elsewhere. Therefore, it’s necessary to construct region-specific prediction model for the target region. [9]. In addition, although machine learning often achieves high accuracy, the black-box property of them limits its geoscientific explainability [10]. Once the contributions of predictor to SOC content accumulation are not understood, single data-driven models may generate false correlations without geoscientific plausibility [11]. This limitation has motivated the adoption of post-hoc explanation methods such as SHapley Additive exPlanations (SHAP) in SOC content prediction [12,13]. SHAP can decompose model predictions into the contribution of each predictor, thereby generating both importance rankings at global scale and dependence plots, which reveal how individual variables affect SOC content predictions. Unlike variable importance ranking based on permutation, the dependence plots of SHAP can indicate whether a predictor exerts a linear or nonlinear effect and whether there is a threshold response, which will help clarify the mechanism driving the spatial heterogeneity of SOC content. Therefore, SHAP is used in this study to quantify the global importance of spectral and topographic variables, and reveal their nonlinear marginal effects on SOC content.
Assessing the credibility of predictions is equally important as the interpretability itself. In regions with long-tailed SOC content distributions and weak spectral signals, a single deterministic map can mask considerable local prediction errors [14]. Spatial uncertainty can be quantified using several existing methods, such as boot-strap resampling, quantile regression forest and sequential Gaussian simulation, etc. Bootstrap resampling generates multiple predictions at pixel scale by repeatedly sampling the training data with replacement, enabling the computation of both the mean prediction and its standard deviation or coefficient of variation. This pixel-wise uncertainty may reveal where and why model predictions may be unreliable, turning SOC content maps from static images into credible decision-support tools [7]. This involves quantifying pixel-wise prediction uncertainty, examining how prediction credibility varies with terrain complexity, and producing complementary SOC content and spatial credibility maps. The joint application of SHAP-based explanation and pix-el-wise uncertainty analysis within a single, region- adaptive framework has not been fully explored in cropland SOC content prediction, especially in landscapes with com-plex topography.
The primary aim of this study is to produce the first dedicated 30 m resolution soil organic carbon (SOC) map for Hungarian croplands with pixel-wise uncertainty, using an interpretable, region-specific framework that integrates multi-temporal bare-soil composites and topographic variables. The specific objectives of this study are to: (1) build and systematically compare multiple SOC prediction models, and select the best-generalizing one for accurate spatial prediction; (2) apply SHAP to quantify variable importance and reveal the direction, magnitude, and nonlinearity of each predictor variables’s influence on SOC; (3) generate a 30 m resolution SOC content map with a corresponding pixel-wise uncertainty map, delineate the spatial patterns and hotspots of SOC, and evaluate prediction reliability.

2. Materials and Methods

2.1. Study Area

This study selects Hungary as the study area, with a geographical scope of 45°48′~48°35′ N and 16°05′~22°58′ E (Figure 1a). Hungary is a country in Central Europe with a total land area of 93,023 km², of which 55.3% is agricultural land [15]. The climate of the study area is temperate continental climate, with an annual mean temperature of 10~11 °C and mean precipitation around 630 mm. Hungary lies within the Carpathian Basin and its terrain varies considerably (Figure 1b): the flat Great Hungarian Plain occupies the east and center, gentle hills characterize the west, and low mountains rise along the northern border. Cropland elevations generally fall between 70 m and 300 m [16]. The plain regions are predominantly covered by fertile Chernozems and meadow soils rich in SOC, while Luvisols and Cambisols dominate the surrounding hills and mountains [17]. Hungary has a long history of farming, and its main crops are winter wheat, maize, and sunflower. Its continental climate, together with the farming calendar, gives two windows when the soil is bare. One occurs prior to spring sowing (March to May), and anther follows autumn tillage (September to November). Those windows are good for collecting Landsat 8 imagery that is free of clouds and vegetation. Such cloud-free and vegetation-free imagery can then be utilized for the task of predicting SOC content.

2.2. Data Sources and Preprocessing

2.2.1. Soil Sampling Data

For this study, the soil organic carbon data were extracted from the 2015 topsoil database of the Land Use/Cover Area frame Survey (LUCAS) (Figure 1). Eurostat and the Joint Research Centre (JRC) coordinate the LUCAS program, which employs a rigorous stratified systematic sampling design. Across Europe, topsoil (0–20 cm) samples were gathered, and their SOC content (g·kg⁻¹) was measured using dry combustion within a single ISO-certified laboratory [18]. This highly standardized approach eliminated systematic sampling errors, so the raw data turned out to be highly reliable. The initial cropland dataset (n = 261) was screened for potential outliers using the three-sigma (3σ) rule to minimize the influence of erroneous values from sampling or laboratory analysis on model fitting. Two extreme outliers that deviated substantially from the main distribution were identified and removed. A refined dataset with 259 valid samples was retained for subsequent modeling. During model construction, a stratified random sampling strategy was used to divide the refined dataset into two independent subsets: 80% of the data was used for model training and hyperparameter optimization, and the remaining 20% was held out as an independent test set to evaluate the model’s generalization performance.

2.2.2. Box-Cox Transformation and Smearing Bias Correction

To address the right-skewed distribution of SOC content and stabilize the variance, a Box-Cox transformation (Eq. 1) was applied to the target variable ω after data splitting [19].
ω = ( ω λ 1 ) / λ , λ 0 ln ω , λ = 0
ω = λ ω + 1 1 λ , λ 0 e ω , λ = 0
where λ is the transformation parameter, estimated from the training set only, and ω is the transformed value of the target variable ω . The parameter λ was kept fixed and then the same transformation was applied to the test set to prevent data leakage. All models were then fitted with the Box-Cox-transformed SOC as the response variable ω .
When predictions are mapped back to the original SOC scale, directly applying the inverse Box–Cox transformation (Eq. 2) yields the conditional median rather than the conditional mean. For right-skewed SOC data, this leads to a systematic underestimation of high-value areas. To obtain approximately unbiased mean predictions on the original scale we applied the smearing bias correction method [20]. First, the residuals e i in the transformed space were computed from the training set, following Eq. 3:
ε i ^ = ω ^ i ω ^ i p r e d  
where i = 1 , , n   , n is the number of training samples; ω ^ i is the Box–Cox transformed observed value of the i-th training sample (Eq. 1); ω ^ i p r e d   is the model prediction for that sample in the transformed space (typically the fitted value) and ε i ^ is the residual in the transformed space.
For any sample (training or test) with a transformed-space prediction ω ^ * p r e d , the smearing-corrected estimate on the original SOC scale, ω * ˜ , was then obtained by averaging the inverse-transformed values after adding back each training residual, as defined in Eq. 4:
ω * ˜ = 1 n i = 1 n g 1 ( ω ^ * p r e d + ε i ^ )
where g 1 is the inverse Box–Cox transformation (Eq. 2), and ω * ˜ represents the approximately unbiased SOC content prediction for that sample. Subsequently, the bias-corrected predictions were used to assess model performance on the test set and to generate the spatial SOC map and the associated uncertainty map.

2.2.3. Compositing Bare-Soil image from Multi-temporal Landsat 8 Data

The Landsat 8 Collection 2 Tier 1 Surface Reflectance (SR) product, with a spatial resolution of 30 m, was used for bare-soil compositing in this study. This product, generated by the U.S. Geological Survey (USGS), was atmospherically corrected with the LaSRC algorithm and is certified as Analysis Ready Data (ARD) by the Committee on Earth Observation Satellites (CEOS) [21]. It provides six multispectral bands (B2~B7) ranging from 0.452 μm to 2.294 μm, covering the coastal aerosol, visible, near-infrared, and shortwave infrared regions, together with a Quality Assessment (QA) band that can be used to identify and mask clouds, cloud shadows and snow. The data is stored as 16-bit integer GeoTIFF files, and surface reflectance is obtained by applying the scaling equation:
r e f l e c t a n c e   =   D N × 2.75 × 10 5 0.2
where reflectance represents the dimensionless surface reflectance, normally ranging from 0 to 1, and DN denotes the stored digital number in the 16-bit unsigned integer GeoTIFF. Validation over bare soil and desert surfaces indicates that the Landsat 8 SR product has atmospheric correction errors of less than 4% in the shortwave infrared and near-infrared bands, confirming its suitability for bare-soil spectral retrieval and soil property estimation [22].
To construct the bare-soil spectral image, we collected all available surface reflectance (SR) images on the Google Earth Engine (GEE) platform that were acquired during the bare-soil phenological windows (March–May and September–November) in Hungary from 2016 to 2020. Clouds, cloud shadows, and saturated pixels were first masked using the QA band. Surface reflectance values were then derived from the scaled digital numbers using the official scale factors and offsets. The 10-m resolution land cover tiles of Hungary, obtained from the Copernicus Land Monitoring Service (CLMS) [23], were first mosaicked to generate a seamless land cover map. To ensure spatial consistency with the Landsat 8 imagery, the dataset was reprojected into the WGS 84 and resampled to 30m resolution. The resulting map was then reclassified into a binary raster: cropland pixels were assigned a value of 1, and all other land cover types were merged into a single non-cropland class with a value of 0. The binary mask was then imported into Google Earth Engine (GEE), where a dual-threshold rule (NDVI < 0.25 and BSI > 0) was applied to filter out any remaining non-bare-soil pixels, thereby isolating the bare soil within the cropland extent [24]. The choice of this estimator is tied directly to its capacity for suppressing interference—from anomalous outliers, from vegetation that emerges and senesces within short intervals, and from moisture lingering near the soil surface. A bare-soil spectral signal estimated in this pixel-wise fashion therefore possesses a degree of robustness against transient artifacts that the arithmetic mean cannot provide [25]. The resulting bare-soil spectral image (Figure 2) was then used as the primary remote sensing input for SOC prediction modeling.

2.2.4. Derivation of the Spectral and Topographic Variables

The predictor variables were derived from two primary sources: spectral variables extracted from the bare-soil image, and topographic variables from the Copernicus DEM GLO-30. The spectral variables are consisted of six original bands (Blue, Green, Red, NIR, SWIR1, SWIR2) and seven spectral indices, including the Normalized Difference Vegetation Index (NDVI), Soil Adjusted Vegetation Index (SAVI), Bare Soil Index (BSI), Normalized Red-Green Difference Index (NRGDI), and three additional indices (Table 1). The Copernicus DEM GLO-30 (30 m resolution) was used to derive topographic variables. This DEM is based on TanDEM-X interferometric radar data acquired between 2010 and 2015 [26], and was hydrologically conditioned with water surface flattening and stream burning to improve terrain analysis [27]. From the hydrologically conditioned DEM, basic topographic factors were derived: elevation, slope, aspect, plan curvature, and profile curvature. In addition, a set of hydro-topographic variables was extracted, including the Topographic Wetness Index (TWI), Topographic Position Index (TPI), Terrain Ruggedness Index (TRI), Valley Depth, Multiresolution Index of Valley Bottom Flatness (MrVBF), and Multiresolution Index of Ridge Top Flatness (MrRTF). The spectral and topographic variables were computed on the GEE platform using standard median compositing and terrain analysis routines. Detailed definitions and calculation formulas for all variables are summarized in Table 1.

2.3. Machine Learning Methods and Hyperparameter Optimization

Five machine learning algorithms were selected to model the relationship between SOC content and predictor variables: Partial Least Squares Regression (PLS), Support Vector Machine (SVM), Random Forest (RF), Gradient Boosting Decision Tree (GBDT), and Extreme Gradient Boosting (XGBoost). This selection reflects the known differences among these models in handling spatially heterogeneous data and high-dimensional features.
PLS is a linear dimensionality reduction method that extracts latent variables by simultaneously decomposing the predictor and response matrices. These latent variables are chosen to maximize the covariance with the target variable while accounting for predictor variance, which helps mitigate multicollinearity among the predictor variables [28]. The only parameter tuned in PLS was the number of latent variables. SVM employs the ε-insensitive loss function, which penalizes only errors exceeding the ε margin, and uses the kernel trick to map input features into a high-dimensional space where an optimal linear regression hyperplane is constructed. This makes SVM particularly suitable for nonlinear regression with small sample sizes [29]. In this study, the radial basis function (RBF) kernel was adopted, and the error penalty coefficient (C) and kernel width parameter (γ) were tuned. RF builds an ensemble of independent regression trees through bootstrap aggregating (bagging). At each node split, a random subset of features is evaluated, and the split that maximizes information gain is selected. By randomizing both the selection of training instances and the set of features considered at each node, random forest introduces a double layer of stochasticity. This decorrelates the base trees, lowers generalization error, and simultaneously supplies robustness to spatially autocorrelated noise. The ensemble estimate is obtained by computing the arithmetic mean of the predictions from all constituent trees [30]. Hyperparameter tuning addressed the number of trees, the maximum allowable tree depth, and the size of the feature subset randomly sampled per split.
Gradient boosting constructs an additive model through a forward stagewise procedure. Simple CART trees, each a weak learner on its own, are added to the ensemble one after another. Rather than fitting a new tree directly to the observed target, the algorithm trains it on the negative gradient of the loss function, where the gradient is evaluated at the residuals left by the current ensemble. In effect, this procedure corresponds to performing gradient descent in function space. Residuals are corrected in a stepwise fashion, and it is this iterative refinement that endows GBDT with a powerful ability to learn complex nonlinear mappings [31]. Three quantities were addressed by hyperparameter tuning: the learning rate, the number of boosting stages, and the maximum depth of individual trees. XGBoost enhances the standard boosting framework by incorporating curvature information. A truncated second-order Taylor expansion of the objective supplies both the gradient and the Hessian; split-finding and leaf-weight computation are thereby improved together. This leads to faster convergence when the predictor set is of high dimensionality. In addition, a regularization term that penalizes the L2 norm of leaf scores is integrated into the loss, directly constraining model complexity. Overfitting is attenuated by this mechanism when training data are limited [32]. The hyperparameters subjected to tuning were the learning rate, the number of boosting rounds, the maximum tree depth, the minimum loss reduction required to partition a node (γ), and the L2 regularization coefficient (λ).
Hyperparameter tuning for each model was carried out through an exhaustive grid search—a strategy that becomes particularly appropriate when the parameter space is of modest dimensionality. Model performance was evaluated via 10-fold cross-validation [33]. A predefined grid of candidate values was first constructed, and every point across that grid was subsequently tested. The hyperparameter combination that returned the highest average cross-validated R² was retained as the optimal configuration. Once this configuration had been identified, all models were retrained on the complete training dataset. The final SOC content prediction models were then obtained from this retraining step.

2.4. SHAP-Based Model Explanation

To quantify the influence of predictor variables on the spatial distribution of topsoil SOC, we applied the SHAP framework. SHAP originates from cooperative game theory and provides additive feature attributions for model predictions. Its key advantage is that it avoids the bias toward continuous variables often seen in impurity-based importance measures from tree models. SHAP instead decomposes the model prediction into additive feature contributions by averaging the marginal contribution of each feature over all possible feature subsets. For any sample, the prediction can be expressed as the sum of a baseline value (the average prediction over all samples) and the SHAP contributions of all features. The local explanation model is defined as:
g ( z ) = ϕ 0 + j = 1 M ϕ j z j
where g is the explanation model; z { 0 , 1 } M is a binary vector indicating whether each feature is included in the coalition; M is the total dimension of the input prediction variables; ϕ 0 is the model’s baseline prediction when no features are included; and ϕ j is the SHAP marginal contribution of the j t h prediction variables. The sign of ϕ j indicates the direction of influence of that environmental factor on the model prediction: ϕ j > 0 means the current feature value has a positive effect on SOC accumulation; ϕ j < 0 means a negative effect.
SHAP values were extracted for all independent test samples using the best-performing model. For global evaluation, the mean absolute SHAP value of each predictor variable was calculated to establish a global feature importance ranking. This approach identifies the key macro-drivers shaping the spatial pattern of SOC in the study area. [34]. For local interpretation, SHAP dependence plots were generated for the top-ranked predictor variables. These plots illustrate the nonlinear response of SOC predictions to each predictor and help reveal potential threshold effects and interactions between topographic and spectral variables. Recent studies have demonstrated the utility of such dependence plots for revealing nonlinear relationships and interaction effects among predictor variables [35]. This approach helps move beyond purely data-driven prediction toward a mechanistic understanding of how topographic and spectral variables control SOC spatial patterns.

2.5. Model Performance Assessment

The soil samples were divided into training samples and testing samples with a ratio of 8:2 using stratified random sampling. Five metrics (R², RMSE, MAE, RPIQ, and LCCC) were used to evaluate prediction accuracy of model and generalization ability:
R 2 = 1 i = 1 n ( y i y ^ i ) 2 i = 1 n ( y i y ¯ i ) 2
R M S E = 1 n i = 1 n ( y i y ^ i ) 2
M A E = 1 n i = 1 n y i y ^ i
R P I Q = Q 3 Q 1 R M S E
L C C C = 2 ρ σ y σ y ^ σ y 2 + σ y ^ 2 + ( y ¯ y ^ ¯ ) 2
where y i and y ^ i are the measured and predicted values for the i-th sample, and n is the total number of samples; Q 1 and Q 3 are the first (25%) and third (75%) quartiles of the measured SOC data; ρ is the Pearson correlation coefficient between the measured and predicted values; σ y 2 and σ y ^ 2 are the variances of the measured and predicted values, respectively; y ¯ and y ^ ¯ are the means of the measured and predicted values. R2 measures the proportion of spatial variance in the target variable explained by the model, ranging from 0 to 1, with higher values indicating a better fit. RMSE is sensitive to large prediction errors; lower values reflect higher absolute accuracy. MAE, the average absolute difference between predicted and measured values, captures systematic bias. RPIQ, defined as the interquartile range of measured values divided by RMSE, evaluates predictive performance across the full data range, with higher values indicating better performance [36]. The LCCC metric reflects the degree to which predicted values coincide with measured ones. It does so by incorporating two components: the strength of the linear association between the two sets of values, and the extent of any systematic offset that pushes them away from the identity line. This metric is bounded within [−1, 1], with values close to 1 indicating that predictions are both highly correlated with observations and located near the 1:1 concordance line [37].

2.6. Bootstrap-Based Spatial Uncertainty Quantification

When a machine learning model is trained on discrete, sparsely distributed observations and subsequently extrapolated to a continuous spatial field, the resulting predictions carry inherent uncertainty. Three distinct sources feed this uncertainty: the density and spatial distribution of the ground sampling, inaccuracies within the predictor variables, and the structural form of the model itself [38]. To quantify this spatial uncertainty, we applied a model-agnostic bootstrap framework. The framework refrains from imposing parametric constraints on the error distribution; stable variance estimates are obtained even from moderately sized training sets. A collection of replicate training datasets was produced by drawing samples with replacement from the original observations. Each replicate inherited the previously tuned hyperparameters and was used to fit a base learner. That learner then delivered predictions at every pixel across the study domain. The final per-pixel SOC estimate was derived by computing the arithmetic mean of the predictions across all replicates. The standard deviation of these replicate-level predictions delineates the spatial structure of prediction uncertainty. Given that SOC content exhibits pronounced spatial variability across the study region, the coefficient of variation (CV) was also computed to express uncertainty in relative terms, thereby factoring out the influence of the mean magnitude. The resulting pair of maps—showing the standard deviation and the CV—highlights zones where predictions are less reliable and offers a quantitative instrument for prioritizing locations that would most benefit from supplementary field sampling. Bootstrapping-based uncertainty quantification of this kind has been widely applied in contemporary digital soil mapping studies [39].

3. Results and Discussion

3.1. Descriptive Statistics of SOC Content

In this study, the initial number of sampling points for topsoil in cropland was 261. After removing outliers using the 3σ rule, 259 points were retained. Descriptive statistics are shown in Table 2. For the full dataset, SOC content ranged from 4.70 g·kg⁻¹ to 37.80 g·kg⁻¹, with a mean of 16.93 g·kg⁻¹ and a median of 16.80 g·kg⁻¹, and a CV of 38.04%, indicating moderate strong spatial heterogeneity. Skewness was 0.29 and kurtosis was −0.40, indicating an approximately symmetric distribution slightly flatter than a normal distribution (Figure 3a). To assess the validity of data splitting for subsequent modeling, we applied a quantile-based stratified random sampling method. Specifically, all 259 points were divided into five strata based on the quintiles of SOC content. Within each stratum, approximately 80% of the samples were randomly assigned to the training set (N = 207) and 20% to the test set (N = 52). The training set had a mean SOC of 16.97 g·kg⁻¹ (CV = 37.90%), and the test set had a mean of 16.79 g·kg⁻¹ (CV = 38.98%). The violin plot in Figure 3b shows similar SOC content distributions between the two sets. The Kolmogorov–Smirnov (K-S) test confirmed that both datasets followed the same distribution (D = 0.069, p = 0.979). These results validate the quantile-based stratified sampling strategy and support reliable generalization assessment of subsequent models on independent cropland samples.

3.2. Model Performance and Region-Specific

The original SOC data showed mild asymmetry (skewness = 0.29). A Box–Cox transformation was applied before modeling, with λ estimated by maximum likelihood in Section 2.2.2. After model predictions were obtained, the Duan smearing estimator was used to back-transform all predicted values to the original SOC content scale [20].The accuracy metrics (R2, RMSE, MAE, RPIQ and LCCC) in Table 3 were all calculated from these back-transformed values. Table 3 summarizes the metrics for each model on the training set and the test set, and Figure 3 shows the corresponding scatter plots of predicted VS measured SOC on the test set.
Figure 3. Scatter plots of measured versus predicted SOC content for the five machine learning algorithms. (a) Gradient Boosting Decision Tree (GBDT); (b) Partial Least Squares Regression (PLS); (c) Random Forest (RF); (d) Support Vector Machine (SVM); (e) Extreme Gradient Boosting (XGBoost).
Figure 3. Scatter plots of measured versus predicted SOC content for the five machine learning algorithms. (a) Gradient Boosting Decision Tree (GBDT); (b) Partial Least Squares Regression (PLS); (c) Random Forest (RF); (d) Support Vector Machine (SVM); (e) Extreme Gradient Boosting (XGBoost).
Preprints 219625 g004
On the training set, GBDT gave the highest R² (0.660), with an RMSE of 3.952 g·kg⁻¹ and an LCCC of 0.727, indicating that it effectively captured the nonlinear relationships between SOC and the predictor variables. RF and XGBoost followed, with R² values of 0.585 and 0.577. PLS produced the lowest R² (0.442). Linear models cannot adequately capture the complex soil–landscape interactions that characterize this heterogeneous agricultural region, a limitation clearly reflected in the PLS result. SVM, with an R² of 0.520, fell between the ensemble methods and PLS. A consistent drop in performance appeared when the models were applied to the test set. GBDT maintained the highest R² (0.518), the lowest RMSE (4.498 g·kg⁻¹), and the highest RPIQ (2.229), together with an LCCC of 0.621. SVM achieved a slightly higher LCCC (0.631), although its R² fell to 0.506. This pattern reveals two complementary strengths: GBDT explained a larger portion of the variance, while SVM constrained systematic bias more tightly. RF (R² = 0.484) and XGBoost (R² = 0.511) followed GBDT. The clear performance ranking across all models—from GBDT down to PLS—confirms that SOC prediction in this landscape requires models capable of capturing complex, high-order interactions among predictors. GBDT, whose core mechanism is sequential residual correction, held a clear advantage over the parallel bagging strategy employed by RF. XGBoost, although it also belongs to the boosting family, did not outperform RF substantially. Its stronger regularization may have suppressed noise during training but did not translate into better generalization on the given data distribution. PLS again ranked lowest. When prediction surfaces are strongly nonlinear, linear formulations encounter fundamental limits, and these limits were clearly visible in the test-set results.
These contrasting behaviors are directly visible in the prediction scatterplots (Figure 3). GBDT produced predictions that cluster tightly around the 1:1 line, with few outliers and modest deviations across the entire SOC range. The SVM output shows a slight compression of predicted values. In contrast, RF and PLS exhibit considerably larger scatter, particularly at the low and high ends of the SOC distribution. For SOC values above 30 g·kg⁻¹, both models systematically underpredict, displaying a clear tail bias. XGBoost falls between these two extremes: its scatter is smaller than that of RF and PLS but does not match the concentration achieved by GBDT. Based on these consistent results, GBDT was selected as the foundational model and was subsequently used to produce the final SOC maps and to perform the SHAP interpretation and uncertainty analyses.
This clear performance ranking can be attributed to three sets of factors: the heterogeneous environmental conditions across Hungary’s croplands, the internal design logic of each algorithm, and the statistical character of the soil carbon dataset itself. The plain regions are predominantly covered by fertile Chernozems and meadow soils rich in SOC, while Luvisols and Cambisols dominate the surrounding hills and mountains, creating a spatially structured mosaic of soil properties that models must capture. Together with the spatially heterogeneous environment, these factors produce a strongly nonlinear relationship between predictor variables and SOC content, as well as a skewed SOC content distribution. Under these conditions, models have a clear advantage, which effectively reduce bias in complex regions of the predictor space. This is most apparent in the contrast between the two boosting implementations. . GBDT, through sequential residual fitting, concentrates on areas with the largest errors and adaptively captures local nonlinearities, a mechanism central to gradient boosting [31]. The tight clustering of its predictions around the 1:1 line, including at high SOC, indicates that this mechanism effectively captures the nonlinear variation and the spatial heterogeneity of the study area. Although XGBoost is based on the same boosting principle, it consistently performs slightly worse than GBDT across all metrics. It incorporates L1 and L2 regularization along with a more aggressive tree pruning strategy than standard gradient boosting [32]. Where SOC content varies sharply with soil type and topographic position, these constraints may slightly limit the model from fully capturing the complexity of the predictor-response surface. Given a training dataset of moderate size that nonetheless spans a wide SOC interval, the bias incurred by such penalization of model complexity may surpass the corresponding gain in variance reduction [40]. The net effect is a performance gap that, though modest in magnitude, persists across all evaluation criteria.
The behavior of RF reinforces this interpretation. Bagging primarily averages predictions over decorrelated trees, which effectively suppresses variance but lacks a built-in mechanism for bias correction in highly nonlinear feature spaces [30]. Consequently, RF predictions gravitate toward the global mean: low SOC sites are overestimated, while high SOC values are systematically underpredicted. RF records the lowest LCCC among all models, a diagnostic that points directly to bias, rather than variance, as the dominant source of error. The superiority of boosting over bagging in this study is therefore a direct consequence of bias reduction. Across Hungary’s croplands, where SOC patterns are spatially intricate and soil-forming conditions are heterogeneous, boosting algorithms offer a demonstrable advantage when training samples are limited. Bias is explicitly and iteratively corrected, which is precisely what the structure of the data demands.
SVM performed well on a complementary aspect of model evaluation. The highest LCCC among all candidates was delivered by this model, marginally exceeding the value recorded for GBDT. Although a larger fraction of the total variance was explained by GBDT, the systematic departures from the 1:1 identity line were more effectively suppressed by SVM. This behavior traces back to two defining features of the algorithm: the ε-insensitive loss function, which disregards errors smaller than a specified margin, and the structural risk minimization principle that explicitly balances model complexity against empirical risk [29]. A mild compression of the predicted range follows from this design—penalties are applied only when residuals exceed the ε-threshold, so extreme outputs are curtailed while large systematic errors are largely avoided. For Hungarian croplands, this bias–variance trade-off carries practical implications. National-scale carbon stock assessments demand spatially unbiased estimates across regions that differ markedly in soil conditions. In such applications, the lower systematic bias of SVM’s aggregate estimates becomes a distinct asset. Field-level management, by contrast, places a premium on local precision. Here, the higher R² achieved by GBDT may make it the more suitable option. PLS proved to be the weakest learner overall. Its linear latent-variable formulation could not accommodate the nonlinear associations that link the predictors to SOC. The consequence was twofold: the most severe underestimation occurred at high SOC values, and the overall dispersion of predictions was the largest. The performance gap separating PLS from the nonlinear models indicates that imposing a linear structure introduces systematic error when the underlying predictor–response relationship departs from linearity—a problem that becomes especially acute in the extreme ranges. This finding aligns with earlier work in soil spectroscopy, where nonlinear techniques have consistently outperformed PLS [41].
Model performance is never solely a function of the algorithm; it is also shaped by the environmental context where the training and test samples were collected. The data inherit the properties of that context, so a model’s predictive success is, to a large extent, conditioned by the region it is asked to represent. Local factors govern the statistical architecture of the data; the architecture, in turn, dictates how algorithms behave. In this study, five learners drawn from distinct families—partial least squares, support vector regression, random forest, gradient boosting, and its regularized variant XGBoost—were evaluated on Hungarian cropland. Gradient boosting (GBDT) attained the top rank. An equivalent outcome was recorded in the alpine meadows of Gannan [42] and the arid Akesai region [43], where boosting once again outperformed alternative tree ensembles. These three environments—high-elevation grassland, dryland, and temperate farmland—differ dramatically, yet converge on the same winner. Universality is absent from the broader record. Across 11 Latin American nations, the highest-performing algorithm varied spatially [44]. In northwestern Iran, a bagging-based random forest surpassed both support vector regression and XGBoost, achieving an R² of 0.84 [45]. Adjacent Andean catchments subjected to the same GWR model produced an R² of 0.99 in one and a negative coefficient of determination in the other [46]. Taken together, these findings effectively preclude the notion that any single model class can claim universal superiority. Rather than committing in advance to a particular algorithm, model selection should be steered by empirical comparisons conducted at the local scale—an adaptive, data-driven strategy for which the Hungarian investigation serves as a concrete exemplar. Model candidates were systematically benchmarked against a set of environmental descriptors that included soil type diversity, topographic roughness, and the legacy of past cultivation. Distributional properties of the data—SOC dynamic range, skewness, and the degree of nonlinearity manifest in the predictor–response relationship—were also fed into the evaluation. What emerged from this benchmarking was a clear, unambiguous preference for one model. The approach, grounded from start to finish in site-specific evidence, demonstrated its fitness for purpose on Hungarian cropland. Whether the same methodology can be successfully transferred to novel environmental settings is a question that independent corroboration must yet resolve. A logical extension of the present work would widen the assessment to span a broader spectrum of environmental gradients, with predictor suites augmented by climatic variables, lithological information, and vegetation indices extracted from remote sensing imagery.

3.3. SHAP-Based Variable Contribution and Nonlinear Effects

According to the SHAP mean absolute value ranking (Figure 4), the near-infrared band (B5) is by far the most important predictor, followed by Topographic Wetness Index (TWI), Normalized Red-Green Difference Index (NRGDI), elevation (H), red band (B4), and SWIR2 band (B7). Spectral bands together account for 42% of total contribution, confirming that soil reflectance properties dominate SOC prediction in this bare-soil composite dataset. The SHAP bee-swarm plot (Figure 5) shows the direction of each feature’s contribution to the predicted SOC content: B5, NRGDI, H, B7, valley depth (VD), SWIR1 band B6, blue band B2, B4, and NDMI have negative contributions (i.e., higher variable values lead to lower predicted SOC content); TWI, bare soil index (BSI), and aspect have positive contributions. The remainder of this discussion focuses on interpreting the most prominent patterns and their implications.
Figure 4 presents the global predictor importance based on mean absolute SHAP values. The SHAP dependence plots (Figure 6) further reveal complex nonlinear interactions between topographic and spectral covariates across the Hungarian study area. A consistent inverse relationship underlies these patterns: as SOC content increases, soil reflectance decreases across the optical range, which is a well-established spectral response in soil spectroscopy.
B5 (0.85–0.88 μm) makes a predominantly negative contribution to SOC prediction, consistent with this inverse relationship. Its dependence plot (Figure 6) shows a threshold at a reflectance of approximately 0.16. Below this value, SHAP contributions are positive, indicating that low reflectance is associated with higher predicted SOC; above the threshold, contributions become negative, meaning the model interprets bright NIR signals as indicative of carbon-poor, mineral-dominated soils. This threshold marks the transition between two competing optical regimes: at low reflectance, absorption by organic matter suppresses the NIR signal, whereas at high reflectance, the spectral response is dominated by mineral components of inherently higher reflectance. B4 (0.64–0.67 μm), B6 (1.57–1.65 μm), and B7 (2.11–2.29 μm) all exhibit a comparable shift from positive to negative SHAP values at characteristic reflectance levels. B7 has a similar pattern to B5, and the two act together in the model [47]. The consistent threshold behavior across these bands reflects their shared sensitivity to the spectral contrast between organic-rich and mineral-dominated soils; in the SWIR region, this sensitivity encompasses both soil organic matter and clay mineral assemblages [48]. In the visible range, B3 (0.53–0.59 μm) and B2 (0.450–0.51 μm) also show negative SHAP contributions, although their influence is considerably smaller. The blue-band response is primarily attributable to absorption by humus-associated chromophores, which reduces reflectance in this spectral region. In the bare-soil composites, B3 reflectance is dominated by the combined effects of iron oxides and SOC content: darker, carbon-rich soils absorb more radiation in the green band, resulting in lower reflectance.
Among topographic variables, TWI contributes positively. High TWI points to converging water flow, leading to waterlogged, reducing soils that slow organic matter decomposition—a pattern observed in both cropland and grassland systems [49]. Within the study area, the most fertile Chernozems and meadow soils occupy topographic depressions that coincide with regions of high TWI, lending empirical support to this interpretation. The SHAP dependence plot (Figure 6) confirms this threshold-type response. For TWI values below approximately 10, SHAP values are negative, corresponding to well-drained, aerobic conditions on higher or steeper terrain, where rapid moisture loss limits organic matter accumulation. Above this threshold, SHAP values turn positive, reflecting low-lying, convergent positions where persistent water saturation restricts oxygen diffusion and promotes reducing conditions that slow decomposition. Despite this clear pattern, the effect size of TWI is substantially smaller than that of B5, indicating that topographic moisture control acts as a secondary modifying influence rather than a primary driver of SOC accumulation. H exerts a negative influence on predicted SOC and displays an equally abrupt decline in the dependence plot. Upland positions are subject to intensified erosion, and in the northern hills, thin soils overlying volcanic and metamorphic substrates impose an additional constraint on carbon retention. VD follows a similarly negative trajectory: larger VD values denote upper-slope settings where erosive potential is considerably amplified; its dependence plot reveals a gradual decline without a pronounced threshold, consistent with the subdued topographic gradients of the plain. Plan curvature also carries a negative association with predicted SOC, conforming to the classic geomorphic framework wherein convex slopes promote runoff and sediment removal, whereas concave slopes favor deposition and accumulation [50]. TRI and Slope contributed only negligibly, as expected given the gentle gradients that dominate Hungarian croplands. Among all terrain-derived variables, aspect alone shows a positive influence. South-facing slopes receive greater solar radiation, which can stimulate vegetation productivity and increase organic matter input [51]; however, enhanced radiation also accelerates mineralization. The net positive contribution suggests that, in this region, the gain from increased production outweighs the loss from faster decomposition, although this interpretation remains provisional without concurrent vegetation data.
Among spectral indices, NRGDI is negatively associated with SOC. This index characterizes soil redness: high NRGDI values correspond to redder soils, which typically indicate iron-oxide-rich subsoil or eroded surfaces with low organic carbon content. It therefore serves as a reasonable proxy for low-SOC conditions. BSI, a brightness indicator, contributes negatively because high values typically correspond to sandy or saline soils with low organic matter. In the present multi-temporal bare-soil composite, where all pixels represent bare soil or very sparse residue, BSI captures variations in soil surface brightness rather than functioning as a vegetation-vs-soil discriminator. Thus, higher BSI values indicate brighter surfaces associated with lower SOC, consistent with the inverse reflectance–SOC relationship. NDMI likewise exhibits a negative contribution. The bare-soil compositing approach removes vegetation and selects the driest, most exposed moment per pixel, eliminating rainfall and soil moisture signals. Consequently, NDMI no longer functions as a moisture indicator; instead, high NDMI values identify bright mineral surfaces—such as aeolian sands and loess-derived soils—that lack organic matter and produce strong mineral scattering.
Together, these dependence patterns demonstrate that the relationships between SOC and its key predictors are predominantly nonlinear. Many variables—both spectral and topographic—exert their influence through abrupt transitions rather than steady changes. This prevalence of nonlinear structure explains why the classical GBDT, with minimal regularization allowing it to closely track sharp threshold breaks without being penalized for model complexity, systematically outperformed the more strongly regularized XGBoost and the linear alternatives. The reflectance thresholds identified in the SHAP dependence plots hold practical value beyond model interpretation. Under bare-soil conditions, a spectral domain defined by B5 reflectance values below approximately 0.16 enables direct flagging of potentially high-SOC cropland from remote sensing imagery—a simple screening procedure that can guide targeted soil sampling and the spatial delineation of zones where conservation tillage should be prioritized [52].
Several limitations should be noted. The interpretation of TWI relies on general hydrological reasoning; direct empirical evidence—such as soil redox potential or microbial activity measured across contrasting TWI sites—would strengthen the causal claims. The SHAP-derived thresholds are specific to the Hungarian cropland context and may not directly transfer to regions with different soil types, climate regimes, or management histories. Future work should test the generalizability of these thresholds across broader agroecological settings.

3.4. SOC Spatial Prediction and Uncertainty Quantification

We applied an optimized Gradient Boosting Decision Tree (GBDT) model (R² = 0.518, RMSE = 4.498 g·kg⁻¹, 10-fold cross-validation) to produce a 30 m resolution map of topsoil organic carbon (SOC) for Hungarian croplands (Figure 7). Predicted SOC content ranged from 6.36 to 28.27 g·kg⁻¹, with a mean of 16.30 g·kg⁻¹ and a standard deviation of 4.65 g·kg⁻¹. SOC content was classified into five classes using 5 g·kg⁻¹ intervals: I (5–10), II (10–15), III (15–20), IV (20–25), and V (25–30 g·kg⁻¹).
Five elevation zones were derived from a DEM using natural breaks [53]. The area and percentage of each SOC class within these zones are summarized in Table 4. Hungarian cropland is heavily concentrated in the low-elevation zone and decreases markedly with increasing elevation, leaving only a small proportion at higher altitudes. SOC classes (Class IV and Class V) are similarly concentrated in the low-elevation zone. Class V occurs almost exclusively in Zone 1 (0–124 m), covering 1181.64 km², whereas the combined extent across all higher zones totals only 1.28 km². Class IV dominates Zone 1, and together with Class III it accounts for over 78% of the zone. As elevation increases, the composition of SOC classes undergoes a marked shift. In Zone 2 (124–195 m), Class II becomes dominant (53.13%), while Class IV declines to 7.26%. In Zones 3 to 5 (>195 m), Class II is overwhelmingly dominant, comprising 78.81%, 83.58%, and 74.06% of the respective zones. Overall, low-elevation cropland not only covers the largest area but also stores the highest SOC levels. With increasing elevation, cropland area diminishes and the SOC class composition shifts toward a Class II-dominated, lower-SOC profile, with high-value classes becoming rare.
The 30 m SOC map reveals a pronounced east–west zonation across Hungarian croplands: higher SOC values are concentrated in the eastern and southeastern plains, while lower values prevail across the western and hilly regions. Class II accounts for the largest areal extent, whereas SOC categories—Classes IV and V—are largely confined to low-elevation plains. Class I appears sporadically on aeolian sand dunes of the Great Plain and in high-elevation northern croplands, Class III stretches across the central region and reaches the northern foothills and western border, and Class V occurs as scattered patches in northeastern alluvial depressions and southern lowlands. This broad spatial pattern is consistent with both the Global Black Soil Map (GBSmap) [54] and the national-level product of Szatmári et al. [55], confirming that the model effectively captures the regional SOC distribution associated with Chernozems, the dominant high-SOC soil type in Hungarian croplands. Complementing this regional agreement, the 30 m resolution reveals within-class heterogeneity that coarser products cannot resolve. . In the eastern high-SOC belt, contiguous Class IV patches are interspersed with scattered Class V occurrences, whereas small Class I fragments appear on sandy dune landforms in the southern part of the study area. This finer spatial detail exposes local-scale SOC variations tied to soil type boundaries, microtopographic depressions, and isolated sand lenses—features inherently smoothed away in coarser mapping products. Such fine-grained differentiation has direct implications for site-specific management, as it delineates zones of contrasting carbon status within fields that would otherwise be treated as uniform.
Class V is almost entirely confined to Zone 1, even though it represents only a small fraction of that zone. Together with Class IV, these highest-SOC soils are concentrated in a narrow low-elevation range. As elevation increases, the proportion of these high-SOC classes declines sharply, and the elevational extent of each class narrows progressively from Class II through to Class V. This sequential contraction supports the inference that the highest carbon stocks are largely confined to flat, low-relief terrain. Similar elevational trends in SOC have been reported for temperate mountain regions [56]. Within the study area, however, the elevational gradient reflects more than a straightforward climatic control. Fertile alluvial depressions and Chernozem-mantled plains are spatially concentrated at low elevations. In these settings, thick organic horizons have developed from deep, base-rich parent materials and have been sustained by sediment accretion and centuries of agricultural enrichment. The elevational signal is therefore mediated, at least in part, by the distribution of soil types and depositional landforms. Higher on the Transdanubian Hills, Luvisols and Cambisols become the dominant soil groups. Moderate organic inputs combined with relatively rapid decomposition maintain Class II as the prevalent SOC category there. Class I emerges in two sharply contrasting elevational contexts. Across the cool, humid northern uplands, steep slopes experience accelerated erosion that strips fertile topsoil. Organic inputs to the root zone are restricted by the consequent loss of soil volume and nutrients, offsetting any carbon gains that slower decomposition might otherwise have permitted. In the lowland sandy tracts, rapid drainage, low cation exchange capacity, and minimal mineral protection act together to severely suppress organic matter retention. These divergent settings demonstrate that SOC is not governed by elevation alone. Rather, it covaries with an ensemble of soil-forming factors, and it is their combined influence that ultimately shapes the carbon stock.
The spatial configuration of SOC across the study area is governed by the interplay of three primary factors: topography, soil type, and cultivation. In the northern mountains and hills, Luvisol–Chernozem landscapes are interspersed with closed depressions where elevated SOC pockets are concentrated. These topographic lows serve as local sinks, intercepting and retaining sediment and runoff from adjacent slopes. Such inputs sustain localized carbon enrichment; without them, these sites would bear only the moderate SOC levels common across the surrounding region. A distinctly different spatial arrangement characterizes the eastern Great Plain, where extensive Chernozems with deep, dark humus horizons reflect historically strong carbon accumulation. The model predictions confirm extensive and relatively continuous Class IV and Class V cover across this region, consistent with the high SOC levels expected from these fertile soils. Nonetheless, long-term intensive cultivation has likely diminished the original carbon stock to some degree. Repeated mechanical tillage and reduced residue return accelerate SOC mineralization, yet the Chernozem landscape still retains a high baseline. High-SOC areas persist most prominently within alluvial depressions, where elevated moisture and finer sediments retard decomposition and provide a measure of protection, allowing Class V patches to remain embedded within the broader Class IV matrix [57]. The cultivation imprint is further modulated in the Danube–Tisza interfluve, where a fine-scale mosaic of aeolian sands and Chernozems imposes additional edaphic heterogeneity, producing a more varied local SOC pattern within the generally high-SOC plain [58]. In the southern Danube–Tisza interfluve and the Kiskunság sand region, sandy and coarse skeletal soils dominate the landscape. Biomass production is restricted by the high sand content and low water-holding capacity of these substrates; organic matter decomposition is simultaneously hastened, resulting in an inherently low SOC status. On the marginal dunes, the decline to Class I represents the most extreme expression of this textural control, a pattern fully consistent with the poor retention capacity typical of sandy soils [59]. The western part of the study area, corresponding to the Transdanubian Hills, is defined by rapid soil-type transitions from Cambisols to Fluvisols that occur along valley axes. An intimate juxtaposition of Class II and Class III is generated by these transitions, mirroring the patchwork of weakly developed alluvial soils intermingled with material sloughed from adjacent slopes. Carbon stored in this zone is relatively labile: the alluvial deposits are geologically young, and the coarse-textured Fluvisols offer only a limited capacity to stabilize organic matter.Across these contrasting environments, macro-scale SOC patterns are consistently shaped by soil–geomorphology interactions. These interactions establish a physical template upon which more localized processes—hydrological redistribution, cultivation history, and microtopography—exert their influence [60]. The relative importance of these secondary controls varies geographically: cultivation acts as a moderate modifier in the eastern plains, texture and drainage dominate in the south, and alluvial dynamics play the primary role in the west.
A pixel-level coefficient of variation (CV) map derived from 100 bootstrap resampling runs was generated to quantify the spatial distribution of SOC prediction uncertainty across Hungarian croplands (Figure 8). The lowest uncertainty (CV < 5%, dark green) is concentrated in the eastern Great Plain (~46.8–47.3°N, 18.5–21.0°E), closely overlapping the extensive high-SOC zones dominated by Chernozems. This overlap between high prediction confidence and high SOC is directly linked to the dominant role of B5. In this region, the deep, dark humus horizons of Chernozems produce persistently low NIR reflectance, providing a stable spectral signal for B5, the most influential predictor in the model. The subdued topography and gradual soil transitions further reduce the complexity of predictor–response relationships, contributing to greater prediction stability. Higher uncertainty appears in two main zones. The first encompasses the southern Danube–Tisza interfluve and the Kiskunság sand region (~46.0°N), where sandy and coarse-textured soils dominate. The high sand content and low water-holding capacity restrict biomass production and accelerate organic matter decomposition, resulting in inherently low SOC levels. Although the terrain appears flat, a fine-scale mosaic of aeolian sands and Chernozems creates high-frequency spatial variation that the current predictor set cannot fully resolve. The second zone lies in the western Transdanubian Hills (~17°E), where the landscape transitions from hills to plains and soil types shift abruptly from Cambisols to Fluvisols along valley axes. River terraces, floodplains, and hillslopes intermingle, producing rapid alternations of soil and topographic conditions that further elevate prediction uncertainty. The northern high-elevation region appears in light green tones, indicating moderate to moderately low uncertainty. This suggests that, despite the pronounced topographic relief, the existing terrain variables still fall short of fully explaining SOC variation in this complex landscape—particularly the localized carbon enrichment within closed depressions of the Luvisol–Chernozem mosaic, which cannot be entirely captured by regional-scale topographic proxies.
Overall, the spatial pattern of uncertainty reflects how well the environmental predictors match the spatial heterogeneity of SOC. In regions with relatively uniform soil and terrain, the stable signal of the dominant spectral predictor ensures high prediction confidence. Conversely, where soils form fine-scale mosaics or topography transitions abruptly, the current predictor suite is less capable of resolving high-frequency variation, leading to elevated prediction risk. This pattern aligns with the overarching framework in which soil–geomorphology interactions set the physical template for SOC distribution, while local-scale processes superimpose additional variability [52]. This prediction framework, while robust at the Hungarian cropland scale, has notable limitations. Its performance remains constrained by the spatial representativeness of the training samples and by the inherent limitation of static covariates, which cannot fully capture the slow dynamics of soil carbon through real-time spectral signals. Nevertheless, the combined use of the 30 m SOC content map and its corresponding pixel-wise uncertainty map offers a distinct advantage over conventional single-product outputs: it equips decision-makers not only with SOC estimates for each field, but also with a spatially explicit quantification of their reliability. In low-uncertainty zones such as the eastern Great Plain, SOC predictions can directly inform variable-rate fertilization, organic matter amendments, and soil health monitoring. In high-uncertainty zones such as the southern sandy region and the western transitional hills, the confidence layer explicitly indicates where supplementary field sampling or more conservative management strategies should be prioritized, thereby reducing the risk of misguided decisions arising from model error. This “prediction plus confidence” dual-layer framework provides actionable, quantitative support for precision nutrient management, the spatial prioritization of conservation tillage, and the spatially informed implementation of agricultural carbon neutrality goals.

4. Conclusions

This study produced the first dedicated 30 m resolution SOC map for Hungarian croplands with pixel-wise uncertainty, using an interpretable, region-specific framework that integrates multi-temporal bare-soil composites and topographic variables. Systematic model comparison identified GBDT as the best-generalizing model, confirming that nonlinear ensemble methods are well-suited to capturing SOC variability in this landscape. Then SHAP analysis revealed that spectral and topographic predictors exert distinct, strongly nonlinear effects that jointly shape SOC spatial patterns, and it made these controls transparent rather than treating the model as a black box. The pixel-wise uncertainty map, generated via bootstrap resampling, delineates where predictions merit high confidence and where they are less reliable—with zones of elevated uncertainty often coinciding with geologically complex parent materials or a long history of intensive cultivation. By delivering both a 30 m SOC map and a matching confidence layer, this framework provides spatially explicit, reliable information to support cropland management and soil health policy in Hungary. Although the framework was constructed using Hungarian cropland data, its methodological core is transferable. The predictor set can be adapted to the environmental characteristics of new regions, offering a replicable template for regional-scale SOC mapping and providing practical insights for agricultural management in new regions. However, this study has several limitations. Notably, the current predictors rely mainly on spectral and topographic variables, without explicitly incorporating supplementary environmental data (such as climate variables) that influence SOC dynamics at broader scales. Future work could integrate such complementary data to improve prediction accuracy and sharpen confidence diagnostics. Future work could also adapt the framework to track SOC changes over time, rather than providing only a single-date baseline. This would make the framework more useful for both field-level soil management and agricultural policy support in intensively cultivated regions.

Author Contributions

Conceptualization, J.L. and Y.Z.; methodology, J.L. and L.S.; software, J.L.; validation, L.S. and Z.X.; formal analysis, J.L. and L.S.; investigation, J.L. and L.S.; resources, H.X. and W.C.; data curation, J.L.; writing—original draft preparation, J.L. and L.S.; writing—review and editing, J.L. and Y.Z.; visualization, J.L. and L.S.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Heilongjiang Provincial Natural Science Foundation, grant number LH2023D024.

Data Availability Statement

All datasets used in this study are publicly available. The LUCAS topsoil dataset is available from the European Soil Data Centre (ESDAC) at http://esdac.jrc.ec.europa.eu/. Landsat 8 Collection 2 Tier 1 surface reflectance data were accessed and processed through the Google Earth Engine platform (https://earthengine.google.com), and the original data are provided by the U.S. Geological Survey and can be downloaded from https://earthexplorer.usgs.gov/. The Copernicus DEM GLO-30 digital elevation model and the 10-m resolution land cover tiles of Hungary were both obtained from the Copernicus Land Monitoring Service, available at https://land.copernicus.eu/. The processed analysis results generated during this study are available from https://doi.org/10.5281/zenodo.20747014 (alternative access: https://zenodo.org/record/20747014).

Acknowledgments

The authors thank the academic editor and the anonymous reviewers for their constructive comments that helped improve this manuscript. We are also grateful to the European Commission for funding the LUCAS soil survey through its relevant Directorates-General, to the Copernicus Land Monitoring Service for providing land cover and digital elevation data, to the U.S. Geological Survey for the open provision of Landsat imagery, and to Google Earth Engine for the computational platform that enabled large-scale data access and processing.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Lal, R. Soil Carbon Sequestration Impacts on Global Climate Change and Food Security. Science 2004, 304, 1623–1627. [CrossRef]
  2. Xia, Y.; McSweeney, K.; Wander, M.M. Digital Mapping of Agricultural Soil Organic Carbon Using Soil Forming Factors: A Review of Current Efforts at the Regional and National Scales. Front. Soil Sci. 2022, 2, 890437. [CrossRef]
  3. Xue, J.; Zhang, X.; Chen, S.; Chen, Z.; Lu, R.; Liu, F.; Van Wesemael, B.; Shi, Z. National-Scale Mapping Topsoil Organic Carbon of Cropland in China Using Multitemporal Sentinel-2 Images. Geoderma 2025, 456, 117272. [CrossRef]
  4. Zoller, A.L.; Birru, G.; Kharel, T.; Jin, V.L.; Schmer, M.R.; Freidenreich, A.; Wardlow, B.; Kettler, T.; Gala, T. Remote Sensing of Soil Organic Carbon in Varied Tillage-Crop Systems. J. Environ. Qual. 2025, 54, 1535–1547. [CrossRef]
  5. Broeg, T.; Don, A.; Gocht, A.; Scholten, T.; Taghizadeh-Mehrjardi, R.; Erasmi, S. Using Local Ensemble Models and Landsat Bare Soil Composites for Large-Scale Soil Organic Carbon Maps in Cropland. Geoderma 2024, 444, 116850. [CrossRef]
  6. Panagos, P.; Van Liedekerke, M.; Jones, A.; Montanarella, L. European Soil Data Centre: Response to European Policy Support and Public Data Requirements. Land Use Policy 2012, 29, 329–338. [CrossRef]
  7. Poggio, L.; De Sousa, L.M.; Batjes, N.H.; Heuvelink, G.B.M.; Kempen, B.; Ribeiro, E.; Rossiter, D. SoilGrids 2.0: Producing Soil Information for the Globe with Quantified Spatial Uncertainty. SOIL 2021, 7, 217–240. [CrossRef]
  8. Padarian, J.; Minasny, B.; McBratney, A.B. Machine Learning and Soil Sciences: A Review Aided by Machine Learning Tools. SOIL 2020, 6, 35–52. [CrossRef]
  9. Ding, Z.; Liu, K.; Grunwald, S.; Smith, P.; Ciais, P.; Wang, B.; Wadoux, A.M.J. -C.; Ferreira, C.; Karunaratne, S.; Shurpali, N.; et al. Advancing Soil Organic Carbon Prediction: A Comprehensive Review of Technologies, AI, Process-Based and Hybrid Modelling Approaches. Advanced Science 2025, 12, e04152. [CrossRef]
  10. Wadoux, A.M.J.-C.; Samuel-Rosa, A.; Poggio, L.; Mulder, V.L. A Note on Knowledge Discovery and Machine Learning in Digital Soil Mapping. Eur. J. Soil Sci. 2020, 71, 133–136. [CrossRef]
  11. Rentschler, T.; Scholten, T. A Note on Spurious Correlations and Explainable Machine Learning in Digital Soil Mapping. Eur. J. Soil Sci. 2025, 76, e70172. [CrossRef]
  12. Dong, Y.; Wang, X.; Wang, S.; Li, B.; Liu, J.; Huang, J.; Li, X.; Zeng, Y.; Su, W. Enhancing Soil Organic Carbon Prediction by Unraveling the Role of Crop Residue Coverage Using Interpretable Machine Learning. Geoderma 2025, 455, 117225. [CrossRef]
  13. Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc., 2017; Vol. 30.
  14. Heuvelink, G. Uncertainty Quantification of GlobalSoilMap Products. Glob.: Basis Glob. Spat. Soil Inf. Syst. - Proc. 1st Glob. Conf. 2014, 335–340. [CrossRef]
  15. World Bank Open Data Available online: https://data.worldbank.org (accessed on 23 May 2026).
  16. Mezősi, G. The Physical Geography of Hungary; Geography of the Physical Environment; Springer International Publishing: Cham, 2017; ISBN 978-3-319-45182-4.
  17. Michailidis, V.; Lugato, E.; Panagos, P.; Freund, F.; Abalos, D. Impact of Healthy Diet Shifts on Soil Greenhouse Gas Emissions across Europe. Global Change Biol. 2025, 31, e70624. [CrossRef]
  18. Orgiazzi, A.; Ballabio, C.; Panagos, P.; Jones, A.; Fernández-Ugalde, O. LUCAS Soil, the Largest Expandable Soil Dataset for Europe: A Review. Eur. J. Soil Sci. 2018, 69, 140–153. [CrossRef]
  19. Box, G.E.P.; Cox, D.R. An Analysis of Transformations. J. R. Stat. Soc.: B (Methodol.) 1964, 26, 211–243. [CrossRef]
  20. Duan, N. Smearing Estimate: A Nonparametric Retransformation Method. J. Am. Stat. Assoc. 1983, 78, 605–610. [CrossRef]
  21. USGS EROS Archive - Landsat Archives - Landsat 8-9 OLI/TIRS Collection 2 Level-2 Science Products | U.S. Geological Survey Available online: https://www.usgs.gov/centers/eros/science/usgs-eros-archive-landsat-archives-landsat-8-9-olitirs-collection-2-level-2 (accessed on 23 May 2026).
  22. Adhikari, S.; Leigh, L.; Pathiranage, D.S. Pressure-Related Discrepancies in Landsat 8 Level 2 Collection 2 Surface Reflectance Products and Their Correction. Remote Sens. 2025, 17. [CrossRef]
  23. Crop Types 2017 - Present (Raster 10m), Europe, Yearly, Nov. 2024 Available online: https://sdi.eea.europa.eu/catalogue/srv/api/records/9db29b07-5968-4ce0-8351-1e356b3d7d47 (accessed on 10 June 2026).
  24. Nguyen, C.T.; Chidthaisong, A.; Diem, P.K.; Huo, L.-Z. A Modified Bare Soil Index to Identify Bare Land Features during Agricultural Fallow-Period in Southeast Asia Using Landsat 8. Land 2021, 10. [CrossRef]
  25. Demattê, J.A.M.; Rizzo, R.; Rosin, N.A.; Poppiel, R.R.; Novais, J.J.M.; Amorim, M.T.A.; Rodriguez-Albarracín, H.S.; Rosas, J.T.F.; Bartsch, B. dos A.; Vogel, L.G.; et al. A Global Soil Spectral Grid Based on Space Sensing. Sci. Total Environ. 2025, 968, 178791. [CrossRef]
  26. German Aerospace Center TanDEM-X - Digital Elevation Model (DEM) - Global, 90m 2018.
  27. Gdulová, K.; Marešová, J.; Moudrý, V. Accuracy Assessment of the Global TanDEM-X Digital Elevation Model in a Mountain Environment. Remote Sens. Environ. 2020, 241, 111724. [CrossRef]
  28. Wold, S.; Sjöström, M.; Eriksson, L. PLS-Regression: A Basic Tool of Chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109–130. [CrossRef]
  29. Schölkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press, 2002; ISBN 978-0-262-19475-4.
  30. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
  31. Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [CrossRef]
  32. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, August 13 2016; pp. 785–794.
  33. Bergstra, J.; Bergstra, J.; Bengio, Y.; Bengio, Y. Random Search for Hyper-Parameter Optimization. 1–25.
  34. Wang, F.; Liang, R.; Li, S.; Xiang, M.; Yang, W.; Lu, M.; Song, Y. Assessing the Impact of Multi-Source Environmental Variables on Soil Organic Carbon in Different Land Use Types of China Using an Interpretable High-Precision Machine Learning Method. Ecol. Indic. 2024, 169, 112865. [CrossRef]
  35. Duan, D.; Wang, P.; Rao, X.; Zhong, J.; Xiao, M.; Huang, F.; Xiao, R. Identifying Interactive Effects of Spatial Drivers in Soil Heavy Metal Pollutants Using Interpretable Machine Learning Models. Sci. Total Environ. 2024, 934, 173284. [CrossRef]
  36. Bellon-Maurel, V.; Fernandez-Ahumada, E.; Palagos, B.; Roger, J.-M.; McBratney, A. Critical Review of Chemometric Indicators Commonly Used for Assessing the Quality of the Prediction of Soil Attributes by NIR Spectroscopy. TrAC, Trends Anal. Chem. 2010, 29, 1073–1081. [CrossRef]
  37. Lin, L.I.-K. A Concordance Correlation Coefficient to Evaluate Reproducibility. Biometrics 1989, 45, 255–269. [CrossRef]
  38. Zhu, A.-X.; Ma, T.; Zhao, F.-H.; Yang, X.; Xia, Y. Uncertainty Quantification for Digital Soil Mapping: An Overview. Pedosphere 2025. [CrossRef]
  39. Hu, B.; Xie, M.; Zhou, Y.; Chen, S.; Zhou, Y.; Ni, H.; Peng, J.; Ji, W.; Hong, Y.; Li, H.; et al. A High-Resolution Map of Soil Organic Carbon in Cropland of Southern China. CATENA 2024, 237, 107813. [CrossRef]
  40. John Lu, Z.Q. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. J. R. Stat. Soc. A 2010, 173, 693–694. [CrossRef]
  41. Mondal, B.P.; Sahoo, R.N.; Das, B.; Ahmed, N.; Bandyopadhyay, K.K.; Mukherjee, J.; Arora, A.; Moursy, A.R.A. Comparison of Multivariate Machine Learning Models for Major Soil Nutrients Prediction Using Laboratory-Based and Airborne (AVIRIS-NG) Visible near-Infrared Spectroscopy. Eur. J. Agron. 2025, 170, 127726. [CrossRef]
  42. Liu, X.; Zhang, M.; Ma, Z. Enhanced Soil Organic Carbon Mapping in Gannan’s Alpine Meadows: A Comparative Analysis of Machine Learning Models and Satellite Data. Ecol. Indic. 2025, 177, 113800. [CrossRef]
  43. Chen, G.; Ge, X.; Zhang, Z.; Han, L. Detecting Drivers and Predicting Spatial Distribution of Soil Organic Carbon in an Arid Region Using Machine Learning. Remote Sens. 2026, 18. [CrossRef]
  44. Guevara, M.; Olmedo, G.F.; Stell, E.; Yigini, Y.; Aguilar Duarte, Y.; Arellano Hernández, C.; Arévalo, G.E.; Arroyo-Cruz, C.E.; Bolivar, A.; Bunning, S.; et al. No Silver Bullet for Digital Soil Mapping: Country-Specific Soil Organic Carbon Estimates across Latin America. SOIL 2018, 4, 173–193. [CrossRef]
  45. Dadgar, M.; Faramarzi, S.E. Assessing the Performance of Machine Learning Models for Predicting Soil Organic Carbon Variability across Diverse Landforms. Environ. Earth Sci. 2024, 83, 657. [CrossRef]
  46. Carbajal, C.; Tumbalobos-Dextre, M.; Condori-Ataupillco, T.; Cuellar-Condori, N.; Gavilan, C. Spatial Prediction of Soil Organic Carbon Stocks across Contrasting Andean Basins, Peru. Geoderma Reg. 2025, 43, e01026. [CrossRef]
  47. Rossel, R.A.V.; Behrens, T. Using Data Mining to Model and Interpret Soil Diffuse Reflectance Spectra. Geoderma 2010, 158, 46–54. [CrossRef]
  48. Stenberg, B.; Viscarra Rossel, R.A.; Mouazen, A.M.; Wetterlind, J. Chapter Five - Visible and near Infrared Spectroscopy in Soil Science. In Advances in agronomy; Sparks, D.L., Ed.; Academic Press, 2010; Vol. 107, pp. 163–215.
  49. Gessler, P.E.; Moore, I.D.; McKENZIE, N.J.; Ryan, P.J. Soil-Landscape Modelling and Spatial Prediction of Soil Attributes. International journal of geographical information systems 1995, 9, 421–432. [CrossRef]
  50. Moore, I.D.; Gessler, P.E.; Nielsen, G.A.; Peterson, G.A. Soil Attribute Prediction Using Terrain Analysis. Soil Science Soc of Amer J 1993, 57, 443–452. [CrossRef]
  51. Bennie, J.; Hill, M.O.; Baxter, R.; Huntley, B. Influence of Slope and Aspect on Long-term Vegetation Change in British Chalk Grasslands. Journal of Ecology 2006, 94, 355–368. [CrossRef]
  52. McBratney, A.B.; Mendonça Santos, M.L.; Minasny, B. On Digital Soil Mapping. Geoderma 2003, 117, 3–52. [CrossRef]
  53. F, J.G. The Data Model Concept in Statistical Mapping. Int. Yearb. Cartogr. 1967, 7, 186–190.
  54. FAO Global Map of Black Soils; FAO: Rome, Italy, 2022;
  55. Szatmári, G.; Laborczi, A.; Mészáros, J.; Takács, K.; Benő, A.; Koós, S.; Bakacsi, Z.; Pásztor, L. Gridded, Temporally Referenced Spatial Information on Soil Organic Carbon for Hungary. Sci. Data 2024, 11, 1312. [CrossRef]
  56. Li, C.; Xiao, C.; Li, M.; Xu, L.; He, N. A Global Synthesis of Patterns in Soil Organic Matter and Temperature Sensitivity along the Altitudinal Gradient. Front. Environ. Sci. 2022, 10. [CrossRef]
  57. Davidson, E.A.; Ackerman, I.L. Changes in Soil Carbon Inventories Following Cultivation of Previously Untilled Soils. Biogeochemistry 1993, 20, 161–193. [CrossRef]
  58. Barczi, A.; Tóth, T.M.; Csanádi, A.; Sümegi, P.; Czinkota, I. Reconstruction of the Paleo-Environment and Soil Evolution of the Csípo˝-Halom Kurgan, Hungary. Quaternary International 2006, 156–157, 49–59. [CrossRef]
  59. Hassink, J. The Capacity of Soils to Preserve Organic C and N by Their Association with Clay and Silt Particles. Plant Soil 1997, 191, 77–87. [CrossRef]
  60. Quinton, J.N.; Govers, G.; Van Oost, K.; Bardgett, R.D. The Impact of Agricultural Soil Erosion on Biogeochemical Cycling. Nature Geosci 2010, 3, 311–314. [CrossRef]
Figure 1. General characterization of the study region. (a) Satellite image; (b) DEM and spatial distribution of sampling points.
Figure 1. General characterization of the study region. (a) Satellite image; (b) DEM and spatial distribution of sampling points.
Preprints 219625 g001
Figure 2. True- color bare-soil composite image of Hungary (Landsat 8 Bands 4-3-2, 2016–2020).
Figure 2. True- color bare-soil composite image of Hungary (Landsat 8 Bands 4-3-2, 2016–2020).
Preprints 219625 g002
Figure 3. Statistical distribution of SOC content. (a) Histogram and probability density curve of the full dataset; (b) Distributions of training and test sets illustrated by violin plots.
Figure 3. Statistical distribution of SOC content. (a) Histogram and probability density curve of the full dataset; (b) Distributions of training and test sets illustrated by violin plots.
Preprints 219625 g003
Figure 4. Feature importance based on mean absolute SHAP values and the contribution of vari-able categories.
Figure 4. Feature importance based on mean absolute SHAP values and the contribution of vari-able categories.
Preprints 219625 g005
Figure 5. Global variable importance and feature contribution analysis based on SHAP values.
Figure 5. Global variable importance and feature contribution analysis based on SHAP values.
Preprints 219625 g006
Figure 6. SHAP dependence plots showing the nonlinear responses of SOC to key spectral and topographic variables. (a) The near-infrared band (B5); (b) Topographic Wetness Index (TWI); (c) Normalized Red-Green Difference Index (NRGDI); (d) Elevation (H);(e) Red band (B4); (f) SWIR2 band (B7);(g) valley depth (VD);(h) SWIR1 band (B6).
Figure 6. SHAP dependence plots showing the nonlinear responses of SOC to key spectral and topographic variables. (a) The near-infrared band (B5); (b) Topographic Wetness Index (TWI); (c) Normalized Red-Green Difference Index (NRGDI); (d) Elevation (H);(e) Red band (B4); (f) SWIR2 band (B7);(g) valley depth (VD);(h) SWIR1 band (B6).
Preprints 219625 g007
Figure 7. Spatial prediction map of topsoil SOC content across Hungary cropland.
Figure 7. Spatial prediction map of topsoil SOC content across Hungary cropland.
Preprints 219625 g008
Figure 8. Spatial distribution of prediction uncertainty for topsoil SOC.
Figure 8. Spatial distribution of prediction uncertainty for topsoil SOC.
Preprints 219625 g009
Table 1. Predictor variables used for SOC content spatial prediction: abbreviations, definitions, and calculations.
Table 1. Predictor variables used for SOC content spatial prediction: abbreviations, definitions, and calculations.
Category Feature Name Abbreviation Definition / Calculation
Topographic factors Elevation H Bare-earth terrain height
Slope Slope First derivative of elevation
Aspect Aspect Direction of the maximum slope
Profile Curvature Prof_Curv Curvature parallel to the direction of maximum slope
Plan Curvature Plan_Curv Curvature perpendicular to the direction of maximum slope
Hydro-Morphological factors Topographic Wetness Index TWI ln ( α / tan β )
Topographic Position Index TPI Difference between central pixel and mean neighborhood elevation
Terrain Ruggedness Index TRI Root mean square of elevation differences
Valley Depth VD Vertical distance to the channel base
Multiresolution Index of Valley Bottom Flatness MrVBF Identification of valley bottoms based on multi-scale slope
Multiresolution Index of Ridge Top Flatness MrRTF Identification of ridge tops based on multi-scale slope
spectral bands Blue, Green, Red B2, B3, B4 Visible reflectance during the bare-soil period
Near Infrared B5 NIR reflectance during the bare-soil period
Shortwave Infrared 1, 2 B6, B7 SWIR reflectance during the bare-soil period
Spectral indices Normalized Difference Vegetation Index NDVI (B5 - B4) / (B5 + B4)
Soil Adjusted Vegetation Index SAVI [ ( B 5 B 4 ) / ( B 5 + B 4 + L ) ] × ( 1 + L )
Bare Soil Index BSI [(B6+B4)-(B5+B2)] / [(B6+B4)+(B5+B2)]
Normalized Red-Green Difference Index NRGDI (B4 - B3) / (B4 + B3)
Normalized Difference Moisture Index NDMI (B5 - B6) / (B5 + B6)
Brightness Index BI sqrt((B4² + B3²) / 2)
Ratio Vegetation Index RVI (B5 / B4)
Table 2. Descriptive statistics of cropland topsoil SOC content for the full, training, and test datasets.
Table 2. Descriptive statistics of cropland topsoil SOC content for the full, training, and test datasets.
Dataset N Min(g·kg⁻¹) Max(g·kg⁻¹) Mean(g·kg⁻¹) Median (g·kg⁻¹) SD(g·kg⁻¹) CV(%) Skewness Kurtosis
Train 207 4.70 37.80 16.97 16.80 6.43 37.90 0.28 -0.35
Test 52 4.70 31.50 16.79 16.40 6.54 38.98 0.35 -0.49
All data 259 4.70 37.80 16.93 16.80 6.44 38.04 0.29 -0.40
Table 3. Performance comparison of five machine learning models.
Table 3. Performance comparison of five machine learning models.
Model Dataset R ² R M S E ( g · k g ¹ ) M A E ( g · k g ¹ ) R P I Q L C C C
RF Training 0.585 4.131 3.163 2.348 0.704
Testing 0.484 4.655 3.591 2.154 0.594
GBDT Training 0.660 3.952 3.048 2.454 0.727
Testing 0.518 4.498 3.499 2.229 0.621
PLS Training 0.442 4.659 3.570 2.082 0.644
Testing 0.469 4.723 3.756 2.122 0.617
SVM Training 0.520 4.444 3.308 2.183 0.670
Testing 0.506 4.557 3.591 2.200 0.631
XGBOOST Training 0.577 4.171 3.174 2.326 0.705
Testing 0.511 4.534 3.588 2.211 0.626
Table 4. Distribution of SOC content classes across different elevation zones.
Table 4. Distribution of SOC content classes across different elevation zones.
Zone Elevation range(m) Area (km²) I(%) II(%) III(%) IV(%) V(%)
1 0-124 26200.49 1179.02(4.50) 3377.24(12.89) 8143.11(31.08) 12319.47(47.02) 1181.64(4.51)
2 124-195 12771.39 1325.67(10.38) 6785.44(53.13) 3731.80(29.22) 927.20(7.26) 1.28(0.01)
3 195-305 3028.42 275.89(9.11) 2386.70(78.81) 356.75(11.78) 9.09(0.30) 0.00
4 305-482 192.96 24.33(12.61) 161.30(83.58) 7.31(3.79) 0.04(0.02) 0.00
5 >482 0.60 0.10(16.94) 0.45(74.06) 0.05(9.00) 0.00 0.00
4 Percentages are computed across each row and sum to 100%. Values are rounded to two decimal places; owing to rounding, intervals with negligible area and calculated values below 0.005% are displayed as 0.00%.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2026 MDPI (Basel, Switzerland) unless otherwise stated

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings