1. Introduction
The real estate market represents an essential component of the global economy [
1], substantially impacting macroeconomic and microeconomic dynamics [
2,
3]. The importance is evidenced by its ability to influence economic growth [
4,
5] , mainly through changes in real estate prices [
2]. A paradigmatic example of the inherent vulnerability of this market was the mortgage crisis of 2007 and 2008, which triggered a global recession[
6]. That crisis revealed the structural fragility of the real estate sector [
7] and its deep interconnection with the international financial system [
8,
9], underscoring the necessity for more accurate price forecasting models.
There is a clear need to refine predictive tools in the real estate market [
11], particularly in contexts of accelerated growth [
2]. Accurate prediction of real estate prices would not only reduce the risk of forming speculative bubbles and financial collapses [
12,
13] and promote local economic development by facilitating more efficient planning [
14]. This predictive capability would drive socio-economic growth locally and globally and support informed decision-making by governments, real estate agents, financial institutions, and market analysts[
16,
17].
To solve the problem, traditional models for price prediction have been applied, such as the Comparative Market Method [
8], Income Capitalization Method (Income or Rent Method)[
18], Replacement Cost or Replacement Value Method, Automatic Valuation Method (AVM)[
19], and Dynamic Residual Value Method (Discounted Cash Flow) [
20]. These models have proven effective in stable markets, where price trends typically follow predictable patterns based on historical data [
21]. However, they have significant limitations, particularly in environments with high volatility or when prices react to more complex economic dynamics. Traditional models often underestimate or overestimate prices in highly speculative markets or during crises, as they fail to adequately consider macroeconomic factors, regulatory changes, large data volumes, or sudden fluctuations in supply and demand.
In recent years, the increasing availability and storage of large volumes of data have created new opportunities to tackle the complex issue of price prediction in the real estate market [25-27]. Traditional valuation models, primarily based on statistical methods, have been outperformed in accuracy and predictive capability by machine learning algorithms [
28,
29], which are proving to be a highly effective alternative for enhancing real estate price estimates [
30,
31].
Numerous empirical studies have investigated the predictive capability of various machine learning algorithms, yielding promising results in real estate price prediction [
24]. Among the most notable methods are Random Forest (RF) [
32,
33], Support Vector Machines (SVM) [
34], Multiple Linear Regression (MLR) [
35], and regularization techniques such as LASSO (Least Absolute Shrinkage and Selection Operator) [
36]. Additionally, algorithms like K-Nearest Neighbors (KNN) [
37] and Decision Trees (DT) [
38] have demonstrated remarkable performance across different dataset configurations.
Research on machine learning algorithms applied to regression problems has gained significant importance in the real estate sector in recent years [
31,
39]. However, regression issues in real-world contexts often involve highly complex internal and external factors [
40]. Furthermore, different machine learning algorithms exhibit considerable variations in terms of scalability and predictive performance [
41,
42], which presents additional challenges for their effective application in practical settings by tuning their parameters to select the best model [
40,
43]. Although this strategy is commonly used, it faces three critical challenges [44-46]. First, defining the "best" model is complicated when multiple candidates show comparable performance, complicating the final choice [
47]. This issue is exacerbated when the algorithm is sensitive to local optima and the amount of training data is limited [
48]. Second, excluding less successful models may lead to the loss of valuable information that could enhance prediction accuracy [
49]. Third, variable selection is essential in mitigating the curse of dimensionality, as using too many features may introduce noise and redundancy, impacting model generalization [
50].
As a result, research in the real estate market encounters methodological and computational challenges that demand a rigorous and systematic approach. The lack of studies examining these issues highlights the necessity of developing innovative strategies that effectively enhance models' predictive accuracy and generalizability, ensuring their relevance in complex and dynamic scenarios within the real estate sector [
51,
52].
In this context, the study aimed to develop and compare ensemble models with optimized feature selection for price prediction in real estate markets. The proposed methodology integrates multiple base algorithms and employs advanced dimensionality reduction strategies, including RF, RFE, and Boruta, to identify and retain only contributing variables. This approach aims to maximize predictive accuracy, model robustness, and generalizability, ensuring optimal performance in highly complex and structurally variable environments.
2. Literature Review
An extensive literature review on price prediction in real estate markets was conducted, emphasizing machine learning algorithms. The analysis identified and assessed the most pertinent scientific articles in this field. In this context, Park et al. [
53] developed a housing price prediction model utilizing C4.5, RIPPER, Naïve Bayes, and AdaBoost, comparing their classification accuracy. The results indicate that RIPPER surpasses the other algorithms in predictive performance. Additionally, they propose an improved model to assist sellers and real estate agents in making decisions based on precise valuations [
53].
In the same line of research, Varma et al. [
54] implemented a weighted average approach using multiple regression techniques to enhance the accuracy of predicting real estate values. This method reduces error and outperforms individual models in stability and accuracy, optimizing valuations by incorporating environmental contextual information [
54]. Conversely, Rafiei et al. [
55] developed an innovative approach based on a Deep Restricted Boltzmann machine (DBM) combined with a genetic algorithm that does not involve mating. Their model evaluates the viability of the real estate market by incorporating economic variables, seasonality, and temporal effects, achieving computationally efficient optimization on standard workstations. The validity and applicability of the model were verified through a case study, demonstrating its accuracy in predicting real estate prices [
55].
An effective strategy to tackle this problem is to implement ensemble methods, which combine multiple base models to enhance the stability and generalizability of the final model [
56]. This approach addresses the individual limitations of each algorithm, reducing both variance and bias, thereby minimizing the risk of overfitting and optimizing predictive accuracy in real estate price estimation [
57]. In this context, models based on ensemble methods, such as RF, have demonstrated superior performance in predicting prices in real estate markets due to their ability to decrease variance, improve stability, and capture nonlinear relationships in the data [
32]. Consistent with these findings, Park and Bae [
53] reported that RF outperformed decision trees and linear regression in terms of accuracy and generalization, highlighting its effectiveness in environments with high variability in the data.
Similarly, Varma et al. [
54] assessed the performance of various machine learning and deep learning algorithms in predicting housing prices in Boulder, Colorado, comparing them to the hedonic regression model. The results demonstrated that both RF and artificial neural networks outperformed hedonic regression analysis in predictive accuracy, highlighting their potential as more advanced and effective methods for real estate valuation. In a related study, Phan [
45] evaluated the performance of SVM, RF, and Gradient Boosting Machines (GBM) in real estate valuation in Hong Kong, discovering that RF and GBM surpassed SVM regarding accuracy. However, while SVM exhibited lower performance, it excelled in efficiency for rapid predictions. The authors concluded that machine learning presents a promising alternative for property valuation.
Arabameri et al. [
57] employed data mining techniques and regression models, including LASSO, RF, and GBM, to predict prices in the Wroclaw real estate market, achieving 90% accuracy and demonstrating these approaches' effectiveness in modeling real estate prices. Regarding the enhancement in capturing nonlinear relationships with high accuracy, Ribeiro and dos Santos [
58] evaluated several machine learning techniques for real estate price prediction, highlighting the superiority of RF over traditional models such as linear regression and SVM. Furthermore, the authors discussed potential future advancements in real estate estimation.
Adetunji et al. [
33] obtained concordant results by employing RF to predict housing prices using the Boston dataset (UCI). Their model achieved an accuracy margin of ±5%, demonstrating its effectiveness in estimating individual real estate values. Ensemble methods like Bagging, GBM, and Extreme Gradient Boosting (XGBoost) have shown promising outcomes in predicting real estate prices. They are noted for their capacity to enhance predictive accuracy and model robustness across various contexts [
42,
58,
59].
Sibindi et al. [
59] evaluated the effectiveness of XGBoost in predicting real estate prices, achieving an accuracy of 84.1% compared to the 42% obtained by hedonic regression. Their study, which analyzed 13 variables, reaffirms the superiority of machine learning approaches over traditional models in this field. Similarly, Gonzales [
60] examined land price prediction in Seoul from 2017 to 2020 using random forest (RF) and XGBoost. This study considered 21 variables and assessed their impact on predictive accuracy. The results indicated that XGBoost outperformed RF, demonstrating greater generalizability and accuracy in estimating real estate values.
Meanwhile, Sankar et al. [
61] applied regression techniques and assembly methods to predict real estate prices, considering key variables such as location and demographics. The models developed achieved an accuracy of 94%, evidencing their usefulness in real estate investment decision-making. Consistent with these findings, Kumkar et al. [
62] conducted a comprehensive comparative analysis of ensemble methods, including Bagging, RF, Gradient Boosting, and XGBoost, in Mumbai real estate valuation, focusing on hyperparameter optimization. To improve the accuracy of the models, they implemented advanced data preprocessing techniques, which mitigated biases and reduced variance [
63]. The results confirm that ensemble methods are robust and efficient tools for price estimation in real estate markets characterized by high complexity and volatility [
59].
In a bibliometric analysis, Takouabou et al. [
10] examined 70 articles indexed in Scopus, noting that scientific production in this field primarily comes from the USA, China, India, Japan, and Hong Kong—regions known for their high levels of digitization and well-established research ecosystems. However, the review revealed recurring methodological limitations, such as dependence on small datasets and a preference for straightforward machine learning models, complicating the issue of the curse of dimensionality. These limitations highlight the need to integrate more advanced methodologies that enhance predictive performance and improve the interpretability of models applied to the real estate market.
In recent years, machine learning techniques for predicting real estate prices have significantly advanced, enhancing the models’ accuracy and efficiency. However, critical challenges remain that restrict these methods' interpretability, generalizability, and robustness, particularly in situations with vast data volumes and high market heterogeneity. One major gap in the literature is the identification, selection, and optimization of relevant variables for predictive models in real estate. Although advancements in machine learning have enabled the development of more sophisticated approaches, challenges remain in integrating multiple data sources and addressing biases stemming from the structural complexity of real estate markets.
In this context, the study tackles this gap by creating a model based on ensemble methods while optimizing the selection of key variables with advanced machine learning techniques. The goal is to enhance predictive accuracy and reduce potential biases, thereby contributing to the development of more robust and generalizable models in real estate valuation.
6. Selection Methods
The importance of the features was evaluated using the varImp() function in the Python language, which employs the RF algorithm to estimate the relative contribution of each variable in the model prediction. This method allowed us to identify influential characteristics by randomly permuting each variable and measuring the decrease in accuracy or increase in impurity. The Recursive Feature Elimination (RFE) technique was implemented to optimize the variable selection using sklearn.feature_selection.RFE, which iteratively eliminates those variables with lower relevance until an optimal subset [65 ] was reached. Additionally, the Boruta algorithm was incorporated using BorutaPy, an extension of RF based on statistical hypothesis testing for feature selection. Boruta compares each variable with randomized synthetic attributes (shadow features), providing a robust criterion for their relevance. Its ability to capture nonlinear interactions and relationships makes it a key tool for improving the interpretability and generalization of models in high-dimensional environments [
66]. In constructing decision trees within the RF model, the quality of each partition was evaluated using impurity metrics, which quantify the reduction of heterogeneity after each split. This criterion ensured that the groups formed were as homogeneous as possible, optimizing the model's predictive capacity. In this context, the impurity function
measures the heterogeneity of the node
, and the optimal partition is selected by maximizing the impurity reduction, given by the equation:
where:
is the goodness of partitioning at node t using the partitioning criterion s.
is the impurity of the parent node.
and are the impurities of the child nodes after partitioning.
and represent the proportions of data assigned to the right and left nodes, respectively.
The most commonly used impurity functions in decision trees are:
It measures the probability that an element is incorrectly classified if it is randomly chosen according to the distribution of classes in the node.
It is calculated as:
where
is the proportion of elements of the class
in the node.
It measures the amount of information held in the node.
Evaluating the goodness of partitioning and impurity in RF allows optimal variable selection using well-founded mathematical criteria. Methods such as RFE and Boruta use this structure to select the most relevant variables, optimize predictive models, and reduce the problem's dimensionality.
Learning algorithm selection
Ensemble models are chosen for land price prediction because they can improve the accuracy and robustness of predictions by combining the results of multiple base models. This is especially advantageous in applications such as real estate appraisal, where data can have high variability and nonlinear characteristics.
AdaBoost
AdaBoost for regression, known as AdaBoost.R, extends the boosting methodology by minimizing a continuous error instead of a binary ranking function [
67]. At each iteration, a weak regressor is trained on a dynamically adjusted weight distribution according to the absolute or quadratic error of the prediction [
68]. The error is normalized and used to compute a weighting coefficient, where regressors with lower error have a more significant influence on the final combination [
69]. AdaBoost (Adaptive Boosting) optimizes the error by weighting instances that are difficult to classify. In regression, an exponential cost function is minimized by iteratively updating weights:
where
are base estimators (commonly weak regressors such as decision trees), and
represents their weighting coefficients.
It is a boosting-based machine learning method that optimizes an additive model by minimizing the loss function using downward gradients in the functional space [
70]. It aims to build a strong model as a sequential combination of weak models, where each new model is designed to correct the errors of the previous models [
71].
Where
is the learning coefficient obtained by minimization of:
Unlike AdaBoost, which adjusts sample weights according to their error, Gradient Boosting builds a sequential model by fitting each new estimator to the residual gradients. This formulation allows flexibility using various loss functions, such as mean square error for regression or classification [
59].
Random Forest Regressor
It is a decision tree-based ensemble method that improves prediction accuracy and stability by combining multiple trees trained on random subsets of the data [
57]. The central idea of RF is to reduce the variance of individual models by taking advantage of the diversity of multiple predictors, which makes it less prone to overfitting compared to a single decision tree [
56].
Where are individual regression trees trained with bootstrap sampling.
Extra Trees Regressor
It is a variant of RF that introduces more randomization in the construction of the trees, reducing the variance of the model and improving its robustness to noise in the data [
72]. Unlike RF, where the split thresholds in each node are selected by optimization, Extra Trees assigns these thresholds completely randomly within the subset of selected characteristics.
the trees
represents the prediction of the tree
. This strategy introduces a higher bias compared to RF but reduces the variance and correlation between the trees, improving generalizability.
Bagging Regressor
It is an ensemble method that decreases the variance of the base models by training them on multiple subsets of data generated by sampling with replacement and combining their predictions [
73]. Each base model
is fitted independently, and the final ensemble prediction is obtained as the average of the individual predictions:
where each
is a base model trained on a random subset of the data. The relationship explains the impact of Bagging on variance reduction:
where the decrease in variance is more significant when the models
are less correlated. Bagging improves model stability and generalizability, reducing overfitting without significantly increasing bias [
58].
Stacking Regressor
It is an ensemble method that combines multiple regression models in a two-level hierarchical architecture to improve predictive ability [
74]. Its theoretical foundation lies in the optimal combination of base estimators using a meta-model that learns to correct its biases and exploit its strengths:
Where
is a meta-model trained with the outputs of the base regressor s
. Stacking can be seen as an optimal predictor combination problem, where the meta-model learns to minimize the loss function L (y,
) more efficiently than a simple aggregation by averaging (as in Bagging) or weighting (as in Voting). Its flexibility allows the integration of heterogeneous models with different biases and variances, achieving a more robust ensemble [
75].
Voting Regressor
It is an ensemble method that combines the predictions of multiple regression models to improve stability and generalizability [
76,
77]. Its underlying principle is that by merging several estimators with different biases and variances, a more robust prediction is obtained that is less sensitive to fluctuations in the training data. Voting by simple averaging one has:
Where all contributions have the same weight.
Each modelreceives a weightproportional to its performance, usually determined by validation metrics such as the coefficient of determination R2 [78].
Model training
The program was developed using the
Scikit-learn Python package, employing a rigorous approach centered on ensemble models and
k-fold cross-validation (k=10) to ensure optimal training and evaluation. This alternating procedure offered robust evaluation, minimizing bias and variance related to specific data partitions while reducing the risk of overfitting [
79]. A sample partition of 70% for training and 30% for testing was utilized, guaranteeing a final model evaluation on entirely independent data. During training, ensemble models iteratively optimize errors, combining predictions from multiple base models to enhance accuracy and lessen bias.
Model evaluation
To evaluate the model’s accuracy and robustness in predicting land prices, specific regression model metrics, including mean square error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and the coefficient of determination (R²), were used. These metrics allow for a comprehensive evaluation, considering both the magnitude of the errors and the model's ability to explain the variability of the data.
Coefficient of determination (R2): Statistical metric to evaluate the quality of a regression model. It indicates what proportion of the variability of the dependent variable (y) is explained by the model as a function of the independent variables (
X). The formula for R
2 is:
Where: SSE: Sum of squared errors ( ), which represents the variability not explained by the model. SST: Total sum of squares ( ), representing the total variability in the real data. : It is the mean of the real values ().
Mean Squared Error (MSE): Evaluation metric that measures the average squared difference between the actual values (
) and the model predictions (
; the formula is:
where: n: Total number of observations;
: Actual value of the i-th observation.
: Value predicted by the model for the i-th observation
Root Mean Squared Error (RMSE): Evaluation metric used mainly in regression models. It represents the square root of the average of the squared errors between the actual values
, and the predictions of the model
; the formula is:
Mean Absolute Error (MAE): It is a metric used to evaluate the accuracy of a model in regression tasks. It represents the average of the absolute differences between the actual values
, and the predictions of the model
, the formula is:
Mean Absolute Percentage Error (MAPE): Metric used to evaluate the accuracy of a regression model. It represents the average of the absolute errors as a percentage of the true values
. This makes it helpful in interpreting the relative error of the model in percentage terms.
4. Results and Discussion
Accurate estimation of real estate prices is crucial for investment and urban planning [
16,
17], given the sector's significance in the global economy [
1,
15]. Dimensionality reduction improves computational efficiency, enhances model interpretability, and reduces overfitting, resulting in more robust predictions [
49,
50]. The comparative analysis of ensemble models in machine learning and dimensionality reduction techniques, such as RFE, RF, and Boruta [
65,
66], shows that effective feature selection maximizes the predictive accuracy and stability of the model, confirming its efficacy in predicting real estate prices.
A dataset from the Kaggle machine learning repository (
https://www.kaggle.com/datasets/ahmedmohameddawoud/ames-housing-data) was utilized. The Ames Housing dataset, well-known in scientific literature and machine learning competitions, contains detailed information on 2,930 homes in Ames, Iowa, described by 81 variables, including structural, spatial, and contextual characteristics. A thorough process of preprocessing and data cleaning was implemented to predict the properties' sale prices. First, the dataset's structure was examined to evaluate the types of variables and the existence of missing values. A threshold of 5% was set for the removal of variables with a high proportion of missing data, ensuring a balance between reducing bias and preserving information. Consequently, the database was narrowed down to 69 variables.
To mitigate the impact of null values, we applied differentiated imputation strategies. We used the median to reduce sensitivity to outliers for numerical variables, while for categorical variables, we employed the mode to maintain class coherence. Additionally, we identified and removed duplicate records to minimize potential redundancies in the analysis. We transformed categorical variables using one-hot encoding, ensuring appropriate numerical representation and avoiding collinearity issues by excluding the first category of each variable. Next, we performed a normalization process for the numerical variables using StandardScaler, ensuring each variable had a distribution with a zero mean and a unit standard deviation. Data preprocessing resulted in a final dataset of 2,930 records and 227 variables derived from the transformation of categorical variables through one-hot encoding. This strategy provided a structured and fitting representation for subsequent analysis, facilitating the application of feature selection techniques and machine learning models with an optimized and uniform database.
The research evaluated the impact of the RFE algorithm in conjunction with an RF classifier to identify the most relevant variables in a structured data set. The results indicated that the progressive elimination of irrelevant features reduced the complexity of the model without significantly affecting its accuracy. The selected features had the highest contribution to predicting the target variable. Additionally, an importance ranking was generated, allowing visualization of the relative influence of each variable in the model, as shown in
Figure 1.
A database was reduced to fifteen variables after conducting attribute engineering (feature selection). For the comparative analysis, seven widely recognized ensemble models were evaluated: AdaBoost, Gradient Boosting, Random Forest, Extra Trees, Bagging, Stacking, and Voting. These were chosen for their ability to enhance prediction accuracy by combining multiple base models, thus optimizing the robustness and generalization of the results.
The predictive performance of each model was quantified using standard metrics for regression problems. Among these metrics, the coefficient of determination (R²) indicated the proportion of variability in the data explained by the model, serving as a crucial indicator of its fit. The root mean square error (RMSE) directly measured the magnitude of error and was expressed in the same units as the dependent variable, allowing for intuitive interpretation of precision. The mean absolute error (MAE) assessed the average absolute deviation between predictions and actual values, providing a complementary metric that does not excessively penalize extreme errors. Lastly, the mean squared error (MSE) was used to impose stricter penalties on significant errors, offering additional insight into model quality by identifying highly discrepant predictions.
The analysis of these metrics enabled a thorough comparison of the models, revealing their strengths and limitations in predicting real estate prices. The observed differences underscore the importance of choosing the correct algorithm based on the problem's specific characteristics and the analysis's goals.
Table 1 presents the results for each model, along with a critical analysis that evaluates their relative performance in terms of accuracy, robustness, and generalizability. This analysis establishes a solid foundation for identifying the optimal configurations for practical applications and future research.
The study evaluated the performance of different machine learning models in real estate price prediction and analyzed the impact of dimensionality reduction through Recursive Feature Elimination (RFE). Base models with 227 features were compared against their optimized versions, with 15 features selected through RFE.
The results demonstrate that Stacking achieved the best overall performance, with an MAE of 14,092.30, an MSE of 5.34 × 10⁸, an RMSE of 23,104.19, and an R² of 0.9241. This was closely followed by Gradient Boosting, which attained an MAE of 14,536.52 and an R² of 0.9197. These values signify a high predictive capacity and adequate generalization across the data, establishing these models as the most effective in terms of accuracy. Conversely, the Adaboost and RFE+Adaboost models demonstrated the lowest performances, with an MAE exceeding 23,000 and an R² below 0.85, indicating difficulties in accurately fitting the training data.
While dimensionality reduction is a crucial strategy for enhancing computational efficiency and model interpretability, the results indicate that feature removal through RFE was not always advantageous for predictive accuracy. Gradient Boosting, in its version without RFE, significantly outperformed the optimized version (RFE + Gradient Boosting), with a reduction in MAE of 16.9% and an improvement in R² of 1.6%. A similar trend was observed in the Stacking model, where the version with all features lowered the MAE by 14.6% compared to the optimized version, demonstrating that eliminating certain features may have stripped away relevant information essential for prediction.
Similarly, in the Random Forest, ExtraTrees, Bagging, and Voting models, the use of RFE increased prediction errors (MAE, MSE, RMSE) and a decrease in the coefficient of determination (R²), indicating that these algorithms perform better with datasets that have a more significant number of features. While removing features did not always enhance predictive accuracy, reducing dimensionality led to lower computational complexity of the models, which is a substantial advantage in resource-constrained scenarios. Models like RFE+Bagging and RFE+RandomForest managed to sustain competitive performance with only 15 features, considerably cutting down the amount of data processed without severely impacting R².
From an efficiency standpoint, feature selection using RFE can be a viable strategy when seeking a balance between accuracy and computational costs. In situations where model interpretability is crucial, reducing to 15 features enhances understanding of each variable's impact on predictions, which is particularly important in regulatory or explanatory model-based decision-making contexts. Analysis of the models indicates that while Gradient Boosting and Stacking deliver the best performance regarding accuracy, implementing them without feature reduction results in better predictive capability. Conversely, using RFE proves advantageous for computational efficiency, as it reduces dimensionality without significantly compromising performance in specific models.
Generally, the trade-off between accuracy and computational efficiency should be evaluated according to the specific constraints and goals of each application. In cases where computational resources are limited, feature selection using RFE may be a viable alternative, always evaluating the effect on the model’s predictive performance.
The method outlined involves calculating the significance of each variable using RF.
Figure 2 illustrates the results of the variable selection.
The results obtained in the evaluation of different ensemble algorithms for price prediction show apparent differences in their performance, according to the indicators of root mean square error (RMSE), explained variance (EV), mean absolute error (MAE), and coefficient of determination (R²), the results are presented in
Table 2.
This study evaluated the performance of different machine learning models in real estate price prediction, comparing versions with 227 features and their respective optimizations using Random Forest Feature Selection RF, reducing the dimensionality to 16 features. This reduction's impact on predictive accuracy and computational efficiency is analyzed.
The results show that Stacking obtained the best overall performance, with an MAE of 14,092.30, an MSE of 5.34 × 10⁸, an RMSE of 23,104.19, and an R² of 0.9241. This performance was followed by Gradient Boosting, whose MAE reached 14,536.52 and presented an R² of 0.9197, confirming its high generalization capacity and robustness in prediction.
The feature reduction process through RF had varying impacts depending on the model. For Gradient Boosting, the optimized version (RF + Gradient Boosting) showed an increase in MAE by 19.1% and a decrease in R² by 1.6% compared to its counterpart without feature reduction. A similar pattern was observed in the Stacking model, where the full version achieved a lower error than the optimized version (RF + Stacking), with an increase in MAE by 17.7%.
The models with feature reduction also experienced a slight decrease in accuracy for RandomForest and ExtraTrees. For example, the RF+ RandomForest model increased its MAE by 13.1%, while RF+ ExtraTrees increased its error by 8.3%, indicating that eliminating certain variables negatively affected its predictive ability. In contrast, the RF+ Adaboost model slightly improved its performance compared to the full version, with an MAE reduction of 5.3 %. This suggests that dimensionality reduction can help mitigate overfitting and improve prediction stability in models susceptible to overparameterization, such as Adaboost.
From a computational perspective, using RF significantly reduced the number of features employed in the models, resulting in shorter training times and improved model interpretability. This is particularly vital in situations where computational resources are limited or where model explainability is prioritized.
While variable reduction did not always enhance accuracy, the RF-optimized models maintained reasonable predictive capability with only 16 features, compared to their full versions, which had 227 features. In situations where computational efficiency is crucial, these findings indicate that dimensionality reduction may represent an acceptable balance between accuracy and computational cost.
The results indicate that while stacking and gradient boosting yield the highest accuracy, employing them without variable reduction results in better outcomes. However, using RF allowed for a significant reduction in dimensionality with minimal performance loss in specific models, which can be beneficial in situations where efficiency and interpretability are priorities. Overall, RF feature selection effectively reduces computational complexity, although its effect on accuracy depends on the particular model. This method provides a suitable balance between efficiency and predictive power in scenarios with processing limitations or where explainability is essential.
Concerning the Boruta model, 16 variables were selected based on the ranking established, as shown in
Figure 3.
Similar to the prior models, the performance of several machine learning models in predicting land prices was assessed using metrics such as mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and the coefficient of determination (R²), as shown in
Table 3.
The analysis evaluates the impact of feature selection using Boruta, a technique that identifies the most relevant variables for prediction. Models with 227 variables are compared to their optimized versions, which include 16 variables selected using Boruta to assess the effects on model accuracy and computational efficiency. The results indicate that applying Boruta significantly reduced the number of variables without drastically degrading predictive performance. In most cases, the optimized models maintained an R² close to the full version, demonstrating the effectiveness of variable selection.
Stacking proved to be the top performer in both versions. The version with 227 variables recorded an MAE of 14,092.30 and an R² of 0.9241. In contrast, the optimized version (Boruta + Stacking) exhibited a 9.8% increase in MAE and a slight decrease in R² to 0.9082, indicating that the model maintains strong predictive capability despite a notable reduction in dimensionality. Similarly, Gradient Boosting performed well with minimal loss in accuracy. The reduced version (Boruta + Gradient Boosting) obtained an MAE of 16,073.47 and an R² of 0.9071, compared to values of 14,536.52 and 0.9197, respectively, for the full version. This result suggests that Boruta could identify a relevant subset of variables without significantly compromising the quality of the prediction.
On the other hand, models based on individual trees, such as Random Forest and ExtraTrees, showed more significant variability in their performance after the application of Boruta. In particular, the Boruta + ExtraTrees combination presented a 2.7 % increase in MAE and a reduction in R² from 0.9065 to 0.8971. This behavior suggests that eliminating certain variables can significantly impact models whose internal structure depends on the diversity of trees generated, which could affect the stability of their predictions.
A notable case is the Adaboost model, where the optimized version achieved nearly identical results to the full version (MAE of 23,528.79 compared to 23,627.92 and R² of 0.8518 compared to 0.8514), suggesting that Adaboost benefits from using fewer variables without losing accuracy. This outcome aligns with previous studies that emphasize Adaboost's sensitivity to noise in the data and its tendency to overfit when dealing with datasets characterized by high variability [
79].
Boruta's variable reduction significantly impacted computational efficiency, reducing the number of variables used by 92% (from 227 to 16). This optimization not only reduced the computational load, accelerating the model training process but also improved interpretability by restricting the analysis to a reduced set of variables with greater predictive relevance. In applications where model transparency is essential, such as real estate valuation, this reduction facilitates the identification of the determining factors in the prediction.
In addition, eliminating irrelevant or redundant variables helped mitigate overfitting in specific models, improving their generalizability to new data. Although slight losses in predictive accuracy were observed, these were marginal compared to the benefits of computational efficiency and model robustness. In particular, Boruta's variable selection proved to be highly effective in preserving predictive performance without compromising the quality of the estimates.
Dimensionality reduction significantly improved computational efficiency. Decreased variables optimized training time and model responsiveness, crucial in prediction applications in massive data environments [
50]. Dimensionality reduction also effectively mitigated the adverse effects of the exponential increase in complexity as the number of variables in the dataset increased [
47].
The findings indicate that no single model is universally superior for predicting real estate prices. While Gradient Boosting and Stacking stand out for their accuracy and generalizability, models like Random Forest and Extra Trees offer robustness and stability, especially when utilizing variable selection techniques. Future research could focus on automated hyperparameter optimization and incorporate hybrid methods to improve predictive performance in highly complex situations.