Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Assessing the Dimensionality Reduction of the Geospatial Dataset Using Principal Component Analysis (PCA) and Its Impact on the Accuracy and Performance of Ensembled and Non-ensembled Algorithms

Version 1 : Received: 7 July 2023 / Approved: 7 July 2023 / Online: 10 July 2023 (11:11:27 CEST)

How to cite: Abbas, F.; Zhang, F.; Iqbal, J.; Abbas, F.; Alrefaei, A.F.; Albeshr, M. Assessing the Dimensionality Reduction of the Geospatial Dataset Using Principal Component Analysis (PCA) and Its Impact on the Accuracy and Performance of Ensembled and Non-ensembled Algorithms. Preprints 2023, 2023070529. https://doi.org/10.20944/preprints202307.0529.v1 Abbas, F.; Zhang, F.; Iqbal, J.; Abbas, F.; Alrefaei, A.F.; Albeshr, M. Assessing the Dimensionality Reduction of the Geospatial Dataset Using Principal Component Analysis (PCA) and Its Impact on the Accuracy and Performance of Ensembled and Non-ensembled Algorithms. Preprints 2023, 2023070529. https://doi.org/10.20944/preprints202307.0529.v1

Abstract

In this study, our primary objective was to analyze the tradeoff between accuracy and complexity in machine learning models, with a specific focus on the impact of reducing complexity and entropy on the production of landslide susceptibility maps. We aimed to investigate how simplifying the model and reducing entropy can affect the capture of complex patterns in the susceptibility maps. To achieve this, we conducted a comprehensive evaluation of various machine-learning algorithms for classification tasks. We compared the performance of these algorithms in terms of accuracy and complexity, considering both "before" and "after" scenarios of dimensionality reduction using Principal Component Analysis (PCA). Our findings revealed that reducing complexity and lowering entropy can lead to an increase in model accuracy. However, we also observed that this reduction in complexity comes at the cost of losing important complex patterns in the produced landslide susceptibility maps. By simplifying the model and reducing entropy, certain intricate relationships and uncertain patterns may be overlooked, resulting in a loss of information and potentially compromising the accuracy of the susceptibility maps. The analysis encompassed a diverse range of machine learning algorithms, including Random Forest (RF), Extra Trees (EXT), XGboost, LightGBM, Catboost, Naive Bayes (NB), K-Nearest Neighbors (KNN), Gradient Boosting Machine (GBM), and Decision Trees (DT). Each algorithm was evaluated for its strengths and limitations, considering the tradeoff between accuracy and complexity. Before dimensionality reduction, the algorithms demonstrated promising results, with RF exhibiting excellent AUC/ROC scores and average accuracy. However, computational costs were noted as a potential drawback for RF, especially when dealing with large datasets. EXT showcased robust performance and good accuracy, while XGboost demonstrated its ability to handle complex relationships within large datasets, albeit requiring careful hyperparameter tuning. The efficiency and scalability of LightGBM made it a suitable choice for large datasets, although it displayed sensitivity to class imbalance. Catboost excelled in handling categorical features, but longer training times were observed for larger datasets. NB showcased simplicity and computational efficiency but assumed independence among features. KNN, known for its capability to capture local patterns and spatial relationships, was found to be sensitive to the choice of distance metric. GBM, while capturing complex relationships effectively, was prone to overfitting without proper regularization. DT, with its interpretability and ease of understanding, faced limitations in terms of overfitting and limited generalization. After dimensionality reduction, certain algorithms exhibited improvements in their AUC/ROC scores and average accuracy, including RF, EXT, XGboost, and LightGBM. However, for a few algorithms, such as NB and DT, a decrease in performance was observed. This study provides valuable insights into the performance characteristics, strengths, and limitations of various machine learning algorithms in classification tasks. Researchers and practitioners can utilize these findings to make informed decisions when selecting algorithms for their specific datasets and requirements. We also aim to identify the potential factors contributing to the high accuracy rates obtained from these ensembled algorithms and explore possible shortcomings of non-ensembled algorithms that may result in lower accuracy rates. By conducting a comprehensive analysis of these algorithms, we seek to provide valuable insights into the benefits and limitations of ensembled approaches for landslide susceptibility mapping. Our study sheds light on the challenges faced when balancing accuracy and complexity in machine learning models for landslide susceptibility mapping. It emphasizes the importance of carefully considering the level of complexity and entropy reduction in relation to the specific patterns and uncertainties present in the data. By providing insights into this tradeoff, our research aims to assist researchers and practitioners in making informed decisions regarding model complexity and entropy reduction, ultimately improving the quality and interpretability of landslide susceptibility maps.

Keywords

machine learning; accuracy; complexity; entropy; landslide susceptibility mapping; dimensionality reduction; Principal Component Analysis (PCA)

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.