1. Introduction
The increased frequency and intensity of extreme precipitation events is a salient concern for mountainous regions, where along with polar regions, changes in climate are observed to be occurring most rapidly [
1]. The increased likelihood of triggering events places the geographically isolated Norwegian population at higher risk [
2], from loss of life, closure of transport routes, isolation, and economic costs due to repairs, and disruption of economic activities [
3]. In recent years, rockfalls [
4], landslides [
5,
6], and floods [
7] have impacted built-up areas, and major transport routes causing significant disruptions due to lack of alternative routes.
A common problem for landslide hazard management is a lack of data of previous events. To better understand landslide hazards in Norway, there have been significant efforts to improve the collection of landslide data over several years, using both ground-based and remote sensing approaches [
8]. These improvement have been made through collaborations between relevant authorities, primarily The Norwegian Water Resources and Energy Directorate (
Norges vassdrags- og energi direktorat; NVE) and The Norwegian Public Roads Administration (
Statens vegvesen, SVV), with support from research institutions. Improvements have included systematic reporting of landslide events that impact roads by SVV, and continuous improvements in the design of the National Mass Movement Database (
Nationale Skred Database, NSDB) managed by NVE, through development of IT-tools for reporting events and managing data, along with systematic mapping using remote sensing images from satellites and aerial photos.
The landslide inventories produced from systematic mapping from recent mass-triggering events have been used extensively to validate and improve existing practices around landslide hazard management, including improving threshold values for early warning models, verification of warnings, as input to hazard mapping, calibration, development and verification of runout models and performance evaluation of susceptibility and hazard mapping methods. Using remote sensing images overcomes some limitations of using ground-based observations alone as was the previous practice, including that the ground-based observations show strong spatial bias towards roads, with events that occur away from the roads rarely being reported, and that information such as the initiation point, size, and classification of the landslide may be missing or inaccurate [
9].
Landslides were manually mapped using the change in Normalized Difference Vegetation Index (dNDVI) derived from Sentinel-2 images following extreme precipitation events at Jølster (30 July 2019) [
9,
10] and in southern Norway after the Hans storm (7-9 August 2023) [
11]. However, freely available satellite images can only detect medium to large landslides and mapping with optical images is seasonally dependent. Higher resolution images from other satellites (including Planet, Pléiades, and WorldView), drones, and aerial photos have also been used to map landslides, and field visits and helicopter flights were also conducted for verification. Landslide polygons have also been produced for less significant triggering events, albeit on an ad hoc basis. Supplementing ground-based observations with remote sensing data has yielded significant improvements in the number of landslides reported. For example, following the Jølster event the number of landslides registered was increased from 14 by ground-based observations from the road authorities to 120 using Sentinel-2 images [
9]. For Hans, this was 263 from ground-based observations to 648 from remote sensing data [
11]. Copernicus Emergency Management Services (Copernicus EMS) mapped landslides in some of the flood-affected valleys based on Pléiades imagery (0.5 m resolution) after Hans [
12,
13], however systematic mapping was necessary to produce a complete inventory over the entire affected area.
Over the same time, there has been continual development in deep-learning models capable of image recognition. Manual landslide mapping with satellite images demands detailed, repetitive visual inspection over large areas, making it both time-consuming and resource intensive. Thus, there has been increasing research, both in Norway and internationally, to automate this process using deep learning [
14,
15]. While recent studies demonstrate potential to significantly advanced landslide mapping through improved data processing, feature extraction, and model scalability, challenges remain in an operational setting due to model generalizability across regions [
16], accurately distinguishing between landslide types [
17], and handling complex or temporally heterogeneous datasets [
15]. Scarcity of high-quality landslide inventories is also a significant challenge both for training and verifying model performance.
For creating landslide inventories after a triggering event, it is necessary to distinguish between pre-existing landslides and new landslides. Some of the state-of-the art methods for landslide detection using high resolution imagery have only been trained on post-event images. For example, researchers using automatic detection models after the 2024 Taiwan earthquake, needed to run their pre-trained models separately, on images from before and after the event, then subsequently remove the pre-existing landslides [
18]. They concluded it would have been more efficient to use a method that includes change detection inherently. The CERS Rapid Mapping uses manual change detection methods and visual inspection of imagery. High accuracy of automatic feature extraction can be achieved, however this is rarely used during emergency response, as it requires homogenous pre- and post-event imagery, from the same sensor and resolution, which is currently rarely available [
13]. Presently, there remains a trade-off between image resolution and revisit frequency, and cost. Medium resolution images, for example from Sentinel, are more readily available and can be used as input for change detection based automatic detection models, however the mapping output is inherently less detailed compared to using higher resolution images.
Another major limitation of applying pre-trained detection models based on optical imagery after mass-triggering events to support disaster response is the time delay in obtaining cloud-free images. Synthetic Aperture Radar (SAR) imagery offers the advantage of being unaffected by cloud cover and landslide detection has been tested both in research [
19,
20,
21,
22], and post-disaster contexts—for example, following Tropical Storm Grace in Haiti, Cyclone Gabrielle in New Zealand, and the Hualien Earthquake in Taiwan [
18,
21,
23]. However, current SAR-based approaches have not yet demonstrated sufficient accuracy for operational use. Landslide detection rates remain substantially higher using cloud-free optical images of similar resolution [
9].
There are numerous examples of studies applying machine learning and deep learning models to automate the process of landslide detection from satellite images. Machine learning algorithms such as Random Forest (RF), Support Vector Machines (SVM), and Gradient Boosting Machines (GBM) are commonly used due to their ability to handle non-linear relationships and multi-source inputs [
15]. Since 2015, there has been a growing shift toward the use of deep learning models, which are better suited for image recognition tasks [
24,
25]. Convolutional Neural Networks (CNNs) and U-Net architectures, in particular, have shown strong capability in capturing spatial patterns and contextual information, especially when high-resolution data are available [
14] In regard to scarce training data, types of deep-learning models may be better suited to getting better results, such as self-supervised modes (Self-supervised Transformer Model) and YOLO (You Only Look Once) models [
26], or by integrating multiple algorithms in ensemble frameworks [
27]. Another limitation when using optical satellite imagery, is that other processes of vegetation removal (forestry, agriculture, river erosion or deposition) can give the same spectral signals as a landslide, leading to false positives in the model predictions. Hybrid models can combine physical-based models, or Object Based Image Assessment (OBIA) algorithms with data-driven deep-learning approaches, which often leads to better predictive performance [
28,
29].
The current study builds on previous work [
9,
11,
16], and moves towards a robust method for semi-automatic landslide detection, that can be run by Norwegian authorities annually or following mass-triggering events, to improve the efficiency of landslide mapping. The objective was to develop and test a pre-operational U-Net model, that can be trained and run on multiple geomorphological regions. In addition to using multiple study areas, previous issues with data leakage due to overlapping training images have been rectified. Using the well-verified Jølster and Hans storm landslide inventories [
11], we investigate the following;
How effectively can the deep learning (U-Net) model detect post-storm landslides using Sentinel-2–based vegetation change and persistence metrics?
Which factors most limit detection accuracy and how can these be mitigated?
How can we set up a pipeline for general expansion of the study area?
2. Materials and Methods
The input data included Sentinel-2 satellite images from the European Space Agency (ESA), acquired through Google Earth Engine python-api. We used the Harmonized Level 2A product, surface reflection (SR) images, pre-processed from Level 1C products using sen2cor. Over 800 previously mapped landslide polygons from the 2019 Jølster [
10] and 2023 Hans [
11] storms were available as training and validation data, provided by NVE. The study areas are shown in
Figure 1.
While the full set of Sentinel-2 tiles were included during initial model development, the experiments reported in the present article, show the results of a detailed study using only two tiles: the Jølster tile and one representative Hans tile (marked in red in
Figure 1). The detailed study can be reproduced by running the script linked in the Data Availability Statement at the end of the article, under ‘investigate_one_tile.ipynb’. Using a limited subset allowed us to efficiently establish and validate the model workflow, assess performance under controlled conditions, and identify key limitations that need to be addressed before scaling to the full dataset. The detailed study involved focused manual inspection of individual landslides to identify cases where the model struggles to separate landslides from non-landslide features. Additionally, the automated cloud-removal preprocessing steps implemented did not give satisfactory results for the full area which introduced errors observed during the initial model development, thus a manual approach for tuning the cloud-removal was applied for the tiles in the detailed study.
Sentinel-2 surface reflectance imagery was preprocessed following the approach of [
9], re-implemented in Python using Google Earth Engine [
30]. For the Jølster area all available images from one month before and one month after the extreme weather event were used. For the Hans tile two months before and after was used, due to persistent clouds. Clouds, shadows, snow, and water were removed from the Sentinel-2 imagery using a combined masking approach. Pixels were excluded with cloud probability above 40% or classified in the Scene Classification Layer as clouds, shadows, snow, or water.
The resulting mask was applied before computing the greenest-pixel composites based on the maximum NDVI value per pixel, ensuring that only high-quality, clear-sky pixels were used for analysis. An additional persistence band was generated, representing the ratio of post-event cloud-free observations with NDVI values below the pre-event median, indicating the persistence of vegetation loss. The persistence score was for a given pixel, after filtering for cloudy pixels, calculated as follows:
The persistence score gives an indication if vegetation damage after the storm events was temporary (indicating superficial damages due to surface water flow or shallow silt deposition), or persistent (indicating significant erosion or deposition).
In total, four input channels were used; difference in NDVI, pre-NDVI, post-NDVI and Persistence. The landslide polygons were converted to a binary raster showing landslide or non-landslide pixels.
A standard U-Net model was used. The two tiles were divided into images with dimensions of 256 x 256 pixels, which corresponds to approximately 2560 x 2560 m in 10 m resolution imagery. Data augmentation was performed by rotating the images 45, 90 and 135 degrees. In total this gave 188 images for testing and training. Only images that contained landslides were included for training; this was done to reduce the model bias towards non-landslides. However, the model was tested on the full tile area, to understand how the model would work in operational settings where one does not have prior knowledge of where landslides have occurred. Each model was trained for 300 epochs.
During model development, the model performance was evaluated repeatedly using data separation to avoid data leakage. The images from the Jølster and Hans tiles were randomly divided into two independent subsets. The first subset was used to train Model 1, and the second to train Model 2. During testing, images used for training Model 2 were evaluated with Model 1, and vice versa, allowing for an unbiased assessment of model generalization within each area. Images that did not contain any landslide pixels were excluded from training and used only for testing, evaluated using Model 1. All augmented versions of the same image (rotated by 45°, 90°, and 135°) were kept within the same subset to ensure that no transformed variant of a training image appeared in the corresponding test set. This approach ensured complete separation between training and testing data and avoided inadvertent information transfer between datasets.
Model performance was evaluated quantitatively and qualitatively. The quantitative evaluation included both pixel-based and object-based approaches. The pixel-based evaluation used confusion matrix metrics to calculate precision, recall, and accuracy. From the pixel-based evaluation, the F1-score was also computed as the harmonic mean of precision and recall. This metric is suitable for assessing detection quality in datasets with strong class imbalance, such as those dominated by non-landslide pixels.
To better assess the ability of the model to identify individual landslides, which is more relevant in an operational setting, an additional object-based evaluation was conducted. In this approach, detected landslide polygons were compared with reference polygons, and the proportion of overlapping pixels was calculated for each pair. A landslide was considered detected when the predicted and reference pixel areas overlapped by (i) at least one pixel and (ii) at least 20%.
A qualitative evaluation was performed, through detailed inspection of the prediction results, to identify systematic errors that were affecting model performance. To explore why certain landslides were not detected, diagnostic plots were generated for each test image. These plots included the pre- and post-event RGB composites, NDVI difference images, persistence maps, and overlays of predicted versus reference polygons.
4. Discussion
4.1. Model Performance
The results indicate that the model performed comparably across the two test sites, achieving a precision and recall of approximately 0.53 for Jølster and 0.35/0.44 for Hans. These values are consistent with those reported in other optical-based automatic landslide detection studies. For example, the pre-trained ALADIM model, applied to the 2021 Haiti earthquake and Tropical Storm Grace, achieved a precision of 76.9% and recall of 52.4% [
23], although it used very high-resolution Pléiades imagery (0.5 m) and differed with more densely clustered, larger landslides than typically occur in Norway.
The morphological and geomorphological contrasts between the Norwegian and Caribbean or Asian case studies make direct comparisons difficult. In Haiti and Taiwan, wide, coalescing slope failures dominate due to seismic triggers, whereas in Norway, landslides triggered by intense rainfall are generally smaller, more elongated downslope, and more spatially dispersed. These differences highlight the importance of developing locally trained models tuned to the glacially sculpted terrain, steep fjord landscapes, and predominantly hydrometeorological triggering mechanisms found in Norway.
Despite moderate precision and recall achieved so far, the model demonstrates a clear potential for semi-automated, post-event mapping of landslides in Norway. With current processing speeds, full-tile predictions can be completed within a few hours, offering a significant time advantage over manual mapping. Nevertheless, human validation remains essential to ensure quality control and to correct systematic misclassifications, particularly in complex terrain and along water courses. However, to achieve useful results in an operational setting, the systematic errors identified in this detailed study must first be addressed.
4.2. Systematic Errors
The qualitative evaluation revealed several systematic sources of error that affected model performance. The most prominent false positives were linked to residual cloud contamination and misclassification of mass-transporting floods (in Norwegian: masseførende flom). Even after applying combined cloud and shadow masking, small residual cloud patches, especially over water or bright surfaces, produced data gaps, or low NDVI values that were erroneously classified as landslides. This limitation is inherent to optical data and reinforces the value of integrating complementary data sources such as aerial photos taken below clouds, and Synthetic Aperture Radar (SAR) images which are cloud-penetrating.
A second major challenge was the spectral similarity in the dNDVI between landslides and vegetation loss along river channels caused by mass-transporting floods. The spectral and spatial boundaries between debris flows and mass-transporting floods are often ambiguous. This was seen in
Figure 5, where one can see that the signature of landslides and rivers overflowing in the diff NDVI image are indistinguishable. Currently, mass-transporting floods are not consistently mapped in either the flood or landslide databases. Such inconsistencies propagate into the training data, leading the model to learn contradictory patterns. This results in false positives in some areas and missed detections in others, particularly where the different mass-transportation mechanisms result in similar signatures.
Missed detections were primarily associated with landslides that did not produce strong NDVI change, either because they were small and involved limited vegetation removal or occurred in sparsely vegetated areas. In some cases, landslides that were visually clear in the NDVI difference images were still missed by the model. This is assumed to be caused by a combination of narrow landslides with weaker dNDVI signals, the limited data set used for training, and the mentioned ambiguity in mass-transporting floods.
4.3. Scaling Up
Previous studies have demonstrated that locally trained deep learning models outperform globally trained models in Norwegian settings, largely due to differences in geomorphology, land cover, and triggering mechanisms that limit model generalization [
16]. However, earlier efforts were constrained to individual case studies due to limited data availability and technical capabilities. The development of the Hans landslide inventory in 2023 provided a valuable opportunity to continue refining a deep learning–based detection model within a Norwegian context and to test its transferability across multiple regions and events.
The framework presented in this article, is the first attempt of scaling up the workflow from a single-event case study to a regional framework. It currently includes all tiles from Jølster and Hans storm, however it is designed in a modular and reproducible way such that more data and new preprocessing methods can easily be added. The first results as presented in
Section 3.2. using this framework shows the importance of adaptive cloud removal. In the detailed study, the cloud removal was manually adapted to the actual cloud conditions for the two tiles used. For a full-scale operational framework, the cloud removal process must be automated and standardized. To improve the results, one needs to be able to automatically adapt the cloud removal to each tile and possibly set a criterion to exclude training data images of insufficient quality. When using the model for detecting new landslides, one should also define criterion on the quality of the images where the model can be used.
Computationally, it was shown that the full dataset from both Hans and Jølster was well within the limits of what can be processed on local hardware, without requiring any external computational infrastructure.
4.4. Future Work and Operational Implications
The detailed evaluation highlighted several issues that must be addressed before scaling the model to detect landslides over larger regions or under operational settings. A central challenge is the inconsistent distinction and mapping of overlapping hydrogeological processes, which often exhibit similar NDVI-based signatures. Improved definitions of classifications, and either including mass-transporting floods in the training data or redefining the model objective to detect non-anthropogenic rapid vegetation loss more broadly, will be important steps toward reducing misclassification.
Further improvements should focus on increasing model robustness and generalizability. Integrating multi-sensor inputs, including higher-resolution optical imagery and Synthetic Aperture Radar (SAR), would help overcome cloud limitations and support detection of small or narrow landslides. Incorporating auxiliary datasets such as river networks, flood maps, and flow accumulation layers may also enhance the ability of the model to distinguish between slope failures and channel processes. Expanding and refining the training dataset to represent a wider range of geomorphological and seasonal conditions will be essential for improving transferability across regions. NVE currently has more landslide polygon data available that has not yet been incorporated in this workflow.
This workflow can easily be adapted for other locations where landslide polygons are available. Overall, while expert interpretation will remain necessary, the developed workflow demonstrates strong potential as a semi-automated and reproducible tool to support Norwegian authorities in post-event landslide mapping.
Author Contributions
Conceptualization, E.L. and M.S.B.; methodology, E.L., and H.H; software, H.H.; validation, E.L. and H.H.; formal analysis, H.H.; investigation, E.L and H.H.; resources, M.S.B.; data curation, M.S.B.; writing—original draft preparation, E.L.; writing—review and editing, E.L.; H.H., and M.S.B., visualization, H.H., and E.L.; project administration, M.S.B., and E.L.; funding acquisition, M.S.B., and E.L. All authors have read and agreed to the published version of the manuscript.