Random forest algorithm for mapping deforestation in the Ituri-Epulu-Aru landscape (Democratic Republic of Congo)

The evaluation of deforestation by optical remote sensing remains a challenge in the humid tropical region due to high cloud cover. This paper develops a simple and reproducible method for mapping deforestation of the old-growth forest using open access software. A map of old-growth forest depletion was created using composites from three different dates (2003, 2010, 2016). Four models were tested: the first model using spectral bands (nir, swir1, swir2 and red), the second model was based on the association of spectral bands and spectral indices (NDVI, B54R, NDWI and NBR), the third model was constructed using spectral bands and geomorphological indices (DEM, Slope and Roughness) and the last model combined spectral bands, spectral indices and geomorphological indices. The optimal random forest ntrees and Mtry parameters were determined for each model to optimize the mapping in each model. The out-of-bag error for these four models was 2.15 %, 2.05 %, 1.86 % and 1.85 %, respectively. The fourth model had the lowest error and was hence used to predict deforestation of the old-growth forest. The annual rates of deforestation amounted 0.26 % (69861 ha) and 0.66 % (145768 ha) between 2003 – 2010 and 2010 – 2016, respectively. The area of the old-growth forest in 2016 was 3601607 ha and 215629 ha of forest lost between 2003 and 2016. These results showed that the Random Forest Classification (RFC) model was able to effectively map the reduction of old-growth forests.


Introduction
Equatorial old-growth forests represent a natural ecosystem having the most important quantity of biomass and among the highest biodiversity [1 -3]. As land use in the world is undergoing profound spatial changes, old-growth forests are increasingly subjected to anthropogenic pressure [4][5].
The methods used to assess Land Use and Cover Change (LUCC) have improved considerably in recent decades [6-7; 4]. This assessment requires a rapid and reliable mapping of vegetation cover dynamics, theoretically hoping for a sustainable exploitation and management of natural resources. If forest inventory is still the main source of LUCC information, remote sensing is a powerful tool for the analysis of forest dynamics [8]. Remote sensing observations are an attractive source of information for survey and land-use mapping. For example, the global assessment of forest resources conducted by the United Nations Food and Agriculture Organization [9] through remote sensing estimated the word forest area at 3999 million ha. The same source indicates that global forest loss was 192 million ha between 1990 and 2015, that is, a rate of 2.16 million ha per year.
Much of the world's old-growth forest is in the tropics. Unfortunately, remote sensing mapping in this region remains challenging given the persistently high cloud cover [10][11][12]. For this reason, a deforestation assessment is more often carried out at the national [13], regional [14][15] or global level [7; 4]. Many natural resource management institutions rely on these assessments but the map products do not always meet managers expectations [16]. Few studies in the Democratic Republic of Congo have been undertaken to assess the deforestation at the local level [17].
Mapping land use dynamics of complex landscapes is a particularly difficult task as some land covers have similar spectral characteristics and some change throughout the season. Thus, a multidate analysis is often required for mapping land cover dynamics [18]. However, when two land cover maps corresponding to two different dates are combined, the individual error of each land cover class is multiplied if one considers that the errors of the two maps are independent [19]. Thus, multi-date analysis reduces classification errors compared to comparing independent classifications produced for multiple dates [20].
In addition to the temporal dimension, studies have shown that auxiliary information can improve the accuracy of land cover classification. At the same time, the inclusion of temporal and geomorphological observations in the classification process can increase the dimensionality of the data. But, the increasing number of input variables to the classification can induce complexity resulting in increased computing time.
The selection of variables is an important step in many machine learning applications. It is not only necessary to obtain more accurate maps, but also to understand which variables are most relevant in the classification process. In this paper, we consider the application of the Random Forest (RF) algorithm for feature selection because of its interesting properties, such as high accuracy and robustness against over-fitting of training data [21]. RF is currently proposed and used to improve land cover mapping from remote sensing imagery [22]. It provides an approach for assessing the importance of characteristics or predictors which can be considered as a useful parameter to study the role of each temporal, spectral or thermal feature in the analysis of old-growth forest reduction discrimination. The purpose of the study was to develop a simple, fast and reproducible method for mapping the deforestation of the old-growth forest using Random Forest.

Study area
The study area is the Ituri-Epulu-Aru landscape (IEAL) is located between 2°37'022''N -0°31'030''N and 27°34'034''E -30°00'039''E, in Democratic Republic of Congo (Fig. 1). Most of the landscape is situated in Ituri province (in the administrative territories of Mambasa, Irumu and Djugu). Only a smaller portion of the landscape is included in the Haut-Uélé (Wamba and Watsa territories) North-Kivu provinces. The average daily temperature is very stable over the year between 23 and 25.5°C. The average annual rainfall in the landscape fluctuates between 1,600 and 2,000 mm [23]. The driest month of the year has an average rainfall of less than 50 mm in some parts of the landscape. During the dry period, the sky is absolutely cloudless, the relative humidity is low and evapotranspiration is very high [24].
The IEAL is covered by a dense semi-evergreen open-canopy forest, although there is a semideciduous forest in the far north-east. The IEAL includes the Okapi Wildlife Reserve (OWR) (13720 3 of 15 km²), the Mai-Tatu Community Reserve (proposed), Enzyme Refines Association (ENRA) private logging concession (520 km²) and three community management areas: Banana (575 Km²), Andekau (6973 Km²) and Bakwanza (2181 Km²) [23]. Also, a reducing emissions from deforestation and degradation (REDD+) pilot project has been implemented in the Mambasa region.  Figure 2 schematically illustrates the methodology used in this study to generate and test the algorithm. It presents the data and their preprocessing, the variables used for the construction of the model, the random sampling of learning datasets, the random forest classifier (RFC) and four land cover classification models.  Ot her l and use ENRA Figure 1. Overview of the analysis framework.

Remote sensing data and preprocessing
The satellite images used to map deforestation in the IEAL are annual image composites of 0.025degree resolution obtained from the Central Africa Regional Program for the Environment (CARPE). The original images are from the Landsat Thematic Mapper (TM), the Enhanced Thematic Mapper (ETM+) and the Operational Land Imager (OLI) sensors. These composites were formed from four spectral bands: nir (0.845-0.885 μm), red (0.630-0.680 μm), SWIR1 (1.560-1.660 μm) and SWIR2 (2.100-2.300 μm) that had undergone atmospheric, radiometric and geometric corrections. The CARPE composites [13] were chosen as they were free of cloud cover, they cover all Congo Basin countries and can be downloaded free of charge from the CARPE website (www.carpe.umd.edu/). These images are organized in square tiles of one degree by one degree. In this study, we used 10 tiles per date. The date range for each composite was less than one year.
The Congo Basin Forest Partnership (CBFP) landscape was created in 2002 and development work began in 2003 in the IEAL [25]. Hence, 2003 was selected as the reference date. In addition, 2016 is the last year CBFP data are available in the IEAL. To ensure the multi-temporal comparability, the images were re-projected in the same reference coordinate system: WGS 84 -UTM zones 35 North. Then, for each spectral band, a mosaic of tiles was created for the selected years. Radiometric offsets due to differences in acquisition dates were minimized by applying histogram equalization.
In addition to the spectral indices, geomorphological indices (elevation, slope, and roughness) derived from a 30-meter Digital Terrain Model (DEM) were introduced to mitigate the relief effect on spectral bands reflectance [33].

Random sample
The training zones were defined on the screen by digitizing several polygons on the multi-date color composition for each spectral class. Thirty percent of the pixels from the 483 polygons was randomly selected for validation of land cover maps. In this study, we randomly sampled pixels from more than 483 polygons divided into six classes (an average of 81). These samples represented more than 53995 ha and were distributed in the studied landscape space. The number of training pixels has been fixed at 3500 pixels per class (i.e., 21000 pixels). This number is sufficient to represent variation throughout the image [34].

Classification algorithm for mapping the deforestation
We used the Random Forest classification algorithm [35] available in the randomForest package [36] of the R software [37]. Random Forest (RF) is a non-parametric classification algorithm that combines the decision tree algorithm, bagging, and a model aggregation technique [22]. The samples omitted from the bootstrap sample are called samples out of bag (OOB). Out-of-bag samples help to evaluate classification error and the variable significance estimate.
In the RF classifier, two parameters must be defined [38]: the number of decision trees (Ntree) and the number of variables to select for the best distribution during tree growth (Mtry). The literature recommends that Ntree be defined at 500 or greater because the errors tend to stabilize before this number of trees is reached [39]. In this project we set Ntree to 2000. Moreover, the optimal Mtry was defined using the function "train" of the CARET package [40].

Land cover classification models and validation
To map the land use of the area, four models were tested: (i) the first model built on the basis of the image channels (red, nir, swir1 and swir2), (ii) the second model was constructed using image channels and vegetation indices (NDVI, B54R, NDWI, NBR), (iii) the third model was constructed using image channels and geomorphological indices (DEM, Slope and Roughnesses) and (iiii) the forth model combined image channels, vegetative indices and geomorphological indices (Table 3). The performance of each model was calculated using out-of-the-bag (OOB) error [41]. The closer the OOB error value is to zero, the better the model [20]. In addition, a confusion matrix was produced to evaluate the quality of learning. The confusion matrix estimates errors of omission and commission. It also validates the land cover map by means of the Kappa coefficient and the estimation of the overall accuracy [42; 43].
The post-classification majority filter (3x3 pixel size) was applied to the land cover maps of each model. This majority filter has been applied to eliminate occurrences of isolated pixels and to avoid the "pixelated" effect (pepper and salt).

The annual deforestation rate
The annual deforestation rate is calculated as the ratio of area cleared over a period, divided by the initial forest area and the number of years of the period [44; 45]. However, several publications have recently highlighted that the evolution of forest loss cannot be obtained with this simple report [46]. The standardized formula proposed by Puyravaud [46] was adopted to calculate the annual deforestation rate for this study.
Where A1= forest area in the initial year; A2= forest area in the final year; t1 = initial year; t2 = final year. ln = natural logarithm

Number of trees and Mtry optimization
The evolution of the OOB error was a function of the increase in the number of trees (Fig. 4). For all models, the OOB error decreases with increasing number of trees and stabilized before the number of trees reaches 500. Also, the four models showed the same behavior when it comes to optimizing the number of variables to select for the best distribution during tree growth. All the models proposed three  (Table 4). The best OOB errror for all four models was obtained when the Mtry value was 2. Models 2 and 4 presented the smallest values of the OOB error (0.021). The largest OOB error was obtained in the model 1 (0.028, with Mtry equal to 12) (Table 4).

. Validation of classification models of land cover
The results of the various tests performed on three groups of variables are presented in Table 5. The conclusion was that model 1 created using only spectral bands had a larger OOB error (2.15%). When we combined the spectral bands with the radiometric indices (NDVI, NIRI and NDWI), the model improved very slightly by 2.05%. The same was true when we combined the spectral bands with the geomorphological indices (1.86%). In addition, the comparison of model 2 and model 3 revealed that the information from geomorphological indices improved the model more than those from spectral indices. Model 4 had the lowest OOB error (1.85%). Therefore, model 4 was selected to map the dynamics of deforestation in the study period 2003 -2009 and 2009 -2016.

Deforestation trends in the IEAL
Statistics on forest cover trends are presented in Figure 5. The landscape was dominated by oldgrowth forests. They account for over 87.67% of the total landscape in 2016. Other stable vegetal covers accounted for almost 6.31% representing more than 259032 ha. The Bare ground / Buildings / Rocks land cover since 2003 represented less than one percent (16.2 ha). Water accounted for 0.38% of the studied landscape (15662 ha).  Table 7 presents the area of old-growth forest cover and deforestation rates observed in the IEAL and in the neighboring eastern highlands. These results suggest that the total area of the old-growth forests in 2016 was 3601608 ha. In the whole, more than 69861 ha were lost during the 2003-2010 period, i.e, 9980 ha per year and approximately 145768 ha for the 2010-2016 period, i.e., 24295 ha per year. This result showed that the annual rate of deforestation increased from 0.03% -in the first period to 0.66% in the second. However, this forest loss was differently distributed within the IEAL. The Banana Community Management area had a relatively constant annual deforestation rate over the two studied periods (0.54%), in contrast with the sharp increase for the other macro zones and in the REDD + Mambasa intervention zone as shown in Table 7. While the ENRA and secondly the REDD + intervention zone had already a high deforestation rate for the 2003-2010 period (3.05% and 0.70% respectively to compare with 0.02%, 0.07% and 0.54% for the three community areas, Andikau, Bakwanza and Banana respectively), the first two communities of Andikau and Bakwanza did see   (145768) 4. Discussion

Adjustment of models and importance of variables
For all models designed in this study, the Kappa coefficient is greater than 80% and the OOB does not reach 2.2% [47]. The four tested models showed little difference in value of the Kappa coefficient and the OOB error (a difference of 0.3% and 0.0031, respectively).
Many studies found that land cover classes can be confounded. For example, Kabuanga [48] showed that the young secondary forest can easily be confounded with the brownfields and old fallows. Elsewhere, AECOM [49] showed some possible confounding between savanna vegetation, areas with little vegetation and crop areas. In our study, however, the obtained results showed that the six land cover classes were not confounded. These results are corroborated by those by Mikwa et al. [50] who found that although the old-growth forests are spectrally similar to secondary forests, their respective discrimination is still possible if the sampling effort is substantial. Thus, merging of secondary forests, wastelands, savannas and agricultural complexes into one class has eliminated the confounding between classes, which in the end impacted the classification model of land use. In the same way, the built-up areas showed spectral signatures close to those of rocks. For this reason, these two classes have been merged. As usual, the water class was clearly distinguishable from other classes [20].
The temporal aspect also played a clear role. More than 90% of the landscape remained stable over the entire observation period. Of the selected six classes, only two (Df1 and Df2) showed spatiotemporal changes expressing deforestation of the old-growth forests. These two classes stood out clearly on the colorful multi-date composition. Image-to-image rectification allowed an objective analysis of the same objects.
Variable importance is presented in Figure 6. The results show that in all four models, the spectral bands play a large role in the mapping of land use. The swir1, swir2, nir and red domains are widely used in land cover mapping [51; 26-30]. The SWIR domain plays a leading role in improving classification models. From a temporal point of view, the 2016 bands occupy the first position in terms of importance. SWIR2 seems to play first in models where geomorphological indices are absent.
The addition of geomorphological indices provided supplementary information and changes the influence of the variables. Indeed, in models made up of geomorphological indices, SWIR1 took the first place in terms of importance. In contrast, in models where the spectral indices are absent, the nir was less important. Also, the geomorphological indices seemed to play a less important role. They improved the total performance of the models by bringing non-redundant information to the models as earlier found by Huo et al. [51] and Oeser et al. [26]. The unique aspect of this study lies in the use of topography information which mitigates the relief effect on the spectral bands reflectance [33]. In this regard, the DEM was the most important geomorphological variable. Figure 6. Importance of the variables in each model.

History of deforestation in the IEAL
The comparison of the results obtained in this study with those of other similar studies should be carried out with caution since the methodologies and data are not compatible (data, methodology, observation periods, post-classification treatments). Lusana et al. [16] estimated the old-growth forests area as 40,492 km² in 2003 and 39,977 km² in 2010, meaning a loss of 51,514 ha between 2003 and 2010, including secondary forest, which explains why their estimations are higher to ours for the same dates. The FACET [52] estimates it as 38,432 km² for 2010. However, both studies unravel rising rates of deforestation.

Conclusions
This paper assessed the potential of multi-spectral remote sensing observations for the classification of land use using RFC technique. The study area presents specific characteristics and various types of natural and artificial surfaces. The key lesson learnt from of our study is that the combination of several types of information (spectral, temporal and topographic data) improves the results. RFC offers a wide range of possibilities regarding variable significance which guides users to important variables or features. Using the RFC importance measures reduces the dimensionality of the input features. RFC importance variables showed how multi-date spectral and topographic characteristics have the greatest influence on class separability in the studied area.
These results showed that the RFC model was able to effectively map the reduction of oldgrowth forests. This study also highlighted the importance of topographic data in addition to multispectral imagery for the purpose of land cover classification. Measuring the importance of variables was another advantage of random forests that was successfully applied in this study. The optimization of the RF was relatively easy because it considers only two parameters (ntrees and Mtry).
In developing countries, several constraints make it difficult to study changes in land use. The use of open access software can allow the reproduction of this method without legal constraints. For this reason, the script used is detailed and entirely reproduced in the supplemental material to facilitate the implementation of RFC.